Returns a Data.Table aggregation of the vector, such as scale ()

I recently used a larger data set, and started to learn and migrate to data.table to improve the performance of aggregation/grouping​​. I can’t convert certain expressions or Function grouping. The following is an example of the basic operation I encountered.

library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)

If I I want to simply calculate the average of each group by category. This is easy.

dt[,mean(value),by="category"]
< br /> category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978

If I try Using the scale function or even a simple expression that subtracts a value from itself, I have problems. Ignore the grouping, I apply the function/expression to each row. The following returns all 100 rows by category instead of 10 groups.

dt[,scale(value),by="category"]


dt[,value-mean(value), by="category"]

I think it might be helpful to recreate the scale as a function that returns a numeric vector instead of a matrix.

zScore <- function (x) {
z=(x-mean(x,na .rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}

dt[,zScore(value),by="category "]

category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1

This will also return the zScore function applied to all rows (N = 100) and ignore grouping. When using mean(), in order to make scale() Or a custom function uses grouping as above, what am I missing?

You have clarified the same behavior you want in the comments:

ddply(df,"category",transform, zscorebycategory=zScore(value))

This makes:

< pre>category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc

You The provided data table options are given:

category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc

This is exactly the same data. However, you also need to repeat the value column in the result and rename the V1 variable with more descriptive content. data.table provides you with the grouping variables in the result and the result of the expression you provided. So we modify it to give you the rows you want:

your

< /p>

dt[,zScore(value),by="category"]

becomes:

dt[,list(value =value, zscorebycategory=zScore(value)),by="category"]

The named item in the list becomes the position of the column in the result.

plyr = data.table(ddply(df,"category",tr ansform, zscorebycategory=zScore(value)))
dt = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)< br />> TRUE

(Note that I converted your ddply data.frame result to data.table to allow the same command to work).

< p>I recently used a larger data set and started to learn and migrate to data.table to improve the performance of aggregation/grouping. I cannot group certain expressions or functions as expected. The following is the problem I am having Examples of basic operations.

library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)

If I want to simply calculate each group by category The average value. This is easy.

dt[,mean(value),by="category"]

category V1
1: 1 -0.67555478
2: 2 -0.50438413
3: 3 0.29093723
4: 4 -0.41684790
5: 5 0.33921764
6: 6 0.01970997
7: 7 -0.23684245
8: 8 -0.04280998
9: 9 0.01838804
10: 10 0.44295978

If I try to use the scale function or even from its own I have a problem with a simple expression that subtracts the value. Ignore the grouping, I apply the function/expression to each row. The following returns all 100 rows by category instead of 10 groups.

dt[,scale(value),by="category"]


dt[,value-mean(value),by="category"]< /pre>

I think it might be helpful to recreate the scale as a function that returns a numeric vector instead of a matrix.

zScore <- function(x) {
z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
return(z)
}

dt [,zScore(value),by="category"]

category V1
1: 1 -1.45114132
2: 1 -0.35304528
3: 1 -0.94075418
4: 1 1.44454416
5: 1 1.39448268
6: 1 0.55366652
....
97: 10 -0.43190602
98: 10 -0.25409244
99: 10 0.35496694
100: 10 0.57323480
category V1

This will also return the zScore function applied to all rows (N = 100) and ignore grouping. In When using mean(), in order to make scale() or custom functions use grouping as above, what am I missing?

You have clarified the same behavior you want in the comments:

ddply (df,"category",transform, zscorebycategory=zScore(value))

This makes:

category value zscorebycategory
1 1 0.28860691 0.31565682
2 1 1.17473759 1.33282374
3 1 0.06395503 0.05778463
4 1 1.37825487 1.56643607
etc

The data table options you provide are given:

category V1
1: 1 0.31565682
2: 1 1.33282374
3: 1 0.05778463
4: 1 1.56643607
etc< /pre>

This is the exact same data. However, you also need to repeat the value column in the result and rename the V1 variable with more descriptive content. data.table gives you the grouping variable in the result, And the result of the expression you provided. So we modify it to give you the desired line:

your

dt[,zScore(value ),by="category"]

becomes:

dt[,list(value=value, zscorebycategory=zScore(value)),by ="category"]

The named item in the list becomes the position of the column in the result.

plyr = data.table(ddply(df," category",transform, zscorebycategory=zScore(value)))
dt = dt[,list(value= value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE

(Note that I will put your ddply data.frame result Converted to data.table to allow the same commands to work).

Leave a Comment

Your email address will not be published.