Performance - Idata.frame: Why is wrong "Is.Data.frame (DF) is not TRUE"? - data, df, frame, Idata, Idata.frame, is.data.frame, not, performance, true, why, wrong

I use a large data frame called exp( file here) in R. In order to improve performance, I suggest that I view the idata.frame() function from plyr. But I think I’m wrong

My original phone is slow, but it works:

df.median<-ddply(exp, 
 .(groupname,starttime,fPhase,fCycle), 
 numcolwise(median), 
 na.rm=TRUE)

Using idata.frame, error: is.data.frame( df) is not TRUE

library(plyr)
df.median<-ddply(idata.frame(exp), 
 .(groupname,starttime ,fPhase,fCycle), 
 numcolwise(median), 
 na.rm=TRUE)

So, I thought, maybe this is my data. So I tried baseball data Set. The idata.frame example works fine: dlply(idata.frame(baseball),”id”,nrow) but if I try a call similar to what I want using baseball, it doesn’t work:

bb.median<-ddply(idata.frame(baseball), 
 .(id,year,team), 
 numcolwise(median), 
 na. rm=TRUE)
>Error: is.data.frame(df) is not TRUE

Maybe my mistake lies in how I specify the grouping? Does anyone know how to make my example work?

ETA:

I also tried:

groupVars <- c("groupname","starttime","fPhase ","fCycle")
voi<-c('inadist','smldist','lardist')

i<-idata.frame(exp)
ag. median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi]: object of type'environment' is not subsettable

It uses more Fast way to get the median, but gives different errors. I think I don’t know how to use idata.frame at all.

Given that you are working with “big” data and looking for performance, this seems very suitable for data.table.

In particular, lapply(.SD,FUN) and .SDcols Parameter with

Set data.table

library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)

Which columns are numbers

numeric_columns <- names(which(unlist(lapply(DT, is .numeric))))



dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase, 
 fCycle), .SDcols = numeric_columns]

Some benchmark tests

library(rbenchmark)
benchmark(data.table = DT[ , lapply(.SD, median), by = list(groupname, starttime, 
 fPhase, fCycle), .SDcols = numeric_columns], 
 plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na .rm = TRUE), 
 idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist), 
 smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur), 
 smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median (x$emptyct), 
 entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct), 
 larct = median(x$larct), na.rm = TRUE)), 
 aggregate = aggregate(exp[, numeric_columns],
 exp[, c("groupname", "starttime", "fPhase", "fCycle")], < br /> median), 
 replications = 5)

## test replications elapsed relative user.self 
## 4 aggregate 5 5.42 1.789 5.30 
## 1 data.table 5 3.03 1.000 3.03 
## 3 idataframe 5 11.81 3.898 11.77 
## 2 plyr 5 9.47 3.125 9.45

I use a name in R Large data frame for exp (file here). In order to improve performance, it is recommended that I view the idata.frame() function from plyr. But I think I am wrong.

My original call, speed Slow, but effective:

df.median<-ddply(exp, 
 .(groupname,starttime,fPhase,fCycle), 
 numcolwise(median ), 
 na.rm=TRUE)

Using idata.frame, error: is.data.frame(df) is not TRUE

library(plyr)
df.median<-ddply(idata.frame(exp), 
 .(groupname,starttime,fPhase,fCycle), 
 numcolwise(median), 
 na.rm=TRUE)

So, I thought, maybe this is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "Id",nrow) But if I try a call using baseball similar to what I want, it doesn't work:

bb.median<-ddply(idata.frame (baseball), 
 .(id,year,team), 
 numcolwise(median), 
 na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE

Maybe my mistake lies in how I specify the grouping? Does anyone know how to make my example work?

ETA:

I also tried:

groupVars <- c("groupname","starttime","fPhase ","fCycle")
voi<-c('inadist','smldist','lardist')

i<-idata.frame(exp)
ag. median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi]: object of type'environment' is not subsettable

It uses more Fast way to get the median, but gives different errors. I don’t think I know how to use idata.frame at all.

Given that you are using "Big" data and looking for performance, this seems to be very suitable for data.table.

Especially lapply(.SD,FUN) and .SDcols parameters with

Set data. table

library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)< /pre> Which columns are numbers
 
 numeric_columns <- names(which(unlist(lapply(DT, is.numeric)))))



dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase, 
 fCycle), .SDcols = numeric_columns] Some benchmark tests
 
 library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list (groupname, starttime, 
 fPhase, fCycle), .SDcols = numeric_columns], 
 plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE), 
 idataframe = ddply(exp, .(groupname , starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist), 
 smldist = median(x$smldist), lardist = median(x$lardist), inadur = median (x$inadur), 
 smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct), 
 entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct), 
 larct = median(x$larct), na.rm = TRUE)), 
 aggregate = aggregate(exp[, numeric_columns],
 exp[, c("groupname", "starttime", "fPhase", "fCycle")], 
 median), 
 replications = 5)
< br />## test replications elapsed relative user.self 
## 4 aggregate 5 5.42 1.789 5.30 
## 1 data.table 5 3.03 1.000 3.03 
## 3 idataframe 5 11.81 3.898 11.77 
## 2 plyr 5 9.47 3.125 9.45

Performance – Idata.frame: Why is wrong “Is.Data.frame (DF) is not TRUE”?

Leave a Comment Cancel reply