June 2013

June 11, 2013

*[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili’s post]*

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated 😉

Here are the complete, self-contained R scripts to analyze these log data:

## ======================================================================

## Step 1: Download all log files

## ======================================================================

# Here's an easy way to get all the URLs in R

start <- as.Date('2012-10-01')

today <- as.Date('2013-06-10')

all_days <- seq(start, today, by = 'day')

year <- as.POSIXlt(all_days)$year + 1900

urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

# only download the files you don't have:

missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))

dir.create("CRANlogs")

for (i in 1:length(missing_days)) {

print(paste0(i, "/", length(missing_days)))

download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))

}

## Step 1: Download all log files

## ======================================================================

# Here's an easy way to get all the URLs in R

start <- as.Date('2012-10-01')

today <- as.Date('2013-06-10')

all_days <- seq(start, today, by = 'day')

year <- as.POSIXlt(all_days)$year + 1900

urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

# only download the files you don't have:

missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))

dir.create("CRANlogs")

for (i in 1:length(missing_days)) {

print(paste0(i, "/", length(missing_days)))

download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))

}

## ======================================================================

## Step 2: Load single data files into one big data.table

## ======================================================================

file_list <- list.files("CRANlogs", full.names=TRUE)

logs <- list()

for (file in file_list) {

print(paste("Reading", file, "..."))

logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = """,

dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)

}

# rbind together all files

library(data.table)

dat <- rbindlist(logs)

# add some keys and define variable types

dat[, date:=as.Date(date)]

dat[, package:=factor(package)]

dat[, country:=factor(country)]

dat[, weekday:=weekdays(date)]

dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]

setkey(dat, package, date, week, country)

save(dat, file="CRANlogs/CRANlogs.RData")

# for later analyses: load the saved data.table

# load("CRANlogs/CRANlogs.RData")

## Step 2: Load single data files into one big data.table

## ======================================================================

file_list <- list.files("CRANlogs", full.names=TRUE)

logs <- list()

for (file in file_list) {

print(paste("Reading", file, "..."))

logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = """,

dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)

}

# rbind together all files

library(data.table)

dat <- rbindlist(logs)

# add some keys and define variable types

dat[, date:=as.Date(date)]

dat[, package:=factor(package)]

dat[, country:=factor(country)]

dat[, weekday:=weekdays(date)]

dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]

setkey(dat, package, date, week, country)

save(dat, file="CRANlogs/CRANlogs.RData")

# for later analyses: load the saved data.table

# load("CRANlogs/CRANlogs.RData")

## ======================================================================

## Step 3: Analyze it!

## ======================================================================

library(ggplot2)

library(plyr)

str(dat)

# Overall downloads of packages

d1 <- dat[, length(week), by=package]

d1 <- d1[order(V1), ]

d1[package=="TripleR", ]

d1[package=="psych", ]

# plot 1: Compare downloads of selected packages on a weekly basis

agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5))

agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5))

## Step 3: Analyze it!

## ======================================================================

library(ggplot2)

library(plyr)

str(dat)

# Overall downloads of packages

d1 <- dat[, length(week), by=package]

d1 <- d1[order(V1), ]

d1[package=="TripleR", ]

d1[package=="psych", ]

# plot 1: Compare downloads of selected packages on a weekly basis

agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5))

agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5))

TripleR

RSA

. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!

To put things in perspective: package

psych

included in the plot:

Some psychological sidenotes on social comparisons:

- Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do
*never*includeggplot2into your graphic!

- Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
- Of course, things are a bit more complicated than that …

*All source code on this post is licensed under the FreeBSD license.*

Comments (27) | Trackback

June 6, 2013

Maybe you have encountered this situation: you run a large-scale study over the internet, and out of curiosity, you frequently the correlation between two variables.

My experience with this practice is usually frustrating, as in small sample sizes (and we will see what “small” means in this context) correlations go up and down, change sign, move from “significant” to “non-significant” and back. As an example, see Figure 1 which shows the actual trajectory of a correlation, plotted against sample size (I also posted a video of this evolution).

It is simply the order how participants dropped into the study (i.e., data has not been rearranged). In this case, the correlation started really strong (*r* = .69) and continuously decayed until it’s final *r* of .26. The light gray lines show some exemplary bootstrapped alternative trajectories.

In this particular case, at least the sign was stable (“There is a positive relationship in the population”, see also “Type-S errors”). Other trajectories in this data set, however, changed signs or their significance status. One correlation even changed from “negative significant” to “positive significant”!

Obviously, the estimate of a correlation stabilizes with increasing sample size. Now I wanted to know: At which sample size exactly can I expect a correlation to be stable? An informal query amongst colleagues revealed estimates between *n* = 80 and *n* = 150.

Together with Marco Perugini, I did a systematic analysis of this question. The results of this simulation study are reported [tippy title=”here”]Schönbrodt, F. D., & Perugini, M. (in press). At what sample size do correlations stabilize? *Journal of Research in Personality*. doi:10.1016/j.jrp.2013.05.009[/tippy] [PDF, 0.39 MB]. In this paper a “corridor of stability” (*COS*) has been utilized: Deviations from the true value are defined as tolerable as long as they stay within that corridor (see also Figure 1 for a COS of +/- .1). The point of stability (*POS*) is that sample size from which on a specific trajectory does not leave the COS anymore.

The point of stability depends on the effect size (How strong is the true correlation?), the width of the corridor of stability (How much deviation from the true value am I willing to accept?), and the confidence in the decision (How confident do I want to be that the trajectory does not leave the COS any more?). If you’re interested in the details: read the paper. It’s not long.

The bottom line is: For typical scenarios in psychology (i.e., rho = .21, w = .1, confidence = 80%), **correlations stabilize when n approaches 250**. That means, estimates with n > 250 are not only significant, they also are fairly

Figure 2 shows the distribution of POS values, depending on the half-width of the COS and on effect size rho. The horizontal axis is cut at n = 300, although several POS were > 300. It can be seen that all distributions have a very long tail. This makes the estimation of the 95th quantile very unstable. Therefore we used a larger number of 100’000 bootstrap replications in each experimental condition in order to get fairly stable estimates for the extreme quantiles.

Finally, Figure 3 shows the probability that a trajectory leaves the COS with increasing sample size.

The dotted lines mark the confidence levels of 80%, 90%, and 95% which were used in the publications. The *n* where the curves intersect these dotted lines indicate the values reported in Table 1 of the publication. For example, if the true correlation is .3 (which is already more than the average effect size in psychology) and you collect 100 participants, there’s still a chance of 50% that your correlation will leave the corridor between .21 and .39 (which are the boundaries for w=.1).

What is the conclusion? Significance tests determine the sign of a correlation. This conclusion can be made with much lower sample sizes. However, when we want to make an accurate conclusion about the *size* of an effect with some confidence (and we do not want to make a “Type M” error), we need much larger samples.

The full R source code for the simulations can be downloaded here.

Kelley, K., & Maxwell, S. E. (2003). Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. *Psychological Methods*, *8*, 305–321. [PDF]

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. *Annual Review of Psychology*, *59*, 537–563. doi:10.1146/annurev.psych.59.103006.093735 [PDF]

Schönbrodt, F. D., & Perugini, M. (2013). At what sample size do correlations stabilize? *Journal of Research in Personality, 47*, 609-612. doi:10.1016/j.jrp.2013.05.009 [PDF]

—-

Send this to a friend