An analysis of 19,000 R scripts hosted by The Dataverse Project
Author
Vincent Arel-Bundock
Published
January 13, 2023
I downloaded 1.9162^{4} R scripts from 4539 projects hosted by the The Dataverse Project. This notebook reports usage statistics for R packages in this large sample of real-life scientific applications.
To download data from Dataverse, I adapted a script from Trisovic et al. (2022) and wrote original Python code. Then, I used the renv::dependencies() function from the renv package for R (Ushey, 2022) to extract the names of R packages used in each script.1
WARNING: This was a very quick job and I did very little quality control on the data. Please take all this with a grain of salt.
Trisovic, A., Lau, M.K., Pasquier, T. et al. A large-scale study on research code quality and execution. Sci Data 9, 60 (2022). https://doi.org/10.1038/s41597-022-01143-6
Warning in .check_tzones(e1, e2): 'tzone' attributes are inconsistent
dat[, month :=anytime(format(date, "%Y-%m-15"))]projects = dat[, .(N =length(unique(dataset_id))), by ="month"]packages = dat[, .N, by ="month"]p1 =ggplot(projects, aes(month, N)) +geom_line() +labs(x ="", y ="", title ="Projects")p2 =ggplot(packages, aes(month, N)) +geom_line() +labs(x ="", y ="", title ="Packages")p1 + p2
Usage statistics for R packages loaded at least twice
# count only one use per projectdat_count = dat[, .(date =min(date)), by =c("dataset_id", "Package")]dat_count = dat_count[, .(`Number of times loaded`= .N), by ="Package"]dat_count = dat_count[order(-`Number of times loaded`)]dat_count = dat_count[`Number of times loaded`>1]DT::datatable(dat_count, options =list(pageLength =50), rownames =FALSE, width =300)