For statistical analyses, the process of validation ensures quality output.
According to the FDA’s Glossary of Computer System Software Development Terminology:
Validation: Establishing documented evidence which provides a high degree of assurance (accuracy) that a specific process consistently (reproducibility) produces a product meeting its predetermined specifications (traceability) and quality attributes.
Since the FDA does not require use of any specific software for statistical analyses, the programming language R can be used if the R installation incorporates all of the following elements:
- Accuracy
- Reproducibility
- Traceability
R Validation Hub outlines how to assess the accuracy of R packages, and how to ensure the reproducibility and traceability of R installations.
Accuracy of R packages
When assessing the accuracy of R packages, the R Validation Hub differentiates R packages by the following types (see German et al, 20131):
- base and recommended (core) packages - developed by the R Foundation and shipped with the basic installation
- contributed (open source) packages - developed by anyone, and may differ in popularity and accuracy
Core packages and contributed packages are managed by different processes. Therefore, different requirements are needed to ensure that both types of packages reliably produces accurate results.
Base R and Recommended Packages
The R Foundation develops both the base and recommended packages, and follows practices that ensures the accuracy of each. These practices include:
- Proper maintenance of the R source code, and control of releases
- Testing the software and identifying issues for the Core Team to address
- The R Core Team hiring highly qualified individuals
- Validation testing each R release against known data and known results, and resolving all errors prior to release
After careful consideration, the R Validation Hub concludes that there is minimal risk using these core packages for regulatory analysis and reporting. For more information, see Base R and Recommended R Packages.
Contributed Packages
Since R is Open Source, contributed packages can be developed by anyone. Therefore, ensuring the accuracy of each contributed package is necessary.
R Validation Hub focuses on contributed packages on The Comprehensive R Archive Network (CRAN). All packages available on CRAN must pass a series of technical checks including an “R CMD” check. An “R CMD” check ensures that all submitted code’s:
- examples run successfully
- tests pass
- packages are compatible with other packages on CRAN
However, these checks do not guarantee the accuracy of a package and a risk assessment is necessary. The risk assessment proposed by the R Validation Hub includes:
- Package Maintenance
- release rate
- size of code base
- formal bug tracking
- Community Usage and Testing
- average downloads during in the last 12 months
- the amount of code that is tested by a formal testing framework
For more information about Package Maintenance and Community Usage and Testing, see R Packages page.
Tidyverse
According to https://www.tidyverse.org/:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Furthermore, the tidyverse is governed by a set of Design Principles that are used by the tidyverse team for consistency and to write better code.
Since tidyverse is held to such high standards and has a large user community, the R Validation Hub members are discussing if tidyverse can be labelled as minimal risk for regularly analysis and reporting.
Reproducibility of R installations
To ensure that R outputs can be reproducible, create and maintain R installations by using Docker containers and tools such as RStudio Package Manager.
As R versions and package versions change over time, creating and maintaining R installations becomes more complex. As versions are updated, define a process that checks and tests package integrity with dependencies.
Please note that investigation of reproducability is ongoing.
Traceability of R installations
Develop system and process controls to automatically document the R packages and installation dependencies that are used in R analyses.
Please note that investigation of traceability is ongoing.
-
German, D.M. & Adams, Bram & Hassan, Ahmed E.. (2013). The Evolution of the R Software Ecosystem. Proceedings of the Euromicro Conference on Software Maintenance and Reengineering, CSMR. 243-252. 10.1109/CSMR.2013.33. ↩︎