rOpenSci, Statistical Software, and the R Validation Hub

By Mark Padgham | September 21, 2020

Background

rOpenSci is an organization devoted to “transforming science through data, software and reproducibility.” One of rOpenSci’s focal activities is peer review of R packages, historically focusing on packages that cover the data management lifecycle. This has historically excluded software implementing statistical methods, for which standards and review require addressing a different set of challenges. This year, we have begun tackling these so as to expand our peer review system to explicitly encompass statistical software, under project funded by the Alfred P. Sloan Foundation.

Two goals for the project are to develop sets of standards for statistical R packages against which they can be reviewed, and to develop a suite of tools to support for this assessment. Many of these tools are intended to function automatically, and to provide overviews of software structure and function, as well as to automatically diagnose and provide information on errors, warnings, and other diagnostic messages issued during execution of statistical software functions.

These tools relate closely R Validation Hub projects, including the riskmetric package and the Risk Assessment Application. Both R Validation Hub and rOpenSci aim to automate, as much as possible, the production of a reports that can be used to evaluate software. We have distinct aims and scope, however, resulting in a complementary set of tools, which this blog post aims to highlight.

Package Reporting

Our automated tools aim to provide peer-reviewers with information that helps them understand the structure and functionality of R packages they are evaluating, so they can better undertake parts of reviews which can not be automatically evaluated. The first of these tools is packgraph, which provides a templated report on function call graphs in an R package.

packgraph provides an overview of package structure and inter-relationships between package functions, along with an optional interactive visualization of the network of function calls within a package. Function call networks are commonly divided among distinct clusters of locally inter-connected functions, with the resultant visualization using a different colour to visually distinguish each cluster. Applying the primary function pg_graph() function to the riskmetric package graphical representation:

Each node of the network is a function, with sizes scaled by how many times that function is called. Each line reflects a call from one function to another, with a thickness scaled by numbers of calls between those two functions. The function at the centre of the purple star shape is the core pkg_metric function, with the long tail representing functions for processing errors and warnings. That graph provides an immediate visual representation of overall package structure, revealing in the case of the riskmetric package a large number of effectively independent functions which are not directly called by other functions. Most of these isolated functions represent the various assessment metrics and associated caching procedures, which in turn reflect the modular design of the package, in which assessments, and the connections between these peripheral isolated functions, are controlled by the user rather than being hard-coded within the package.

Most packages have more defined clusters of interconnections which this interactive graphical output can help to explore and understand. The pg_report() function also generates a tabular summary of this function call network. By default, the pg_report() function only summarizes inter-relationships between exported functions of package, although setting exported_only = FALSE will yield a summary of inter-relationships between all functions of a package. Here is the summary of exported functions of the riskmetric package.

library (packgraph)
pkg_source <- "/<local>/<path>/<to>/riskmetric"
g <- pg_graph (pkg_source, plot = FALSE)
pg_report (g)
== riskmetric ==================================================================

The package has 24 exported functions, and 154 non-exported funtions. The
exported functions are structured into the following 3 primary clusters
containing 2, 9 and 2 functions


| cluster|  n|name       | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
|-------:|--:|:----------|----------:|-------------:|-------------:|-----------------:|----------:|
|       1|  1|pkg_ref    |          2|           400|             9|                 0|          1|
|       1|  2|as_pkg_ref |          2|            29|             0|                 0|         NA|


| cluster|  n|name                        | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
|-------:|--:|:---------------------------|----------:|-------------:|-------------:|-----------------:|----------:|
|       2|  1|assessment_error_empty      |          2|            48|             9|                 0|          2|
|       2|  2|assess_has_bug_reports_url  |          2|            76|             0|                 1|         NA|
|       2|  3|assess_has_maintainer       |          2|            45|             5|                 1|         NA|
|       2|  4|assess_has_source_control   |          2|            53|             8|                 1|         NA|
|       2|  5|assess_has_website          |          2|            46|            76|                 1|         NA|
|       2|  6|assess_license              |          2|            42|            13|                 1|         NA|
|       2|  7|assessment_error_as_warning |          3|            62|            37|                 0|         NA|
|       2|  8|assessment_error_throw      |          3|            56|            23|                 0|         NA|
|       2|  9|pkg_metric                  |          3|            87|             0|                 0|         NA|


| cluster|  n|name            | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
|-------:|--:|:---------------|----------:|-------------:|-------------:|-----------------:|----------:|
|       3|  1|all_assessments |          0|            57|             6|                 0|         NA|
|       3|  2|pkg_assess      |          4|           162|             1|                 0|         NA|

There are also 11 isolated functions:


|  n|name                       | loc|
|--:|:--------------------------|---:|
|  1|assess_covr_coverage       |   3|
|  2|assess_downloads_1yr       |   3|
|  3|assess_export_help         |   3|
|  4|assess_has_news            |   3|
|  5|assess_has_vignettes       |   3|
|  6|assess_last_30_bugs_status |   3|
|  7|assess_news_current        |   3|
|  8|coverage                   |   3|
|  9|metric_score               |   3|
| 10|pkg_score                  |   3|
| 11|summarize_scores           |   3|

-- Documentation of non-exported functions -------------------------------------


|value  | doclines| cmtlines|
|:------|--------:|--------:|
|mean   |      4.3|      0.6|
|median |      2.0|      0.0|

The primary cluster shown in purple in the preceding image has only two exported functions, yet is still identified as the primary cluster in this output because it connects the largest number of internal and exported functions within the package.

Even when called in default mode to report only on exported functions, the pg_report() function concludes with a statistical summary of documentation of non-exported functions. All functions should of course be documented, and these final numbers reveal that every non-exported function of the riskmetric package has a median of 2 lines of documentation, with an equivalent median value of no comment lines, which also reflects good and clean coding practice. The output of the packgraph package is intended to be provided at the outset of our review process as an aid to reviewers.

packgraph and its main dependency, pkgapi package, can be installed form GitHub with

remotes::intall_github("r-lib/pkgapi")`
remotes::install_github("ropenscilabs/packgraph")

Package Testing

Package reporting is primarily intended as an aid to reviewers of packages to be submitted to our peer review system. We are also developing tools to aid package developers, foremost among which is a package for automatic testing of statistical software called autotest. The package implements a form of “mutation testing” (sometimes called “mutation fuzzing”). This mutates the objects which are passed to the functions of the package, automatically testing their response to a variety of potential inputs. This frees authors from needing to develop tests for myriad possible edge cases.

autotest extracts all example code for a package, parses those examples to examine all objects being thrown at the package’s functions, and then mutates those objects to assess what happens. The package will ultimately have a workflow entirely compatible with riskmetric, and so will act as a plug-in extension to that package, with automatic tests themselves being user-controlled and modular.

Current tests include mutations of value, size, class, and other structural properties of inputs. Mutations may be expected to be acceptable – such as a documented example which includes some function myfn (x = TRUE), which would be expected to also work with x = FALSE – or they may be expected to generate warnings or errors, such as in response to passing a value of x = "a" to that example. Robust software should accept all appropriate mutations of inputs, while rejecting all inappropriate mutations. autotest only produces output where expectations are not met.

The package is intended as developer tool, because all packages to be submitted to our peer review system will be expected to yield clean results when submitted to autotest. The package will be able to be applied by anyone developing packages from the moment they implement their first exported function. The hope is then that ongoing usage of the package throughout the development of any statistical (or other) software will enhance its robustness, and reduce any chance of unexpected behaviour in response to inputs which developers may not otherwise have anticipated.

Finally, the autotest package will also form part of our reporting system, with its output forming part of reports provided to reviewers. Most importantly, we intend to implement mechanisms to enable users to control which tests are run on any particular package, and to oblige those intending to submit to our system to provide descriptive justifications of why particular tests may have been switched off. These textual explanations will then also form part of our reviewer reports, enabling reviewers to understand not only which kinds of tests package developers deem inappropriate for their software, but more importantly why.

Autotesting the riskmetric package

What happens when autotest is applied to the riskmetric package? The main function that does the work is autotest_package(), as demonstrated with the following code:

library (autotest)
system.time (
    x <- autotest_package ("/<local>/<path>/<to>/riskmetric")
    )
- parsing all package examples
v parsed all package examples
   user  system elapsed 
  12.41    2.31   20.83 

And you can see that the function takes a few seconds to run. The function returns a tibble object, each row of which represents a test expectation which was not fulfilled. The package also implements a summary method for these objects an edited part of which looks like this:

summary (x) 
autotesting package [riskmetric, v0.1.0.9001] generated 13 rows of output of the following types:
     0 errors
     13 warnings
     0 messages
     0 other diagnosticss
That corresponds to NaN messages per documented function (which has examples)

                       fn_name num_errors num_warnings num_messages
1              all_assessments         NA            1           NA
2                   as_pkg_ref         NA            1           NA
3  assessment_error_as_warning         NA            1           NA
4       assessment_error_empty         NA            1           NA
5       assessment_error_throw         NA            1           NA
6                     coverage         NA            1           NA
7                 metric_score         NA            1           NA
8                   pkg_assess         NA            1           NA
9                   pkg_metric         NA            1           NA
10                     pkg_ref         NA            1           NA
11         score_error_default         NA            1           NA
12              score_error_NA         NA            1           NA
13            score_error_zero         NA            1           NA
   num_diagnostics
1               NA
2               NA
3               NA
4               NA
5               NA
6               NA
7               NA
8               NA
9               NA
10              NA
11              NA
12              NA
13              NA

In addition to the values in that table, the output includes 13 functions which have no documented examples: 
    1. all_assessments
    2. as_pkg_ref
    3. assessment_error_as_warning
    4. assessment_error_empty
    5. assessment_error_throw
    6. coverage
    7. metric_score
    8. pkg_assess
    9. pkg_metric
    10. pkg_ref
    11. score_error_default
    12. score_error_NA
    13. score_error_zero

    git hash for package as analysed here:
    [164a2e89acfce535d29d8e8ee95f8e19c85314e3]

The result contained no errors or diagnostic messages, and 13 warnings for functions which have no documented examples. These are considered as warnings, because the autotest package primarily works by scraping example code for each function, so functions with no examples can not be tested. A clean autotest result could thus be achieved for the riskmetric package by providing example code for each of those listed functions (and ensuring that the resultant autotest-ing of those examples generated no additional output).

Package Standards and Peer Review

In addition to the automated tools described in the preceding two sections, a large part of the project is devoted to devising standards for statistical software. One challenge we have found in developing standards is how varied and method-specific best practices for statistical software can be. As such, we are using a two-tiered approach: a “general” set of standards applicable to all packages, and specific standards for sub-categories of statistical software. A package may fall within multiple sub-categories and more than one set of these specific standards can apply to them.

We are beginning with 11 statistical sub-categories, based a practical taxonomy of R packages submitted to statistical journals and conferences. Full details of the categories and standards can be seen on the primary “living book” of the project, which describes the current categories of:

  1. Bayesian and Monte Carlo Routines
  2. Dimensionality Reduction, Clustering, and Unsupervised Learning
  3. Machine Learning
  4. Regression and Supervised Learning
  5. Probability Distributions
  6. Wrapper Packages
  7. Networks
  8. Exploratory Data Analysis (EDA) and Summary Statistics
  9. Workflow Support
  10. Spatial Analyses
  11. Time Series Analyses

The tools described above aim to make the task of reviewing packages as easy as possible. The category-specific standards aim to ensure that software accepted as part of our system is of the highest possible quality. One of the primary tasks of reviewers will be to assess software against these standards.

Currently, we have initial standrads for five of these categories, and have released an initial call for “pilot submissions” within those categories to to help us test and improve the standards and the process of peer review. We invite any developers reading this blog who might be interested in submitting a statistical software package for peer review to contact us (Mark Padgham and/or Noam Ross ) about a “pilot submission”. Your contribution would help improve the quality of our system, while our assessments and reviews would help improve the quality of your software. We look forward to any contributions to help improve our system for peer review of statistical software, and ultimately for helping to improve the quality of statistical software in R.