| September 21, 2020
Background
rOpenSci is an organization devoted to “transforming science through data, software and reproducibility.” One of rOpenSci’s focal activities is peer review of R packages, historically focusing on packages that cover the data management lifecycle. This has historically excluded software implementing statistical methods, for which standards and review require addressing a different set of challenges. This year, we have begun tackling these so as to expand our peer review system to explicitly encompass statistical software, under project funded by the Alfred P. Sloan Foundation.
Two goals for the project are to develop sets of standards for statistical R packages against which they can be reviewed, and to develop a suite of tools to support for this assessment. Many of these tools are intended to function automatically, and to provide overviews of software structure and function, as well as to automatically diagnose and provide information on errors, warnings, and other diagnostic messages issued during execution of statistical software functions.
These tools relate closely R Validation Hub projects, including the riskmetric
package
and the Risk Assessment
Application.
Both R Validation Hub and rOpenSci aim to automate, as
much as possible, the production of a reports that can be used to evaluate software.
We have distinct aims and scope, however, resulting in a complementary
set of tools, which this blog post aims to highlight.
Package Reporting
Our automated tools aim to provide peer-reviewers with information
that helps them understand the structure and functionality of R packages they
are evaluating, so they can better undertake parts of reviews which can not be
automatically evaluated. The first of these tools is packgraph
,
which provides a templated report on function call graphs in an R package.
packgraph
provides an overview of
package structure and inter-relationships between package functions, along with
an optional interactive visualization of the network of function calls within
a package. Function call networks are commonly
divided among distinct clusters of locally inter-connected functions, with the
resultant visualization using a different colour to visually distinguish each
cluster. Applying the primary function pg_graph()
function to the riskmetric
package graphical representation:
Each node of the network is a function, with sizes scaled by how many times
that function is called. Each line reflects a call from one function to
another, with a thickness scaled by numbers of calls between those two
functions. The function at the centre of the purple star shape is the core
pkg_metric
function, with the long tail representing functions for processing
errors and warnings. That graph provides an immediate visual representation of
overall package structure, revealing in the case of the
riskmetric
package a large number of
effectively independent functions which are not directly called by other
functions. Most of these isolated functions represent the various assessment
metrics and associated caching procedures, which in turn reflect the modular
design of the package, in which assessments, and the connections between these
peripheral isolated functions, are controlled by the user rather than being
hard-coded within the package.
Most packages have more defined clusters of interconnections which this
interactive graphical output can help to explore and understand. The
pg_report()
function also generates a tabular summary of this function call
network. By default, the pg_report()
function only summarizes
inter-relationships between exported functions of package, although setting
exported_only = FALSE
will yield a summary of inter-relationships between all
functions of a package. Here is the summary of exported functions of the
riskmetric
package.
library (packgraph)
pkg_source <- "/<local>/<path>/<to>/riskmetric"
g <- pg_graph (pkg_source, plot = FALSE)
pg_report (g)
== riskmetric ==================================================================
The package has 24 exported functions, and 154 non-exported funtions. The
exported functions are structured into the following 3 primary clusters
containing 2, 9 and 2 functions
| cluster| n|name | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
|-------:|--:|:----------|----------:|-------------:|-------------:|-----------------:|----------:|
| 1| 1|pkg_ref | 2| 400| 9| 0| 1|
| 1| 2|as_pkg_ref | 2| 29| 0| 0| NA|
| cluster| n|name | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
|-------:|--:|:---------------------------|----------:|-------------:|-------------:|-----------------:|----------:|
| 2| 1|assessment_error_empty | 2| 48| 9| 0| 2|
| 2| 2|assess_has_bug_reports_url | 2| 76| 0| 1| NA|
| 2| 3|assess_has_maintainer | 2| 45| 5| 1| NA|
| 2| 4|assess_has_source_control | 2| 53| 8| 1| NA|
| 2| 5|assess_has_website | 2| 46| 76| 1| NA|
| 2| 6|assess_license | 2| 42| 13| 1| NA|
| 2| 7|assessment_error_as_warning | 3| 62| 37| 0| NA|
| 2| 8|assessment_error_throw | 3| 56| 23| 0| NA|
| 2| 9|pkg_metric | 3| 87| 0| 0| NA|
| cluster| n|name | num_params| num_doc_words| num_doc_lines| num_example_lines| centrality|
|-------:|--:|:---------------|----------:|-------------:|-------------:|-----------------:|----------:|
| 3| 1|all_assessments | 0| 57| 6| 0| NA|
| 3| 2|pkg_assess | 4| 162| 1| 0| NA|
There are also 11 isolated functions:
| n|name | loc|
|--:|:--------------------------|---:|
| 1|assess_covr_coverage | 3|
| 2|assess_downloads_1yr | 3|
| 3|assess_export_help | 3|
| 4|assess_has_news | 3|
| 5|assess_has_vignettes | 3|
| 6|assess_last_30_bugs_status | 3|
| 7|assess_news_current | 3|
| 8|coverage | 3|
| 9|metric_score | 3|
| 10|pkg_score | 3|
| 11|summarize_scores | 3|
-- Documentation of non-exported functions -------------------------------------
|value | doclines| cmtlines|
|:------|--------:|--------:|
|mean | 4.3| 0.6|
|median | 2.0| 0.0|
The primary cluster shown in purple in the preceding image has only two exported functions, yet is still identified as the primary cluster in this output because it connects the largest number of internal and exported functions within the package.
Even when called in default mode to report only
on exported functions, the pg_report()
function concludes with a statistical
summary of documentation of non-exported functions. All functions should of
course be documented, and these final numbers reveal that every non-exported
function of the riskmetric
package
has a median of 2 lines of documentation, with an equivalent median value of no
comment lines, which also reflects good and clean coding practice. The output
of the packgraph
package is intended
to be provided at the outset of our review process as an aid to reviewers.
packgraph
and its main dependency, pkgapi
package, can be installed form GitHub with
remotes::intall_github("r-lib/pkgapi")`
remotes::install_github("ropenscilabs/packgraph")
Package Testing
Package reporting is primarily intended as an aid to reviewers of packages to
be submitted to our peer review system. We are also developing tools to aid
package developers, foremost among which is a package for automatic testing of
statistical software called autotest
.
The package implements a form of “mutation testing” (sometimes called “mutation
fuzzing”). This mutates the
objects which are passed to the functions of the package, automatically testing
their response to a variety of potential inputs. This frees authors from needing
to develop tests for myriad possible edge cases.
autotest
extracts all example
code for a package, parses those examples to examine all objects being thrown
at the package’s functions, and then mutates those objects to assess what
happens. The package will ultimately have a workflow entirely compatible with
riskmetric
, and so will act as
a plug-in extension to that package, with automatic tests themselves being
user-controlled and modular.
Current tests include mutations of value, size, class, and other structural
properties of inputs. Mutations may be expected to be acceptable – such as
a documented example which includes some function myfn (x = TRUE)
, which
would be expected to also work with x = FALSE
– or they may be expected to
generate warnings or errors, such as in response to passing a value of x = "a"
to that example. Robust software should accept all appropriate mutations
of inputs, while rejecting all inappropriate mutations.
autotest
only produces output
where expectations are not met.
The package is intended as developer tool, because all
packages to be submitted to our peer review system will be expected to yield
clean results when submitted to
autotest
. The package will be
able to be applied by anyone developing packages from the moment they implement
their first exported function. The hope is then that ongoing usage of the
package throughout the development of any statistical (or other) software will
enhance its robustness, and reduce any chance of unexpected behaviour in
response to inputs which developers may not otherwise have anticipated.
Finally, the autotest
package
will also form part of our reporting system, with its output forming part of
reports provided to reviewers. Most importantly, we intend to
implement mechanisms to enable users to control which tests are run on any
particular package, and to oblige those intending to submit to our system to
provide descriptive justifications of why particular tests may have been
switched off. These textual explanations will then also form part of our
reviewer reports, enabling reviewers to understand not only which kinds of tests
package developers deem inappropriate for their software, but more importantly
why.
Autotesting the riskmetric package
What happens when autotest
is
applied to the riskmetric
package?
The main function that does the work is
autotest_package()
,
as demonstrated with the following code:
library (autotest)
system.time (
x <- autotest_package ("/<local>/<path>/<to>/riskmetric")
)
- parsing all package examples
v parsed all package examples
user system elapsed
12.41 2.31 20.83
And you can see that the function takes a few seconds to run. The function
returns a tibble
object, each row of which
represents a test expectation which was not fulfilled. The package also
implements a summary
method for these objects an edited part of which looks
like this:
summary (x)
autotesting package [riskmetric, v0.1.0.9001] generated 13 rows of output of the following types:
0 errors
13 warnings
0 messages
0 other diagnosticss
That corresponds to NaN messages per documented function (which has examples)
fn_name num_errors num_warnings num_messages
1 all_assessments NA 1 NA
2 as_pkg_ref NA 1 NA
3 assessment_error_as_warning NA 1 NA
4 assessment_error_empty NA 1 NA
5 assessment_error_throw NA 1 NA
6 coverage NA 1 NA
7 metric_score NA 1 NA
8 pkg_assess NA 1 NA
9 pkg_metric NA 1 NA
10 pkg_ref NA 1 NA
11 score_error_default NA 1 NA
12 score_error_NA NA 1 NA
13 score_error_zero NA 1 NA
num_diagnostics
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
11 NA
12 NA
13 NA
In addition to the values in that table, the output includes 13 functions which have no documented examples:
1. all_assessments
2. as_pkg_ref
3. assessment_error_as_warning
4. assessment_error_empty
5. assessment_error_throw
6. coverage
7. metric_score
8. pkg_assess
9. pkg_metric
10. pkg_ref
11. score_error_default
12. score_error_NA
13. score_error_zero
git hash for package as analysed here:
[164a2e89acfce535d29d8e8ee95f8e19c85314e3]
The result contained no errors or diagnostic messages, and 13 warnings for
functions which have no documented examples. These are considered as warnings,
because the autotest
package
primarily works by scraping example code for each function, so functions with
no examples can not be tested. A clean
autotest
result could thus be
achieved for the riskmetric
package
by providing example code for each of those listed functions (and ensuring that
the resultant autotest
-ing of
those examples generated no additional output).
Package Standards and Peer Review
In addition to the automated tools described in the preceding two sections, a large part of the project is devoted to devising standards for statistical software. One challenge we have found in developing standards is how varied and method-specific best practices for statistical software can be. As such, we are using a two-tiered approach: a “general” set of standards applicable to all packages, and specific standards for sub-categories of statistical software. A package may fall within multiple sub-categories and more than one set of these specific standards can apply to them.
We are beginning with 11 statistical sub-categories, based a practical taxonomy of R packages submitted to statistical journals and conferences. Full details of the categories and standards can be seen on the primary “living book” of the project, which describes the current categories of:
- Bayesian and Monte Carlo Routines
- Dimensionality Reduction, Clustering, and Unsupervised Learning
- Machine Learning
- Regression and Supervised Learning
- Probability Distributions
- Wrapper Packages
- Networks
- Exploratory Data Analysis (EDA) and Summary Statistics
- Workflow Support
- Spatial Analyses
- Time Series Analyses
The tools described above aim to make the task of reviewing packages as easy as possible. The category-specific standards aim to ensure that software accepted as part of our system is of the highest possible quality. One of the primary tasks of reviewers will be to assess software against these standards.
Currently, we have initial standrads for five of these categories, and have released an initial call for “pilot submissions” within those categories to to help us test and improve the standards and the process of peer review. We invite any developers reading this blog who might be interested in submitting a statistical software package for peer review to contact us (Mark Padgham mark@ropensci.org and/or Noam Ross ross@ecohealthalliance.org) about a “pilot submission”. Your contribution would help improve the quality of our system, while our assessments and reviews would help improve the quality of your software. We look forward to any contributions to help improve our system for peer review of statistical software, and ultimately for helping to improve the quality of statistical software in R.