Chapter 3 Implementing the Automatic Statistician in R

This chapter focuses on the Automatic Statistician project which introduces report generating automats for data science incorporating declarative statistics, automated construction and natural language explanation (Steinruecken et al. 2019). In the following, the key characteristics which make up an Automatic Statistician are highlighted. Then, various implementations of the Automatic Statistician are presented. Among others, this includes an Automatic Statistician for the R language (AutoStatR) that was implemented in the context of this seminar paper. Last, challenges and limitations of the Automatic Statistician are explored with one eye on the experience gained when implementing AutoStatR.

3.1 The Automatic Statistician: A philiosphy for automating machine learning

In 2014, a project called “The Automatic Statistician” won the $750,000 Google Focused Research Award (CambridgeUniversity 2014). The project, lead by Zoubin Ghahramani, aims to reduce the dozens of man-hour and high-value expertise that are required to select the best combination of models and parameters by automating the process of data science. More than that, the Automatic Statistician produces predictions and human-readable reports from raw datasets while reducing the necessity of human intervention. It consists of several components as described by (Steinruecken et al. 2019):

Basic graphs and statistics: A first overview of the dataset’s features is provided. It can be used to prove that the dataset was loaded correctly.
Automated construction of models: A suitable model has to be selected from a fixed or open-ended set of models. This model is then trained on the provided dataset.
Explanation of the model: The patterns that have been found are explained to the user. There is a certain degree of interpretation.
Report curator: A software component that turns these results into a human-readable report. The content of the report fully depends on the dataset and evaluated models. It should give insights about the data to a larger group of people.

Over the years, multiple versions of the Automatic Statistician have been build by different people. Each of them has a slightly different purpose, but all of them incorporate the philosophy of the Automatic Statistician and intend to automate data science (Steinruecken et al. 2019). In the next section, there is an overview about what research has shown regarding the Automatic Statistician so far.

3.2 Examples of Automatic Statisticians

Over the time, since the Automatic Statistician was announced, numerous authors have contributed to the project with their individual work. For example, (Lloyd et al. 2014) present an Automatic Statistician for regression which explores an open-ended space of models to produce a natural-language report, that was also mentioned by (Steinruecken et al. 2019). They make use of Gaussian Processes and their strength of modelling high-level properties of functions (e.g. smoothness, trends, periodicity) which can be used directly for the model explanation. (Hwang, Tong, and Choi 2015) proceed similar, but construct natural-language descriptions of time-series data. They also make use of Gaussian Processes. While most papers focus on an Automatic Statistician for regression, there has been done research on classification problems as well. (Mrkšić 2014) makes use of earlier work on regression problems with Gaussian Processes and contributes to the project by implementing a model search procedure for classification problems. Clearly, there has been done a lot of research on the Automatic Statistician already. Various authors focused on different types of problems all incorporating the ability of generating human-readable reports. The focus in research done so far was mainly on Gaussian Processes. They provide direct model explanation while constructing the model (Lloyd et al. 2014). Models which incorporate this properties are called interpretable models (Molnar 2019). Interpretability in this context can be seen as “the degree to which an observer can understand the cause of a decision” (Miller 2019) of a machine learning model. While interpretable model provide an easy way of achieving interpretability, they also suffer in terms of flexibility, as each model yields different types of interpretion and thereby binds the developer to the selected model type (Molnar 2019). Indeed, there is a method which provides more flexibility in terms of model selection, model-agnostic methods. This approach was selected for the Automatic Statistician presented in this paper and it is explained in more detail in the subsequent section.

3.3 Model-agnostic methods

Model-agnostic methods challenge the task of interpreting and explaining any machine learning models including those that appear as a black box (Molnar 2019). While some machine learning models already incorporate a certain degree of interpretability (such as decision trees), others do clearly not [Molnar (2019)}. In the latter case, model-agnostic methods can be used to seperate the explanation from the machine learning model. These methods are applied to the already optimized machine learning model and provide insights about its behaviour to the developer. They have the advantage that developers are free to select their machine learning model and do not have to choose a fixed model (e.g. Gaussian Processes) which might not be suitable for a certain use case (Molnar 2019). Model-agnostic methods leverage the strength and diversity of the full range of machine learning models while still providing a degree of explanability. In the next section, this alternative approach is used in an Automatic Statistician for the R language.

3.4 AutoStatR: An Automatic Statistician for the R language

While Automatic Statisticians so far have been implemented using interpretable models (specifically Gaussian Processes), we introduce an Automatic Statistician build with model-agnostic methods. This R package can use a great variety of different machine learning models and still provide interpretability to the user by exploiting the strength of model-agnostic methods. It follows the philosophy, includes the key components of an Automatic Statistician as presented by (Steinruecken et al. 2019) and can be applied to solve classification problems. Further, it summarizes a data set, re-uses the R-version of TPOT (tpotr) to build a machine learning model, explains this model through model-agnostic methods and outputs an HTML report. In the following subsections the implementation of the four core components of an Automatic Statistician in AutoStatR is described.

3.4.0.1 Data set overview

The first component provides an overview of the data set that is loaded into the Automatic Statistician. For this purpose, the package summarytools⁸ is used. It provides methods for a quick and simple overview of all the features and feature values in the data set and visualizes them in a table format. The following information are provided:

The number of the feature indicating the order in which it appears in the dataset
The name of the feature and its class
An insight into the feature’s values. The frequency, proportions or number of distinct values
A histogram or barplot of the feature’s values
The number and proportion of valid and missing values in the feature.

3.4.0.2 Machine learning model construction

The second component deals with the search and evaluation of machine learning models. Other than proposed by (Steinruecken et al. 2019) who make use of Gaussian Processes, any machine learning model might be used in AutoStatR. We want to use a model that fits the input data set best and not restrict the range of selectable models. This is possible because we split the model search and the model interpretation as described in the previous chapter as model-agnostic methods. Theoretically, AutoStatR can select any interpretable as well as black-box model. Because AutoStatR uses the R implementation of TPOT tpotr, which was introduced previously, it is limited to the models included in the TPOT package. Although the approach of model-agnostic explanation is significantly different from the interpretable Gaussian Processes approach, TPOT fulfills three of the four key ingredients of model search and evaluation introduced by (Steinruecken et al. 2019):

First, the open ended language of models is provided by TPOT, because it uses various machine learning operators and models and can construct arbitrary pipelines from them to represent different real world phenomena. Second, TPOT provides a search procedure using genetic programming to explore the language of models. Third, TPOT provides a principled method of evaluating models through genetic programming, which trades of model complexity and fit to data. This is ensured by the NSGA-II selection schema in TPOT (see chapter 2.1). The fourth ingredient “automatic explanation” is not provided by TPOT itself, but by the model-agnostic explanation approach.

3.4.0.3 Model interpretation

After gaining an overview over the data set and constructing a machine learning model, the third component of the AutoStatR deals with the model explanation and interpretation. Any Automatic Statistician should be able to make the assumptions of the model explicit to the user in an accurate and intelligible way (Steinruecken et al. 2019).

Since the machine learning pipelines generated by TPOT are generally black-box machine learning models, the assumptions of the model can only be made explicit through the usage of model-agnostic methods. The book Interpretable Machine Learning (Molnar 2019) provides an overview of such methods with respective implementations in R. From the model-agnostic methods described in the book, three where found to be suitable for usage in a report.

Feature importance is selected as the first method for model interpretation, because it offers a conceptually easy to understand interpretation and a compressed global insight into the model. As the first method being displayed in the AutoStatR report it offers a good entry point for the user. Feature importance is measured as the increase in the prediction error of the model after the feature was permuted. Therefore, feature importance can also be interpreted as the increase in the model error, when a feature is destroyed (Molnar 2019).

As the second method accumulated local affects (ALE) are selected, because they describe how an individual feature influences the prediction of the model on average. Thus, ALE offer a more detailed insight into the model than feature importance. Compared to similar methods like partial dependency plots, ALE offer the advantage of providing an unbiased influence if features are correlated (Molnar 2019). In AutoStatR, the two most important features are selected based on the feature importance analysis and subsequently ALE are applied for these features.

Lime is selected as the third method. It also provides the most detailed insight into the model, because it looks at individual observations and can therefore be categorized as an interpretable local surrogate model (Molnar 2019). Lime helps to explain the decision of the model for individual observations. Consequently, Lime is used in AutoStatR to explain the decisions of the model for the provided test data.

3.4.0.4 Report curator

The last component of AutoStatR is the report curator which wraps the results from the previously explained components in an HTML report. It contains the data set overview, the best fitting model and model explanation. It is based on graphs, tables and natural language descriptions which aims to enhance the comprehensibility for non-experts.

Technically, the report curator is located in the file /inst/rmarkdown/report.rmd and is called from the method autostatr() in /R/main.R, when the user starts AutoStatR. The user provides the data set for training, the input data for which the AutoStatR should make predictions, and the target variable. Subsequently, the report is generated automatically.

One challenge of the report curator is to generate natural-language descriptions, especially when taking into consideration that the provided data set can have an arbitrary size and the target column can have an arbitrary amount of levels. How this is handled by the report curator is demonstrated exemplary for the description of the first ALE plot in the following. \

for(i in 1:length(levels)){
  level <- levels[i]
  temp.result2 <- subset(temp.result, .class == level)
  ale.desc[i] <- paste("The class ", "<b>", level, "</b>", " is based on the feature ", "<b>", first.feature, "</b>", " most likely for values between ", temp.result[[1,"min"]], " and ", paste0(temp.result[[1,"max"]], ". "))
}

Listing 3 shows how the dynamic description is generated. The variable temp.result contains the result of the ALE for the most important feature. More precisely temp.result contains all target class levels that are most likely within a given range of the most important feature. The for loop iterates over these levels and generates a respective natural-language description for every level.

<p>For the given classification problem the most important feature according to the feature importance analysis  is <b>`r paste(imp.features[1], sep = ", ")`</b>. The ALE-plot provides an analysis how this feature influences the target <b>`r target`</b> with its `r length(levels)` classes. `r paste(ale.desc, collapse="")`</p>

This generated description, which is saved in ale.desc is then used among other variables in the HTML source code of the report as it can be seen in Listing 4.

3.5 Challenges of the Automatic Statistician

Due to conceptually ambitious design of Automatic Statisticians, the implementation of AutoStatR was faced by a variety of challenges, of which some remain open for future research. (Steinruecken et al. 2019) already list several design challenges that go along with the implementation of an Automatic Statistician:

User interaction: The user should be able to interact with the system and influence the choices it makes. The system than engages a dialogue with the user to explain the results that were found.
Missing and messy data: Some machine learning models struggle with missing data values and other defects on the data set. Therefore, automatic data pre-processing is an important requirement.
Resource allocation: Resource constraints such as limited computer power or limited time should be handled by the Automatic Statistician.

While TPOT incorporates mechanisms for resource allocation such as processing time restriction and also pre-processes the data in it’s pipelines up to a certain degree (e.g. feature selection), user interaction is still outstanding for AutoStatR. This is because an adequate user interaction requires technology that can provide a user experience such as a web-technology-based solution. Since AutoStatR is an R package, user interaction might be build on top of it. Apart from that, two main challenges remain that were observed during the development and testing of AutoStatR which we want to highlight in the following.

3.5.0.1 Data input

Data sets that are given to an Automatic Statistician can be of arbitrary size and quality. First of all, they can range from a few features and observations up to thousands of features and million of rows. Since the output of an Automatic Statistician is a report, this report suffers if data size gets to large because the amount of visualized content gets very complex. The question is whether the report is still a suitable output format to deal with massive data sets. Second, data pre-processing sometimes requires a large amount of manual work which makes up to 70% of all the efforts done in data science projects. The problem is that missing or dirty data in data sets can be so diverse, ranging from missing values over wrong values to miss-spelled column names, that automatic pre-processing can be a project on its own. Of course, missing data can be re-generated through automatic data imputation, but bad data quality is so much more than only missing values and it should be possible for the user to also intervene manually.

3.5.0.2 Trade-off between model fit and interpretability

The second point that we want to mention here relates to the principle of model-agnostic methods that was used in AutoStatR. We already mentioned that model-agnostic methods are significantly different from using interpretable machine learning models. Clearly, there is a tradeoff to be made. While interpretable models best explain the inner dependencies of the model but restrict the repertoire of models to select from, the model-agnostic approach might use any machine learning model, but yields less sophisticated explanations if the model gets to complex (Molnar 2019).

References

CambridgeUniversity. 2014. “Google awards $ 750.000 for The Automatic Statistician.” 2014. http://mlg.eng.cam.ac.uk/?p=1578.

Hwang, Yunseong, Anh Tong, and Jaesik Choi. 2015. “The Automatic Statistician: A Relational Perspective.” arXiv:1511.08343 [cs.LG].

Lloyd, James Robert, David Duvenaud, Roger Grosse, Joshua Tenenbaum, and Zoubin Ghahramani. 2014. “Automatic Construction and Natural-Language Description of Nonparametric Regression Models.” In Twenty-Eighth Aaai Conference on Artificial Intelligence.

Miller, Tim. 2019. “Explanation in Artificial Intelligence: Insights from the Social Sciences.” Artif. Intell. 267: 1–38.

Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.

Mrkšić, Nikola. 2014. “Kernel Structure Discovery for Gaussian Process Classification.” Master’s thesis, Computer Laboratory, University of Cambridge.

Steinruecken, Smith, Janz, Lloyd, and Ghahramani. 2019. “The Automatic Statistician.” In Automated Machine Learning. Springer.

https://github.com/dcomtois/summarytools ↩