Chapter 2 Tree-Based Pipeline Optimization With TPOT

2.1 Tree-Based Pipeline Optimization Tool

To achieve the goal of automating machine learning the Tree-Based Pipeline Optimization Tool (TPOT) uses genetic programming (GP) for the representation and optimization of machine learning pipelines (Olson and Moore 2018).

GP is an evolutionary computation technique that is used to seek models with maximum fit. The special characteristic of GP is the usage of a tree-structure as representation. This characteristic allows for recombination by the exchange of subtrees and mutation by random changes in the tree-structure (Eiben, Smith, and others 2003).

TPOT represents machine learning pipelines by treating the machine learning pipeline operators as GP primitives and the data set as GP terminals. This allows arbitrary and flexible pipeline structures. TPOT organizes the machine learning operators into the following categories: Feature preprocessing operators, feature selection operators and supervised classification operators. Fig.1 shows an exemplary pipeline. The data set is manipulated in a successive manner by each operator and the resulting data set is used for Logistic Regression algorithm to solve a classification task (Olson and Moore 2018).

Exemplary Tree-based pipeline [@olson_tpot_2018]

Figure 2.1: Exemplary Tree-based pipeline (Olson and Moore 2018)

The optimization of the pipeline is done by using a standard genetic programming algorithm. Classification accuracy is used as the pipelines fitness. The GP algorithm generates 100 random pipelines and selects the top 20 pipelines according to the NSGA-II selection scheme nsgaII. During this phase pipelines are selected to maximize classification accuracy and at the same time minimize the number of pipeline operators. Each of the top 20 pipelines produces 5 offspring’s. 5% of the offspring is affected by a crossover using one-point crossover. 90% of the unaffected offspring is randomly changed by a point, insert, or shrink mutation in the tree-structure respectively the machine learning pipeline. After 100 generations the pipeline with the highest classification accuracy is selected from the pareto-front (Olson and Moore 2018).

The authors implemented their approach in the python package TPOT.1 The package makes use of the two python packages scikit-learn2 and DEAP.3 While scikit-learn is used for the interfacing of machine-learning operators, DEAP is used for the optimization with GP. The functionality of the python package TPOT goes beyond what is published by the TPOT authors (Olson and Moore 2018) in the AutoML-book (Hutter, Kotthoff, and Vanschoren 2018), because the package allows to manipulate the parameters of the used genetic programming algorithm and is constantly expanded. As an example the publication by (Olson and Moore 2018) only deals with classification, while the python package TPOT also offers regression.

2.2 From Theory to Practise: Implementing a Tree-Based Pipeline Optimization Tool

2.2.1 An R-Wrapper for TPOT

2.2.1.1 Approach

Two options for implementing TPOT in R were considered at the beginning. The first is an independent implementation of the underlying theory in R, which - possibly inspired by the Python reference implementation - renounce external dependencies to the extent that the essential algorithms and functions are executed in R. In addition to the obvious advantage that this approach spares users the need to install additional dependencies (programming languages, runtime environments, etc.), there are further benefits: language-dependent advantages of R over Python can be better exploited, maintenance and troubleshooting are simplified. R packages required for implementing TPOT functions are - at least partially - available for both pipeline operators and genetic programming (ecr). However, this also touches on one of the disadvantages of this procedure: Although a range of machine learning packages are available for R (caret, mlr), some of the libraries TPOT uses are not provided as R packages, making it difficult to recreate the reference implementation true to the original. The biggest disadvantage of a stand-alone implementation of TPOT, however, is the resulting separation from the actively developed reference implementation. Since the authors of this seminar paper do not plan to actively maintain the project, the two implementations would have to drift apart immediately with regard to their functional scope.

These considerations lead to the second implementation option for TPOT in R, namely wrapping the existing Python implementation in R. The advantages are obvious: With the reticulate package, Python applications can be conveniently wrapped in R. Assuming API stability, the R implementation will then benefit from the continued development of TPOT in Python, which should significantly reduce the maintenance effort in R. Ultimately, this approach allows the functions and results of the Python implementation to be reproduced in a way that would not have been possible with an R-based custom development. Users of the Python variant should be able to determine little to no differences in the results of the application. Thus, it was decided to wrap the existing python package TPOT in a new R package tpotr. The realization is described in the following.

2.2.1.2 Requirements & Implementation

Due to the chosen procedure, Python must be supported on the target system. In TPOTs installation documentation, the author recommends the use of the Enterprise Data Science Platform Anaconda.4 Consequently, an installed Anaconda environment is required on the target system of the tpotr-package.

The package contains five R files that represent the logic. Following classical conventions, functions required to load the package can be found in the R/zzz.R file. Other functions that are used to install Python libraries can be found in R/requirements.R and R/installation.R. If a user installs tpotr, the function install_tpot() is executed in the packages .onload(). This checks whether Anaconda is installed on the target system and then executes the installation procedure for TPOT using the reticulate packet, if this has not yet been installed. This procedure is repeated each time the package is loaded.

For wrapping TPOT, constructor scripts were first created in Python, which can be found at /inst/python/pipeline_generator.py. This script is used by the R functions in R/TPOT.R to create and execute corresponding TPOT objects in Python. #### Integration with mlr The R package mlr is one of the more prominent packages for the execution of machine learning tasks in R (Bischl et al. 2016). Essentially, the same steps are performed for any data sets: Creating an ML-Task, determining which Learner (i.e. which ML-Algorithm) should be applied to the problem and finally training the Learner on the data. While mlr comes with numerous Learners, the user of the package must decide which Learner is to be applied best to the problem at hand. Obviously the integration of TPOT as an mlr-learner can simplify the machine learning process. In accordance with the AutoML concept, this makes the use of machine learning methods more accessible. The provision of an external learner in mlr is described in detail in their documentation5, the corresponding implementation in tpotr is found in the file R/mlr.R. The following code excerpt shows the simple use of textit{tpotr} in mlr using the example of the iris classification task:

task = makeClassifTask(data = iris, target = "Species", id = "iris")
learner = makeLearner(cl = "classif.tpot", population_size = 10, generations = 3, verbosity = 2)
model = train(learner, task)
predict(obj = model, newdata = testdata)

Due to the structure of mlr, however, the use of tpotr in mlr is not problem-free. In mlr, the prediction type is defined when the learner is created. With a classification problem, the standard predict.type = 'response' returns the predicted classes, while predict.type='prob' returns the probability of the individual classes. Now - if a learner is selected manually - it is clear in advance whether this learner is able to return the probability for classes. However, if TPOT fits a Machine Learning Pipeline, the fitted model may not support this prediction type. Since TPOT itself does not offer the possibility to prevent such pipelines, the user of the package must make use of adequate countermeasures here. For instance, you can train the model in a loop until a pipeline is fitted, that supports the predict.type = 'prob'.

task = makeClassifTask(data = iris, target = "Species", id = "iris")
learner = makeLearner(cl = "classif.tpot", population_size = 10, generations = 3, verbosity = 2)
model = NULL; pred = NULL;
# predity.type = "prob" of learner is not supported by every possible
# tpot pipeline to ensure that tpot returns a pipeline 
# that supports the "prob" property, iterate over the training 
# until such pipeline is found.
while(TRUE){
    result = try({
      model = train(learner, task)
      pred = predict(obj = model, newdata = testdata)
    }, silent = TRUE)
    if (!inherits(result, "try-error")){
      break
    }
}

2.2.1.3 Testing & Documentation

The functionality of tpotr is ensured on several levels. By wrapping the Python implementation, the functional tests of the reference implementation take effect first. In the R package the wrapping as well as the integration with mlr is tested by means of the “testthat” package. Finally, with Travis CI - a free and open source software for continuous integration - the successful deployment on different virtual machines is tested after each change. Specifically, the package will be rolled out on a total of 8 virtual machines in different combinations of operating systems and installed Python versions6. The Travis-CI tests are performed after each code change in the Github repository of the project, the .travis.yml file stored in the R package specifies the test environments.

In spite of the extensive tests and the different levels at which tests are carried out, the successful completion of these tests does not lead to the conclusion that there are no errors. TPOT uses a number of dependencies that are maintained by different maintainers. Different update cycles can lead to problems between the individual packages, as documented in the BugTracker of the reference implementation.

The documentation of the package is manifold. Code written in R is documented by the roxygen2-package. The reasoning and function of the package, as well as installation instructions and troubleshooting hints can be found on the Github page7 of the project and in the package vignettes. In addition, the package contains a number of examples that demonstrate how to use the package. A detailed description of the underlying theory and additional information about the R package can be found in this report, that is also published via R bookdown as a github page.

2.2.1.4 Benchmarking

In the context of machine learning, the question of performance inevitably arises. With Auto-ML this is all the more true since experimenting (cf. TPOT) is a frequent component in finding and training machine learning models. The authors of the reference implementation have described the performance of their application in various publications (e.g. (Olson and Moore 2018)). The authors of this report have considered comparative performance studies, but rejected them for the following reasons: For the benchmarking runs to be comparable, not only the processed data sets would have to be the same, but also the pipeline operators tested within the genetic programming framework. However, the fundamental random influence cannot be eliminated by setting the seed of the random number generator. This is because the R implementation has no interface to pass its seed to the wrapped Python functions. Thus the test runs would only be comparable in the context of statistical surveys, i.e. the x-fold execution of tpotr and TPOT. These executions were not possible during the seminar due to time considerations. However, the necessity of benchmarks can be questioned with a simple consideration: The R code in tpotr only plays a critical role when TPOT is being initialized. All other operations (fitting the pipeline, executing predictions, etc.) are completely executed in Python and only the respective end result is transmitted back to the R-interface. This means that the performance of TPOT is only affected by the costs of the R calls. Although R is a comparatively slow language, it can be assumed that the delay is in the range of a few miliseconds.

References

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “Mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.

Eiben, Agoston E, James E Smith, and others. 2003. Introduction to Evolutionary Computing. Vol. 53. Springer.

Hutter, Frank, Lars Kotthoff, and Joaquin Vanschoren, eds. 2018. Automatic Machine Learning: Methods, Systems, Challenges. Springer.

Olson, Randal S., and Jason H. Moore. 2018. “TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning.” In, edited by Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, 163–73. Springer.