Picnic Machine Learning

Logo

Machine learning in Scala.

View picnicml on GitHub

doddle-model

doddle-model is an in-memory machine learning library that can be summed up with three main characteristics:

Caveat emptor! doddle-model is in an early-stage development phase. Any kind of contributions are much appreciated.

You can chat with us on gitter.

Installation

latest release

Add the dependency to your SBT project definition:

libraryDependencies  ++= Seq(
  "io.github.picnicml" %% "doddle-model" % "<latest_version>",
  // add optionally to utilize native libraries for a significant performance boost
  "org.scalanlp" %% "breeze-natives" % "0.13.2"
)

Note that the latest version is displayed in the maven central badge above and that the v prefix should be removed from the SBT definition.

Getting Started

This is a complete list of code examples, for an example of how to serve a trained doddle-model in a pipeline implemented with Apache Beam see doddle-beam-example.

1. Feature Preprocessing

2. Metrics

3. Baseline models

4. Linear models

5. Model Selection

6. Miscellaneous

7. Use Cases

Performance

doddle-model is developed with performance in mind.

1. Native Linear Algebra Libraries

Breeze utilizes netlib-java for accessing hardware optimised linear algebra libraries (note that the breeze-natives dependency needs to be added to the SBT project definition). TL;DR seeing something like

INFO: successfully loaded /var/folders/9h/w52f2svd3jb750h890q1x4j80000gn/T/jniloader3358656786070405996netlib-native_system-osx-x86_64.jnilib

means that BLAS/LAPACK/ARPACK implementations are used. For more information see the Breeze documentation.

2. Memory

If you encounter java.lang.OutOfMemoryError: Java heap space increase the maximum heap size with -Xms and -Xmx JVM properties. E.g. use -Xms8192m -Xmx8192m for initial and maximum heap space of 8Gb. Note that the maximum heap limit for the 32-bit JVM is 4Gb (at least in theory) so make sure to use 64-bit JVM if more memory is needed. If the error still occurs and you are using hyperparameter search or cross validation, see the next section.

3. Parallelism

To limit the number of threads running at one time (and thus memory consumption) when doing cross validation and hyperparameter search, a FixedThreadPool executor is used. By default maximum number of threads is set to the number of system’s cores. Set the -DmaxNumThreads JVM property to change that, e.g. to allow for 16 threads use -DmaxNumThreads=16.

Benchmarks

All experiments ran multiple times (iterations) for all implementations and with fixed hyperparameters, selected in a way such that models yielded similar test set performance.

1. Linear Regression

Implementation RMSE Training Time Prediction Time
scikit-learn 3.0936 0.042s (+/- 0.014s) 0.002s (+/- 0.002s)
doddle-model 3.0936 0.053s (+/- 0.061s) 0.002s (+/- 0.004s)

2. Logistic Regression

Implementation Accuracy Training Time Prediction Time
scikit-learn 0.8389 2.789s (+/- 0.090s) 0.005s (+/- 0.006s)
doddle-model 0.8377 3.080s (+/- 0.665s) 0.025s (+/- 0.025s)

3. Softmax Classifier

Implementation Accuracy Training Time Prediction Time
scikit-learn 0.9234 21.243s (+/- 0.303s) 0.074s (+/- 0.018s)
doddle-model 0.9223 25.749s (+/- 1.813s) 0.042s (+/- 0.032s)

Development

build status coverage code quality

Run the tests with sbt test. Concerning the code style, PayPal Scala Style and Databricks Scala Guide are roughly followed. Note that a maximum line length of 120 characters is used.

For a list of typeclasses that together define the estimator API see the typeclasses directory.

Resources