Welcome to learning-curves's documentation!
===========================================


Learning-curves is Python module that extends `sklearn's learning curve
feature`_. It will help you visualizing the learning curve of your
models.

Learning curves give an opportunity to diagnose bias and variance in
supervised learning models, but also to visualize how training set size
influence the performance of the models (more informations `here`_).

Such plots help you answer the following questions:

-  Do I have enough data?
-  Can I train my model with less data without reducing accuracy?
-  Is my training/validation set biased?
-  What is the best model for my data?
-  What is the perfect training size for tuning parameters?

Learning-curves will also help you fitting the learning curve to
extrapolate and find the saturation value of the curve.

Installation
============

::

   $ pip install git+https://github.com/H4dr1en/learning-curves#egg=learning-curves

To create learning curve plots, first import the module with
``import learning_curves as LC``.

Getting started
===============

To get started, you can start with the following code:

::

   import learning_curves as LC
   from sklearn.datasets import make_regression
   from sklearn.linear_model import SGDRegressor

   X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)
   lc = LC.LearningCurve()
   lc.get_lc(SGDRegressor(), X, Y)


Output:

|alt text1|

On this example the green curve suggests that adding more data to the
training set is not likely to improve the model accuracy. The green
curve also shows a saturation near 0.84. We can easily fit a function to
any curve:

::

   lc.plot(predictor="best")

Output:

|alt text2|

Here we used a predefined function, ``pow``, to fit the green curve. The
R2 score is very close to 1, meaning that the fit is optimal. We can
therefore use this curve to extrapolate the evolution of the accuracy
with the training set size.

This also tells us how many data we should use to train our model to
maximize performances and accuracy: near 2000, we achieved 99% of the 
maximal accuracy we can get for this model.

Custom Predictors
=================

Predictors are object wrapping the fitting of learning curves.

You can create a ``Predictor`` like this:

::

   predictor = Predictor("myPredictor", lambda x,a,b : a*x + b, [1,0])

Here we created a Predictor called "myPredictor" with the function
``y(x) = a*x + b``. Because internally SciPy ``optimize.curve_fit`` is
called, a first guess of the parameters ``a`` and ``b`` are required.
Here we gave them respective value 1 and 0. You can then add the
``Predictor`` to the ``LearningCurve`` object in two different ways:

-  Pass the ``Predictor`` to the ``LearningCurve`` constructor:

::

   lc = LearningCurve([predictor])

-  Register the ``Predictor`` inside the predictors of the
   ``LearningCurve`` object:

::

   lc.predictors.append(predictor)

By default, 5 ``Predictors`` are instantiated:

::

    defaults_predictors = [
            Predictor("pow",
                      lambda x, a, b, c, d: a - 1 / ( x/b - d)**c, 
                      [1, 1, 1, 1],
                      lambda x, a, b, c, d: b * ( 1 / (a-x)**(1/c) + d)
                    ),
            Predictor("inv",
                      lambda x, a, b, d: a / (1 + b/(x-d)), 
                      [1, 1, 1],
                      lambda x, a, b, d: b / (a/x - 1) + d
                    ),
            Predictor("inv_log",
                      lambda x, a, b, c, d: a - b/np.log(x-d)**c,          
                      [1, 1, 1, 1],
                      lambda x, a, b, c, d: np.exp((b / (a-x))**(1/c) ) + d
                    ),
            Predictor("pow_log",
                      lambda x, a, b, c, d, m, n: a - 1 / (x/b - d)**c + m*np.log(x**n),          
                      [1, 1, 1, 1, 1e-2, 1e-2],
                      diverging=True,
                      bounds=([-np.inf, 0, 0, 0, 0, 0], [np.inf, np.inf, np.inf, np.inf, np.inf, np.inf])
                  ),
            Predictor("inv_2",
                      lambda x, a, b, d, e: a / (e + b/(x-d)),          
                      [1, 1, 1, 1],
                      lambda x, a, b, d, e: b / (a/x - e) + d
                    )
        ]   

Some predictors perform better (R2 score is closer to 1) than others,
depending on the dataset, the model and the value to be preditected.

Find the best Predictor
=======================

To find the Predictor that will fit best your learning curve, we can
call ``get_predictor`` function:

::

   lc.get_predictor("best")

Output:

::

   (pow [params:[   0.9588563    11.74747659   -0.36232639 -236.46115903]][score:0.9997458683912492])

Plot the Predictors
===================

You can plot any ``Predictor``\ s fitted function with the ``plot``
function:

::

   lc.plot(predictor="all")

Output:

|alt text3|

Predictor bounds
================

Each parameter of a ``Predictor`` can be enforced to have values inside a
fixed interval using bounds:

::

   lc.predictors[0].bounds

Output:

::

   ([-np.inf, 1e-10, -np.inf, -np.inf], [1, np.inf, 0, 1])

For example, the first parameter (the saturation parameter) is enforced to
have values between [-inf, 1], because a R2 score cannot be > 1.

Average learning curves for better extrapolation
================================================

Multiple learning curves can be averaged to get a more accurate extrapolation,
as well as a estimation of the error (standard deviation of the curve). This can
easily be done using LearningCurveCombined class:

::

    from sklearn.datasets import make_regression
    from learning_curves import *
    from xgboost import XGBRegressor

    X, Y = make_regression(500, noise=0.5, bias=0.2, n_informative=50)
    model = XGBRegressor(tree_method="hist")
    lc = LearningCurveCombined(10)
    lc.train(model, X, Y, n_splits=10, test_size=.2)
    lc.plot(target=2000, figsize=(8,4))

Output:

|alt text10|

In this example, the LearningCurveCombined class computes 10 different learning
curves and save them internally. Then the extrapolation is calculated by averaging
each predictor results. To get the score of a particular training size, use the 
target() method:

::

    lc.target(5000000)

Output:

::

    (array([0.80658311]), array([0.27924702]))

This will give you the averaged scoore (0.8) and the standard deviation (0.279).

We can verifiy this by plotting the actual 10 learning curves computed:

::

    lc.plot_all(figsize=(12,6), what="valid", std=False, alpha=.1, alpha_fit=.5, target=2000, predictor="best", legend=False)

Output:

|alt text11|

Evaluate extrapolation using mse validation
===========================================

The goodness a fit is calculated using the R2 score. Another metric can be
used: the mean-squared-error (or root-mean-squared-error). This can be done 
by excluding points from the fitting of the curve and using them for 
validation:

::

   import learning_curves as LC
   from sklearn.datasets import make_regression
   from sklearn.linear_model import SGDRegressor
   
   X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)
   lc = LC.LearningCurve()
   lc.get_lc(SGDRegressor(), X, Y, predictor="best", validation=0.2)

Output:

|alt text6|

In this plot we can see that 20% of the points have been excluded from the 
fitting and have been used for calculating a RMSE (here, 2,38e-3). This RMSE 
is another indicator that we can safely extrapolate this curve and predict
the score of the model trained with more data.

Compare Learning curves of various models
=========================================

If you have multiple models, you can plot their learning curves on the same 
plot:

::

   import learning_curves as LC
   from sklearn.datasets import make_regression
   from sklearn.linear_model import SGDRegressor
   from sklearn.svm import SVR
   from sklearn.neighbors import KNeighborsRegressor
   from sklearn.ensemble import RandomForestRegressor

   models = []
   models.append(("SGDRegressor",SGDRegressor()))
   models.append(("KNeighborsRegressor",KNeighborsRegressor()))
   models.append(("SVR",SVR()))
   models.append(("RandomForestRegressor",RandomForestRegressor()))

   X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)

   lcs = []
   for name, model in models:
       lc = LC.LearningCurve(name=name)
       lc.train(model, X, Y)
       lcs.append(lc)

   LC.LearningCurve.compare(lcs, what="valid")

Output:

|alt text5|

Save and load LearningCurve instances
=====================================

Because ``Predictor`` contains lambda functions, you can not simply save
a ``LearningCurve`` instance. One possibility is to only save the data
points of the curve inside ``lc.recorder["data"]`` and retrieve then
later on. But then the custom predictors are not saved. Therefore it is
recommended to use the ``save`` and ``load`` methods:

::

   lc.save("path/to/save.pkl")
   lc = LC.LearningCurve.load("path/to/save.pkl")

This internally uses the ``dill`` library to save the ``LearningCurve``
instance with all the ``Predictor``\ s.

Find the best training set size
===============================

``learning-curves`` will help you finding the best training set size by
extrapolation of the best fitted curve:

::

   lc.plot(predictor="all", saturation="best", target=31668)

Output:

|alt text4|

The horizontal red line shows the saturation of the curve. The
vertical blue lines shows the best accuracy we can get,
given a certain ``threshold`` (see below). We can use ``target`` 
parameter to extrapolate the curves.

To retrieve the value of the best training set size:

::

   lc.threshold(predictor="best", saturation="best")

Output:

::

   (0.9589, 31668, 0.9493)

This tells us that the saturation value (the maximum accuracy we can get
from this model without changing any other parameter) is ``0.9589``.
This value corresponds to an infinite number of samples in our training
set! But with a threshold of ``0.99`` (this parameter can be changed
with ``threshold=x``), we can have an accuracy ``0.9493`` if our
training set contains ``31668`` samples.

Note: The saturation value is always the *second parameter* of the
function. Therefore, if you create your own ``Predictor``, place the
saturation factor in second position (called a in the predefined
``Predictor``\ s). If the function of your custom ``Predictor`` is
diverging, then no saturation value can be retrieven. In that case, pass
``diverging=True`` to the constructor of the ``Predictor``. The
saturation value will then be calculated considering the ``max_scaling``
parameter of the ``threshold_cust`` function (see documentation for
details). You should set this parameter to the maximum number of sample
you can add to your training set.

Compare the models performances
===============================

``learning-curves`` also keeps track of the time elapsed during the 
computation of the learning curves:

::

   from sklearn.datasets import make_regression
   from sklearn.ensemble import RandomForestRegressor

   X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100)
   estimator = RandomForestRegressor()
   
   lc=LC.LearningCurve()
   lc.train(estimator, X, Y)
   lc.plot_time()

Output:

|alt text7|

As for the learning curves, you can easily compare the performances
using the ``LearningCurve.compare_time()`` function:

::

   from sklearn.linear_model import SGDRegressor
   from sklearn.svm import SVR
   from sklearn.neighbors import KNeighborsRegressor
   from sklearn.ensemble import RandomForestRegressor

   models = []
   models.append(("SGDRegressor",SGDRegressor()))
   models.append(("KNeighborsRegressor",KNeighborsRegressor()))
   models.append(("SVR",SVR()))
   models.append(("RandomForestRegressor",RandomForestRegressor()))

   lcs = []
   for name, model in progress_bar(models):
       lc=LC.LearningCurve(name=name)
       lc.train(model, X, Y, verbose=10)
       lcs.append(lc)

   LC.LearningCurve.compare_time(lcs, what="fit")

Output:

|alt text8|

Having the times help you diagnose which model is likely to scale
better with more data:

::

   import matplotlib.pyplot as plt
   fig, ax = plt.subplots(1,1, figsize=(12,6))

   for lc in lcs:
       ax.plot(lc.recorder["fit_times_mean"], lc.recorder["test_scores_mean"])
       ax.scatter(lc.recorder["fit_times_mean"], lc.recorder["test_scores_mean"], label=lc.name)

   ax.set_xlabel("Fit time (s)")
   ax.set_ylabel("Accuracy (r2 score)")
   ax.legend()

Output:

|alt text9|

With this plot with see that KNeighborsRegressor, altough looking very
promising and scalable in the previous plot, does not achieve to reach
an accuracy such as SGDRegressor. SGDRegressor would probably be the 
best model for doing predictions on this dataset.



.. _sklearn's learning curve feature: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
.. _here: https://www.dataquest.io/blog/learning-curves-machine-learning/

.. |alt text1| image:: ../images/learning_curve_doc_get_started.png
.. |alt text2| image:: ../images/learning_curve_simple.png
.. |alt text3| image:: ../images/learning_curve_all.png
.. |alt text4| image:: ../images/learning_curve_fit_sat_all.png
.. |alt text5| image:: ../images/learning_curve_doc_compare.png
.. |alt text6| image:: ../images/learning_curve_doc_valid.png
.. |alt text7| image:: ../images/learning_curve_doc_time.png
.. |alt text8| image:: ../images/learning_curve_doc_time_all.png
.. |alt text9| image:: ../images/learning_curve_doc_diag.png
.. |alt text10| image:: ../images/learning_curve_doc_combined.png
.. |alt text11| image:: ../images/learning_curve_doc_combined_all.png