Welcome to learning-curves's documentation! =========================================== Learning-curves is Python module that extends `sklearn's learning curve feature`_. It will help you visualizing the learning curve of your models. Learning curves give an opportunity to diagnose bias and variance in supervised learning models, but also to visualize how training set size influence the performance of the models (more informations `here`_). Such plots help you answer the following questions: - Do I have enough data? - Can I train my model with less data without reducing accuracy? - Is my training/validation set biased? - What is the best model for my data? - What is the perfect training size for tuning parameters? Learning-curves will also help you fitting the learning curve to extrapolate and find the saturation value of the curve. Installation ============ :: $ pip install git+https://github.com/H4dr1en/learning-curves#egg=learning-curves To create learning curve plots, first import the module with ``import learning_curves as LC``. Getting started =============== To get started, you can start with the following code: :: import learning_curves as LC from sklearn.datasets import make_regression from sklearn.linear_model import SGDRegressor X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100) lc = LC.LearningCurve() lc.get_lc(SGDRegressor(), X, Y) Output: |alt text1| On this example the green curve suggests that adding more data to the training set is not likely to improve the model accuracy. The green curve also shows a saturation near 0.84. We can easily fit a function to any curve: :: lc.plot(predictor="best") Output: |alt text2| Here we used a predefined function, ``pow``, to fit the green curve. The R2 score is very close to 1, meaning that the fit is optimal. We can therefore use this curve to extrapolate the evolution of the accuracy with the training set size. This also tells us how many data we should use to train our model to maximize performances and accuracy: near 2000, we achieved 99% of the maximal accuracy we can get for this model. Custom Predictors ================= Predictors are object wrapping the fitting of learning curves. You can create a ``Predictor`` like this: :: predictor = Predictor("myPredictor", lambda x,a,b : a*x + b, [1,0]) Here we created a Predictor called "myPredictor" with the function ``y(x) = a*x + b``. Because internally SciPy ``optimize.curve_fit`` is called, a first guess of the parameters ``a`` and ``b`` are required. Here we gave them respective value 1 and 0. You can then add the ``Predictor`` to the ``LearningCurve`` object in two different ways: - Pass the ``Predictor`` to the ``LearningCurve`` constructor: :: lc = LearningCurve([predictor]) - Register the ``Predictor`` inside the predictors of the ``LearningCurve`` object: :: lc.predictors.append(predictor) By default, 5 ``Predictors`` are instantiated: :: defaults_predictors = [ Predictor("pow", lambda x, a, b, c, d: a - 1 / ( x/b - d)**c, [1, 1, 1, 1], lambda x, a, b, c, d: b * ( 1 / (a-x)**(1/c) + d) ), Predictor("inv", lambda x, a, b, d: a / (1 + b/(x-d)), [1, 1, 1], lambda x, a, b, d: b / (a/x - 1) + d ), Predictor("inv_log", lambda x, a, b, c, d: a - b/np.log(x-d)**c, [1, 1, 1, 1], lambda x, a, b, c, d: np.exp((b / (a-x))**(1/c) ) + d ), Predictor("pow_log", lambda x, a, b, c, d, m, n: a - 1 / (x/b - d)**c + m*np.log(x**n), [1, 1, 1, 1, 1e-2, 1e-2], diverging=True, bounds=([-np.inf, 0, 0, 0, 0, 0], [np.inf, np.inf, np.inf, np.inf, np.inf, np.inf]) ), Predictor("inv_2", lambda x, a, b, d, e: a / (e + b/(x-d)), [1, 1, 1, 1], lambda x, a, b, d, e: b / (a/x - e) + d ) ] Some predictors perform better (R2 score is closer to 1) than others, depending on the dataset, the model and the value to be preditected. Find the best Predictor ======================= To find the Predictor that will fit best your learning curve, we can call ``get_predictor`` function: :: lc.get_predictor("best") Output: :: (pow [params:[ 0.9588563 11.74747659 -0.36232639 -236.46115903]][score:0.9997458683912492]) Plot the Predictors =================== You can plot any ``Predictor``\ s fitted function with the ``plot`` function: :: lc.plot(predictor="all") Output: |alt text3| Predictor bounds ================ Each parameter of a ``Predictor`` can be enforced to have values inside a fixed interval using bounds: :: lc.predictors[0].bounds Output: :: ([-np.inf, 1e-10, -np.inf, -np.inf], [1, np.inf, 0, 1]) For example, the first parameter (the saturation parameter) is enforced to have values between [-inf, 1], because a R2 score cannot be > 1. Average learning curves for better extrapolation ================================================ Multiple learning curves can be averaged to get a more accurate extrapolation, as well as a estimation of the error (standard deviation of the curve). This can easily be done using LearningCurveCombined class: :: from sklearn.datasets import make_regression from learning_curves import * from xgboost import XGBRegressor X, Y = make_regression(500, noise=0.5, bias=0.2, n_informative=50) model = XGBRegressor(tree_method="hist") lc = LearningCurveCombined(10) lc.train(model, X, Y, n_splits=10, test_size=.2) lc.plot(target=2000, figsize=(8,4)) Output: |alt text10| In this example, the LearningCurveCombined class computes 10 different learning curves and save them internally. Then the extrapolation is calculated by averaging each predictor results. To get the score of a particular training size, use the target() method: :: lc.target(5000000) Output: :: (array([0.80658311]), array([0.27924702])) This will give you the averaged scoore (0.8) and the standard deviation (0.279). We can verifiy this by plotting the actual 10 learning curves computed: :: lc.plot_all(figsize=(12,6), what="valid", std=False, alpha=.1, alpha_fit=.5, target=2000, predictor="best", legend=False) Output: |alt text11| Evaluate extrapolation using mse validation =========================================== The goodness a fit is calculated using the R2 score. Another metric can be used: the mean-squared-error (or root-mean-squared-error). This can be done by excluding points from the fitting of the curve and using them for validation: :: import learning_curves as LC from sklearn.datasets import make_regression from sklearn.linear_model import SGDRegressor X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100) lc = LC.LearningCurve() lc.get_lc(SGDRegressor(), X, Y, predictor="best", validation=0.2) Output: |alt text6| In this plot we can see that 20% of the points have been excluded from the fitting and have been used for calculating a RMSE (here, 2,38e-3). This RMSE is another indicator that we can safely extrapolate this curve and predict the score of the model trained with more data. Compare Learning curves of various models ========================================= If you have multiple models, you can plot their learning curves on the same plot: :: import learning_curves as LC from sklearn.datasets import make_regression from sklearn.linear_model import SGDRegressor from sklearn.svm import SVR from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import RandomForestRegressor models = [] models.append(("SGDRegressor",SGDRegressor())) models.append(("KNeighborsRegressor",KNeighborsRegressor())) models.append(("SVR",SVR())) models.append(("RandomForestRegressor",RandomForestRegressor())) X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100) lcs = [] for name, model in models: lc = LC.LearningCurve(name=name) lc.train(model, X, Y) lcs.append(lc) LC.LearningCurve.compare(lcs, what="valid") Output: |alt text5| Save and load LearningCurve instances ===================================== Because ``Predictor`` contains lambda functions, you can not simply save a ``LearningCurve`` instance. One possibility is to only save the data points of the curve inside ``lc.recorder["data"]`` and retrieve then later on. But then the custom predictors are not saved. Therefore it is recommended to use the ``save`` and ``load`` methods: :: lc.save("path/to/save.pkl") lc = LC.LearningCurve.load("path/to/save.pkl") This internally uses the ``dill`` library to save the ``LearningCurve`` instance with all the ``Predictor``\ s. Find the best training set size =============================== ``learning-curves`` will help you finding the best training set size by extrapolation of the best fitted curve: :: lc.plot(predictor="all", saturation="best", target=31668) Output: |alt text4| The horizontal red line shows the saturation of the curve. The vertical blue lines shows the best accuracy we can get, given a certain ``threshold`` (see below). We can use ``target`` parameter to extrapolate the curves. To retrieve the value of the best training set size: :: lc.threshold(predictor="best", saturation="best") Output: :: (0.9589, 31668, 0.9493) This tells us that the saturation value (the maximum accuracy we can get from this model without changing any other parameter) is ``0.9589``. This value corresponds to an infinite number of samples in our training set! But with a threshold of ``0.99`` (this parameter can be changed with ``threshold=x``), we can have an accuracy ``0.9493`` if our training set contains ``31668`` samples. Note: The saturation value is always the *second parameter* of the function. Therefore, if you create your own ``Predictor``, place the saturation factor in second position (called a in the predefined ``Predictor``\ s). If the function of your custom ``Predictor`` is diverging, then no saturation value can be retrieven. In that case, pass ``diverging=True`` to the constructor of the ``Predictor``. The saturation value will then be calculated considering the ``max_scaling`` parameter of the ``threshold_cust`` function (see documentation for details). You should set this parameter to the maximum number of sample you can add to your training set. Compare the models performances =============================== ``learning-curves`` also keeps track of the time elapsed during the computation of the learning curves: :: from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor X, Y = make_regression(n_samples=int(1e4), n_features=50, n_informative=25, bias=-92, noise=100) estimator = RandomForestRegressor() lc=LC.LearningCurve() lc.train(estimator, X, Y) lc.plot_time() Output: |alt text7| As for the learning curves, you can easily compare the performances using the ``LearningCurve.compare_time()`` function: :: from sklearn.linear_model import SGDRegressor from sklearn.svm import SVR from sklearn.neighbors import KNeighborsRegressor from sklearn.ensemble import RandomForestRegressor models = [] models.append(("SGDRegressor",SGDRegressor())) models.append(("KNeighborsRegressor",KNeighborsRegressor())) models.append(("SVR",SVR())) models.append(("RandomForestRegressor",RandomForestRegressor())) lcs = [] for name, model in progress_bar(models): lc=LC.LearningCurve(name=name) lc.train(model, X, Y, verbose=10) lcs.append(lc) LC.LearningCurve.compare_time(lcs, what="fit") Output: |alt text8| Having the times help you diagnose which model is likely to scale better with more data: :: import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1, figsize=(12,6)) for lc in lcs: ax.plot(lc.recorder["fit_times_mean"], lc.recorder["test_scores_mean"]) ax.scatter(lc.recorder["fit_times_mean"], lc.recorder["test_scores_mean"], label=lc.name) ax.set_xlabel("Fit time (s)") ax.set_ylabel("Accuracy (r2 score)") ax.legend() Output: |alt text9| With this plot with see that KNeighborsRegressor, altough looking very promising and scalable in the previous plot, does not achieve to reach an accuracy such as SGDRegressor. SGDRegressor would probably be the best model for doing predictions on this dataset. .. _sklearn's learning curve feature: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html .. _here: https://www.dataquest.io/blog/learning-curves-machine-learning/ .. |alt text1| image:: ../images/learning_curve_doc_get_started.png .. |alt text2| image:: ../images/learning_curve_simple.png .. |alt text3| image:: ../images/learning_curve_all.png .. |alt text4| image:: ../images/learning_curve_fit_sat_all.png .. |alt text5| image:: ../images/learning_curve_doc_compare.png .. |alt text6| image:: ../images/learning_curve_doc_valid.png .. |alt text7| image:: ../images/learning_curve_doc_time.png .. |alt text8| image:: ../images/learning_curve_doc_time_all.png .. |alt text9| image:: ../images/learning_curve_doc_diag.png .. |alt text10| image:: ../images/learning_curve_doc_combined.png .. |alt text11| image:: ../images/learning_curve_doc_combined_all.png