Documentation¶

learning_curves.learning_curves module¶

class learning_curves.learning_curves.LearningCurve(predictors=[], scoring=<function r2_score>, name=None)[source]¶

Bases: object

best_predictor(predictors='all', prefer_conv_delta=0.002, **kwargs)[source]¶

Find the Predictor having the best fit of a learning curve.

Parameters

predictors (list(Predictor), "all") – A list of Predictors to consider.
prefer_conv_delta (float) – If the difference of the two best Predictor fit scores is lower than prefer_conv_delta, then if the converging Predict will be prefered (if any).
kwargs (dict) – Parameters that will be forwarded to internal functions.

Returns

The Predicto having the best fit of the learning curve.

Return type

Predictor

static compare(lcs, fit=True, figsize=(12, 6), colors=None, what='both', fig=None, **kwargs)[source]¶

Stack learning curves on a single plot (max 10).

Parameters

lcs (list(LearningCurve)) – List of LearningCurves to stack.
fit (bool) – If True, calls LearningCurve.fit_all() on all the learning curve objects.
figsize (tuple) – Dimensions of the figure
colors (cycle, list) – cycle of the learning curves colors. A cycler an be created as follows: cycle = cycle(‘color’, [“color1”, “color2”, …])
what ("train", "valid", "both") – curves to show
fig (Matplotlib.figure) – The resulting figure
kwargs (dict) – Dictionary of values that will be passed to each LearningCurve.plot() method

Returns

The resulting figure

Return type

fig (Matplotlib.figure)

static compare_time(lcs, what='both', figsize=(12, 6), colors=None, **kwargs)[source]¶

Stack times of the computing of the learning curves on a single plot.

Parameters

lcs (list(LearningCurve)) – List of LearningCurves to stack (max 10).
what (str) – Value in [“both”, “fit”, “score”]. Select the curve to show.
figsize (tuple) – Dimensions of the figure.
colors (cycle, list) – cycle of the learning curves colors. A cycler can be created as follows: cycle = cycle(‘color’, [“color1”, “color2”, …])
kwargs (dict) – Dictionary of values that will be passed to each LearningCurve.plot_time() method

Returns

The resulting figure ax

Return type

ax (Matplotlib.axes)

eval_fitted_curve(validation, **kwargs)[source]¶

Split the data points in two sets then fit predictors in the first set and evaluate them using RMSE on the second set. See eval_fitted_curve_cust()

Parameters

validation (float, int) – Percentage or number of samples of the validation set (the highest training sizes will be used for validation)
kwargs (dict) – Parameters passed to eval_fitted_curve_cust()

Returns

The Root Mean Squared Error of the validation set against the fitted curve of the Predictor

Return type

fit_score (float)

eval_fitted_curve_cust(train_sizes_fit, test_scores_mean_fit, test_scores_std_fit, train_sizes_val, test_scores_mean_val, predictor='best', fit=True, metric=<function mean_bias_error>)[source]¶

Compute the error of a fitted curve on a validation set.

Parameters

train_sizes_fit (array) – List of train sizes used for the fitting of the curve
test_scores_mean_fit (array) – Means computed by the estimator for the train sizes.
test_scores_std_fit (array) – Standard deviations computed by the estimator for the train sizes.
train_sizes_val (array) – List of train sizes used for vscoring of the fitting of the curve (the computation of the RMSE).
test_scores_mean_val (array) – Values computed by the estimator for the validation train sizes.
predictor (Predictor, "best") – Predictor to consider
fit (bool) – If True, perform a fit of the Predictors using the test_scores_mean_fit data points.
metric (function) – Function to use for the evaluation of the fit of the validation points.

Returns

The score of the extrapolation using the validation set

Return type

fit_score (float)

eval_train_sizes()[source]¶

Compute the difference of scale between the first and last gradients of accuracies of the train_sizes.

If this number is lower than 2, then it indicates that the provided training set sizes don’t cover a wide enough range of the accuracies values to fit a curve. In that case, you should look at the generated plot to determine if you need more points close to the minimum or the maximum training set size.

Returns: The difference of scale between the first and last gradients of accuracies of the train_sizes
Return type: tain_size_score (float)

Example

get_train_sizes_grads([ 2, 8, …, 2599, 2824]) > 2.7156

fit(P, x, y, **kwargs)[source]¶

Fit a curve with a predictor, compute and save the score of the fit.

Parameters

x (list) – 1D array (list) representing the training sizes
y (list) – 1D array (list) representing the test scores
kwargs (dict) – Parameters that will be forwarded to Scipy curve_fit.

Returns

The Predictor with the updated params and score. Score will be None if a ValueError exception occures while computing the score.

Return type

Predictor

fit_all(**kwargs)[source]¶

Fit a curve with all the predictors using the recorder data and retrieve score if y_pred is finite.

Parameters: kwargs (dict) – Parameters that will be forwarded to internal functions.
Returns: an array of predictors with the updated params and score.
Return type: list

fit_all_cust(x, y, predictors, **kwargs)[source]¶

Fit a curve with all the predictors and retrieve score if y_pred is finite.

Parameters

x (list) – 1D array (list) representing the training sizes
y (list) – 1D array (list) representing the test scores

Returns

an array of predictors with the updated params and score.

Return type

list

get_label(label)[source]¶

Prefix the label with the name of the LearningCurve instance.

Parameters: label (str) – label to prefix
Returns: label prefixed with name, if any.
Return type: label (str)

get_lc(estimator, X, Y, train_kwargs={}, **kwargs)[source]¶

Compute and plot the learning curve. See train() and plot() functions for parameters.

Parameters

estimator (Object) – Must implement a fit(X,Y) and predict(Y) method.
X (array) – Features to use for prediction
Y (array) – Values to be predicted
train_kwargs (dict) – See train() parameters
kwargs (dict) – See:meth:plot parameters.

get_predictor(pred)[source]¶

Get a learning_curves.predictor from the list of the Predictors.

Parameters: pred (Predictor, str, list) – Predictor name, “best” or “all”, a Predictor, a list of string (Predictor names), a list of Predictors`
Returns: The matching Predictor(s)
Return type: Predictor, list
Raises: ValueError – If no matching Predictor is found

plot(predictor=None, figsize=(12, 6), fig=None, **kwargs)[source]¶

Plot the training and test learning curves of the recorder data, with optionally fitted functions and saturation. See plot_cust():

Parameters

predictor (str, list(str), Predictor, list(Predictor)) – The predictor(s) to use for plotting the fitted curve. Can also be “all” and “best”.
figsize (2uple) – Size of the figure (only taken in account if ax is None)
ax (Matplotlib.axes) – A figure on which the learning curve will be drawn. If None, a new one is created.
kwargs (dict) – Parameters that will be forwarded to internal functions.

Returns

A Matplotlib figure of the result.

Raises

RuntimeError – If the recorder is empty.

plot_cust(train_sizes, train_scores_mean, train_scores_std, test_scores_mean, test_scores_std, predictor=None, what='both', xlim=None, ylim=None, figsize=(12, 6), title=None, saturation=None, target=None, validation=0, close=True, uncertainty=False, fig=None, alpha=1, alpha_fit=1, std=True, **kwargs)[source]¶

Plot any training and test learning curves, with optionally fitted functions and saturation.

Parameters

train_sizes (list) – Training sizes (x values).
train_scores_std (list) – Train score standard deviations.
test_scores_mean (list) – Test score means(y values).
test_scores_std (list) – Train score means.
predictor (str, list(str), Predictor, list(Predictor)) – The predictor(s) to use for plotting the fitted curve. Can be “all” or “best”.
what ("train", "valid", "both") – learning curves to show
xlim (2uple) – Limits of the x axis of the plot.
ylim (2uple) – Limits of the y axis of the plot.
figsize (2uple) – Size of the figure
title (str) – Title of the figure
saturation (str, list(str), Predictor, list(Predictor)) – Predictor(s) to consider for displaying the saturation on the plot. Can be “all” or “best”.
target (int) – Training size to reach. The training size axis will be extended and the fitted curve extrapolated until reaching this value.
validation (float) – Percentage or number of data points to keep for validation of the curve fitting (they will not be used during the fitting but displayed afterwards)
close (bool) – If True, close the figure before returning it. This is usefull if a lot of plots are being created because Matplotlib won’t close them, potentially leading to warnings. If False, the plot will not be closed. This can be desired when working on Jupyter notebooks, so that the plot will be rendered in the output of the cell.
uncertainty (bool) – If True, plot the standard deviation of the best fitted curve for the validation data points.
fig (Matplotlib.figure) – A figure which the learning curve will be drawn. If None, a new one is created.
alpha (float) – Controls transparency of the learning curve
alpha_fit (float) – Controls transparency of the fitted line
std (bool) – Whether to plot standard deviations of points or not.
kwargs (dict) – Parameters that will be forwarded to internal functions.

Returns

fig (Matplotlib.figure)

plot_fitted_curve(ax, P, x, scores=True, best=False, best_ls='-.', alpha=1, **kwargs)[source]¶

Add to figure ax a fitted curve.

Parameters

ax (Matplotlib.axes) – Figure used to print the curve.
P (Predictor) – Predictor to use for the computing of the curve.
x (array) – 1D array (list) representing the training sizes.
scores (bool) – Print the score of each curve fit in the legend if True.
best (bool) – use a higher zorder to make the curve more visible if True.
best_ls (Matplotlib line-style) – line-style of the curve whose Predictor is used for computing saturation accuracy.
alpha (float) – Controls the transparency of the fitted curve
kwargs (dict) – Parameters that will be forwarded to internal functions.

Returns

The updated figure.

Return type

Matplotlib axes

plot_time(fig=None, what='both', figsize=(12, 6))[source]¶

Plot training sizes against fit/score computing times.

Parameters

fig (Matplotlib.figure) – A figure on which the curves will be drawn. If None, a new one will be created.
what (str) – Value in [“both”, “fit”, “score”]. Select the curve to show.
figsize (2-uple) – Dimensions of the figure (ignored if ax is not None).

Returns

A Matplotlib figure of the result.

Return type

fig (Matplotlib.figure)

save(path=None)[source]¶

Save the LearningCurve object as a pickle object in disk, or as a string.

It uses the dill library to save the instance because the object contains lambda functions, that can not be pickled otherwise.

Parameters: path (str) – Path where to save the object. If None, the string representing the object is returned

threshold(P='best', **kwargs)[source]¶

See threshold_cust() function. This function calls threshold_cust() with the recorder data.

Parameters

P (Predictor, string) – Predictor to use.
kwargs (dict) – Parameters that will be forwarded to internal functions.

Returns

(x_thresh, y_thresh, sat_val, threshold). If P is diverging, the saturation value will be 1.

Return type

Tuple

Raises

RuntimeError – If the recorder is empty.

threshold_cust(P, x, threshold=0.99, max_scaling=1, resolution=10000.0, strategies={'max_scaling': 1, 'threshold': -0.01}, **kwargs)[source]¶

Find the training set size providing the highest accuracy up to a predefined threshold.

P(x) = y and for x -> inf, y -> saturation value. This method approximates x_thresh such as P(x_thresh) = threshold * saturation value.

Parameters

P (str, Predictor) – The predictor to use for the calculation of the saturation value.
x (array) – Training set sizes
threshold (float) – In [0.0, 1.0]. Percentage of the saturation value to use for the calculus of the best training set size.
max_scaling (float) – Order of magnitude added to the order of magnitude of the maximum train set size. Generally, a value of 1-2 is enough.
resolution (float) – Only considered for diverging Predictors without inverse function. The higher it is, the more accurate the value of the training set size will be.
strategies (dict) – A dictionary of the values to add / substract to the other parameters in case a saturation value can not be found. If an RecursionError raises, (None, None, sat_val, threshold) will be returned.
kwargs (dict) – Parameters that will be forwarded to internal functions.

Returns

(x_thresh, y_thresh, saturation_arrucacy, threshold)

Return type

Tuple

threshold_cust_approx(P, x, threshold, max_scaling, resolution, strategies, **kwargs)[source]¶: Find the training set size providing the highest accuracy up to a predefined threshold for a Predictor having no inverse function. See threshold_cust().

threshold_cust_inv(P, x, threshold, **kwargs)[source]¶: Find the training set size providing the highest accuracy up to a desired threshold for a Predictor having an inverse function. See threshold_cust().

train(estimator, X, Y, train_sizes=None, test_size=0.2, n_splits=5, verbose=1, n_jobs=-1, n_samples=20, **kwargs)[source]¶

Compute the learning curve of an estimator over a dataset.

Parameters

estimator (Object) – Must implement a fit(X,Y) and predict(Y) method.
X (array) – Features to use for prediction
Y (array) – Values to be predicted
train_sizes (list) – See sklearn learning_curve function documentation. If None, np.geomspace will be used with 20 values
n_split (int) – Number of random cross validation calculated for each train size
verbose (int) – The higher, the more verbose.
n_jobs (int) – See sklearn learning_curve function documentation.
n_samples (int) – if train_sizes is None, n_samples is the number of samples of to use for the learning curve.
kwargs (dict) – See sklearn learning_curve function parameters. Invalid parameters raise errors.

Returns

The resulting object can then be passed to plot() function.

Return type

Dict

learning_curves.predictor module¶

class learning_curves.predictor.Predictor(name, func, guess, inv=None, diverging=False, bounds=None)[source]¶

Bases: object

Object representing a function to fit a learning curve (See learning_curves.LearningCurve).

get_saturation()[source]¶

Compute the saturation accuracy of the Predictor.

The saturation accuracy is the best accuracy you will get from the model without changing any other parameter than the training set size. If the Predictor is diverging, this value should be disregarded, being meaningless.

Returns

saturation accuracy of the Predictor.: This value is 1 if the Predictor is diverging without inverse function. This valus is the first parameter of the Predictor if it is converging. This value is calculated if the Predictor is diverging with inverse function.

Return type

float

learning_curves.tools module¶

learning_curves.tools.get_absolute_value(validation, len_vector)[source]¶: Get the absolute value of the length of a vector.

learning_curves.tools.get_scale(val, floor=True)[source]¶

Returns the scale of a value.

Parameters: floor (bool) – if True, apply np.floor to the result

Examples

get_scale(1.5e-15) > -15 get_scale(1.5e-15, False) > -14.823908740944319

learning_curves.tools.get_unique_list(predictors)[source]¶: Return a list of unique predictors. Two Predictors are equal if they have the same name.

learning_curves.tools.is_strictly_increasing(L)[source]¶

Returns True if the list contains strictly increasing values.

Examples

is_strictly_increasing([0,1,2,3,4,5]) > True is_strictly_increasing([0,1,2,2,4,5]) > False is_strictly_increasing([0,1,2,1,4,5]) > False

learning_curves.tools.load(path='./lc_data.pkl')[source]¶: Load a learning_curves.LearningCurve object from disk.

learning_curves.tools.mean_bias_error(y_trues, y_preds)[source]¶: Computes the Mean Bias Error of two vectors.

learning_curves.tools.split(array, start, end, validation=None, step=1)[source]¶: Split arrays in an object with the possibility of keeping rightmost elements of arrays Each array will be resized as follows: newArray = oldArray[start:end:step] (+ oldArray[-validation:])

learning_curves.tools.update_params(params, strategies)[source]¶

Update the values of params based on the values in strategies.

Example: update_params(params=dict(val1=1, val2=10), strategies=dict(val1=0.1, val2=-1): > {‘val1’: 1.1, ‘val2’: 9}