Documentation¶
learning_curves.learning_curves module¶
-
class
learning_curves.learning_curves.
LearningCurve
(predictors=[], scoring=<function r2_score>, name=None)[source]¶ Bases:
object
-
best_predictor
(predictors='all', prefer_conv_delta=0.002, **kwargs)[source]¶ Find the Predictor having the best fit of a learning curve.
- Parameters
predictors (list(Predictor), "all") – A list of Predictors to consider.
prefer_conv_delta (float) – If the difference of the two best Predictor fit scores is lower than prefer_conv_delta, then if the converging Predict will be prefered (if any).
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
The Predicto having the best fit of the learning curve.
- Return type
-
static
compare
(lcs, fit=True, figsize=(12, 6), colors=None, what='both', fig=None, **kwargs)[source]¶ Stack learning curves on a single plot (max 10).
- Parameters
lcs (list(LearningCurve)) – List of LearningCurves to stack.
fit (bool) – If True, calls
LearningCurve.fit_all()
on all the learning curve objects.figsize (tuple) – Dimensions of the figure
colors (cycle, list) – cycle of the learning curves colors. A cycler an be created as follows: cycle = cycle(‘color’, [“color1”, “color2”, …])
what ("train", "valid", "both") – curves to show
fig (Matplotlib.figure) – The resulting figure
kwargs (dict) – Dictionary of values that will be passed to each
LearningCurve.plot()
method
- Returns
The resulting figure
- Return type
fig (Matplotlib.figure)
-
static
compare_time
(lcs, what='both', figsize=(12, 6), colors=None, **kwargs)[source]¶ Stack times of the computing of the learning curves on a single plot.
- Parameters
lcs (list(LearningCurve)) – List of LearningCurves to stack (max 10).
what (str) – Value in [“both”, “fit”, “score”]. Select the curve to show.
figsize (tuple) – Dimensions of the figure.
colors (cycle, list) – cycle of the learning curves colors. A cycler can be created as follows: cycle = cycle(‘color’, [“color1”, “color2”, …])
kwargs (dict) – Dictionary of values that will be passed to each
LearningCurve.plot_time()
method
- Returns
The resulting figure ax
- Return type
ax (Matplotlib.axes)
-
eval_fitted_curve
(validation, **kwargs)[source]¶ Split the data points in two sets then fit predictors in the first set and evaluate them using RMSE on the second set. See
eval_fitted_curve_cust()
- Parameters
validation (float, int) – Percentage or number of samples of the validation set (the highest training sizes will be used for validation)
kwargs (dict) – Parameters passed to
eval_fitted_curve_cust()
- Returns
The Root Mean Squared Error of the validation set against the fitted curve of the Predictor
- Return type
fit_score (float)
-
eval_fitted_curve_cust
(train_sizes_fit, test_scores_mean_fit, test_scores_std_fit, train_sizes_val, test_scores_mean_val, predictor='best', fit=True, metric=<function mean_bias_error>)[source]¶ Compute the error of a fitted curve on a validation set.
- Parameters
train_sizes_fit (array) – List of train sizes used for the fitting of the curve
test_scores_mean_fit (array) – Means computed by the estimator for the train sizes.
test_scores_std_fit (array) – Standard deviations computed by the estimator for the train sizes.
train_sizes_val (array) – List of train sizes used for vscoring of the fitting of the curve (the computation of the RMSE).
test_scores_mean_val (array) – Values computed by the estimator for the validation train sizes.
predictor (Predictor, "best") – Predictor to consider
fit (bool) – If True, perform a fit of the Predictors using the test_scores_mean_fit data points.
metric (function) – Function to use for the evaluation of the fit of the validation points.
- Returns
The score of the extrapolation using the validation set
- Return type
fit_score (float)
-
eval_train_sizes
()[source]¶ Compute the difference of scale between the first and last gradients of accuracies of the train_sizes.
If this number is lower than 2, then it indicates that the provided training set sizes don’t cover a wide enough range of the accuracies values to fit a curve. In that case, you should look at the generated plot to determine if you need more points close to the minimum or the maximum training set size.
- Returns
The difference of scale between the first and last gradients of accuracies of the train_sizes
- Return type
tain_size_score (float)
Example
get_train_sizes_grads([ 2, 8, …, 2599, 2824]) > 2.7156
-
fit
(P, x, y, **kwargs)[source]¶ Fit a curve with a predictor, compute and save the score of the fit.
- Parameters
x (list) – 1D array (list) representing the training sizes
y (list) – 1D array (list) representing the test scores
kwargs (dict) – Parameters that will be forwarded to Scipy curve_fit.
- Returns
The Predictor with the updated params and score. Score will be None if a ValueError exception occures while computing the score.
- Return type
-
fit_all
(**kwargs)[source]¶ Fit a curve with all the predictors using the recorder data and retrieve score if y_pred is finite.
- Parameters
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
an array of predictors with the updated params and score.
- Return type
list
-
fit_all_cust
(x, y, predictors, **kwargs)[source]¶ Fit a curve with all the predictors and retrieve score if y_pred is finite.
- Parameters
x (list) – 1D array (list) representing the training sizes
y (list) – 1D array (list) representing the test scores
- Returns
an array of predictors with the updated params and score.
- Return type
list
-
get_label
(label)[source]¶ Prefix the label with the name of the LearningCurve instance.
- Parameters
label (str) – label to prefix
- Returns
label prefixed with name, if any.
- Return type
label (str)
-
get_lc
(estimator, X, Y, train_kwargs={}, **kwargs)[source]¶ Compute and plot the learning curve. See
train()
andplot()
functions for parameters.- Parameters
estimator (Object) – Must implement a fit(X,Y) and predict(Y) method.
X (array) – Features to use for prediction
Y (array) – Values to be predicted
train_kwargs (dict) – See
train()
parameterskwargs (dict) – See:meth:plot parameters.
-
get_predictor
(pred)[source]¶ Get a
learning_curves.predictor
from the list of the Predictors.
-
plot
(predictor=None, figsize=(12, 6), fig=None, **kwargs)[source]¶ Plot the training and test learning curves of the recorder data, with optionally fitted functions and saturation. See
plot_cust()
:- Parameters
predictor (str, list(str), Predictor, list(Predictor)) – The predictor(s) to use for plotting the fitted curve. Can also be “all” and “best”.
figsize (2uple) – Size of the figure (only taken in account if ax is None)
ax (Matplotlib.axes) – A figure on which the learning curve will be drawn. If None, a new one is created.
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
A Matplotlib figure of the result.
- Raises
RuntimeError – If the recorder is empty.
-
plot_cust
(train_sizes, train_scores_mean, train_scores_std, test_scores_mean, test_scores_std, predictor=None, what='both', xlim=None, ylim=None, figsize=(12, 6), title=None, saturation=None, target=None, validation=0, close=True, uncertainty=False, fig=None, alpha=1, alpha_fit=1, std=True, **kwargs)[source]¶ Plot any training and test learning curves, with optionally fitted functions and saturation.
- Parameters
train_sizes (list) – Training sizes (x values).
train_scores_std (list) – Train score standard deviations.
test_scores_mean (list) – Test score means(y values).
test_scores_std (list) – Train score means.
predictor (str, list(str), Predictor, list(Predictor)) – The predictor(s) to use for plotting the fitted curve. Can be “all” or “best”.
what ("train", "valid", "both") – learning curves to show
xlim (2uple) – Limits of the x axis of the plot.
ylim (2uple) – Limits of the y axis of the plot.
figsize (2uple) – Size of the figure
title (str) – Title of the figure
saturation (str, list(str), Predictor, list(Predictor)) – Predictor(s) to consider for displaying the saturation on the plot. Can be “all” or “best”.
target (int) – Training size to reach. The training size axis will be extended and the fitted curve extrapolated until reaching this value.
validation (float) – Percentage or number of data points to keep for validation of the curve fitting (they will not be used during the fitting but displayed afterwards)
close (bool) – If True, close the figure before returning it. This is usefull if a lot of plots are being created because Matplotlib won’t close them, potentially leading to warnings. If False, the plot will not be closed. This can be desired when working on Jupyter notebooks, so that the plot will be rendered in the output of the cell.
uncertainty (bool) – If True, plot the standard deviation of the best fitted curve for the validation data points.
fig (Matplotlib.figure) – A figure which the learning curve will be drawn. If None, a new one is created.
alpha (float) – Controls transparency of the learning curve
alpha_fit (float) – Controls transparency of the fitted line
std (bool) – Whether to plot standard deviations of points or not.
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
fig (Matplotlib.figure)
-
plot_fitted_curve
(ax, P, x, scores=True, best=False, best_ls='-.', alpha=1, **kwargs)[source]¶ Add to figure ax a fitted curve.
- Parameters
ax (Matplotlib.axes) – Figure used to print the curve.
P (Predictor) – Predictor to use for the computing of the curve.
x (array) – 1D array (list) representing the training sizes.
scores (bool) – Print the score of each curve fit in the legend if True.
best (bool) – use a higher zorder to make the curve more visible if True.
best_ls (Matplotlib line-style) – line-style of the curve whose Predictor is used for computing saturation accuracy.
alpha (float) – Controls the transparency of the fitted curve
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
The updated figure.
- Return type
Matplotlib axes
-
plot_time
(fig=None, what='both', figsize=(12, 6))[source]¶ Plot training sizes against fit/score computing times.
- Parameters
fig (Matplotlib.figure) – A figure on which the curves will be drawn. If None, a new one will be created.
what (str) – Value in [“both”, “fit”, “score”]. Select the curve to show.
figsize (2-uple) – Dimensions of the figure (ignored if ax is not None).
- Returns
A Matplotlib figure of the result.
- Return type
fig (Matplotlib.figure)
-
save
(path=None)[source]¶ Save the LearningCurve object as a pickle object in disk, or as a string.
It uses the dill library to save the instance because the object contains lambda functions, that can not be pickled otherwise.
- Parameters
path (str) – Path where to save the object. If None, the string representing the object is returned
-
threshold
(P='best', **kwargs)[source]¶ See
threshold_cust()
function. This function callsthreshold_cust()
with the recorder data.- Parameters
P (Predictor, string) – Predictor to use.
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
(x_thresh, y_thresh, sat_val, threshold). If P is diverging, the saturation value will be 1.
- Return type
Tuple
- Raises
RuntimeError – If the recorder is empty.
-
threshold_cust
(P, x, threshold=0.99, max_scaling=1, resolution=10000.0, strategies={'max_scaling': 1, 'threshold': -0.01}, **kwargs)[source]¶ Find the training set size providing the highest accuracy up to a predefined threshold.
P(x) = y and for x -> inf, y -> saturation value. This method approximates x_thresh such as P(x_thresh) = threshold * saturation value.
- Parameters
P (str, Predictor) – The predictor to use for the calculation of the saturation value.
x (array) – Training set sizes
threshold (float) – In [0.0, 1.0]. Percentage of the saturation value to use for the calculus of the best training set size.
max_scaling (float) – Order of magnitude added to the order of magnitude of the maximum train set size. Generally, a value of 1-2 is enough.
resolution (float) – Only considered for diverging Predictors without inverse function. The higher it is, the more accurate the value of the training set size will be.
strategies (dict) – A dictionary of the values to add / substract to the other parameters in case a saturation value can not be found. If an RecursionError raises, (None, None, sat_val, threshold) will be returned.
kwargs (dict) – Parameters that will be forwarded to internal functions.
- Returns
(x_thresh, y_thresh, saturation_arrucacy, threshold)
- Return type
Tuple
-
threshold_cust_approx
(P, x, threshold, max_scaling, resolution, strategies, **kwargs)[source]¶ Find the training set size providing the highest accuracy up to a predefined threshold for a Predictor having no inverse function. See
threshold_cust()
.
-
threshold_cust_inv
(P, x, threshold, **kwargs)[source]¶ Find the training set size providing the highest accuracy up to a desired threshold for a Predictor having an inverse function. See
threshold_cust()
.
-
train
(estimator, X, Y, train_sizes=None, test_size=0.2, n_splits=5, verbose=1, n_jobs=-1, n_samples=20, **kwargs)[source]¶ Compute the learning curve of an estimator over a dataset.
- Parameters
estimator (Object) – Must implement a fit(X,Y) and predict(Y) method.
X (array) – Features to use for prediction
Y (array) – Values to be predicted
train_sizes (list) – See sklearn learning_curve function documentation. If None, np.geomspace will be used with 20 values
n_split (int) – Number of random cross validation calculated for each train size
verbose (int) – The higher, the more verbose.
n_jobs (int) – See sklearn learning_curve function documentation.
n_samples (int) – if train_sizes is None, n_samples is the number of samples of to use for the learning curve.
kwargs (dict) – See sklearn learning_curve function parameters. Invalid parameters raise errors.
- Returns
The resulting object can then be passed to
plot()
function.- Return type
Dict
-
learning_curves.predictor module¶
-
class
learning_curves.predictor.
Predictor
(name, func, guess, inv=None, diverging=False, bounds=None)[source]¶ Bases:
object
Object representing a function to fit a learning curve (See
learning_curves.LearningCurve
).-
get_saturation
()[source]¶ Compute the saturation accuracy of the Predictor.
The saturation accuracy is the best accuracy you will get from the model without changing any other parameter than the training set size. If the Predictor is diverging, this value should be disregarded, being meaningless.
- Returns
- saturation accuracy of the Predictor.
This value is 1 if the Predictor is diverging without inverse function. This valus is the first parameter of the Predictor if it is converging. This value is calculated if the Predictor is diverging with inverse function.
- Return type
float
-
learning_curves.tools module¶
-
learning_curves.tools.
get_absolute_value
(validation, len_vector)[source]¶ Get the absolute value of the length of a vector.
-
learning_curves.tools.
get_scale
(val, floor=True)[source]¶ Returns the scale of a value.
- Parameters
floor (bool) – if True, apply np.floor to the result
Examples
get_scale(1.5e-15) > -15 get_scale(1.5e-15, False) > -14.823908740944319
-
learning_curves.tools.
get_unique_list
(predictors)[source]¶ Return a list of unique predictors. Two Predictors are equal if they have the same name.
-
learning_curves.tools.
is_strictly_increasing
(L)[source]¶ Returns True if the list contains strictly increasing values.
Examples
is_strictly_increasing([0,1,2,3,4,5]) > True is_strictly_increasing([0,1,2,2,4,5]) > False is_strictly_increasing([0,1,2,1,4,5]) > False
-
learning_curves.tools.
load
(path='./lc_data.pkl')[source]¶ Load a
learning_curves.LearningCurve
object from disk.
-
learning_curves.tools.
mean_bias_error
(y_trues, y_preds)[source]¶ Computes the Mean Bias Error of two vectors.