Probability Score Class Docs
There is just one class to represent predictions returning probability scores
BinaryScore: an object that represents predictions matching each observation with a probability score between 0 and 1.
Probability scores can be easily transformed in predictions by setting a threshold above which the observations are mapped to “1”, while the remaining get a “0”.
BinaryScore
Class to represent probability estimates, thus predictions that do not directly return fitted values but that can be converted to such. It can be viewed as the step before BinaryPrediction.
It allows to compute AUC score and other metrics that depend on the convertion threshold as arrays.
- class easypred.binary_score.BinaryScore
Bases:
object
Class to represent a prediction in terms of probability estimates, thus having each observation paired with a score between 0 and 1 representing the likelihood of being the “positive value”.
- computation_decimals
The number of decimal places to be considered when rounding probability scores to obtain the unique values.
- Type
int
- fitted_scores
The array-like object of length N containing the probability scores.
- Type
np.ndarray | pd.Series
- real_values
The array-like object containing the N real values.
- Type
np.ndarray | pd.Series
- value_positive
The value in the data that corresponds to 1 in the boolean logic. It is generally associated with the idea of “positive” or being in the “treatment” group. By default is 1.
- Type
Any
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.24, 0.28, 0.37, 0.18], ... value_positive=1) >>> score.real_values array([0, 1, 1, 0, 1, 0]) >>> score.fitted_scores array([0.31, 0.44, 0.24, 0.28, 0.37, 0.18]) >>> score.value_positive 1 >>> score.computation_decimals 3
- __init__(real_values, fitted_scores, value_positive=1)
Create a BinaryScore object to represent a prediction in terms of probability estimates.
- Parameters
real_values (np.ndarray | pd.Series | list | tuple) – The array-like object containing the real values. If not pd.Series or np.array, it will be coerced into np.array.
fitted_scores (np.ndarray | pd.Series | list | tuple) – The array-like object of length N containing the probability scores. It must have the same length as real_values. If not pd.Series or np.array, it will be coerced into np.array.
value_positive (Any) – The value in the data that corresponds to 1 in the boolean logic. It is generally associated with the idea of “positive” or being in the “treatment” group. By default is 1.
Examples
>>> from easypred import BinaryScore >>> BinaryScore([0, 1, 1, 0, 1, 0], [0.31, 0.44, 0.24, 0.28, 0.37, 0.18]) <easypred.binary_score.BinaryScore object at 0x000001E8AD923430>
- property accuracy_scores: numpy.ndarray
Return an array containing the accuracy scores calculated setting the threshold for each unique score value.
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.accuracy_scores array([0.5 , 0.66666667, 0.5 , 0.66666667, 0.83333333, 0.66666667])
Note that the length of the array changes if the number of decimals used in the computation of unique values is lowered to 2. This is because 0.241 and 0.244 establish a unique threshold equal to 0.24.
>>> score.computation_decimals = 2 >>> score.accuracy_scores array([0.5 , 0.5 , 0.66666667, 0.83333333, 0.66666667])
- property auc_score: float
Return the Area Under the Receiver Operating Characteristic Curve (ROC AUC).
It is computed using pairs properties as: (Nc - 0.5 * Nt) / Ntot. Where Nc is the number of concordant pairs, Ntot is the number of tied pairs and Ntot is the total number of pairs.
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.24, 0.28, 0.37, 0.24], ... value_positive=1) >>> score.auc_score 0.7222222222222222
- best_threshold(criterion='f1')
Return the threshold to convert scores into values that performs the best given a specified criterion.
- Parameters
criterion (str, optional) – The value to be maximized by the threshold. It defaults to “f1”, the options are: - “f1”: maximize the f1 score - “accuracy”: maximize the accuracy score
- Returns
The threshold that maximizes the indicator specified.
- Return type
float
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.best_threshold(criterion="f1") 0.37
- property c_score: float
Return the Kendall tau-a, computed as the difference between the number of concordant and discordant pairs, divided by the number of combinations of pairs.
- Returns
Kendall tau-a.
- Return type
float
References
https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient#Tau-a
- describe()
Return a dataframe containing some key information about the prediction.
Examples
>>> real = [0, 1, 1, 0, 1, 0] >>> fit = [0.31, 0.44, 0.24, 0.28, 0.37, 0.18] >>> from easypred import BinaryScore >>> score = BinaryScore(real, fit, value_positive=1) >>> score.describe() Value N 6.000000 Max fitted score 0.440000 AUC score 0.777778 Max accuracy 0.833333 Thresh max accuracy 0.370000 Max F1 score 0.800000 Thresh max F1 score 0.370000
- Return type
pandas.core.frame.DataFrame
- property f1_scores: numpy.ndarray
Return an array containing the f1 scores calculated setting the threshold for each unique score value.
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.f1_scores array([0.66666667, 0.75 , 0.57142857, 0.66666667, 0.8 , 0.5 ])
Note that the length of the array changes if the number of decimals used in the computation of unique values is lowered to 2. This is because 0.241 and 0.244 establish a unique threshold equal to 0.24.
>>> score.computation_decimals = 2 >>> score.f1_scores array([0.66666667, 0.57142857, 0.66666667, 0.8 , 0.5 ])
- property false_positive_rates: numpy.ndarray
Return an array containing the false positive rates calculated setting the threshold for each unique score value.
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.false_positive_rates array([1. , 0.66666667, 0.66666667, 0.33333333, 0. , 0. ])
Note that the length of the array changes if the number of decimals used in the computation of unique values is lowered to 2. This is because 0.241 and 0.244 establish a unique threshold equal to 0.24.
>>> score.computation_decimals = 2 >>> score.false_positive_rates array([1. , 0.66666667, 0.33333333, 0. , 0. ])
- property goodmankruskagamma_score: float
Return the Goodman and Kruskal’s gamma, computed as the ratio between the difference and the sum of the number of concordant and discordant pairs.
- Returns
Goodman and Kruskal’s gamma.
- Return type
float
References
- property kendalltau_score: float
Return the Kendall tau-a, computed as the difference between the number of concordant and discordant pairs, divided by the number of combinations of pairs.
- Returns
Kendall tau-a.
- Return type
float
References
https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient#Tau-a
- pairs_count(relative=False)
Return a dataframe containing the count of concordant, discordant, tied and total pairs.
- Parameters
relative (bool, optional) – If True, return the relative percentage for the three types of pairs instead that the absolute count. By default is False.
- Returns
A dataframe of shape (3, 1) containing in one column the information about concordant, discordant and tied pairs.
- Return type
pd.DataFrame
Examples
>>> real = [1, 0, 0, 1, 0] >>> fit = [0.81, 0.31, 0.81, 0.73, 0.45] >>> from easypred import BinaryScore >>> score = BinaryScore(real, fit, value_positive=1) >>> score.pairs_count() Count Concordant 4 Discordant 1 Tied 1 Total 6 >>> score.pairs_count(relative=True) Percentage Concordant 0.666667 Discordant 0.166667 Tied 0.166667 Total 1.0
- plot_metric(metric, figsize=(20, 10), show_legend=True, title_size=14, axes_labels_size=12, ax=None, **kwargs)
Plot the variation for one or more metrics given different values for the threshold telling “1s” from “0s”.
- Parameters
metric (Metric function | list[Metric functions]) – A function from easypred.metrics or a list of such functions. It defines which values are to be plotted.
figsize (tuple[int, int], optional) – Tuple of integers specifying the size of the plot. Default is (20, 10).
show_legend (bool, optional) – If True, show the plot’s legend. By default is True.
title_size (int, optional) – Font size of the plot title. Default is 14.
axes_labels_size (int, optional) – Font size of the axes labels. Default is 12.
ax (matplotlib Axes, optional) – Axes object to draw the plot onto, otherwise creates new Figure and Axes. Use this option to further customize the plot.
kwargs (key, value mappings) – Other keyword arguments to be passed through to matplotlib.pyplot.hist().
- Returns
Matplotlib Axes object with the plot drawn on it.
- Return type
matplotlib Axes
Examples
With one metric
>>> real = [0, 1, 1, 0, 1, 0] >>> fit = [0.31, 0.44, 0.73, 0.28, 0.37, 0.18] >>> from easypred import BinaryScore >>> score = BinaryScore(real, fit, value_positive=1) >>> from easypred.metrics import accuracy_score >>> score.plot_metric(metric=accuracy_score) <AxesSubplot:title={'center':'accuracy_score given different thresholds'}, xlabel='Threshold', ylabel='Metric value'> >>> from matplotlib import pyplot as plt >>> plt.show()
Adding a second metric
>>> from easypred.metrics import f1_score >>> score.plot_metrics(metric=[accuracy_score, f1_score]) <AxesSubplot:title={'center':'accuracy_score & f1_score given different thresholds'}, xlabel='Threshold', ylabel='Metric value'> >>> plt.show()
- plot_roc_curve(figsize=(20, 10), plot_baseline=True, show_legend=True, title_size=14, axes_labels_size=12, ax=None, **kwargs)
Plot the ROC curve for the score. This curve depicts the True Positive Rate (Recall score) against the False Positive Rate.
- Parameters
figsize (tuple[int, int], optional) – Tuple of integers specifying the size of the plot. Default is (20, 10).
plot_baseline (bool, optional) – If True, a reference straight line with slope 1 is added to the plot, representing the performance of a random classifier. By default is True.
title_size (int, optional) – Font size of the plot title. Default is 14.
axes_labels_size (int, optional) – Font size of the axes labels. Default is 12.
ax (matplotlib Axes, optional) – Axes object to draw the plot onto, otherwise creates new Figure and Axes. Use this option to further customize the plot.
kwargs (key, value mappings) – Other keyword arguments to be passed through to matplotlib.pyplot.plot().
show_legend (bool) –
- Returns
Matplotlib Axes object with the plot drawn on it.
- Return type
matplotlib Axes
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.plot_roc_curve() <AxesSubplot:title={'center':'ROC Curve'}, xlabel='False Positive Rate', ylabel='True Positive Rate'> >>> from matplotlib import pyplot as plt >>> plt.show()
- plot_score_histogram(figsize=(20, 10), title_size=14, axes_labels_size=12, ax=None, **kwargs)
Plot the histogram of the probability scores.
- Parameters
figsize (tuple[int, int], optional) – Tuple of integers specifying the size of the plot. Default is (20, 10).
title_size (int, optional) – Font size of the plot title. Default is 14.
axes_labels_size (int, optional) – Font size of the axes labels. Default is 12.
ax (matplotlib Axes, optional) – Axes object to draw the plot onto, otherwise creates new Figure and Axes. Use this option to further customize the plot.
kwargs (key, value mappings) – Other keyword arguments to be passed through to matplotlib.pyplot.hist().
- Returns
Matplotlib Axes object with the plot drawn on it.
- Return type
matplotlib Axes
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.plot_score_histogram() <AxesSubplot:title={'center':'Fitted Scores Distribution'}, xlabel='Fitted Scores', ylabel='Frequency'> >>> from matplotlib import pyplot as plt >>> plt.show()
Passing keyword arguments to matplotlib’s hist function:
>>> score.plot_score_histogram(bins=10) <AxesSubplot:title={'center':'Fitted Scores Distribution'}, xlabel='Fitted Scores', ylabel='Frequency'>
- property recall_scores: numpy.ndarray
Return an array containing the recall scores calculated setting the threshold for each unique score value.
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.recall_scores array([1. , 1. , 0.66666667, 0.66666667, 0.66666667, 0.33333333])
Note that the length of the array changes if the number of decimals used in the computation of unique values is lowered to 2. This is because 0.241 and 0.244 establish a unique threshold equal to 0.24.
>>> score.computation_decimals = 2 >>> score.recall_scores array([1. , 0.66666667, 0.66666667, 0.66666667, 0.33333333])
- score_to_values(threshold=0.5)
Return an array contained fitted values derived on the basis of the provided threshold.
- Parameters
threshold (float, optional) – The minimum value such that the score is translated into value_positive. Any score below the threshold is instead associated with the other value. By default 0.5.
- Returns
The array containing the inferred fitted values. Its type matches fitted_scores’ type.
- Return type
np.ndarray | pd.Series
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.24, 0.28, 0.37, 0.24], ... value_positive=1) >>> score.score_to_values(threshold=0.6) array([0, 0, 0, 0, 0, 0]) >>> score.score_to_values(threshold=0.31) array([1, 1, 0, 0, 1, 0])
- property somersd_score: float
Return the Somer’s D score, computed as the difference between the number of concordant and discordant pairs, divided by the total number of pairs.
Also called: Gini coefficient.
- Returns
Somer’s D score.
- Return type
float
References
https://en.wikipedia.org/wiki/Somers%27_D#Somers’_D_for_binary_dependent_variables
- to_binary_prediction(threshold=0.5)
Create an instance of BinaryPrediction from the BinaryScore object.
- Parameters
threshold (float | str, optional) –
If float, it is the minimum value such that the score is translated into value_positive. Any score below the threshold is instead associated with the other value. If str, the threshold is automatically set such that it maximizes the metric corresponding to the provided keyword. The available keywords are: - “f1”: maximize the f1 score - “accuracy”: maximize the accuracy score
By default 0.5.
- Returns
An object of type BinaryPrediction, a subclass of Prediction specific for predictions with just two outcomes. The class instance is given the special attribute “threshold” that returns the threshold used in the convertion.
- Return type
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.244, 0.28, 0.37, 0.241], ... value_positive=1) >>> score.to_binary_prediction(threshold=0.37) <easypred.binary_prediction.BinaryPrediction object at 0x000001E8C813FAF0>
- property unique_scores: VectorPdNp
Return the unique values attained by the fitted scores, sorted in ascending order
- Returns
The array containing the sorted unique values. Its type matches fitted_scores’ type.
- Return type
np.ndarray | pd.Series
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.24, 0.28, 0.37, 0.24], ... value_positive=1) >>> score.unique_scores array([0.24, 0.28, 0.31, 0.37, 0.44])
- property value_negative: Any
Return the value that it is not the positive value.
Examples
>>> from easypred import BinaryScore >>> score = BinaryScore([0, 1, 1, 0, 1, 0], ... [0.31, 0.44, 0.24, 0.28, 0.37, 0.18], ... value_positive=1) >>> score.value_negative 0