API Referenceο
constrata_core_credit.preprocessingο
A module for loading and pre-processing data
- preprocessing.map_categoricals_to_strings(data)ο
Map categorical values to strings.
The mapping follows a deterministic process where nanβs, strings and numerical values are mapped to string versions of integers.
A bidirectional dictionary is also returned that describes the mapping between the original and transformed values.
- Parameters
data (pandas Series) β The data to be transformed, of shape (num_samples,).
- Return cat_data
The data transformed to categorical values (string values), of shape (num_samples,). Note: cat_data.name==None
- Return type
pandas Series
- Return var_to_cat_dict
The bidrectional dictionary of the mapping between original and transformed variables.
- Return type
bidirectional dictionary
- preprocessing.split_dataset(data_set, split_fractions, shuffle=True)ο
Split dataset into disjoint subsets. For example, into training, validation and testing sets.
- Parameters
data_set (pandas dataframe) β The dataset to be split.
split_fractions (float list) β The fractions to allocate to each of the subsets (i.e [0.7, 0.2, 0.1]).
shuffle (bool) β Whether or not to shuffle the data before splitting. Default:
shuffle=True
.
- Returns
A tuple of dataframes, an element for each subset resulting from the split.
- Return type
A tuple of dataframes
Example:
>>> data_df = pd.DataFrame([[0, 0.1, 0], [1, 0.2, 0], [2, 0.3, 0], [3, 0.21, 1], [4, 0.25, 0], [5, 0.35, 1], ... [6, 0.39, 0], [7, 0.11, 0], [8, 0.15, 0], [9, 0.18, 1]], ... columns=['user_id', 'var1', 'targets']) >>> data_split_fractions = [0.6, 0.2, 0.2] >>> train_df, validation_df, test_df = preprocessing.split_dataset(accepts_df, data_split_fractions, shuffle=True)
- preprocessing.to_float(value)ο
If possible, convert
value
tofloat
, otherwise return the original.Example:
'-1' is returned as -1.0 'abc' is returned as 'abc'
- Parameters
value (string) β The value that needs to be converted.
- Returns
The converted value.
- Return type
string or float
constrata_core_credit.rollratesο
A module for calculating the change in delinquency states (known as roll rates) of clients
- rollrates.calc_delinquency_state_transition(dpd_df, first_date, last_date, early_mob=12, late_mob=45, min_mob=None, del_breaks=None, remove_delinquent_at_mob0=False)ο
Calculate the delinquency state transition matrix between an βearlyβ and βlateβ period defined as:
'early': 0 <= MOB < early_mob 'late': early_mob <= MOB < late_mob
The maximum delinquency is calculated for both periods and classified into discrete states defined by the upper limits in
del_breaks
.- Parameters
dpd_df (dataframe) β The delinquency table with columns [βIDβ, βDATEβ and βDPDβ]. The date column must be a datetime object.
first_date (str) β Only consider accounts opened on or after this date.
last_date (str) β Only consider accounts opened before this date.
early_mob (int) β The time period (months on book) in which the initial delinquency state is calculated.
late_mob (int) β The time period (months on book) in which the subsequent delinquency state is calculated.
min_mob (int) β The minimum number of months on book for an account to be used.
del_breaks (numeric iterable) β Delinquency breaks. These values define the boundaries between different delinquency buckets (left-inclusive).
remove_dpd0_at_mob0 (bool) β Set True to remove all accounts which had non-zero delinquency at first month on book. Default:
remove_delinquent_at_mob0=False
.
- Returns
Dataframe of state transition counts.
- Return type
dataframe
- rollrates.relative_delta_in_months(last, first)ο
Calculate the number of months from first to last date.
- Parameters
last (datetime object) β Last date in the time period considered.
first (datetime object) β First date in the time period considered.
- Returns
The difference between first and last dates, in months.
- Return type
A datetime object
constrata_core_credit.rollrate_widgetsο
A module containing visual tools (IPython Widgets) for roll rates
- class rollrate_widgets.RollRateWidget(days_past_due_df)ο
Bases:
object
Class for calculating and displaying roll rates.
It requires a dataframe as input with the following required columns:
'ID' (str or int): Any unique user ID. 'DATE' (str): This is a date sequence at which times the Days Past Due (DPD) are determined, typically at the beginning of the month. The format of the date should be 'year/month/day', for example '2020/04/19'. 'DPD' (int): The number of days past the payment due date.
Given this information the roll rates are calculated for fixed buckets:
β0 daysβ
β1-30 daysβ
β31-60 daysβ
β61-90 daysβ
β90+ daysβ
- Parameters
days_past_due_df (pandas dataframe) β A dataframe of the client information.
- update_duration(duration)ο
Callback for duration widget.
Update due to change in βMonths Includedβ widget.
- Parameters
duration (int) β Value returned by the Months Included widget, measured in months.
- update_init_mob(init_mob)ο
Callback for init_mob widget.
Update due to change in initial months on books.
- Parameters
init_mob (str) β Value returned by the Initial MOB widget, in months (e.g. β6 monthsβ).
- update_ref_mob(ref_mob)ο
Callback for ref_mob widget. Update due to change in Subsequent MOB.
- Parameters
ref_mob (str) β Value returned from Subsequent MOB widget, in months. (e.g. β6 monthsβ)
- update_year(year)ο
Callback for Initial Date widget.
Update due to change in initial year.
- Parameters
year (datetime object) β Date returned by Initial Date widget.
constrata_core_credit.weights_of_evidenceο
A module containing Weights of Evidence (WOE) tools
- class weights_of_evidence.WOEBinning(var_type, bin_dict=None, protect_bins=False)ο
Bases:
object
A WOE (weights of evidence) binning.
This class provides the following:
Given the values of a single variable, the targets (labels), and the data type of the variable, bins are automatically calculated together with their weights of evidence.
'numerical'
,'categorical'
,'ordinal'
, or'mixed'
dtypes are allowed.If a variable has only
float
orint
values, usedtype='numerical'
- In the case of a mix of string and numerical values, use
dtype='mixed'
. String values are treated as categorical.
- In the case of a mix of string and numerical values, use
If a variable has only
str
values, usedtype='categorical'
If a variable has ordered
str
values, usedtype='ordinal'
. In this case the order must be provided.
- The class can be initialised with a pre-calculated
bin_dict
. Ifbin_dict
is notNone
, all other parameters are ignored.
- If
bin_dict
does not have WOE values, then thefit_woe
method has to be called. It calculates the WOE values for the given binning using the train data (values and targets) provided through the
fit
method.
- If
- The class can be initialised with a pre-calculated
- Once the bins and the WOE are available, the
transform_woe
method can be used to transform the features to WOE values.
- Once the bins and the WOE are available, the
- Parameters
var_type (dict) β Specifies the dtype of the variable and the ordinal sequence, if the dtype is
'ordinal'
.bin_dict (dict) β Specifies the binning dictionary. Optional.
protect_bins (dict) β A dictionary with the protected bin values.
Examples:
If only the bins are given:
>>> bin_dict = {'categorical': {'keys': []}, 'numerical': {'keys': []}, 'dtype': str, 'ord_order': list}
If both the Bins and WOE are given:
>>> bin_dict = {'categorical': {'keys': [], 'woe_vals': [], 'labels': [], 'bad_rate': []}, ... 'numerical': {'keys': [], 'woe_vals': [], 'labels': [], 'bad_rate': []}, ... 'iv': '', ... 'dtype': str ... 'ord_order': list}
Examples of
var_type
>>> var_type = {'dtype': 'numerical'} # For numerical data type >>> var_type = {'dtype': 'mixed'} # For mixed data type >>> var_type = {'dtype': 'categorical'} # For categorical data type >>> var_type = {'dtype': 'ordinal', 'ord_order': ['a', 'b', 'c']} # Providing the order as well
- property bin_dictο
Return all the binning information for both categorical as well as numerical bins.
- Return bin_dict
The binning dictionary.
- Return type
dict
- property breaksο
Return the numerical bin breaks and the categorical bins
- Return breaks
The break and bin values.
- Return type
dict[str,list]
- property categorical_binsο
Return only the categorical binning information.
- Return cat_bins
The categorical binning information.
- Return type
dict
- property categoriesο
Return the categories, i.e. the unique categorical values. Note that the categories can be combined to form the bins.
- Return categories
A list of str of categories.
- Return type
string
- property countsο
Return the counts for all the bins.
- Return counts
The counts.
- Return type
dict
- fit_autobinning(features, targets, sample_weights=None)ο
Given the features and targets, the binning and WOE are calculated and the internal state of the object is updated with this information. The existing internal state is overwritten.
- Parameters
features (array-like) β The features.
targets (binary (0 or 1)) β The target values (labels).
sample_weights (A Pandas Series.) β Optional. Individual weights for each sample in features.
- fit_woe(features, targets, sample_weights=None)ο
Fit the features and targets to the existing bins.
The existing bins are used. Only the WOE are estimated from the data.
- Parameters
features (array-like) β The features.
targets (array-like) β The binary (0 for βgoodβ or 1 for βbadβ) target values.
sample_weights (A Pandas Series.) β Optional. Individual weights for each sample in features.
- get_woe()ο
Return the WOE values for all the bins.
- Return woe
The woe values.
- Return type
dict
- get_woe_dict()ο
Return the WOE dictionary.
- Return woe_dict
The WOE dictionary.
- Return type
dict
- property labelsο
Return the labels of both the categorical and numerical bins. Used as the x-axis in the plotting.
- Return labels
The labels.
- Return type
list
- property sample_weightsο
Return the sample weights.
- Return sample_weights
The sample weights
- Return type
Pandas Series
- set_manual_override(state)ο
Set whether bin boundaries have been manually overridden.
- Parameters
state (bool) β Set whether bins have been manually modified after being protected. True indicates modification.
- set_protected_bins(categorical_bins=None, numerical_breaks=None)ο
Set protected bin boundaries.
Sets new bin boundaries and/or categorical bins, and set their status to protected. This is a useful when you want to explicitly enforce particular binning boundaries on a particular variable.
Note: This forces an update of the WOE values, which will use the new bin boundaries.
- Parameters
categorical_bins (list[str]) β The new categorical bins to assign to the WOE binning.
numerical_breaks (list[float]) β The new numerical breaks to assign to the WOE binning.
- transform_woe(features, var_name=None)ο
Transform features to WOE values.
- Parameters
features (array-like) β The features to be transformed, with shape (num_samples,).
var_name (str) β Name of the variable to be transformed.
- Return woe_features
The WOE-transformed features.
- Rype
array-like
- update_categorical_bins(categorical_bins)ο
Update the categorical bins in response to a widget change. The internal state of the categorical bins is updated. The woe for the new categorical bins is calculated and the internal state is updated.
- Parameters
categorical_bins (A list of lists) β A list of the new categorical bins.
Example:
In the code example below, there are 3 categorical values 'a', 'b', 'c'. 'a' is in a bin by itself, 'b' and 'c' are grouped together in the same bin.
>>> categorical_bins = [['a'], ['b', 'c']]
- update_numerical_breaks(breaks)ο
Update the numerical breaks and recalculate the WOE values.
The breaks are the numerical values separating the numerical bins.
- Parameters
breaks (List[float]) β The new breaks.
- class weights_of_evidence.WOETransformer(woe_dict)ο
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
A WOE (weights of evidence) transformer.
Transforms variable values to WOE values.
If the transformer is instantiated with a WOE dictionary with WOE values, the transform method will use these values to do the transform.
If the transformer is instantiated with a dictionary that contains only the binning information, see example below, then the corresponding WOE values must be trained using the
fit()
method.The binning objects also have methods that allow one to obtain the binning information and calculate the WOE given a new instance (features and labels) of the variable using the existing, internal binning information.
This class supports the following methods:
fit(features_df, targets): # Fit the given features to the existing binning. transform(features_df): # Transform the given features to their WOE values. get_info_vals(): # Return a dictionary with the information values for each variable. save_woe_to_json(filename): # Save the binning objects to a json file named filename.
- Parameters
woe_dict (dict) β Dictionary of WOE values.
Example:
>>> woe_dict = {'var1': {'categorical': {'keys': [['missing'], ['a']], ... 'woe_vals': [], ... 'labels': [], ... 'bad_rate': []}, ... 'numerical': {'keys': [2.0], ... 'woe_vals': [], ... 'labels': [], ... 'bad_rate': []}, ... 'iv': '', ... 'dtype': {'dtype': 'categorical', 'ord_order': None}}, ... 'var2': {'categorical': {'keys': [['0'], ['1'], ['missing'], ['2']], ... 'woe_vals': [], ... 'labels': [], ... 'bad_rate': []}, ... 'numerical': {'keys': [], ... 'woe_vals': [], ... 'labels': [], ... 'bad_rate': []}, ... 'iv': '', ... 'dtype': {'dtype': 'mixed', 'ord_order': None}} ... 'var3': {'categorical': {'keys': [['0'], ['1'], ['missing'], ['2']], ... 'woe_vals': [], ... 'labels': [], ... 'bad_rate': []}, ... 'numerical': {'keys': [], ... 'woe_vals': [], ... 'labels': [], ... 'bad_rate': []}, ... 'iv': '', ... 'dtype': {'dtype': 'ordinal', 'ord_order': ['a', 'b', 'c']}}}
- fit(features_df, targets, sample_weights=None)ο
Fit the WOE transformer to data.
woe_dict must contain binning information for the features in features_df, or a ValueError is raised.
The binning specified by the internal state is used. The WOE values are calculated for these bins and passed to the internal state of the class.
- Parameters
features_df (dataframe) β The input features where each column is a variable, and each row is a sample.
targets (array-like) β The corresponding (binary) targets with shape (num_samples,). A value of 0 indicates a βgoodβ sample. A value of 1 indicates a βbadβ sample.
sample_weights (A Pandas Series.) β Optional. Individual weights for each sample in features.
- Raises
ValueError β Raised if woe_dict does not contain binning information for one or more features in features_df.
- fit_transform(X, y=None, **fit_params)ο
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) β Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) β Target values (None for unsupervised transformations).
**fit_params (dict) β Additional fit parameters.
- Returns
X_new β Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- get_info_vals()ο
Collect the Information Value for each variable.
- Returns
The information value for each feature.
- Return type
dict
- get_params(deep=True)ο
Get parameters for this estimator.
- Parameters
deep (bool, default=True) β If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params β Parameter names mapped to their values.
- Return type
dict
- get_woe_lookup()ο
Return the woe lookup dictionary.
- Return woe_lookup
The WOE lookup directory.
- Return type
dict
- save_woe_to_json(filename)ο
Save WOE information to JSON file. Note that this also includes variable and binning information.
- Parameters
filename (str) β Name of the output file where the dictionary is written to.
- set_params(**params)ο
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that itβs possible to update each component of a nested object.- Parameters
**params (dict) β Estimator parameters.
- Returns
self β Estimator instance.
- Return type
estimator instance
- transform(features_df)ο
Apply the WOE transformation to a dataset. All the variables in the dataset will be transformed.
- Parameters
features_df (pandas dataframe) β The input features with shape (num_samples, num_features).
- Returns
The WOE-transformed features with shape (num_samples, num_features).
- Return type
pandas dataframe
- Raises
ValueError β Raised if woe_dict does not contain binning information for one or more features in features_df.
- weights_of_evidence.autobin(data, labels, vars_type, sample_weights=None)ο
Perform autobinning for all variables in the data dataframe.
- Parameters
data (Pandas dataframe) β A dataframe of the data that needs to be binned.
labels (array-like) β The binary labels good=0, bad=1.
vars_type (dict) β A dictionary where keys are the name of the variable, and values specify the
dtype
:'categorical'
,'ordinal'
,'numerical'
, and'mixed'
are allowed.sample_weights (Pandas Series.) β Optional. Individual weights for each sample in features.
- Returns
A dictionary of binning objects.
- Return type
dict of WOEbinning objects
- weights_of_evidence.breaks_to_labels(breaks)ο
Return a list of bin labels for the numerical bins, given the breaks between bins.
- Parameters
breaks (list) β The numerical values separating the bins, excluding the implicit extreme edges
-inf
andinf
.- Returns
The bin labels.
- Return type
list of strings
- weights_of_evidence.calc_bin_counts(bin_dict, features, targets, sample_weights=None)ο
Count the number of goods and bads in each of the categorical and numerical bins.
The counts in each bin are weighted with the sample weights.
- Parameters
bin_dict (dict) β Dictionary containing categorical bin features and the breaks separating the numerical bins.
features (Pandas Series.) β The feature values to be binned, of shape (num_samples).
targets (Pandas Series of integer values of either 0 (for "good") or 1 (for "bad"), of shape (num_samples,)) β The corresponding target values, of shape (num_samples).
sample_weights (A Pandas Series.) β Optional. Individual weights for each sample in features.
- Returns
A dictionary of good/bad counts for each bin.
- Return type
dict
:Pseudo Example:
>>> bin_dict = {'categorical_bins': [], 'categorical_woe': [], 'numerical_breaks': [], 'numerical_woe': []} >>> features = pd.Series([...]) >>> targets = pd.Series([...]) >>> calc_bin_counts(bin_dict, features, targets) {'categorical': {'good': [...], 'bad': [...]}, 'numerical': {'good': [...], 'bad': [...]}}
- weights_of_evidence.concat_woe_and_counts(woe_dict, counts)ο
Concatenate labels, counts and woe for categorical, numerical for WOE plots.
- Parameters
woe_lookup (dict) β The WOE binning dictionary.
counts (dict) β The corresponding raw good/bad counts for each bin.
- Returns
Concatenated labels, WOE values, good counts and bad counts.
- Return type
tuple of lists
- weights_of_evidence.convert_str_to_cat(series)ο
Map the string values to a categorical-type encoding.
A mix of string values and integer values are allowed. The strings are mapped to different integers with missing values (nanβs) mapped to β-1β. The integers are cast to strings allowing string-valued variables to be used directly when convenient.
- Parameters
series (pandas series) β The string-valued data.
- Return results
A tuple where the three elements are: list of
categories
, list of string values, of shape(num_samples,)
, bidirectional dictionary that provides the map string <-> categorical.- Return type
tuple
- weights_of_evidence.counts_to_woe(bin_dict, counts, pseudocount=1e-05)ο
Convert bin counts to WOE for a specific variable.
- Parameters
bin_dict (dict) β Binning dictionary containing categorical bin values and breaks for numerical bins.
counts (dict) β Corresponding good/bad counts for each bin.
pseudocount (float) β Optional.
- Returns
WOE lookup dictionary which maps each bin to a WOE value.
- Return type
dict
:Example:
>>> # Show the structure of the dictionaries. >>> bin_dict = {'categorical_bins': list, 'numerical_breaks': list} >>> counts = {'categorical': {'good': list, 'bad': list}, 'numerical': {'good': list, 'bad': list}} >>> lookup = {'var1': {'categorical': {'keys': list, 'woe_vals': list, 'labels': list, 'bad_rate': list}, ... 'numerical': {'keys': list, 'woe_vals': list, 'labels': list, 'bad_rate': list}, ... 'iv': float}}
- weights_of_evidence.filter_dict(woe_dict, filter_list)ο
Filter
woe_dict
.- Parameters
woe_dict (dict) β A WOE dictionary
filter_list (list) β A list of all the features that need to be included.
- Returns
Filtered WOE dictionary
- Return type
dict
- weights_of_evidence.find_str_cols(dataset)ο
Identify all the columns in a dataframe that contain string variables.
- Note
Only model variables are included in dataset i.e. columns containing things like target values, customer ID, etc, should be removed.
- Parameters
dataset β The dataframe.
- Type
pandas dataframe
- Return str_cols
A list of the column names containing string variables.
- Return type
list of strings
- weights_of_evidence.get_bin_info_from_autobin_obj(autobinning)ο
Extract the woe_dict information from the autobin object returned by the function autobin()
- Parameters
autobinning β A dictionary of binning objects. An object for each feature.
- Returns
A WOE dictionary.
- Return type
dict
- weights_of_evidence.get_woe_values_from_counts(counts, values_type, prior_log_odds, pseudocount)ο
Convert bin counts into WOE values.
- Parameters
counts (dict) β Corresponding good/bad counts for each bin.
values_type (string) β The type of values (βcategoricalβ or βnumericalβ)
prior_log_odds β The ratio of the total good/bad for the corresponding variable.
pseudocount (float) β The pseudocount to add to the good and bad counts.
- Returns
The WOE values.
- Return type
np.ndarray
- weights_of_evidence.read_dict_from_json(filename)ο
Read a dictionary from a JSON file.
- Parameters
filename (str) β The name of the JSON file. This should include the directory name.
- weights_of_evidence.save_dict_to_json(bin_dict, filename)ο
Save a dictionary to a JSON file.
- Parameters
bin_dict (dict) β The dictionary to be saved.
filename (str) β The output file name where the dictionary is written to.
- weights_of_evidence.transform_woe(feature_values, woe_lookup, var_name=None)ο
Transform features values into WOE values.
The data values of a variable, feature_values, are transformed to WOE values using the WOE lookup dictionary for that variable. In other words, we assign each sample in feature_values the WOE value corresponding to the bin to which it belongs.
Both the bins and corresponding WOE values assosciated with each bin is defined in woe_lookup.
Example:
>>> woe_lookup = {'categorical': {'keys': [...], 'woe_vals': [...], 'labels': [...], 'bad_rate': [...]}, ... 'numerical': {'keys': [...], 'woe_vals': [...], 'labels': [...], 'bad_rate': [...]}, ... 'iv': float, ... 'dtype': str, ... 'ord_order': list}
- Parameters
feature_values (pd.Series) β Pandas series of input feature values of shape (num_samples,).
woe_lookup (dict) β The WOE lookup dictionary for the variable.
var_name (str) β Name of the variable to be transformed.
- Returns
Array of WOE values of shape (num_samples,).
- Return type
array-like
constrata_core_credit.weights_of_evidence_widgetsο
A module containing visual tools (IPython Widgets) for binning and Weights of Evidence (WOE)
- class weights_of_evidence_widgets.BinningVisualiser(woe_binning_objects, datasets)ο
Bases:
object
A GUI for visualising and comparing the binning information for different datasets, e.g. train and validation.
- Parameters
woe_binning_objects (dict) β A dictionary of the binning objects. These are the same for all datasets.
datasets (dict) β A dictionary of the dataframes (e.g. train and validation) that will be compared.
- Example
>>> # An example of the ``datasets`` argument. >>> datasets = {"TRN": {'data': train_data[selected_varnames], 'labels': train_labels}, ... "VLD": {'data': train_data[selected_varnames], 'labels': train_labels}}
- calc_metrics()ο
Calculate all WOE and counts needed for the plots.
- Returns
A dictionary of the dataset information, for example, the train and validation datasets.
- Return type
dict
- Example
>>> # The structure of the returned dictionary is the following. >>> return_dict = {'train': {'bin_labels': [], 'woe': [], 'counts_good': [], 'counts_bad': []}, ... 'val': {'bin_labels': [], 'woe': [], 'counts_good'}: [], 'counts_bad': []}}
- select_variable(variable)ο
Change currently selected variable.
- Parameters
variable (str) β Name of selected variable
- update_plot()ο
Update all plots in the GUI, usually when a new variable is selected.
- class weights_of_evidence_widgets.ManualBinningUI(data, targets, bin_dict, predefined_bins=None)ο
Bases:
object
Do manual bin adjustment using a widget.
The data is in the form of a pandas dataframe with the columns consisting of the variables that need to be binned. Fine-tuning the bin boundaries starts with an initial binning that is typically provided by an auto-binning algorithm. This binning should also be provided in the form of a bin dictionary. The dictionary has the structure {key: object}. Here key is the variable name and object is a Python object that is returned by WOEbinning. The object contains all the information about its variable, including all binning information.
- Parameters
data (dataframe) β A dataframe containing the variables to be binned, of shape (num_samples, num_variables).
targets (array-like) β The labels or target vector, of shape (num_samples,).
bin_dict (dict) β The binning dictionary containing all binning information about the variables.
protected_bins (dict[str,dict]) β A dictionary that contains variables and their binning boundaries bins that are predefined, and should be therefore be protected. Keys must match the variable names in
bin_objects
, and corresponding values must be a dict specifying the categorical and numerical breaks where appropriate. An example is shown below.
- Example
>>> # An example of specifying the predefined bins. >>> predefined_bins = { ... 'a_numerical_var': {'categorical': None, 'numerical': [10, 15, 20]}, ... 'a_categorical_var': {'categorical': [['abc'], ['missing']], 'numerical': None}, ... 'a_mixed_var': {'categorical': [['abc'], ['missing']], 'numerical': [2, 4, 6]}, ... } >>> manual_binning = ManualBinningUI(..., predefined_bins=predefined_bins)
- adjust_break(**kwargs)ο
Call-back function called whenever a break is adjusted via one of the break adjuster TextBoxes.
- Parameters
kwarks (dict) β A dictionary of FloatBox values
- adjust_categorical_bins(change)ο
Call-back function called whenever the content of the categorical bin input textbox is changed.
- Parameters
change (object) β Change in state of the categorical bins text box
- property binningο
Read binning object for currently selected variable.
- property categorical_binsο
Return the categorical bins for currently selected variable.
- property dataο
Read data for currently selected variable.
- get_bins()ο
Return the current binning for all the variables.
- get_woe_dict()ο
Return the bin dictionary for all the variables.
- Return woe_dict
The WOE dictionary
- Return type
dict
- static load_bins_from_json(fname_in)ο
Load binning information from JSON file.
- Parameters
fname_in (str) β Filename to load from
- Returns
The dictionary of binning information
- Return type
dict
- merge_bins(change)ο
Merge two bins by removing a break/edge. The call-back is only activated upon a change selected by the widget, therefore change is passed to the call-back by the widget.
- Parameters
change (object) β An object with details regarding the change in bin boundaries as selected through the merge-bin widget.
- property numerical_breaksο
Read breaks for currently selected variable.
- save_bins_to_json(fname_out)ο
Save binning information to JSON file.
- Parameters
fname_out (str) β Filename to write to
- select_removed_variable(change)ο
Change currently selected variable for removal.
- Parameters
change (dict) β A dictionary with details on the change in state of the variable selection dropdown
- select_variable(change)ο
Register a change if a new variable is selected for binning.
- Parameters
change (dict) β A dictionary with details on the change in state of the variable selection dropdown
- split_bins(change)ο
Split a bin by inserting a break/edge halfway between the edges of the specified bin.
- Parameters
change (dict) β A dictionary with details regarding the change in bin boundaries as selected through the merge-bin widget.
- update_break_adjuster_widgets(is_protected=False)ο
Update the break adjuster widgets to show the current breaks.
- update_dropdowns()ο
Populate the merge/split drop-downs to show the current breaks.
- update_plot()ο
Update all the plots.
- property varnamesο
Return the variable names
- weights_of_evidence_widgets.add_legend_to_plot(plot, legend_colours, legend_labels)ο
Adds a legend outside of a plot.
- Parameters
plot (Figure widget) β Plot to which a legend must be added
legend_colours (list) β Colours to be used for the legend icons
legend_labels (list) β Labels to be used for the legend
- Returns
Combined plot and legend
constrata_core_credit.variable_reductionο
A module for performing variable selection and reduction
- class variable_reduction.CorrelationMatrixColorCoder(lower_bound, upper_bound)ο
Bases:
object
A class for indicating different levels of correlation between variables (for use with correlation_vif_table).
- Parameters
lower_bound (float) β The lower bound (inclusive) for medium correlation.
upper_bound (float) β The upper bound (exclusive) for medium correlation.
- Raises
TypeError β If either or both of the bounds are not specified.
ValueError β If the upper bound is smaller than the lower bound.
- get_explanation_df()ο
Get a colour-coded dataframe that indicates the meaning of each colour in the stability table.
- Returns
colour-coded dataframe.
- Return type
pandas dataframe
- variable_reduction.calculate_gini_indices(train_data_df, train_labels, test_data_df, test_labels)ο
Calculate the Gini Indices for all variables (corresponding to columns in the dataframes) by training a Logistic Regression model on each variable and evaluating the performance of the model.
- Parameters
train_data_df (pandas dataframe) β The dataframe of the variables to be used for training, of shape (num_samples, num_features).
train_labels (array-like) β The labels, of shape (num_samples,).
test_data_df (pandas dataframe) β The dataframe of the data to be used for testing, of shape (num_samples, num_features).
test_labels (array-like) β The labels, of shape (num_samples,).
- Returns
A pandas DataFrame mapping the variables to their respective Gini Indices.
- Return type
DataFrame
- variable_reduction.calculate_vif(dataset_df)ο
Calculate the Variance Inflation Factors for a set of variables.
- Parameters
dataset_df (pandas dataframe) β Features with shape (num_samples, num_features).
- Returns
Variance Inflation factors with shape (num_features).
- Return type
pandas series
Note
Suppose that the variance inflation factor of a predictor variable were \(5.27 \sqrt{5.27} = 2.3)\). This implies that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.
Cluster and order the variables that are correlated.
Please ensure that
data_df
does not contain the target variable.- Parameters
data_df (pandas dataframe) β A dataframe with columns for each WOE transformed variable (feature).
- Returns
tuple consisting of
clusters_dict
: dictionary of the cluster membership.sorted_df
: sorted correlation dataframe.sorted_variable_order
: list of the variables in sorted order.- Return type
tuple
- variable_reduction.create_correlation_vif_table(data_df, correlation_colour_coder)ο
Given WOE transformed variables, a dataframe containing variable correlations and VIF is calculated. Ensure that the input dataframe does not contain the target variable.
- Parameters
data_df (pandas dataframe) β A dataframe with columns for each WOE transformed variable (feature).
correlation_colour_coder (CorrelationMatrixColorCoder object) β An object for indicating different levels of correlation
- Returns
A dataframe containing the variable correlations and VIF.
- Return type
pandas dataframe
- variable_reduction.plot_clustered_correlation_matrix(re_indexed_df, new_variable_order, fig_height=7)ο
Plot the clustered variable correlation matrix, according to the given variable.
- Parameters
fig_height (int) β The height (in inches) of the figure to be plotted.
- variable_reduction.plot_selected_variables_iv(info_values_df)ο
Plot variable information values (IVs) and show the variables that lie in an acceptable IV range (as indicated by the in_acceptable_range column in info_values_df) in a different colour.
The input dataframe has the following columns:
'variable' 'information_value' 'in_acceptable_range'
- Parameters
info_values_df (pandas dataframe) β The input dataframe.
- variable_reduction.select_best_variables_from_clusters(clusters_dict, variable_ratings_dict)ο
Select the best (maximum) variables in each cluster from a set of clusters. The variables are selected from clusters_dict according to the measure represented by the values in variable_ratings_dict. This measure could, for example, be the Gini Index or Information Value.
- Parameters
clusters_dict (dict) β A dictionary containing the cluster indices and variable names.
variable_ratings_dict (dict) β A dictionary containing the variable names as keys and the value of their ratings as values.
- Returns
The best variables (one for each cluster).
- Return type
list of strings
- variable_reduction.select_variables_on_iv(info_values_dict, min_threshold, max_threshold)ο
Perform variable selection by filtering out variables based on Information Value thresholds.
- Parameters
info_values_dict (dict) β Dictionary containing the binning information for each variable.
min_threshold (float) β The minimum Information Value threshold, below which variables will be excluded.
max_threshold (float) β The maximum Information Value threshold, above which variables will be excluded.
plot (bool) β Whether or not to plot the Information Value bar plot.
- Returns
A pandas DataFrame mapping each variable name to its Information Value.
- Return type
pandas DataFrame
- variable_reduction.select_variables_on_vif(dataset, max_vif=5)ο
Prune a set of variables to remove all variables with a variance inflation factor (VIF) higher than max_vif.
- Parameters
dataset (pandas datafrmae) β Features with shape (num_samples, num_features).
max_vif (float) β The maximum VIF allowed for the set of variables that to be kept.
- Returns
The list of variables.
- Return type
list of strings
- variable_reduction.show_correlation_vif_table(data_df, lower_bound, upper_bound)ο
Show the colour-coded correlation and VIF table.
- Parameters
data_df (pandas dataframe) β A dataframe with columns for each WOE transformed variable (feature).
lower_bound (float) β The lower bound for medium correlation.
upper_bound (float) β The upper bound for medium correlation.
constrata_core_credit.scorecardο
A module for creating scorecards from models
- class scorecard.Scorecard(base_odds=0.1, base_points=500, ptdo=50, **kwargs)ο
Bases:
sklearn.base.BaseEstimator
A Scorecard model, essentially Logistic regression with outputs scaled to produce a credit score.
The keyword arguments can be any of those accepted by sklearnβs logistic regression.
- Parameters
base_odds (float) β The odds ratio (bad/good) that will be mapped to the score specified by
base_points
. Default: 0.1base_points (float) β The reference
score
which will correspond tobase_odds
. Default: 500ptdo (float) β Points-To-Double-Odds. The increment in score which corresponds to a doubling of the odds. Default: 50
kwargs (**) β These keyword arguments are passed through to sklearnβs logistic regression constructor.
- Default
>>> # ``kwargs`` for sklearn logistic regression. >>> logistic_regression_kwargs = { ... 'C': 1000, ... 'random_state': 0, ... 'max_iter': 1000, ... 'solver': 'lbfgs' ... }
- fit(features, targets, sample_weights=None)ο
Fit the Scorecard model to a dataset.
- Parameters
features (array-like of dtype: float) β Weights of Evidence (WOE) features of shape (num_samples, num_features).
targets (array-like of dtype: int (binary 0 or 1)) β Target labels (binary) of shape (num_samples,).
sample_weights (array-like of dtype: float) β Sample weights of shape (num_samples,).
- Returns
The trained model.
- Return type
sklearn model
- get_params(deep=True)ο
Get parameters for this estimator.
- Parameters
deep (bool, default=True) β If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params β Parameter names mapped to their values.
- Return type
dict
- predict(features)ο
Make predictions and convert to scores.
- Parameters
features (array-like of dtype: float) β Weights of Evidence (WOE) features of shape (num_samples, num_features).
- Returns
Scores of shape (num_samples,).
- Return type
array-like of dtype int
- predict_log_odds(features)ο
Predict the log-odds for a default.
- Parameters
features (array like) β Weights of Evidence (WOE) features (float) of shape (num_samples, num_features).
- Returns
Scores of shape (num_samples).
- Return type
array like
- predict_proba(features)ο
Predict the probability for a default.
- Parameters
features (array-like of dtype: float) β Weights of Evidence (WOE) features of shape (num_samples, num_features).
- Returns
Probability of default of shape (num_samples).
- Return type
array-like of dtype: float
- set_params(**params)ο
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that itβs possible to update each component of a nested object.- Parameters
**params (dict) β Estimator parameters.
- Returns
self β Estimator instance.
- Return type
estimator instance
- scorecard.create_client_summary(client_data, model_summary, woe_lookup)ο
Create a summary of the client indicating how the attributes of the features contribute towards the score. The client summary information is extracted from the model summary.
- Parameters
client_data (pandas dataframe) β Original features of client of shape (1, num_features).
model_summary (pandas dataframe) β Summary of the point contributions.
- Returns
Summary of the client point contributions.
- Return type
pandas dataframe
- scorecard.create_model_summary(scorecard_model_pipeline)ο
Create a summary of the model indicating how the attributes of the features contribute towards the score.
- Parameters
scorecard_model_pipeline β Trained model.
- Returns
Summary of the point contributions.
- Return type
pandas dataframe
- scorecard.extract_scorecard_info(scorecard_model_pipeline)ο
Train the model parameters (weights and offset) to predict a probability. If a score is needed, the model parameters need to be appropriately transformed. This function first extracts the model parameters and the
woe_lookup
dictionary from the trained model pipeline Then it transforms the model parameters to that needed for returning a score.- Parameters
scorecard_model_pipeline (sklearn Pipeline) β Trained model pipeline.
- Return coef
The model weights modified using the scale factor.
- Return type
list
- Return intercept
The transformed offset.
- Return type
float
- Return woe_lookup
The woe lookup dictionary.
- Return type
dict
- scorecard.get_vars_from_model_pipeline(scorecard_model_pipeline)ο
Extract the variables that were used to train the model from the model pipeline
- Parameters
scorecard_model_pipeline (sklearn Pipeline) β The model pipeline of the trained model.
- Returns
variable names.
- Return type
list
- scorecard.show_client_summary(client_data, scorecard_model_pipeline)ο
Show a summary of the client indicating how the attributes of the features contribute towards the score.
- Parameters
scorecard_model_pipeline (sklearn pipeline) β The trained model pipeline.
client_data (pandas dataframe) β Original features of client of shape (1, num_features).
- scorecard.show_model_summary(scorecard_model_pipeline, datasets)ο
Show a summary of the model indicating how the attributes of the features contribute towards the score. Gini values are shown for the different datasets, e.g. train, validation and test sets. The different datasets are in the form of a dictionary of dataframes where each dataframe represents one of the datasets for which the Gini coefficient is required.
- Parameters
scorecard_model_pipeline (sklearn pipeline) β The trained model pipeline.
datasets β The dictionary of the dataframes.
>>> datasets = {'TRN: {'data': train_data_df, 'labels': labels}, ... 'VAL': {'data': validation_df, 'labels': labels}, ... 'TST': {'data': test_df, 'labels': labels}}
Note
The datasets include the features and labels that are used to calculate the Gini coefficients.
constrata_core_credit.monitoringο
A module for monitoring model stability
- class monitoring.StabilityClassesColorCoder(no_shift_upper_bound, serious_shift_lower_bound)ο
Bases:
object
A class for indicating different levels of model stability (for use with calculate_stability_table and show_stability_table).
- Parameters
no_shift_upper_bound (float) β The upper bound for defining no shift.
serious_shift_lower_bound (float) β The lower bound for defining serious shift.
- Raises
ValueError β If the lower bound is not smaller than the upper bound.
- get_explanation_df()ο
Get a colour-coded dataframe that indicates the meaning of each colour in the stability table.
- Returns
colour-coded dataframe.
- Return type
pandas dataframe
- monitoring.calculate_scorecard_distributions(score_bin_splits, train_score, test_score, train_target, test_target)ο
Calculate the distributions over score intervals for a training and test set.
- Parameters
score_bin_splits (array like) β The list of splits defining the credit score bins.
train_score (pandas dataframe) β The training dataset.
test_score (pandas dataframe) β The testing dataset.
train_target (pandas series) β The training labels.
test_target (pandas series) β The testing labels.
- Returns
The development and recent distributions.
- Return type
pandas DataFrame
- Raises
ValueError β If the list of splits defining the credit score bins contains duplicates.
- monitoring.calculate_stability_table(scorecard_dists_df)ο
Calculate the model stability. The model stability is defined by the symmetric Kullback Leibler divergence between the (scorecard) model output distributions for the training data (development_data) and the model output for testing data (
recent_data
).- Parameters
scorecard_dists_df (DataFrame) β A pandas DataFrame containing the score distribution for the training (development) and testing (recent) datasets across a number of bins.
scorecard_dists_df
is typically the output from thecalculate_scorecard_distributions()
function.- Returns
The dataframe containing the stability information.
- Return type
pandas DataFrame
- monitoring.show_stability_table(scorecard_dists_df, no_shift_upper_bound, serious_shift_lower_bound)ο
Show the colour-coded model output stability table.
- Parameters
development_data (pandas series) β The model output distribution for the training data.
recent_data (pandas series) β The model output distribution (typically recent data) for the test data.
no_shift_upper_bound (float) β The upper bound for defining no shift.
serious_shift_lower_bound (float) β The lower bound for defining serious shift.
constrata_core_credit.monitoring_plotsο
A module for generating reporting plots
- class monitoring_plots.ValuesToLabels(feature_values, bin_dict)ο
Bases:
object
Given a pandas series of feature values and bin information, the values in the series are assigned to given labels. The labels can be numerical (e.g. WOE values) or strings (e.g. labels).
This class is instantiated with a pandas series of feature values and bin descriptions.
The values in the series will then be transformed to the given labels (numerical or categorical).
This is therefore a generalisation of the
transform_woe()
function in theweights-fo-evidence
module in the sense thattransform_woe()
only transforms values to WOE values.- Example
>>> feature_values = pd.Series([1.0, 'a', 4.0, 'b', 'c', 5.0]) >>> bin_dict = {'categorical': {'keys': [['a', 'b'],['c']]}, 'numerical': {'keys': [2.0]}} >>> cat_labels = ['0', '1'] >>> num_labels = [-5.0, 5.0] >>> transf = ValuesToLabels(feature_values, bin_dict) >>> transf.transform(cat_labels, num_labels) array(['-5.0', '0', '5.0', '0', '1', '5.0']
- Parameters
feature_values (pandas series of type str) β Array of input feature values of shape (num_samples,).
bin_dict (dict) β The binning dictionary for the variable.
- transform(cat_labels=None, num_labels=None)ο
The functor returning the transformed values.
Note
Either numerical or string values are returned, depending on the values in
cat_labels
andnum_labels
- Parameters
cat_labels (list) β List of the labels that will be assigned to the categorical values.
num_labels (list) β List of the labels that will be assigned to the numerical values.
- Return labeled_feature
The labels assigned to the input values.
- Return type
pandas series of type
str
orfloat
- monitoring_plots.calc_bin_count(data, data_description, outcome_column, column_name, status, target)ο
Calculate the bin counts of the outcome specified by
status
.- Parameters
data (pandas dataframe) β A dataframe of shape (n_samples, n_variables).
data_description (str) β The name of the data sample, e.g. βtrainβ.
outcome_column (str) β The column in
data
that contains the application outcome.column_name (str) β The column to get the count for.
status (str) β The application outcome status for which the count is required.
target (str) β The name of the target variable.
- Returns
A dataframe with successful or declined application outcome counts, percentages good and bad ratings.
- Return type
pandas dataframe
- monitoring_plots.calc_dataset_counts(data_sets, outcome_column, column_name, status, target)ο
Create a dataframe of the bin counts of a specified application outcome for different datasets.
Note
The
status
values are specified inoutcome_column
. One can calculate the counts for any of these values as specified bystatus
.- Parameters
data_sets (dict) β A dictionary of the data sets with the structure shown below. Its keys specify the different datasets.
outcome_column (str) β The column in
data_sets
that contains the application outcome.column_name (str) β The column name for which the counts are required.
status (str) β The application outcome status. whether the application was successful or declined.
target (str) β The name of the target variable.
- Return count_df
A dataframe with the counts of the specified application outcome for the different datasets.
- Example
>>> # Structure of the `datasets` argument. >>> data_sets = {'development': develop_df, 'rolling_recent': rolling_recent_df}
- monitoring_plots.calc_psi(first_distr, second_distr)ο
Calculate the Population Stability Index (PSI) for two distributions. The PSI is a symmetric version of the KullbackβLeibler divergence. The order of the two distributions does not matter. Note np.log10 is used, and not np.log.
- Parameters
first_distr (pandas series.) β The first distribution.
second_distr (pandas series.) β The second distribution.
- Return psi
The PSI value
- Return type
str
- monitoring_plots.calculate_outcome_counts(outcomes)ο
Calculate the counts for the different outcomes, given a pandas series of the outcomes. :param outcomes: A series of outcomes. :type outcomes: pandas series. :return outcome_df: A dataframe with the counts and the percentages for the different outcomes. :rtype: dataframe
- monitoring_plots.create_outcome_dataframe(data_sets, outcome_column)ο
Create a dataframe of the counts of different application outcomes for a number of datasets.
- Parameters
data_sets (dict) β A dictionary where the key is a dataset name and the value the dataset, with the structure given in the example below.
outcome_column (str) β The column in data that contains the application outcome.
- Example
The structure of the
data_sets
argument:>>> data_sets = {'train': train_df, 'test':test_df, 'validation': validation_df}
- monitoring_plots.create_successful_summary(data, outcome_column, target)ο
Create a dataframe of a summary of clients with successful application outcomes over the selected time period. The dates are given in the
date
column.The input dataframe
data
should have the following columns:An outcomes column specified in
outcome_column
.A date column called
date
. The outcomes will be calculated separately at these different dates.A target column specified in
target
. The labels of the loans that have defaulted.
The output dataframe gives the number of successful applications per month as well as the fraction per month that subsequently defaulted.
- Parameters
data β A dataframe containing a column named
date
and outcomes and target columns specified by the parameters.outcome_column (str) β The column with the application outcomes.
target (str) β The target column name with the default labels.
- Returns
dataframe with number of successful applicants, total applications and the default rate.
- monitoring_plots.plot_application_outcome(outcome_df, datasets, bar_width=0.13, tick_separation=0.3)ο
Compare the application outcomes between different datasets. The data that is displayed in the graph is in
outcome_df
, a dataframe that is created by the functioncreate_outcome_dataframe
in themonitoring_plots
module.The outcomes that are compared are given by the index of
outcome_df
, see example below.The names of the different datasets are specified in the
datasets
argument.The percentage and the count of each outcome are shown in the graph for each of the different datasets. Each dataset name therefore contributes two columns in
outcome_df
where the dataset name is suffixed by_percent
and_count
.Any number of datasets is allowed. However, only 7 different color schemes are provided to distinguish between the datasets.
- Parameters
outcome_df (dataframe) β A dataframe with the structure as shown in the example below.
datasets (list) β A list of the names of the data sets that are compared, of type
str
.bar_width (float) β Set the width of the bars in bar plot. default: 0.13
tick_separation (float) β Set the tick separation on the x-axis. default: 0.3
- Example
This is an example of
outcome_df
anddatasets
for a single dataset,train
, and four possible outcomes:'Declined'
,'Successful'
,'NotTakenUp'
, and'NotSet'
.>>> data = {'train_count': [7724, 1662, 1094, 575], 'train_percent': [70.0, 15.0, 10.0, 5.0]} >>> index = ['Declined', 'Successful', 'NotTakenUp', 'NotSet'] >>> outcome_df = pd.DataFrame(data, index=index) >>> datasets = ['train']
Note
The dataset name,
'train'
, leads to two columns inoutcome_df
, namely,train_count
andtrain_percent
where the prefix refers to the dataset name.
- monitoring_plots.plot_correlation_matrix(data, title, corr_vars)ο
Create a correlation matrix for variables in a dataset sample. The correlations are typically calculated using the WOE values and must be added as a suffix to the variable names, i.e for the
age
variable the corresponding variable name looks likeage_WoE
.The values must all be numerical values (
float
orint
).- Parameters
data β Dataframe with all successful application outcomes for the two data sets.
title (str) β Title of the graph.
corr_vars (list) β List of all the variables for the matrix.
- monitoring_plots.plot_gini(data_df, variable_list, target)ο
Create a barplot of the Gini values. The Gini values are calculated using the Somersβ delta.
- Parameters
data_df β A dataframe to use for the creation of the plot.
variable_list β A list of variables to include in the plot.
target (str) β The target variable.
- Returns
A barplot of the Gini values.
- monitoring_plots.plot_manually_closed_summary(data, dataset_names, bar_width=0.3, tick_separation=0.3)ο
Create a plot displaying the count and percentages of the reasons for the declined applicants.
- Parameters
data (dataframe) β The dataframe with a column called
ManualCloseReason
and for each dataset, it has to have columns calledf'{dataset_name}_percent'
andf'{dataset_name}_count'
.dataset_names (list) β A list of the names of the different datasets, of type
str
.bar_width (float) β Set the width of the bars in bar plot. default: 0.3
tick_separation (float) β Set the tick separation on the x-axis. default: 0.3
- monitoring_plots.plot_psi(data_sets, variables, outcome_column, status, target)ο
Plot PSI values for all variables. The PSI of a variable measures the difference between the distributions of two datasets.
- Parameters
data_sets (dict) β A dictionary of the two datasets, called
development
androlling-mature
.variables (list) β A list of variable column names.
outcome_column (str) β The column in data that contains the application outcome.
status (str) β The application outcome status.
target (str) β The name of the target variable.
- monitoring_plots.plot_rolling_statistics(data, dataset_names, column_name, bar_width=0.3, tick_separation=0.2)ο
Create a plot for comparing the percentage of successful applications and the bad rate between two data sets.
Note
This allows the user to compare the two datasets with respect to values such as WOE, or any other discrete set of variables.
- Parameters
data (dict) β Dictionary of dataframes of the two data sets.
dataset_names (list) β The names of the data sets (of type
str
).column_name (str) β The column with the bucket descriptions.
bar_width (float) β Set the width of the bars in bar plot. default: 0.3
tick_separation (float) β Set the spacing between the ticks on the x-axis. default: 0.3
- Example
The bucket counts are compared and for this reason
'_buckets'
need to be added as a suffix to the variable name. For a variable called'YearsInBusiness'
, thecolumn_name
becomes>>> column_name = 'YearsInBusiness` + '_buckets'
- monitoring_plots.plot_successful_applications(successful_df)ο
Plot the number of successful applications as a bar plot and default rate as a line plot on the same axes.
The input dataframe
successful_df
should have a column calleddate
since the outcomes are plotted for the different dates. The input dataframe should also have columns namedNumber_of_clients
anddefault_rate
that contain the the number of applications for each date and the default rate, respectively.Note
successful_df
is created by a call tocreate_successful_summary()
.- Parameters
successful_df β A dataframe with columns named
date
,Number_of_clients
anddefault_rate
.
- monitoring_plots.somers_delta(x_var, y_var)ο
Compute Somersβ Delta, which is the measure of agreement between two ordinal variables. The value ranges from -1 to 1, with -1 indicating disagreement and 1 indicating agreement. For reference, see: Somersβ delta
- Parameters
x_var (array-like) β The independent variable of shape (n_samples,).
y_var (array-like) β The dependent (binary) variable of shape (n_samples,).
- Returns
The calculated Somersβ Delta
- Return type
float
constrata_core_credit.model_evaluationο
A module for evaluating models
- class model_evaluation.HypothesisTester(model=None, random_state=0, solver='lbfgs', max_iter=100)ο
Bases:
object
A class for testing different models (βhypothesesβ) on a dataset. The user may supply a model, the default is sklearnβs LogisticRegression
If no model is specified, the user can specify the hyper parameters for to instantiatingthe LogisticRegression class of sklearn.
- Parameters
model (object) β The model to be tested. If None, a Logistic Regression model will be used.
random_state (int) β The random state of the LogisticRegression, default random_state=0.
solver (string) β The optimiser used in sklearnβs LogisticRegression.
max_iter (int) β The maximum number of iterations for which the model will be trained.
- fit_and_test_model(X_train, y_train, X_test, y_test, plot=False)ο
Fit the model to the training data and test the trained model on the test data.
- Parameters
X_train (pandas dataframe) β The training samples, of shape (num_samples, num_features)
y_train (array-like) β The labels for the corresponding samples, of shape (num_samples,)
X_test (pandas dataframe) β The test samples, of shape (num_samples, num_features)
y_test (array-like) β The labels for the corresponding samples, of shape (num_samples,)
plot (bool) β Whether or not to plot the ROC curve
- Returns
A dictionary containing the ROC data and Gini Index
- Return type
dict
- fit_model(X_train, y_train)ο
Fit the model.
- Parameters
X_train (pandas dataframe) β The training samples, of shape (num_samples, num_features).
y_train (array-like) β The labels for the corresponding samples, of shape (num_samples,).
- test_model(X_train, y_train, plot=False)ο
Test the model and return the ROC data and Gini Index.
- Parameters
X_train (pandas dataframe) β The training samples, of shape (num_samples, num_features).
y_train (array-like) β The labels for the corresponding samples, of shape (num_samples,).
plot (bool) β Whether or not to plot the ROC curve.
- Returns
A dictionary containing the ROC data and Gini Index.
- Return type
dict
- Raises
ValueError β if the model has not been trained.
- model_evaluation.get_roc_data(model, x, y)ο
Get the false positive rate (FPR), true positive rate (TPR) and area under the curve (AUC) for a set of (automatically calculated) thresholds, all of which describes an ROC curve.
- Parameters
model (sklearn model) β The model for which the ROC curve data is calculated.
x (array-like) β Input (feature) data of shape (num_samples, num_features).
y (array-like) β Target labels of shape (num_samples,), corresponding to x.
- Returns
A tuple containing the false positive rate, true positive rate, the thresholds and the area under the curve.
- Return type
tuple of dtype float
- model_evaluation.get_roc_metrics(model, data, labels)ο
Calculate the false positive rate (fpr), true positive rate (tpr), area under the curve (auc) and Gini index.
- Parameters
model (sklearn pipeline) β The model that should be evaluated. This object should have a predict function and accept a features dataframe as input.
data (pandas dataframe) β The dataset of features used for evaluating the model, of shape (num_samples, num_features).
labels (pandas dataframe) β The labels for the different samples, of shape (num_samples,).
target_variable_column (str) β The target variable column name.
feature_columns (list of dtype str) β The names of the columns in the dataset that are used for evaluation.
- Returns
A DataFrame containing the false positive rate, true positive rate, area under the curve and the Gini index.
- Return type
Tuple (DataFrame, float, float)
- model_evaluation.get_roc_metrics_for_datasets(model, datasets, data_key='data', labels_key='labels')ο
Get the ROC metrics for a number of datasets.
- Parameters
model (Scikit-learn Pipeline) β A scikit-learn pipeline (containing a model) that will be evaluated.
datasets (Dict) β A dictionary containing the datasets to be evaluated. Each key should be the name given to the dataset, and each value should be another dictionary containing the model input data under key
data_key
and the corresponding labels underlabels_key
.data_key (str) β The key in which the input data is stored for each
dict
indatasets
. Default value is βdataβ.labels_key (str) β The key in which the corresponding labels of the input data is stored for each
dict
indatasets
. Default value is βlabelsβ.
- Returns
A DataFrame containing the ROC metrics for each dataset.
- Return type
pandas DataFrame
- Example
>>> datasets = {'TRN': {'data': final_train_data, 'labels': train_labels}, ... 'VAL': {'data': final_validation_data, 'labels': validation_labels}, ... 'TST': {'data': final_test_data, 'labels': test_labels}} >>> get_roc_metrics_for_datasets( ... model=model, ... datasets=datasets ... )
- model_evaluation.plot_roc(fpr, tpr, auc, model_name=None, fontsize=12)ο
Plot a Receiver Operator Curve (ROC).
- Parameters
fpr (float list) β A list of false positive rates.
tpr (float list) β A list of true positive rates.
auc (float) β The area under the curve.
model_name (str) β The name of the model used to plot the ROC.
fontsize (int) β The font size of the title and legend of the ROC plot.
constrata_core_credit.plot_setupο
A module for setting plot options
- plot_setup.set_plot_config(mode)ο
Set the matplotlib plot configuration to the desired mode to allow plots to display correctly in the associated mode.
- Parameters
mode (str) β The mode to set the plot config to, either βlightβ or βdarkβ
- Raises
ValueError β if a mode other than βlightβ or βdarkβ is specified
constrata_core_credit.experimentalο
A module for performing variable selection and reduction
- experimental.variable_reduction.plot_clustered_correlation_with_gini(corr_df, variable_order, ginis_df)ο
Plot the clustered variable correlation matrix together with gini values.
- Parameters
df_corr (pandas dataframe) β The feature correlation dataframe.
variable_order (list) β variables in sorted order
ginis_df (dict) β DataFrame of gini values for all the variables
- Return combined_fig
figure with correlation and gini plots
- Return type
HBox
Two graphics libraries are used in this project,
seaborn
andbqplot
. The latter is designed to work with ipython widgets and is used whenever widgets are required.bqplot
is based on a Grammar of Graphics framework and every attribute of the plot is an interactive widget. It is this feature that makes it so useful to integrate with ipython widgets. Moreover, the user has complete control change any of the attributes of the plot after the fact.Note that the image object
fig
produced in this function consists of twochildren
,corr_fig
and their attributes can be reset by the user. This is particularly useful to change the appearance of the figure. It might be interesting to note that this is how we update the figures in the ManualBinning object: Only the marks attributes, i.e. the lines and/or bars are updated, the rest of the figure is left unchanged.The following is a short introduction to some of the attributes that the user might want to change.
- Example
>>> fig.children # list of two figures that make up the combined figure. >>> len(fig.children) # answer: 2 >>> fig.children[0] # The correlation matrix figure should appear >>> fig.children[1] # The bar plot should appear by itself >>> first_fig = fig.children[0] # the first figure for closer inspection >>> dir(first_fig) # Get all the attributes of the figure. You can also use `tab` complete in the notebook. >>> first_fig.title # shows the current title, 'Correlation' >>> first_fig.title = 'new_title' # see how 'Correlation' is changed to 'new_title' in the figure above >>> first_fig.fig_margin # a dictionary that sets the margin around the figure. >>> # These values were chosen so that this particular figure displays well >>> first_fig.layout.height # the height of the figure. Try changing it! >>> first_fig.layout.width # the width of the figure. Try changing it! >>> first_fig.axes[0].label # returns the current label. Try changing it! >>> first_fig.axes[0].label_offset # returns the label offset. Try changing it!