API Reference

constrata_core_credit.preprocessing

A module for loading and pre-processing data

preprocessing.map_categoricals_to_strings(data)

Map categorical values to strings.

The mapping follows a deterministic process where nan’s, strings and numerical values are mapped to string versions of integers.

A bidirectional dictionary is also returned that describes the mapping between the original and transformed values.

Parameters: data (pandas Series) – The data to be transformed, of shape (num_samples,).
Return cat_data: The data transformed to categorical values (string values), of shape (num_samples,). Note: cat_data.name==None
Return type: pandas Series
Return var_to_cat_dict: The bidrectional dictionary of the mapping between original and transformed variables.
Return type: bidirectional dictionary

preprocessing.split_dataset(data_set, split_fractions, shuffle=True)

Split dataset into disjoint subsets. For example, into training, validation and testing sets.

Parameters

data_set (pandas dataframe) – The dataset to be split.
split_fractions (float list) – The fractions to allocate to each of the subsets (i.e [0.7, 0.2, 0.1]).
shuffle (bool) – Whether or not to shuffle the data before splitting. Default: shuffle=True.

Returns

A tuple of dataframes, an element for each subset resulting from the split.

Return type

A tuple of dataframes

Example:

>>> data_df = pd.DataFrame([[0, 0.1, 0], [1, 0.2, 0], [2, 0.3, 0], [3, 0.21, 1], [4, 0.25, 0], [5, 0.35, 1],
...                         [6, 0.39, 0], [7, 0.11, 0], [8, 0.15, 0], [9, 0.18, 1]],
...                         columns=['user_id', 'var1', 'targets'])
>>> data_split_fractions = [0.6, 0.2, 0.2]
>>> train_df, validation_df, test_df = preprocessing.split_dataset(accepts_df, data_split_fractions, shuffle=True)

preprocessing.to_float(value)

If possible, convert value to float, otherwise return the original.

Example:

'-1' is returned as -1.0
'abc' is returned as 'abc'

Parameters: value (string) – The value that needs to be converted.
Returns: The converted value.
Return type: string or float

constrata_core_credit.rollrates

A module for calculating the change in delinquency states (known as roll rates) of clients

rollrates.calc_delinquency_state_transition(dpd_df, first_date, last_date, early_mob=12, late_mob=45, min_mob=None, del_breaks=None, remove_delinquent_at_mob0=False)

Calculate the delinquency state transition matrix between an ‘early’ and ‘late’ period defined as:

'early':       0    <= MOB < early_mob
'late':   early_mob <= MOB < late_mob

The maximum delinquency is calculated for both periods and classified into discrete states defined by the upper limits in del_breaks.

Parameters

dpd_df (dataframe) – The delinquency table with columns [‘ID’, ‘DATE’ and ‘DPD’]. The date column must be a datetime object.
first_date (str) – Only consider accounts opened on or after this date.
last_date (str) – Only consider accounts opened before this date.
early_mob (int) – The time period (months on book) in which the initial delinquency state is calculated.
late_mob (int) – The time period (months on book) in which the subsequent delinquency state is calculated.
min_mob (int) – The minimum number of months on book for an account to be used.
del_breaks (numeric iterable) – Delinquency breaks. These values define the boundaries between different delinquency buckets (left-inclusive).
remove_dpd0_at_mob0 (bool) – Set True to remove all accounts which had non-zero delinquency at first month on book. Default: remove_delinquent_at_mob0=False.

Returns

Dataframe of state transition counts.

Return type

dataframe

rollrates.relative_delta_in_months(last, first)

Calculate the number of months from first to last date.

Parameters

last (datetime object) – Last date in the time period considered.
first (datetime object) – First date in the time period considered.

Returns

The difference between first and last dates, in months.

Return type

A datetime object

constrata_core_credit.rollrate_widgets

A module containing visual tools (IPython Widgets) for roll rates

class rollrate_widgets.RollRateWidget(days_past_due_df)

Bases: object

Class for calculating and displaying roll rates.

It requires a dataframe as input with the following required columns:

'ID' (str or int): Any unique user ID.
'DATE' (str): This is a date sequence at which times the Days Past Due (DPD)
        are determined, typically at the beginning of the month. The format
        of the date should be 'year/month/day', for example '2020/04/19'.
'DPD' (int): The number of days past the payment due date.

Given this information the roll rates are calculated for fixed buckets:

‘0 days’
‘1-30 days’
‘31-60 days’
‘61-90 days’
‘90+ days’

Parameters: days_past_due_df (pandas dataframe) – A dataframe of the client information.

update_duration(duration)

Callback for duration widget.

Update due to change in ‘Months Included’ widget.

Parameters: duration (int) – Value returned by the Months Included widget, measured in months.

update_init_mob(init_mob)

Callback for init_mob widget.

Update due to change in initial months on books.

Parameters: init_mob (str) – Value returned by the Initial MOB widget, in months (e.g. ‘6 months’).

update_ref_mob(ref_mob)

Callback for ref_mob widget. Update due to change in Subsequent MOB.

Parameters: ref_mob (str) – Value returned from Subsequent MOB widget, in months. (e.g. ‘6 months’)

update_year(year)

Callback for Initial Date widget.

Update due to change in initial year.

Parameters: year (datetime object) – Date returned by Initial Date widget.

constrata_core_credit.weights_of_evidence

A module containing Weights of Evidence (WOE) tools

class weights_of_evidence.WOEBinning(var_type, bin_dict=None, protect_bins=False)

Bases: object

A WOE (weights of evidence) binning.

This class provides the following:

Given the values of a single variable, the targets (labels), and the data type of the variable, bins are automatically calculated together with their weights of evidence.
- 'numerical', 'categorical', 'ordinal', or 'mixed' dtypes are allowed.
- If a variable has only float or int values, use dtype='numerical'
- In the case of a mix of string and numerical values, use dtype='mixed'. String values are treated as
  categorical.
- If a variable has only str values, use dtype='categorical'
- If a variable has ordered str values, use dtype='ordinal'. In this case the order must be provided.
The class can be initialised with a pre-calculated bin_dict. If bin_dict is not None, all other
parameters are ignored.
- If bin_dict does not have WOE values, then the fit_woe method has to be called. It calculates the WOE
  values for the given binning using the train data (values and targets) provided through the fit method.
Once the bins and the WOE are available, the transform_woe method can be used to transform the features
to WOE values.

Parameters

var_type (dict) – Specifies the dtype of the variable and the ordinal sequence, if the dtype is 'ordinal'.
bin_dict (dict) – Specifies the binning dictionary. Optional.
protect_bins (dict) – A dictionary with the protected bin values.

Examples:

If only the bins are given:

>>> bin_dict = {'categorical': {'keys': []}, 'numerical': {'keys': []}, 'dtype': str, 'ord_order': list}

If both the Bins and WOE are given:

>>> bin_dict = {'categorical': {'keys': [], 'woe_vals': [], 'labels': [], 'bad_rate': []},
...             'numerical': {'keys': [], 'woe_vals': [], 'labels': [], 'bad_rate': []},
...             'iv': '',
...             'dtype': str
...             'ord_order': list}

Examples of var_type

>>> var_type = {'dtype': 'numerical'} # For numerical data type
>>> var_type = {'dtype': 'mixed'} # For mixed data type
>>> var_type = {'dtype': 'categorical'} # For categorical data type
>>> var_type = {'dtype': 'ordinal', 'ord_order': ['a', 'b', 'c']} # Providing the order as well

property bin_dict

Return all the binning information for both categorical as well as numerical bins.

Return bin_dict: The binning dictionary.
Return type: dict

property breaks

Return the numerical bin breaks and the categorical bins

Return breaks: The break and bin values.
Return type: dict[str,list]

property categorical_bins

Return only the categorical binning information.

Return cat_bins: The categorical binning information.
Return type: dict

property categories

Return the categories, i.e. the unique categorical values. Note that the categories can be combined to form the bins.

Return categories: A list of str of categories.
Return type: string

property counts

Return the counts for all the bins.

Return counts: The counts.
Return type: dict

fit_autobinning(features, targets, sample_weights=None)

Given the features and targets, the binning and WOE are calculated and the internal state of the object is updated with this information. The existing internal state is overwritten.

Parameters

features (array-like) – The features.
targets (binary (0 or 1)) – The target values (labels).
sample_weights (A Pandas Series.) – Optional. Individual weights for each sample in features.

fit_woe(features, targets, sample_weights=None)

Fit the features and targets to the existing bins.

The existing bins are used. Only the WOE are estimated from the data.

Parameters

features (array-like) – The features.
targets (array-like) – The binary (0 for “good” or 1 for “bad”) target values.
sample_weights (A Pandas Series.) – Optional. Individual weights for each sample in features.

get_woe()

Return the WOE values for all the bins.

Return woe: The woe values.
Return type: dict

get_woe_dict()

Return the WOE dictionary.

Return woe_dict: The WOE dictionary.
Return type: dict

property labels

Return the labels of both the categorical and numerical bins. Used as the x-axis in the plotting.

Return labels: The labels.
Return type: list

property sample_weights

Return the sample weights.

Return sample_weights: The sample weights
Return type: Pandas Series

set_manual_override(state)

Set whether bin boundaries have been manually overridden.

Parameters: state (bool) – Set whether bins have been manually modified after being protected. True indicates modification.

set_protected_bins(categorical_bins=None, numerical_breaks=None)

Set protected bin boundaries.

Sets new bin boundaries and/or categorical bins, and set their status to protected. This is a useful when you want to explicitly enforce particular binning boundaries on a particular variable.

Note: This forces an update of the WOE values, which will use the new bin boundaries.

Parameters

categorical_bins (list[str]) – The new categorical bins to assign to the WOE binning.
numerical_breaks (list[float]) – The new numerical breaks to assign to the WOE binning.

transform_woe(features, var_name=None)

Transform features to WOE values.

Parameters

features (array-like) – The features to be transformed, with shape (num_samples,).
var_name (str) – Name of the variable to be transformed.

Return woe_features

The WOE-transformed features.

Rype

array-like

update_categorical_bins(categorical_bins)

Update the categorical bins in response to a widget change. The internal state of the categorical bins is updated. The woe for the new categorical bins is calculated and the internal state is updated.

Parameters: categorical_bins (A list of lists) – A list of the new categorical bins.

Example:

In the code example below, there are 3 categorical values 'a', 'b', 'c'. 'a' is in a bin by itself,
'b' and 'c' are grouped together in the same bin.

>>> categorical_bins = [['a'], ['b', 'c']]

update_numerical_breaks(breaks)

Update the numerical breaks and recalculate the WOE values.

The breaks are the numerical values separating the numerical bins.

Parameters: breaks (List[float]) – The new breaks.

class weights_of_evidence.WOETransformer(woe_dict)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A WOE (weights of evidence) transformer.

Transforms variable values to WOE values.

If the transformer is instantiated with a WOE dictionary with WOE values, the transform method will use these values to do the transform.

If the transformer is instantiated with a dictionary that contains only the binning information, see example below, then the corresponding WOE values must be trained using the fit() method.

The binning objects also have methods that allow one to obtain the binning information and calculate the WOE given a new instance (features and labels) of the variable using the existing, internal binning information.

This class supports the following methods:

fit(features_df, targets): # Fit the given features to the existing binning.
transform(features_df): # Transform the given features to their WOE values.
get_info_vals(): # Return a dictionary with the information values for each variable.
save_woe_to_json(filename): # Save the binning objects to a json file named filename.

Parameters: woe_dict (dict) – Dictionary of WOE values.

Example:

>>> woe_dict = {'var1': {'categorical': {'keys': [['missing'], ['a']],
...                                      'woe_vals': [],
...                                     'labels': [],
...                                     'bad_rate': []},
...                      'numerical': {'keys': [2.0],
...                                    'woe_vals': [],
...                                    'labels': [],
...                                    'bad_rate': []},
...                       'iv': '',
...                       'dtype': {'dtype': 'categorical', 'ord_order': None}},
...             'var2': {'categorical': {'keys': [['0'], ['1'], ['missing'], ['2']],
...                                      'woe_vals': [],
...                                      'labels': [],
...                                      'bad_rate': []},
...                      'numerical': {'keys': [],
...                                    'woe_vals': [],
...                                    'labels': [],
...                                    'bad_rate': []},
...                      'iv': '',
...                      'dtype': {'dtype': 'mixed', 'ord_order': None}}
...             'var3': {'categorical': {'keys': [['0'], ['1'], ['missing'], ['2']],
...                                      'woe_vals': [],
...                                      'labels': [],
...                                      'bad_rate': []},
...                      'numerical': {'keys': [],
...                                    'woe_vals': [],
...                                    'labels': [],
...                                    'bad_rate': []},
...                      'iv': '',
...                      'dtype': {'dtype': 'ordinal', 'ord_order': ['a', 'b', 'c']}}}

fit(features_df, targets, sample_weights=None)

Fit the WOE transformer to data.

woe_dict must contain binning information for the features in features_df, or a ValueError is raised.

The binning specified by the internal state is used. The WOE values are calculated for these bins and passed to the internal state of the class.

Parameters

features_df (dataframe) – The input features where each column is a variable, and each row is a sample.
targets (array-like) – The corresponding (binary) targets with shape (num_samples,). A value of 0 indicates a “good” sample. A value of 1 indicates a “bad” sample.
sample_weights (A Pandas Series.) – Optional. Individual weights for each sample in features.

Raises

ValueError – Raised if woe_dict does not contain binning information for one or more features in features_df.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_info_vals()

Collect the Information Value for each variable.

Returns: The information value for each feature.
Return type: dict

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

get_woe_lookup()

Return the woe lookup dictionary.

Return woe_lookup: The WOE lookup directory.
Return type: dict

save_woe_to_json(filename)

Save WOE information to JSON file. Note that this also includes variable and binning information.

Parameters: filename (str) – Name of the output file where the dictionary is written to.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

transform(features_df)

Apply the WOE transformation to a dataset. All the variables in the dataset will be transformed.

Parameters: features_df (pandas dataframe) – The input features with shape (num_samples, num_features).
Returns: The WOE-transformed features with shape (num_samples, num_features).
Return type: pandas dataframe
Raises: ValueError – Raised if woe_dict does not contain binning information for one or more features in features_df.

weights_of_evidence.autobin(data, labels, vars_type, sample_weights=None)

Perform autobinning for all variables in the data dataframe.

Parameters

data (Pandas dataframe) – A dataframe of the data that needs to be binned.
labels (array-like) – The binary labels good=0, bad=1.
vars_type (dict) – A dictionary where keys are the name of the variable, and values specify the dtype: 'categorical', 'ordinal', 'numerical', and 'mixed' are allowed.
sample_weights (Pandas Series.) – Optional. Individual weights for each sample in features.

Returns

A dictionary of binning objects.

Return type

dict of WOEbinning objects

weights_of_evidence.breaks_to_labels(breaks)

Return a list of bin labels for the numerical bins, given the breaks between bins.

Parameters: breaks (list) – The numerical values separating the bins, excluding the implicit extreme edges -inf and inf.
Returns: The bin labels.
Return type: list of strings

weights_of_evidence.calc_bin_counts(bin_dict, features, targets, sample_weights=None)

Count the number of goods and bads in each of the categorical and numerical bins.

The counts in each bin are weighted with the sample weights.

Parameters

bin_dict (dict) – Dictionary containing categorical bin features and the breaks separating the numerical bins.
features (Pandas Series.) – The feature values to be binned, of shape (num_samples).
targets (Pandas Series of integer values of either 0 (for "good") or 1 (for "bad"), of shape (num_samples,)) – The corresponding target values, of shape (num_samples).
sample_weights (A Pandas Series.) – Optional. Individual weights for each sample in features.

Returns

A dictionary of good/bad counts for each bin.

Return type

dict

:Pseudo Example:

>>> bin_dict = {'categorical_bins': [], 'categorical_woe': [], 'numerical_breaks': [], 'numerical_woe': []}
>>> features = pd.Series([...])
>>> targets = pd.Series([...])
>>> calc_bin_counts(bin_dict, features, targets)
    {'categorical': {'good': [...], 'bad': [...]}, 'numerical': {'good': [...], 'bad': [...]}}

weights_of_evidence.concat_woe_and_counts(woe_dict, counts)

Concatenate labels, counts and woe for categorical, numerical for WOE plots.

Parameters

woe_lookup (dict) – The WOE binning dictionary.
counts (dict) – The corresponding raw good/bad counts for each bin.

Returns

Concatenated labels, WOE values, good counts and bad counts.

Return type

tuple of lists

weights_of_evidence.convert_str_to_cat(series)

Map the string values to a categorical-type encoding.

A mix of string values and integer values are allowed. The strings are mapped to different integers with missing values (nan’s) mapped to ‘-1’. The integers are cast to strings allowing string-valued variables to be used directly when convenient.

Parameters: series (pandas series) – The string-valued data.
Return results: A tuple where the three elements are: list of categories, list of string values, of shape (num_samples,), bidirectional dictionary that provides the map string <-> categorical.
Return type: tuple

weights_of_evidence.counts_to_woe(bin_dict, counts, pseudocount=1e-05)

Convert bin counts to WOE for a specific variable.

Parameters

bin_dict (dict) – Binning dictionary containing categorical bin values and breaks for numerical bins.
counts (dict) – Corresponding good/bad counts for each bin.
pseudocount (float) – Optional.

Returns

WOE lookup dictionary which maps each bin to a WOE value.

Return type

dict

:Example:

>>> # Show the structure of the dictionaries.
>>> bin_dict = {'categorical_bins': list,  'numerical_breaks': list}
>>> counts = {'categorical': {'good': list, 'bad': list}, 'numerical': {'good': list, 'bad': list}}
>>> lookup = {'var1': {'categorical': {'keys': list, 'woe_vals': list, 'labels': list, 'bad_rate': list},
...                    'numerical': {'keys': list, 'woe_vals': list, 'labels': list, 'bad_rate': list},
...           'iv': float}}

weights_of_evidence.filter_dict(woe_dict, filter_list)

Filter woe_dict.

Parameters

woe_dict (dict) – A WOE dictionary
filter_list (list) – A list of all the features that need to be included.

Returns

Filtered WOE dictionary

Return type

dict

weights_of_evidence.find_str_cols(dataset)

Identify all the columns in a dataframe that contain string variables.

Note: Only model variables are included in dataset i.e. columns containing things like target values, customer ID, etc, should be removed.
Parameters: dataset – The dataframe.
Type: pandas dataframe
Return str_cols: A list of the column names containing string variables.
Return type: list of strings

weights_of_evidence.get_bin_info_from_autobin_obj(autobinning)

Extract the woe_dict information from the autobin object returned by the function autobin()

Parameters: autobinning – A dictionary of binning objects. An object for each feature.
Returns: A WOE dictionary.
Return type: dict

weights_of_evidence.get_woe_values_from_counts(counts, values_type, prior_log_odds, pseudocount)

Convert bin counts into WOE values.

Parameters

counts (dict) – Corresponding good/bad counts for each bin.
values_type (string) – The type of values (‘categorical’ or ‘numerical’)
prior_log_odds – The ratio of the total good/bad for the corresponding variable.
pseudocount (float) – The pseudocount to add to the good and bad counts.

Returns

The WOE values.

Return type

np.ndarray

weights_of_evidence.read_dict_from_json(filename)

Read a dictionary from a JSON file.

Parameters: filename (str) – The name of the JSON file. This should include the directory name.

weights_of_evidence.save_dict_to_json(bin_dict, filename)

Save a dictionary to a JSON file.

Parameters

bin_dict (dict) – The dictionary to be saved.
filename (str) – The output file name where the dictionary is written to.

weights_of_evidence.transform_woe(feature_values, woe_lookup, var_name=None)

Transform features values into WOE values.

The data values of a variable, feature_values, are transformed to WOE values using the WOE lookup dictionary for that variable. In other words, we assign each sample in feature_values the WOE value corresponding to the bin to which it belongs.

Both the bins and corresponding WOE values assosciated with each bin is defined in woe_lookup.

Example:

>>> woe_lookup = {'categorical': {'keys': [...], 'woe_vals': [...], 'labels': [...], 'bad_rate': [...]},
...               'numerical': {'keys': [...], 'woe_vals': [...], 'labels': [...], 'bad_rate': [...]},
...               'iv': float,
...               'dtype': str,
...               'ord_order': list}

Parameters

feature_values (pd.Series) – Pandas series of input feature values of shape (num_samples,).
woe_lookup (dict) – The WOE lookup dictionary for the variable.
var_name (str) – Name of the variable to be transformed.

Returns

Array of WOE values of shape (num_samples,).

Return type

array-like

constrata_core_credit.weights_of_evidence_widgets

A module containing visual tools (IPython Widgets) for binning and Weights of Evidence (WOE)

class weights_of_evidence_widgets.BinningVisualiser(woe_binning_objects, datasets)

Bases: object

A GUI for visualising and comparing the binning information for different datasets, e.g. train and validation.

Parameters

woe_binning_objects (dict) – A dictionary of the binning objects. These are the same for all datasets.
datasets (dict) – A dictionary of the dataframes (e.g. train and validation) that will be compared.

Example

>>> # An example of the ``datasets`` argument.
>>> datasets = {"TRN": {'data': train_data[selected_varnames], 'labels': train_labels},
...             "VLD": {'data': train_data[selected_varnames], 'labels': train_labels}}

calc_metrics()

Calculate all WOE and counts needed for the plots.

Returns: A dictionary of the dataset information, for example, the train and validation datasets.
Return type: dict
Example

>>> # The structure of the returned dictionary is the following.
>>> return_dict = {'train': {'bin_labels': [], 'woe': [], 'counts_good': [], 'counts_bad': []},
...                'val': {'bin_labels': [], 'woe': [], 'counts_good'}: [], 'counts_bad': []}}

select_variable(variable)

Change currently selected variable.

Parameters: variable (str) – Name of selected variable

update_plot(): Update all plots in the GUI, usually when a new variable is selected.

class weights_of_evidence_widgets.ManualBinningUI(data, targets, bin_dict, predefined_bins=None)

Bases: object

Do manual bin adjustment using a widget.

The data is in the form of a pandas dataframe with the columns consisting of the variables that need to be binned. Fine-tuning the bin boundaries starts with an initial binning that is typically provided by an auto-binning algorithm. This binning should also be provided in the form of a bin dictionary. The dictionary has the structure {key: object}. Here key is the variable name and object is a Python object that is returned by WOEbinning. The object contains all the information about its variable, including all binning information.

Parameters

data (dataframe) – A dataframe containing the variables to be binned, of shape (num_samples, num_variables).
targets (array-like) – The labels or target vector, of shape (num_samples,).
bin_dict (dict) – The binning dictionary containing all binning information about the variables.
protected_bins (dict[str,dict]) – A dictionary that contains variables and their binning boundaries bins that are predefined, and should be therefore be protected. Keys must match the variable names in bin_objects, and corresponding values must be a dict specifying the categorical and numerical breaks where appropriate. An example is shown below.

Example

>>> # An example of specifying the predefined bins.
>>> predefined_bins = {
...    'a_numerical_var': {'categorical': None, 'numerical': [10, 15, 20]},
...    'a_categorical_var': {'categorical': [['abc'], ['missing']], 'numerical': None},
...    'a_mixed_var': {'categorical': [['abc'], ['missing']], 'numerical': [2, 4, 6]},
... }
>>> manual_binning = ManualBinningUI(..., predefined_bins=predefined_bins)

adjust_break(**kwargs)

Call-back function called whenever a break is adjusted via one of the break adjuster TextBoxes.

Parameters: kwarks (dict) – A dictionary of FloatBox values

adjust_categorical_bins(change)

Call-back function called whenever the content of the categorical bin input textbox is changed.

Parameters: change (object) – Change in state of the categorical bins text box

property binning: Read binning object for currently selected variable.

property categorical_bins: Return the categorical bins for currently selected variable.

property data: Read data for currently selected variable.

get_bins(): Return the current binning for all the variables.

get_woe_dict()

Return the bin dictionary for all the variables.

Return woe_dict: The WOE dictionary
Return type: dict

static load_bins_from_json(fname_in)

Load binning information from JSON file.

Parameters: fname_in (str) – Filename to load from
Returns: The dictionary of binning information
Return type: dict

merge_bins(change)

Merge two bins by removing a break/edge. The call-back is only activated upon a change selected by the widget, therefore change is passed to the call-back by the widget.

Parameters: change (object) – An object with details regarding the change in bin boundaries as selected through the merge-bin widget.

property numerical_breaks: Read breaks for currently selected variable.

save_bins_to_json(fname_out)

Save binning information to JSON file.

Parameters: fname_out (str) – Filename to write to

select_removed_variable(change)

Change currently selected variable for removal.

Parameters: change (dict) – A dictionary with details on the change in state of the variable selection dropdown

select_variable(change)

Register a change if a new variable is selected for binning.

Parameters: change (dict) – A dictionary with details on the change in state of the variable selection dropdown

split_bins(change)

Split a bin by inserting a break/edge halfway between the edges of the specified bin.

Parameters: change (dict) – A dictionary with details regarding the change in bin boundaries as selected through the merge-bin widget.

update_break_adjuster_widgets(is_protected=False): Update the break adjuster widgets to show the current breaks.

update_dropdowns(): Populate the merge/split drop-downs to show the current breaks.

update_plot(): Update all the plots.

property varnames: Return the variable names

weights_of_evidence_widgets.add_legend_to_plot(plot, legend_colours, legend_labels)

Adds a legend outside of a plot.

Parameters

plot (Figure widget) – Plot to which a legend must be added
legend_colours (list) – Colours to be used for the legend icons
legend_labels (list) – Labels to be used for the legend

Returns

Combined plot and legend

constrata_core_credit.variable_reduction

A module for performing variable selection and reduction

class variable_reduction.CorrelationMatrixColorCoder(lower_bound, upper_bound)

Bases: object

A class for indicating different levels of correlation between variables (for use with correlation_vif_table).

Parameters

lower_bound (float) – The lower bound (inclusive) for medium correlation.
upper_bound (float) – The upper bound (exclusive) for medium correlation.

Raises

TypeError – If either or both of the bounds are not specified.
ValueError – If the upper bound is smaller than the lower bound.

get_explanation_df()

Get a colour-coded dataframe that indicates the meaning of each colour in the stability table.

Returns: colour-coded dataframe.
Return type: pandas dataframe

variable_reduction.calculate_gini_indices(train_data_df, train_labels, test_data_df, test_labels)

Calculate the Gini Indices for all variables (corresponding to columns in the dataframes) by training a Logistic Regression model on each variable and evaluating the performance of the model.

Parameters

train_data_df (pandas dataframe) – The dataframe of the variables to be used for training, of shape (num_samples, num_features).
train_labels (array-like) – The labels, of shape (num_samples,).
test_data_df (pandas dataframe) – The dataframe of the data to be used for testing, of shape (num_samples, num_features).
test_labels (array-like) – The labels, of shape (num_samples,).

Returns

A pandas DataFrame mapping the variables to their respective Gini Indices.

Return type

DataFrame

variable_reduction.calculate_vif(dataset_df)

Calculate the Variance Inflation Factors for a set of variables.

Parameters: dataset_df (pandas dataframe) – Features with shape (num_samples, num_features).
Returns: Variance Inflation factors with shape (num_features).
Return type: pandas series

Note

Example from Wikipedia:

Suppose that the variance inflation factor of a predictor variable were \(5.27 \sqrt{5.27} = 2.3)\). This implies that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

variable_reduction.cluster_correlated_variables(data_df)

Cluster and order the variables that are correlated.

Please ensure that data_df does not contain the target variable.

Parameters: data_df (pandas dataframe) – A dataframe with columns for each WOE transformed variable (feature).
Returns: tuple consisting of clusters_dict: dictionary of the cluster membership. sorted_df: sorted correlation dataframe. sorted_variable_order: list of the variables in sorted order.
Return type: tuple

variable_reduction.create_correlation_vif_table(data_df, correlation_colour_coder)

Given WOE transformed variables, a dataframe containing variable correlations and VIF is calculated. Ensure that the input dataframe does not contain the target variable.

Parameters

data_df (pandas dataframe) – A dataframe with columns for each WOE transformed variable (feature).
correlation_colour_coder (CorrelationMatrixColorCoder object) – An object for indicating different levels of correlation

Returns

A dataframe containing the variable correlations and VIF.

Return type

pandas dataframe

variable_reduction.plot_clustered_correlation_matrix(re_indexed_df, new_variable_order, fig_height=7)

Plot the clustered variable correlation matrix, according to the given variable.

Parameters: fig_height (int) – The height (in inches) of the figure to be plotted.

variable_reduction.plot_selected_variables_iv(info_values_df)

Plot variable information values (IVs) and show the variables that lie in an acceptable IV range (as indicated by the in_acceptable_range column in info_values_df) in a different colour.

The input dataframe has the following columns:

'variable'
'information_value'
'in_acceptable_range'

Parameters: info_values_df (pandas dataframe) – The input dataframe.

variable_reduction.select_best_variables_from_clusters(clusters_dict, variable_ratings_dict)

Select the best (maximum) variables in each cluster from a set of clusters. The variables are selected from clusters_dict according to the measure represented by the values in variable_ratings_dict. This measure could, for example, be the Gini Index or Information Value.

Parameters

clusters_dict (dict) – A dictionary containing the cluster indices and variable names.
variable_ratings_dict (dict) – A dictionary containing the variable names as keys and the value of their ratings as values.

Returns

The best variables (one for each cluster).

Return type

list of strings

variable_reduction.select_variables_on_iv(info_values_dict, min_threshold, max_threshold)

Perform variable selection by filtering out variables based on Information Value thresholds.

Parameters

info_values_dict (dict) – Dictionary containing the binning information for each variable.
min_threshold (float) – The minimum Information Value threshold, below which variables will be excluded.
max_threshold (float) – The maximum Information Value threshold, above which variables will be excluded.
plot (bool) – Whether or not to plot the Information Value bar plot.

Returns

A pandas DataFrame mapping each variable name to its Information Value.

Return type

pandas DataFrame

variable_reduction.select_variables_on_vif(dataset, max_vif=5)

Prune a set of variables to remove all variables with a variance inflation factor (VIF) higher than max_vif.

Parameters

dataset (pandas datafrmae) – Features with shape (num_samples, num_features).
max_vif (float) – The maximum VIF allowed for the set of variables that to be kept.

Returns

The list of variables.

Return type

list of strings

variable_reduction.show_correlation_vif_table(data_df, lower_bound, upper_bound)

Show the colour-coded correlation and VIF table.

Parameters

data_df (pandas dataframe) – A dataframe with columns for each WOE transformed variable (feature).
lower_bound (float) – The lower bound for medium correlation.
upper_bound (float) – The upper bound for medium correlation.

constrata_core_credit.scorecard

A module for creating scorecards from models

class scorecard.Scorecard(base_odds=0.1, base_points=500, ptdo=50, **kwargs)

Bases: sklearn.base.BaseEstimator

A Scorecard model, essentially Logistic regression with outputs scaled to produce a credit score.

The keyword arguments can be any of those accepted by sklearn’s logistic regression.

Parameters

base_odds (float) – The odds ratio (bad/good) that will be mapped to the score specified by base_points. Default: 0.1
base_points (float) – The reference score which will correspond to base_odds. Default: 500
ptdo (float) – Points-To-Double-Odds. The increment in score which corresponds to a doubling of the odds. Default: 50
kwargs (**) – These keyword arguments are passed through to sklearn’s logistic regression constructor.

Default

>>> # ``kwargs`` for sklearn logistic regression.
>>> logistic_regression_kwargs = {
...     'C': 1000,
...     'random_state': 0,
...     'max_iter': 1000,
...     'solver': 'lbfgs'
... }

fit(features, targets, sample_weights=None)

Fit the Scorecard model to a dataset.

Parameters

features (array-like of dtype: float) – Weights of Evidence (WOE) features of shape (num_samples, num_features).
targets (array-like of dtype: int (binary 0 or 1)) – Target labels (binary) of shape (num_samples,).
sample_weights (array-like of dtype: float) – Sample weights of shape (num_samples,).

Returns

The trained model.

Return type

sklearn model

get_params(deep=True)

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: dict

predict(features)

Make predictions and convert to scores.

Parameters: features (array-like of dtype: float) – Weights of Evidence (WOE) features of shape (num_samples, num_features).
Returns: Scores of shape (num_samples,).
Return type: array-like of dtype int

predict_log_odds(features)

Predict the log-odds for a default.

Parameters: features (array like) – Weights of Evidence (WOE) features (float) of shape (num_samples, num_features).
Returns: Scores of shape (num_samples).
Return type: array like

predict_proba(features)

Predict the probability for a default.

Parameters: features (array-like of dtype: float) – Weights of Evidence (WOE) features of shape (num_samples, num_features).
Returns: Probability of default of shape (num_samples).
Return type: array-like of dtype: float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: estimator instance

scorecard.create_client_summary(client_data, model_summary, woe_lookup)

Create a summary of the client indicating how the attributes of the features contribute towards the score. The client summary information is extracted from the model summary.

Parameters

client_data (pandas dataframe) – Original features of client of shape (1, num_features).
model_summary (pandas dataframe) – Summary of the point contributions.

Returns

Summary of the client point contributions.

Return type

pandas dataframe

scorecard.create_model_summary(scorecard_model_pipeline)

Create a summary of the model indicating how the attributes of the features contribute towards the score.

Parameters: scorecard_model_pipeline – Trained model.
Returns: Summary of the point contributions.
Return type: pandas dataframe

scorecard.extract_scorecard_info(scorecard_model_pipeline)

Train the model parameters (weights and offset) to predict a probability. If a score is needed, the model parameters need to be appropriately transformed. This function first extracts the model parameters and the woe_lookup dictionary from the trained model pipeline Then it transforms the model parameters to that needed for returning a score.

Parameters: scorecard_model_pipeline (sklearn Pipeline) – Trained model pipeline.
Return coef: The model weights modified using the scale factor.
Return type: list
Return intercept: The transformed offset.
Return type: float
Return woe_lookup: The woe lookup dictionary.
Return type: dict

scorecard.get_vars_from_model_pipeline(scorecard_model_pipeline)

Extract the variables that were used to train the model from the model pipeline

Parameters: scorecard_model_pipeline (sklearn Pipeline) – The model pipeline of the trained model.
Returns: variable names.
Return type: list

scorecard.show_client_summary(client_data, scorecard_model_pipeline)

Show a summary of the client indicating how the attributes of the features contribute towards the score.

Parameters

scorecard_model_pipeline (sklearn pipeline) – The trained model pipeline.
client_data (pandas dataframe) – Original features of client of shape (1, num_features).

scorecard.show_model_summary(scorecard_model_pipeline, datasets)

Show a summary of the model indicating how the attributes of the features contribute towards the score. Gini values are shown for the different datasets, e.g. train, validation and test sets. The different datasets are in the form of a dictionary of dataframes where each dataframe represents one of the datasets for which the Gini coefficient is required.

Parameters

scorecard_model_pipeline (sklearn pipeline) – The trained model pipeline.
datasets – The dictionary of the dataframes.

>>> datasets = {'TRN: {'data': train_data_df, 'labels': labels},
...             'VAL': {'data': validation_df, 'labels': labels},
...             'TST': {'data': test_df, 'labels': labels}}

Note

The datasets include the features and labels that are used to calculate the Gini coefficients.

constrata_core_credit.monitoring

A module for monitoring model stability

class monitoring.StabilityClassesColorCoder(no_shift_upper_bound, serious_shift_lower_bound)

Bases: object

A class for indicating different levels of model stability (for use with calculate_stability_table and show_stability_table).

Parameters

no_shift_upper_bound (float) – The upper bound for defining no shift.
serious_shift_lower_bound (float) – The lower bound for defining serious shift.

Raises

ValueError – If the lower bound is not smaller than the upper bound.

get_explanation_df()

Get a colour-coded dataframe that indicates the meaning of each colour in the stability table.

Returns: colour-coded dataframe.
Return type: pandas dataframe

monitoring.calculate_scorecard_distributions(score_bin_splits, train_score, test_score, train_target, test_target)

Calculate the distributions over score intervals for a training and test set.

Parameters

score_bin_splits (array like) – The list of splits defining the credit score bins.
train_score (pandas dataframe) – The training dataset.
test_score (pandas dataframe) – The testing dataset.
train_target (pandas series) – The training labels.
test_target (pandas series) – The testing labels.

Returns

The development and recent distributions.

Return type

pandas DataFrame

Raises

ValueError – If the list of splits defining the credit score bins contains duplicates.

monitoring.calculate_stability_table(scorecard_dists_df)

Calculate the model stability. The model stability is defined by the symmetric Kullback Leibler divergence between the (scorecard) model output distributions for the training data (development_data) and the model output for testing data (recent_data).

Parameters: scorecard_dists_df (DataFrame) – A pandas DataFrame containing the score distribution for the training (development) and testing (recent) datasets across a number of bins. scorecard_dists_df is typically the output from the calculate_scorecard_distributions() function.
Returns: The dataframe containing the stability information.
Return type: pandas DataFrame

monitoring.show_stability_table(scorecard_dists_df, no_shift_upper_bound, serious_shift_lower_bound)

Show the colour-coded model output stability table.

Parameters

development_data (pandas series) – The model output distribution for the training data.
recent_data (pandas series) – The model output distribution (typically recent data) for the test data.
no_shift_upper_bound (float) – The upper bound for defining no shift.
serious_shift_lower_bound (float) – The lower bound for defining serious shift.

constrata_core_credit.monitoring_plots

A module for generating reporting plots

class monitoring_plots.ValuesToLabels(feature_values, bin_dict)

Bases: object

Given a pandas series of feature values and bin information, the values in the series are assigned to given labels. The labels can be numerical (e.g. WOE values) or strings (e.g. labels).

This class is instantiated with a pandas series of feature values and bin descriptions.

The values in the series will then be transformed to the given labels (numerical or categorical).

This is therefore a generalisation of the transform_woe() function in the weights-fo-evidence module in the sense that transform_woe() only transforms values to WOE values.

Example

>>> feature_values = pd.Series([1.0, 'a', 4.0, 'b', 'c', 5.0])
>>> bin_dict = {'categorical': {'keys': [['a', 'b'],['c']]}, 'numerical': {'keys': [2.0]}}
>>> cat_labels = ['0', '1']
>>> num_labels = [-5.0, 5.0]
>>> transf = ValuesToLabels(feature_values, bin_dict)
>>> transf.transform(cat_labels, num_labels)
array(['-5.0', '0', '5.0', '0', '1', '5.0']

Parameters

feature_values (pandas series of type str) – Array of input feature values of shape (num_samples,).
bin_dict (dict) – The binning dictionary for the variable.

transform(cat_labels=None, num_labels=None)

The functor returning the transformed values.

Note

Either numerical or string values are returned, depending on the values in cat_labels and num_labels

Parameters

cat_labels (list) – List of the labels that will be assigned to the categorical values.
num_labels (list) – List of the labels that will be assigned to the numerical values.

Return labeled_feature

The labels assigned to the input values.

Return type

pandas series of type str or float

monitoring_plots.calc_bin_count(data, data_description, outcome_column, column_name, status, target)

Calculate the bin counts of the outcome specified by status.

Parameters

data (pandas dataframe) – A dataframe of shape (n_samples, n_variables).
data_description (str) – The name of the data sample, e.g. ‘train’.
outcome_column (str) – The column in data that contains the application outcome.
column_name (str) – The column to get the count for.
status (str) – The application outcome status for which the count is required.
target (str) – The name of the target variable.

Returns

A dataframe with successful or declined application outcome counts, percentages good and bad ratings.

Return type

pandas dataframe

monitoring_plots.calc_dataset_counts(data_sets, outcome_column, column_name, status, target)

Create a dataframe of the bin counts of a specified application outcome for different datasets.

Note

The status values are specified in outcome_column. One can calculate the counts for any of these values as specified by status.

Parameters

data_sets (dict) – A dictionary of the data sets with the structure shown below. Its keys specify the different datasets.
outcome_column (str) – The column in data_sets that contains the application outcome.
column_name (str) – The column name for which the counts are required.
status (str) – The application outcome status. whether the application was successful or declined.
target (str) – The name of the target variable.

Return count_df

A dataframe with the counts of the specified application outcome for the different datasets.

Example

>>> # Structure of the `datasets` argument.
>>> data_sets = {'development': develop_df, 'rolling_recent': rolling_recent_df}

monitoring_plots.calc_psi(first_distr, second_distr)

Calculate the Population Stability Index (PSI) for two distributions. The PSI is a symmetric version of the Kullback–Leibler divergence. The order of the two distributions does not matter. Note np.log10 is used, and not np.log.

Parameters

first_distr (pandas series.) – The first distribution.
second_distr (pandas series.) – The second distribution.

Return psi

The PSI value

Return type

str

monitoring_plots.calculate_outcome_counts(outcomes): Calculate the counts for the different outcomes, given a pandas series of the outcomes. :param outcomes: A series of outcomes. :type outcomes: pandas series. :return outcome_df: A dataframe with the counts and the percentages for the different outcomes. :rtype: dataframe

monitoring_plots.create_outcome_dataframe(data_sets, outcome_column)

Create a dataframe of the counts of different application outcomes for a number of datasets.

Parameters

data_sets (dict) – A dictionary where the key is a dataset name and the value the dataset, with the structure given in the example below.
outcome_column (str) – The column in data that contains the application outcome.

Example

The structure of the data_sets argument:

>>> data_sets = {'train': train_df, 'test':test_df, 'validation': validation_df}

monitoring_plots.create_successful_summary(data, outcome_column, target)

Create a dataframe of a summary of clients with successful application outcomes over the selected time period. The dates are given in the date column.

The input dataframe data should have the following columns:

An outcomes column specified in outcome_column.
A date column called date. The outcomes will be calculated separately at these different dates.
A target column specified in target. The labels of the loans that have defaulted.

The output dataframe gives the number of successful applications per month as well as the fraction per month that subsequently defaulted.

Parameters

data – A dataframe containing a column named date and outcomes and target columns specified by the parameters.
outcome_column (str) – The column with the application outcomes.
target (str) – The target column name with the default labels.

Returns

dataframe with number of successful applicants, total applications and the default rate.

monitoring_plots.plot_application_outcome(outcome_df, datasets, bar_width=0.13, tick_separation=0.3)

Compare the application outcomes between different datasets. The data that is displayed in the graph is in outcome_df, a dataframe that is created by the function create_outcome_dataframe in the monitoring_plots module.

The outcomes that are compared are given by the index of outcome_df, see example below.

The names of the different datasets are specified in the datasets argument.

The percentage and the count of each outcome are shown in the graph for each of the different datasets. Each dataset name therefore contributes two columns in outcome_df where the dataset name is suffixed by _percent and _count.

Any number of datasets is allowed. However, only 7 different color schemes are provided to distinguish between the datasets.

Parameters

outcome_df (dataframe) – A dataframe with the structure as shown in the example below.
datasets (list) – A list of the names of the data sets that are compared, of type str.
bar_width (float) – Set the width of the bars in bar plot. default: 0.13
tick_separation (float) – Set the tick separation on the x-axis. default: 0.3

Example

This is an example of outcome_df and datasets for a single dataset, train, and four possible outcomes: 'Declined', 'Successful', 'NotTakenUp', and 'NotSet'.

>>> data = {'train_count': [7724, 1662, 1094, 575], 'train_percent': [70.0, 15.0, 10.0, 5.0]}
>>> index = ['Declined', 'Successful', 'NotTakenUp', 'NotSet']
>>> outcome_df = pd.DataFrame(data, index=index)
>>> datasets = ['train']

Note

The dataset name, 'train', leads to two columns in outcome_df, namely, train_count and train_percent where the prefix refers to the dataset name.

monitoring_plots.plot_correlation_matrix(data, title, corr_vars)

Create a correlation matrix for variables in a dataset sample. The correlations are typically calculated using the WOE values and must be added as a suffix to the variable names, i.e for the age variable the corresponding variable name looks like age_WoE.

The values must all be numerical values (float or int).

Parameters

data – Dataframe with all successful application outcomes for the two data sets.
title (str) – Title of the graph.
corr_vars (list) – List of all the variables for the matrix.

monitoring_plots.plot_gini(data_df, variable_list, target)

Create a barplot of the Gini values. The Gini values are calculated using the Somers’ delta.

Parameters

data_df – A dataframe to use for the creation of the plot.
variable_list – A list of variables to include in the plot.
target (str) – The target variable.

Returns

A barplot of the Gini values.

monitoring_plots.plot_manually_closed_summary(data, dataset_names, bar_width=0.3, tick_separation=0.3)

Create a plot displaying the count and percentages of the reasons for the declined applicants.

Parameters

data (dataframe) – The dataframe with a column called ManualCloseReason and for each dataset, it has to have columns called f'{dataset_name}_percent' and f'{dataset_name}_count'.
dataset_names (list) – A list of the names of the different datasets, of type str.
bar_width (float) – Set the width of the bars in bar plot. default: 0.3
tick_separation (float) – Set the tick separation on the x-axis. default: 0.3

monitoring_plots.plot_psi(data_sets, variables, outcome_column, status, target)

Plot PSI values for all variables. The PSI of a variable measures the difference between the distributions of two datasets.

Parameters

data_sets (dict) – A dictionary of the two datasets, called development and rolling-mature.
variables (list) – A list of variable column names.
outcome_column (str) – The column in data that contains the application outcome.
status (str) – The application outcome status.
target (str) – The name of the target variable.

monitoring_plots.plot_rolling_statistics(data, dataset_names, column_name, bar_width=0.3, tick_separation=0.2)

Create a plot for comparing the percentage of successful applications and the bad rate between two data sets.

Note

This allows the user to compare the two datasets with respect to values such as WOE, or any other discrete set of variables.

Parameters

data (dict) – Dictionary of dataframes of the two data sets.
dataset_names (list) – The names of the data sets (of type str).
column_name (str) – The column with the bucket descriptions.
bar_width (float) – Set the width of the bars in bar plot. default: 0.3
tick_separation (float) – Set the spacing between the ticks on the x-axis. default: 0.3

Example

The bucket counts are compared and for this reason '_buckets' need to be added as a suffix to the variable name. For a variable called 'YearsInBusiness', the column_name becomes

>>> column_name = 'YearsInBusiness` + '_buckets'

monitoring_plots.plot_successful_applications(successful_df)

Plot the number of successful applications as a bar plot and default rate as a line plot on the same axes.

The input dataframe successful_df should have a column called date since the outcomes are plotted for the different dates. The input dataframe should also have columns named Number_of_clients and default_rate that contain the the number of applications for each date and the default rate, respectively.

Note

successful_df is created by a call to create_successful_summary().

Parameters: successful_df – A dataframe with columns named date, Number_of_clients and default_rate.

monitoring_plots.somers_delta(x_var, y_var)

Compute Somers’ Delta, which is the measure of agreement between two ordinal variables. The value ranges from -1 to 1, with -1 indicating disagreement and 1 indicating agreement. For reference, see: Somers’ delta

Parameters

x_var (array-like) – The independent variable of shape (n_samples,).
y_var (array-like) – The dependent (binary) variable of shape (n_samples,).

Returns

The calculated Somers’ Delta

Return type

float

constrata_core_credit.model_evaluation

A module for evaluating models

class model_evaluation.HypothesisTester(model=None, random_state=0, solver='lbfgs', max_iter=100)

Bases: object

A class for testing different models (‘hypotheses’) on a dataset. The user may supply a model, the default is sklearn’s LogisticRegression

If no model is specified, the user can specify the hyper parameters for to instantiatingthe LogisticRegression class of sklearn.

Parameters

model (object) – The model to be tested. If None, a Logistic Regression model will be used.
random_state (int) – The random state of the LogisticRegression, default random_state=0.
solver (string) – The optimiser used in sklearn’s LogisticRegression.
max_iter (int) – The maximum number of iterations for which the model will be trained.

fit_and_test_model(X_train, y_train, X_test, y_test, plot=False)

Fit the model to the training data and test the trained model on the test data.

Parameters

X_train (pandas dataframe) – The training samples, of shape (num_samples, num_features)
y_train (array-like) – The labels for the corresponding samples, of shape (num_samples,)
X_test (pandas dataframe) – The test samples, of shape (num_samples, num_features)
y_test (array-like) – The labels for the corresponding samples, of shape (num_samples,)
plot (bool) – Whether or not to plot the ROC curve

Returns

A dictionary containing the ROC data and Gini Index

Return type

dict

fit_model(X_train, y_train)

Fit the model.

Parameters

X_train (pandas dataframe) – The training samples, of shape (num_samples, num_features).
y_train (array-like) – The labels for the corresponding samples, of shape (num_samples,).

test_model(X_train, y_train, plot=False)

Test the model and return the ROC data and Gini Index.

Parameters

X_train (pandas dataframe) – The training samples, of shape (num_samples, num_features).
y_train (array-like) – The labels for the corresponding samples, of shape (num_samples,).
plot (bool) – Whether or not to plot the ROC curve.

Returns

A dictionary containing the ROC data and Gini Index.

Return type

dict

Raises

ValueError – if the model has not been trained.

model_evaluation.get_roc_data(model, x, y)

Get the false positive rate (FPR), true positive rate (TPR) and area under the curve (AUC) for a set of (automatically calculated) thresholds, all of which describes an ROC curve.

Parameters

model (sklearn model) – The model for which the ROC curve data is calculated.
x (array-like) – Input (feature) data of shape (num_samples, num_features).
y (array-like) – Target labels of shape (num_samples,), corresponding to x.

Returns

A tuple containing the false positive rate, true positive rate, the thresholds and the area under the curve.

Return type

tuple of dtype float

model_evaluation.get_roc_metrics(model, data, labels)

Calculate the false positive rate (fpr), true positive rate (tpr), area under the curve (auc) and Gini index.

Parameters

model (sklearn pipeline) – The model that should be evaluated. This object should have a predict function and accept a features dataframe as input.
data (pandas dataframe) – The dataset of features used for evaluating the model, of shape (num_samples, num_features).
labels (pandas dataframe) – The labels for the different samples, of shape (num_samples,).
target_variable_column (str) – The target variable column name.
feature_columns (list of dtype str) – The names of the columns in the dataset that are used for evaluation.

Returns

A DataFrame containing the false positive rate, true positive rate, area under the curve and the Gini index.

Return type

Tuple (DataFrame, float, float)

model_evaluation.get_roc_metrics_for_datasets(model, datasets, data_key='data', labels_key='labels')

Get the ROC metrics for a number of datasets.

Parameters

model (Scikit-learn Pipeline) – A scikit-learn pipeline (containing a model) that will be evaluated.
datasets (Dict) – A dictionary containing the datasets to be evaluated. Each key should be the name given to the dataset, and each value should be another dictionary containing the model input data under key data_key and the corresponding labels under labels_key.
data_key (str) – The key in which the input data is stored for each dict in datasets. Default value is “data”.
labels_key (str) – The key in which the corresponding labels of the input data is stored for each dict in datasets. Default value is “labels”.

Returns

A DataFrame containing the ROC metrics for each dataset.

Return type

pandas DataFrame

Example

>>> datasets = {'TRN': {'data': final_train_data, 'labels': train_labels},
...             'VAL': {'data': final_validation_data, 'labels': validation_labels},
...             'TST': {'data': final_test_data, 'labels': test_labels}}
>>> get_roc_metrics_for_datasets(
...     model=model,
...     datasets=datasets
... )

model_evaluation.plot_roc(fpr, tpr, auc, model_name=None, fontsize=12)

Plot a Receiver Operator Curve (ROC).

Parameters

fpr (float list) – A list of false positive rates.
tpr (float list) – A list of true positive rates.
auc (float) – The area under the curve.
model_name (str) – The name of the model used to plot the ROC.
fontsize (int) – The font size of the title and legend of the ROC plot.

constrata_core_credit.plot_setup

A module for setting plot options

plot_setup.set_plot_config(mode)

Set the matplotlib plot configuration to the desired mode to allow plots to display correctly in the associated mode.

Parameters: mode (str) – The mode to set the plot config to, either ‘light’ or ‘dark’
Raises: ValueError – if a mode other than ‘light’ or ‘dark’ is specified

constrata_core_credit.experimental

A module for performing variable selection and reduction

experimental.variable_reduction.plot_clustered_correlation_with_gini(corr_df, variable_order, ginis_df)

Plot the clustered variable correlation matrix together with gini values.

Parameters

df_corr (pandas dataframe) – The feature correlation dataframe.
variable_order (list) – variables in sorted order
ginis_df (dict) – DataFrame of gini values for all the variables

Return combined_fig

figure with correlation and gini plots

Return type

HBox

Two graphics libraries are used in this project, seaborn and bqplot. The latter is designed to work with ipython widgets and is used whenever widgets are required.

bqplot is based on a Grammar of Graphics framework and every attribute of the plot is an interactive widget. It is this feature that makes it so useful to integrate with ipython widgets. Moreover, the user has complete control change any of the attributes of the plot after the fact.

Note that the image object fig produced in this function consists of two children, corr_fig and their attributes can be reset by the user. This is particularly useful to change the appearance of the figure. It might be interesting to note that this is how we update the figures in the ManualBinning object: Only the marks attributes, i.e. the lines and/or bars are updated, the rest of the figure is left unchanged.

The following is a short introduction to some of the attributes that the user might want to change.

Example

>>> fig.children # list of two figures that make up the combined figure.
>>> len(fig.children) # answer: 2
>>> fig.children[0] # The correlation matrix figure should appear
>>> fig.children[1] # The bar plot should appear by itself
>>> first_fig = fig.children[0] # the first figure for closer inspection
>>> dir(first_fig) # Get all the attributes of the figure. You can also use `tab` complete in the notebook.
>>> first_fig.title # shows the current title, 'Correlation'
>>> first_fig.title = 'new_title' # see how 'Correlation' is changed to 'new_title' in the figure above
>>> first_fig.fig_margin # a dictionary that sets the margin around the figure.
>>>                      # These values were chosen so that this particular figure displays well
>>> first_fig.layout.height # the height of the figure. Try changing it!
>>> first_fig.layout.width # the width of the figure. Try changing it!
>>> first_fig.axes[0].label # returns the current label. Try changing it!
>>> first_fig.axes[0].label_offset # returns the label offset. Try changing it!