Core Models¶

combo.models.classifier_comb module¶

A collection of methods for combining classifiers

class combo.models.classifier_comb.SimpleClassifierAggregator(base_estimators, method='average', threshold=0.5, weights=None, pre_fitted=False)[source]¶

Bases: combo.models.base.BaseAggregator

A collection of simple classifier combination methods.

Parameters

base_estimators (list or numpy array (n_estimators,)) – A list of base classifiers.
method (str, optional (default='average')) – Combination method: {‘average’, ‘maximization’, ‘majority vote’, ‘median’}. Pass in weights of classifier for weighted version.
threshold (float in (0, 1), optional (default=0.5)) – Cut-off value to convert scores into binary labels.
weights (numpy array of shape (1, n_classifiers)) – Classifier weights.
pre_fitted (bool, optional (default=False)) – Whether the base classifiers are trained. If True, fit process may be skipped.

fit(X, y)[source]¶

Fit classifier.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

fit_predict(X, y)[source]¶

Fit estimator and predict on X

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Class labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

predict_proba(X)[source]¶

Return probability estimates for the test data X.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: p – The class probabilities of the input samples. Classes are ordered by lexicographic order.
Return type: numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.classifier_dcs module¶

Stacking (meta ensembling). See http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/ for more information.

class combo.models.classifier_dcs.DCS_LA(base_estimators, local_region_size=30, threshold=None, pre_fitted=None)[source]¶

Bases: combo.models.base.BaseAggregator

Dynamic Classifier Selection (DCS) is an established combination framework for classification tasks. The technique was first proposed by Ho et al. in 1994 [BHHS94] and then extended, under the name DCS Local Accuracy, by Woods et al. in 1997 [BWKB97] to select the most accurate base classifier in a local region. The motivation behind this approach is that base classifiers often make distinctive errors and over a degree of complementarity. Consequently, selectively combining base classifier can result in a performance improvement over generic ensembles which use the majority vote of all base classifiers.

See [BWKB97] for details.

Parameters

base_estimators (list or numpy array (n_estimators,)) – A list of base classifiers.
local_region_size (int, optional (default=30)) – Number of training points to consider in each iteration of the local region generation process (30 by default).
threshold (float in (0, 1), optional (default=None)) – Cut-off value to convert scores into binary labels.
pre_fitted (bool, optional (default=False)) – Whether the base classifiers are trained. If True, fit process may be skipped.

fit(X, y)[source]¶

Fit classifier.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

fit_predict(X, y)[source]¶

Fit estimator and predict on X

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Class labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

predict_proba(X)[source]¶

Return probability estimates for the test data X.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: p – The class probabilities of the input samples. Classes are ordered by lexicographic order.
Return type: numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.classifier_des module¶

Dynamic Classifier Selection (DES) is an established combination framework for classification tasks.

class combo.models.classifier_des.DES_LA(base_estimators, local_region_size=30, n_selected_clfs=None, use_weights=False, threshold=None, pre_fitted=None)[source]¶

Bases: combo.models.base.BaseAggregator

Dynamic Ensemble Selection (DES) is an established combination framework for classification tasks. The technique was based on Dynamic Classifier Selection (DCS) proposed by Ho et al. in 1994 [BHHS94]. The motivation behind this approach is that base classifiers often make distinctive errors and over a degree of complementarity. Consequently, selectively combining base classifier can result in a performance improvement over generic ensembles which use the majority vote of all base classifiers.

Compared with DCS, DES uses a group of best classifiers to conduct a second phase combination, other than only the best classifier. The implemented version in this class is DES_LA which uses local accuracy as the metric for evaluating base classifier performance. predict uses (weighted) majority vote and predict_proba uses (weighted) average.

See [BKSBJ08] for details.

Parameters

base_estimators (list or numpy array (n_estimators,)) – A list of base classifiers.
local_region_size (int, optional (default=30)) – Number of training points to consider in each iteration of the local region generation process (30 by default).
n_selected_clfs (int, optional (default=None)) – Number of selected base classifiers in the second phase combination. If None, set it to 1/2 * n_base_estimators
use_weights (bool, optional (default=False)) – If True, use the classifiers’ performance on the local region as their weight.
threshold (float in (0, 1), optional (default=None)) – Cut-off value to convert scores into binary labels.
pre_fitted (bool, optional (default=False)) – Whether the base classifiers are trained. If True, fit process may be skipped.

fit(X, y)[source]¶

Fit classifier.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

fit_predict(X, y)[source]¶

Fit estimator and predict on X

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Class labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

predict_proba(X)[source]¶

Return probability estimates for the test data X.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: p – The class probabilities of the input samples. Classes are ordered by lexicographic order.
Return type: numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.classifier_stacking module¶

Stacking (meta ensembling). See http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/ for more information.

class combo.models.classifier_stacking.Stacking(base_estimators, meta_clf=None, n_folds=2, keep_original=True, use_proba=False, shuffle_data=False, random_state=None, threshold=None, pre_fitted=None)[source]¶

Bases: combo.models.base.BaseAggregator

Meta ensembling, also known as stacking. See http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/ for more information

Parameters

base_estimators (list or numpy array (n_estimators,)) – A list of base classifiers.
meta_clf (object, optional (default=LogisticRegression)) – The meta classifier to make the final prediction.
n_folds (int, optional (default=2)) – The number of splits of the training sample.
keep_original (bool, optional (default=False)) – If True, keep the original features for training and predicting.
use_proba (bool, optional (default=False)) – If True, use the probability prediction as the new features.
shuffle_data (bool, optional (default=False)) – If True, shuffle the input data.
random_state (int, RandomState or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
threshold (float in (0, 1), optional (default=None)) – Cut-off value to convert scores into binary labels.
pre_fitted (bool, optional (default=False)) – Whether the base classifiers are trained. If True, fit process may be skipped.

fit(X, y)[source]¶

Fit classifier.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

fit_predict(X, y)[source]¶

Fit estimator and predict on X

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Class labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

predict_proba(X)[source]¶

Return probability estimates for the test data X.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: p – The class probabilities of the input samples. Classes are ordered by lexicographic order.
Return type: numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.classifier_stacking.split_datasets(X, y, n_folds=3, shuffle_data=False, random_state=None)[source]¶

Utility function to split the data for stacking. The data is split into n_folds with roughly equal rough size.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,)) – The ground truth of the input samples (labels).
n_folds (int, optional (default=3)) – The number of splits of the training sample.
shuffle_data (bool, optional (default=False)) – If True, shuffle the input data.
random_state (RandomState, optional (default=None)) – A random number generator instance to define the state of the random permutations generator.

Returns

X (numpy array of shape (n_samples, n_features)) – The input samples. If shuffle_data, return the shuffled data.
y (numpy array of shape (n_samples,)) – The ground truth of the input samples (labels). If shuffle_data, return the shuffled data.
index_lists (list of list) – The list of indexes of each fold regarding the returned X and y. For instance, index_lists[0] contains the indexes of fold 0.

combo.models.cluster_comb module¶

A collection of combination methods for clustering

class combo.models.cluster_comb.ClustererEnsemble(base_estimators, n_clusters, weights=None, reference_idx=0, pre_fitted=False)[source]¶

Bases: combo.models.base.BaseAggregator

Clusterer Ensemble combines multiple base clustering estimators by alignment. See [BZT06] for details.

Parameters

base_estimators (list or numpy array of shape (n_estimators,)) – A list of base estimators. Estimators must have a labels_ attribute once fitted. Sklearn clustering estimators are recommended.
n_clusters (int, optional (default=8)) – The number of clusters.
weights (numpy array of shape (n_estimators,)) – Estimator weights. May be used after the alignment.
reference_idx (int in range [0, n_estimators-1], optional (default=0)) – The ith base estimator used as the reference for label alignment.
pre_fitted (bool, optional (default=False)) – Whether the base estimators are trained. If True, fit process may be skipped.

labels_¶

The predicted label of the fitted data.

Type: int

fit(X)[source]¶

Fit estimators.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.

fit_predict(X, y=None)[source]¶

Fit estimator and predict on X. y is optional for unsupervised methods.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Cluster labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

predict_proba(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.cluster_comb.clusterer_ensemble_scores(original_labels, n_estimators, n_clusters, weights=None, return_results=False, reference_idx=0)[source]¶

Function to align the raw clustering results from base estimators. Different from ClustererEnsemble class, this function takes in the output from base estimators directly without training and prediction.

Parameters

original_labels (numpy array of shape (n_samples, n_estimators)) – The raw output from base estimators
n_estimators (int) – The number of base estimators.
n_clusters (int, optional (default=8)) – The number of clusters.
weights (numpy array of shape (1, n_estimators)) – Estimators weights.
return_results (bool, optional (default=False)) – If True, also return the aligned label matrix.
reference_idx (int in range [0, n_estimators-1], optional (default=0)) – The ith base estimator used as the reference for label alignment.

Returns

aligned_labels – The aligned label results by using reference_idx estimator as the reference.

Return type

numpy array of shape (n_samples, n_estimators)

combo.models.cluster_eac module¶

Combining multiple clusterings using evidence accumulation (EAC).

class combo.models.cluster_eac.EAC(base_estimators, n_clusters, linkage_method='single', weights=None, pre_fitted=False)[source]¶

Bases: combo.models.base.BaseAggregator

Combining multiple clusterings using evidence accumulation (EAC) first builds similarity matrix for each base clustering to model the similarity among the cluster assignment among each sample. After the similarity matrices are aggregated, a hierarchical clustering is built on it. See [BFJ05] for details.

Parameters

base_estimators (list or numpy array of shape (n_estimators,)) – A list of base estimators. Estimators must have a labels_ attribute once fitted. Sklearn clustering estimators are recommended.
n_clusters (int, optional (default=8)) – The number of clusters.
linkage_method (str, optional (default='single')) – The linkage method to use (single, complete, average, weighted, median centroid, ward). See https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html for more information.
weights (numpy array of shape (n_estimators,)) – Estimator weights. May be used after the alignment.
pre_fitted (bool, optional (default=False)) – Whether the base estimators are trained. If True, fit process may be skipped.

labels_¶

The predicted label of the fitted data.

Type: int

Z_¶

The linkage matrix encoding the hierarchical clustering. This can be used to plot dendrogram using scipy.

Type: numpy array

fit(X)[source]¶

Fit estimators.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.

fit_predict(X, y=None)[source]¶

Fit estimator and predict on X. y is optional for unsupervised methods.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Cluster labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

predict_proba(X)[source]¶

Predict the class labels for the provided data.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: labels – Class labels for each data sample.
Return type: numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.detector_comb module¶

A collection of methods for combining detectors

class combo.models.detector_comb.SimpleDetectorAggregator(base_estimators, method='average', contamination=0.1, standardization=True, weights=None, pre_fitted=False)[source]¶

Bases: combo.models.base.BaseAggregator

A collection of simple detector combination methods.

Parameters

base_estimators (list, length must be greater than 1) – Base unsupervised outlier detectors from PyOD. (Note: requires fit and decision_function methods)
method (str, optional (default='average')) – Combination method: {‘average’, ‘maximization’, ‘median’}. Pass in weights of detector for weighted version.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
standardization (bool, optional (default=True)) – If True, perform standardization first to convert prediction score to zero mean and unit variance. See http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
weights (numpy array of shape (1, n_detectors)) – detector weights.
pre_fitted (bool, optional (default=False)) – Whether the base detectors are trained. If True, fit process may be skipped.

decision_scores_¶

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type: numpy array of shape (n_samples,)

threshold_¶

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type: float

labels_¶

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type: int, either 0 or 1

decision_function(X)[source]¶

Predict raw anomaly scores of X using the fitted detector.

The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns: anomaly_scores – The anomaly score of the input samples.
Return type: numpy array of shape (n_samples,)

fit(X, y=None)[source]¶

Fit detector. y is optional for unsupervised methods.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels_ – Return the generated labels.

Return type

numpy array of shape (n_samples,)

fit_predict(X, y=None)[source]¶

Fit estimator and predict on X. y is optional for unsupervised methods.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Class labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict if a particular sample is an outlier or not.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Return type: numpy array of shape (n_samples,)

predict_proba(X, proba_method='linear')[source]¶

Predict the probability of a sample being outlier. Two approaches are possible:

simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
proba_method (str, optional (default='linear')) – Probability conversion method. It must be one of ‘linear’ or ‘unify’.

Returns

outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

Return type

numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.detector_lscp module¶

Locally Selective Combination of Parallel Outlier Ensembles (LSCP). Implemented on PyOD library (https://github.com/yzhao062/pyod).

class combo.models.detector_lscp.LSCP(base_estimators, local_region_size=30, local_max_features=1.0, n_bins=10, random_state=None, contamination=0.1, pre_fitted=False)[source]¶

Bases: combo.models.base.BaseAggregator

Locally Selection Combination in Parallel Outlier Ensembles

LSCP is an unsupervised parallel outlier detection ensemble which selects competent detectors in the local region of a test instance. This implementation uses an Average of Maximum strategy. First, a heterogeneous list of base detectors is fit to the training data and then generates a pseudo ground truth for each train instance is generated by taking the maximum outlier score.

For each test instance: 1) The local region is defined to be the set of nearest training points in randomly sampled feature subspaces which occur more frequently than a defined threshold over multiple iterations.

2) Using the local region, a local pseudo ground truth is defined and the pearson correlation is calculated between each base detector’s training outlier scores and the pseudo ground truth.

3) A histogram is built out of pearson correlation scores; detectors in the largest bin are selected as competent base detectors for the given test instance.

4) The average outlier score of the selected competent detectors is taken to be the final score.

See [BZNHL19] for details.

Parameters

base_estimators (list, length must be greater than 1) – Base unsupervised outlier detectors from PyOD. (Note: requires fit and decision_function methods)
local_region_size (int, optional (default=30)) – Number of training points to consider in each iteration of the local region generation process (30 by default).
local_max_features (float in (0.5, 1.), optional (default=1.0)) – Maximum proportion of number of features to consider when defining the local region (1.0 by default).
n_bins (int, optional (default=10)) – Number of bins to use when selecting the local region
random_state (RandomState, optional (default=None)) – A random number generator instance to define the state of the random permutations generator.
contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function (0.1 by default).
pre_fitted (bool, optional (default=False)) – Whether the base estimators are trained. If True, fit process may be skipped.

decision_scores_¶

The outlier scores of the training data. The higher, the more abnormal. Outliers tend to have higher scores. This value is available once the detector is fitted.

Type: numpy array of shape (n_samples,)

threshold_¶

The threshold is based on contamination. It is the n_samples * contamination most abnormal samples in decision_scores_. The threshold is calculated for generating binary outlier labels.

Type: float

labels_¶

The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies. It is generated by applying threshold_ on decision_scores_.

Type: int, either 0 or 1

decision_function(X)[source]¶

Predict raw anomaly scores of X using the fitted detector.

The anomaly score of an input sample is computed based on the fitted detector. For consistency, outliers are assigned with higher anomaly scores.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the base estimator.
Returns: anomaly_scores – The anomaly score of the input samples.
Return type: numpy array of shape (n_samples,)

fit(X, y=None)[source]¶

Fit detector. y is optional for unsupervised methods.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

fit_predict(X, y=None)[source]¶

Fit estimator and predict on X. y is optional for unsupervised methods.

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
y (numpy array of shape (n_samples,), optional (default=None)) – The ground truth of the input samples (labels).

Returns

labels – Class labels for each data sample.

Return type

numpy array of shape (n_samples,)

get_params(deep=True)¶

Get parameters for this estimator.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Parameters: deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

predict(X)[source]¶

Predict if a particular sample is an outlier or not.

Parameters: X (numpy array of shape (n_samples, n_features)) – The input samples.
Returns: outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. 0 stands for inliers and 1 for outliers.
Return type: numpy array of shape (n_samples,)

predict_proba(X, proba_method='linear')[source]¶

Predict the probability of a sample being outlier. Two approaches are possible:

simply use Min-max conversion to linearly transform the outlier scores into the range of [0,1]. The model must be fitted first.
use unifying scores, see [BKKSZ11].

Parameters

X (numpy array of shape (n_samples, n_features)) – The input samples.
proba_method (str, optional (default='linear')) – Probability conversion method. It must be one of ‘linear’ or ‘unify’.

Returns

outlier_labels – For each observation, tells whether or not it should be considered as an outlier according to the fitted model. Return the outlier probability, ranging in [0,1].

Return type

numpy array of shape (n_samples,)

set_params(**params)¶

Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

See http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html and sklearn/base.py for more information.

Returns: self
Return type: object

combo.models.score_comb module¶

A collection of combination methods for combining raw scores.

combo.models.score_comb.aom(scores, n_buckets=5, method='static', bootstrap_estimators=False, random_state=None)[source]¶

Average of Maximum - An ensemble method for combining multiple estimators. See [BAS15] for details.

First dividing estimators into subgroups, take the maximum score as the subgroup score. Finally, take the average of all subgroup scores.

Parameters

scores (numpy array of shape (n_samples, n_estimators)) – The score matrix outputted from various estimators
n_buckets (int, optional (default=5)) – The number of subgroups to build
method (str, optional (default='static')) – {‘static’, ‘dynamic’}, if ‘dynamic’, build subgroups randomly with dynamic bucket size.
bootstrap_estimators (bool, optional (default=False)) – Whether estimators are drawn with replacement.
random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns

combined_scores – The combined scores.

Return type

Numpy array of shape (n_samples,)

combo.models.score_comb.average(scores, estimator_weights=None)[source]¶

Combination method to merge the scores from multiple estimators by taking the average.

Parameters

scores (numpy array of shape (n_samples, n_estimators)) – Score matrix from multiple estimators on the same samples.
estimator_weights (numpy array of shape (1, n_estimators)) – If specified, using weighted average.

Returns

combined_scores – The combined scores.

Return type

numpy array of shape (n_samples, )

combo.models.score_comb.majority_vote(scores, n_classes=2, weights=None)[source]¶

Combination method to merge the scores from multiple estimators by majority vote.

Parameters

scores (numpy array of shape (n_samples, n_estimators)) – Score matrix from multiple estimators on the same samples.
n_classes (int, optional (default=2)) – The number of classes in scores matrix
weights (numpy array of shape (1, n_estimators)) – If specified, using weighted majority weight.

Returns

combined_scores – The combined scores.

Return type

numpy array of shape (n_samples, )

combo.models.score_comb.maximization(scores)[source]¶

Combination method to merge the scores from multiple estimators by taking the maximum.

Parameters: scores (numpy array of shape (n_samples, n_estimators)) – Score matrix from multiple estimators on the same samples.
Returns: combined_scores – The combined scores.
Return type: numpy array of shape (n_samples, )

combo.models.score_comb.median(scores)[source]¶

Combination method to merge the scores from multiple estimators by taking the median.

Parameters: scores (numpy array of shape (n_samples, n_estimators)) – Score matrix from multiple estimators on the same samples.
Returns: combined_scores – The combined scores.
Return type: numpy array of shape (n_samples, )

combo.models.score_comb.moa(scores, n_buckets=5, method='static', bootstrap_estimators=False, random_state=None)[source]¶

Maximization of Average - An ensemble method for combining multiple estimators. See [BAS15] for details.

First dividing estimators into subgroups, take the average score as the subgroup score. Finally, take the maximization of all subgroup outlier scores.

Parameters

scores (numpy array of shape (n_samples, n_estimators)) – The score matrix outputted from various estimators
n_buckets (int, optional (default=5)) – The number of subgroups to build
method (str, optional (default='static')) – {‘static’, ‘dynamic’}, if ‘dynamic’, build subgroups randomly with dynamic bucket size.
bootstrap_estimators (bool, optional (default=False)) – Whether estimators are drawn with replacement.
random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns

combined_scores – The combined scores.

Return type

Numpy array of shape (n_samples,)

Module contents¶

References

BAS15(1,2): Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24–47, 2015.
BFJ05: Ana LN Fred and Anil K Jain. Combining multiple clusterings using evidence accumulation. IEEE transactions on pattern analysis and machine intelligence, 27(6):835–850, 2005.
BHHS94(1,2): Tin Kam Ho, Jonathan J. Hull, and Sargur N. Srihari. Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 66–75, 1994.
BKSBJ08: Albert HR Ko, Robert Sabourin, and Alceu Souza Britto Jr. From dynamic classifier selection to dynamic ensemble selection. Pattern recognition, 41(5):1718–1731, 2008.
BKKSZ11(1,2): Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of the 2011 SIAM International Conference on Data Mining, 13–24. SIAM, 2011.
BWKB97(1,2): Kevin Woods, W. Philip Kegelmeyer, and Kevin Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE transactions on pattern analysis and machine intelligence, 19(4):405–410, 1997.
BZNHL19: Yue Zhao, Zain Nasrullah, Maciej K Hryniewicki, and Zheng Li. LSCP: locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, 585–593. Calgary, Canada, May 2019. SIAM. URL: https://doi.org/10.1137/1.9781611975673.66, doi:10.1137/1.9781611975673.66.
BZT06: Zhi-Hua Zhou and Wei Tang. Clusterer ensemble. Knowledge-Based Systems, 19(1):77–83, 2006.