General library requirements (Release 1.0):
- Dataframe of features (text values may be one-hot encoded)
- Class labels in np.ndarray or pd.Series with shape (n,1)
- Binary classification (not multiclass or multilabel)
Workflow: Correlation-based feature filtering has four steps: preprocessing, discretization, calculating correlations, and feature reduction.
Here the first two steps are implemented in the Discretizer class,
and the second two steps in the qlcfFilter class.
They work SciKit-Learn style (instantiate, fit, transform)
and can be used in a pipeline.
# import the local library:
# add parent folder path where lib folder is
import sys
if ".." not in sys.path:import sys; sys.path.insert(0, '..')
from QLCFF import Discretizer, qlcfFilter
dzdf = Discretizer().fit_transform(features_train, labels_train)
fltrs = ['FDR', 'FWE', 'FCBF-PC']
ffdf = qlcfFilter().fit_transform(dzdf, labels_train,
fltrs, features_train)
Examples are in QLCF_demo .py and .ipynb
-
dtzr = Discretizer(numjobs= -2, msglvl=5) #Initialise- Requires : none
- Optional : joblib Parallel(n_jobs=, verbose=)
-
dtzr.fit(X, y) # Calls the preprocessor- Requires : features as pd.dataframe, labels as array-like
- Optional : none
- X : preprocessor
- selects only column dtypes np.number and pd or np boolean
- normalizes all columns with signed dtypes to positive numbers
- normalizes all columns with boolean dtypes to zero//one
- y : Text labels are converted with sklearn LabelEncoder()
-
After fit(), the preprocessed dataframe is an attribute
dtzr.prebin_df.head() -
_ = dtzr.transform(mkbins='hgrm', detail=False)-
Returns : discretized df
-
Requires : none
-
Optional : binning strategy, default or one of
'unif-ten' 'unif-log' 'unif-sqrt'
'mdlp-ten' 'mdlp-log' 'mdlp-sqrt'
'chim-ten' 'chim-log' 'chim-sqrt' -
Optional : (boolean) print binning report
-
The default value
mkbins=hgrmappliesnumpy.histogram(feature, bins='auto'), and repeatedly folds lower bins into the next higher one until there are a maximum of 12 for the feature.
Otherwise, the valid values combine an algorithm for calculating the bin edges (cutpoints) with a method for determining the maximum number of bins.calculate edges number of bins unif: uniform [numpy.linspace()] ten: always ten [3,4] mdlp: MDLP algorithm [1] sqrt: sqrt(len(feature)) [5] chim: ChiMerge algorithm [2] log: log10(len(feature)) [3]
-
-
After transform():
- the processed dataframe is an attribute
dtzr.binned_df.head() - the dict of bin edges is an attribute
dtzr.cutpoints - note: distribution of values within bins
numpy.bincount(dtzr['num_compromised'].values)
- the processed dataframe is an attribute
-
ffltr = qlcfFilter() #Initialise- Requires : none
- Optional : none
-
ffltr.fit(X, y, filters, plvl=0.5, minpc=0.035, minsu=0.0025, hipc=0.82, hisu=0.7)-
Requires : discretizer.binned_df, labels as array-like, list of one or more filters
-
Optional : *varies depending on filters selected
-
A list with one or more of
'Floor', 'FDR', 'FWE', 'FCBF-SU', 'FCBF-PC'
The list is processed in order with progressive filtering'Floor': filters on the basis that low correlation with the target labels (f2y) means low utility for distinguishing class membership. Keeps features that have correlation > a threshold (the defaults were selected through experimentation).- Optional :
minpc: threshold for pearson correlationminsu: threshold for symmetric uncertainty
- Optional :
'FDR', 'FWE': sklearn univariate chi-square test; selects features to keep based on an upper bound on the expected false discovery rate.fwewill select more to drop thanfdr, and lower thresholds will also select more to drop. The floor filter will select all from either univariate test, and more.- Optional :
plvl: chi-square threshold (alpha), standard values are 0.01, 0.05, 0.1
- Optional :
'FCBF-SU', 'FCBF-PC': FCBF-style, filter on feature-to-feature (f2f) correlations. Given a group of features with high cross-correlations, keep the one with the highest (f2y) as a proxy for the others (FCBF paper [6] calls this the "dominant feature"). The standard threshold for multicolliniarity is > 0.7, the defaults were selected through experimentation.- Optional :
hipc: threshold for "high" f2f pearson correlationhisu: threshold for "high" f2f symmetric uncertainty
- Optional :
To create layered feature selection filters, apply either
'Floor'or'FDR', 'FWE'before'FCBF-SU'and/or'FCBF-PC'
-
-
After fit():
- the consolidated drop list is an attribute
ffltr.QLCFFilter - reporting methods are available:
ffltr.get_f2y_report(kd='drop')
print feature-to-label (f2y) correlationsfyd = ffltr.get_f2y_dict(kd='drop')
returns a dict of correlations for each filter- Optional :
kd = 'keep'or'drop'
- Optional :
ffltr.get_f2f_report()
print feature to feature (f2f) correlations above threshold reportffd = ffltr.get_f2f_dict()
returns a dict of f2f correlations checked by each filter- f2f is only available for
'FCBF-SU'or'FCBF-PC'
- f2f is only available for
- the consolidated drop list is an attribute
-
reduced_df = ffltr.transform(Xdf)- Returns : Xdf after applying the consolidated drop list
- Requires : actual pd.dataframe for clf.fit_predict()
- Optional : none
Examples are in QLCF_demo .py and .ipynb
[1] Fayyad, U. M., and Irani, K. B. (1993). "Multiinterval discretization of continuous-valued attributes for classifcation learning", Proc. 13th Int. Joint Conference on Artifcial Intelligence, pp. 1022-1027
[2] Kerber R. (1992). "Chimerge: Discretization of numeric attributes", Proc. 10th National Conference on Artifcial Intelligence (AAAI'92), pp. 123–128
[3] Dougherty J., Kohavi, R., and Sahami, M. (1995), “Supervised and unsupervised discretization of continuous features”, Proc. ICML 1995, pp. 194–202
[4] Yang, Y. and Webb, G. I. (2002), “A comparative study of discretization methods for naive-bayes classifiers”, Proc. PKAW 2002, pp. 159-173
[5] Yang, Y. and Webb, G. I. (2001), “Proportional k-interval discretization for naive-bayes classifiers”, in Machine learning: ECML 2001, pp. 564–575
[6] Lei Yu and Huan Liu (2003), "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution", Proc. 20th ICML 2003, pp. 856-863