Package 'stabm'

Title: Stability Measures for Feature Selection
Description: An implementation of many measures for the assessment of the stability of feature selection. Both simple measures and measures which take into account the similarities between features are available, see Bommert (2020) <doi:10.17877/DE290R-21906>.
Authors: Andrea Bommert [aut, cre] , Michel Lang [aut]
Maintainer: Andrea Bommert <[email protected]>
License: LGPL-3
Version: 1.2.2
Built: 2024-08-27 03:21:35 UTC
Source: https://github.com/bommert/stabm

Help Index


stabm: Stability Measures for Feature Selection

Description

An implementation of many measures for the assessment of the stability of feature selection. Both simple measures and measures which take into account the similarities between features are available, see Bommert (2020) doi:10.17877/DE290R-21906.

Author(s)

Maintainer: Andrea Bommert [email protected] (ORCID)

Authors:

See Also

Useful links:


List All Available Stability Measures

Description

Lists all stability measures of package stabm and provides information about them.

Usage

listStabilityMeasures()

Value

data.frame
For each stability measure, its name, the information, whether it is corrected for chance by definition, the information, whether it is adjusted for similar features, its minimal value and its maximal value are displayed.

Note

The given minimal values might only be reachable in some scenarios, e.g. if the feature sets have a certain size. The measures which are not corrected for chance by definition can be corrected for chance with correction.for.chance. This however changes the minimal value. For the adjusted stability measures, the minimal value depends on the similarity structure.

Examples

listStabilityMeasures()

Plot Selected Features

Description

Creates a heatmap of the features which are selected in at least one feature set. The sets are ordered according to average linkage hierarchical clustering based on the Manhattan distance. If sim.mat is given, the features are ordered according to average linkage hierarchical clustering based on 1 - sim.mat. Otherwise, the features are ordered in the same way as the feature sets.

Note that this function needs the packages ggplot2, cowplot and ggdendro installed.

Usage

plotFeatures(features, sim.mat = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

Value

Object of class ggplot.

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
plotFeatures(features = feats)
plotFeatures(features = feats, sim.mat = mat)

Stability Measure Davis

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityDavis(
  features,
  p,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL,
  penalty = 0
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets. Required, if correction.for.chance is set to "estimate" or "exact".

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

penalty

numeric(1)
Penalty parameter, see Details.

Details

The stability measure is defined as (see Notation)

max{0,1Vj=1phjmpenaltypmedian{V1,,Vm}}.\max \left\{ 0, \frac{1}{|V|} \sum_{j=1}^p \frac{h_j}{m} - \frac{penalty}{p} \cdot \mathop{\mathrm{median}} \{ |V_1|, \ldots, |V_m| \} \right\}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Davis CA, Gerick F, Hintermair V, Friedel CC, Fundel K, Kuffner R, Zimmer R (2006). “Reliable gene signatures for microarray classification: assessment of stability and performance.” Bioinformatics, 22(19), 2356–2363. doi:10.1093/bioinformatics/btl400.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityDavis(features = feats, p = 10)

Stability Measure Dice

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityDice(
  features,
  p = NULL,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets. Required, if correction.for.chance is set to "estimate" or "exact".

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1m2ViVjVi+Vj.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{2 |V_i \cap V_j|}{|V_i| + |V_j|}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Dice LR (1945). “Measures of the Amount of Ecologic Association Between Species.” Ecology, 26(3), 297–302. doi:10.2307/1932409.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityDice(features = feats)

Stability Measure Hamming

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityHamming(
  features,
  p,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets. Required, if correction.for.chance is set to "estimate" or "exact".

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mViVj+VicVjcp.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| + |V_i^c \cap V_j^c|}{p}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Dunne, Kevin, Cunningham, Padraig, Azuaje, Francisco (2002). “Solutions to instability problems with sequential wrapper-based approaches to feature selection.” Machine Learning Group, Department of Computer Science, Trinity College, Dublin.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityHamming(features = feats, p = 10)

Stability Measure Adjusted Intersection Count

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityIntersectionCount(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "estimate",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

correction.for.chance

character(1)
How should the expected value of the stability score (see Details) be assessed? Options are "estimate", "exact" and "none". For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features and numbers of considered datasets (length(features)). For "none", the transformation (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected) is not conducted, i.e. only scorescore is used. This is not recommended.

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mI(Vi,Vj)E(I(Vi,Vj))ViVjE(I(Vi,Vj))\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{I(V_i, V_j) - E(I(V_i, V_j))}{\sqrt{|V_i| \cdot |V_j|} - E(I(V_i, V_j))}

with

I(Vi,Vj)=ViVj+min(C(Vi,Vj),C(Vj,Vi))I(V_i, V_j) = |V_i \cap V_j| + \min (C(V_i, V_j), C(V_j, V_i))

and

C(Vk,Vl)={xVkVl:yVlVk withSimilarity(x,y)threshold}.C(V_k, V_l) = |\{x \in V_k \setminus V_l : \exists y \in V_l \setminus V_k \ \mathrm{with Similarity} (x,y) \geq \mathrm{threshold} \}|.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Bommert A, Rahnenführer J (2020). “Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features.” In Machine Learning, Optimization, and Data Science, 203–214. doi:10.1007/978-3-030-64583-0_19.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityIntersectionCount(features = feats, sim.mat = mat, N = 1000)

Stability Measure Adjusted Intersection Greedy

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityIntersectionGreedy(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "estimate",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

correction.for.chance

character(1)
How should the expected value of the stability score (see Details) be assessed? Options are "estimate", "exact" and "none". For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features and numbers of considered datasets (length(features)). For "none", the transformation (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected) is not conducted, i.e. only scorescore is used. This is not recommended.

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mI(Vi,Vj)E(I(Vi,Vj))ViVjE(I(Vi,Vj))\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{I(V_i, V_j) - E(I(V_i, V_j))}{\sqrt{|V_i| \cdot |V_j|} - E(I(V_i, V_j))}

with

I(Vi,Vj)=ViVj+GMBM(ViVj,Vj\Vi).I(V_i, V_j) = |V_i \cap V_j| + \mathop{\mathrm{GMBM}}(V_i \setminus V_j, V_j \backslash V_i).

GMBM(ViVj,Vj\Vi)\mathop{\mathrm{GMBM}}(V_i \setminus V_j, V_j \backslash V_i) denotes a greedy approximation of MBM(ViVj,Vj\Vi)\mathop{\mathrm{MBM}}(V_i \setminus V_j, V_j \backslash V_i), see stabilityIntersectionMBM.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Bommert A, Rahnenführer J (2020). “Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features.” In Machine Learning, Optimization, and Data Science, 203–214. doi:10.1007/978-3-030-64583-0_19.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityIntersectionGreedy(features = feats, sim.mat = mat, N = 1000)

Stability Measure Adjusted Intersection MBM

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityIntersectionMBM(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "estimate",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

correction.for.chance

character(1)
How should the expected value of the stability score (see Details) be assessed? Options are "estimate", "exact" and "none". For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features and numbers of considered datasets (length(features)). For "none", the transformation (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected) is not conducted, i.e. only scorescore is used. This is not recommended.

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mI(Vi,Vj)E(I(Vi,Vj))ViVjE(I(Vi,Vj))\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{I(V_i, V_j) - E(I(V_i, V_j))}{\sqrt{|V_i| \cdot |V_j|} - E(I(V_i, V_j))}

with

I(Vi,Vj)=ViVj+MBM(ViVj,Vj\Vi).I(V_i, V_j) = |V_i \cap V_j| + \mathop{\mathrm{MBM}}(V_i \setminus V_j, V_j \backslash V_i).

MBM(ViVj,Vj\Vi)\mathop{\mathrm{MBM}}(V_i \setminus V_j, V_j \backslash V_i) denotes the size of the maximum bipartite matching based on the graph whose vertices are the features of ViVjV_i \setminus V_j on the one side and the features of Vj\ViV_j \backslash V_i on the other side. Vertices x and y are connected if and only if Similarity(x,y)threshold.\mathrm{Similarity}(x, y) \geq \mathrm{threshold}. Requires the package igraph.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Bommert A, Rahnenführer J (2020). “Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features.” In Machine Learning, Optimization, and Data Science, 203–214. doi:10.1007/978-3-030-64583-0_19.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityIntersectionMBM(features = feats, sim.mat = mat, N = 1000)

Stability Measure Adjusted Intersection Mean

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityIntersectionMean(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "estimate",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

correction.for.chance

character(1)
How should the expected value of the stability score (see Details) be assessed? Options are "estimate", "exact" and "none". For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features and numbers of considered datasets (length(features)). For "none", the transformation (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected) is not conducted, i.e. only scorescore is used. This is not recommended.

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mI(Vi,Vj)E(I(Vi,Vj))ViVjE(I(Vi,Vj))\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{I(V_i, V_j) - E(I(V_i, V_j))}{\sqrt{|V_i| \cdot |V_j|} - E(I(V_i, V_j))}

with

I(Vi,Vj)=ViVj+min(C(Vi,Vj),C(Vj,Vi)),I(V_i, V_j) = |V_i \cap V_j| + \min (C(V_i, V_j), C(V_j, V_i)),

C(Vk,Vl)=xVkVl:Gxkl>01GxklyGxkl Similarity(x,y)C(V_k, V_l) = \sum_{x \in V_k \setminus V_l : |G^{kl}_x| > 0} \frac{1}{|G^{kl}_x|} \sum_{y \in G^{kl}_x} \ \mathrm{Similarity} (x,y)

and

Gxkl={yVlVk: Similarity(x,y)threshold}.G^{kl}_x = \{y \in V_l \setminus V_k: \ \mathrm{Similarity} (x, y) \geq \mathrm{threshold} \}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Bommert A, Rahnenführer J (2020). “Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features.” In Machine Learning, Optimization, and Data Science, 203–214. doi:10.1007/978-3-030-64583-0_19.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityIntersectionMean(features = feats, sim.mat = mat, N = 1000)

Stability Measure Jaccard

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityJaccard(
  features,
  p = NULL,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets. Required, if correction.for.chance is set to "estimate" or "exact".

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVj.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j|}{|V_i \cup V_j|}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Jaccard, Paul (1901). “Étude comparative de la distribution florale dans une portion des Alpes et du Jura.” Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547-579. doi:10.5169/SEALS-266450.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityJaccard(features = feats)

Stability Measure Kappa

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityKappa(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as the average kappa coefficient between all pairs of feature sets. It can be rewritten as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVjpVi+Vj2ViVjp.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| - \frac{|V_i| \cdot |V_j|}{p}} {\frac{|V_i| + |V_j|}{2} - \frac{|V_i| \cdot |V_j|}{p}}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Carletta, Jean (1996). “Assessing Agreement on Classification Tasks: The Kappa Statistic.” Computational Linguistics, 22(2), 249–254.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityKappa(features = feats, p = 10)

Stability Measure Lustgarten

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityLustgarten(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVjpmin{Vi,Vj}max{0,Vi+Vjp}.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| - \frac{|V_i| \cdot |V_j|}{p}} {\min \{|V_i|, |V_j|\} - \max \{ 0, |V_i| + |V_j| - p \}}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Lustgarten, L J, Gopalakrishnan, Vanathi, Visweswaran, Shyam (2009). “Measuring stability of feature selection in biomedical datasets.” In AMIA annual symposium proceedings, volume 2009, 406. American Medical Informatics Association.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityLustgarten(features = feats, p = 10)

Stability Measure Nogueira

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityNogueira(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

11pj=1pmm1hjm(1hjm)qmp(1qmp).1 - \frac{\frac{1}{p} \sum_{j=1}^p \frac{m}{m-1} \frac{h_j}{m} \left(1 - \frac{h_j}{m}\right)} {\frac{q}{mp} (1 - \frac{q}{mp})}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Nogueira S, Sechidis K, Brown G (2018). “On the Stability of Feature Selection Algorithms.” Journal of Machine Learning Research, 18(174), 1–54. https://jmlr.org/papers/v18/17-514.html.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityNogueira(features = feats, p = 10)

Stability Measure Novovičová

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityNovovicova(
  features,
  p = NULL,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets. Required, if correction.for.chance is set to "estimate" or "exact".

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

1qlog2(m)j:XjVhjlog2(hj).\frac{1}{q \log_2(m)} \sum_{j: X_j \in V} h_j \log_2(h_j).

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Novovičová J, Somol P, Pudil P (2009). “A New Measure of Feature Selection Algorithms' Stability.” In 2009 IEEE International Conference on Data Mining Workshops. doi:10.1109/icdmw.2009.32.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityNovovicova(features = feats)

Stability Measure Ochiai

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityOchiai(
  features,
  p = NULL,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets. Required, if correction.for.chance is set to "estimate" or "exact".

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVj.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j|}{\sqrt{|V_i| \cdot |V_j|}}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Ochiai A (1957). “Zoogeographical Studies on the Soleoid Fishes Found in Japan and its Neighbouring Regions-III.” Nippon Suisan Gakkaishi, 22(9), 531-535. doi:10.2331/suisan.22.531.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityOchiai(features = feats)

Stability Measure Phi

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityPhi(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as the average phi coefficient between all pairs of feature sets. It can be rewritten as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVjpVi(1Vip)Vj(1Vjp).\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| - \frac{|V_i| \cdot |V_j|}{p}} {\sqrt{|V_i| (1 - \frac{|V_i|}{p}) \cdot |V_j| (1 - \frac{|V_j|}{p})}}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Nogueira S, Brown G (2016). “Measuring the Stability of Feature Selection.” In Machine Learning and Knowledge Discovery in Databases, 442–457. Springer International Publishing. doi:10.1007/978-3-319-46227-1_28.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityPhi(features = feats, p = 10)

Stability Measure Sechidis

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilitySechidis(features, sim.mat, threshold = 0.9, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as

1trace(CS)trace(CΣ)1 - \frac{\mathop{\mathrm{trace}}(CS)}{\mathop{\mathrm{trace}}(C \Sigma)}

with (p×pp \times p)-matrices

(S)ij=mm1(hijmhimhjm)(S)_{ij} = \frac{m}{m-1}\left(\frac{h_{ij}}{m} - \frac{h_i}{m} \frac{h_j}{m}\right)

and

(Σ)ii=qmp(1qmp),(\Sigma)_{ii} = \frac{q}{mp} \left(1 - \frac{q}{mp}\right),

(Σ)ij=1mi=1mVi2qmp2pq2m2p2,ij.(\Sigma)_{ij} = \frac{\frac{1}{m} \sum_{i=1}^{m} |V_i|^2 - \frac{q}{m}}{p^2 - p} - \frac{q^2}{m^2 p^2}, i \neq j.

The matrix CC is created from matrix sim.mat by setting all values of sim.mat that are smaller than threshold to 0. If you want to CC to be equal to sim.mat, use threshold = 0.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

Note

This stability measure is not corrected for chance. Unlike for the other stability measures in this R package, that are not corrected for chance, for stabilitySechidis, no correction.for.chance can be applied. This is because for stabilitySechidis, no finite upper bound is known at the moment, see listStabilityMeasures.

References

Sechidis K, Papangelou K, Nogueira S, Weatherall J, Brown G (2020). “On the Stability of Feature Selection in the Presence of Feature Correlations.” In Machine Learning and Knowledge Discovery in Databases, 327–342. Springer International Publishing. doi:10.1007/978-3-030-46150-8_20.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilitySechidis(features = feats, sim.mat = mat)

Stability Measure Somol

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilitySomol(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

(j=1phjqhj1m1)cmincmaxcmin\frac{\left(\sum\limits_{j=1}^p \frac{h_j}{q} \frac{h_j - 1}{m-1}\right) - c_{\min}}{c_{\max} - c_{\min}}

with

cmin=q2p(qq mod p)(q mod p)2pq(m1),c_{\min} = \frac{q^2 - p(q - q \ \mathop{mod} \ p) - \left(q \ \mathop{mod} \ p\right)^2}{p q (m-1)},

cmax=(q mod m)2+q(m1)(q mod m)mq(m1).c_{\max} = \frac{\left(q \ \mathop{mod} \ m\right)^2 + q(m-1) - \left(q \ \mathop{mod} \ m\right)m}{q(m-1)}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Somol P, Novovičová J (2010). “Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11), 1921–1939. doi:10.1109/tpami.2010.34.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilitySomol(features = feats, p = 10)

Stability Measure Unadjusted

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityUnadjusted(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVjpViVjViVjp.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| - \frac{|V_i| \cdot |V_j|}{p}} {\sqrt{|V_i| \cdot |V_j|} - \frac{|V_i| \cdot |V_j|}{p}}.

This is what stabilityIntersectionMBM, stabilityIntersectionGreedy, stabilityIntersectionCount and stabilityIntersectionMean become, when there are no similar features.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Bommert A, Rahnenführer J (2020). “Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features.” In Machine Learning, Optimization, and Data Science, 203–214. doi:10.1007/978-3-030-64583-0_19.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityUnadjusted(features = feats, p = 10)

Stability Measure Wald

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityWald(features, p, impute.na = NULL)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

p

numeric(1)
Total number of features in the datasets.

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mViVjViVjpmin{Vi,Vj}ViVjp.\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| - \frac{|V_i| \cdot |V_j|}{p}} {\min \{|V_i|, |V_j|\} - \frac{|V_i| \cdot |V_j|}{p}}.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Wald R, Khoshgoftaar TM, Napolitano A (2013). “Stability of Filter- and Wrapper-Based Feature Subset Selection.” In 2013 IEEE 25th International Conference on Tools with Artificial Intelligence. doi:10.1109/ictai.2013.63.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
stabilityWald(features = feats, p = 10)

Stability Measure Yu

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityYu(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "estimate",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

correction.for.chance

character(1)
How should the expected value of the stability score (see Details) be assessed? Options are "estimate", "exact" and "none". For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features and numbers of considered datasets (length(features)). For "none", the transformation (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected) is not conducted, i.e. only scorescore is used. This is not recommended.

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

Let OijO_{ij} denote the number of features in ViV_i that are not shared with VjV_j but that have a highly simlar feature in VjV_j:

Oij={x(ViVj):y(Vj\Vi) with Similarity(x,y)threshold}.O_{ij} = |\{ x \in (V_i \setminus V_j) : \exists y \in (V_j \backslash V_i) \ with \ Similarity(x,y) \geq threshold \}|.

Then the stability measure is defined as (see Notation)

2m(m1)i=1m1j=i+1mI(Vi,Vj)E(I(Vi,Vj))Vi+Vj2E(I(Vi,Vj))\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{I(V_i, V_j) - E(I(V_i, V_j))}{\frac{|V_i| + |V_j|}{2} - E(I(V_i, V_j))}

with

I(Vi,Vj)=ViVj+Oij+Oji2.I(V_i, V_j) = |V_i \cap V_j| + \frac{O_{ij} + O_{ji}}{2}.

Note that this definition slightly differs from its original in order to make it suitable for arbitrary datasets and similarity measures and applicable in situations with ViVj|V_i| \neq |V_j|.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Yu L, Han Y, Berens ME (2012). “Stable Gene Selection from Microarray Data via Sample Weighting.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 262–272. doi:10.1109/tcbb.2011.47.

Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D, Wang C, Guo Z (2009). “Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes.” Bioinformatics, 25(13), 1662–1668. doi:10.1093/bioinformatics/btp295.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityYu(features = feats, sim.mat = mat, N = 1000)

Stability Measure Zucknick

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityZucknick(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "none",
  N = 10000,
  impute.na = NULL
)

Arguments

features

list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (character) or indices (integerish).

sim.mat

numeric matrix
Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of features are integerish vectors, then the feature numbering must correspond to the ordering of sim.mat. If the list elements of features are character vectors, then sim.mat must be named and the names of sim.mat must correspond to the entries in features.

threshold

numeric(1)
Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of sim.mat is greater than or equal to threshold.

correction.for.chance

character(1)
Should a correction for chance be applied? Correction for chance means that if features are chosen at random, the expected value must be independent of the number of chosen features. To correct for chance, the original score is transformed by (scoreexpected)/(maximumexpected)(score - expected) / (maximum - expected). For stability measures whose score is the average value of pairwise scores, this transformation is done for all components individually. Options are "none", "estimate" and "exact". For "none", no correction is performed, i.e. the original score is used. For "estimate", N random feature sets of the same sizes as the input feature sets (features) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features (p) and numbers of considered datasets (length(features)).

N

numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance is set to "estimate".

impute.na

numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? NULL means no imputation.

Details

The stability measure is defined as

2m(m1)i=1m1j=i+1mViVj+C(Vi,Vj)+C(Vj,Vi)ViVj\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{|V_i \cap V_j| + C(V_i, V_j) + C(V_j, V_i)}{|V_i \cup V_j|}

with

C(Vk,Vl)=1Vl(x,y)Vk×(VlVk) withSimilarity(x,y)thresholdSimilarity(x,y).C(V_k, V_l) = \frac{1}{|V_l|} \sum_{(x, y) \in V_k \times (V_l \setminus V_k) \ \mathrm{with Similarity}(x,y) \geq \mathrm{threshold}} \mathop{\mathrm{Similarity}}(x,y).

Note that this definition slightly differs from its original in order to make it suitable for arbitrary similarity measures.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V1,,VmV_1, \ldots, V_m denote the sets of chosen features for the mm datasets, i.e. features has length mm and ViV_i is a set which contains the ii-th entry of features. Furthermore, let hjh_j denote the number of sets that contain feature XjX_j so that hjh_j is the absolute frequency with which feature XjX_j is chosen. Analogously, let hijh_{ij} denote the number of sets that include both XiX_i and XjX_j. Also, let q=j=1phj=i=1mViq = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V=i=1mViV = \bigcup_{i=1}^m V_i.

References

Zucknick M, Richardson S, Stronach EA (2008). “Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods.” Statistical Applications in Genetics and Molecular Biology, 7(1). doi:10.2202/1544-6115.1307.

Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1–18. doi:10.1155/2017/7907163.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

See Also

listStabilityMeasures

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityZucknick(features = feats, sim.mat = mat)