Component Selection

Module containing component selection methods and the ComponentSelect class.

The component selection step occurs after the computation of the invariant components. Several methods are already implemented. If you want to call ICS with another method, you would need to create a ComponentSelect object, as detailed in the Custom Component Selection tutorial.

class icspylab.comp_select.ComponentSelect(label, components, n_components, component_names, info)[source]

Bases: object

A class to represent a component selection method and its related data.

label

Label of the component selection method.

Type:

str

components

Invariant components selected by the method.

Type:

ndarray

n_components

Number of invariant components selected by the method.

Type:

int

component_names

Names of invariant components selected by the method.

Type:

ndarray

info

Additional information specific to the method.

Type:

dict or None

icspylab.comp_select.dftu(x)[source]

Apply the Double Folding Test of Unimodality (DFTU), a two-step extension of Siffer’s Folding Test of Unimodality (FTU). The null hypothesis states that the underlying distribution is unimodal. Small p-values provide evidence against unimodality.

Parameters:

x (ndarray) – Data

Returns:

Test statistic p_val (float): Associated p-value

Return type:

stat (float)

Details:

The test is based on the statistic:

\[T = \min(\Phi_1, \Phi_2)\]

\(\Phi_1\) and \(\Phi_2\) are obtained from two successive folding steps.

Hypotheses:

\[ \begin{align}\begin{aligned}H_0: T \geq 1 \quad \text{(the distribution is unimodal)}\\H_1: T < 1 \quad \text{(the distribution is not unimodal)}\end{aligned}\end{align} \]

Small values of \(T\) provide evidence against unimodality.

Reference:
  • Becquart, C., Archimbaud, A., Ruiz, A.M., Smida, Z., A Note on the Folding Test of Unimodality: limitations and an improved alternative.

  • Siffer, A., Fouque, P.-A., Termier, A. and Largouët, C. (2018), Are your data gathered?, In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2210–2218. <doi:10.1145/3219819.3219994>.

Example

>>> from sklearn.datasets import load_iris
>>> from icspylab import dftu
>>> iris = load_iris()
>>> X = iris.data
>>> stat, p = dftu(X[:,0])
>>> print(round(stat, 2), round(p, 2))
1.15 1.0
icspylab.comp_select.median_crit(kurtosis, W, nb_select=None, **kwargs)[source]

Identifies as interesting the invariant coordinates whose generalized eigenvalues (kurtosis) are the furthermost away from the median of all generalized eigenvalues (kurtosis).

Parameters:
  • kurtosis (ndarray) – Array of kurtosis values.

  • W (ndarray) – Transformation matrix in which each row contains the coefficients of the linear transformation to the corresponding invariant coordinate.

  • nb_select (int or None, default=None) – Exact number of components to select. If None (default), number of components to select is the number of variables minus one.

Returns:

Summary of the component selection step

Return type:

dict

References

  • Archimbaud, A., Alfons, A., Nordhausen, K., & Ruiz-Gazen, A. (2023). ICSClust: Tandem clustering with invariant coordinate selection.

  • Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2024). Tandem clustering with invariant coordinate selection. Econometrics and Statistics. doi:10.1016/j.ecosta.2024.03.002.

Example

>>> from sklearn.datasets import load_iris
>>> from icspylab import ICS, median_crit
>>> iris = load_iris()
>>> X = iris.data
>>> ics = ICS(S1="cov", S2="cov4")
>>> X_new = ics.fit_transform(X)
>>> selection_res = median_crit(kurtosis=ics.kurtosis_,W=ics.components_)
>>> print(selection_res.info)
{'crit': 'med', 'nb_select': 3, 'gen_kurtosis': array([1.20739878, 1.0269412 , 0.9292235 , 0.74046722]), 'med_gen_kurtosis': np.float64(0.9780823483964416), 'gen_kurtosis_diff_med': array([0.22931644, 0.04885885, 0.04885885, 0.23761513]), 'component_names': ['IC_4', 'IC_1', 'IC_2']}
icspylab.comp_select.normal_crit(X, W, level=0.05, test='agostino', max_select=None, **kwargs)[source]

Identifies invariant coordinates that deviate from normality using univariate normality tests. Only the first and last components are investigated.

SciPy implementations are used. The available tests are: normal, agostino, jarque, anscombe, and shapiro.

Parameters:
  • X (ndarray) – Data to fit the ICS model, where rows are samples and columns are features.

  • W (ndarray) – Transformation matrix in which each row contains the coefficients of the linear transformation to the corresponding invariant coordinate.

  • level (float, default=0.05)

  • test ({'normal', 'agostino', 'jarque', 'anscombe', 'shapiro'}, default='agostino')

  • max_select (int or None, default=None) – Maximum number of components to select.

Returns:

Summary of the component selection step

Return type:

dict

References

  • Archimbaud, A., Alfons, A., Nordhausen, K., & Ruiz-Gazen, A. (2023). ICSClust: Tandem clustering with invariant coordinate selection.

  • Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2024). Tandem clustering with invariant coordinate selection. Econometrics and Statistics. doi:10.1016/j.ecosta.2024.03.002.

Example

>>> from sklearn.datasets import load_iris
>>> from icspylab import ICS, normal_crit
>>> iris = load_iris()
>>> X = iris.data
>>> ics = ICS(S1="cov", S2="cov4")
>>> selection_res = normal_crit(X=X, W=ics.components_)
>>> print(selection_res.info)
{'crit': 'normal', 'level': 0.05, 'max_select': 3, 'test': 'agostino', 'pvalues': array([0.07492811, 0.19460223, 0.9311222 , 0.00942277]), 'adjusted_levels': [0.05, 0.025], 'component_names': ['IC_4']}
icspylab.comp_select.unimodal_crit(X, W, level=0.05, max_select=None, **kwargs)[source]

Identifies invariant coordinates that are multimodal using the univariate Fouble Folding Test of Unimodality (DFTU). Only the first and last components are investigated.

Parameters:
  • X (ndarray) – Data to fit the ICS model, where rows are samples and columns are features.

  • W (ndarray) – Transformation matrix in which each row contains the coefficients of the linear transformation to the corresponding invariant coordinate.

  • level (float, default=0.05)

  • max_select (int or None, default=None) – Maximum number of components to select.

Returns:

Summary of the component selection step

Return type:

dict

Example

>>> from sklearn.datasets import load_iris
>>> from icspylab import ICS, unimodal_crit
>>> iris = load_iris()
>>> X = iris.data
>>> ics = ICS(S1="cov", S2="cov4")
>>> selection_res = unimodal_crit(X=X, W=ics.components_)
>>> print(selection_res.info)
{'crit': 'unimodal', 'level': 0.05, 'max_select': 3, 'pvalues': array([9.99998344e-01, 9.97549948e-01, 9.99996927e-01, 2.85058691e-12]), 'adjusted_levels': [0.05, 0.025], 'component_names': ['IC_4']}