# Clustering In this example, ICS is used for exploratory data analysis. Using a real-world dataset, it illustrates how ICS can reveal interesting structures in the data - in this case, clusters. The dataset used here is the same as in the Outlier Detection tutorial, namely the [Forest covertypes](https://scikit-learn.org/stable/datasets/real_world.html#covtype-dataset) dataset. It contains cartographic information about forest patches, where the target variable corresponds to the dominant tree species in each patch. The dataset includes 54 features whose description is available online. This example focuses on exploratory analysis of the observations belonging to cover type 2, which is the largest class in the dataset. First, let us load the dataset, separate the features X and the target y, and filter the observations corresponding to cover type 2 (``y=2``). ```python from sklearn.datasets import fetch_covtype X, y = fetch_covtype(return_X_y=True, as_frame=True) print(y.value_counts()) s = (y == 2) X = X.loc[s] y = y.loc[s] print("X shape:", X.shape) ``` ```text Cover_Type 2 283301 1 211840 3 35754 7 20510 6 17367 5 9493 4 2747 Name: count, dtype: int64 X shape: (283301, 54) ``` As in the Outlier Detection example, we remove variables containing almost only zeros in order to avoid singularity issues. ```python # Features cleaning zero_ratio = (X == 0).mean() cols_to_drop = zero_ratio[zero_ratio > 0.95].index print("Features to drop (more than 95% of 0 values):\n", cols_to_drop) X = X.drop(cols_to_drop, axis=1) print("X shape:", X.shape) ``` ```text Features to drop (more than 95% of 0 values): Index(['Wilderness_Area_1', 'Wilderness_Area_3', 'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3', 'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7', 'Soil_Type_8', 'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15', 'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18', 'Soil_Type_19', 'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_23', 'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27', 'Soil_Type_30', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35', 'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38', 'Soil_Type_39'], dtype='str') X shape: (283301, 18) ``` Since the dataset contains more than 280,000 observations, we keep only 5% of the data to reduce the computational cost. ```python # Subsample the data X_sub = X.sample(frac=0.05, random_state=42) print("X_sub shape:", X_sub.shape) ``` ```text X_sub shape: (14165, 18) ``` When looking at the original data, no clear structure is visible. We therefore apply dimensionality reduction to better reveal any underlying organization. ```python from icspylab import plot_ics plot_ics( X_sub, col_names=X_sub.columns.tolist(), plot_kws={'alpha':0.7} ) ``` ```{image} ../_static/clustering_orig.png :alt: Original data of the clustering example :width: 700px :align: center ``` ## PCA We begin by applying PCA for dimensionality reduction and visualizing the resulting principal components. Although all components are retained, only the first six are shown for readability. ```python from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA scaler = StandardScaler().set_output(transform="pandas") scaled_X_sub = scaler.fit_transform(X_sub) pca = PCA() X_transformed_pca = pca.fit_transform(scaled_X_sub) plot_ics( X_transformed_pca, components="first", col_names=[f"PC_{i+1}" for i in range(X_transformed_pca.shape[1])], plot_kws={'alpha':0.7} ) ``` ```{image} ../_static/clustering_pca_orig.png :alt: PCA results of the clustering example :width: 700px :align: center ``` The first components reveal 2 clusters. Some structure is also visible on IC5 and IC6, each isolating a cluster and some seems overlapping in the main bulk. As an illustration, we apply KMeans with ``n_clusters=2``. ```python from sklearn.cluster import KMeans kmeans_pca = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X_transformed_pca) plot_ics(X_transformed_pca, y=kmeans_pca.labels_, components="first", col_names=[f"PC_{i+1}" for i in range(X_transformed_pca.shape[1])], plot_kws={'alpha':0.7}) ``` ```{image} ../_static/clustering_pca_clust.png :alt: PCA results of the clustering example with kmeans labels :width: 700px :align: center ``` ## ICS We apply the same methodology using ICS in place of PCA. The invariant components are computed and visualized. As in PCA, all components are retained. ```python from icspylab import ICS ics = ICS(S1="tcov", S2="cov", center=True) X_transformed_ics = ics.fit_transform(X_sub) plot_ics(X_transformed_ics, components="first", plot_kws={'alpha':0.7}) ``` ```{image} ../_static/clustering_ics_orig.png :alt: ICS results of the clustering example :width: 700px :align: center ``` The IC_2–IC_3 projection reveals a clear clustered structure, with roughly seven visually distinct groups. We therefore apply KMeans with ``n_clusters=7``. ```python kmeans_ics = KMeans(n_clusters=7, random_state=0, n_init="auto").fit(X_transformed_ics) plot_ics(X_transformed_ics, components="first", y=kmeans_ics.labels_, plot_kws={'alpha':0.7}) ``` ```{image} ../_static/clustering_ics_clust.png :alt: ICS results of the clustering example with kmeans labels :width: 700px :align: center ``` While PCA mainly highlights two broad groups, ICS reveals a richer cluster structure that becomes visible in only a few components. This illustrates how ICS can help uncover meaningful subgroups during exploratory data analysis.