Scatter

Module containing scatter matrix calculations and the Scatter class.

This module provides various functions to compute scatter matrices, which are essential for the ICS algorithm. The implemented scatter matrices include the covariance matrix, weighted covariance matrix, and pairwise one-step M-estimate of scatter. These scatter matrices are encapsulated in the Scatter class, which includes information about the location (mean) and a label describing the type of scatter matrix. If you want to use ICS with other scatter matrices than the ones provided in this module, you would need to create a Scatter object. See the Custom Scatter for a complete example.

Most of the implemented scatters come from the R package ICS.

class icspylab.scatter.Scatter(location, scatter, label)[source]

Bases: object

A class to represent the scatter matrix and its related data.

location

The mean location of the data.

Type:: ndarray

scatter

The scatter matrix.

Type:: ndarray

label

A label describing the scatter matrix.

Type:: str

icspylab.scatter.cov(X, location=True)[source]

Compute the covariance matrix.

Parameters:

X (array-like) – The data matrix.
location (bool, default=True)

Returns:

An object containing the location and scatter matrix.

Return type:

Scatter

Refernce:: Refer to numpy.cov

Example

>>> import numpy as np
>>> from icspylab.scatter import cov
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> cov_X = cov(X)
>>> print(cov_X.scatter)
[[5.6 3.6]
 [3.6 2.4]]

icspylab.scatter.cov4(X, location=True)[source]

Compute a custom weighted covariance matrix (cov4) which internally uses covW with alpha=1 and cf=(1 / (p + 2)).

Parameters:

X (array-like) – The data matrix.
location (bool, default=True)

Returns:

An object containing the location and custom weighted scatter matrix.

Return type:

Scatter

References

Cardoso, J.F. (1989), Source separation using higher order moments, in Proc. IEEE Conf. on Acoustics, Speech and Signal Processing (ICASSP’89), 2109–2112. <doi:10.1109/ICASSP.1989.266878>.
Oja, H., Sirkia, S. and Eriksson, J. (2006), Scatter matrices and independent component analysis, Austrian Journal of Statistics, 35, 175–189.
Nordhausen, K., Oja, H., & Tyler, D. E. (2008). Tools for exploring multivariate data: The package ICS. Journal of Statistical Software, 28, 1-31.

Example

>>> import numpy as np
>>> from icspylab.scatter import cov4
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> cov4_X = cov4(X)
>>> print(cov4_X.scatter)
[[1.94444444 1.25      ]
 [1.25       0.83333333]]

icspylab.scatter.covAxis(X, location=True)[source]

Compute the one-step Tyler shape matrix which internally uses covW with alpha=-1 and cf=p.

Parameters:

X (array-like) – The data matrix.
location (bool, default=True)

Returns:

An object containing the location and scatter matrix.

Return type:

Scatter

References

Critchley , F., Pires, A. and Amado, C. (2006), Principal axis analysis, Technical Report, 06/14, The Open University Milton Keynes.
Tyler, D.E., Critchley, F., Dumbgen, L. and Oja, H. (2009), Invariant co-ordinate selection, Journal of the Royal Statistical Society,Series B, 71, 549–592. <doi:10.1111/j.1467-9868.2009.00706.x>.
Nordhausen, K., Oja, H., & Tyler, D. E. (2008). Tools for exploring multivariate data: The package ICS. Journal of Statistical Software, 28, 1-31.

Example

>>> import numpy as np
>>> from icspylab.scatter import covAxis
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> covAxis_X = covAxis(X)
>>> print(covAxis_X.scatter)
[[5.6 3.6]
 [3.6 2.4]]

icspylab.scatter.covW(X, location=True, alpha=1, cf=1)[source]

Estimates the scatter matrix based on one-step M-estimator using mean and covariance matrix as starting point. For more details, check the R documentation of the package ICS (function covW).

Parameters:

X (array-like) – Data matrix, must be at least bi-variate.
location (bool, default=True)
alpha (float, default=1)
cf (float, default=1)

Returns:

An object containing the location and weighted scatter matrix.

Return type:

Scatter

Details:: It is given for a \(n\) x \(p\) matrix \(X\) by:

\(CovW(X) = (1/n) cf \sum_{i=1}^{n} w(D^2(x_i)) (x_i - \overline{x})^T (x_i - \overline{x})\)

where:

\(n\) is the number of observations,
\(x_i\) is the i-th observation vector,
\(\overline{x}\) is the mean vector of all observations,
\(w(d)= d^{α}\) is a non-negative and continuous weight function applied to the squared Mahalanobis distance \(D^2(x_i)\).
\(cf\) is a consistency factor

References

Tukey, J. W. (1977). Exploratory data analysis (Vol. 2, pp. 131-160). Reading, MA: Addison-wesley.
Archimbaud, A., Drmac, Z., Nordhausen, K., Radojicic, U. and Ruiz-Gazen, A. (2023). SIAM Journal on Mathematics of Data Science (SIMODS), Vol.5(1):97–121. doi:10.1137/22M1498759.
Nordhausen, K., Oja, H., & Tyler, D. E. (2008). Tools for exploring multivariate data: The package ICS. Journal of Statistical Software, 28, 1-31.

Example

>>> import numpy as np
>>> from icspylab.scatter import covW
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> covW_X = covW(X)
>>> print(covW_X.scatter)
[[7.77777778 5.        ]
 [5.         3.33333333]]

icspylab.scatter.mcd(X, support_fraction=0.25, reweighted=False, **kwargs)[source]

Wrapper function around scikit learn’s implementation of the MCD (Minimum Covariance Determinant) estimator, computed using the FastMCD algorithm. The MCD is a robust estimator of location and covariance. It is based on selecting a subset of observations (of given size) whose empirical covariance matrix has minimal determinant. The covariance estimate is then rescaled to ensure consistency. An optional reweighting step can be applied, where observations are weighted according to their Mahalanobis distance from the estimated location. This yields the reweighted MCD estimator. In scikit-learn’s MinCovDet, both raw (non-reweighted) and reweighted estimates are available through the attributes raw_location_, raw_covariance_, location_, and covariance_. By default, this function returns the raw estimates. Fore more information, check out scikit learn’s documentation.

Parameters:

X (array-like) – The data matrix.
support_fraction (float or None, default=0.25) – The proportion of observations to be included in the support of the raw MCD estimate. This corresponds to the support_fraction parameter of sklearn.covariance.MinCovDet. By default, it is set to 0.25, which differs from the default used in sklearn.covariance.MinCovDet. If None, the original MinCovDet default value is used: (n_samples + n_features + 1) / (2 * n_samples).
reweighted (bool, default=False)

Returns:

An object containing the location and scatter matrix.

Return type:

Scatter

References

Refer to sklearn.covariance.MinCovDet
Rousseeuw, P.J. (1984) Least median of squares regression. J. Am Stat Ass, 79:871.
A Fast Algorithm for the Minimum Covariance Determinant Estimator, 1999, American Statistical Association and the American Society for Quality, TECHNOMETRICS.

Example

>>> import numpy as np
>>> from icspylab.scatter import mcd
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> mcd_X = mcd(X, support_fraction=None)
>>> print(mcd_X.scatter)
[[4.66666667 3.        ]
 [3.         2.        ]]

icspylab.scatter.tM(X, df=1, mu_init=None, V_init=None, eps=1e-06, maxiter=100)[source]

Computes the location and scatter for a multivariate t-distribution for a given degree of freedom. This function implements the third EM algorithm described in Kent et al. (1994). The norm used to define convergence is as in Arslan et al. (1995).

Parameters:

X (array-like) – data matrix, must be at least bi-variate
df (int >= 1, default=1)
distribution. (to the Cauchy)
mu_init (ndarray(p) or None, default=None)
V_init (ndarray(p,p) or None, default=None)
eps (float, default=1e-6)
maxiter (int, default=100)

Returns:

_alg3 output

Return type:

tuple

References

Kent, J.T., Tyler, D.E. and Vardi, Y. (1994), A curious likelihood identity for the multivariate tdistribution, Communications in Statistics, Simulation and Computation, 23, 441–453. <doi:10.1080/03610919408813180>.
Arslan, O., Constable, P.D.L. and Kent, J.T. (1995), Convergence behaviour of the EM algorithm for the multivariate t-distribution, Communications in Statistics, Theory and Methods, 24, 2981–3000. <doi:10.1080/03610929508831664>.
Nordhausen, K., Oja, H., & Tyler, D. E. (2008). Tools for exploring multivariate data: The package ICS. Journal of Statistical Software, 28, 1-31.

Example

>>> import numpy as np
>>> from icspylab.scatter import tM
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> tM_X = tM(X)
>>> print(tM_X.scatter)
[[4.66666667 3.        ]
 [3.         2.        ]]

icspylab.scatter.tcov(X, beta=2)[source]

Computes a pairwise one-step M-estimate of scatter with weights based on pairwise Mahalanobis distances. It can be interpreted as a local covariance matrix. Without the exponential term (or when beta=0), \(T_n\) is proportional to the classical covariance matrix. The weighting scheme increases the importance of close sample pairs and decreases the contribution of pairs that are far apart. Note that this estimator is based on pairwise differences and therefore no location estimate is returned.

Parameters:

X (array-like) – data
beta (int or float > 0, default=2) – positive numeric value specifying the tuning parameter of the tcov estimator. The optimal value depends on the data but it usually is between 1.5 and 2.5.

Returns:

An object containing the location(=None) and scatter matrix.

Return type:

Scatter

Details:: It is given for a \(n\) x \(p\) matrix \(X\) by:

\[T_n(\beta) = \frac{ \sum_{i=1}^{n} \sum_{j=i+1}^{n} \exp\left( -\frac{\beta}{2} \lVert X_i - X_j \rVert_{V_n^{-1}}^2 \right) (X_i - X_j)(X_i - X_j)^\top }{ \sum_{i=1}^{n} \sum_{j=i+1}^{n} \exp\left( -\frac{\beta}{2} \lVert X_i - X_j \rVert_{V_n^{-1}}^2 \right) }\]

where

\[V_n = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar X_n)(X_i - \bar X_n)^\top\]

\[\bar X_n = \frac{1}{n} \sum_{i=1}^{n} X_i\]

\[\lVert x \rVert_M^2 = x^\top M x\]

The computation is optimized with Numba. In the paper, we have \(w(x) = exp(-x/2)\). But since we always call \(w(beta * r^2)\), we instead set \(b = -beta/2\) and use \(w(x) = exp(x)\).

References

Caussinus, H. and Ruiz-Gazen, A. (1993) Projection Pursuit and Generalized Principal Component Analysis. In Morgenthaler, S., Ronchetti, E., Stahel, W.A. (eds.) New Directions in Statistical Data Analysis and Robustness, 35-46. Monte Verita, Proceedings of the Centro Stefano Franciscini Ascona Series. Springer-Verlag.
Caussinus, H. and Ruiz-Gazen, A. (1995) Metrics for Finding Typical Structures by Means of Principal Component Analysis. In Data Science and its Applications, 177-192. Academic Press.

Example

>>> import numpy as np
>>> from icspylab.scatter import tcov
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> tcov_X = tcov(X)
>>> print(tcov_X.scatter)
[[5.03250611 3.2351825 ]
 [3.2351825  2.15678833]]

icspylab.scatter.tcovAxis(X)[source]

Computes a pairwise one-step M-estimate of scatter, similar to tcov, but using the same weights as covAxis.

Parameters:: X (array-like) – data matrix, must be at least bi-variate.
Returns:: An object containing the location(=None) and scatter matrix.
Return type:: Scatter

Details:: It is given for a \(n\) x \(p\) matrix \(X\) by:

\[T_n = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \frac{ (y_i - y_j)(y_i - y_j)^\top }{ \left( (y_i - y_j)^\top S_n^{-1} (y_i - y_j) \right)^2 }\]

Reference:

Tyler, D.E., Critchley, F., Dumbgen, L. and Oja, H. (2009), Invariant co-ordinate selecetion, Journal of the Royal Statistical Society,Series B, 71, 549–592. <doi:10.1111/j.1467-9868.2009.00706.x>.

Example

>>> import numpy as np
>>> from icspylab.scatter import tcovAxis
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> tcovAxis_X = tcovAxis(X)
>>> print(tcovAxis_X.scatter)
[[0.98 0.63]
 [0.63 0.42]]