Principal Component Analysis
Use the Geochem Analysis > Principal Component Analysis option (CHPCOMP GX) to perform a principal component analysis.
Principal Component Analysis dialog options
Principal Component Analysis
Principal component analysis is a collection of mathematical methods which are designed to reveal mathematical relationships between two or more (often many) variables. Measurements which include many variables are commonly encountered in mineral exploration and geochemistry. For instance, the concentrations of a suite of minerals or elements may be determined for a number of rock samples. In this case the "variables" are the concentrations of each constituent. Consider a collection of samples containing three elements of interest. The relative abundance of the three elements for each sample could be plotted as a unique position in a 3-dimensional plot, with the axes representing each of the elements. Were the points be concentrated about some plane or line, it would be because some inter-relationship or dependence existed between the variables. Principal component analysis determines the significance of these correlations. For instance, the first principle component is the "best fit" line through the data. Were the data to be originally concentrated along a line, the first principle component would contain most of the "information" about the correlations of the data. The second principle component is determined by fitting the best fit line through the data, after the first principle component’s contribution has been removed, and so on. For N variables, a total of N principal components can be extracted, and the data can be completely re-constructed from the information contained within the components.
Data Standardisation
Prior to calculation of the principal components, the data is transformed into a condition amenable to analysis. In the CHPCOMP GX, depending on whether the assay channel’s "Logarithmic Distribution" attribute is set to "Yes", the logarithms of the data are taken. The mean is then removed, and finally the data are normalised through division by the variance (standard deviation).
Eigenvalues
A correlation matrix is produced from the transformed data. An eigenvector decomposition is performed to determine the eigenvectors (which are directionally equivalent to the principal components) and eigenvalues. The relative significance of each component is indicated by its "eigenvalue". The first principal component will have the largest eigenvalue, and succeeding components will have smaller eigenvalues, as their significance in the data decreases. In our analysis, the sum of the eigenvalues is equal to the number of variables, N, which means that the average eigenvalue is one. Typically, it is those components with eigenvalues exceeding one which are of interest to the analyst. By selecting a limited number of components and re-synthesising the data from the associated components, the more important interrelationships between the different variables can be emphasised, and analysed. Throwing away the contributions of the lesser components can be viewed as eliminating the "noise" from data. (Though in fact it may not be "noise" in the traditional sense, but a combination of natural variability, measurement error and "true" correlation factors which are small enough to be ignored).
Principal Component Loading and Scores
The principal component loading are the eigenvectors of the correlation matrix, ordered in terms of the size of the eigenvalues, and scaled by the square roots of the eigenvalues. They are the "loadings" of the variables on the principal component axes.
For a given variable, the vector sum of the loadings over all the components has a length of one. Generally, the largest loadings occur for the largest components, and the loadings for the last components are generally very small.
The "scores" describe the contribution of each principle component to each data point. The standardised data can be reconstituted from a matrix multiplication of the scores times the (transpose of the) loadings. Unlike the loadings, these values are not automatically normalised, and it can be useful, for display purposes, to re-scale them from 0 to 100 in order to more clearly show which components are most influential.
Varimax Normalization
The directions of the principle components are constrained by the fact that they are mutually orthogonal. Once the "dimensionality" of the data has been reduced by rejecting the contributions of a number of the lesser principal components, it is often possible to rotate the remaining axes to obtain a better "fit". One scheme which does this is know as "Kaiser’s varimax" scheme. It operates by moving each principal component axis so that the projections of each variable onto that axis are either near the extremities (a loading value of plus or minus one) or near the origin (a loading value of zero). This sometimes eases the interpretation of the data in terms of the original variables.
Got a question? Visit the Seequent forums or Seequent support
© 2024 Seequent, The Bentley Subsurface Company
Privacy | Terms of Use