The Long, Complicated, Tedious, and Difficult Route to Principal Components: Part III

Jerome Workman Jr.;

Article

Spectroscopy

May 1, 2008

Spectroscopy

Spectroscopy-05-01-2008

Volume23

Issue 5

The Long, Complicated, Tedious, and Difficult Route to Principal Components: Part III

Author(s):

Jerome Workman Jr.

Howard Mark and Jerome Workman, Jr. continue their discussion of the derivation of the principal component algorithm using elementary algebra.

This column is a continuation of the set we have been working on to explain and derive the equations behind principal components (1,2). As we usually do, when we continue the discussion of a topic through more than one column, we continue the numbering of equations from where we left off.

Jerome Workman, Jr.

We also repeat our note that while our basic approach will avoid the use of matrices in the explanations, we will include the matrix expression corresponding to several of the results we derive. There are several reasons for this, which were given in the previous column. It might not always be obvious that a given matrix expression is the decomposition of the algebraic equations stated. While it might not always be easy to go from the algebra to the matrix notation, it is always possible to confirm the correspondence by performing the specified matrix operations and seeing that the algebraic expression is recovered. For some of the more complex expressions, we present a demonstration of the equivalency in an appendix of the column.

Howard Mark

Our first foray into demonstrating the nature of principal components (2) led to some important, and necessary, conclusions but did not go where we want to end up. The reasons for that were manifold:

We did not take into account the fact that the loadings representing the principal component under consideration are normalized, and in such a way that the variance of the coefficients representing the loadings equals unity.

We did not take into account the multivariate nature of the analysis; the calculation of sums of squares in equation 18 (2) implicitly treated the data as a group of independent univariate quantities, rather than as parts of the multivariate whole.

Because our previous attack was inherently univariate, it did not (and could not) take into account the correlation structure of the data; recalling that having principal components subsume into themselves the correlated pieces of information in the data set is one of the inherent properties of principal component analysis (1).

We had discussed the first point in a previous column (2). Therefore, we will begin by considering the second point (we will introduce the normalization of the loadings at the appropriate point). Let us consider the first row of the data matrix:

[X_1,1X_1,2X_1,3 . . . X_1,m] [26]

The error in estimating this function, as indicated in equation 12, is

[E] = [X_1,1 – S₁L₁X_1,2 – S₁L₂ . . . X_1,m – S₁L_m] [17]

But the squared errors are not the value expressed by equation 18; that equation is the result of treating the variables as independent univariate quantities, as stated earlier. To properly take into account the true squared error of the first row of the estimation of [X], we must use an expression representing the multivariate nature of the data:

E = ((X_1,1 – S₁L₁) + (X_1,2 – S₁L₂) + . . . + (X_1,m – S₁L_m))² [27]

and then the total sum-squared error for the entire data set is

SSE = Σ_i ((X_i,₁ – S_iL₁) + (X_i,₂ – S_iL₂) + . . . + (X_i,m – S_iL_m))² [28]

where the summation over i is the summation over the n samples in the data set represented by [X]. We now expand the square in equation 28:

SSE = Σ_i (X_i,₁ – S_iL₁)² + Σ_i (X_i,₁ – S_iL₁) (X_i,₂ – S_iL₂) + . . . + Σ_i (X_i,₁ – S_iL₁) (X_i,m – S_iL_m)) +

Σ_i (X_i,₁ – S_iL₁) (X_i,₂ – S_iL₂) + Σ_i (X_i,₂ – S_iL₂)² + . . . + Σ_i (X_i,₂ – S_iL₂) (X_i,m – S_iL_m)) +

Σ_i (X_i,₁ – S_iL₁) (X_i,m – S_iL_m) + Σ_i (X_i,₂ – S_iL₂) (X_i,m – S_iL_m) + . . . + Σ_i (X_i,m – S_iL_m)² [29]

Equation 29 contains m² terms; now you can begin to appreciate why previous discussions of the theory of principal components invariably and immediately have switched over to the use of matrix notation to describe what is happening (and this isn't the worst of it; wait until we finally get close to the solution to see REALLY large numbers of terms!). However, we eschew the simplification of notation achieved by that means and plunge doggedly on in the direction we have set out in. To do that, however, we will have to simplify equation 29, in an effort to keep the expressions manageable. To do that without losing generality, we make some changes to equation 29. First, instead of calculating the errors with a view toward minimizing them, we will use equation 10 (1) to go back to the variance accounted for by the principal component loadings, because as we showed previously, minimizing the error variance and maximizing the variance accounted for are equivalent operations. Therefore, by replacing the sums of squares of errors with the sums of squares due to the approximation to the data, instead of minimizing the sum of squares, we will have to maximize it.

Second, we limit the number of variables to three; this is enough to show all the effects we need to consider, and extending the concepts and equations to larger numbers of variables is straightforward conceptually, if impractical to do here. Third, we start by considering the use of only a single principal component to estimate the set of data spectra [X], which now consists of only three wavelengths:

Y_i = L_1,1X_i,₁ + L_2,1X_i,₂ + L_3,1X_i,₃ [30]

(Note: in matrix notation, equation 30 is written as

Equation 30, which is the equivalent for the variance accounted for, of equation 17 from which we obtained the variance of the error, is the starting point for our development. The constant subscript 1 indicates that this is the loading for the first principal component.

With these simplifications, equation 27 becomes

SSA = Σ(L_1,1X_i,₁ + L_2,1X_i,₂ + L_3,1X_i,₃)² [31]

where again, the summation is taken over i, representing summation over all the samples.

Expanding the square in equation 31, we obtain

SSA = Σ(L_1,1² X_i,₁² + L_1,1L_2,1X_i,₁X_i,₂ + L_1,1L_3,1X_i,₁X_i,₃ + L_2,1L_1,1X_i,₂X_i,₁ + L_2,1² X_i,₂² + L_2,1L_3,1X_i,₂X_i,₃ + L_3,1L_1,1X_i,₃X_i,₁ + L_3,1L_2,1X_i,₃X_i,₂ + L_3,1² X_i,₃² ) [32]

where again, all summations are taken over i.

(Note: Equation 32 can be written in matrix notation as [L'][X'][X][L]. Demonstrating this by replacing [X] and [L] with their definitions [see equation 30 and its matrix expansion] is a useful exercise for the reader.)

The next step, therefore, is to take derivatives with respect to each of the L_i and set the derivatives equal to zero:

Substituting equation 32 for SSA in equations 33a–33c, we find the indicated derivatives (remembering that the derivative is taken with respect to the various L_i, and not with respect to X) to be

(Note: the matrix representation of equations 34a, b, and c is 2[X]^T [X][L]^T ). (For a demonstration of this, see appendix B.)

There is a very interesting point about the equations comprising the various parts of equation 34. If we look at the parts of equations 34 that involve only the X's, we see that they have the form and are in essentially the same format as the X matrix of the variance–covariance matrix that we have seen before (3,4). In fact, with the mean spectrum subtracted from each of the individual spectra in the data set as we discussed earlier, the matrix represented by equations 34a–34c is the variance–covariance matrix (we ignore temporarily the question of the multiplication by (n – 1), which we demonstrated previously is not pertinent to the key issues we are concerned with here). Therefore, we will consider that the data matrix has had the mean subtracted, and call equation 34 the variance–covariance matrix.

We begin to see, now, how this all starts to hang together; in the process of trying to maximize the sums of the squares accounted for by the loading, we have generated a mathematical structure that implicitly expresses the correlation structure of the underlying data. As we recall from our introductory column, the fact that the loadings are based upon the underlying correlations within the data set is one of the fundamental characteristics of principal components, and here we see how that property arises, naturally and inevitably, from the mathematics underlying it all.

Again, there is a trivial solution to these equations, with L_1,1 = L_2,1 = L_3,1 = 0. This solution is unsatisfactory, however, because although it corresponds to an extremum of the multivariate space, this extremum is a minimum. While appropriate for finding the values for the L_i that minimize the errors, when we are dealing with the variances accounted for by the loadings, we want the value that maximizes the function.

There are at least three solutions that address this. We can examine one of these, and the other two will then be obvious. From the expression on the right-hand side of equation 34a (for example), we can express one of the L_i, let's say L₁, in terms of the other two:

As long as we retain the relationship between L₁ and the other two L_i, we can vary them as we please and the derivative will remain zero. The variance accounted for by the loading, however, will increase as the values of the L_i increase; therefore, the larger the values of the L_i, the more of the variance the loading accounts for. Then, if we take the limit as L_2,1 and L_3,1 → ∞, we get the maximum possible value for variance accounted for, and we can achieve the goals of satisfying equation 34a while accounting for the maximum amount of variance. We can, of course, make a similar argument based upon solving equations 34b or 34c for one of their terms. Thus, we find that the equations have three solutions, indeed, three sets of solutions.

There are several things wrong with these solutions, however, despite the fact that they certainly provide the value that is the maximum variance. First and foremost, they are not computable. Because they all will be infinite, we cannot find values for the various L_i that actually could be computed, much less that could be used to perform other computations. Second, the variance that would be accounted for by these loadings also would be infinite, and therefore not computable; furthermore, a variance with that magnitude would not satisfy equation 10 (1), because equation 10 was predicated on the use of finite values for the total sum of squares. Finally, none of these would be a least-squares solution, because the error variance would be infinite. Thus, any finite value, such as could be obtained by selecting a value for any pair of the L_i arbitrarily, would provide a smaller error variance.

Presumably, nontrivial solutions exist. If so, they are inherent in the equations that were developed. The problem is finding them. There are no immediate mathematical tools that we can use to determine the values of the L_i. There are some possible computer-based approaches that could be tried in our simplified case, however. For example, a computer might be programmed to try different values for L₁, while keeping L₂ and L₃ constant, and find a value that maximizes SSA. Then we could keep L₁ and L₂ constant and search for a value of L₂ that maximizes SSA, and similarly for L₃. Then, if necessary, we could iterate the procedure in an attempt to increase SSA further.

Despite the fact that procedures like this are used in other areas of mathematical investigations, there are several problems with this approach for solving these equations. First, we have no idea about the nature of the values of L. They might be large or small, positive or negative; we don't even know where to start guessing values. Second, we don't know what value SSA will converge to; therefore, we might find a set of values for the L_i that are not correct because they represent a local maximum for one or more of the L_i, but not the correct, global value. In addition, for a small problem such as we set up, this sort of trial-and-error approach might be satisfactory, but in a real situation, with hundreds or thousands of variables, the computation would become unfeasible.

So we see that, while we are making progress, we still have not found the answer to the original question and must continue to seek.

Thus ends false start number 2. In our next installment, we will finally begin to mount our final (and successful) assault on the problem.

Appendix B

Demonstration that 2[X]^T [X][L]^T is the matrix representation of equation 34

First, expand the matrix symbols into the full matrices they represent: 2[X]^T [X][L]^T

Then, carry out the indicated multiplication operations. We do it step-by-step here, for clarity: 2[X]^T [X][L]^T

where all summations are taken over i.

Howard Mark serves on the Editorial Advisory Board of Spectroscopy and runs a consulting service, Mark Electronics (Suffern, NY). He can be reached via e-mail: hlmark@prodigy.net

Jerome Workman, Jr. serves on the Editorial Advisory Board of Spectroscopy and is with Luminous Medical, Inc. (Carlsbad, CA). He can be reached by e-mail at: jerry.workman@luminousmedical.com