Singular Value Decomposition

By CS Wagner
Last updated November 2011.

Background

Singular value decomposition (SVD) is a bit of a trick from mathematics that takes a matrix of values and changes it into three matrices. For example, matrix M becomes matrices U times S times V:

M = USV

Why is this useful? There are three problems when dealing with large sets of data:

  1. Synonymy: More than one attribute is used to represent the same thing, like "Date of Birth" and "Age".
  2. Polysemy: A single attribute represents more than one value, like "Blood Pressure" (which contains Systolic Blood Pressure and Diastolic Blood Pressure).
  3. Sparsity: Missing data.

The following explains what SVD contains and how it helps handle synonymy, polysemy, and sparsity. It is NOT a description of how to calculate SVD. In fact, SVD is not calculated at all. Instead, I will use what is best referred to as median value decomposition (MVD), but I will use it the same way as SVD to show how SVD works.

The Problem

We are going to start with a matrix. It is actually a description of four users, A, B, C, and D. Each user is described by two attributes, X, and Y. In the real world, you would have hundreds or thousands of users and dozens of attributes. I use four users and two attributes so the math won't take up the wall of an office building and we can graph the results easily. So, here's our users and attributes:

X Y
A 9 6
B 8 20
C 12 6
D 7 18

We can graph these values to get a better idea of what we are using.

Notice that users B and D are very similar. A and C are similar also, just not as similar. You can also use this as an example of what it means to be "similar" when using numeric data. Usually, it means that the angle between the vectors (when graphed) is small.

Our goal is to make three matrices. The first matrix will have four rows, one for each of the users. The second matrix will be a two by two matrix, which we'll call a scale. The third matrix will have two columns, one for each attribute. So, we'll convert a matrix that contains all the information about users and attributes into three matrices. One will contain information about users, one will contain information about attributes. The scale will just be used to adjust the values in the users and attributes matrices so they all fall between -1 and 1 (normalized). It is easier to compare values when they are all in the same range.

The First Pass

We'll be doing this in multiple passes. This is the first one. Each pass will be identical and will get us closer to a good answer to the problem. It is basically just starting with an estimation and refining it until you get the answer.

PLEASE NOTE: The following is not the PROPER calculation of SVD. When properly calculating SVD, it is possible to explain how it is done, but not why. So, this example explains why, but I will fudge a bit on how.

First, we need to make either the four row user table or the two column attribute table. I like to do the easy one first, so we'll start with a two column attribute table. To make it real easy, we'll make it one row. Each value will be the median (center) value of the attribute: (max+min)/2. For X, we have (12+7)/2 = 9.5. For Y, we have (20+6)/2 = 13. We don't have values for the users, so we'll just use a, b, c, and d as placeholders. Also, we'll worry about the scale thing later.

X Y
A 9 6 A a X Y
B 8 20 B b × 9.5 13
C 12 6 C c
D 7 18 D d

We are halfway through this pass. We split the original matrix into two matrices. In the new XY table, we have two values (9.5 and 13). Each one is a sort of description of the original values for X and Y. They do not describe the users in any way. So, we need to describe the users now.

First, let's look at the math. You need to understand matrix multiplication do to this. To get the original value of 9 for A.X, we need to multiply a×9.5. So, a×9.5=9. However, we can also see that a×13=6. There is no valid value for a to satisfy both equations. So, we'll take the median of the two possible values and let a=(9/9.5+6/13)/2=0.704. Similarly, b=(8/9.5+20/13)/2=1.190, c=(12/9.5+6/13)/2=0.862, and d=(7/9.5+18/13)/2=1.061. I rounded these to three digits, but in your computer program you would use the most precise value possible.

X Y
A 9 6 A 0.704 X Y
B 8 20 B 1.190 × 9.5 13
C 12 6 C 0.862
D 7 18 D 1.061

Now, we have an estimation for our original table. One table has four rows, each one containing a value that in some way describes the original users. The other table has two columns, each containing a value we derived from the original attributes. Let's multiply them together and see how well our first pass performed.

X Y
A 0.704 X Y A 6.692 9.158
B 1.190 × 9.5 13 = B 11.308 15.474
C 0.862 C 8.192 11.211
D 1.061 D 10.077 13.790

Looking at the graph (the new vectors are in red), it isn't even close. Look at A. The values for X and Y are pretty much backwards. But, this is a first pass. Remember, we have to refine our estimate to get a better answer.

Refinement

Because our estimate is pretty bad, we are going to repeat the same process. We can't do it on the original table. We'd get the same results. So, we'll subtract our new (completely wrong) table from the original table. That is a measure of how far it is off. Then, we'll use our proces to make an estimate of how far it is off. Adding our estimate of how far it is off to the original estimate should refine our answer. First, the subtraction.

X Y X Y X Y
A 9 6 A 6.692 9.158 A 2.308 -3.158
B 8 20 - B 11.308 15.474 = B -3.308 4.526
C 12 6 C 8.192 11.211 C 3.808 -5.211
D 7 18 D 10.077 13.790 D -3.078 4.211

Side note: Notice that A and C have a positive X and negative Y in the difference table. B and D have a negative X and positive Y. Remember the very early observation about the similarity of A and C? That is why they are similar here.

Now, we will calculate our new XY table by using the average of the max and min values of each attribute just like before.

X Y
A 2.308 -3.158 A a X Y
B -3.308 4.526 B b × 0.25 -0.342
C 3.808 -5.211 C c
D -3.078 4.211 D d

And, once again, a=(2.308/0.25+(-3.158)/(-0.342))/2=9.231. We do the same for b, c, and d.

X Y
A 2.308 -3.158 A 9.321 X Y
B -3.308 4.526 B -13.231 × 0.25 -0.342
C 3.808 -5.211 C 15.231
D -3.078 4.211 D -12.308

Now, a special trick. If you want to multiply two matrices, you multiply row by column, adding each row×column value together. We want to add our new estimation to our original estimation. So, instead of having our original two matrices added to our new two matrices, we can merge them all together and just have two matrices. The first column of our user matrix will be our first estimation. The second column will be our second estimation. Similarly, the first row of our attribute matrix will be our first estimation. The second row will be our second estimation. Then, we can multiply them together and see what we get.

X Y
A 0.704 9.231 X Y A 9 6
B 1.190 -13.231 × 9.5 13 = B 8 20
C 0.862 15.231 0.25 -0.342 C 12 6
D 1.061 -12.308 D 7 18

No way! We got the original values (if you did this on a computer and didn't round values to three digits past the decimal point). Not only is that cool. We also have two values and only two values per user and per attribute that describe the user or attribute. If you want to know what X is like, I can say it is like {9.5, 0.25}. I know, that isn't very helpful, but it is just two values that are independent of the users. Previously, I had to give you four values (one for each user).

There is a minor issue. 9.5 and 0.25 aren't in a similar range. There are values from -13 to 13 here. It would be nicer if everything was bounded by -1 to 1, then comparison of values would be easier.

Scaling

If you have a bunch of values, you can normalize them from -1 to 1 with a rather simple trick. First, get the square root of the sum of the squares of all the values. In our case, look at the first column of our estimated user matrix. It has {0.704, 1.190, 0.862, 1.061}. First, I square them all and get {0.496, 1.417, 0.744, 1.125}. Then, I sum those values and get 3.782. Finally, I take the square root of that and get 1.945. This is my scaling factor for that column.

I get a scaling factor for each column of the user matrix and each row of the attribute matrix. The scaling factor of the second column of the user matrix is 25.372. The top row of the attribute matrix has a scaling factor of 16.101. The bottom row has a scaling factor of 0.424.

To normalize the values, we just divide them by the scaling factor. For example, I divide all the values in the first column of the user matrix by 1.945. That means that I have to multiply them by 1.945 again when I do my calculation of the original data. Doing so looks like:

X Y
A 0.362 0.364 X Y A 9 6
B 0.612 -0.521 × 1.945 0 × 16.101 0 × 0.590 0.807 = B 8 20
C 0.443 0.600 0 25.372 0 0.424 0.590 -0.807 C 12 6
D 0.545 -0.485 D 7 18

Notice that I have to place the first scaling factor in the upper-left cell of the matrix to have it only multiply with the first estimate. I place the second scaling factor in the lower-right cell so it works with the second estimate values. I place zeroes everywhere else just to make it a complete matrix. But, there is no reason to have two scaling tables. I can easily multiply them together and get one scaling table.

X Y
A 0.362 0.364 X Y A 9 6
B 0.612 -0.521 × 31.312 0 × 0.590 0.807 = B 8 20
C 0.443 0.600 0 10.750 0.590 -0.807 C 12 6
D 0.545 -0.485 D 7 18

This is the final work. The left matrix has a row per user. Each row describes the user. Each value is between -1 and 1. The middle table is a scaling factor. If anything, it merely describes the variance in the original values. The right table has a column per attribute. Each column describes the attribute. Each value is between -1 and 1.

Eigenvalues

The example above was chosen because it works with the simple mathematical functions I chose to use. It won't work with just any arbitraty value. Why? When calculating a, b, c, and d from the estimated attribute values, you divide by one of the calculated values. What if you end up trying to divide by zero? The whole thing fails. We need a function that will do basically the same thing, but won't suffer from division by zero.

The proper calculation of SVD uses Eigenvectors. What is an Eigenvector? The best I can suggest is reading the Wikipedia article, which won't help much. In a very elementary case, if you multiply a vector by a square matrix, you will likely change the direction and magnitude of the vector. If the vector is an Eigenvector of the square matrix, multiplication may change the magnitude, but not the direction of the vector (note: changing magnitude to a negative magnitude is allowed, which is identical to pointing the vector in the opposite direction). So, an Eigenvector is kind of a way to describe a matrix.

For SVD, Each column of the left matrix and row of the right matrix is an Eigenvector. When producing the values of the matrix, the row-by-row calculation we performed above is lost. Entire colums or rows are calculated at once. It makes it harder to see that the values of the matrices are independent of one another.

So, what is the true SVD of our example above?

X Y
A 9 6 A 0.302 -0.482 X Y
B 8 20 B 0.661 0.332 × 32.224 0 × 0.507 0.862
C 12 6 C 0.349 -0.747 0 9.779 -0.862 0.507
D 7 18 D 0.592 0.316

As expected, if you multiply that out (without the roundoff to three decimal places shown here), you will get the original values. Just like the example above, the rows of the left matrix describe the users, one row per user. The columns of the right matrix describe the attributes, one column per attribute. But, unlike the example above, you won't have the threat of division by zero using Eigenvectors.

Synonymy

So, how does SVD handle synonymy? Again, I will be using median calculations instead of real SVD to make it clear that the values in the decomposition are independent of one another.

To create synonymy, assume that user B is added to the population twice. So, we have to distinct copies of B in our data (one as B and one as the new user E).

X Y
A 9 6
B 8 20
C 12 6
D 7 18
E 8 20

The first step is creation of the XY matrix. It is done by taking the median of each column of the original data. Repeating 20 as a maximum value in the Y column does not change the median value. If we were using Eigenvectors (as in read SVD calculations), the Eigenvector's relationship to the original user data would not be changed because it is perpendicular to each dimension in the data. Adding a new dimension doesn't change that it is perpendicular to each of the previous dimensions.

Then, we use the median value of X and Y to get two possible values for a, b, c, d, and e. The median of those two values is placed in each row of the first table. The result is:

X Y
A 9 6 A 0.704 X Y
B 8 20 B 1.190 × 9.5 13
C 12 6 C 0.862
D 7 18 D 1.061
E 8 20 E 1.190

Compare this with the original table that didn't have E in it. You can see that the values of X and Y are independent of the user data. Further, each user row is independent of the other user rows. So, the data did not change. The only way to force a change is to add a value that is beyond the range of existing values - which implies that it is not a copy of data that already exists.

If I were to continue calculating the second pass, the end result would be the same. User E is just a copy of user B. The other users and attributes are not changed.

X Y
A 0.704 9.231 X Y A 9 6
B 1.190 -13.231 × 9.5 13 = B 8 20
C 0.862 15.231 0.25 -0.342 C 12 6
D 1.061 -12.308 D 7 18
E 1.190 -13.231 E 8 20

However, the scaling is affected. The added value changes the normalization a bit. The values are little different, but the overall design is the same. B and E remain the same. The relative values of comparing any user to another user (or attribute to another attribute) does not change. The only real change is the scaling values, which changes the scale of the normalized values.

X Y
A 0.309 0.326 X Y A 9 6
B 0.522 -0.462 × 36.712 0 × 0.590 0.807 = B 8 20
C 0.378 0.532 0 12.124 0.590 -0.807 C 12 6
D 0.465 -0.430 D 7 18
E 0.522 -0.462 E 7 18

Polysemy

So, how does SVD handle polysemy? Once again, I will be using median calculations instead of real SVD to make it clear that the values in the decomposition are independent of one another.

This time, user E will be the sum of all values for all other users. This is important because it gives us a value that is radically outside the range of what we were previously using. Also, this is a composite field. It is a conglomeration of data from the other fields.

You can go through all the steps we did before. You will end up with the following SVD:

X Y
A 0.164 0.363 X Y A 9 6
B 0.278 -0.521 × 68.987 0 × 0.590 0.807 = B 8 20
C 0.201 0.600 0 10.760 0.590 -0.807 C 12 6
D 0.248 -0.485 D 7 18
E 0.891 -0.042 E 36 50

Looking at the columns of the user table, you can see that the relative values of A, B, C, and D are the same as they always have been. The scale changes because we have much larger values now. The attribute table hasn't changed because the attributes are independent of the users - adding a user doesn't add an attribute.

Sparsity

Finally, how does SVD handle sparsity? This is where I tend to disagree with the accepted answer. So, first I'll give you the accepted answer.

Suppose we have or original table before we added user E. It looked like:

A 0.362 0.364 X Y
B 0.612 -0.521 × 31.312 0 × 0.590 0.807
C 0.443 0.600 0 10.750 0.590 -0.807
D 0.545 -0.485

Now, assume we have a new user E. We somehow know that if E was added to the user table, it would have the values { 0.5, 0.4 }. We want to know what the X attribute for E would be. Well, we know the scale and we know the attribute values for X are { 0.590, 0.590 }. So, we just multiply the matrices together:

X E.X
E 0.5 0.4 × 31.312 0 × 0.590 = 11.774
0 10.750 0.590

This is great. We can calculate rather accurate guesses for any missing values. Of course, we are expecting that the missing values will not change the scale or attribute values. That is where the problem comes in.

How did we know that E was { 0.5, 0.4 }? We'd have to have X and Y values in E to include it in our SVD calculation. What many people do is place temporary "best guess" values wherever a value is missing. Then, they calculate the SVD. Then, they go back and replace the guesses with a recalculation using the user, scale, and attribute tables. But, consider what would happen if you did that.

First, we place guesses in all missing values. Then, we calculate and SVD that accurately produces the values we guessed. Finally, we use SVD to fill in the values, which turn out to be nearly the same as our guesses. So, we decide that since the magical SVD produces the same values as we guessed, our guesses were good. Really? No. The SVD reproduces our guesses just as it is designed to do. If we change the guesses, the SVD will change and produce our new guesses.

So, it is clear we can't use guesses to get values for user E. Suppose that E was only missing X and had a value for Y. Another way to handle sparsity is to calculate two SVD's. One will include E, but omit X - avoiding the missing X value in E. Then, the other SVD will include X, but omit E. Finally, both SVD's are somehow merged (using whatever trick the implementer feels like using at the time). E gets two values, which are multiplied by the scale and attribute matrix to get the value X for E. But, this is all based on somehow merging two SVD's. Can that really be done? I feel that any way you do it, you will end up introducing more error to the whole thing than any benefit you will get out.

So, all in all, I personally don't feel that SVD is useful at handling sparsity. But, I'm not considered an expert, so my opinion doesn't really matter.

Conclusion

Hopefully you can see that SVD is useful in many ways. Mainly, it allows you to work with large sets of data without having your results skewed by synonymy or polysemy. Further, most people feel it allows you to produce good estimates to hide sparsity. The next step is to learn how to actually calculate SVD. The best way is to simply get an SVD calculator (like this one). Calculation of Eigenvectors is difficult and time consuming. It is better to simply use the work of others and get straight to the benefits of SVD.