1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation, or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,700 --> 00:00:25,110 GILBERT STRANG: So this is a pretty key lecture. 9 00:00:25,110 --> 00:00:29,660 This lecture is about principal component analysis, PCA-- 10 00:00:29,660 --> 00:00:35,390 which is a major tool in understanding a matrix of data. 11 00:00:35,390 --> 00:00:37,410 So what is PCA about? 12 00:00:37,410 --> 00:00:41,610 Well first of all, let me remember what 13 00:00:41,610 --> 00:00:43,620 was the whole point of last-- 14 00:00:43,620 --> 00:00:47,490 yesterday's lecture-- the singular value decomposition, 15 00:00:47,490 --> 00:00:54,850 that any matrix A could be broken into r rank 1 pieces-- 16 00:00:54,850 --> 00:00:57,550 r being the rank of the matrix. 17 00:00:57,550 --> 00:01:03,250 And each piece has a U times a V transpose. 18 00:01:03,250 --> 00:01:08,590 And the good-- special thing is, the U's are orthonormal, 19 00:01:08,590 --> 00:01:10,960 and also, the V's are orthonormal. 20 00:01:10,960 --> 00:01:11,740 OK. 21 00:01:11,740 --> 00:01:13,900 So that's the whole matrix. 22 00:01:13,900 --> 00:01:15,670 But we have a big matrix, and we want 23 00:01:15,670 --> 00:01:17,770 to get the important information out of it-- 24 00:01:17,770 --> 00:01:20,000 not all the information. 25 00:01:20,000 --> 00:01:25,690 And people say, in machine learning, 26 00:01:25,690 --> 00:01:29,080 if you've learned all the training data, you haven't 27 00:01:29,080 --> 00:01:30,310 learned anything, really. 28 00:01:30,310 --> 00:01:33,520 You've just copied it all in. 29 00:01:33,520 --> 00:01:40,450 The whole point of neural nets and the process 30 00:01:40,450 --> 00:01:45,280 of machine learning is to learn important facts about the data. 31 00:01:45,280 --> 00:01:50,020 And now, here we're at the most basic stage of that. 32 00:01:50,020 --> 00:01:54,940 And I claim that the important facts about the matrix 33 00:01:54,940 --> 00:02:00,340 are in its largest k singular values-- 34 00:02:00,340 --> 00:02:02,440 the largest k pieces. 35 00:02:02,440 --> 00:02:04,960 We can take-- k equal 1 would tell us 36 00:02:04,960 --> 00:02:06,940 the largest single piece. 37 00:02:06,940 --> 00:02:09,550 But maybe we have space and computing power 38 00:02:09,550 --> 00:02:12,010 to handle a hundred pieces. 39 00:02:12,010 --> 00:02:13,870 So I would take k equal 100. 40 00:02:13,870 --> 00:02:16,540 The matrix might have ranked thousands. 41 00:02:16,540 --> 00:02:20,620 So I claim that Ak is the best. 42 00:02:20,620 --> 00:02:27,900 Now here's the one theorem for today, that Ak-- 43 00:02:27,900 --> 00:02:33,130 using the first k pieces of the SVD-- 44 00:02:33,130 --> 00:02:37,450 is the best approximation to A of rank k. 45 00:02:37,450 --> 00:02:38,770 So I'll write that down. 46 00:02:38,770 --> 00:02:43,210 So that really says why the SVD is perfect. 47 00:02:43,210 --> 00:02:43,780 OK. 48 00:02:43,780 --> 00:02:51,640 So that statement says, that if B-- 49 00:02:51,640 --> 00:03:03,730 another matrix-- has rank k, then the distance from A to B-- 50 00:03:03,730 --> 00:03:07,030 the error you're making in just using B-- 51 00:03:07,030 --> 00:03:11,020 that error is greater than or equal to the error 52 00:03:11,020 --> 00:03:12,550 you make for the best guy. 53 00:03:16,320 --> 00:03:19,680 Now that's a pretty straightforward, 54 00:03:19,680 --> 00:03:21,420 beautiful fact. 55 00:03:21,420 --> 00:03:25,740 And it goes back to people who discovered 56 00:03:25,740 --> 00:03:28,240 the SVD in the first place. 57 00:03:28,240 --> 00:03:32,280 But then a couple of psychologists 58 00:03:32,280 --> 00:03:36,660 gave a proof in a later paper-- 59 00:03:36,660 --> 00:03:39,655 and it's often called the Eckart-Young Theorem. 60 00:03:39,655 --> 00:03:40,530 There is the theorem. 61 00:03:40,530 --> 00:03:43,860 Isn't that straightforward? 62 00:03:43,860 --> 00:03:48,810 And the hypothesis is straightforward. 63 00:03:48,810 --> 00:03:51,280 That's pretty nice. 64 00:03:51,280 --> 00:03:55,730 But of course, we have to think, why is it true? 65 00:03:55,730 --> 00:03:57,500 Why is it true? 66 00:03:57,500 --> 00:04:02,030 And to give meaning to the theorem, 67 00:04:02,030 --> 00:04:04,350 we have to say what these double bars are. 68 00:04:04,350 --> 00:04:07,580 Do you know the right name for this? 69 00:04:07,580 --> 00:04:13,620 So that double bar around a matrix is called the-- 70 00:04:13,620 --> 00:04:16,310 the norm of the matrix, the norm. 71 00:04:16,310 --> 00:04:19,670 So I have to say something about matrix norms. 72 00:04:19,670 --> 00:04:24,080 How big is-- that's a measure of how big it is. 73 00:04:24,080 --> 00:04:28,970 And what I have to say is, there are many different measures 74 00:04:28,970 --> 00:04:29,940 of a matrix-- 75 00:04:29,940 --> 00:04:31,820 how large that matrix is. 76 00:04:31,820 --> 00:04:36,380 Let me tell you, for today, three possible measures 77 00:04:36,380 --> 00:04:37,040 of a matrix. 78 00:04:39,950 --> 00:04:41,960 So different ways to measure-- 79 00:04:41,960 --> 00:04:45,680 I'll call the matrix just A, maybe. 80 00:04:45,680 --> 00:04:50,140 But then I'm going to apply the measure to A minus B, 81 00:04:50,140 --> 00:04:55,250 and to A minus AK, and show that that is smaller. 82 00:04:55,250 --> 00:04:55,750 OK. 83 00:04:55,750 --> 00:05:00,220 So I want to tell you about the norm of A-- 84 00:05:00,220 --> 00:05:04,710 about some possible norms of A. And actually, the norms I'm 85 00:05:04,710 --> 00:05:07,320 going to take today will be-- 86 00:05:07,320 --> 00:05:10,940 will have the special feature that they can be found-- 87 00:05:10,940 --> 00:05:13,810 computed by their singular values. 88 00:05:13,810 --> 00:05:18,850 So let me mention the L2 norm. 89 00:05:18,850 --> 00:05:22,370 That is the largest singular value. 90 00:05:22,370 --> 00:05:24,560 So that's an important measure of the-- 91 00:05:24,560 --> 00:05:26,570 sort of the size of a matrix. 92 00:05:26,570 --> 00:05:32,030 I'm talking here about a general m by n matrix 93 00:05:32,030 --> 00:05:37,340 A. Sigma 1 is an important norm-- 94 00:05:37,340 --> 00:05:39,260 often called the L2 norm. 95 00:05:39,260 --> 00:05:41,490 And that's where that index 2 goes. 96 00:05:41,490 --> 00:05:41,990 Oh. 97 00:05:41,990 --> 00:05:44,090 I should really start with vectors-- 98 00:05:44,090 --> 00:05:48,050 norms of vectors-- and then build to the norms of matrices. 99 00:05:48,050 --> 00:05:51,070 Let me do norms of vectors over on this side. 100 00:05:51,070 --> 00:05:53,030 The L2 norm of a vector-- 101 00:05:55,820 --> 00:05:58,340 do we know what that is? 102 00:05:58,340 --> 00:06:04,220 That's the regular length of the vector that we all expect-- 103 00:06:04,220 --> 00:06:12,850 the square root of v1 squared up to vn squared. 104 00:06:12,850 --> 00:06:18,110 The hypotenuse-- the length of the hypotenuse 105 00:06:18,110 --> 00:06:19,790 in n dimensional space. 106 00:06:19,790 --> 00:06:23,270 That's the L2 norm, because of that 2. 107 00:06:23,270 --> 00:06:31,040 The L1 norm of a vector is just add up those pieces 108 00:06:31,040 --> 00:06:34,400 without squaring and square rooting them. 109 00:06:34,400 --> 00:06:36,900 Just add them. 110 00:06:36,900 --> 00:06:39,750 That's the L1 norm. 111 00:06:39,750 --> 00:06:43,320 And you might say, why do we want two norms? 112 00:06:43,320 --> 00:06:45,060 Or there are more norms. 113 00:06:45,060 --> 00:06:46,965 Let me just tell you one more. 114 00:06:46,965 --> 00:06:52,790 The infinity norm-- and there is a reason for the 1 and the 2 115 00:06:52,790 --> 00:06:54,000 and the infinity-- 116 00:06:54,000 --> 00:06:55,520 is the largest of the v's. 117 00:06:58,740 --> 00:07:01,440 OK. 118 00:07:01,440 --> 00:07:03,030 Have you met norms before? 119 00:07:03,030 --> 00:07:04,020 I don't know. 120 00:07:04,020 --> 00:07:07,030 These are vector norms, but maybe you have met. 121 00:07:07,030 --> 00:07:10,560 Then we're going to have matrix norms, that maybe will be new. 122 00:07:13,600 --> 00:07:17,530 So this is the norm that we usually think of. 123 00:07:17,530 --> 00:07:21,130 But this one has become really, really important, 124 00:07:21,130 --> 00:07:23,080 and let me tell you just why. 125 00:07:23,080 --> 00:07:27,190 And then we'll-- later section of the notes and a later 126 00:07:27,190 --> 00:07:30,840 lecture in this course will develop that-- 127 00:07:30,840 --> 00:07:32,060 develop this. 128 00:07:32,060 --> 00:07:34,080 This is the L1 norm. 129 00:07:34,080 --> 00:07:38,990 So this is L2, L1, and L infinity-- 130 00:07:38,990 --> 00:07:41,200 [INAUDIBLE] 131 00:07:41,200 --> 00:07:44,020 So what's special about this one? 132 00:07:44,020 --> 00:07:47,770 Well, it just turned out-- and it was only discovered 133 00:07:47,770 --> 00:07:49,660 in our lifetimes-- 134 00:07:49,660 --> 00:07:59,350 that when you minimize some function using the L1 norm, 135 00:07:59,350 --> 00:08:06,090 you minimize some, let's say, signal the noise, 136 00:08:06,090 --> 00:08:09,100 or whatever you minimize-- 137 00:08:09,100 --> 00:08:10,910 some function. 138 00:08:10,910 --> 00:08:16,360 If you use L1, the winning vector-- 139 00:08:16,360 --> 00:08:20,960 the minimizing vector-- turns out to be sparse. 140 00:08:20,960 --> 00:08:23,870 And what does sparse mean? 141 00:08:23,870 --> 00:08:27,170 Sparse means mostly zero components. 142 00:08:27,170 --> 00:08:29,930 Somehow, when I minimize in L2-- 143 00:08:29,930 --> 00:08:34,580 which historically goes back to Gauss, 144 00:08:34,580 --> 00:08:37,520 the greatest mathematician of all time. 145 00:08:37,520 --> 00:08:41,990 When you minimize something in L2, you do the least squares. 146 00:08:41,990 --> 00:08:45,950 And you find that the guy that gives you the minimum 147 00:08:45,950 --> 00:08:48,200 has a lot of little numbers-- 148 00:08:48,200 --> 00:08:49,610 lot of little components. 149 00:08:49,610 --> 00:08:52,010 Because when you're square those little ones, 150 00:08:52,010 --> 00:08:55,780 they don't hurt much. 151 00:08:55,780 --> 00:09:00,610 But Gauss-- so Gauss didn't do least L1 norm. 152 00:09:00,610 --> 00:09:04,190 That has different names-- basis pursuit. 153 00:09:04,190 --> 00:09:13,380 And it comes into signal processing and sensing. 154 00:09:13,380 --> 00:09:15,070 Right. 155 00:09:15,070 --> 00:09:18,760 And then it was discovered that if you minimize-- 156 00:09:21,440 --> 00:09:23,480 as we'll see in that norm-- 157 00:09:23,480 --> 00:09:29,650 you amazingly get-- the winning vector has-- 158 00:09:29,650 --> 00:09:31,300 is mostly zeros. 159 00:09:31,300 --> 00:09:34,330 And the advantage of that is that you can understand 160 00:09:34,330 --> 00:09:36,490 what its components are. 161 00:09:36,490 --> 00:09:40,140 The one with many small components, 162 00:09:40,140 --> 00:09:43,150 you have no interpretation for that answer. 163 00:09:43,150 --> 00:09:46,630 But for an answer that just has a few non-zero components, 164 00:09:46,630 --> 00:09:48,610 you really see what's happening. 165 00:09:48,610 --> 00:09:51,110 And then this is a important one, too. 166 00:09:51,110 --> 00:09:53,500 OK. 167 00:09:53,500 --> 00:09:57,070 Now I'm going to turn just to-- so 168 00:09:57,070 --> 00:09:59,140 what's the property of a norm? 169 00:09:59,140 --> 00:10:05,410 Well, you can see that the norm of C times a vector is-- 170 00:10:05,410 --> 00:10:08,620 just multiplying by 6, or 11, or minus 171 00:10:08,620 --> 00:10:15,100 pi, or whatever-- is the size of C. Norms 172 00:10:15,100 --> 00:10:17,680 have that nice property. 173 00:10:17,680 --> 00:10:20,890 They're homogeneous, or whatever word. 174 00:10:20,890 --> 00:10:24,130 If you double the vector, you should double the norm-- 175 00:10:24,130 --> 00:10:24,970 double the length. 176 00:10:24,970 --> 00:10:26,330 That makes sense. 177 00:10:26,330 --> 00:10:30,010 And then the important property is that-- 178 00:10:30,010 --> 00:10:34,100 is the famous triangle in equality-- 179 00:10:34,100 --> 00:10:40,960 that if v and w are two sides of a triangle, 180 00:10:40,960 --> 00:10:43,900 and you take the norm of v and add to the norm of w-- 181 00:10:43,900 --> 00:10:45,130 the two sides-- 182 00:10:45,130 --> 00:10:50,210 you get more than the straight norm along the hypotenuse. 183 00:10:50,210 --> 00:10:50,710 Yeah. 184 00:10:50,710 --> 00:10:53,230 So those are properties that we require, 185 00:10:53,230 --> 00:10:59,350 and the fact that the norm is positive, which is-- 186 00:10:59,350 --> 00:11:00,260 I won't write down. 187 00:11:00,260 --> 00:11:01,990 But it's important too. 188 00:11:01,990 --> 00:11:02,660 OK. 189 00:11:02,660 --> 00:11:05,270 So those are norms, and those will apply also 190 00:11:05,270 --> 00:11:07,790 to matrix norms. 191 00:11:07,790 --> 00:11:12,320 So if I double the matrix, I want to double its norm. 192 00:11:12,320 --> 00:11:16,100 And of course, that works for that 2 norm. 193 00:11:16,100 --> 00:11:22,130 And actually, probably-- so the triangle in equality for this 194 00:11:22,130 --> 00:11:27,170 norm is saying that the largest singular value of A plus B-- 195 00:11:27,170 --> 00:11:31,200 two matrices-- is less or equal to the larger 196 00:11:31,200 --> 00:11:35,210 the singular value of A plus the largest singular value of B. 197 00:11:35,210 --> 00:11:42,770 And that's-- we won't take class time to check minor, 198 00:11:42,770 --> 00:11:44,730 straightforward things like that. 199 00:11:44,730 --> 00:11:48,210 So now I'm going to continue with the three norms 200 00:11:48,210 --> 00:11:49,460 that I want to tell you about. 201 00:11:52,800 --> 00:11:54,930 That's a very important one. 202 00:11:54,930 --> 00:11:58,980 Then there is another norm that's named-- 203 00:11:58,980 --> 00:12:04,180 has an F. And it's named after Frobenius. 204 00:12:04,180 --> 00:12:05,040 Sorry about that. 205 00:12:08,520 --> 00:12:11,530 And what is that norm? 206 00:12:11,530 --> 00:12:15,130 That norm looks at all the entries in the matrix-- 207 00:12:15,130 --> 00:12:17,950 just like it was a long vector-- 208 00:12:17,950 --> 00:12:20,500 and squares them all, and adds them up. 209 00:12:20,500 --> 00:12:23,830 So in a way, it's like the 2 norm for a vector. 210 00:12:23,830 --> 00:12:26,440 It's-- so the squared-- 211 00:12:26,440 --> 00:12:27,850 or shall I put square root? 212 00:12:27,850 --> 00:12:28,960 Maybe I should. 213 00:12:28,960 --> 00:12:35,660 It's the square root of all the little people in the matrix. 214 00:12:35,660 --> 00:12:43,600 So a1, n squared, plus the next a2, 1 squared, and so on. 215 00:12:43,600 --> 00:12:47,030 You finally get to a-m-n squared. 216 00:12:47,030 --> 00:12:52,150 You just treat the matrix like a long vector. 217 00:12:52,150 --> 00:12:57,510 And take this square root just like so. 218 00:12:57,510 --> 00:12:59,680 That's the Frobenius norm. 219 00:12:59,680 --> 00:13:05,050 And then finally, not so well known, 220 00:13:05,050 --> 00:13:09,940 is something that's more like L1. 221 00:13:09,940 --> 00:13:12,830 It's called the nuclear norm. 222 00:13:15,820 --> 00:13:19,600 And not all the faculty would know about this nuclear norm. 223 00:13:19,600 --> 00:13:24,535 So it is the sum of the sigma of the singular values. 224 00:13:27,130 --> 00:13:28,480 I guess there are r of them. 225 00:13:28,480 --> 00:13:32,440 So that's where we would stop. 226 00:13:32,440 --> 00:13:34,220 Oh, OK. 227 00:13:34,220 --> 00:13:37,220 So those are three norms. 228 00:13:37,220 --> 00:13:41,480 Now why do I pick on those three norms? 229 00:13:41,480 --> 00:13:43,940 And here's the point-- 230 00:13:43,940 --> 00:13:49,230 that for those three norms, this statement is true. 231 00:13:49,230 --> 00:13:51,330 I could cook up other matrix norms 232 00:13:51,330 --> 00:13:53,590 for which this wouldn't work. 233 00:13:53,590 --> 00:13:56,460 But for these three highly important norms, 234 00:13:56,460 --> 00:14:01,710 this Eckart-Young statement, that the closest rank k 235 00:14:01,710 --> 00:14:06,890 approximation is found from the first k pieces. 236 00:14:06,890 --> 00:14:10,020 You see, that's a good thing, because this is 237 00:14:10,020 --> 00:14:13,530 what we compute from the SVD. 238 00:14:13,530 --> 00:14:16,120 So now we've solved an approximation problem. 239 00:14:16,120 --> 00:14:21,720 We found the best B is Ak. 240 00:14:21,720 --> 00:14:26,670 And the point is, it could use all the-- any of those norms. 241 00:14:26,670 --> 00:14:28,410 So there would be a-- 242 00:14:28,410 --> 00:14:32,730 well, somebody finally came up with a proof that 243 00:14:32,730 --> 00:14:35,710 does all three norms at once. 244 00:14:35,710 --> 00:14:44,400 In the notes, I do that one separately from Frobenius. 245 00:14:44,400 --> 00:14:46,380 And actually, I found-- 246 00:14:46,380 --> 00:14:48,210 in an MIT thesis-- 247 00:14:48,210 --> 00:14:53,070 I was just reading a course 6 PhD thesis-- 248 00:14:53,070 --> 00:15:01,090 and the author-- who is speaking tomorrow, or Friday in IDSS-- 249 00:15:04,020 --> 00:15:09,120 Dr. [? Cerebro ?] found a nice new proof of Frobenius. 250 00:15:09,120 --> 00:15:14,370 And it's in the notes, as well as an older proof. 251 00:15:14,370 --> 00:15:15,840 OK. 252 00:15:15,840 --> 00:15:20,340 You know, as I talk here, I'm not too sure 253 00:15:20,340 --> 00:15:27,740 whether it is essential for me to go through the proof, 254 00:15:27,740 --> 00:15:30,470 either in the L2 norm-- 255 00:15:30,470 --> 00:15:33,740 which takes half a page in then notes-- 256 00:15:33,740 --> 00:15:37,870 or in the Frobenius norm, which takes more. 257 00:15:37,870 --> 00:15:40,920 I'd rather you see the point. 258 00:15:40,920 --> 00:15:44,160 The point is that, in these norms-- and now, 259 00:15:44,160 --> 00:15:48,880 what is special about these norms of a matrix? 260 00:15:48,880 --> 00:15:51,550 These depend only on the sigmas-- 261 00:15:51,550 --> 00:15:52,890 only on the-- oh. 262 00:15:52,890 --> 00:15:53,470 Oh. 263 00:15:53,470 --> 00:15:56,020 I'll finish that sentence, because it was true. 264 00:15:56,020 --> 00:16:01,640 These norms depend only on the singular values. 265 00:16:01,640 --> 00:16:02,420 Right? 266 00:16:02,420 --> 00:16:05,000 That one, at least, depends only on the singular value. 267 00:16:05,000 --> 00:16:06,560 It's the largest one. 268 00:16:06,560 --> 00:16:08,870 This one is the sum of them all. 269 00:16:08,870 --> 00:16:12,770 This one comes into the Netflix competition, by the way. 270 00:16:12,770 --> 00:16:17,120 This was the right norm to win a zillion dollars in the Netflix 271 00:16:17,120 --> 00:16:18,510 competition. 272 00:16:18,510 --> 00:16:22,640 So what did Netflix put-- it did a math competition. 273 00:16:22,640 --> 00:16:32,940 It had movie preferences from many, many Netflix subscribers. 274 00:16:32,940 --> 00:16:38,280 They gave their ranking to a bunch of movies. 275 00:16:38,280 --> 00:16:39,870 But of course, they hadn't seen-- 276 00:16:39,870 --> 00:16:41,830 none of them had seen all the movies. 277 00:16:41,830 --> 00:16:44,370 So the matrix of rankings-- 278 00:16:44,370 --> 00:16:47,175 where you had the ranker and the matrix-- 279 00:16:49,790 --> 00:16:51,180 is a very big matrix. 280 00:16:51,180 --> 00:16:53,520 But it's got missing entries. 281 00:16:53,520 --> 00:16:58,020 If the ranker didn't see the movie, he isn't-- 282 00:16:58,020 --> 00:16:59,910 he or she isn't ranking it. 283 00:16:59,910 --> 00:17:02,670 So what's the idea about Netflix? 284 00:17:02,670 --> 00:17:05,490 So they offered like a million dollar prize. 285 00:17:05,490 --> 00:17:08,190 And a lot of math and computer science people 286 00:17:08,190 --> 00:17:10,410 fought for that prize. 287 00:17:10,410 --> 00:17:18,390 And over the years, they got like higher 92, 93, 94% right. 288 00:17:18,390 --> 00:17:20,970 But it turned out that this was-- 289 00:17:20,970 --> 00:17:22,710 well, you had to-- 290 00:17:22,710 --> 00:17:25,650 in the end, you had to use a little psychology 291 00:17:25,650 --> 00:17:27,700 of how people voted. 292 00:17:27,700 --> 00:17:29,850 So it was partly about human psychology. 293 00:17:29,850 --> 00:17:33,420 But it was also a very large matrix problem 294 00:17:33,420 --> 00:17:35,760 with an incomplete matrix-- 295 00:17:35,760 --> 00:17:37,140 an incomplete matrix. 296 00:17:37,140 --> 00:17:38,550 And so it had to be completed. 297 00:17:38,550 --> 00:17:42,570 You had to figure out what would the ranker have 298 00:17:42,570 --> 00:17:46,680 said about the post if he hadn't seen it, 299 00:17:46,680 --> 00:17:53,250 but had ranked several other movies, like All 300 00:17:53,250 --> 00:17:56,370 the President's Men, or whatever-- 301 00:17:56,370 --> 00:17:57,990 given a ranking to those? 302 00:17:57,990 --> 00:18:00,900 You have to-- and that's a recommender system, of course. 303 00:18:00,900 --> 00:18:05,070 That's how you get recommendations from Amazon. 304 00:18:05,070 --> 00:18:09,430 They've got a big matrix calculation here. 305 00:18:09,430 --> 00:18:12,930 And if you've bought a couple of math books, 306 00:18:12,930 --> 00:18:15,750 they're going to tell you about more math books-- 307 00:18:15,750 --> 00:18:17,070 more than you want to know. 308 00:18:17,070 --> 00:18:18,120 Right. 309 00:18:18,120 --> 00:18:18,930 OK. 310 00:18:18,930 --> 00:18:24,770 So anyway, it just turned out that this norm 311 00:18:24,770 --> 00:18:28,670 was the right one to minimize. 312 00:18:28,670 --> 00:18:32,240 I can't give you all the details of the Netflix competition, 313 00:18:32,240 --> 00:18:34,220 but this turned out to be the right norm 314 00:18:34,220 --> 00:18:38,870 to do a minimum problem, a best not least squares. 315 00:18:38,870 --> 00:18:41,030 These squares would look at some other norm, 316 00:18:41,030 --> 00:18:47,030 but a best nuclear norm completion of the matrix. 317 00:18:47,030 --> 00:18:52,290 And that-- and now it's-- 318 00:18:52,290 --> 00:19:00,320 so now it's being put to much more serious uses for MRI-- 319 00:19:00,320 --> 00:19:05,480 magnetic resonance stuff, when you go in and get-- 320 00:19:05,480 --> 00:19:10,120 it's a noisy system, but you get-- 321 00:19:10,120 --> 00:19:19,190 it gives a excellent picture of what's going on. 322 00:19:19,190 --> 00:19:23,500 So I'll just write Netflix here. 323 00:19:23,500 --> 00:19:25,120 So it gets in the-- 324 00:19:25,120 --> 00:19:26,095 and then MRIs. 325 00:19:30,780 --> 00:19:32,250 So what's the point about MRIs? 326 00:19:35,430 --> 00:19:36,300 So if you don't-- 327 00:19:36,300 --> 00:19:40,950 if you stay in long enough, you get all the numbers. 328 00:19:40,950 --> 00:19:42,780 There isn't missing data. 329 00:19:42,780 --> 00:19:45,360 But if you-- as with a child-- 330 00:19:45,360 --> 00:19:47,310 you might want to just have the child 331 00:19:47,310 --> 00:19:50,250 in for a few minutes, then that's 332 00:19:50,250 --> 00:19:52,410 not enough to get a complete picture. 333 00:19:52,410 --> 00:19:55,350 And you have, again, missing data 334 00:19:55,350 --> 00:20:05,760 in your matrix in the image from the MRI. 335 00:20:05,760 --> 00:20:09,060 So then, of course, you've got to complete that matrix. 336 00:20:09,060 --> 00:20:13,320 You have to fill in, what would the MRI have 337 00:20:13,320 --> 00:20:18,210 seen in those positions where it didn't look long enough? 338 00:20:18,210 --> 00:20:25,420 And again, a nuclear norm is a good one for that. 339 00:20:25,420 --> 00:20:27,490 OK. 340 00:20:27,490 --> 00:20:34,390 So there will be a whole section on norms, maybe just about-- 341 00:20:34,390 --> 00:20:38,280 in stellar by now. 342 00:20:38,280 --> 00:20:39,880 OK. 343 00:20:39,880 --> 00:20:43,180 So I'm not going to-- 344 00:20:43,180 --> 00:20:47,020 let me just say, what does this say? 345 00:20:47,020 --> 00:20:49,210 What does this tell us? 346 00:20:49,210 --> 00:20:52,380 I'll just give an example. 347 00:20:52,380 --> 00:20:56,830 Maybe I'll take-- start with the example that's in the notes. 348 00:20:56,830 --> 00:20:59,440 Suppose k is 2. 349 00:20:59,440 --> 00:21:03,550 So I'm looking among all rank 2 matrices. 350 00:21:03,550 --> 00:21:09,470 And suppose my matrix is 4, 3, 2, 1, and all the rest 0's. 351 00:21:12,620 --> 00:21:13,120 Diagonal. 352 00:21:16,410 --> 00:21:19,600 And it's rank 4 matrix. 353 00:21:19,600 --> 00:21:21,270 I can see its singular values. 354 00:21:21,270 --> 00:21:22,970 They're sitting there. 355 00:21:22,970 --> 00:21:25,800 Those would be the singular values, and the eigenvalues, 356 00:21:25,800 --> 00:21:28,140 and everything, of course. 357 00:21:28,140 --> 00:21:31,350 Now, what would be A2? 358 00:21:31,350 --> 00:21:33,930 What would be the best approximation 359 00:21:33,930 --> 00:21:44,240 of rank 2 to that matrix, in this sense to be completed? 360 00:21:44,240 --> 00:21:47,220 What would A2 do? 361 00:21:47,220 --> 00:21:47,830 Yeah. 362 00:21:47,830 --> 00:21:49,210 It would be 4 and 3. 363 00:21:49,210 --> 00:21:51,520 It would pick the two largest. 364 00:21:51,520 --> 00:21:53,560 So I'm looking at Ak. 365 00:21:53,560 --> 00:21:57,130 This is k to the 2, so it has to have rank 2. 366 00:21:57,130 --> 00:21:58,840 This has got rank 4. 367 00:21:58,840 --> 00:22:01,045 The biggest pieces are those. 368 00:22:04,280 --> 00:22:06,820 OK. 369 00:22:06,820 --> 00:22:11,490 So this thing says that if I had any other matrix B, 370 00:22:11,490 --> 00:22:14,730 it would be further away from A than this guy. 371 00:22:14,730 --> 00:22:17,390 It says that this is the closest. 372 00:22:17,390 --> 00:22:23,180 And I just-- could you think of a matrix that could possibly 373 00:22:23,180 --> 00:22:26,060 be closer, and be rank 2? 374 00:22:26,060 --> 00:22:28,700 Rank two 2 the tricky thing. 375 00:22:28,700 --> 00:22:33,050 The matrices of rank 2 form a kind of crazy set. 376 00:22:33,050 --> 00:22:36,230 If I add a rank 2 matrix to a rank 2 matrix, 377 00:22:36,230 --> 00:22:38,590 probably the rank is up to 4. 378 00:22:38,590 --> 00:22:44,140 So the rank 2 matrices are all kind of floating around 379 00:22:44,140 --> 00:22:46,880 in their own little corners. 380 00:22:46,880 --> 00:22:48,750 This looks like the best one. 381 00:22:48,750 --> 00:22:55,410 But in the notes I suggest, well, you could get a rank 2-- 382 00:22:55,410 --> 00:22:57,500 well, what about B? 383 00:22:57,500 --> 00:22:59,900 What about this B? 384 00:22:59,900 --> 00:23:05,210 For this guy, I could get closer-- 385 00:23:05,210 --> 00:23:15,210 maybe not exact-- but closer, maybe by taking 3.5, 3.5. 386 00:23:15,210 --> 00:23:16,770 But I only want to use rank-- 387 00:23:16,770 --> 00:23:19,410 I've only got two rank 2 to play with. 388 00:23:19,410 --> 00:23:21,690 So I better make this into a rank-- 389 00:23:21,690 --> 00:23:26,400 I have to make this into a rank 1 piece, and then 390 00:23:26,400 --> 00:23:27,870 the 2 and the 1. 391 00:23:27,870 --> 00:23:29,610 So you see what I-- 392 00:23:29,610 --> 00:23:30,930 what I thought of? 393 00:23:30,930 --> 00:23:33,060 I thought, man, maybe that's better-- 394 00:23:33,060 --> 00:23:37,350 like on the diagonal, I'm coming closer. 395 00:23:37,350 --> 00:23:39,750 Well, I'm not getting it exactly here. 396 00:23:39,750 --> 00:23:42,250 But then I've got one left to play with. 397 00:23:42,250 --> 00:23:45,480 And I'll put, maybe, 1.5 down here. 398 00:23:50,270 --> 00:23:50,770 OK. 399 00:23:50,770 --> 00:23:53,310 So that's a rank 2 matrix-- 400 00:23:53,310 --> 00:23:55,600 two little rank 1s. 401 00:23:55,600 --> 00:23:58,300 And on the diagonal, it's better. 402 00:23:58,300 --> 00:24:01,600 3.5s-- I'm only missing by a half. 403 00:24:01,600 --> 00:24:03,820 1.5s-- I'm missing by half. 404 00:24:03,820 --> 00:24:06,430 So I'm only missing by a half on the diagonal 405 00:24:06,430 --> 00:24:10,010 where this guy was missing by 2. 406 00:24:10,010 --> 00:24:14,280 So maybe I've found something better. 407 00:24:14,280 --> 00:24:18,000 But I had to pay a price of these things off the diagonal 408 00:24:18,000 --> 00:24:20,010 to keep the rank low. 409 00:24:20,010 --> 00:24:22,620 And they kill me. 410 00:24:22,620 --> 00:24:26,960 So that B will be further away from A. 411 00:24:26,960 --> 00:24:32,190 The error, if I computed A minus B, and computed its norm, 412 00:24:32,190 --> 00:24:37,450 I would see bigger than A minus A2. 413 00:24:37,450 --> 00:24:38,880 Yeah. 414 00:24:38,880 --> 00:24:41,030 So, you see the point of the theorem? 415 00:24:41,030 --> 00:24:44,700 That's really what I'm trying to say, that it's not obvious. 416 00:24:44,700 --> 00:24:47,370 You may feel, well, it's totally obvious. 417 00:24:47,370 --> 00:24:49,520 Pick 4 and 3. 418 00:24:49,520 --> 00:24:50,800 What else could do it? 419 00:24:50,800 --> 00:24:54,290 But it depends on the norm and so on. 420 00:24:54,290 --> 00:24:55,940 So it's not-- 421 00:24:55,940 --> 00:25:00,580 Eckart-Young had to think of a proof, and other people, too. 422 00:25:00,580 --> 00:25:01,100 OK. 423 00:25:01,100 --> 00:25:05,540 So that's-- now, but you could say-- also say-- 424 00:25:05,540 --> 00:25:09,200 object that I started with a diagonal matrix here. 425 00:25:09,200 --> 00:25:10,880 That's so special. 426 00:25:10,880 --> 00:25:13,280 But what I want to say is the diagonal matrix 427 00:25:13,280 --> 00:25:20,370 is not that special, because I could take A-- 428 00:25:20,370 --> 00:25:23,830 so let me now just call this diagonal matrix D-- 429 00:25:23,830 --> 00:25:25,880 or let me call it sigma to give it 430 00:25:25,880 --> 00:25:28,360 another sort of appropriate name. 431 00:25:31,960 --> 00:25:37,120 So if I thought of matrices, what I want to say 432 00:25:37,120 --> 00:25:42,210 is, this could be the sigma matrix. 433 00:25:42,210 --> 00:25:46,260 And there could be a U on the left of it, 434 00:25:46,260 --> 00:25:47,680 and a sigma on the right of it. 435 00:25:47,680 --> 00:25:51,790 So A is U sigma V transpose. 436 00:25:51,790 --> 00:25:55,460 So this is my sigma. 437 00:25:55,460 --> 00:25:59,940 And this is like any orthogonal matrix U. 438 00:25:59,940 --> 00:26:03,146 And this is like any V transpose. 439 00:26:08,900 --> 00:26:09,650 Right? 440 00:26:09,650 --> 00:26:12,800 I'm just saying, here's a whole lot more matrices. 441 00:26:12,800 --> 00:26:14,450 There is just one matrix. 442 00:26:14,450 --> 00:26:17,240 But now, I have all these matrices 443 00:26:17,240 --> 00:26:20,830 with Us multiplying on the left, and V transpose ones 444 00:26:20,830 --> 00:26:21,980 on the right. 445 00:26:21,980 --> 00:26:24,320 And I ask you this question, what 446 00:26:24,320 --> 00:26:28,640 are the singular values of that matrix, A? 447 00:26:28,640 --> 00:26:30,620 Here the singular values were clear-- 448 00:26:30,620 --> 00:26:32,300 4, 3, 2, and 1. 449 00:26:32,300 --> 00:26:35,150 What are the singular values of this matrix A, 450 00:26:35,150 --> 00:26:40,720 when I've multiplied by a orthogonal guy on both sides? 451 00:26:40,720 --> 00:26:42,940 That's a key question. 452 00:26:42,940 --> 00:26:46,265 What are the singular values of that one? 453 00:26:46,265 --> 00:26:47,140 AUDIENCE: 4, 3, 2, 1. 454 00:26:47,140 --> 00:26:48,520 GILBERT STRANG: 4, 3, 2, 1. 455 00:26:48,520 --> 00:26:49,510 Didn't change. 456 00:26:49,510 --> 00:26:51,700 Why is that? 457 00:26:51,700 --> 00:26:54,760 Because the singular values are the-- 458 00:26:54,760 --> 00:26:57,980 because this has a SVD form-- 459 00:26:57,980 --> 00:27:01,370 orthogonal times diagonal times orthogonal. 460 00:27:01,370 --> 00:27:04,290 And that diagonal contains the singular values. 461 00:27:04,290 --> 00:27:06,810 What I'm saying is, that my-- 462 00:27:06,810 --> 00:27:11,390 and our-- trivial little example here, actually 463 00:27:11,390 --> 00:27:15,650 was all 4 by 4's that have these singular values. 464 00:27:15,650 --> 00:27:21,850 I could-- my whole problem is orthogonally invariant, 465 00:27:21,850 --> 00:27:23,760 a math guy would say. 466 00:27:23,760 --> 00:27:27,540 When I multiply by U or a V transpose, or both-- 467 00:27:27,540 --> 00:27:28,800 the problem doesn't change. 468 00:27:28,800 --> 00:27:30,030 Norms don't change. 469 00:27:30,030 --> 00:27:31,420 Yeah, that's a point. 470 00:27:31,420 --> 00:27:31,920 Yeah. 471 00:27:31,920 --> 00:27:33,480 I realize it now. 472 00:27:33,480 --> 00:27:35,100 This is the point. 473 00:27:35,100 --> 00:27:42,740 If I multiply the matrix A by an orthogonal matrix U, 474 00:27:42,740 --> 00:27:44,300 it has all the same norms-- 475 00:27:44,300 --> 00:27:45,890 doesn't change the norm. 476 00:27:45,890 --> 00:27:53,790 Actually, that was true way back for vectors with this length-- 477 00:27:53,790 --> 00:27:55,560 with this length. 478 00:27:55,560 --> 00:27:57,320 What's the deal about vectors? 479 00:27:57,320 --> 00:28:00,830 Suppose I have a vector V, and I've computed 480 00:28:00,830 --> 00:28:03,590 its hypotenuse and the norm. 481 00:28:03,590 --> 00:28:09,230 And now I look at Q times V in that same 2 norm. 482 00:28:12,880 --> 00:28:15,880 What's special about that? 483 00:28:15,880 --> 00:28:19,510 So I took any vector V and I know what its length is-- 484 00:28:19,510 --> 00:28:21,400 hypotenuse. 485 00:28:21,400 --> 00:28:23,760 Now I multiply by Q. 486 00:28:23,760 --> 00:28:26,540 What happens to the length? 487 00:28:26,540 --> 00:28:28,580 Doesn't change. 488 00:28:28,580 --> 00:28:31,310 Doesn't change. 489 00:28:31,310 --> 00:28:32,990 Orthogonal matrix-- you could think 490 00:28:32,990 --> 00:28:36,350 of it as just like rotating the triangle in space. 491 00:28:36,350 --> 00:28:38,820 The hypotenuse doesn't change. 492 00:28:38,820 --> 00:28:42,590 And we've checked that, because we could-- 493 00:28:42,590 --> 00:28:44,720 the check is to square it. 494 00:28:44,720 --> 00:28:50,670 And then you're doing QV, transpose QV. 495 00:28:50,670 --> 00:28:53,850 And you simplify it the usual way. 496 00:28:53,850 --> 00:28:57,240 And then you have Q transpose Q equal the identity. 497 00:28:57,240 --> 00:28:59,550 And you're golden. 498 00:28:59,550 --> 00:29:00,050 Yeah. 499 00:29:00,050 --> 00:29:10,200 So the result is you get the same answer as V. 500 00:29:10,200 --> 00:29:16,410 So let me put it in a sentence now, pause. 501 00:29:16,410 --> 00:29:24,930 Multiplying that norm is not changed by orthogonal matrix. 502 00:29:24,930 --> 00:29:28,440 And these norms are not changed by orthogonal matrices, 503 00:29:28,440 --> 00:29:33,020 because if I multiply the A here by an orthogonal matrix, 504 00:29:33,020 --> 00:29:34,890 I have-- 505 00:29:34,890 --> 00:29:39,102 this is my A. If i multiply by a Q, 506 00:29:39,102 --> 00:29:42,770 then I have QU sigma V transpose. 507 00:29:42,770 --> 00:29:46,550 And what is really the underlying point? 508 00:29:46,550 --> 00:29:53,495 That QU is an orthogonal matrix just as good as U. So if I-- 509 00:29:53,495 --> 00:29:55,380 let me put this down. 510 00:29:55,380 --> 00:30:01,490 QA would be QU sigma V transpose. 511 00:30:01,490 --> 00:30:05,210 And now I'm asking you, what's the singular value 512 00:30:05,210 --> 00:30:08,448 decomposition for QA? 513 00:30:08,448 --> 00:30:12,180 And I hope I may actually-- seeing it. 514 00:30:12,180 --> 00:30:15,750 What's the singular value decomposition of QA? 515 00:30:15,750 --> 00:30:17,070 What are the singular values? 516 00:30:17,070 --> 00:30:19,740 What's the diagonal matrix? 517 00:30:19,740 --> 00:30:21,920 Just look there for it. 518 00:30:21,920 --> 00:30:24,330 The diagram matrix is sigma. 519 00:30:24,330 --> 00:30:25,860 What goes on the right of it? 520 00:30:25,860 --> 00:30:27,270 The V transpose. 521 00:30:27,270 --> 00:30:31,030 And what goes on the left of it is QU. 522 00:30:31,030 --> 00:30:33,460 Plus, that's orthogonal times orthogonal. 523 00:30:33,460 --> 00:30:35,860 Everybody in this room has to know 524 00:30:35,860 --> 00:30:38,740 that if I multiply two orthogonal matrices, 525 00:30:38,740 --> 00:30:41,020 the result is, again, orthogonal. 526 00:30:41,020 --> 00:30:46,345 So I can multiply by Q, and it only affects the U part, not 527 00:30:46,345 --> 00:30:48,160 the sigma part. 528 00:30:48,160 --> 00:30:51,600 And so it doesn't change any of those norms. 529 00:30:51,600 --> 00:30:53,000 OK. 530 00:30:53,000 --> 00:30:55,780 So that's fine. 531 00:30:55,780 --> 00:30:59,140 That's what I wanted to say about the Eckart-Young 532 00:30:59,140 --> 00:31:00,280 Theorem-- 533 00:31:00,280 --> 00:31:04,330 not proving it, but hopefully giving you 534 00:31:04,330 --> 00:31:09,080 an example there of what it means-- 535 00:31:09,080 --> 00:31:14,560 that this is the best rank to approximate that one. 536 00:31:14,560 --> 00:31:17,470 OK. 537 00:31:17,470 --> 00:31:24,220 So that's the key math behind PCA. 538 00:31:24,220 --> 00:31:25,930 So now I have to-- 539 00:31:25,930 --> 00:31:31,210 want to, not just have to-- but want to tell you about PCA. 540 00:31:31,210 --> 00:31:33,770 So what's that about? 541 00:31:33,770 --> 00:31:40,640 So we have a bunch of data, and we want to see-- 542 00:31:40,640 --> 00:31:42,940 so let me take a bunch of data-- 543 00:31:42,940 --> 00:31:46,342 bunch of data points-- 544 00:31:46,342 --> 00:31:49,060 say, points in the plane. 545 00:31:49,060 --> 00:31:53,200 So I have a bunch of data points in the plane. 546 00:31:53,200 --> 00:31:54,370 So here's my data vector. 547 00:31:54,370 --> 00:31:57,820 First, vector x1-- well, x. 548 00:31:57,820 --> 00:31:58,970 Is at a good-- 549 00:31:58,970 --> 00:31:59,600 maybe v1. 550 00:32:02,140 --> 00:32:04,450 These are just two component guys. 551 00:32:04,450 --> 00:32:06,720 v2. 552 00:32:06,720 --> 00:32:08,840 They're just columns with two components. 553 00:32:08,840 --> 00:32:14,440 So I'm just measuring height and age, 554 00:32:14,440 --> 00:32:17,110 and I want to find the relationship between height 555 00:32:17,110 --> 00:32:18,040 and age. 556 00:32:18,040 --> 00:32:23,830 So the first row is meant-- is the height of my data. 557 00:32:23,830 --> 00:32:25,850 And the second row is the ages. 558 00:32:25,850 --> 00:32:31,214 So these are-- so I've got say a lot of people, 559 00:32:31,214 --> 00:32:38,160 and these are the heights and these are the ages. 560 00:32:38,160 --> 00:32:42,365 And I've got n points in 2D. 561 00:32:49,050 --> 00:32:51,660 And I want to make sense out of that. 562 00:32:51,660 --> 00:32:54,690 I want to look for the relationship between height 563 00:32:54,690 --> 00:32:57,133 and age. 564 00:32:57,133 --> 00:32:59,300 I'm actually going to look for a linear row relation 565 00:32:59,300 --> 00:33:03,160 between height and age. 566 00:33:03,160 --> 00:33:09,250 So first of all, these are all over the place. 567 00:33:09,250 --> 00:33:14,180 So the first step that a statistician does, 568 00:33:14,180 --> 00:33:17,150 is to get mean 0. 569 00:33:17,150 --> 00:33:19,820 Get the average to be 0. 570 00:33:19,820 --> 00:33:24,330 So what is-- so all these points are all over the place. 571 00:33:24,330 --> 00:33:28,770 From row 1, the height, I subtract the average height. 572 00:33:28,770 --> 00:33:32,850 So this is A-- the matrix I'm really going to work on is 573 00:33:32,850 --> 00:33:37,750 my matrix A-- minus the average height-- 574 00:33:37,750 --> 00:33:40,830 well, in all components. 575 00:33:40,830 --> 00:33:46,460 So this is a, a, a, a-- 576 00:33:46,460 --> 00:33:55,510 I'm subtracting the mean, so average height and average age. 577 00:33:55,510 --> 00:33:58,390 Oh, that was a brilliant notation, 578 00:33:58,390 --> 00:34:04,600 a sub a can't be a sub a. 579 00:34:04,600 --> 00:34:07,780 You see what the matrix has done-- 580 00:34:07,780 --> 00:34:10,270 this matrix 2 means? 581 00:34:10,270 --> 00:34:19,150 It's just made each row of A. Now adds to row. 582 00:34:19,150 --> 00:34:22,070 Now add to what? 583 00:34:24,969 --> 00:34:27,760 If I have a bunch of things, and I've 584 00:34:27,760 --> 00:34:29,449 subtracted off their mean-- 585 00:34:29,449 --> 00:34:32,679 so the mean, or the average is now 0-- 586 00:34:32,679 --> 00:34:35,139 then those things add up to-- 587 00:34:35,139 --> 00:34:35,770 AUDIENCE: Zero. 588 00:34:35,770 --> 00:34:36,760 GILBERT STRANG: Zero. 589 00:34:36,760 --> 00:34:37,750 Right. 590 00:34:37,750 --> 00:34:43,360 I've just brought these points into something like here. 591 00:34:43,360 --> 00:34:50,460 This is age, and this is height. 592 00:34:50,460 --> 00:34:53,639 And let's see. 593 00:34:53,639 --> 00:34:57,360 And by subtracting, it no longer is 594 00:34:57,360 --> 00:35:02,190 unreasonable to have negative age and negative height, 595 00:35:02,190 --> 00:35:03,810 because-- 596 00:35:03,810 --> 00:35:06,490 so, right. 597 00:35:06,490 --> 00:35:12,170 The little kids, when I subtract it off the average age, 598 00:35:12,170 --> 00:35:14,270 they ended up with a negative age. 599 00:35:14,270 --> 00:35:17,960 The older ones ended up still positive. 600 00:35:17,960 --> 00:35:20,600 And somehow, I've got a whole lot of points, 601 00:35:20,600 --> 00:35:30,630 but hopefully, their mean is now zero. 602 00:35:30,630 --> 00:35:35,220 Do you see that I've centered the data at 0, 0? 603 00:35:35,220 --> 00:35:39,030 And I'm looking for-- what am I looking for here? 604 00:35:39,030 --> 00:35:40,500 I'm looking for the best line. 605 00:35:43,825 --> 00:35:48,590 That's what I want to find. 606 00:35:48,590 --> 00:35:50,930 And that would be a problem in PCA. 607 00:35:50,930 --> 00:35:53,840 What's the best linear relation? 608 00:35:53,840 --> 00:35:55,640 Because PCA is limited. 609 00:35:55,640 --> 00:35:59,210 PCA isn't all of deep learning by any means. 610 00:35:59,210 --> 00:36:01,340 The whole success of deep learning 611 00:36:01,340 --> 00:36:05,030 was the final realization, after a bunch of years, 612 00:36:05,030 --> 00:36:08,510 that they had to have a nonlinear function in there 613 00:36:08,510 --> 00:36:14,340 to get to model serious data. 614 00:36:14,340 --> 00:36:19,040 But here's PCA as a linear business. 615 00:36:19,040 --> 00:36:21,740 And I'm looking for the best line. 616 00:36:25,390 --> 00:36:28,330 And you will say, wait a minute. 617 00:36:28,330 --> 00:36:34,610 I know how to find the best line, just use least squares. 618 00:36:34,610 --> 00:36:35,360 Gauss did it. 619 00:36:35,360 --> 00:36:38,505 Can't be all bad. 620 00:36:38,505 --> 00:36:42,370 But PCA-- and I was giving a talk 621 00:36:42,370 --> 00:36:46,470 in New York when I was just learning about it. 622 00:36:46,470 --> 00:36:48,790 And somebody said, what you're doing 623 00:36:48,790 --> 00:36:51,160 with PCA has to be the same as least squares-- 624 00:36:51,160 --> 00:36:53,550 it's finding the best line. 625 00:36:53,550 --> 00:36:56,970 And I knew it wasn't, but I didn't know how 626 00:36:56,970 --> 00:36:59,960 to answer that question best. 627 00:36:59,960 --> 00:37:03,580 And now, at least, I know better. 628 00:37:03,580 --> 00:37:06,640 So the best line in least squares-- 629 00:37:06,640 --> 00:37:08,320 can I remind you about least squares? 630 00:37:08,320 --> 00:37:10,600 Because this is not least squares. 631 00:37:10,600 --> 00:37:13,430 The best line of least squares-- 632 00:37:13,430 --> 00:37:15,600 so I have some data points. 633 00:37:15,600 --> 00:37:18,610 And I have a best line that goes through them. 634 00:37:18,610 --> 00:37:23,290 And least squares, I don't always center the data 635 00:37:23,290 --> 00:37:25,390 to mean zero, but I could. 636 00:37:25,390 --> 00:37:30,010 But what do you minimize in least squares-- 637 00:37:30,010 --> 00:37:30,750 least squares? 638 00:37:34,680 --> 00:37:37,050 If you remember the picture in linear algebra 639 00:37:37,050 --> 00:37:41,910 books of least squares, you measure the errors-- 640 00:37:41,910 --> 00:37:43,560 the three errors. 641 00:37:43,560 --> 00:37:49,020 And it's how much you're wrong at those three points. 642 00:37:53,030 --> 00:37:54,890 Those are the three errors. 643 00:37:54,890 --> 00:37:59,170 A-- difference between Ax and B-- 644 00:37:59,170 --> 00:38:05,150 the B minus Ax that you square. 645 00:38:05,150 --> 00:38:09,460 And you add up those three errors. 646 00:38:09,460 --> 00:38:12,000 And what's different over here? 647 00:38:12,000 --> 00:38:15,750 I mean, there's more points, but that's not the point. 648 00:38:15,750 --> 00:38:17,500 That's not the difference. 649 00:38:17,500 --> 00:38:23,550 The difference is, in PCA, you're measuring perpendicular 650 00:38:23,550 --> 00:38:25,620 to the line. 651 00:38:25,620 --> 00:38:30,570 You're adding up all these little guys, squaring them. 652 00:38:30,570 --> 00:38:34,410 So you're adding up their squares and minimizing. 653 00:38:34,410 --> 00:38:38,040 So the points-- you see it's a different problem? 654 00:38:38,040 --> 00:38:41,045 And therefore it has a different answer. 655 00:38:41,045 --> 00:38:53,770 And this answer turns out to involve the SVD, the sigmas. 656 00:38:53,770 --> 00:38:57,640 Where this answer, you remember from ordinary linear algebra, 657 00:38:57,640 --> 00:39:01,420 just when you minimize that, you got 658 00:39:01,420 --> 00:39:07,520 to an equation that leads to what equation for the best x? 659 00:39:07,520 --> 00:39:09,156 So do you remember? 660 00:39:09,156 --> 00:39:10,335 AUDIENCE: [INAUDIBLE] 661 00:39:10,335 --> 00:39:11,210 GILBERT STRANG: Yeah. 662 00:39:11,210 --> 00:39:12,590 What is it now? 663 00:39:12,590 --> 00:39:15,950 Everybody should know. 664 00:39:15,950 --> 00:39:19,280 And we will actually see it in this course, 665 00:39:19,280 --> 00:39:23,190 because we're doing the heart of linear algebra here. 666 00:39:23,190 --> 00:39:24,860 We haven't done it yet, though. 667 00:39:24,860 --> 00:39:30,460 And tell me again, what equation do I solve for that problem? 668 00:39:30,460 --> 00:39:31,460 AUDIENCE: A transpose A. 669 00:39:31,460 --> 00:39:37,100 GILBERT STRANG: A transpose A x hat equal A transpose b. 670 00:39:40,650 --> 00:39:42,150 Called the normal equations. 671 00:39:47,460 --> 00:39:49,350 It's sort part of-- 672 00:39:49,350 --> 00:39:56,740 it's this regression in statistics language. 673 00:39:56,740 --> 00:39:58,170 That's a regression problem. 674 00:39:58,170 --> 00:40:00,470 This is a different problem. 675 00:40:00,470 --> 00:40:01,430 OK. 676 00:40:01,430 --> 00:40:04,690 Just so now you see the answer. 677 00:40:04,690 --> 00:40:06,320 So that involves-- well, they both 678 00:40:06,320 --> 00:40:10,290 involve A transpose A. That's sort of interesting, 679 00:40:10,290 --> 00:40:12,240 because you have a rectangular matrix A, 680 00:40:12,240 --> 00:40:16,290 and then sooner or later, A transpose A is coming. 681 00:40:16,290 --> 00:40:21,090 But this involves solving a linear system of equations. 682 00:40:21,090 --> 00:40:23,000 So it's fast. 683 00:40:23,000 --> 00:40:25,430 And we will do it. 684 00:40:25,430 --> 00:40:27,940 And it's very important. 685 00:40:27,940 --> 00:40:33,650 It's probably the most important application in 18.06. 686 00:40:33,650 --> 00:40:36,750 But it's not the same as this one. 687 00:40:36,750 --> 00:40:42,920 So this is now in 18.06, maybe the last day is PCA. 688 00:40:42,920 --> 00:40:44,510 So I didn't put those letters-- 689 00:40:44,510 --> 00:40:53,905 Principal Component Analysis-- PCA. 690 00:40:56,630 --> 00:40:59,520 Which statisticians have been doing for a long time. 691 00:40:59,520 --> 00:41:03,630 We're not doing something brand new here. 692 00:41:03,630 --> 00:41:08,590 But the result is that we-- 693 00:41:08,590 --> 00:41:13,120 so how does a statistician think about this problem, 694 00:41:13,120 --> 00:41:16,370 or that data matrix? 695 00:41:16,370 --> 00:41:20,060 What-- if you have a matrix of data-- 696 00:41:20,060 --> 00:41:28,190 2 by 2 rows and many columns-- so many, many samples-- 697 00:41:28,190 --> 00:41:32,000 what-- and we've made the mean zero. 698 00:41:32,000 --> 00:41:34,040 So that's a first step a statistician 699 00:41:34,040 --> 00:41:36,590 takes to check on the mean. 700 00:41:36,590 --> 00:41:39,130 What's the next step? 701 00:41:39,130 --> 00:41:43,150 What else does a statistician do with data 702 00:41:43,150 --> 00:41:45,400 to measure how-- its size? 703 00:41:45,400 --> 00:41:46,450 There's another number. 704 00:41:46,450 --> 00:41:49,570 There's a number that goes with the mean, 705 00:41:49,570 --> 00:41:51,900 and it's the variance-- 706 00:41:51,900 --> 00:41:53,100 the mean and the variance. 707 00:41:53,100 --> 00:41:56,340 So somehow we're going to do variances. 708 00:41:56,340 --> 00:41:59,460 And it will really be involved, because we 709 00:41:59,460 --> 00:42:03,240 have two sets of data-- heights and ages. 710 00:42:03,240 --> 00:42:06,090 We're really going to have a covariance-- 711 00:42:06,090 --> 00:42:15,340 covariance matrix-- and it will be 2 by 2. 712 00:42:18,950 --> 00:42:22,340 Because it will tell us not only the variance in the heights-- 713 00:42:22,340 --> 00:42:24,680 that's the first thing a statistician 714 00:42:24,680 --> 00:42:26,010 would think about-- 715 00:42:26,010 --> 00:42:28,910 some small people, some big people-- 716 00:42:28,910 --> 00:42:31,040 and variation in ages-- 717 00:42:31,040 --> 00:42:33,440 but also the link between them. 718 00:42:33,440 --> 00:42:37,190 How are the height, age pairs-- 719 00:42:37,190 --> 00:42:41,420 does more height-- does more age go with more height? 720 00:42:41,420 --> 00:42:42,410 And of course, it does. 721 00:42:42,410 --> 00:42:44,360 That's the whole point here. 722 00:42:44,360 --> 00:42:46,580 So it's this covariance matrix. 723 00:42:46,580 --> 00:42:48,530 And that covariance matrix-- 724 00:42:48,530 --> 00:42:52,940 or the sample covariance matrix, to give it its full name-- 725 00:42:55,870 --> 00:43:00,420 what's the-- so just touching on statistics for a moment here. 726 00:43:00,420 --> 00:43:06,050 What's the-- when we see that word sample in the name, what 727 00:43:06,050 --> 00:43:09,010 is that telling us? 728 00:43:09,010 --> 00:43:13,970 It's telling us that this matrix is computed from the samples, 729 00:43:13,970 --> 00:43:18,620 not from a theoretical probability distribution. 730 00:43:18,620 --> 00:43:21,830 We might have a proposed distribution 731 00:43:21,830 --> 00:43:25,670 that the height follows the age-- 732 00:43:25,670 --> 00:43:28,250 height follows the age by some formula. 733 00:43:28,250 --> 00:43:34,340 And that would give us theoretical variances. 734 00:43:34,340 --> 00:43:37,790 We're doing sample variances, also called 735 00:43:37,790 --> 00:43:40,430 empirical covariance made. 736 00:43:40,430 --> 00:43:42,770 Empirical says-- empirical-- that word 737 00:43:42,770 --> 00:43:45,930 means, from the information, from the data. 738 00:43:45,930 --> 00:43:46,880 So that's what we do. 739 00:43:46,880 --> 00:43:50,660 And it is exactly-- 740 00:43:50,660 --> 00:43:52,020 it's AA transpose. 741 00:43:57,610 --> 00:44:06,680 You have to normalize it by the number of data points, N. 742 00:44:06,680 --> 00:44:09,290 And then, for some reason-- 743 00:44:09,290 --> 00:44:12,520 best known to statisticians-- 744 00:44:12,520 --> 00:44:13,570 it's N minus 1. 745 00:44:17,950 --> 00:44:19,680 And of course, they've got to be right. 746 00:44:19,680 --> 00:44:23,440 They've been around a long time and it should be N minus 1, 747 00:44:23,440 --> 00:44:27,190 because somehow 1 degree of freedom 748 00:44:27,190 --> 00:44:30,930 was accounted for when we took away-- 749 00:44:30,930 --> 00:44:33,700 when we made the mean 0. 750 00:44:33,700 --> 00:44:36,800 So we-- anyway, no problem. 751 00:44:36,800 --> 00:44:43,180 But the N minus 1 is not going to affect our computation here. 752 00:44:43,180 --> 00:44:46,360 This is the matrix that tells us that's 753 00:44:46,360 --> 00:44:50,820 what we've got to work with. 754 00:44:50,820 --> 00:44:55,590 That's what we've got to work with-- the matrix AA transpose. 755 00:44:55,590 --> 00:45:00,220 And then the-- so we have this problem. 756 00:45:00,220 --> 00:45:01,020 So we have a-- 757 00:45:01,020 --> 00:45:02,220 yeah. 758 00:45:02,220 --> 00:45:05,140 I guess we really have a minimum problem. 759 00:45:05,140 --> 00:45:07,160 We want to find-- 760 00:45:07,160 --> 00:45:07,670 yeah. 761 00:45:07,670 --> 00:45:11,226 What problem are we solving? 762 00:45:11,226 --> 00:45:12,820 And it's-- yeah. 763 00:45:12,820 --> 00:45:17,230 So our problem was not least squares-- 764 00:45:17,230 --> 00:45:21,620 not the same as least squares. 765 00:45:21,620 --> 00:45:23,920 Similar, but not the same. 766 00:45:23,920 --> 00:45:24,875 We want to minimize. 767 00:45:24,875 --> 00:45:29,520 So we're looking for that best line 768 00:45:29,520 --> 00:45:38,420 where age equals some number, c, times the height, times the-- 769 00:45:38,420 --> 00:45:39,660 yeah-- or height. 770 00:45:39,660 --> 00:45:42,210 Maybe it would have been better to age here and height here. 771 00:45:42,210 --> 00:45:42,710 No. 772 00:45:42,710 --> 00:45:46,510 No, because there are two unknowns. 773 00:45:46,510 --> 00:45:47,550 So I'm looking for c. 774 00:45:47,550 --> 00:45:50,690 I'm looking for the number c-- 775 00:45:50,690 --> 00:45:52,370 looking for the number c. 776 00:45:52,370 --> 00:45:58,870 And with just two minutes in class left, what is that number 777 00:45:58,870 --> 00:46:02,620 c going to be, when I finally get the problem stated 778 00:46:02,620 --> 00:46:06,720 properly, and then solve it? 779 00:46:06,720 --> 00:46:10,800 I'm going to learn that the best ratio of age to height 780 00:46:10,800 --> 00:46:13,980 is sigma 1. 781 00:46:13,980 --> 00:46:15,610 Sigma 1. 782 00:46:15,610 --> 00:46:21,280 That's the one that tells us how those two are connected, 783 00:46:21,280 --> 00:46:26,820 and the orthogonal-- and what will be the best-- 784 00:46:26,820 --> 00:46:27,850 yeah. 785 00:46:27,850 --> 00:46:28,350 No. 786 00:46:28,350 --> 00:46:31,260 Maybe I didn't answer that the right-- 787 00:46:31,260 --> 00:46:33,730 maybe I didn't get that right. 788 00:46:33,730 --> 00:46:36,460 Because I'm looking for-- 789 00:46:36,460 --> 00:46:40,520 I'm looking for the vector that points in the right way. 790 00:46:40,520 --> 00:46:41,470 Yeah. 791 00:46:41,470 --> 00:46:42,010 I'm sorry. 792 00:46:44,890 --> 00:46:48,490 I think the answer is, it's got to be there in the SVD. 793 00:46:48,490 --> 00:46:50,760 I think it's the vector you want. 794 00:46:50,760 --> 00:46:54,280 It's the principal component you want. 795 00:46:54,280 --> 00:46:57,910 Let's do that properly on Friday. 796 00:46:57,910 --> 00:46:59,230 I hope you see-- 797 00:46:59,230 --> 00:47:03,250 because this was a first step away 798 00:47:03,250 --> 00:47:05,350 from the highlights of linear algebra 799 00:47:05,350 --> 00:47:08,740 to problem solve by linear algebra, 800 00:47:08,740 --> 00:47:12,580 and practical problems, and my point 801 00:47:12,580 --> 00:47:15,305 is that the SVD solves these.