1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation, or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:22,700 --> 00:00:25,110
GILBERT STRANG: So this
is a pretty key lecture.

9
00:00:25,110 --> 00:00:29,660
This lecture is about principal
component analysis, PCA--

10
00:00:29,660 --> 00:00:35,390
which is a major tool in
understanding a matrix of data.

11
00:00:35,390 --> 00:00:37,410
So what is PCA about?

12
00:00:37,410 --> 00:00:41,610
Well first of all,
let me remember what

13
00:00:41,610 --> 00:00:43,620
was the whole point of last--

14
00:00:43,620 --> 00:00:47,490
yesterday's lecture-- the
singular value decomposition,

15
00:00:47,490 --> 00:00:54,850
that any matrix A could be
broken into r rank 1 pieces--

16
00:00:54,850 --> 00:00:57,550
r being the rank of the matrix.

17
00:00:57,550 --> 00:01:03,250
And each piece has a
U times a V transpose.

18
00:01:03,250 --> 00:01:08,590
And the good-- special thing
is, the U's are orthonormal,

19
00:01:08,590 --> 00:01:10,960
and also, the V's
are orthonormal.

20
00:01:10,960 --> 00:01:11,740
OK.

21
00:01:11,740 --> 00:01:13,900
So that's the whole matrix.

22
00:01:13,900 --> 00:01:15,670
But we have a big
matrix, and we want

23
00:01:15,670 --> 00:01:17,770
to get the important
information out of it--

24
00:01:17,770 --> 00:01:20,000
not all the information.

25
00:01:20,000 --> 00:01:25,690
And people say, in
machine learning,

26
00:01:25,690 --> 00:01:29,080
if you've learned all the
training data, you haven't

27
00:01:29,080 --> 00:01:30,310
learned anything, really.

28
00:01:30,310 --> 00:01:33,520
You've just copied it all in.

29
00:01:33,520 --> 00:01:40,450
The whole point of neural
nets and the process

30
00:01:40,450 --> 00:01:45,280
of machine learning is to learn
important facts about the data.

31
00:01:45,280 --> 00:01:50,020
And now, here we're at the
most basic stage of that.

32
00:01:50,020 --> 00:01:54,940
And I claim that the important
facts about the matrix

33
00:01:54,940 --> 00:02:00,340
are in its largest
k singular values--

34
00:02:00,340 --> 00:02:02,440
the largest k pieces.

35
00:02:02,440 --> 00:02:04,960
We can take-- k
equal 1 would tell us

36
00:02:04,960 --> 00:02:06,940
the largest single piece.

37
00:02:06,940 --> 00:02:09,550
But maybe we have space
and computing power

38
00:02:09,550 --> 00:02:12,010
to handle a hundred pieces.

39
00:02:12,010 --> 00:02:13,870
So I would take k equal 100.

40
00:02:13,870 --> 00:02:16,540
The matrix might have
ranked thousands.

41
00:02:16,540 --> 00:02:20,620
So I claim that Ak is the best.

42
00:02:20,620 --> 00:02:27,900
Now here's the one theorem
for today, that Ak--

43
00:02:27,900 --> 00:02:33,130
using the first k
pieces of the SVD--

44
00:02:33,130 --> 00:02:37,450
is the best approximation
to A of rank k.

45
00:02:37,450 --> 00:02:38,770
So I'll write that down.

46
00:02:38,770 --> 00:02:43,210
So that really says
why the SVD is perfect.

47
00:02:43,210 --> 00:02:43,780
OK.

48
00:02:43,780 --> 00:02:51,640
So that statement
says, that if B--

49
00:02:51,640 --> 00:03:03,730
another matrix-- has rank k,
then the distance from A to B--

50
00:03:03,730 --> 00:03:07,030
the error you're making
in just using B--

51
00:03:07,030 --> 00:03:11,020
that error is greater
than or equal to the error

52
00:03:11,020 --> 00:03:12,550
you make for the best guy.

53
00:03:16,320 --> 00:03:19,680
Now that's a pretty
straightforward,

54
00:03:19,680 --> 00:03:21,420
beautiful fact.

55
00:03:21,420 --> 00:03:25,740
And it goes back to
people who discovered

56
00:03:25,740 --> 00:03:28,240
the SVD in the first place.

57
00:03:28,240 --> 00:03:32,280
But then a couple
of psychologists

58
00:03:32,280 --> 00:03:36,660
gave a proof in a later paper--

59
00:03:36,660 --> 00:03:39,655
and it's often called
the Eckart-Young Theorem.

60
00:03:39,655 --> 00:03:40,530
There is the theorem.

61
00:03:40,530 --> 00:03:43,860
Isn't that straightforward?

62
00:03:43,860 --> 00:03:48,810
And the hypothesis
is straightforward.

63
00:03:48,810 --> 00:03:51,280
That's pretty nice.

64
00:03:51,280 --> 00:03:55,730
But of course, we have
to think, why is it true?

65
00:03:55,730 --> 00:03:57,500
Why is it true?

66
00:03:57,500 --> 00:04:02,030
And to give meaning
to the theorem,

67
00:04:02,030 --> 00:04:04,350
we have to say what
these double bars are.

68
00:04:04,350 --> 00:04:07,580
Do you know the
right name for this?

69
00:04:07,580 --> 00:04:13,620
So that double bar around
a matrix is called the--

70
00:04:13,620 --> 00:04:16,310
the norm of the
matrix, the norm.

71
00:04:16,310 --> 00:04:19,670
So I have to say something
about matrix norms.

72
00:04:19,670 --> 00:04:24,080
How big is-- that's a
measure of how big it is.

73
00:04:24,080 --> 00:04:28,970
And what I have to say is, there
are many different measures

74
00:04:28,970 --> 00:04:29,940
of a matrix--

75
00:04:29,940 --> 00:04:31,820
how large that matrix is.

76
00:04:31,820 --> 00:04:36,380
Let me tell you, for today,
three possible measures

77
00:04:36,380 --> 00:04:37,040
of a matrix.

78
00:04:39,950 --> 00:04:41,960
So different ways to measure--

79
00:04:41,960 --> 00:04:45,680
I'll call the matrix
just A, maybe.

80
00:04:45,680 --> 00:04:50,140
But then I'm going to apply
the measure to A minus B,

81
00:04:50,140 --> 00:04:55,250
and to A minus AK, and
show that that is smaller.

82
00:04:55,250 --> 00:04:55,750
OK.

83
00:04:55,750 --> 00:05:00,220
So I want to tell you
about the norm of A--

84
00:05:00,220 --> 00:05:04,710
about some possible norms of
A. And actually, the norms I'm

85
00:05:04,710 --> 00:05:07,320
going to take today will be--

86
00:05:07,320 --> 00:05:10,940
will have the special feature
that they can be found--

87
00:05:10,940 --> 00:05:13,810
computed by their
singular values.

88
00:05:13,810 --> 00:05:18,850
So let me mention the L2 norm.

89
00:05:18,850 --> 00:05:22,370
That is the largest
singular value.

90
00:05:22,370 --> 00:05:24,560
So that's an important
measure of the--

91
00:05:24,560 --> 00:05:26,570
sort of the size of a matrix.

92
00:05:26,570 --> 00:05:32,030
I'm talking here about
a general m by n matrix

93
00:05:32,030 --> 00:05:37,340
A. Sigma 1 is an
important norm--

94
00:05:37,340 --> 00:05:39,260
often called the L2 norm.

95
00:05:39,260 --> 00:05:41,490
And that's where
that index 2 goes.

96
00:05:41,490 --> 00:05:41,990
Oh.

97
00:05:41,990 --> 00:05:44,090
I should really
start with vectors--

98
00:05:44,090 --> 00:05:48,050
norms of vectors-- and then
build to the norms of matrices.

99
00:05:48,050 --> 00:05:51,070
Let me do norms of
vectors over on this side.

100
00:05:51,070 --> 00:05:53,030
The L2 norm of a vector--

101
00:05:55,820 --> 00:05:58,340
do we know what that is?

102
00:05:58,340 --> 00:06:04,220
That's the regular length of
the vector that we all expect--

103
00:06:04,220 --> 00:06:12,850
the square root of v1
squared up to vn squared.

104
00:06:12,850 --> 00:06:18,110
The hypotenuse-- the
length of the hypotenuse

105
00:06:18,110 --> 00:06:19,790
in n dimensional space.

106
00:06:19,790 --> 00:06:23,270
That's the L2 norm,
because of that 2.

107
00:06:23,270 --> 00:06:31,040
The L1 norm of a vector is
just add up those pieces

108
00:06:31,040 --> 00:06:34,400
without squaring and
square rooting them.

109
00:06:34,400 --> 00:06:36,900
Just add them.

110
00:06:36,900 --> 00:06:39,750
That's the L1 norm.

111
00:06:39,750 --> 00:06:43,320
And you might say, why
do we want two norms?

112
00:06:43,320 --> 00:06:45,060
Or there are more norms.

113
00:06:45,060 --> 00:06:46,965
Let me just tell you one more.

114
00:06:46,965 --> 00:06:52,790
The infinity norm-- and there
is a reason for the 1 and the 2

115
00:06:52,790 --> 00:06:54,000
and the infinity--

116
00:06:54,000 --> 00:06:55,520
is the largest of the v's.

117
00:06:58,740 --> 00:07:01,440
OK.

118
00:07:01,440 --> 00:07:03,030
Have you met norms before?

119
00:07:03,030 --> 00:07:04,020
I don't know.

120
00:07:04,020 --> 00:07:07,030
These are vector norms,
but maybe you have met.

121
00:07:07,030 --> 00:07:10,560
Then we're going to have matrix
norms, that maybe will be new.

122
00:07:13,600 --> 00:07:17,530
So this is the norm that
we usually think of.

123
00:07:17,530 --> 00:07:21,130
But this one has become
really, really important,

124
00:07:21,130 --> 00:07:23,080
and let me tell you just why.

125
00:07:23,080 --> 00:07:27,190
And then we'll-- later section
of the notes and a later

126
00:07:27,190 --> 00:07:30,840
lecture in this course
will develop that--

127
00:07:30,840 --> 00:07:32,060
develop this.

128
00:07:32,060 --> 00:07:34,080
This is the L1 norm.

129
00:07:34,080 --> 00:07:38,990
So this is L2, L1,
and L infinity--

130
00:07:38,990 --> 00:07:41,200
[INAUDIBLE]

131
00:07:41,200 --> 00:07:44,020
So what's special
about this one?

132
00:07:44,020 --> 00:07:47,770
Well, it just turned out--
and it was only discovered

133
00:07:47,770 --> 00:07:49,660
in our lifetimes--

134
00:07:49,660 --> 00:07:59,350
that when you minimize some
function using the L1 norm,

135
00:07:59,350 --> 00:08:06,090
you minimize some, let's
say, signal the noise,

136
00:08:06,090 --> 00:08:09,100
or whatever you minimize--

137
00:08:09,100 --> 00:08:10,910
some function.

138
00:08:10,910 --> 00:08:16,360
If you use L1, the
winning vector--

139
00:08:16,360 --> 00:08:20,960
the minimizing vector--
turns out to be sparse.

140
00:08:20,960 --> 00:08:23,870
And what does sparse mean?

141
00:08:23,870 --> 00:08:27,170
Sparse means mostly
zero components.

142
00:08:27,170 --> 00:08:29,930
Somehow, when I minimize in L2--

143
00:08:29,930 --> 00:08:34,580
which historically
goes back to Gauss,

144
00:08:34,580 --> 00:08:37,520
the greatest
mathematician of all time.

145
00:08:37,520 --> 00:08:41,990
When you minimize something in
L2, you do the least squares.

146
00:08:41,990 --> 00:08:45,950
And you find that the guy
that gives you the minimum

147
00:08:45,950 --> 00:08:48,200
has a lot of little numbers--

148
00:08:48,200 --> 00:08:49,610
lot of little components.

149
00:08:49,610 --> 00:08:52,010
Because when you're
square those little ones,

150
00:08:52,010 --> 00:08:55,780
they don't hurt much.

151
00:08:55,780 --> 00:09:00,610
But Gauss-- so Gauss
didn't do least L1 norm.

152
00:09:00,610 --> 00:09:04,190
That has different
names-- basis pursuit.

153
00:09:04,190 --> 00:09:13,380
And it comes into signal
processing and sensing.

154
00:09:13,380 --> 00:09:15,070
Right.

155
00:09:15,070 --> 00:09:18,760
And then it was discovered
that if you minimize--

156
00:09:21,440 --> 00:09:23,480
as we'll see in that norm--

157
00:09:23,480 --> 00:09:29,650
you amazingly get-- the
winning vector has--

158
00:09:29,650 --> 00:09:31,300
is mostly zeros.

159
00:09:31,300 --> 00:09:34,330
And the advantage of that
is that you can understand

160
00:09:34,330 --> 00:09:36,490
what its components are.

161
00:09:36,490 --> 00:09:40,140
The one with many
small components,

162
00:09:40,140 --> 00:09:43,150
you have no interpretation
for that answer.

163
00:09:43,150 --> 00:09:46,630
But for an answer that just
has a few non-zero components,

164
00:09:46,630 --> 00:09:48,610
you really see what's happening.

165
00:09:48,610 --> 00:09:51,110
And then this is a
important one, too.

166
00:09:51,110 --> 00:09:53,500
OK.

167
00:09:53,500 --> 00:09:57,070
Now I'm going to
turn just to-- so

168
00:09:57,070 --> 00:09:59,140
what's the property of a norm?

169
00:09:59,140 --> 00:10:05,410
Well, you can see that the
norm of C times a vector is--

170
00:10:05,410 --> 00:10:08,620
just multiplying by
6, or 11, or minus

171
00:10:08,620 --> 00:10:15,100
pi, or whatever-- is
the size of C. Norms

172
00:10:15,100 --> 00:10:17,680
have that nice property.

173
00:10:17,680 --> 00:10:20,890
They're homogeneous,
or whatever word.

174
00:10:20,890 --> 00:10:24,130
If you double the vector,
you should double the norm--

175
00:10:24,130 --> 00:10:24,970
double the length.

176
00:10:24,970 --> 00:10:26,330
That makes sense.

177
00:10:26,330 --> 00:10:30,010
And then the important
property is that--

178
00:10:30,010 --> 00:10:34,100
is the famous
triangle in equality--

179
00:10:34,100 --> 00:10:40,960
that if v and w are two
sides of a triangle,

180
00:10:40,960 --> 00:10:43,900
and you take the norm of v
and add to the norm of w--

181
00:10:43,900 --> 00:10:45,130
the two sides--

182
00:10:45,130 --> 00:10:50,210
you get more than the straight
norm along the hypotenuse.

183
00:10:50,210 --> 00:10:50,710
Yeah.

184
00:10:50,710 --> 00:10:53,230
So those are properties
that we require,

185
00:10:53,230 --> 00:10:59,350
and the fact that the norm
is positive, which is--

186
00:10:59,350 --> 00:11:00,260
I won't write down.

187
00:11:00,260 --> 00:11:01,990
But it's important too.

188
00:11:01,990 --> 00:11:02,660
OK.

189
00:11:02,660 --> 00:11:05,270
So those are norms, and
those will apply also

190
00:11:05,270 --> 00:11:07,790
to matrix norms.

191
00:11:07,790 --> 00:11:12,320
So if I double the matrix,
I want to double its norm.

192
00:11:12,320 --> 00:11:16,100
And of course, that
works for that 2 norm.

193
00:11:16,100 --> 00:11:22,130
And actually, probably-- so the
triangle in equality for this

194
00:11:22,130 --> 00:11:27,170
norm is saying that the largest
singular value of A plus B--

195
00:11:27,170 --> 00:11:31,200
two matrices-- is less
or equal to the larger

196
00:11:31,200 --> 00:11:35,210
the singular value of A plus
the largest singular value of B.

197
00:11:35,210 --> 00:11:42,770
And that's-- we won't take
class time to check minor,

198
00:11:42,770 --> 00:11:44,730
straightforward
things like that.

199
00:11:44,730 --> 00:11:48,210
So now I'm going to continue
with the three norms

200
00:11:48,210 --> 00:11:49,460
that I want to tell you about.

201
00:11:52,800 --> 00:11:54,930
That's a very important one.

202
00:11:54,930 --> 00:11:58,980
Then there is another
norm that's named--

203
00:11:58,980 --> 00:12:04,180
has an F. And it's
named after Frobenius.

204
00:12:04,180 --> 00:12:05,040
Sorry about that.

205
00:12:08,520 --> 00:12:11,530
And what is that norm?

206
00:12:11,530 --> 00:12:15,130
That norm looks at all the
entries in the matrix--

207
00:12:15,130 --> 00:12:17,950
just like it was a long vector--

208
00:12:17,950 --> 00:12:20,500
and squares them all,
and adds them up.

209
00:12:20,500 --> 00:12:23,830
So in a way, it's like
the 2 norm for a vector.

210
00:12:23,830 --> 00:12:26,440
It's-- so the squared--

211
00:12:26,440 --> 00:12:27,850
or shall I put square root?

212
00:12:27,850 --> 00:12:28,960
Maybe I should.

213
00:12:28,960 --> 00:12:35,660
It's the square root of all the
little people in the matrix.

214
00:12:35,660 --> 00:12:43,600
So a1, n squared, plus the
next a2, 1 squared, and so on.

215
00:12:43,600 --> 00:12:47,030
You finally get
to a-m-n squared.

216
00:12:47,030 --> 00:12:52,150
You just treat the matrix
like a long vector.

217
00:12:52,150 --> 00:12:57,510
And take this square
root just like so.

218
00:12:57,510 --> 00:12:59,680
That's the Frobenius norm.

219
00:12:59,680 --> 00:13:05,050
And then finally,
not so well known,

220
00:13:05,050 --> 00:13:09,940
is something that's
more like L1.

221
00:13:09,940 --> 00:13:12,830
It's called the nuclear norm.

222
00:13:15,820 --> 00:13:19,600
And not all the faculty would
know about this nuclear norm.

223
00:13:19,600 --> 00:13:24,535
So it is the sum of the
sigma of the singular values.

224
00:13:27,130 --> 00:13:28,480
I guess there are r of them.

225
00:13:28,480 --> 00:13:32,440
So that's where we would stop.

226
00:13:32,440 --> 00:13:34,220
Oh, OK.

227
00:13:34,220 --> 00:13:37,220
So those are three norms.

228
00:13:37,220 --> 00:13:41,480
Now why do I pick on
those three norms?

229
00:13:41,480 --> 00:13:43,940
And here's the point--

230
00:13:43,940 --> 00:13:49,230
that for those three norms,
this statement is true.

231
00:13:49,230 --> 00:13:51,330
I could cook up
other matrix norms

232
00:13:51,330 --> 00:13:53,590
for which this wouldn't work.

233
00:13:53,590 --> 00:13:56,460
But for these three
highly important norms,

234
00:13:56,460 --> 00:14:01,710
this Eckart-Young statement,
that the closest rank k

235
00:14:01,710 --> 00:14:06,890
approximation is found
from the first k pieces.

236
00:14:06,890 --> 00:14:10,020
You see, that's a good
thing, because this is

237
00:14:10,020 --> 00:14:13,530
what we compute from the SVD.

238
00:14:13,530 --> 00:14:16,120
So now we've solved an
approximation problem.

239
00:14:16,120 --> 00:14:21,720
We found the best B is Ak.

240
00:14:21,720 --> 00:14:26,670
And the point is, it could use
all the-- any of those norms.

241
00:14:26,670 --> 00:14:28,410
So there would be a--

242
00:14:28,410 --> 00:14:32,730
well, somebody finally
came up with a proof that

243
00:14:32,730 --> 00:14:35,710
does all three norms at once.

244
00:14:35,710 --> 00:14:44,400
In the notes, I do that one
separately from Frobenius.

245
00:14:44,400 --> 00:14:46,380
And actually, I found--

246
00:14:46,380 --> 00:14:48,210
in an MIT thesis--

247
00:14:48,210 --> 00:14:53,070
I was just reading a
course 6 PhD thesis--

248
00:14:53,070 --> 00:15:01,090
and the author-- who is speaking
tomorrow, or Friday in IDSS--

249
00:15:04,020 --> 00:15:09,120
Dr. [? Cerebro ?] found a
nice new proof of Frobenius.

250
00:15:09,120 --> 00:15:14,370
And it's in the notes, as
well as an older proof.

251
00:15:14,370 --> 00:15:15,840
OK.

252
00:15:15,840 --> 00:15:20,340
You know, as I talk
here, I'm not too sure

253
00:15:20,340 --> 00:15:27,740
whether it is essential for
me to go through the proof,

254
00:15:27,740 --> 00:15:30,470
either in the L2 norm--

255
00:15:30,470 --> 00:15:33,740
which takes half a
page in then notes--

256
00:15:33,740 --> 00:15:37,870
or in the Frobenius
norm, which takes more.

257
00:15:37,870 --> 00:15:40,920
I'd rather you see the point.

258
00:15:40,920 --> 00:15:44,160
The point is that, in
these norms-- and now,

259
00:15:44,160 --> 00:15:48,880
what is special about
these norms of a matrix?

260
00:15:48,880 --> 00:15:51,550
These depend only
on the sigmas--

261
00:15:51,550 --> 00:15:52,890
only on the-- oh.

262
00:15:52,890 --> 00:15:53,470
Oh.

263
00:15:53,470 --> 00:15:56,020
I'll finish that sentence,
because it was true.

264
00:15:56,020 --> 00:16:01,640
These norms depend only
on the singular values.

265
00:16:01,640 --> 00:16:02,420
Right?

266
00:16:02,420 --> 00:16:05,000
That one, at least, depends
only on the singular value.

267
00:16:05,000 --> 00:16:06,560
It's the largest one.

268
00:16:06,560 --> 00:16:08,870
This one is the sum of them all.

269
00:16:08,870 --> 00:16:12,770
This one comes into the Netflix
competition, by the way.

270
00:16:12,770 --> 00:16:17,120
This was the right norm to win
a zillion dollars in the Netflix

271
00:16:17,120 --> 00:16:18,510
competition.

272
00:16:18,510 --> 00:16:22,640
So what did Netflix put--
it did a math competition.

273
00:16:22,640 --> 00:16:32,940
It had movie preferences from
many, many Netflix subscribers.

274
00:16:32,940 --> 00:16:38,280
They gave their ranking
to a bunch of movies.

275
00:16:38,280 --> 00:16:39,870
But of course,
they hadn't seen--

276
00:16:39,870 --> 00:16:41,830
none of them had
seen all the movies.

277
00:16:41,830 --> 00:16:44,370
So the matrix of rankings--

278
00:16:44,370 --> 00:16:47,175
where you had the
ranker and the matrix--

279
00:16:49,790 --> 00:16:51,180
is a very big matrix.

280
00:16:51,180 --> 00:16:53,520
But it's got missing entries.

281
00:16:53,520 --> 00:16:58,020
If the ranker didn't see
the movie, he isn't--

282
00:16:58,020 --> 00:16:59,910
he or she isn't ranking it.

283
00:16:59,910 --> 00:17:02,670
So what's the idea
about Netflix?

284
00:17:02,670 --> 00:17:05,490
So they offered like a
million dollar prize.

285
00:17:05,490 --> 00:17:08,190
And a lot of math and
computer science people

286
00:17:08,190 --> 00:17:10,410
fought for that prize.

287
00:17:10,410 --> 00:17:18,390
And over the years, they got
like higher 92, 93, 94% right.

288
00:17:18,390 --> 00:17:20,970
But it turned out
that this was--

289
00:17:20,970 --> 00:17:22,710
well, you had to--

290
00:17:22,710 --> 00:17:25,650
in the end, you had to
use a little psychology

291
00:17:25,650 --> 00:17:27,700
of how people voted.

292
00:17:27,700 --> 00:17:29,850
So it was partly about
human psychology.

293
00:17:29,850 --> 00:17:33,420
But it was also a very
large matrix problem

294
00:17:33,420 --> 00:17:35,760
with an incomplete matrix--

295
00:17:35,760 --> 00:17:37,140
an incomplete matrix.

296
00:17:37,140 --> 00:17:38,550
And so it had to be completed.

297
00:17:38,550 --> 00:17:42,570
You had to figure out
what would the ranker have

298
00:17:42,570 --> 00:17:46,680
said about the post
if he hadn't seen it,

299
00:17:46,680 --> 00:17:53,250
but had ranked several
other movies, like All

300
00:17:53,250 --> 00:17:56,370
the President's
Men, or whatever--

301
00:17:56,370 --> 00:17:57,990
given a ranking to those?

302
00:17:57,990 --> 00:18:00,900
You have to-- and that's a
recommender system, of course.

303
00:18:00,900 --> 00:18:05,070
That's how you get
recommendations from Amazon.

304
00:18:05,070 --> 00:18:09,430
They've got a big
matrix calculation here.

305
00:18:09,430 --> 00:18:12,930
And if you've bought a
couple of math books,

306
00:18:12,930 --> 00:18:15,750
they're going to tell you
about more math books--

307
00:18:15,750 --> 00:18:17,070
more than you want to know.

308
00:18:17,070 --> 00:18:18,120
Right.

309
00:18:18,120 --> 00:18:18,930
OK.

310
00:18:18,930 --> 00:18:24,770
So anyway, it just
turned out that this norm

311
00:18:24,770 --> 00:18:28,670
was the right one to minimize.

312
00:18:28,670 --> 00:18:32,240
I can't give you all the details
of the Netflix competition,

313
00:18:32,240 --> 00:18:34,220
but this turned out
to be the right norm

314
00:18:34,220 --> 00:18:38,870
to do a minimum problem,
a best not least squares.

315
00:18:38,870 --> 00:18:41,030
These squares would
look at some other norm,

316
00:18:41,030 --> 00:18:47,030
but a best nuclear norm
completion of the matrix.

317
00:18:47,030 --> 00:18:52,290
And that-- and now it's--

318
00:18:52,290 --> 00:19:00,320
so now it's being put to much
more serious uses for MRI--

319
00:19:00,320 --> 00:19:05,480
magnetic resonance stuff,
when you go in and get--

320
00:19:05,480 --> 00:19:10,120
it's a noisy system,
but you get--

321
00:19:10,120 --> 00:19:19,190
it gives a excellent
picture of what's going on.

322
00:19:19,190 --> 00:19:23,500
So I'll just write Netflix here.

323
00:19:23,500 --> 00:19:25,120
So it gets in the--

324
00:19:25,120 --> 00:19:26,095
and then MRIs.

325
00:19:30,780 --> 00:19:32,250
So what's the point about MRIs?

326
00:19:35,430 --> 00:19:36,300
So if you don't--

327
00:19:36,300 --> 00:19:40,950
if you stay in long enough,
you get all the numbers.

328
00:19:40,950 --> 00:19:42,780
There isn't missing data.

329
00:19:42,780 --> 00:19:45,360
But if you-- as with a child--

330
00:19:45,360 --> 00:19:47,310
you might want to
just have the child

331
00:19:47,310 --> 00:19:50,250
in for a few
minutes, then that's

332
00:19:50,250 --> 00:19:52,410
not enough to get
a complete picture.

333
00:19:52,410 --> 00:19:55,350
And you have,
again, missing data

334
00:19:55,350 --> 00:20:05,760
in your matrix in the
image from the MRI.

335
00:20:05,760 --> 00:20:09,060
So then, of course, you've
got to complete that matrix.

336
00:20:09,060 --> 00:20:13,320
You have to fill in,
what would the MRI have

337
00:20:13,320 --> 00:20:18,210
seen in those positions where
it didn't look long enough?

338
00:20:18,210 --> 00:20:25,420
And again, a nuclear norm
is a good one for that.

339
00:20:25,420 --> 00:20:27,490
OK.

340
00:20:27,490 --> 00:20:34,390
So there will be a whole section
on norms, maybe just about--

341
00:20:34,390 --> 00:20:38,280
in stellar by now.

342
00:20:38,280 --> 00:20:39,880
OK.

343
00:20:39,880 --> 00:20:43,180
So I'm not going to--

344
00:20:43,180 --> 00:20:47,020
let me just say,
what does this say?

345
00:20:47,020 --> 00:20:49,210
What does this tell us?

346
00:20:49,210 --> 00:20:52,380
I'll just give an example.

347
00:20:52,380 --> 00:20:56,830
Maybe I'll take-- start with
the example that's in the notes.

348
00:20:56,830 --> 00:20:59,440
Suppose k is 2.

349
00:20:59,440 --> 00:21:03,550
So I'm looking among
all rank 2 matrices.

350
00:21:03,550 --> 00:21:09,470
And suppose my matrix is 4,
3, 2, 1, and all the rest 0's.

351
00:21:12,620 --> 00:21:13,120
Diagonal.

352
00:21:16,410 --> 00:21:19,600
And it's rank 4 matrix.

353
00:21:19,600 --> 00:21:21,270
I can see its singular values.

354
00:21:21,270 --> 00:21:22,970
They're sitting there.

355
00:21:22,970 --> 00:21:25,800
Those would be the singular
values, and the eigenvalues,

356
00:21:25,800 --> 00:21:28,140
and everything, of course.

357
00:21:28,140 --> 00:21:31,350
Now, what would be A2?

358
00:21:31,350 --> 00:21:33,930
What would be the
best approximation

359
00:21:33,930 --> 00:21:44,240
of rank 2 to that matrix, in
this sense to be completed?

360
00:21:44,240 --> 00:21:47,220
What would A2 do?

361
00:21:47,220 --> 00:21:47,830
Yeah.

362
00:21:47,830 --> 00:21:49,210
It would be 4 and 3.

363
00:21:49,210 --> 00:21:51,520
It would pick the two largest.

364
00:21:51,520 --> 00:21:53,560
So I'm looking at Ak.

365
00:21:53,560 --> 00:21:57,130
This is k to the 2, so
it has to have rank 2.

366
00:21:57,130 --> 00:21:58,840
This has got rank 4.

367
00:21:58,840 --> 00:22:01,045
The biggest pieces are those.

368
00:22:04,280 --> 00:22:06,820
OK.

369
00:22:06,820 --> 00:22:11,490
So this thing says that if
I had any other matrix B,

370
00:22:11,490 --> 00:22:14,730
it would be further away
from A than this guy.

371
00:22:14,730 --> 00:22:17,390
It says that this
is the closest.

372
00:22:17,390 --> 00:22:23,180
And I just-- could you think
of a matrix that could possibly

373
00:22:23,180 --> 00:22:26,060
be closer, and be rank 2?

374
00:22:26,060 --> 00:22:28,700
Rank two 2 the tricky thing.

375
00:22:28,700 --> 00:22:33,050
The matrices of rank 2
form a kind of crazy set.

376
00:22:33,050 --> 00:22:36,230
If I add a rank 2 matrix
to a rank 2 matrix,

377
00:22:36,230 --> 00:22:38,590
probably the rank is up to 4.

378
00:22:38,590 --> 00:22:44,140
So the rank 2 matrices are
all kind of floating around

379
00:22:44,140 --> 00:22:46,880
in their own little corners.

380
00:22:46,880 --> 00:22:48,750
This looks like the best one.

381
00:22:48,750 --> 00:22:55,410
But in the notes I suggest,
well, you could get a rank 2--

382
00:22:55,410 --> 00:22:57,500
well, what about B?

383
00:22:57,500 --> 00:22:59,900
What about this B?

384
00:22:59,900 --> 00:23:05,210
For this guy, I
could get closer--

385
00:23:05,210 --> 00:23:15,210
maybe not exact-- but closer,
maybe by taking 3.5, 3.5.

386
00:23:15,210 --> 00:23:16,770
But I only want to use rank--

387
00:23:16,770 --> 00:23:19,410
I've only got two
rank 2 to play with.

388
00:23:19,410 --> 00:23:21,690
So I better make
this into a rank--

389
00:23:21,690 --> 00:23:26,400
I have to make this into
a rank 1 piece, and then

390
00:23:26,400 --> 00:23:27,870
the 2 and the 1.

391
00:23:27,870 --> 00:23:29,610
So you see what I--

392
00:23:29,610 --> 00:23:30,930
what I thought of?

393
00:23:30,930 --> 00:23:33,060
I thought, man,
maybe that's better--

394
00:23:33,060 --> 00:23:37,350
like on the diagonal,
I'm coming closer.

395
00:23:37,350 --> 00:23:39,750
Well, I'm not getting
it exactly here.

396
00:23:39,750 --> 00:23:42,250
But then I've got one
left to play with.

397
00:23:42,250 --> 00:23:45,480
And I'll put, maybe,
1.5 down here.

398
00:23:50,270 --> 00:23:50,770
OK.

399
00:23:50,770 --> 00:23:53,310
So that's a rank 2 matrix--

400
00:23:53,310 --> 00:23:55,600
two little rank 1s.

401
00:23:55,600 --> 00:23:58,300
And on the diagonal,
it's better.

402
00:23:58,300 --> 00:24:01,600
3.5s-- I'm only
missing by a half.

403
00:24:01,600 --> 00:24:03,820
1.5s-- I'm missing by half.

404
00:24:03,820 --> 00:24:06,430
So I'm only missing by
a half on the diagonal

405
00:24:06,430 --> 00:24:10,010
where this guy was missing by 2.

406
00:24:10,010 --> 00:24:14,280
So maybe I've found
something better.

407
00:24:14,280 --> 00:24:18,000
But I had to pay a price of
these things off the diagonal

408
00:24:18,000 --> 00:24:20,010
to keep the rank low.

409
00:24:20,010 --> 00:24:22,620
And they kill me.

410
00:24:22,620 --> 00:24:26,960
So that B will be
further away from A.

411
00:24:26,960 --> 00:24:32,190
The error, if I computed A
minus B, and computed its norm,

412
00:24:32,190 --> 00:24:37,450
I would see bigger
than A minus A2.

413
00:24:37,450 --> 00:24:38,880
Yeah.

414
00:24:38,880 --> 00:24:41,030
So, you see the
point of the theorem?

415
00:24:41,030 --> 00:24:44,700
That's really what I'm trying
to say, that it's not obvious.

416
00:24:44,700 --> 00:24:47,370
You may feel, well,
it's totally obvious.

417
00:24:47,370 --> 00:24:49,520
Pick 4 and 3.

418
00:24:49,520 --> 00:24:50,800
What else could do it?

419
00:24:50,800 --> 00:24:54,290
But it depends on
the norm and so on.

420
00:24:54,290 --> 00:24:55,940
So it's not--

421
00:24:55,940 --> 00:25:00,580
Eckart-Young had to think of a
proof, and other people, too.

422
00:25:00,580 --> 00:25:01,100
OK.

423
00:25:01,100 --> 00:25:05,540
So that's-- now, but you
could say-- also say--

424
00:25:05,540 --> 00:25:09,200
object that I started with
a diagonal matrix here.

425
00:25:09,200 --> 00:25:10,880
That's so special.

426
00:25:10,880 --> 00:25:13,280
But what I want to say
is the diagonal matrix

427
00:25:13,280 --> 00:25:20,370
is not that special,
because I could take A--

428
00:25:20,370 --> 00:25:23,830
so let me now just call
this diagonal matrix D--

429
00:25:23,830 --> 00:25:25,880
or let me call it
sigma to give it

430
00:25:25,880 --> 00:25:28,360
another sort of
appropriate name.

431
00:25:31,960 --> 00:25:37,120
So if I thought of
matrices, what I want to say

432
00:25:37,120 --> 00:25:42,210
is, this could be
the sigma matrix.

433
00:25:42,210 --> 00:25:46,260
And there could be a
U on the left of it,

434
00:25:46,260 --> 00:25:47,680
and a sigma on the right of it.

435
00:25:47,680 --> 00:25:51,790
So A is U sigma V transpose.

436
00:25:51,790 --> 00:25:55,460
So this is my sigma.

437
00:25:55,460 --> 00:25:59,940
And this is like any
orthogonal matrix U.

438
00:25:59,940 --> 00:26:03,146
And this is like
any V transpose.

439
00:26:08,900 --> 00:26:09,650
Right?

440
00:26:09,650 --> 00:26:12,800
I'm just saying, here's a
whole lot more matrices.

441
00:26:12,800 --> 00:26:14,450
There is just one matrix.

442
00:26:14,450 --> 00:26:17,240
But now, I have
all these matrices

443
00:26:17,240 --> 00:26:20,830
with Us multiplying on the
left, and V transpose ones

444
00:26:20,830 --> 00:26:21,980
on the right.

445
00:26:21,980 --> 00:26:24,320
And I ask you this
question, what

446
00:26:24,320 --> 00:26:28,640
are the singular values
of that matrix, A?

447
00:26:28,640 --> 00:26:30,620
Here the singular
values were clear--

448
00:26:30,620 --> 00:26:32,300
4, 3, 2, and 1.

449
00:26:32,300 --> 00:26:35,150
What are the singular
values of this matrix A,

450
00:26:35,150 --> 00:26:40,720
when I've multiplied by a
orthogonal guy on both sides?

451
00:26:40,720 --> 00:26:42,940
That's a key question.

452
00:26:42,940 --> 00:26:46,265
What are the singular
values of that one?

453
00:26:46,265 --> 00:26:47,140
AUDIENCE: 4, 3, 2, 1.

454
00:26:47,140 --> 00:26:48,520
GILBERT STRANG: 4, 3, 2, 1.

455
00:26:48,520 --> 00:26:49,510
Didn't change.

456
00:26:49,510 --> 00:26:51,700
Why is that?

457
00:26:51,700 --> 00:26:54,760
Because the singular
values are the--

458
00:26:54,760 --> 00:26:57,980
because this has a SVD form--

459
00:26:57,980 --> 00:27:01,370
orthogonal times diagonal
times orthogonal.

460
00:27:01,370 --> 00:27:04,290
And that diagonal contains
the singular values.

461
00:27:04,290 --> 00:27:06,810
What I'm saying is, that my--

462
00:27:06,810 --> 00:27:11,390
and our-- trivial little
example here, actually

463
00:27:11,390 --> 00:27:15,650
was all 4 by 4's that have
these singular values.

464
00:27:15,650 --> 00:27:21,850
I could-- my whole problem
is orthogonally invariant,

465
00:27:21,850 --> 00:27:23,760
a math guy would say.

466
00:27:23,760 --> 00:27:27,540
When I multiply by U or
a V transpose, or both--

467
00:27:27,540 --> 00:27:28,800
the problem doesn't change.

468
00:27:28,800 --> 00:27:30,030
Norms don't change.

469
00:27:30,030 --> 00:27:31,420
Yeah, that's a point.

470
00:27:31,420 --> 00:27:31,920
Yeah.

471
00:27:31,920 --> 00:27:33,480
I realize it now.

472
00:27:33,480 --> 00:27:35,100
This is the point.

473
00:27:35,100 --> 00:27:42,740
If I multiply the matrix A
by an orthogonal matrix U,

474
00:27:42,740 --> 00:27:44,300
it has all the same norms--

475
00:27:44,300 --> 00:27:45,890
doesn't change the norm.

476
00:27:45,890 --> 00:27:53,790
Actually, that was true way back
for vectors with this length--

477
00:27:53,790 --> 00:27:55,560
with this length.

478
00:27:55,560 --> 00:27:57,320
What's the deal about vectors?

479
00:27:57,320 --> 00:28:00,830
Suppose I have a vector
V, and I've computed

480
00:28:00,830 --> 00:28:03,590
its hypotenuse and the norm.

481
00:28:03,590 --> 00:28:09,230
And now I look at Q times
V in that same 2 norm.

482
00:28:12,880 --> 00:28:15,880
What's special about that?

483
00:28:15,880 --> 00:28:19,510
So I took any vector V and
I know what its length is--

484
00:28:19,510 --> 00:28:21,400
hypotenuse.

485
00:28:21,400 --> 00:28:23,760
Now I multiply by Q.

486
00:28:23,760 --> 00:28:26,540
What happens to the length?

487
00:28:26,540 --> 00:28:28,580
Doesn't change.

488
00:28:28,580 --> 00:28:31,310
Doesn't change.

489
00:28:31,310 --> 00:28:32,990
Orthogonal matrix--
you could think

490
00:28:32,990 --> 00:28:36,350
of it as just like rotating
the triangle in space.

491
00:28:36,350 --> 00:28:38,820
The hypotenuse doesn't change.

492
00:28:38,820 --> 00:28:42,590
And we've checked that,
because we could--

493
00:28:42,590 --> 00:28:44,720
the check is to square it.

494
00:28:44,720 --> 00:28:50,670
And then you're doing
QV, transpose QV.

495
00:28:50,670 --> 00:28:53,850
And you simplify
it the usual way.

496
00:28:53,850 --> 00:28:57,240
And then you have Q transpose
Q equal the identity.

497
00:28:57,240 --> 00:28:59,550
And you're golden.

498
00:28:59,550 --> 00:29:00,050
Yeah.

499
00:29:00,050 --> 00:29:10,200
So the result is you get
the same answer as V.

500
00:29:10,200 --> 00:29:16,410
So let me put it in a
sentence now, pause.

501
00:29:16,410 --> 00:29:24,930
Multiplying that norm is not
changed by orthogonal matrix.

502
00:29:24,930 --> 00:29:28,440
And these norms are not
changed by orthogonal matrices,

503
00:29:28,440 --> 00:29:33,020
because if I multiply the A
here by an orthogonal matrix,

504
00:29:33,020 --> 00:29:34,890
I have--

505
00:29:34,890 --> 00:29:39,102
this is my A. If
i multiply by a Q,

506
00:29:39,102 --> 00:29:42,770
then I have QU
sigma V transpose.

507
00:29:42,770 --> 00:29:46,550
And what is really
the underlying point?

508
00:29:46,550 --> 00:29:53,495
That QU is an orthogonal matrix
just as good as U. So if I--

509
00:29:53,495 --> 00:29:55,380
let me put this down.

510
00:29:55,380 --> 00:30:01,490
QA would be QU
sigma V transpose.

511
00:30:01,490 --> 00:30:05,210
And now I'm asking you,
what's the singular value

512
00:30:05,210 --> 00:30:08,448
decomposition for QA?

513
00:30:08,448 --> 00:30:12,180
And I hope I may
actually-- seeing it.

514
00:30:12,180 --> 00:30:15,750
What's the singular value
decomposition of QA?

515
00:30:15,750 --> 00:30:17,070
What are the singular values?

516
00:30:17,070 --> 00:30:19,740
What's the diagonal matrix?

517
00:30:19,740 --> 00:30:21,920
Just look there for it.

518
00:30:21,920 --> 00:30:24,330
The diagram matrix is sigma.

519
00:30:24,330 --> 00:30:25,860
What goes on the right of it?

520
00:30:25,860 --> 00:30:27,270
The V transpose.

521
00:30:27,270 --> 00:30:31,030
And what goes on the
left of it is QU.

522
00:30:31,030 --> 00:30:33,460
Plus, that's orthogonal
times orthogonal.

523
00:30:33,460 --> 00:30:35,860
Everybody in this
room has to know

524
00:30:35,860 --> 00:30:38,740
that if I multiply two
orthogonal matrices,

525
00:30:38,740 --> 00:30:41,020
the result is,
again, orthogonal.

526
00:30:41,020 --> 00:30:46,345
So I can multiply by Q, and it
only affects the U part, not

527
00:30:46,345 --> 00:30:48,160
the sigma part.

528
00:30:48,160 --> 00:30:51,600
And so it doesn't change
any of those norms.

529
00:30:51,600 --> 00:30:53,000
OK.

530
00:30:53,000 --> 00:30:55,780
So that's fine.

531
00:30:55,780 --> 00:30:59,140
That's what I wanted to
say about the Eckart-Young

532
00:30:59,140 --> 00:31:00,280
Theorem--

533
00:31:00,280 --> 00:31:04,330
not proving it, but
hopefully giving you

534
00:31:04,330 --> 00:31:09,080
an example there
of what it means--

535
00:31:09,080 --> 00:31:14,560
that this is the best rank
to approximate that one.

536
00:31:14,560 --> 00:31:17,470
OK.

537
00:31:17,470 --> 00:31:24,220
So that's the key
math behind PCA.

538
00:31:24,220 --> 00:31:25,930
So now I have to--

539
00:31:25,930 --> 00:31:31,210
want to, not just have to--
but want to tell you about PCA.

540
00:31:31,210 --> 00:31:33,770
So what's that about?

541
00:31:33,770 --> 00:31:40,640
So we have a bunch of
data, and we want to see--

542
00:31:40,640 --> 00:31:42,940
so let me take a bunch of data--

543
00:31:42,940 --> 00:31:46,342
bunch of data points--

544
00:31:46,342 --> 00:31:49,060
say, points in the plane.

545
00:31:49,060 --> 00:31:53,200
So I have a bunch of
data points in the plane.

546
00:31:53,200 --> 00:31:54,370
So here's my data vector.

547
00:31:54,370 --> 00:31:57,820
First, vector x1-- well, x.

548
00:31:57,820 --> 00:31:58,970
Is at a good--

549
00:31:58,970 --> 00:31:59,600
maybe v1.

550
00:32:02,140 --> 00:32:04,450
These are just two
component guys.

551
00:32:04,450 --> 00:32:06,720
v2.

552
00:32:06,720 --> 00:32:08,840
They're just columns
with two components.

553
00:32:08,840 --> 00:32:14,440
So I'm just measuring
height and age,

554
00:32:14,440 --> 00:32:17,110
and I want to find the
relationship between height

555
00:32:17,110 --> 00:32:18,040
and age.

556
00:32:18,040 --> 00:32:23,830
So the first row is meant--
is the height of my data.

557
00:32:23,830 --> 00:32:25,850
And the second row is the ages.

558
00:32:25,850 --> 00:32:31,214
So these are-- so I've
got say a lot of people,

559
00:32:31,214 --> 00:32:38,160
and these are the heights
and these are the ages.

560
00:32:38,160 --> 00:32:42,365
And I've got n points in 2D.

561
00:32:49,050 --> 00:32:51,660
And I want to make
sense out of that.

562
00:32:51,660 --> 00:32:54,690
I want to look for the
relationship between height

563
00:32:54,690 --> 00:32:57,133
and age.

564
00:32:57,133 --> 00:32:59,300
I'm actually going to look
for a linear row relation

565
00:32:59,300 --> 00:33:03,160
between height and age.

566
00:33:03,160 --> 00:33:09,250
So first of all, these
are all over the place.

567
00:33:09,250 --> 00:33:14,180
So the first step that
a statistician does,

568
00:33:14,180 --> 00:33:17,150
is to get mean 0.

569
00:33:17,150 --> 00:33:19,820
Get the average to be 0.

570
00:33:19,820 --> 00:33:24,330
So what is-- so all these
points are all over the place.

571
00:33:24,330 --> 00:33:28,770
From row 1, the height, I
subtract the average height.

572
00:33:28,770 --> 00:33:32,850
So this is A-- the matrix I'm
really going to work on is

573
00:33:32,850 --> 00:33:37,750
my matrix A-- minus
the average height--

574
00:33:37,750 --> 00:33:40,830
well, in all components.

575
00:33:40,830 --> 00:33:46,460
So this is a, a, a, a--

576
00:33:46,460 --> 00:33:55,510
I'm subtracting the mean, so
average height and average age.

577
00:33:55,510 --> 00:33:58,390
Oh, that was a
brilliant notation,

578
00:33:58,390 --> 00:34:04,600
a sub a can't be a sub a.

579
00:34:04,600 --> 00:34:07,780
You see what the
matrix has done--

580
00:34:07,780 --> 00:34:10,270
this matrix 2 means?

581
00:34:10,270 --> 00:34:19,150
It's just made each row
of A. Now adds to row.

582
00:34:19,150 --> 00:34:22,070
Now add to what?

583
00:34:24,969 --> 00:34:27,760
If I have a bunch
of things, and I've

584
00:34:27,760 --> 00:34:29,449
subtracted off their mean--

585
00:34:29,449 --> 00:34:32,679
so the mean, or the
average is now 0--

586
00:34:32,679 --> 00:34:35,139
then those things add up to--

587
00:34:35,139 --> 00:34:35,770
AUDIENCE: Zero.

588
00:34:35,770 --> 00:34:36,760
GILBERT STRANG: Zero.

589
00:34:36,760 --> 00:34:37,750
Right.

590
00:34:37,750 --> 00:34:43,360
I've just brought these points
into something like here.

591
00:34:43,360 --> 00:34:50,460
This is age, and this is height.

592
00:34:50,460 --> 00:34:53,639
And let's see.

593
00:34:53,639 --> 00:34:57,360
And by subtracting,
it no longer is

594
00:34:57,360 --> 00:35:02,190
unreasonable to have negative
age and negative height,

595
00:35:02,190 --> 00:35:03,810
because--

596
00:35:03,810 --> 00:35:06,490
so, right.

597
00:35:06,490 --> 00:35:12,170
The little kids, when I
subtract it off the average age,

598
00:35:12,170 --> 00:35:14,270
they ended up with
a negative age.

599
00:35:14,270 --> 00:35:17,960
The older ones ended
up still positive.

600
00:35:17,960 --> 00:35:20,600
And somehow, I've got
a whole lot of points,

601
00:35:20,600 --> 00:35:30,630
but hopefully, their
mean is now zero.

602
00:35:30,630 --> 00:35:35,220
Do you see that I've
centered the data at 0, 0?

603
00:35:35,220 --> 00:35:39,030
And I'm looking for-- what
am I looking for here?

604
00:35:39,030 --> 00:35:40,500
I'm looking for the best line.

605
00:35:43,825 --> 00:35:48,590
That's what I want to find.

606
00:35:48,590 --> 00:35:50,930
And that would be
a problem in PCA.

607
00:35:50,930 --> 00:35:53,840
What's the best linear relation?

608
00:35:53,840 --> 00:35:55,640
Because PCA is limited.

609
00:35:55,640 --> 00:35:59,210
PCA isn't all of deep
learning by any means.

610
00:35:59,210 --> 00:36:01,340
The whole success
of deep learning

611
00:36:01,340 --> 00:36:05,030
was the final realization,
after a bunch of years,

612
00:36:05,030 --> 00:36:08,510
that they had to have a
nonlinear function in there

613
00:36:08,510 --> 00:36:14,340
to get to model serious data.

614
00:36:14,340 --> 00:36:19,040
But here's PCA as
a linear business.

615
00:36:19,040 --> 00:36:21,740
And I'm looking
for the best line.

616
00:36:25,390 --> 00:36:28,330
And you will say, wait a minute.

617
00:36:28,330 --> 00:36:34,610
I know how to find the best
line, just use least squares.

618
00:36:34,610 --> 00:36:35,360
Gauss did it.

619
00:36:35,360 --> 00:36:38,505
Can't be all bad.

620
00:36:38,505 --> 00:36:42,370
But PCA-- and I
was giving a talk

621
00:36:42,370 --> 00:36:46,470
in New York when I was
just learning about it.

622
00:36:46,470 --> 00:36:48,790
And somebody said,
what you're doing

623
00:36:48,790 --> 00:36:51,160
with PCA has to be the
same as least squares--

624
00:36:51,160 --> 00:36:53,550
it's finding the best line.

625
00:36:53,550 --> 00:36:56,970
And I knew it wasn't,
but I didn't know how

626
00:36:56,970 --> 00:36:59,960
to answer that question best.

627
00:36:59,960 --> 00:37:03,580
And now, at least,
I know better.

628
00:37:03,580 --> 00:37:06,640
So the best line
in least squares--

629
00:37:06,640 --> 00:37:08,320
can I remind you
about least squares?

630
00:37:08,320 --> 00:37:10,600
Because this is
not least squares.

631
00:37:10,600 --> 00:37:13,430
The best line of least squares--

632
00:37:13,430 --> 00:37:15,600
so I have some data points.

633
00:37:15,600 --> 00:37:18,610
And I have a best line
that goes through them.

634
00:37:18,610 --> 00:37:23,290
And least squares, I don't
always center the data

635
00:37:23,290 --> 00:37:25,390
to mean zero, but I could.

636
00:37:25,390 --> 00:37:30,010
But what do you minimize
in least squares--

637
00:37:30,010 --> 00:37:30,750
least squares?

638
00:37:34,680 --> 00:37:37,050
If you remember the
picture in linear algebra

639
00:37:37,050 --> 00:37:41,910
books of least squares,
you measure the errors--

640
00:37:41,910 --> 00:37:43,560
the three errors.

641
00:37:43,560 --> 00:37:49,020
And it's how much you're
wrong at those three points.

642
00:37:53,030 --> 00:37:54,890
Those are the three errors.

643
00:37:54,890 --> 00:37:59,170
A-- difference
between Ax and B--

644
00:37:59,170 --> 00:38:05,150
the B minus Ax that you square.

645
00:38:05,150 --> 00:38:09,460
And you add up
those three errors.

646
00:38:09,460 --> 00:38:12,000
And what's different over here?

647
00:38:12,000 --> 00:38:15,750
I mean, there's more points,
but that's not the point.

648
00:38:15,750 --> 00:38:17,500
That's not the difference.

649
00:38:17,500 --> 00:38:23,550
The difference is, in PCA,
you're measuring perpendicular

650
00:38:23,550 --> 00:38:25,620
to the line.

651
00:38:25,620 --> 00:38:30,570
You're adding up all these
little guys, squaring them.

652
00:38:30,570 --> 00:38:34,410
So you're adding up their
squares and minimizing.

653
00:38:34,410 --> 00:38:38,040
So the points-- you see
it's a different problem?

654
00:38:38,040 --> 00:38:41,045
And therefore it has
a different answer.

655
00:38:41,045 --> 00:38:53,770
And this answer turns out to
involve the SVD, the sigmas.

656
00:38:53,770 --> 00:38:57,640
Where this answer, you remember
from ordinary linear algebra,

657
00:38:57,640 --> 00:39:01,420
just when you
minimize that, you got

658
00:39:01,420 --> 00:39:07,520
to an equation that leads to
what equation for the best x?

659
00:39:07,520 --> 00:39:09,156
So do you remember?

660
00:39:09,156 --> 00:39:10,335
AUDIENCE: [INAUDIBLE]

661
00:39:10,335 --> 00:39:11,210
GILBERT STRANG: Yeah.

662
00:39:11,210 --> 00:39:12,590
What is it now?

663
00:39:12,590 --> 00:39:15,950
Everybody should know.

664
00:39:15,950 --> 00:39:19,280
And we will actually
see it in this course,

665
00:39:19,280 --> 00:39:23,190
because we're doing the
heart of linear algebra here.

666
00:39:23,190 --> 00:39:24,860
We haven't done it yet, though.

667
00:39:24,860 --> 00:39:30,460
And tell me again, what equation
do I solve for that problem?

668
00:39:30,460 --> 00:39:31,460
AUDIENCE: A transpose A.

669
00:39:31,460 --> 00:39:37,100
GILBERT STRANG: A transpose
A x hat equal A transpose b.

670
00:39:40,650 --> 00:39:42,150
Called the normal equations.

671
00:39:47,460 --> 00:39:49,350
It's sort part of--

672
00:39:49,350 --> 00:39:56,740
it's this regression
in statistics language.

673
00:39:56,740 --> 00:39:58,170
That's a regression problem.

674
00:39:58,170 --> 00:40:00,470
This is a different problem.

675
00:40:00,470 --> 00:40:01,430
OK.

676
00:40:01,430 --> 00:40:04,690
Just so now you see the answer.

677
00:40:04,690 --> 00:40:06,320
So that involves--
well, they both

678
00:40:06,320 --> 00:40:10,290
involve A transpose A.
That's sort of interesting,

679
00:40:10,290 --> 00:40:12,240
because you have a
rectangular matrix A,

680
00:40:12,240 --> 00:40:16,290
and then sooner or later,
A transpose A is coming.

681
00:40:16,290 --> 00:40:21,090
But this involves solving a
linear system of equations.

682
00:40:21,090 --> 00:40:23,000
So it's fast.

683
00:40:23,000 --> 00:40:25,430
And we will do it.

684
00:40:25,430 --> 00:40:27,940
And it's very important.

685
00:40:27,940 --> 00:40:33,650
It's probably the most
important application in 18.06.

686
00:40:33,650 --> 00:40:36,750
But it's not the
same as this one.

687
00:40:36,750 --> 00:40:42,920
So this is now in 18.06,
maybe the last day is PCA.

688
00:40:42,920 --> 00:40:44,510
So I didn't put those letters--

689
00:40:44,510 --> 00:40:53,905
Principal Component
Analysis-- PCA.

690
00:40:56,630 --> 00:40:59,520
Which statisticians have
been doing for a long time.

691
00:40:59,520 --> 00:41:03,630
We're not doing
something brand new here.

692
00:41:03,630 --> 00:41:08,590
But the result is that we--

693
00:41:08,590 --> 00:41:13,120
so how does a statistician
think about this problem,

694
00:41:13,120 --> 00:41:16,370
or that data matrix?

695
00:41:16,370 --> 00:41:20,060
What-- if you have
a matrix of data--

696
00:41:20,060 --> 00:41:28,190
2 by 2 rows and many columns--
so many, many samples--

697
00:41:28,190 --> 00:41:32,000
what-- and we've
made the mean zero.

698
00:41:32,000 --> 00:41:34,040
So that's a first
step a statistician

699
00:41:34,040 --> 00:41:36,590
takes to check on the mean.

700
00:41:36,590 --> 00:41:39,130
What's the next step?

701
00:41:39,130 --> 00:41:43,150
What else does a
statistician do with data

702
00:41:43,150 --> 00:41:45,400
to measure how-- its size?

703
00:41:45,400 --> 00:41:46,450
There's another number.

704
00:41:46,450 --> 00:41:49,570
There's a number that
goes with the mean,

705
00:41:49,570 --> 00:41:51,900
and it's the variance--

706
00:41:51,900 --> 00:41:53,100
the mean and the variance.

707
00:41:53,100 --> 00:41:56,340
So somehow we're
going to do variances.

708
00:41:56,340 --> 00:41:59,460
And it will really be
involved, because we

709
00:41:59,460 --> 00:42:03,240
have two sets of data--
heights and ages.

710
00:42:03,240 --> 00:42:06,090
We're really going to
have a covariance--

711
00:42:06,090 --> 00:42:15,340
covariance matrix--
and it will be 2 by 2.

712
00:42:18,950 --> 00:42:22,340
Because it will tell us not only
the variance in the heights--

713
00:42:22,340 --> 00:42:24,680
that's the first
thing a statistician

714
00:42:24,680 --> 00:42:26,010
would think about--

715
00:42:26,010 --> 00:42:28,910
some small people,
some big people--

716
00:42:28,910 --> 00:42:31,040
and variation in ages--

717
00:42:31,040 --> 00:42:33,440
but also the link between them.

718
00:42:33,440 --> 00:42:37,190
How are the height, age pairs--

719
00:42:37,190 --> 00:42:41,420
does more height-- does more
age go with more height?

720
00:42:41,420 --> 00:42:42,410
And of course, it does.

721
00:42:42,410 --> 00:42:44,360
That's the whole point here.

722
00:42:44,360 --> 00:42:46,580
So it's this covariance matrix.

723
00:42:46,580 --> 00:42:48,530
And that covariance matrix--

724
00:42:48,530 --> 00:42:52,940
or the sample covariance matrix,
to give it its full name--

725
00:42:55,870 --> 00:43:00,420
what's the-- so just touching
on statistics for a moment here.

726
00:43:00,420 --> 00:43:06,050
What's the-- when we see that
word sample in the name, what

727
00:43:06,050 --> 00:43:09,010
is that telling us?

728
00:43:09,010 --> 00:43:13,970
It's telling us that this matrix
is computed from the samples,

729
00:43:13,970 --> 00:43:18,620
not from a theoretical
probability distribution.

730
00:43:18,620 --> 00:43:21,830
We might have a
proposed distribution

731
00:43:21,830 --> 00:43:25,670
that the height
follows the age--

732
00:43:25,670 --> 00:43:28,250
height follows the
age by some formula.

733
00:43:28,250 --> 00:43:34,340
And that would give us
theoretical variances.

734
00:43:34,340 --> 00:43:37,790
We're doing sample
variances, also called

735
00:43:37,790 --> 00:43:40,430
empirical covariance made.

736
00:43:40,430 --> 00:43:42,770
Empirical says--
empirical-- that word

737
00:43:42,770 --> 00:43:45,930
means, from the
information, from the data.

738
00:43:45,930 --> 00:43:46,880
So that's what we do.

739
00:43:46,880 --> 00:43:50,660
And it is exactly--

740
00:43:50,660 --> 00:43:52,020
it's AA transpose.

741
00:43:57,610 --> 00:44:06,680
You have to normalize it by
the number of data points, N.

742
00:44:06,680 --> 00:44:09,290
And then, for some reason--

743
00:44:09,290 --> 00:44:12,520
best known to statisticians--

744
00:44:12,520 --> 00:44:13,570
it's N minus 1.

745
00:44:17,950 --> 00:44:19,680
And of course, they've
got to be right.

746
00:44:19,680 --> 00:44:23,440
They've been around a long time
and it should be N minus 1,

747
00:44:23,440 --> 00:44:27,190
because somehow 1
degree of freedom

748
00:44:27,190 --> 00:44:30,930
was accounted for
when we took away--

749
00:44:30,930 --> 00:44:33,700
when we made the mean 0.

750
00:44:33,700 --> 00:44:36,800
So we-- anyway, no problem.

751
00:44:36,800 --> 00:44:43,180
But the N minus 1 is not going
to affect our computation here.

752
00:44:43,180 --> 00:44:46,360
This is the matrix
that tells us that's

753
00:44:46,360 --> 00:44:50,820
what we've got to work with.

754
00:44:50,820 --> 00:44:55,590
That's what we've got to work
with-- the matrix AA transpose.

755
00:44:55,590 --> 00:45:00,220
And then the-- so we
have this problem.

756
00:45:00,220 --> 00:45:01,020
So we have a--

757
00:45:01,020 --> 00:45:02,220
yeah.

758
00:45:02,220 --> 00:45:05,140
I guess we really have
a minimum problem.

759
00:45:05,140 --> 00:45:07,160
We want to find--

760
00:45:07,160 --> 00:45:07,670
yeah.

761
00:45:07,670 --> 00:45:11,226
What problem are we solving?

762
00:45:11,226 --> 00:45:12,820
And it's-- yeah.

763
00:45:12,820 --> 00:45:17,230
So our problem was
not least squares--

764
00:45:17,230 --> 00:45:21,620
not the same as least squares.

765
00:45:21,620 --> 00:45:23,920
Similar, but not the same.

766
00:45:23,920 --> 00:45:24,875
We want to minimize.

767
00:45:24,875 --> 00:45:29,520
So we're looking
for that best line

768
00:45:29,520 --> 00:45:38,420
where age equals some number,
c, times the height, times the--

769
00:45:38,420 --> 00:45:39,660
yeah-- or height.

770
00:45:39,660 --> 00:45:42,210
Maybe it would have been better
to age here and height here.

771
00:45:42,210 --> 00:45:42,710
No.

772
00:45:42,710 --> 00:45:46,510
No, because there
are two unknowns.

773
00:45:46,510 --> 00:45:47,550
So I'm looking for c.

774
00:45:47,550 --> 00:45:50,690
I'm looking for the number c--

775
00:45:50,690 --> 00:45:52,370
looking for the number c.

776
00:45:52,370 --> 00:45:58,870
And with just two minutes in
class left, what is that number

777
00:45:58,870 --> 00:46:02,620
c going to be, when I finally
get the problem stated

778
00:46:02,620 --> 00:46:06,720
properly, and then solve it?

779
00:46:06,720 --> 00:46:10,800
I'm going to learn that the
best ratio of age to height

780
00:46:10,800 --> 00:46:13,980
is sigma 1.

781
00:46:13,980 --> 00:46:15,610
Sigma 1.

782
00:46:15,610 --> 00:46:21,280
That's the one that tells us
how those two are connected,

783
00:46:21,280 --> 00:46:26,820
and the orthogonal-- and
what will be the best--

784
00:46:26,820 --> 00:46:27,850
yeah.

785
00:46:27,850 --> 00:46:28,350
No.

786
00:46:28,350 --> 00:46:31,260
Maybe I didn't answer
that the right--

787
00:46:31,260 --> 00:46:33,730
maybe I didn't get that right.

788
00:46:33,730 --> 00:46:36,460
Because I'm looking for--

789
00:46:36,460 --> 00:46:40,520
I'm looking for the vector
that points in the right way.

790
00:46:40,520 --> 00:46:41,470
Yeah.

791
00:46:41,470 --> 00:46:42,010
I'm sorry.

792
00:46:44,890 --> 00:46:48,490
I think the answer is, it's
got to be there in the SVD.

793
00:46:48,490 --> 00:46:50,760
I think it's the
vector you want.

794
00:46:50,760 --> 00:46:54,280
It's the principal
component you want.

795
00:46:54,280 --> 00:46:57,910
Let's do that
properly on Friday.

796
00:46:57,910 --> 00:46:59,230
I hope you see--

797
00:46:59,230 --> 00:47:03,250
because this was
a first step away

798
00:47:03,250 --> 00:47:05,350
from the highlights
of linear algebra

799
00:47:05,350 --> 00:47:08,740
to problem solve
by linear algebra,

800
00:47:08,740 --> 00:47:12,580
and practical
problems, and my point

801
00:47:12,580 --> 00:47:15,305
is that the SVD solves these.