1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT Open Courseware

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:16,670
from hundreds of
MIT courses, visit

7
00:00:16,670 --> 00:00:18,540
MITopencourseware@ocw.MIT.edu.

8
00:00:24,170 --> 00:00:29,070
GILBERT STRANG: So I'm going to
talk about the gradient descent

9
00:00:29,070 --> 00:00:32,580
today to get to that
central algorithm

10
00:00:32,580 --> 00:00:38,190
of neural net deep
learning, machine learning,

11
00:00:38,190 --> 00:00:40,530
and optimization in general.

12
00:00:40,530 --> 00:00:43,230
So I'm trying to
minimize a function.

13
00:00:43,230 --> 00:00:50,400
And that's the way you do it if
there are many, many variables,

14
00:00:50,400 --> 00:00:52,890
too many to take
second derivatives,

15
00:00:52,890 --> 00:00:56,880
then we settle for first
derivatives of the function.

16
00:00:56,880 --> 00:00:59,610
So I introduced,
and you've already

17
00:00:59,610 --> 00:01:01,610
met the idea of gradient.

18
00:01:01,610 --> 00:01:04,470
But let me just be sure
to make some comments

19
00:01:04,470 --> 00:01:07,410
about the gradient
and the Hessian

20
00:01:07,410 --> 00:01:15,610
and the role of convexity before
we see the big crucial example.

21
00:01:15,610 --> 00:01:19,425
So I've kind of prepared over
here for this crucial example.

22
00:01:22,010 --> 00:01:26,820
The function is a pure
quadratic, two unknowns, x

23
00:01:26,820 --> 00:01:30,240
and y, pure quadratic.

24
00:01:30,240 --> 00:01:34,620
So every pure quadratic
I can write in terms

25
00:01:34,620 --> 00:01:37,160
of a symmetric matrix s.

26
00:01:37,160 --> 00:01:42,890
And in this case, x1 squared
was bx2 squared, the symmetric,

27
00:01:42,890 --> 00:01:45,810
the matrix is just 2 by 2.

28
00:01:45,810 --> 00:01:47,040
It's diagonal.

29
00:01:47,040 --> 00:01:52,440
It's got eigenvalues 1 and
b sitting on the diagonal.

30
00:01:52,440 --> 00:01:56,020
I'm thinking of b as
being the smaller one.

31
00:01:56,020 --> 00:02:00,720
So the condition
number, which we'll see,

32
00:02:00,720 --> 00:02:07,230
is all important in the question
of the speed of convergence

33
00:02:07,230 --> 00:02:13,260
is the ratio of the
largest to the smallest.

34
00:02:13,260 --> 00:02:17,310
In this case, the largest
is 1 the smallest is b.

35
00:02:17,310 --> 00:02:19,260
So that's 1 over b.

36
00:02:19,260 --> 00:02:23,370
And when 1 over b
is a big number,

37
00:02:23,370 --> 00:02:26,130
when b is a very small
number, then that's

38
00:02:26,130 --> 00:02:27,090
when we're in trouble.

39
00:02:31,560 --> 00:02:34,380
When the matrix is symmetric,
that condition number

40
00:02:34,380 --> 00:02:37,620
is lambda max over lambda min.

41
00:02:37,620 --> 00:02:40,830
If I had an
unsymmetric matrix, I

42
00:02:40,830 --> 00:02:44,360
would probably use sigma max
over sigma min, of course.

43
00:02:44,360 --> 00:02:48,660
But here, matrices
are symmetric.

44
00:02:48,660 --> 00:02:52,170
We're going to
see something neat

45
00:02:52,170 --> 00:02:58,260
is that we can actually take
the steps of steepest descent,

46
00:02:58,260 --> 00:03:01,440
write down what
each step gives us,

47
00:03:01,440 --> 00:03:05,310
and see how quickly they
converge to the answer.

48
00:03:05,310 --> 00:03:07,220
And what is the answer?

49
00:03:07,220 --> 00:03:11,370
So I haven't put in
any linear term here.

50
00:03:11,370 --> 00:03:14,730
So I just have a bowl
sitting on the origin.

51
00:03:14,730 --> 00:03:18,990
So of course, the minimum
point is x equal 0, y equals 0.

52
00:03:18,990 --> 00:03:26,050
So the minimum point x
star, is 0, 0, of course.

53
00:03:26,050 --> 00:03:29,670
So the question will be how
quickly do we get to that one.

54
00:03:29,670 --> 00:03:33,450
And you will say pretty
small example, not typical.

55
00:03:33,450 --> 00:03:37,080
But the terrific
thing is that we see

56
00:03:37,080 --> 00:03:38,890
everything for this example.

57
00:03:38,890 --> 00:03:43,380
We can see the actual
steps of steepest descent.

58
00:03:43,380 --> 00:03:45,600
We can see how
quickly they converge

59
00:03:45,600 --> 00:03:50,730
to the x star, the
answer, the place

60
00:03:50,730 --> 00:03:52,890
where this thing is a minimum.

61
00:03:52,890 --> 00:04:01,440
And we can begin to think
what to do if it's too slow.

62
00:04:01,440 --> 00:04:06,930
So I'll come to that example
after some general thoughts

63
00:04:06,930 --> 00:04:09,840
about gradients, Hessians.

64
00:04:09,840 --> 00:04:12,300
So what does the
gradient tell us?

65
00:04:12,300 --> 00:04:14,745
So let me just take an
example of the gradient.

66
00:04:17,860 --> 00:04:23,980
Let me take a linear function,
f of xy equals say, 2x plus 5y.

67
00:04:26,560 --> 00:04:31,540
I just think we ought to get
totally familiar with these.

68
00:04:31,540 --> 00:04:33,910
We're doing something.

69
00:04:33,910 --> 00:04:38,800
We're jumping into
an important topic.

70
00:04:38,800 --> 00:04:41,440
When I ask you
what's the gradient,

71
00:04:41,440 --> 00:04:43,780
that's a freshman question.

72
00:04:43,780 --> 00:04:48,460
But let's just be sure we know
how to interpret the gradient,

73
00:04:48,460 --> 00:04:51,970
how to compute
it, what it means,

74
00:04:51,970 --> 00:04:54,200
how to see it geometrically.

75
00:04:54,200 --> 00:04:56,650
So what's the gradient
of that function?

76
00:04:56,650 --> 00:04:58,380
It's a function
of two variables.

77
00:04:58,380 --> 00:05:02,110
So the gradient is a
vector with two components.

78
00:05:02,110 --> 00:05:02,980
And they are?

79
00:05:07,540 --> 00:05:09,420
The derivative of
this factor x, which

80
00:05:09,420 --> 00:05:13,320
is 2 and the derivative of
this factor y, which is 5.

81
00:05:13,320 --> 00:05:17,100
So in this case, the
gradient is constant.

82
00:05:17,100 --> 00:05:22,650
And the Hessian, which I
often call H after Hessian,

83
00:05:22,650 --> 00:05:25,800
or del squared F
would tell us we're

84
00:05:25,800 --> 00:05:27,990
taking the second
derivatives, that

85
00:05:27,990 --> 00:05:33,150
will be the second derivatives
obviously 0 in this case.

86
00:05:33,150 --> 00:05:38,230
So what shape is H here?

87
00:05:38,230 --> 00:05:39,730
It's 2 by 2.

88
00:05:39,730 --> 00:05:45,212
Everybody recognizes 2 by
2 is H would have the--

89
00:05:45,212 --> 00:05:49,220
I'll take a second
derivative of that--

90
00:05:49,220 --> 00:05:52,090
sorry, the first derivative
of that with respect to x,

91
00:05:52,090 --> 00:05:54,700
obviously 0, the first
derivative with respect

92
00:05:54,700 --> 00:06:00,620
to y, the first derivative
of that with respect to x y.

93
00:06:00,620 --> 00:06:04,840
Anyway, Hessian 0 for sure.

94
00:06:04,840 --> 00:06:08,080
So let me draw the surface.

95
00:06:08,080 --> 00:06:13,540
So x, y, and the surface, if
I graph F in this direction,

96
00:06:13,540 --> 00:06:16,960
then obviously, I have a plane.

97
00:06:16,960 --> 00:06:20,840
And I'm at a typical point
on the plane let's say.

98
00:06:20,840 --> 00:06:21,910
Yeah, yeah.

99
00:06:21,910 --> 00:06:24,070
So I'm at a point
x, y, I should say.

100
00:06:24,070 --> 00:06:25,690
I'm at a point x, y.

101
00:06:25,690 --> 00:06:28,340
And let me put the
plane through it.

102
00:06:28,340 --> 00:06:30,160
So how do I interpret
the gradient

103
00:06:30,160 --> 00:06:32,235
at that particular point x, y?

104
00:06:35,630 --> 00:06:38,240
What does 2x plus 5y tell me?

105
00:06:38,240 --> 00:06:46,400
Or rather what does grad
F tell me about movement

106
00:06:46,400 --> 00:06:50,510
from that point x, y?

107
00:06:50,510 --> 00:06:52,030
Of course, the
gradient is constant.

108
00:06:52,030 --> 00:06:55,130
So it really didn't matter
what point I'm moving from.

109
00:06:55,130 --> 00:06:57,680
But taking a point here.

110
00:06:57,680 --> 00:07:00,290
So what's the deal if I move?

111
00:07:00,290 --> 00:07:04,010
What's the fastest way
to go up the surface?

112
00:07:04,010 --> 00:07:09,110
If I took the plane that
went through that point x, y,

113
00:07:09,110 --> 00:07:11,620
what's the fastest way
to climb the plane?

114
00:07:11,620 --> 00:07:14,630
What direction goes up fastest?

115
00:07:14,630 --> 00:07:16,230
The gradient direction, right?

116
00:07:16,230 --> 00:07:19,080
The gradient direction
is the way up.

117
00:07:19,080 --> 00:07:22,700
How am I going to put
it in this picture?

118
00:07:22,700 --> 00:07:26,710
I guess I'm thinking
of this plane as--

119
00:07:26,710 --> 00:07:27,530
so what plane?

120
00:07:27,530 --> 00:07:30,230
You could well ask what
plane have I drawn?

121
00:07:30,230 --> 00:07:39,350
Suppose I've drawn the plane
2x plus 5y equals 0 even?

122
00:07:39,350 --> 00:07:41,560
So I'll make it go
through the arc.

123
00:07:41,560 --> 00:07:44,540
And I've taken a typical
point on that plane.

124
00:07:44,540 --> 00:07:48,380
Now if I want to
increase that function,

125
00:07:48,380 --> 00:07:52,700
I go perpendicular to the plane.

126
00:07:52,700 --> 00:07:54,665
If I want to stay level
with the function,

127
00:07:54,665 --> 00:07:58,620
if I wanted to stay at
0, I stay in the plane.

128
00:07:58,620 --> 00:08:00,650
So there are two key directions.

129
00:08:00,650 --> 00:08:01,880
Everybody knows this.

130
00:08:01,880 --> 00:08:03,200
I'm just repeating.

131
00:08:03,200 --> 00:08:08,030
This is the direction
of the gradient of F out

132
00:08:08,030 --> 00:08:10,250
of the plane, steepest upwards.

133
00:08:10,250 --> 00:08:13,190
This is the downwards
direction minus gradient

134
00:08:13,190 --> 00:08:16,940
of F, perpendicular to
the plane downwards.

135
00:08:16,940 --> 00:08:21,800
And that line is in the plane.

136
00:08:21,800 --> 00:08:23,660
That's part of the level set.

137
00:08:23,660 --> 00:08:28,070
2x plus 5y equals 0
would be a level set.

138
00:08:28,070 --> 00:08:32,950
That's my pretty
amateur picture.

139
00:08:32,950 --> 00:08:45,130
Just all I want to remember is
these words level and steepest,

140
00:08:45,130 --> 00:08:49,330
up or down.

141
00:08:49,330 --> 00:08:54,610
Down with a minus sign that
we see in steepest descent.

142
00:08:54,610 --> 00:08:58,980
So where in steepest descent.

143
00:09:03,020 --> 00:09:08,900
And what's the Hessian
telling me about the surface

144
00:09:08,900 --> 00:09:12,810
if I take the matrix
of second derivatives?

145
00:09:12,810 --> 00:09:14,680
So I have this surface.

146
00:09:14,680 --> 00:09:18,070
So I have a surface
F equal constant.

147
00:09:22,990 --> 00:09:25,620
That's the sort
of level surface.

148
00:09:25,620 --> 00:09:29,530
So if I stay in that surface,
the gradient of F is 0.

149
00:09:29,530 --> 00:09:33,351
Gradient of F is 0 in--

150
00:09:36,960 --> 00:09:39,270
on-- on is a better word--

151
00:09:39,270 --> 00:09:39,900
on the surface.

152
00:09:43,330 --> 00:09:46,220
The gradient of F
points perpendicular.

153
00:09:46,220 --> 00:09:58,100
But what about the Hessian,
the second derivative?

154
00:09:58,100 --> 00:10:03,430
What is that telling
me about that surface

155
00:10:03,430 --> 00:10:07,950
in particular when the Hessian
is 0 or other surfaces?

156
00:10:07,950 --> 00:10:10,395
What does the Hessian
tell me about--

157
00:10:13,370 --> 00:10:16,990
I'm thinking of the Hessian
at a particular point.

158
00:10:16,990 --> 00:10:25,580
So I'm getting 0 for the Hessian
because the surface is flat.

159
00:10:25,580 --> 00:10:34,180
If the surface was
convex upwards from--

160
00:10:34,180 --> 00:10:41,775
if it was a convex or a graph
of F, the Hessian would be--

161
00:10:46,340 --> 00:10:48,810
so I just want to make
that connection now.

162
00:10:48,810 --> 00:10:54,990
What's the connection between
the Hessian and convexity

163
00:10:54,990 --> 00:10:55,590
of the--

164
00:10:55,590 --> 00:11:00,660
the Hessian of the function
and convexity of the function?

165
00:11:00,660 --> 00:11:06,550
So the point is that convexity--

166
00:11:06,550 --> 00:11:10,350
the Hessian tells me whether
or not the surface is convex.

167
00:11:10,350 --> 00:11:11,550
And what is the test?

168
00:11:11,550 --> 00:11:12,600
AUDIENCE: [INAUDIBLE].

169
00:11:12,600 --> 00:11:16,350
GILBERT STRANG: Positive
definite or semi definite.

170
00:11:16,350 --> 00:11:20,340
I'm just looking for
an excuse to write down

171
00:11:20,340 --> 00:11:26,910
convexity and strong.

172
00:11:26,910 --> 00:11:29,760
Do I say strict or
strong convexity?

173
00:11:29,760 --> 00:11:30,630
I've forgotten.

174
00:11:30,630 --> 00:11:32,150
Strict, I think.

175
00:11:32,150 --> 00:11:33,030
Strictly convex.

176
00:11:38,230 --> 00:11:45,100
So convexity, the Hessian
is positive semi-definite,

177
00:11:45,100 --> 00:11:48,330
or which includes--

178
00:11:48,330 --> 00:11:49,990
I better say that right here--

179
00:11:49,990 --> 00:11:52,074
includes positive definite.

180
00:11:58,380 --> 00:12:00,420
If I'm looking for
a strict convexity,

181
00:12:00,420 --> 00:12:03,220
then I must require
positive definite.

182
00:12:03,220 --> 00:12:05,863
H is positive definite.

183
00:12:09,810 --> 00:12:12,300
Semi-definite won't do.

184
00:12:12,300 --> 00:12:15,300
So semi-definite for convex.

185
00:12:15,300 --> 00:12:18,540
So that in fact,
the linear function

186
00:12:18,540 --> 00:12:22,170
is convex, but not
strictly convex.

187
00:12:22,170 --> 00:12:25,160
Strictly means it
really bends upwards.

188
00:12:25,160 --> 00:12:26,890
The Hessian is
positive definite.

189
00:12:26,890 --> 00:12:31,120
The curvatures are positive.

190
00:12:31,120 --> 00:12:34,290
So this would include
linear functions,

191
00:12:34,290 --> 00:12:37,460
and that would not
include linear function.

192
00:12:37,460 --> 00:12:40,740
They're not strictly convex.

193
00:12:40,740 --> 00:12:42,510
Good, good, good.

194
00:12:42,510 --> 00:12:46,600
Some examples-- OK, the
number one example, of course,

195
00:12:46,600 --> 00:12:49,410
is the one we're
talking about over here.

196
00:12:49,410 --> 00:12:59,840
So examples f of x equal
1/2 x transpose Sx.

197
00:13:03,020 --> 00:13:05,660
And of course, I could
have linear terms

198
00:13:05,660 --> 00:13:10,310
minus a transpose
x, a linear term.

199
00:13:10,310 --> 00:13:12,770
And I could have a constant.

200
00:13:12,770 --> 00:13:13,270
OK.

201
00:13:18,790 --> 00:13:23,390
So this function
is strictly convex

202
00:13:23,390 --> 00:13:28,130
when S is positive
definite, because H is now

203
00:13:28,130 --> 00:13:33,800
S for that function,
for that function

204
00:13:33,800 --> 00:13:39,170
H. Usually H, the Hessian is
varying from point to point.

205
00:13:39,170 --> 00:13:42,770
The nice thing about a pure
quadratic is its constant.

206
00:13:42,770 --> 00:13:46,550
It's the same S at all points.

207
00:13:46,550 --> 00:13:49,580
Let me just ask you--

208
00:13:49,580 --> 00:13:53,370
so that's a convex function.

209
00:13:53,370 --> 00:13:56,250
And what's its minimum?

210
00:13:56,250 --> 00:13:57,883
What's the gradient,
first of all?

211
00:13:57,883 --> 00:13:59,050
What's the gradient of that?

212
00:14:03,790 --> 00:14:09,570
I'm asking really
for differentiating

213
00:14:09,570 --> 00:14:14,440
thinking in vector, doing all
n derivatives at once here.

214
00:14:14,440 --> 00:14:19,840
I'm asking for the whole
vector of first derivatives.

215
00:14:19,840 --> 00:14:24,420
Because here I'm giving
you the whole function

216
00:14:24,420 --> 00:14:28,150
with x for vector x.

217
00:14:28,150 --> 00:14:31,210
Of course, we could
take n to be 1.

218
00:14:31,210 --> 00:14:33,760
And then we would
see that if n was 1,

219
00:14:33,760 --> 00:14:39,880
this would just be Sx
squared, half Sx squared.

220
00:14:39,880 --> 00:14:44,170
And the derivative of
a half Sx squared--

221
00:14:44,170 --> 00:14:46,030
let me just put that
over here so we're

222
00:14:46,030 --> 00:14:48,700
sure to get it right--
half of Sx squared.

223
00:14:48,700 --> 00:14:51,490
This is in the n equal 1 case.

224
00:14:51,490 --> 00:14:53,860
And the derivative
is obviously Sx.

225
00:14:53,860 --> 00:14:55,540
And that's what it is here, Sx.

226
00:15:06,490 --> 00:15:10,200
It's obviously
simple, but if you

227
00:15:10,200 --> 00:15:14,190
haven't thought
about that line, it's

228
00:15:14,190 --> 00:15:18,120
asking for all the
first derivatives

229
00:15:18,120 --> 00:15:20,850
of that quadratic function.

230
00:15:20,850 --> 00:15:21,570
Oh!

231
00:15:21,570 --> 00:15:27,940
It's not-- What do I
have to include now here?

232
00:15:27,940 --> 00:15:31,200
That's not right as it stands
for the function that's

233
00:15:31,200 --> 00:15:32,517
written above it.

234
00:15:32,517 --> 00:15:33,600
What's the right gradient?

235
00:15:33,600 --> 00:15:34,517
AUDIENCE: [INAUDIBLE].

236
00:15:34,517 --> 00:15:38,220
GILBERT STRANG: Minus a, thanks.

237
00:15:38,220 --> 00:15:41,440
Because the linear function,
its partial derivatives

238
00:15:41,440 --> 00:15:45,120
are obviously just
the components of a.

239
00:15:45,120 --> 00:15:56,030
And the Hessian H is S,
derivatives of that guy.

240
00:15:56,030 --> 00:15:56,700
OK.

241
00:15:56,700 --> 00:15:57,300
Good.

242
00:15:57,300 --> 00:15:59,550
Good, good, good.

243
00:15:59,550 --> 00:16:02,520
And the minimum value-- we
might as well-- oh yeah!

244
00:16:02,520 --> 00:16:07,820
What's the right words
for a minimum value?

245
00:16:07,820 --> 00:16:09,570
No, I'm sorry.

246
00:16:09,570 --> 00:16:14,430
The right word is
minimum value like f min.

247
00:16:14,430 --> 00:16:17,880
So I want to compute f min.

248
00:16:17,880 --> 00:16:23,930
Well, first I have to figure out
where is that minimum reached?

249
00:16:23,930 --> 00:16:27,140
And what's the answer to that?

250
00:16:27,140 --> 00:16:30,840
We're putting everything on
the board for this simple case.

251
00:16:30,840 --> 00:16:38,990
The minimum of f
of f of f of x--

252
00:16:38,990 --> 00:16:42,290
remember, it's x is--
we're in n dimensions--

253
00:16:42,290 --> 00:16:49,910
is at x equal what?

254
00:16:49,910 --> 00:16:52,400
Well, the minimum is
where the gradient is 0.

255
00:16:55,460 --> 00:16:59,381
So what's the minimizing x?

256
00:16:59,381 --> 00:17:01,115
S inverse a, thanks.

257
00:17:08,180 --> 00:17:09,260
Sorry.

258
00:17:09,260 --> 00:17:12,480
That's not right.

259
00:17:12,480 --> 00:17:14,020
It's here that I
meant to write it.

260
00:17:17,099 --> 00:17:20,550
Really, my whole point
for this little moment

261
00:17:20,550 --> 00:17:23,250
is to be sure that
we keep straight what

262
00:17:23,250 --> 00:17:27,780
I mean by the place where
the minimum is reached

263
00:17:27,780 --> 00:17:29,160
and the minimum value.

264
00:17:29,160 --> 00:17:30,600
Those are two different things.

265
00:17:34,330 --> 00:17:36,810
So the minimum is
reached at S inverse

266
00:17:36,810 --> 00:17:40,270
a, because that's obviously
where the gradient is 0.

267
00:17:40,270 --> 00:17:43,073
It's the solution to Sx equal a.

268
00:17:43,073 --> 00:17:48,970
And what I was going to ask
you is what's the right word--

269
00:17:48,970 --> 00:17:56,440
well, sort of word, made up
word-- for this point x star

270
00:17:56,440 --> 00:17:58,760
where the minimum is reached?

271
00:17:58,760 --> 00:18:00,160
So it's not the minimum value.

272
00:18:00,160 --> 00:18:01,720
It's the point
where it's reached.

273
00:18:01,720 --> 00:18:06,057
And that's called-- the
notation for that point is

274
00:18:06,057 --> 00:18:06,991
AUDIENCE: Arg min.

275
00:18:06,991 --> 00:18:10,240
GILBERT STRANG: Arg min, thanks.

276
00:18:10,240 --> 00:18:16,620
Arg min of my function.

277
00:18:16,620 --> 00:18:18,900
And that means the place--

278
00:18:18,900 --> 00:18:24,918
the point where f equals f min.

279
00:18:28,200 --> 00:18:30,600
I haven't said yet what
the minimum value is.

280
00:18:30,600 --> 00:18:31,830
This tells us the point.

281
00:18:31,830 --> 00:18:34,290
And that's usually what
we're interested in.

282
00:18:34,290 --> 00:18:36,540
We're, to tell the
truth, not that

283
00:18:36,540 --> 00:18:40,470
interested in a typical example
and what the minimum value

284
00:18:40,470 --> 00:18:43,740
is as much as where is it?

285
00:18:43,740 --> 00:18:46,590
Where do we reach that thing?

286
00:18:46,590 --> 00:18:50,490
And of course, so this is x min.

287
00:18:50,490 --> 00:19:00,010
This is then arg min
of my function f.

288
00:19:00,010 --> 00:19:00,940
That's the point.

289
00:19:00,940 --> 00:19:04,420
And it happens to
be in this case,

290
00:19:04,420 --> 00:19:06,520
the minimum value is actually 0.

291
00:19:11,470 --> 00:19:15,190
Because there's no linear
term a transpose x.

292
00:19:20,080 --> 00:19:26,270
Why am I talking about arg
min when you've all seen it?

293
00:19:26,270 --> 00:19:28,990
I guess I think that
somebody could just

294
00:19:28,990 --> 00:19:34,750
be reading this stuff,
for example, learning

295
00:19:34,750 --> 00:19:40,740
about neural net, and run
into this expression arg min

296
00:19:40,740 --> 00:19:43,360
and think what's that?

297
00:19:43,360 --> 00:19:47,620
So it's maybe a right
time to say what it is.

298
00:19:47,620 --> 00:19:50,110
It's the point where
the minimum is reached.

299
00:19:52,930 --> 00:19:55,510
Why those words, by the way?

300
00:19:55,510 --> 00:19:57,280
Well, arg isn't much of a word.

301
00:19:57,280 --> 00:20:00,160
It sounds like you're
getting strangled.

302
00:20:00,160 --> 00:20:03,520
But it's sort of short.

303
00:20:03,520 --> 00:20:05,440
I assume it's short.

304
00:20:05,440 --> 00:20:07,300
Nobody ever told me this.

305
00:20:07,300 --> 00:20:10,210
I assume it's
short for argument.

306
00:20:10,210 --> 00:20:15,160
The word argument is a kind of
long word for the value of x.

307
00:20:15,160 --> 00:20:18,850
If I have a function
f of x, f, I

308
00:20:18,850 --> 00:20:23,770
call it function and x is the
argument of that function.

309
00:20:23,770 --> 00:20:27,430
You might more often
see the word variable.

310
00:20:27,430 --> 00:20:31,240
But argument-- and I'm assuming
that's what that refers to,

311
00:20:31,240 --> 00:20:35,430
it's the argument that
minimizes the function.

312
00:20:35,430 --> 00:20:37,180
OK, good.

313
00:20:37,180 --> 00:20:41,090
And here it is, S inverse a.

314
00:20:41,090 --> 00:20:43,180
Now but just by the
way, what is f min?

315
00:20:43,180 --> 00:20:45,730
Do you know the
minimum of a quadratic?

316
00:20:45,730 --> 00:20:49,750
I mean, this is the fundamental
minimization question,

317
00:20:49,750 --> 00:20:52,660
to minimize a quadratic.

318
00:20:52,660 --> 00:20:56,410
Electrical engineering, a
quadratic regulator problem

319
00:20:56,410 --> 00:20:58,280
is the simplest problem there.

320
00:20:58,280 --> 00:20:59,920
There could be constraints.

321
00:20:59,920 --> 00:21:03,070
And we'll see it with
constraints included.

322
00:21:03,070 --> 00:21:06,260
But right now, no
constraints at all.

323
00:21:06,260 --> 00:21:08,560
We're just looking at
the function f of x.

324
00:21:11,480 --> 00:21:15,040
Let me to remove the
b, because that just

325
00:21:15,040 --> 00:21:18,130
shifts the function by b.

326
00:21:18,130 --> 00:21:22,710
If I erase that, just
to say it didn't matter.

327
00:21:22,710 --> 00:21:25,000
It's really that function.

328
00:21:25,000 --> 00:21:28,030
So that function
actually goes through 0.

329
00:21:28,030 --> 00:21:32,290
As it is, when x is
0, we obviously get 0.

330
00:21:32,290 --> 00:21:35,950
But it's still on its
way down, so to speak.

331
00:21:35,950 --> 00:21:40,090
It's on its way down to
this point, S inverse a.

332
00:21:40,090 --> 00:21:42,490
That's where it bottoms out.

333
00:21:42,490 --> 00:21:47,060
And when it bottoms out,
what do you get for f?

334
00:21:47,060 --> 00:21:49,660
One thing I know, it's
going to be negative

335
00:21:49,660 --> 00:21:53,620
because it passed through 0,
and it was on its way below 0.

336
00:21:53,620 --> 00:21:57,220
So let's just figure
out what that f min is.

337
00:21:57,220 --> 00:22:00,010
So I have a half.

338
00:22:00,010 --> 00:22:05,560
I'm just going to plug in S
inverse a, the bottom point

339
00:22:05,560 --> 00:22:11,860
into the function, and see
where the surface bottoms out

340
00:22:11,860 --> 00:22:15,700
and at what level
it bottoms out.

341
00:22:15,700 --> 00:22:17,200
So I have a half.

342
00:22:17,200 --> 00:22:23,320
So that's S inverse a is
a transpose S inverse.

343
00:22:23,320 --> 00:22:26,950
S symmetric, so I'll just
write this inverse transpose.

344
00:22:26,950 --> 00:22:33,520
S, S inverse a from
the quadratic term,

345
00:22:33,520 --> 00:22:37,770
minus a transpose.

346
00:22:37,770 --> 00:22:40,030
And x is S inverse a.

347
00:22:40,030 --> 00:22:42,580
Have you done this calculation?

348
00:22:42,580 --> 00:22:46,240
It just doesn't
hurt to repeat it.

349
00:22:46,240 --> 00:22:53,530
So I've plugged in S inverse
a there, there, and there.

350
00:22:53,530 --> 00:22:55,060
OK, what have I got?

351
00:22:55,060 --> 00:22:58,630
Well, S inverse
cancels S. So I have

352
00:22:58,630 --> 00:23:02,310
a half of a transpose
S inverse a minus 1

353
00:23:02,310 --> 00:23:04,150
of a transpose inverse a.

354
00:23:04,150 --> 00:23:08,350
So I get finally
negative a half.

355
00:23:08,350 --> 00:23:15,850
Half of it minus one of it
of a transpose S inverse a.

356
00:23:15,850 --> 00:23:19,480
Sorry, that's not brilliant
use of the blackboard

357
00:23:19,480 --> 00:23:21,370
to squeeze that in there.

358
00:23:21,370 --> 00:23:26,380
But that's easily repeatable.

359
00:23:26,380 --> 00:23:29,770
OK, good.

360
00:23:29,770 --> 00:23:34,560
So that's what a quadratic bowl,
a perfect quadratic problem

361
00:23:34,560 --> 00:23:40,390
minimizes to that's
its lowest level.

362
00:23:40,390 --> 00:23:45,390
Ooh, I wanted to mention
one other function,

363
00:23:45,390 --> 00:23:48,480
because I'm going to speak
mostly about quadratics,

364
00:23:48,480 --> 00:23:51,150
but obviously,
the whole point is

365
00:23:51,150 --> 00:23:56,520
that it's the convexity that's
really making things work.

366
00:23:56,520 --> 00:24:07,190
So here, let me just put here,
a remarkable convex function.

367
00:24:11,800 --> 00:24:20,690
And the notes tell what's the
gradient of this function.

368
00:24:20,690 --> 00:24:24,550
They don't actually go
as far as the Hessian.

369
00:24:24,550 --> 00:24:32,780
Proving that this function I'm
going to write down is convex,

370
00:24:32,780 --> 00:24:34,720
it takes a little thinking.

371
00:24:34,720 --> 00:24:37,810
But it's a fantastic function.

372
00:24:37,810 --> 00:24:41,922
You would never
sort of imagine it

373
00:24:41,922 --> 00:24:44,110
if you didn't see it sometime.

374
00:24:44,110 --> 00:24:48,580
So it's going to be a function
of a matrix, a function of--

375
00:24:48,580 --> 00:24:58,630
those are n squared
variables, x, i, j.

376
00:24:58,630 --> 00:25:01,140
So it's a function
of many variables.

377
00:25:01,140 --> 00:25:03,220
And here is this function.

378
00:25:03,220 --> 00:25:07,300
It's you take the
determinant of the matrix.

379
00:25:07,300 --> 00:25:11,010
That's clearly a function of
all the n squared variables.

380
00:25:11,010 --> 00:25:15,810
Then you take the log
of the determinant

381
00:25:15,810 --> 00:25:21,840
and put in a minus sign
because we want convex.

382
00:25:21,840 --> 00:25:24,660
That turns out to be
a convex function.

383
00:25:24,660 --> 00:25:29,250
And even to just check that
for 2 by 2 well, for 2 by 2

384
00:25:29,250 --> 00:25:32,190
you have four variables,
because it's a 2 by 2 matrix.

385
00:25:32,190 --> 00:25:35,160
We could maybe check it
for a symmetric matrix.

386
00:25:35,160 --> 00:25:37,170
I move it down to
three variables.

387
00:25:37,170 --> 00:25:45,540
But I'd be glad anybody
who's ambitious to see

388
00:25:45,540 --> 00:25:51,450
why that log determinant
is a remarkable function.

389
00:25:51,450 --> 00:25:52,650
And let me see.

390
00:25:56,040 --> 00:26:01,860
So the gradient of that
thing is also amazing.

391
00:26:01,860 --> 00:26:06,120
The gradient of that function--

392
00:26:06,120 --> 00:26:11,610
I'm going to peek so I don't
write the wrong fact here.

393
00:26:15,780 --> 00:26:19,800
So the partial derivative
of that function

394
00:26:19,800 --> 00:26:23,190
are the entries of--

395
00:26:23,190 --> 00:26:26,220
these are the entries
of a, a inverse.

396
00:26:26,220 --> 00:26:27,960
That's the-- of x inverse.

397
00:26:38,360 --> 00:26:39,880
That's like, wow.

398
00:26:39,880 --> 00:26:42,130
Where did that come from?

399
00:26:42,130 --> 00:26:45,410
It might be minus the
entries, of course.

400
00:26:45,410 --> 00:26:46,930
Yeah, yeah, yeah.

401
00:26:46,930 --> 00:26:53,240
So we've got n
squared function--

402
00:26:53,240 --> 00:26:56,560
what is a typical
entry in x inverse?

403
00:26:56,560 --> 00:27:02,090
What does a typical
x inverse i, j?

404
00:27:02,090 --> 00:27:05,890
Just to remember
that bit of pretty

405
00:27:05,890 --> 00:27:09,910
old fashioned linear
algebra, the entry

406
00:27:09,910 --> 00:27:14,980
is of the inverse matrix,
I'm sure to divide by what?

407
00:27:14,980 --> 00:27:17,200
The determinant, that's
the one thing we know.

408
00:27:21,720 --> 00:27:24,270
And that's the reason
we take the log,

409
00:27:24,270 --> 00:27:27,840
because when you take
derivatives of a log,

410
00:27:27,840 --> 00:27:31,680
that will put determinant
of x in the denominator.

411
00:27:31,680 --> 00:27:33,990
And then the numerator
will be the derivatives

412
00:27:33,990 --> 00:27:36,160
of the determinant of x.

413
00:27:36,160 --> 00:27:36,660
Oh!

414
00:27:36,660 --> 00:27:41,640
Can we get any idea what are the
derivatives of the determinant?

415
00:27:41,640 --> 00:27:43,596
Oh my god.

416
00:27:43,596 --> 00:27:46,410
How did I never get into this?

417
00:27:46,410 --> 00:27:50,090
So are you with me so far?

418
00:27:50,090 --> 00:27:54,350
This is going to be
derivatives of determinant,

419
00:27:54,350 --> 00:27:58,020
the strength of all
these variables divided

420
00:27:58,020 --> 00:28:02,130
by the determinant, because
that's what the log achieved.

421
00:28:02,130 --> 00:28:04,560
So when I take the derivative
of the log of something,

422
00:28:04,560 --> 00:28:12,060
that chain rule says take the
derivative of that something

423
00:28:12,060 --> 00:28:15,900
divide by the function
determinant of x.

424
00:28:15,900 --> 00:28:20,710
So what's the derivative of
the determinant of a matrix

425
00:28:20,710 --> 00:28:22,510
with respect to its 1, 1 entry?

426
00:28:22,510 --> 00:28:23,010
Yeah, sure.

427
00:28:23,010 --> 00:28:24,960
This is crazy.

428
00:28:24,960 --> 00:28:26,490
But it's crazy to be doing this.

429
00:28:26,490 --> 00:28:28,000
But it's healthy.

430
00:28:28,000 --> 00:28:28,500
OK.

431
00:28:31,960 --> 00:28:38,111
So I have a matrix x, da,
da, da, x, x, 1, 1, x, 1n,

432
00:28:38,111 --> 00:28:43,400
et cetera, xn, 1, x, n, n.

433
00:28:43,400 --> 00:28:45,050
OK.

434
00:28:45,050 --> 00:28:46,440
And what am I looking for?

435
00:28:46,440 --> 00:28:52,160
I'm looking for that for
the derivatives of the--

436
00:28:52,160 --> 00:28:55,630
do I want the derivatives
of the determinant?

437
00:28:55,630 --> 00:28:57,550
Yes.

438
00:28:57,550 --> 00:29:05,470
So what's the derivative of x
of the determinant with respect

439
00:29:05,470 --> 00:29:10,100
to the first equals what?

440
00:29:13,780 --> 00:29:15,950
How can I figure out?

441
00:29:15,950 --> 00:29:17,810
So what's this asking me to do?

442
00:29:17,810 --> 00:29:22,790
It's asking me to change x,
1, 1 by delta x and see what's

443
00:29:22,790 --> 00:29:25,980
the change in the determinant.

444
00:29:25,980 --> 00:29:28,220
That's what derivatives are.

445
00:29:28,220 --> 00:29:31,010
Change x, 1, 1 a little bit.

446
00:29:31,010 --> 00:29:32,615
How much did the
determinant change?

447
00:29:36,150 --> 00:29:39,060
What has the determinant
of the whole matrix

448
00:29:39,060 --> 00:29:42,850
got to do with x, 1, 1?

449
00:29:42,850 --> 00:29:47,270
You remember that there is
a formula for determinants.

450
00:29:47,270 --> 00:29:49,160
So I need that fact.

451
00:29:49,160 --> 00:29:55,600
The determinant of x is
x, 1, 1 times something.

452
00:29:55,600 --> 00:29:58,510
Is that something that
I really want to know?

453
00:29:58,510 --> 00:30:01,870
Plus x, 1, 2 times
other something plus

454
00:30:01,870 --> 00:30:06,348
say, along the first row
times another something.

455
00:30:09,340 --> 00:30:15,970
What are these
factors that multiply

456
00:30:15,970 --> 00:30:19,790
the x's to give the determinant?

457
00:30:19,790 --> 00:30:22,520
What [INAUDIBLE] a
linear combination

458
00:30:22,520 --> 00:30:27,340
of the first row time certain
factors gives the determinant?

459
00:30:27,340 --> 00:30:30,520
And how do I know that
there will be such factors,

460
00:30:30,520 --> 00:30:33,160
because the fundamental
property of the determinant

461
00:30:33,160 --> 00:30:39,280
is that it's linear in row 1 if
I don't mess with other rows.

462
00:30:39,280 --> 00:30:43,240
It's a linear function of row 1.

463
00:30:43,240 --> 00:30:46,510
So it has a form x,
1, 1 times something.

464
00:30:46,510 --> 00:30:48,284
And what is something?

465
00:30:48,284 --> 00:30:49,201
AUDIENCE: [INAUDIBLE].

466
00:30:49,201 --> 00:30:52,300
GILBERT STRANG: The
determinant of this.

467
00:30:52,300 --> 00:30:56,560
So what does x, 1, 1 multiply
when you compute determinants?

468
00:30:56,560 --> 00:31:00,280
X, 1, 1 will not multiply
any other guys in its row,

469
00:31:00,280 --> 00:31:02,920
because you're never
multiplying two

470
00:31:02,920 --> 00:31:06,280
x's in the same row
or the same column.

471
00:31:06,280 --> 00:31:10,210
What x, 1, 1 is
multiplying all these guys.

472
00:31:10,210 --> 00:31:15,040
And in fact, it turns out
to be is the determinant.

473
00:31:15,040 --> 00:31:17,180
And what is this called?

474
00:31:17,180 --> 00:31:22,930
That one smaller determinant
that I get by throwing away

475
00:31:22,930 --> 00:31:24,970
the first row and first column?

476
00:31:24,970 --> 00:31:27,710
It's called a--

477
00:31:27,710 --> 00:31:28,880
Minor is good.

478
00:31:28,880 --> 00:31:30,860
Yes, minor is good.

479
00:31:30,860 --> 00:31:33,650
I was saying there are two
words that can be used,

480
00:31:33,650 --> 00:31:36,890
minor and co-factor.

481
00:31:42,860 --> 00:31:43,560
Yeah.

482
00:31:43,560 --> 00:31:44,740
And what is it?

483
00:31:44,740 --> 00:31:46,050
I mean, how do I compute it?

484
00:31:46,050 --> 00:31:47,367
What is the number?

485
00:31:47,367 --> 00:31:48,075
This is a number.

486
00:31:51,180 --> 00:31:52,110
It's just a number.

487
00:31:56,880 --> 00:32:01,090
Maybe I think of the minor
as this determinant--

488
00:32:01,090 --> 00:32:01,750
Ah!

489
00:32:01,750 --> 00:32:03,480
Let me cancel that.

490
00:32:03,480 --> 00:32:05,820
Maybe I think of the
minor as this smaller

491
00:32:05,820 --> 00:32:08,790
matrix, and the
co-factor, which is

492
00:32:08,790 --> 00:32:10,425
the determinant of the minor.

493
00:32:15,180 --> 00:32:16,890
And there is a plus or minus.

494
00:32:16,890 --> 00:32:20,250
Everything about
determinants, there's

495
00:32:20,250 --> 00:32:23,430
a there's a plus or
minus choice to be made.

496
00:32:23,430 --> 00:32:27,600
And we're not going
to worry about that.

497
00:32:27,600 --> 00:32:33,325
But so anyway, so
it's the co-factor.

498
00:32:33,325 --> 00:32:35,300
Let me call it C, 1, 1.

499
00:32:37,950 --> 00:32:42,690
And so that's the formula
for a determinant.

500
00:32:42,690 --> 00:32:46,842
That's the co-factor
expansion of a determinant.

501
00:32:54,230 --> 00:32:56,100
OK.

502
00:32:56,100 --> 00:32:59,400
And that will connect
back to this amazing fact

503
00:32:59,400 --> 00:33:02,790
that the gradient is the
entries of x inverse,

504
00:33:02,790 --> 00:33:07,720
because the inverse is the ratio
of co-factor to determinant.

505
00:33:07,720 --> 00:33:15,772
So x inverse 1, 1 is that
co-factor over the determinant.

506
00:33:18,670 --> 00:33:20,190
Yeah.

507
00:33:20,190 --> 00:33:22,530
So that's where
this all comes from.

508
00:33:22,530 --> 00:33:32,670
Anyway, I'm just mentioning that
as a very interesting example

509
00:33:32,670 --> 00:33:35,820
of a convex function.

510
00:33:35,820 --> 00:33:37,270
OK.

511
00:33:37,270 --> 00:33:37,950
I'll leave that.

512
00:33:37,950 --> 00:33:41,740
That's just for like, education.

513
00:33:41,740 --> 00:33:43,080
OK.

514
00:33:43,080 --> 00:33:48,510
Now I'm ready to go to
work on gradient descent.

515
00:33:48,510 --> 00:33:52,260
So actually, the rest of
this class and Friday's class

516
00:33:52,260 --> 00:33:59,310
about gradient descent are very
fundamental parts of 18.065.

517
00:33:59,310 --> 00:34:01,750
And that will be
one of our examples.

518
00:34:01,750 --> 00:34:06,650
And then the general case here.

519
00:34:06,650 --> 00:34:11,040
So I'm using this.

520
00:34:11,040 --> 00:34:13,670
It would be interesting
to minimize that thing,

521
00:34:13,670 --> 00:34:15,409
but we're not going there.

522
00:34:15,409 --> 00:34:20,480
Let's hide it, so we
don't see it again.

523
00:34:20,480 --> 00:34:23,030
And I'll work with that example.

524
00:34:26,429 --> 00:34:28,610
So here's gradient descent.

525
00:34:37,770 --> 00:34:45,030
Is xk plus 1 is xk
minus Sk the step size

526
00:34:45,030 --> 00:34:47,760
times the gradient of f at xk.

527
00:34:52,922 --> 00:34:56,080
So the only thing
left that requires

528
00:34:56,080 --> 00:35:01,570
us to input some decision making
is a step size, the learning

529
00:35:01,570 --> 00:35:03,100
rate.

530
00:35:03,100 --> 00:35:06,520
We can take it as constant.

531
00:35:06,520 --> 00:35:09,170
If we take too big
a learning rate,

532
00:35:09,170 --> 00:35:12,130
the thing will oscillate
all over the place

533
00:35:12,130 --> 00:35:16,130
and it's a disaster.

534
00:35:16,130 --> 00:35:19,520
If we take too small a
learning rate, too small steps,

535
00:35:19,520 --> 00:35:22,600
what's the matter with that?

536
00:35:22,600 --> 00:35:24,190
Takes too long.

537
00:35:24,190 --> 00:35:26,260
Takes too long.

538
00:35:26,260 --> 00:35:30,400
So the problem is to
get it just right.

539
00:35:30,400 --> 00:35:32,560
And one way that you
could say get it right

540
00:35:32,560 --> 00:35:37,030
would be to think of optimize.

541
00:35:37,030 --> 00:35:38,920
Choose the optimal Sk.

542
00:35:38,920 --> 00:35:43,450
Of course, that takes longer
than just deciding an Sk

543
00:35:43,450 --> 00:35:46,370
in advance, which
is what people do.

544
00:35:46,370 --> 00:35:51,760
So I'll tell you what people
do is on really big problems is

545
00:35:51,760 --> 00:35:53,160
take an Sk--

546
00:35:53,160 --> 00:35:57,520
estimate a suitable Sk, and
then go with it for a while.

547
00:35:57,520 --> 00:36:02,830
And then look back to
see if it was too big,

548
00:36:02,830 --> 00:36:05,310
they'll see oscillations.

549
00:36:05,310 --> 00:36:09,220
It'll be bouncing
all over the place.

550
00:36:09,220 --> 00:36:13,525
Or of course, an
exact line search--

551
00:36:16,730 --> 00:36:19,090
so you see that this
expression often.

552
00:36:19,090 --> 00:36:30,810
The exact line search choose
Sk to make my function

553
00:36:30,810 --> 00:36:44,020
f at xk plus 1 a minimum on
the line, on the search line,

554
00:36:44,020 --> 00:36:48,235
a minimum in the
search direction.

555
00:36:54,175 --> 00:36:57,940
The search direction is
given by the gradient.

556
00:36:57,940 --> 00:36:59,770
That's the direction
we're moving.

557
00:36:59,770 --> 00:37:02,260
This is the distance
we're moving,

558
00:37:02,260 --> 00:37:05,440
or measure of the
distance we're moving.

559
00:37:05,440 --> 00:37:09,580
And an exact search would
be to go along there.

560
00:37:09,580 --> 00:37:14,110
If I have a convex function,
then as I move along this line,

561
00:37:14,110 --> 00:37:19,350
as I increase Sk, I'll see
the function start down,

562
00:37:19,350 --> 00:37:25,380
because the gradient,
negative gradient means down.

563
00:37:25,380 --> 00:37:28,080
But at some point
it'll turn up again.

564
00:37:28,080 --> 00:37:33,220
And an exact line search would
find that point and stop there.

565
00:37:36,310 --> 00:37:38,860
That doesn't mean we would--

566
00:37:38,860 --> 00:37:40,600
we will see in
this example where

567
00:37:40,600 --> 00:37:46,960
we will do exact line searches
that for a small value of b,

568
00:37:46,960 --> 00:37:51,790
it's extremely slow, that
the condition number controls

569
00:37:51,790 --> 00:37:52,660
the speed.

570
00:37:52,660 --> 00:37:55,330
That's really what
my message will

571
00:37:55,330 --> 00:37:59,050
be just in these last
minutes and next time

572
00:37:59,050 --> 00:38:03,340
the sort of key lecture
on gradient descent.

573
00:38:03,340 --> 00:38:06,670
So an exact line
search would be that.

574
00:38:06,670 --> 00:38:09,070
So what a backtracking
line search--

575
00:38:15,880 --> 00:38:24,670
backtracking would be
take a fixed S like one.

576
00:38:24,670 --> 00:38:32,290
And then be prepared
to come backwards.

577
00:38:32,290 --> 00:38:34,060
Cut back by half.

578
00:38:34,060 --> 00:38:36,250
See what you get at that point.

579
00:38:36,250 --> 00:38:40,180
Cut back by half of that to a
quarter of the original step.

580
00:38:40,180 --> 00:38:41,200
See what that is.

581
00:38:44,650 --> 00:38:48,970
So the full step might
have taken you back

582
00:38:48,970 --> 00:38:52,450
to the upward sweep.

583
00:38:52,450 --> 00:38:55,420
Halfway forward it might
still be on the upward sweep.

584
00:38:55,420 --> 00:39:00,760
Might be too much, but so
backtracking cuts the step size

585
00:39:00,760 --> 00:39:04,840
in pieces and checks until it--

586
00:39:08,440 --> 00:39:13,180
So S0, half of
S0, quarter of S0,

587
00:39:13,180 --> 00:39:18,250
or obviously, a different
parameter, aS0, a squared S0,

588
00:39:18,250 --> 00:39:25,720
and so on until you're
satisfied with that step.

589
00:39:25,720 --> 00:39:28,070
And there are of course,
many, many refinements.

590
00:39:28,070 --> 00:39:31,810
We're talking about
the big algorithm

591
00:39:31,810 --> 00:39:40,260
here that everybody has,
depending on their function,

592
00:39:40,260 --> 00:39:44,250
has different experiences with.

593
00:39:44,250 --> 00:39:46,670
So here's my
fundamental question.

594
00:39:50,580 --> 00:39:53,610
Let's think of an
exact line search.

595
00:39:53,610 --> 00:39:57,700
How much does that
reduce the function?

596
00:39:57,700 --> 00:40:00,400
How much does that
reduce the function?

597
00:40:00,400 --> 00:40:05,380
So that's really what the
bounds that I want are.

598
00:40:05,380 --> 00:40:08,440
How much does that
reduce the function?

599
00:40:08,440 --> 00:40:24,320
And we'll see that the reduction
involves the condition number,

600
00:40:24,320 --> 00:40:32,730
m over M. So why don't I
turn to the example first?

601
00:40:32,730 --> 00:40:37,260
And then where we
know exact answers.

602
00:40:37,260 --> 00:40:39,980
That gives us a
basis for comparison.

603
00:40:39,980 --> 00:40:46,150
And then our math
goal is prove--

604
00:40:46,150 --> 00:40:50,050
get S dead bounds
on the size of f

605
00:40:50,050 --> 00:40:55,330
that match what we see
exactly in that example

606
00:40:55,330 --> 00:40:58,120
where we know everything.

607
00:40:58,120 --> 00:41:01,510
We know the gradient.

608
00:41:01,510 --> 00:41:03,140
We know the Hessian.

609
00:41:03,140 --> 00:41:04,090
It's that matrix.

610
00:41:04,090 --> 00:41:05,650
We know the condition number.

611
00:41:05,650 --> 00:41:08,440
So what happens if
I start at a point

612
00:41:08,440 --> 00:41:15,105
x0 y0 that's on my surface?

613
00:41:19,110 --> 00:41:20,230
Sorry.

614
00:41:20,230 --> 00:41:22,710
What do I want to do here?

615
00:41:22,710 --> 00:41:23,250
Yeah.

616
00:41:23,250 --> 00:41:31,080
I take a point, x0
y0 and I iterate.

617
00:41:34,350 --> 00:41:54,040
So the new xy k plus
1 is xyk minus the S,

618
00:41:54,040 --> 00:41:56,940
which I can compute
times the gradient of f.

619
00:41:56,940 --> 00:41:58,710
So I'm going to
put in gradient f.

620
00:41:58,710 --> 00:42:00,030
What is the gradient here?

621
00:42:02,790 --> 00:42:05,790
The derivative is
we expect to x.

622
00:42:05,790 --> 00:42:11,970
So I have a 2xk and 2by.

623
00:42:16,630 --> 00:42:18,244
And this is the step size.

624
00:42:22,120 --> 00:42:25,450
And for this small
problem where we're

625
00:42:25,450 --> 00:42:27,940
going to get such
a revealing answer,

626
00:42:27,940 --> 00:42:29,860
I'm going to choose
exact line search.

627
00:42:29,860 --> 00:42:31,240
I'm going to choose the best xk.

628
00:42:34,040 --> 00:42:35,240
And what's the answer?

629
00:42:35,240 --> 00:42:39,500
So I just want to tell you
what the iterations are

630
00:42:39,500 --> 00:42:43,520
for that particular
function starting at x0 y0.

631
00:42:46,080 --> 00:42:51,460
So let me put start x0 y0.

632
00:42:54,810 --> 00:42:56,790
And I haven't done this
calculation myself.

633
00:42:56,790 --> 00:43:01,470
It's taken from the book by
Steven Boyd and Vandenberghe

634
00:43:01,470 --> 00:43:03,240
called Convex Optimization.

635
00:43:03,240 --> 00:43:06,010
Of course, they weren't the
first to do this either.

636
00:43:06,010 --> 00:43:11,580
But I'm happy to mention that
book Convex Optimization.

637
00:43:11,580 --> 00:43:14,160
And Steven Boyd will be
on campus this spring

638
00:43:14,160 --> 00:43:18,180
actually, in April
for three lectures.

639
00:43:18,180 --> 00:43:20,010
This is April, maybe.

640
00:43:20,010 --> 00:43:21,010
Yeah, OK.

641
00:43:21,010 --> 00:43:24,400
So it's this month in
two or three weeks.

642
00:43:24,400 --> 00:43:26,470
And I'll tell you about that.

643
00:43:26,470 --> 00:43:34,820
So here are the xk's and the
yk's and the f and the function

644
00:43:34,820 --> 00:43:35,320
values.

645
00:43:40,190 --> 00:43:41,400
So where am I going to start?

646
00:43:44,840 --> 00:43:45,440
Yeah.

647
00:43:45,440 --> 00:43:50,480
So I'm starting from the
point x0 y0 equal b1.

648
00:43:50,480 --> 00:43:54,110
Turns out that will make our
formulas very convenient,

649
00:43:54,110 --> 00:43:57,500
x0 y0 equals b1.

650
00:43:57,500 --> 00:43:58,340
Good.

651
00:43:58,340 --> 00:44:00,530
So OK.

652
00:44:00,530 --> 00:44:09,260
So xk is b times the key
ratio b minus 1 over b plus 1

653
00:44:09,260 --> 00:44:11,420
to the kth power.

654
00:44:11,420 --> 00:44:15,335
And yk happens to be--

655
00:44:20,270 --> 00:44:24,020
it has this same ratio.

656
00:44:24,020 --> 00:44:29,600
And my function f has
the same ratio too.

657
00:44:29,600 --> 00:44:30,815
This is fk.

658
00:44:30,815 --> 00:44:34,010
It has that same
ratio 1 minus b over 1

659
00:44:34,010 --> 00:44:39,710
plus b to the kth times f0.

660
00:44:39,710 --> 00:44:51,160
That's the beautiful
formula that we're

661
00:44:51,160 --> 00:44:54,450
going to take as the
best example possible.

662
00:44:54,450 --> 00:44:55,160
Let's just see.

663
00:44:55,160 --> 00:45:04,800
If k equals 0, I have xk equal
b yk equal 1 b starting at b1.

664
00:45:04,800 --> 00:45:09,690
And that tells me the rate
of decrease of the function.

665
00:45:09,690 --> 00:45:11,680
It's this same ratio.

666
00:45:11,680 --> 00:45:14,730
So what am I learning
from this example?

667
00:45:14,730 --> 00:45:20,365
What's jumping out is that this
ratio 1 minus b over 1 plus b

668
00:45:20,365 --> 00:45:20,865
is crucial.

669
00:45:25,920 --> 00:45:29,500
If b is near 1,
that ratio is small.

670
00:45:29,500 --> 00:45:32,870
If b is near 1,
that's near 0 over 2.

671
00:45:32,870 --> 00:45:36,070
And I converge quickly,
no problem at all.

672
00:45:36,070 --> 00:45:42,490
But if b is near 0, if my
condition number is bad--

673
00:45:42,490 --> 00:45:51,430
so the bad case, the
hard case is small b.

674
00:45:55,200 --> 00:46:01,300
Of course, when b is small,
that ratio is very near 1.

675
00:46:01,300 --> 00:46:02,590
It's below 1.

676
00:46:02,590 --> 00:46:06,220
The ratio is below 1, so
I'm getting convergence.

677
00:46:06,220 --> 00:46:07,360
I do get convergence.

678
00:46:07,360 --> 00:46:09,460
I do go downhill.

679
00:46:09,460 --> 00:46:13,810
But what happens is I don't
go downhill very far until I'm

680
00:46:13,810 --> 00:46:15,910
headed back uphill again.

681
00:46:15,910 --> 00:46:20,720
So the picture to
draw for this--

682
00:46:20,720 --> 00:46:26,070
let me change that picture
to a picture in the xy

683
00:46:26,070 --> 00:46:29,400
plane of the level sets.

684
00:46:29,400 --> 00:46:33,870
So the picture really to
see is in the xy plane.

685
00:46:33,870 --> 00:46:37,395
The level sets f equal constant.

686
00:46:37,395 --> 00:46:38,940
That's what a level set is.

687
00:46:38,940 --> 00:46:43,570
It's a set of points, x and
y where f has the same value.

688
00:46:43,570 --> 00:46:46,510
And what do those look like?

689
00:46:46,510 --> 00:46:48,000
Oh, let's see.

690
00:46:50,920 --> 00:46:53,680
I think-- what do you think?

691
00:46:53,680 --> 00:46:59,860
What do the level sets look like
for this particular function?

692
00:46:59,860 --> 00:47:04,520
If I look at the curve x
squared plus b y squared equal

693
00:47:04,520 --> 00:47:07,240
a constant, that's
what the level set is.

694
00:47:07,240 --> 00:47:13,620
This is x squared plus by
squared equal a constant.

695
00:47:13,620 --> 00:47:16,402
What kind of a curve is that?

696
00:47:16,402 --> 00:47:17,330
AUDIENCE: [INAUDIBLE].

697
00:47:17,330 --> 00:47:19,470
GILBERT STRANG:
That's an ellipse.

698
00:47:19,470 --> 00:47:21,900
And what's up with that ellipse?

699
00:47:21,900 --> 00:47:24,750
What's the shape of it?

700
00:47:24,750 --> 00:47:27,960
Because there is no
xy term, that ellipse

701
00:47:27,960 --> 00:47:33,180
is like, well lined
up with the axes.

702
00:47:33,180 --> 00:47:37,770
The major axes of the ellipse
are in the x and y directions,

703
00:47:37,770 --> 00:47:42,150
because there is
no cross term here.

704
00:47:42,150 --> 00:47:46,020
We could always have
diagonalized our matrix

705
00:47:46,020 --> 00:47:47,623
if it wasn't diagonal.

706
00:47:47,623 --> 00:47:49,290
And that wouldn't
have changed anything.

707
00:47:49,290 --> 00:47:52,740
So it's just
rotating this space.

708
00:47:52,740 --> 00:47:54,090
And we've done that.

709
00:47:57,570 --> 00:47:59,130
What do the levels
set look like?

710
00:47:59,130 --> 00:48:00,870
They're ellipses.

711
00:48:00,870 --> 00:48:06,690
And suppose b is a small number,
then what's with the ellipses?

712
00:48:06,690 --> 00:48:10,530
If b is small, I
have to go pretty--

713
00:48:10,530 --> 00:48:14,070
I have to take a pretty
large y to match a--

714
00:48:14,070 --> 00:48:15,090
change an x.

715
00:48:15,090 --> 00:48:18,340
I think maybe they're
ellipses of that sort.

716
00:48:18,340 --> 00:48:18,840
Are they?

717
00:48:24,220 --> 00:48:26,780
They're lined up for the axes.

718
00:48:26,780 --> 00:48:30,610
And I hope I'm drawing
in the right direction.

719
00:48:30,610 --> 00:48:33,807
They're long and thin.

720
00:48:33,807 --> 00:48:34,390
Is that right?

721
00:48:34,390 --> 00:48:36,880
Because I would have
to take a pretty big y

722
00:48:36,880 --> 00:48:40,120
to make up for a small b.

723
00:48:40,120 --> 00:48:41,830
OK.

724
00:48:41,830 --> 00:48:44,140
So what happens
when I'm descending?

725
00:48:44,140 --> 00:48:45,910
This is a narrow valley then.

726
00:48:45,910 --> 00:48:52,240
Think of it as a valley
which comes down steeply

727
00:48:52,240 --> 00:48:54,730
in the y direction,
but in the x direction

728
00:48:54,730 --> 00:48:57,560
I'm crossing the valley slow--

729
00:48:57,560 --> 00:49:00,250
Oh, is that right?

730
00:49:00,250 --> 00:49:04,300
So what happens if I
take a point there?

731
00:49:04,300 --> 00:49:06,690
Oh yeah, I remember what to do.

732
00:49:06,690 --> 00:49:10,850
So let's start at that
point on that ellipse.

733
00:49:14,070 --> 00:49:17,490
And those were the levels
sets f equal constant.

734
00:49:17,490 --> 00:49:20,980
So what's the first
search direction?

735
00:49:20,980 --> 00:49:23,320
What direction do
I move from x0 y0?

736
00:49:28,510 --> 00:49:31,210
Do I move along the ellipse?

737
00:49:31,210 --> 00:49:35,490
Absolutely not, because along
the ellipse f is constant.

738
00:49:35,490 --> 00:49:39,430
The gradient direction is
perpendicular to the ellipse.

739
00:49:39,430 --> 00:49:42,280
So I move perpendicular
to the ellipse.

740
00:49:42,280 --> 00:49:43,285
And when do I stop?

741
00:49:47,040 --> 00:49:50,930
Pretty soon, because very
soon I'm going back up again.

742
00:50:02,410 --> 00:50:04,120
I haven't practiced
with this curve.

743
00:50:04,120 --> 00:50:08,400
But I know-- and time
is up, thank God.

744
00:50:08,400 --> 00:50:10,780
So what do I know
is going to happen?

745
00:50:10,780 --> 00:50:13,780
And by Friday we'll
make it happen?

746
00:50:13,780 --> 00:50:22,840
So what do we see for the
curve, the track of the--

747
00:50:22,840 --> 00:50:24,776
it's say it?

748
00:50:24,776 --> 00:50:25,770
AUDIENCE: Zigzag.

749
00:50:25,770 --> 00:50:28,110
GILBERT STRANG:
It's a zigzag, yeah.

750
00:50:28,110 --> 00:50:31,110
We would like to get here, but
we're not aimed here at all.

751
00:50:31,110 --> 00:50:36,000
So we zig, zig, zig zag,
and very slowly approach

752
00:50:36,000 --> 00:50:36,540
that point.

753
00:50:39,210 --> 00:50:41,910
And how slowly?

754
00:50:41,910 --> 00:50:48,990
With that multiplier, 1
minus b over 1 plus b.

755
00:50:48,990 --> 00:50:51,000
That's what I'm learning
from this example,

756
00:50:51,000 --> 00:50:53,010
that that's a key number.

757
00:50:53,010 --> 00:50:56,760
And then you could ask, well,
what about general examples?

758
00:50:56,760 --> 00:51:01,470
This was one specially chose
an example with exact solution.

759
00:51:01,470 --> 00:51:04,530
Well, we'll see at the
beginning of next time

760
00:51:04,530 --> 00:51:08,400
that for a convex
function this is typical.

761
00:51:08,400 --> 00:51:14,550
This is 1 minus b is the
critical quantity, or 1 over b,

762
00:51:14,550 --> 00:51:17,760
or the how small
is b compared to 1?

763
00:51:17,760 --> 00:51:20,110
So that will be the
critical quantity.

764
00:51:20,110 --> 00:51:24,390
And we see it in this ratio
1 minus b over 1 plus b.

765
00:51:24,390 --> 00:51:30,210
So if b is 100, this
is 0.99 over 1.01.

766
00:51:30,210 --> 00:51:31,830
It's virtually 1.

767
00:51:31,830 --> 00:51:32,460
OK.

768
00:51:32,460 --> 00:51:36,780
So next time is a
sort of a key lecture

769
00:51:36,780 --> 00:51:43,380
to see what I've just
said, that this controls

770
00:51:43,380 --> 00:51:46,440
the convergence of
steepest descent,

771
00:51:46,440 --> 00:51:51,130
and then to see an
idea that speeds it up.

772
00:51:51,130 --> 00:51:54,660
That idea is called
momentum or heavy ball.

773
00:51:54,660 --> 00:52:02,820
So the physical idea is if you
had a heavy ball right there

774
00:52:02,820 --> 00:52:06,930
and wanted to get it down
the valley toward the bottom,

775
00:52:06,930 --> 00:52:10,650
you wouldn't go perpendicular
to the level sets.

776
00:52:10,650 --> 00:52:11,280
Not at all.

777
00:52:11,280 --> 00:52:13,680
You'd let the momentum
of the ball take over

778
00:52:13,680 --> 00:52:16,990
and let it roll down.

779
00:52:16,990 --> 00:52:21,500
So the idea of momentum is
to model the possibility

780
00:52:21,500 --> 00:52:26,240
of letting that heavy ball
roll instead of directing it

781
00:52:26,240 --> 00:52:30,380
by the steepest
descent at every point.

782
00:52:30,380 --> 00:52:34,280
So there's an extra term in
steepest descent, the momentum

783
00:52:34,280 --> 00:52:36,230
term that accelerates.

784
00:52:36,230 --> 00:52:36,860
OK.

785
00:52:36,860 --> 00:52:39,530
So Friday is the day.

786
00:52:39,530 --> 00:52:40,190
Good.

787
00:52:40,190 --> 00:52:42,130
See you then.