1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT Open Courseware 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:16,670 from hundreds of MIT courses, visit 7 00:00:16,670 --> 00:00:18,540 MITopencourseware@ocw.MIT.edu. 8 00:00:24,170 --> 00:00:29,070 GILBERT STRANG: So I'm going to talk about the gradient descent 9 00:00:29,070 --> 00:00:32,580 today to get to that central algorithm 10 00:00:32,580 --> 00:00:38,190 of neural net deep learning, machine learning, 11 00:00:38,190 --> 00:00:40,530 and optimization in general. 12 00:00:40,530 --> 00:00:43,230 So I'm trying to minimize a function. 13 00:00:43,230 --> 00:00:50,400 And that's the way you do it if there are many, many variables, 14 00:00:50,400 --> 00:00:52,890 too many to take second derivatives, 15 00:00:52,890 --> 00:00:56,880 then we settle for first derivatives of the function. 16 00:00:56,880 --> 00:00:59,610 So I introduced, and you've already 17 00:00:59,610 --> 00:01:01,610 met the idea of gradient. 18 00:01:01,610 --> 00:01:04,470 But let me just be sure to make some comments 19 00:01:04,470 --> 00:01:07,410 about the gradient and the Hessian 20 00:01:07,410 --> 00:01:15,610 and the role of convexity before we see the big crucial example. 21 00:01:15,610 --> 00:01:19,425 So I've kind of prepared over here for this crucial example. 22 00:01:22,010 --> 00:01:26,820 The function is a pure quadratic, two unknowns, x 23 00:01:26,820 --> 00:01:30,240 and y, pure quadratic. 24 00:01:30,240 --> 00:01:34,620 So every pure quadratic I can write in terms 25 00:01:34,620 --> 00:01:37,160 of a symmetric matrix s. 26 00:01:37,160 --> 00:01:42,890 And in this case, x1 squared was bx2 squared, the symmetric, 27 00:01:42,890 --> 00:01:45,810 the matrix is just 2 by 2. 28 00:01:45,810 --> 00:01:47,040 It's diagonal. 29 00:01:47,040 --> 00:01:52,440 It's got eigenvalues 1 and b sitting on the diagonal. 30 00:01:52,440 --> 00:01:56,020 I'm thinking of b as being the smaller one. 31 00:01:56,020 --> 00:02:00,720 So the condition number, which we'll see, 32 00:02:00,720 --> 00:02:07,230 is all important in the question of the speed of convergence 33 00:02:07,230 --> 00:02:13,260 is the ratio of the largest to the smallest. 34 00:02:13,260 --> 00:02:17,310 In this case, the largest is 1 the smallest is b. 35 00:02:17,310 --> 00:02:19,260 So that's 1 over b. 36 00:02:19,260 --> 00:02:23,370 And when 1 over b is a big number, 37 00:02:23,370 --> 00:02:26,130 when b is a very small number, then that's 38 00:02:26,130 --> 00:02:27,090 when we're in trouble. 39 00:02:31,560 --> 00:02:34,380 When the matrix is symmetric, that condition number 40 00:02:34,380 --> 00:02:37,620 is lambda max over lambda min. 41 00:02:37,620 --> 00:02:40,830 If I had an unsymmetric matrix, I 42 00:02:40,830 --> 00:02:44,360 would probably use sigma max over sigma min, of course. 43 00:02:44,360 --> 00:02:48,660 But here, matrices are symmetric. 44 00:02:48,660 --> 00:02:52,170 We're going to see something neat 45 00:02:52,170 --> 00:02:58,260 is that we can actually take the steps of steepest descent, 46 00:02:58,260 --> 00:03:01,440 write down what each step gives us, 47 00:03:01,440 --> 00:03:05,310 and see how quickly they converge to the answer. 48 00:03:05,310 --> 00:03:07,220 And what is the answer? 49 00:03:07,220 --> 00:03:11,370 So I haven't put in any linear term here. 50 00:03:11,370 --> 00:03:14,730 So I just have a bowl sitting on the origin. 51 00:03:14,730 --> 00:03:18,990 So of course, the minimum point is x equal 0, y equals 0. 52 00:03:18,990 --> 00:03:26,050 So the minimum point x star, is 0, 0, of course. 53 00:03:26,050 --> 00:03:29,670 So the question will be how quickly do we get to that one. 54 00:03:29,670 --> 00:03:33,450 And you will say pretty small example, not typical. 55 00:03:33,450 --> 00:03:37,080 But the terrific thing is that we see 56 00:03:37,080 --> 00:03:38,890 everything for this example. 57 00:03:38,890 --> 00:03:43,380 We can see the actual steps of steepest descent. 58 00:03:43,380 --> 00:03:45,600 We can see how quickly they converge 59 00:03:45,600 --> 00:03:50,730 to the x star, the answer, the place 60 00:03:50,730 --> 00:03:52,890 where this thing is a minimum. 61 00:03:52,890 --> 00:04:01,440 And we can begin to think what to do if it's too slow. 62 00:04:01,440 --> 00:04:06,930 So I'll come to that example after some general thoughts 63 00:04:06,930 --> 00:04:09,840 about gradients, Hessians. 64 00:04:09,840 --> 00:04:12,300 So what does the gradient tell us? 65 00:04:12,300 --> 00:04:14,745 So let me just take an example of the gradient. 66 00:04:17,860 --> 00:04:23,980 Let me take a linear function, f of xy equals say, 2x plus 5y. 67 00:04:26,560 --> 00:04:31,540 I just think we ought to get totally familiar with these. 68 00:04:31,540 --> 00:04:33,910 We're doing something. 69 00:04:33,910 --> 00:04:38,800 We're jumping into an important topic. 70 00:04:38,800 --> 00:04:41,440 When I ask you what's the gradient, 71 00:04:41,440 --> 00:04:43,780 that's a freshman question. 72 00:04:43,780 --> 00:04:48,460 But let's just be sure we know how to interpret the gradient, 73 00:04:48,460 --> 00:04:51,970 how to compute it, what it means, 74 00:04:51,970 --> 00:04:54,200 how to see it geometrically. 75 00:04:54,200 --> 00:04:56,650 So what's the gradient of that function? 76 00:04:56,650 --> 00:04:58,380 It's a function of two variables. 77 00:04:58,380 --> 00:05:02,110 So the gradient is a vector with two components. 78 00:05:02,110 --> 00:05:02,980 And they are? 79 00:05:07,540 --> 00:05:09,420 The derivative of this factor x, which 80 00:05:09,420 --> 00:05:13,320 is 2 and the derivative of this factor y, which is 5. 81 00:05:13,320 --> 00:05:17,100 So in this case, the gradient is constant. 82 00:05:17,100 --> 00:05:22,650 And the Hessian, which I often call H after Hessian, 83 00:05:22,650 --> 00:05:25,800 or del squared F would tell us we're 84 00:05:25,800 --> 00:05:27,990 taking the second derivatives, that 85 00:05:27,990 --> 00:05:33,150 will be the second derivatives obviously 0 in this case. 86 00:05:33,150 --> 00:05:38,230 So what shape is H here? 87 00:05:38,230 --> 00:05:39,730 It's 2 by 2. 88 00:05:39,730 --> 00:05:45,212 Everybody recognizes 2 by 2 is H would have the-- 89 00:05:45,212 --> 00:05:49,220 I'll take a second derivative of that-- 90 00:05:49,220 --> 00:05:52,090 sorry, the first derivative of that with respect to x, 91 00:05:52,090 --> 00:05:54,700 obviously 0, the first derivative with respect 92 00:05:54,700 --> 00:06:00,620 to y, the first derivative of that with respect to x y. 93 00:06:00,620 --> 00:06:04,840 Anyway, Hessian 0 for sure. 94 00:06:04,840 --> 00:06:08,080 So let me draw the surface. 95 00:06:08,080 --> 00:06:13,540 So x, y, and the surface, if I graph F in this direction, 96 00:06:13,540 --> 00:06:16,960 then obviously, I have a plane. 97 00:06:16,960 --> 00:06:20,840 And I'm at a typical point on the plane let's say. 98 00:06:20,840 --> 00:06:21,910 Yeah, yeah. 99 00:06:21,910 --> 00:06:24,070 So I'm at a point x, y, I should say. 100 00:06:24,070 --> 00:06:25,690 I'm at a point x, y. 101 00:06:25,690 --> 00:06:28,340 And let me put the plane through it. 102 00:06:28,340 --> 00:06:30,160 So how do I interpret the gradient 103 00:06:30,160 --> 00:06:32,235 at that particular point x, y? 104 00:06:35,630 --> 00:06:38,240 What does 2x plus 5y tell me? 105 00:06:38,240 --> 00:06:46,400 Or rather what does grad F tell me about movement 106 00:06:46,400 --> 00:06:50,510 from that point x, y? 107 00:06:50,510 --> 00:06:52,030 Of course, the gradient is constant. 108 00:06:52,030 --> 00:06:55,130 So it really didn't matter what point I'm moving from. 109 00:06:55,130 --> 00:06:57,680 But taking a point here. 110 00:06:57,680 --> 00:07:00,290 So what's the deal if I move? 111 00:07:00,290 --> 00:07:04,010 What's the fastest way to go up the surface? 112 00:07:04,010 --> 00:07:09,110 If I took the plane that went through that point x, y, 113 00:07:09,110 --> 00:07:11,620 what's the fastest way to climb the plane? 114 00:07:11,620 --> 00:07:14,630 What direction goes up fastest? 115 00:07:14,630 --> 00:07:16,230 The gradient direction, right? 116 00:07:16,230 --> 00:07:19,080 The gradient direction is the way up. 117 00:07:19,080 --> 00:07:22,700 How am I going to put it in this picture? 118 00:07:22,700 --> 00:07:26,710 I guess I'm thinking of this plane as-- 119 00:07:26,710 --> 00:07:27,530 so what plane? 120 00:07:27,530 --> 00:07:30,230 You could well ask what plane have I drawn? 121 00:07:30,230 --> 00:07:39,350 Suppose I've drawn the plane 2x plus 5y equals 0 even? 122 00:07:39,350 --> 00:07:41,560 So I'll make it go through the arc. 123 00:07:41,560 --> 00:07:44,540 And I've taken a typical point on that plane. 124 00:07:44,540 --> 00:07:48,380 Now if I want to increase that function, 125 00:07:48,380 --> 00:07:52,700 I go perpendicular to the plane. 126 00:07:52,700 --> 00:07:54,665 If I want to stay level with the function, 127 00:07:54,665 --> 00:07:58,620 if I wanted to stay at 0, I stay in the plane. 128 00:07:58,620 --> 00:08:00,650 So there are two key directions. 129 00:08:00,650 --> 00:08:01,880 Everybody knows this. 130 00:08:01,880 --> 00:08:03,200 I'm just repeating. 131 00:08:03,200 --> 00:08:08,030 This is the direction of the gradient of F out 132 00:08:08,030 --> 00:08:10,250 of the plane, steepest upwards. 133 00:08:10,250 --> 00:08:13,190 This is the downwards direction minus gradient 134 00:08:13,190 --> 00:08:16,940 of F, perpendicular to the plane downwards. 135 00:08:16,940 --> 00:08:21,800 And that line is in the plane. 136 00:08:21,800 --> 00:08:23,660 That's part of the level set. 137 00:08:23,660 --> 00:08:28,070 2x plus 5y equals 0 would be a level set. 138 00:08:28,070 --> 00:08:32,950 That's my pretty amateur picture. 139 00:08:32,950 --> 00:08:45,130 Just all I want to remember is these words level and steepest, 140 00:08:45,130 --> 00:08:49,330 up or down. 141 00:08:49,330 --> 00:08:54,610 Down with a minus sign that we see in steepest descent. 142 00:08:54,610 --> 00:08:58,980 So where in steepest descent. 143 00:09:03,020 --> 00:09:08,900 And what's the Hessian telling me about the surface 144 00:09:08,900 --> 00:09:12,810 if I take the matrix of second derivatives? 145 00:09:12,810 --> 00:09:14,680 So I have this surface. 146 00:09:14,680 --> 00:09:18,070 So I have a surface F equal constant. 147 00:09:22,990 --> 00:09:25,620 That's the sort of level surface. 148 00:09:25,620 --> 00:09:29,530 So if I stay in that surface, the gradient of F is 0. 149 00:09:29,530 --> 00:09:33,351 Gradient of F is 0 in-- 150 00:09:36,960 --> 00:09:39,270 on-- on is a better word-- 151 00:09:39,270 --> 00:09:39,900 on the surface. 152 00:09:43,330 --> 00:09:46,220 The gradient of F points perpendicular. 153 00:09:46,220 --> 00:09:58,100 But what about the Hessian, the second derivative? 154 00:09:58,100 --> 00:10:03,430 What is that telling me about that surface 155 00:10:03,430 --> 00:10:07,950 in particular when the Hessian is 0 or other surfaces? 156 00:10:07,950 --> 00:10:10,395 What does the Hessian tell me about-- 157 00:10:13,370 --> 00:10:16,990 I'm thinking of the Hessian at a particular point. 158 00:10:16,990 --> 00:10:25,580 So I'm getting 0 for the Hessian because the surface is flat. 159 00:10:25,580 --> 00:10:34,180 If the surface was convex upwards from-- 160 00:10:34,180 --> 00:10:41,775 if it was a convex or a graph of F, the Hessian would be-- 161 00:10:46,340 --> 00:10:48,810 so I just want to make that connection now. 162 00:10:48,810 --> 00:10:54,990 What's the connection between the Hessian and convexity 163 00:10:54,990 --> 00:10:55,590 of the-- 164 00:10:55,590 --> 00:11:00,660 the Hessian of the function and convexity of the function? 165 00:11:00,660 --> 00:11:06,550 So the point is that convexity-- 166 00:11:06,550 --> 00:11:10,350 the Hessian tells me whether or not the surface is convex. 167 00:11:10,350 --> 00:11:11,550 And what is the test? 168 00:11:11,550 --> 00:11:12,600 AUDIENCE: [INAUDIBLE]. 169 00:11:12,600 --> 00:11:16,350 GILBERT STRANG: Positive definite or semi definite. 170 00:11:16,350 --> 00:11:20,340 I'm just looking for an excuse to write down 171 00:11:20,340 --> 00:11:26,910 convexity and strong. 172 00:11:26,910 --> 00:11:29,760 Do I say strict or strong convexity? 173 00:11:29,760 --> 00:11:30,630 I've forgotten. 174 00:11:30,630 --> 00:11:32,150 Strict, I think. 175 00:11:32,150 --> 00:11:33,030 Strictly convex. 176 00:11:38,230 --> 00:11:45,100 So convexity, the Hessian is positive semi-definite, 177 00:11:45,100 --> 00:11:48,330 or which includes-- 178 00:11:48,330 --> 00:11:49,990 I better say that right here-- 179 00:11:49,990 --> 00:11:52,074 includes positive definite. 180 00:11:58,380 --> 00:12:00,420 If I'm looking for a strict convexity, 181 00:12:00,420 --> 00:12:03,220 then I must require positive definite. 182 00:12:03,220 --> 00:12:05,863 H is positive definite. 183 00:12:09,810 --> 00:12:12,300 Semi-definite won't do. 184 00:12:12,300 --> 00:12:15,300 So semi-definite for convex. 185 00:12:15,300 --> 00:12:18,540 So that in fact, the linear function 186 00:12:18,540 --> 00:12:22,170 is convex, but not strictly convex. 187 00:12:22,170 --> 00:12:25,160 Strictly means it really bends upwards. 188 00:12:25,160 --> 00:12:26,890 The Hessian is positive definite. 189 00:12:26,890 --> 00:12:31,120 The curvatures are positive. 190 00:12:31,120 --> 00:12:34,290 So this would include linear functions, 191 00:12:34,290 --> 00:12:37,460 and that would not include linear function. 192 00:12:37,460 --> 00:12:40,740 They're not strictly convex. 193 00:12:40,740 --> 00:12:42,510 Good, good, good. 194 00:12:42,510 --> 00:12:46,600 Some examples-- OK, the number one example, of course, 195 00:12:46,600 --> 00:12:49,410 is the one we're talking about over here. 196 00:12:49,410 --> 00:12:59,840 So examples f of x equal 1/2 x transpose Sx. 197 00:13:03,020 --> 00:13:05,660 And of course, I could have linear terms 198 00:13:05,660 --> 00:13:10,310 minus a transpose x, a linear term. 199 00:13:10,310 --> 00:13:12,770 And I could have a constant. 200 00:13:12,770 --> 00:13:13,270 OK. 201 00:13:18,790 --> 00:13:23,390 So this function is strictly convex 202 00:13:23,390 --> 00:13:28,130 when S is positive definite, because H is now 203 00:13:28,130 --> 00:13:33,800 S for that function, for that function 204 00:13:33,800 --> 00:13:39,170 H. Usually H, the Hessian is varying from point to point. 205 00:13:39,170 --> 00:13:42,770 The nice thing about a pure quadratic is its constant. 206 00:13:42,770 --> 00:13:46,550 It's the same S at all points. 207 00:13:46,550 --> 00:13:49,580 Let me just ask you-- 208 00:13:49,580 --> 00:13:53,370 so that's a convex function. 209 00:13:53,370 --> 00:13:56,250 And what's its minimum? 210 00:13:56,250 --> 00:13:57,883 What's the gradient, first of all? 211 00:13:57,883 --> 00:13:59,050 What's the gradient of that? 212 00:14:03,790 --> 00:14:09,570 I'm asking really for differentiating 213 00:14:09,570 --> 00:14:14,440 thinking in vector, doing all n derivatives at once here. 214 00:14:14,440 --> 00:14:19,840 I'm asking for the whole vector of first derivatives. 215 00:14:19,840 --> 00:14:24,420 Because here I'm giving you the whole function 216 00:14:24,420 --> 00:14:28,150 with x for vector x. 217 00:14:28,150 --> 00:14:31,210 Of course, we could take n to be 1. 218 00:14:31,210 --> 00:14:33,760 And then we would see that if n was 1, 219 00:14:33,760 --> 00:14:39,880 this would just be Sx squared, half Sx squared. 220 00:14:39,880 --> 00:14:44,170 And the derivative of a half Sx squared-- 221 00:14:44,170 --> 00:14:46,030 let me just put that over here so we're 222 00:14:46,030 --> 00:14:48,700 sure to get it right-- half of Sx squared. 223 00:14:48,700 --> 00:14:51,490 This is in the n equal 1 case. 224 00:14:51,490 --> 00:14:53,860 And the derivative is obviously Sx. 225 00:14:53,860 --> 00:14:55,540 And that's what it is here, Sx. 226 00:15:06,490 --> 00:15:10,200 It's obviously simple, but if you 227 00:15:10,200 --> 00:15:14,190 haven't thought about that line, it's 228 00:15:14,190 --> 00:15:18,120 asking for all the first derivatives 229 00:15:18,120 --> 00:15:20,850 of that quadratic function. 230 00:15:20,850 --> 00:15:21,570 Oh! 231 00:15:21,570 --> 00:15:27,940 It's not-- What do I have to include now here? 232 00:15:27,940 --> 00:15:31,200 That's not right as it stands for the function that's 233 00:15:31,200 --> 00:15:32,517 written above it. 234 00:15:32,517 --> 00:15:33,600 What's the right gradient? 235 00:15:33,600 --> 00:15:34,517 AUDIENCE: [INAUDIBLE]. 236 00:15:34,517 --> 00:15:38,220 GILBERT STRANG: Minus a, thanks. 237 00:15:38,220 --> 00:15:41,440 Because the linear function, its partial derivatives 238 00:15:41,440 --> 00:15:45,120 are obviously just the components of a. 239 00:15:45,120 --> 00:15:56,030 And the Hessian H is S, derivatives of that guy. 240 00:15:56,030 --> 00:15:56,700 OK. 241 00:15:56,700 --> 00:15:57,300 Good. 242 00:15:57,300 --> 00:15:59,550 Good, good, good. 243 00:15:59,550 --> 00:16:02,520 And the minimum value-- we might as well-- oh yeah! 244 00:16:02,520 --> 00:16:07,820 What's the right words for a minimum value? 245 00:16:07,820 --> 00:16:09,570 No, I'm sorry. 246 00:16:09,570 --> 00:16:14,430 The right word is minimum value like f min. 247 00:16:14,430 --> 00:16:17,880 So I want to compute f min. 248 00:16:17,880 --> 00:16:23,930 Well, first I have to figure out where is that minimum reached? 249 00:16:23,930 --> 00:16:27,140 And what's the answer to that? 250 00:16:27,140 --> 00:16:30,840 We're putting everything on the board for this simple case. 251 00:16:30,840 --> 00:16:38,990 The minimum of f of f of f of x-- 252 00:16:38,990 --> 00:16:42,290 remember, it's x is-- we're in n dimensions-- 253 00:16:42,290 --> 00:16:49,910 is at x equal what? 254 00:16:49,910 --> 00:16:52,400 Well, the minimum is where the gradient is 0. 255 00:16:55,460 --> 00:16:59,381 So what's the minimizing x? 256 00:16:59,381 --> 00:17:01,115 S inverse a, thanks. 257 00:17:08,180 --> 00:17:09,260 Sorry. 258 00:17:09,260 --> 00:17:12,480 That's not right. 259 00:17:12,480 --> 00:17:14,020 It's here that I meant to write it. 260 00:17:17,099 --> 00:17:20,550 Really, my whole point for this little moment 261 00:17:20,550 --> 00:17:23,250 is to be sure that we keep straight what 262 00:17:23,250 --> 00:17:27,780 I mean by the place where the minimum is reached 263 00:17:27,780 --> 00:17:29,160 and the minimum value. 264 00:17:29,160 --> 00:17:30,600 Those are two different things. 265 00:17:34,330 --> 00:17:36,810 So the minimum is reached at S inverse 266 00:17:36,810 --> 00:17:40,270 a, because that's obviously where the gradient is 0. 267 00:17:40,270 --> 00:17:43,073 It's the solution to Sx equal a. 268 00:17:43,073 --> 00:17:48,970 And what I was going to ask you is what's the right word-- 269 00:17:48,970 --> 00:17:56,440 well, sort of word, made up word-- for this point x star 270 00:17:56,440 --> 00:17:58,760 where the minimum is reached? 271 00:17:58,760 --> 00:18:00,160 So it's not the minimum value. 272 00:18:00,160 --> 00:18:01,720 It's the point where it's reached. 273 00:18:01,720 --> 00:18:06,057 And that's called-- the notation for that point is 274 00:18:06,057 --> 00:18:06,991 AUDIENCE: Arg min. 275 00:18:06,991 --> 00:18:10,240 GILBERT STRANG: Arg min, thanks. 276 00:18:10,240 --> 00:18:16,620 Arg min of my function. 277 00:18:16,620 --> 00:18:18,900 And that means the place-- 278 00:18:18,900 --> 00:18:24,918 the point where f equals f min. 279 00:18:28,200 --> 00:18:30,600 I haven't said yet what the minimum value is. 280 00:18:30,600 --> 00:18:31,830 This tells us the point. 281 00:18:31,830 --> 00:18:34,290 And that's usually what we're interested in. 282 00:18:34,290 --> 00:18:36,540 We're, to tell the truth, not that 283 00:18:36,540 --> 00:18:40,470 interested in a typical example and what the minimum value 284 00:18:40,470 --> 00:18:43,740 is as much as where is it? 285 00:18:43,740 --> 00:18:46,590 Where do we reach that thing? 286 00:18:46,590 --> 00:18:50,490 And of course, so this is x min. 287 00:18:50,490 --> 00:19:00,010 This is then arg min of my function f. 288 00:19:00,010 --> 00:19:00,940 That's the point. 289 00:19:00,940 --> 00:19:04,420 And it happens to be in this case, 290 00:19:04,420 --> 00:19:06,520 the minimum value is actually 0. 291 00:19:11,470 --> 00:19:15,190 Because there's no linear term a transpose x. 292 00:19:20,080 --> 00:19:26,270 Why am I talking about arg min when you've all seen it? 293 00:19:26,270 --> 00:19:28,990 I guess I think that somebody could just 294 00:19:28,990 --> 00:19:34,750 be reading this stuff, for example, learning 295 00:19:34,750 --> 00:19:40,740 about neural net, and run into this expression arg min 296 00:19:40,740 --> 00:19:43,360 and think what's that? 297 00:19:43,360 --> 00:19:47,620 So it's maybe a right time to say what it is. 298 00:19:47,620 --> 00:19:50,110 It's the point where the minimum is reached. 299 00:19:52,930 --> 00:19:55,510 Why those words, by the way? 300 00:19:55,510 --> 00:19:57,280 Well, arg isn't much of a word. 301 00:19:57,280 --> 00:20:00,160 It sounds like you're getting strangled. 302 00:20:00,160 --> 00:20:03,520 But it's sort of short. 303 00:20:03,520 --> 00:20:05,440 I assume it's short. 304 00:20:05,440 --> 00:20:07,300 Nobody ever told me this. 305 00:20:07,300 --> 00:20:10,210 I assume it's short for argument. 306 00:20:10,210 --> 00:20:15,160 The word argument is a kind of long word for the value of x. 307 00:20:15,160 --> 00:20:18,850 If I have a function f of x, f, I 308 00:20:18,850 --> 00:20:23,770 call it function and x is the argument of that function. 309 00:20:23,770 --> 00:20:27,430 You might more often see the word variable. 310 00:20:27,430 --> 00:20:31,240 But argument-- and I'm assuming that's what that refers to, 311 00:20:31,240 --> 00:20:35,430 it's the argument that minimizes the function. 312 00:20:35,430 --> 00:20:37,180 OK, good. 313 00:20:37,180 --> 00:20:41,090 And here it is, S inverse a. 314 00:20:41,090 --> 00:20:43,180 Now but just by the way, what is f min? 315 00:20:43,180 --> 00:20:45,730 Do you know the minimum of a quadratic? 316 00:20:45,730 --> 00:20:49,750 I mean, this is the fundamental minimization question, 317 00:20:49,750 --> 00:20:52,660 to minimize a quadratic. 318 00:20:52,660 --> 00:20:56,410 Electrical engineering, a quadratic regulator problem 319 00:20:56,410 --> 00:20:58,280 is the simplest problem there. 320 00:20:58,280 --> 00:20:59,920 There could be constraints. 321 00:20:59,920 --> 00:21:03,070 And we'll see it with constraints included. 322 00:21:03,070 --> 00:21:06,260 But right now, no constraints at all. 323 00:21:06,260 --> 00:21:08,560 We're just looking at the function f of x. 324 00:21:11,480 --> 00:21:15,040 Let me to remove the b, because that just 325 00:21:15,040 --> 00:21:18,130 shifts the function by b. 326 00:21:18,130 --> 00:21:22,710 If I erase that, just to say it didn't matter. 327 00:21:22,710 --> 00:21:25,000 It's really that function. 328 00:21:25,000 --> 00:21:28,030 So that function actually goes through 0. 329 00:21:28,030 --> 00:21:32,290 As it is, when x is 0, we obviously get 0. 330 00:21:32,290 --> 00:21:35,950 But it's still on its way down, so to speak. 331 00:21:35,950 --> 00:21:40,090 It's on its way down to this point, S inverse a. 332 00:21:40,090 --> 00:21:42,490 That's where it bottoms out. 333 00:21:42,490 --> 00:21:47,060 And when it bottoms out, what do you get for f? 334 00:21:47,060 --> 00:21:49,660 One thing I know, it's going to be negative 335 00:21:49,660 --> 00:21:53,620 because it passed through 0, and it was on its way below 0. 336 00:21:53,620 --> 00:21:57,220 So let's just figure out what that f min is. 337 00:21:57,220 --> 00:22:00,010 So I have a half. 338 00:22:00,010 --> 00:22:05,560 I'm just going to plug in S inverse a, the bottom point 339 00:22:05,560 --> 00:22:11,860 into the function, and see where the surface bottoms out 340 00:22:11,860 --> 00:22:15,700 and at what level it bottoms out. 341 00:22:15,700 --> 00:22:17,200 So I have a half. 342 00:22:17,200 --> 00:22:23,320 So that's S inverse a is a transpose S inverse. 343 00:22:23,320 --> 00:22:26,950 S symmetric, so I'll just write this inverse transpose. 344 00:22:26,950 --> 00:22:33,520 S, S inverse a from the quadratic term, 345 00:22:33,520 --> 00:22:37,770 minus a transpose. 346 00:22:37,770 --> 00:22:40,030 And x is S inverse a. 347 00:22:40,030 --> 00:22:42,580 Have you done this calculation? 348 00:22:42,580 --> 00:22:46,240 It just doesn't hurt to repeat it. 349 00:22:46,240 --> 00:22:53,530 So I've plugged in S inverse a there, there, and there. 350 00:22:53,530 --> 00:22:55,060 OK, what have I got? 351 00:22:55,060 --> 00:22:58,630 Well, S inverse cancels S. So I have 352 00:22:58,630 --> 00:23:02,310 a half of a transpose S inverse a minus 1 353 00:23:02,310 --> 00:23:04,150 of a transpose inverse a. 354 00:23:04,150 --> 00:23:08,350 So I get finally negative a half. 355 00:23:08,350 --> 00:23:15,850 Half of it minus one of it of a transpose S inverse a. 356 00:23:15,850 --> 00:23:19,480 Sorry, that's not brilliant use of the blackboard 357 00:23:19,480 --> 00:23:21,370 to squeeze that in there. 358 00:23:21,370 --> 00:23:26,380 But that's easily repeatable. 359 00:23:26,380 --> 00:23:29,770 OK, good. 360 00:23:29,770 --> 00:23:34,560 So that's what a quadratic bowl, a perfect quadratic problem 361 00:23:34,560 --> 00:23:40,390 minimizes to that's its lowest level. 362 00:23:40,390 --> 00:23:45,390 Ooh, I wanted to mention one other function, 363 00:23:45,390 --> 00:23:48,480 because I'm going to speak mostly about quadratics, 364 00:23:48,480 --> 00:23:51,150 but obviously, the whole point is 365 00:23:51,150 --> 00:23:56,520 that it's the convexity that's really making things work. 366 00:23:56,520 --> 00:24:07,190 So here, let me just put here, a remarkable convex function. 367 00:24:11,800 --> 00:24:20,690 And the notes tell what's the gradient of this function. 368 00:24:20,690 --> 00:24:24,550 They don't actually go as far as the Hessian. 369 00:24:24,550 --> 00:24:32,780 Proving that this function I'm going to write down is convex, 370 00:24:32,780 --> 00:24:34,720 it takes a little thinking. 371 00:24:34,720 --> 00:24:37,810 But it's a fantastic function. 372 00:24:37,810 --> 00:24:41,922 You would never sort of imagine it 373 00:24:41,922 --> 00:24:44,110 if you didn't see it sometime. 374 00:24:44,110 --> 00:24:48,580 So it's going to be a function of a matrix, a function of-- 375 00:24:48,580 --> 00:24:58,630 those are n squared variables, x, i, j. 376 00:24:58,630 --> 00:25:01,140 So it's a function of many variables. 377 00:25:01,140 --> 00:25:03,220 And here is this function. 378 00:25:03,220 --> 00:25:07,300 It's you take the determinant of the matrix. 379 00:25:07,300 --> 00:25:11,010 That's clearly a function of all the n squared variables. 380 00:25:11,010 --> 00:25:15,810 Then you take the log of the determinant 381 00:25:15,810 --> 00:25:21,840 and put in a minus sign because we want convex. 382 00:25:21,840 --> 00:25:24,660 That turns out to be a convex function. 383 00:25:24,660 --> 00:25:29,250 And even to just check that for 2 by 2 well, for 2 by 2 384 00:25:29,250 --> 00:25:32,190 you have four variables, because it's a 2 by 2 matrix. 385 00:25:32,190 --> 00:25:35,160 We could maybe check it for a symmetric matrix. 386 00:25:35,160 --> 00:25:37,170 I move it down to three variables. 387 00:25:37,170 --> 00:25:45,540 But I'd be glad anybody who's ambitious to see 388 00:25:45,540 --> 00:25:51,450 why that log determinant is a remarkable function. 389 00:25:51,450 --> 00:25:52,650 And let me see. 390 00:25:56,040 --> 00:26:01,860 So the gradient of that thing is also amazing. 391 00:26:01,860 --> 00:26:06,120 The gradient of that function-- 392 00:26:06,120 --> 00:26:11,610 I'm going to peek so I don't write the wrong fact here. 393 00:26:15,780 --> 00:26:19,800 So the partial derivative of that function 394 00:26:19,800 --> 00:26:23,190 are the entries of-- 395 00:26:23,190 --> 00:26:26,220 these are the entries of a, a inverse. 396 00:26:26,220 --> 00:26:27,960 That's the-- of x inverse. 397 00:26:38,360 --> 00:26:39,880 That's like, wow. 398 00:26:39,880 --> 00:26:42,130 Where did that come from? 399 00:26:42,130 --> 00:26:45,410 It might be minus the entries, of course. 400 00:26:45,410 --> 00:26:46,930 Yeah, yeah, yeah. 401 00:26:46,930 --> 00:26:53,240 So we've got n squared function-- 402 00:26:53,240 --> 00:26:56,560 what is a typical entry in x inverse? 403 00:26:56,560 --> 00:27:02,090 What does a typical x inverse i, j? 404 00:27:02,090 --> 00:27:05,890 Just to remember that bit of pretty 405 00:27:05,890 --> 00:27:09,910 old fashioned linear algebra, the entry 406 00:27:09,910 --> 00:27:14,980 is of the inverse matrix, I'm sure to divide by what? 407 00:27:14,980 --> 00:27:17,200 The determinant, that's the one thing we know. 408 00:27:21,720 --> 00:27:24,270 And that's the reason we take the log, 409 00:27:24,270 --> 00:27:27,840 because when you take derivatives of a log, 410 00:27:27,840 --> 00:27:31,680 that will put determinant of x in the denominator. 411 00:27:31,680 --> 00:27:33,990 And then the numerator will be the derivatives 412 00:27:33,990 --> 00:27:36,160 of the determinant of x. 413 00:27:36,160 --> 00:27:36,660 Oh! 414 00:27:36,660 --> 00:27:41,640 Can we get any idea what are the derivatives of the determinant? 415 00:27:41,640 --> 00:27:43,596 Oh my god. 416 00:27:43,596 --> 00:27:46,410 How did I never get into this? 417 00:27:46,410 --> 00:27:50,090 So are you with me so far? 418 00:27:50,090 --> 00:27:54,350 This is going to be derivatives of determinant, 419 00:27:54,350 --> 00:27:58,020 the strength of all these variables divided 420 00:27:58,020 --> 00:28:02,130 by the determinant, because that's what the log achieved. 421 00:28:02,130 --> 00:28:04,560 So when I take the derivative of the log of something, 422 00:28:04,560 --> 00:28:12,060 that chain rule says take the derivative of that something 423 00:28:12,060 --> 00:28:15,900 divide by the function determinant of x. 424 00:28:15,900 --> 00:28:20,710 So what's the derivative of the determinant of a matrix 425 00:28:20,710 --> 00:28:22,510 with respect to its 1, 1 entry? 426 00:28:22,510 --> 00:28:23,010 Yeah, sure. 427 00:28:23,010 --> 00:28:24,960 This is crazy. 428 00:28:24,960 --> 00:28:26,490 But it's crazy to be doing this. 429 00:28:26,490 --> 00:28:28,000 But it's healthy. 430 00:28:28,000 --> 00:28:28,500 OK. 431 00:28:31,960 --> 00:28:38,111 So I have a matrix x, da, da, da, x, x, 1, 1, x, 1n, 432 00:28:38,111 --> 00:28:43,400 et cetera, xn, 1, x, n, n. 433 00:28:43,400 --> 00:28:45,050 OK. 434 00:28:45,050 --> 00:28:46,440 And what am I looking for? 435 00:28:46,440 --> 00:28:52,160 I'm looking for that for the derivatives of the-- 436 00:28:52,160 --> 00:28:55,630 do I want the derivatives of the determinant? 437 00:28:55,630 --> 00:28:57,550 Yes. 438 00:28:57,550 --> 00:29:05,470 So what's the derivative of x of the determinant with respect 439 00:29:05,470 --> 00:29:10,100 to the first equals what? 440 00:29:13,780 --> 00:29:15,950 How can I figure out? 441 00:29:15,950 --> 00:29:17,810 So what's this asking me to do? 442 00:29:17,810 --> 00:29:22,790 It's asking me to change x, 1, 1 by delta x and see what's 443 00:29:22,790 --> 00:29:25,980 the change in the determinant. 444 00:29:25,980 --> 00:29:28,220 That's what derivatives are. 445 00:29:28,220 --> 00:29:31,010 Change x, 1, 1 a little bit. 446 00:29:31,010 --> 00:29:32,615 How much did the determinant change? 447 00:29:36,150 --> 00:29:39,060 What has the determinant of the whole matrix 448 00:29:39,060 --> 00:29:42,850 got to do with x, 1, 1? 449 00:29:42,850 --> 00:29:47,270 You remember that there is a formula for determinants. 450 00:29:47,270 --> 00:29:49,160 So I need that fact. 451 00:29:49,160 --> 00:29:55,600 The determinant of x is x, 1, 1 times something. 452 00:29:55,600 --> 00:29:58,510 Is that something that I really want to know? 453 00:29:58,510 --> 00:30:01,870 Plus x, 1, 2 times other something plus 454 00:30:01,870 --> 00:30:06,348 say, along the first row times another something. 455 00:30:09,340 --> 00:30:15,970 What are these factors that multiply 456 00:30:15,970 --> 00:30:19,790 the x's to give the determinant? 457 00:30:19,790 --> 00:30:22,520 What [INAUDIBLE] a linear combination 458 00:30:22,520 --> 00:30:27,340 of the first row time certain factors gives the determinant? 459 00:30:27,340 --> 00:30:30,520 And how do I know that there will be such factors, 460 00:30:30,520 --> 00:30:33,160 because the fundamental property of the determinant 461 00:30:33,160 --> 00:30:39,280 is that it's linear in row 1 if I don't mess with other rows. 462 00:30:39,280 --> 00:30:43,240 It's a linear function of row 1. 463 00:30:43,240 --> 00:30:46,510 So it has a form x, 1, 1 times something. 464 00:30:46,510 --> 00:30:48,284 And what is something? 465 00:30:48,284 --> 00:30:49,201 AUDIENCE: [INAUDIBLE]. 466 00:30:49,201 --> 00:30:52,300 GILBERT STRANG: The determinant of this. 467 00:30:52,300 --> 00:30:56,560 So what does x, 1, 1 multiply when you compute determinants? 468 00:30:56,560 --> 00:31:00,280 X, 1, 1 will not multiply any other guys in its row, 469 00:31:00,280 --> 00:31:02,920 because you're never multiplying two 470 00:31:02,920 --> 00:31:06,280 x's in the same row or the same column. 471 00:31:06,280 --> 00:31:10,210 What x, 1, 1 is multiplying all these guys. 472 00:31:10,210 --> 00:31:15,040 And in fact, it turns out to be is the determinant. 473 00:31:15,040 --> 00:31:17,180 And what is this called? 474 00:31:17,180 --> 00:31:22,930 That one smaller determinant that I get by throwing away 475 00:31:22,930 --> 00:31:24,970 the first row and first column? 476 00:31:24,970 --> 00:31:27,710 It's called a-- 477 00:31:27,710 --> 00:31:28,880 Minor is good. 478 00:31:28,880 --> 00:31:30,860 Yes, minor is good. 479 00:31:30,860 --> 00:31:33,650 I was saying there are two words that can be used, 480 00:31:33,650 --> 00:31:36,890 minor and co-factor. 481 00:31:42,860 --> 00:31:43,560 Yeah. 482 00:31:43,560 --> 00:31:44,740 And what is it? 483 00:31:44,740 --> 00:31:46,050 I mean, how do I compute it? 484 00:31:46,050 --> 00:31:47,367 What is the number? 485 00:31:47,367 --> 00:31:48,075 This is a number. 486 00:31:51,180 --> 00:31:52,110 It's just a number. 487 00:31:56,880 --> 00:32:01,090 Maybe I think of the minor as this determinant-- 488 00:32:01,090 --> 00:32:01,750 Ah! 489 00:32:01,750 --> 00:32:03,480 Let me cancel that. 490 00:32:03,480 --> 00:32:05,820 Maybe I think of the minor as this smaller 491 00:32:05,820 --> 00:32:08,790 matrix, and the co-factor, which is 492 00:32:08,790 --> 00:32:10,425 the determinant of the minor. 493 00:32:15,180 --> 00:32:16,890 And there is a plus or minus. 494 00:32:16,890 --> 00:32:20,250 Everything about determinants, there's 495 00:32:20,250 --> 00:32:23,430 a there's a plus or minus choice to be made. 496 00:32:23,430 --> 00:32:27,600 And we're not going to worry about that. 497 00:32:27,600 --> 00:32:33,325 But so anyway, so it's the co-factor. 498 00:32:33,325 --> 00:32:35,300 Let me call it C, 1, 1. 499 00:32:37,950 --> 00:32:42,690 And so that's the formula for a determinant. 500 00:32:42,690 --> 00:32:46,842 That's the co-factor expansion of a determinant. 501 00:32:54,230 --> 00:32:56,100 OK. 502 00:32:56,100 --> 00:32:59,400 And that will connect back to this amazing fact 503 00:32:59,400 --> 00:33:02,790 that the gradient is the entries of x inverse, 504 00:33:02,790 --> 00:33:07,720 because the inverse is the ratio of co-factor to determinant. 505 00:33:07,720 --> 00:33:15,772 So x inverse 1, 1 is that co-factor over the determinant. 506 00:33:18,670 --> 00:33:20,190 Yeah. 507 00:33:20,190 --> 00:33:22,530 So that's where this all comes from. 508 00:33:22,530 --> 00:33:32,670 Anyway, I'm just mentioning that as a very interesting example 509 00:33:32,670 --> 00:33:35,820 of a convex function. 510 00:33:35,820 --> 00:33:37,270 OK. 511 00:33:37,270 --> 00:33:37,950 I'll leave that. 512 00:33:37,950 --> 00:33:41,740 That's just for like, education. 513 00:33:41,740 --> 00:33:43,080 OK. 514 00:33:43,080 --> 00:33:48,510 Now I'm ready to go to work on gradient descent. 515 00:33:48,510 --> 00:33:52,260 So actually, the rest of this class and Friday's class 516 00:33:52,260 --> 00:33:59,310 about gradient descent are very fundamental parts of 18.065. 517 00:33:59,310 --> 00:34:01,750 And that will be one of our examples. 518 00:34:01,750 --> 00:34:06,650 And then the general case here. 519 00:34:06,650 --> 00:34:11,040 So I'm using this. 520 00:34:11,040 --> 00:34:13,670 It would be interesting to minimize that thing, 521 00:34:13,670 --> 00:34:15,409 but we're not going there. 522 00:34:15,409 --> 00:34:20,480 Let's hide it, so we don't see it again. 523 00:34:20,480 --> 00:34:23,030 And I'll work with that example. 524 00:34:26,429 --> 00:34:28,610 So here's gradient descent. 525 00:34:37,770 --> 00:34:45,030 Is xk plus 1 is xk minus Sk the step size 526 00:34:45,030 --> 00:34:47,760 times the gradient of f at xk. 527 00:34:52,922 --> 00:34:56,080 So the only thing left that requires 528 00:34:56,080 --> 00:35:01,570 us to input some decision making is a step size, the learning 529 00:35:01,570 --> 00:35:03,100 rate. 530 00:35:03,100 --> 00:35:06,520 We can take it as constant. 531 00:35:06,520 --> 00:35:09,170 If we take too big a learning rate, 532 00:35:09,170 --> 00:35:12,130 the thing will oscillate all over the place 533 00:35:12,130 --> 00:35:16,130 and it's a disaster. 534 00:35:16,130 --> 00:35:19,520 If we take too small a learning rate, too small steps, 535 00:35:19,520 --> 00:35:22,600 what's the matter with that? 536 00:35:22,600 --> 00:35:24,190 Takes too long. 537 00:35:24,190 --> 00:35:26,260 Takes too long. 538 00:35:26,260 --> 00:35:30,400 So the problem is to get it just right. 539 00:35:30,400 --> 00:35:32,560 And one way that you could say get it right 540 00:35:32,560 --> 00:35:37,030 would be to think of optimize. 541 00:35:37,030 --> 00:35:38,920 Choose the optimal Sk. 542 00:35:38,920 --> 00:35:43,450 Of course, that takes longer than just deciding an Sk 543 00:35:43,450 --> 00:35:46,370 in advance, which is what people do. 544 00:35:46,370 --> 00:35:51,760 So I'll tell you what people do is on really big problems is 545 00:35:51,760 --> 00:35:53,160 take an Sk-- 546 00:35:53,160 --> 00:35:57,520 estimate a suitable Sk, and then go with it for a while. 547 00:35:57,520 --> 00:36:02,830 And then look back to see if it was too big, 548 00:36:02,830 --> 00:36:05,310 they'll see oscillations. 549 00:36:05,310 --> 00:36:09,220 It'll be bouncing all over the place. 550 00:36:09,220 --> 00:36:13,525 Or of course, an exact line search-- 551 00:36:16,730 --> 00:36:19,090 so you see that this expression often. 552 00:36:19,090 --> 00:36:30,810 The exact line search choose Sk to make my function 553 00:36:30,810 --> 00:36:44,020 f at xk plus 1 a minimum on the line, on the search line, 554 00:36:44,020 --> 00:36:48,235 a minimum in the search direction. 555 00:36:54,175 --> 00:36:57,940 The search direction is given by the gradient. 556 00:36:57,940 --> 00:36:59,770 That's the direction we're moving. 557 00:36:59,770 --> 00:37:02,260 This is the distance we're moving, 558 00:37:02,260 --> 00:37:05,440 or measure of the distance we're moving. 559 00:37:05,440 --> 00:37:09,580 And an exact search would be to go along there. 560 00:37:09,580 --> 00:37:14,110 If I have a convex function, then as I move along this line, 561 00:37:14,110 --> 00:37:19,350 as I increase Sk, I'll see the function start down, 562 00:37:19,350 --> 00:37:25,380 because the gradient, negative gradient means down. 563 00:37:25,380 --> 00:37:28,080 But at some point it'll turn up again. 564 00:37:28,080 --> 00:37:33,220 And an exact line search would find that point and stop there. 565 00:37:36,310 --> 00:37:38,860 That doesn't mean we would-- 566 00:37:38,860 --> 00:37:40,600 we will see in this example where 567 00:37:40,600 --> 00:37:46,960 we will do exact line searches that for a small value of b, 568 00:37:46,960 --> 00:37:51,790 it's extremely slow, that the condition number controls 569 00:37:51,790 --> 00:37:52,660 the speed. 570 00:37:52,660 --> 00:37:55,330 That's really what my message will 571 00:37:55,330 --> 00:37:59,050 be just in these last minutes and next time 572 00:37:59,050 --> 00:38:03,340 the sort of key lecture on gradient descent. 573 00:38:03,340 --> 00:38:06,670 So an exact line search would be that. 574 00:38:06,670 --> 00:38:09,070 So what a backtracking line search-- 575 00:38:15,880 --> 00:38:24,670 backtracking would be take a fixed S like one. 576 00:38:24,670 --> 00:38:32,290 And then be prepared to come backwards. 577 00:38:32,290 --> 00:38:34,060 Cut back by half. 578 00:38:34,060 --> 00:38:36,250 See what you get at that point. 579 00:38:36,250 --> 00:38:40,180 Cut back by half of that to a quarter of the original step. 580 00:38:40,180 --> 00:38:41,200 See what that is. 581 00:38:44,650 --> 00:38:48,970 So the full step might have taken you back 582 00:38:48,970 --> 00:38:52,450 to the upward sweep. 583 00:38:52,450 --> 00:38:55,420 Halfway forward it might still be on the upward sweep. 584 00:38:55,420 --> 00:39:00,760 Might be too much, but so backtracking cuts the step size 585 00:39:00,760 --> 00:39:04,840 in pieces and checks until it-- 586 00:39:08,440 --> 00:39:13,180 So S0, half of S0, quarter of S0, 587 00:39:13,180 --> 00:39:18,250 or obviously, a different parameter, aS0, a squared S0, 588 00:39:18,250 --> 00:39:25,720 and so on until you're satisfied with that step. 589 00:39:25,720 --> 00:39:28,070 And there are of course, many, many refinements. 590 00:39:28,070 --> 00:39:31,810 We're talking about the big algorithm 591 00:39:31,810 --> 00:39:40,260 here that everybody has, depending on their function, 592 00:39:40,260 --> 00:39:44,250 has different experiences with. 593 00:39:44,250 --> 00:39:46,670 So here's my fundamental question. 594 00:39:50,580 --> 00:39:53,610 Let's think of an exact line search. 595 00:39:53,610 --> 00:39:57,700 How much does that reduce the function? 596 00:39:57,700 --> 00:40:00,400 How much does that reduce the function? 597 00:40:00,400 --> 00:40:05,380 So that's really what the bounds that I want are. 598 00:40:05,380 --> 00:40:08,440 How much does that reduce the function? 599 00:40:08,440 --> 00:40:24,320 And we'll see that the reduction involves the condition number, 600 00:40:24,320 --> 00:40:32,730 m over M. So why don't I turn to the example first? 601 00:40:32,730 --> 00:40:37,260 And then where we know exact answers. 602 00:40:37,260 --> 00:40:39,980 That gives us a basis for comparison. 603 00:40:39,980 --> 00:40:46,150 And then our math goal is prove-- 604 00:40:46,150 --> 00:40:50,050 get S dead bounds on the size of f 605 00:40:50,050 --> 00:40:55,330 that match what we see exactly in that example 606 00:40:55,330 --> 00:40:58,120 where we know everything. 607 00:40:58,120 --> 00:41:01,510 We know the gradient. 608 00:41:01,510 --> 00:41:03,140 We know the Hessian. 609 00:41:03,140 --> 00:41:04,090 It's that matrix. 610 00:41:04,090 --> 00:41:05,650 We know the condition number. 611 00:41:05,650 --> 00:41:08,440 So what happens if I start at a point 612 00:41:08,440 --> 00:41:15,105 x0 y0 that's on my surface? 613 00:41:19,110 --> 00:41:20,230 Sorry. 614 00:41:20,230 --> 00:41:22,710 What do I want to do here? 615 00:41:22,710 --> 00:41:23,250 Yeah. 616 00:41:23,250 --> 00:41:31,080 I take a point, x0 y0 and I iterate. 617 00:41:34,350 --> 00:41:54,040 So the new xy k plus 1 is xyk minus the S, 618 00:41:54,040 --> 00:41:56,940 which I can compute times the gradient of f. 619 00:41:56,940 --> 00:41:58,710 So I'm going to put in gradient f. 620 00:41:58,710 --> 00:42:00,030 What is the gradient here? 621 00:42:02,790 --> 00:42:05,790 The derivative is we expect to x. 622 00:42:05,790 --> 00:42:11,970 So I have a 2xk and 2by. 623 00:42:16,630 --> 00:42:18,244 And this is the step size. 624 00:42:22,120 --> 00:42:25,450 And for this small problem where we're 625 00:42:25,450 --> 00:42:27,940 going to get such a revealing answer, 626 00:42:27,940 --> 00:42:29,860 I'm going to choose exact line search. 627 00:42:29,860 --> 00:42:31,240 I'm going to choose the best xk. 628 00:42:34,040 --> 00:42:35,240 And what's the answer? 629 00:42:35,240 --> 00:42:39,500 So I just want to tell you what the iterations are 630 00:42:39,500 --> 00:42:43,520 for that particular function starting at x0 y0. 631 00:42:46,080 --> 00:42:51,460 So let me put start x0 y0. 632 00:42:54,810 --> 00:42:56,790 And I haven't done this calculation myself. 633 00:42:56,790 --> 00:43:01,470 It's taken from the book by Steven Boyd and Vandenberghe 634 00:43:01,470 --> 00:43:03,240 called Convex Optimization. 635 00:43:03,240 --> 00:43:06,010 Of course, they weren't the first to do this either. 636 00:43:06,010 --> 00:43:11,580 But I'm happy to mention that book Convex Optimization. 637 00:43:11,580 --> 00:43:14,160 And Steven Boyd will be on campus this spring 638 00:43:14,160 --> 00:43:18,180 actually, in April for three lectures. 639 00:43:18,180 --> 00:43:20,010 This is April, maybe. 640 00:43:20,010 --> 00:43:21,010 Yeah, OK. 641 00:43:21,010 --> 00:43:24,400 So it's this month in two or three weeks. 642 00:43:24,400 --> 00:43:26,470 And I'll tell you about that. 643 00:43:26,470 --> 00:43:34,820 So here are the xk's and the yk's and the f and the function 644 00:43:34,820 --> 00:43:35,320 values. 645 00:43:40,190 --> 00:43:41,400 So where am I going to start? 646 00:43:44,840 --> 00:43:45,440 Yeah. 647 00:43:45,440 --> 00:43:50,480 So I'm starting from the point x0 y0 equal b1. 648 00:43:50,480 --> 00:43:54,110 Turns out that will make our formulas very convenient, 649 00:43:54,110 --> 00:43:57,500 x0 y0 equals b1. 650 00:43:57,500 --> 00:43:58,340 Good. 651 00:43:58,340 --> 00:44:00,530 So OK. 652 00:44:00,530 --> 00:44:09,260 So xk is b times the key ratio b minus 1 over b plus 1 653 00:44:09,260 --> 00:44:11,420 to the kth power. 654 00:44:11,420 --> 00:44:15,335 And yk happens to be-- 655 00:44:20,270 --> 00:44:24,020 it has this same ratio. 656 00:44:24,020 --> 00:44:29,600 And my function f has the same ratio too. 657 00:44:29,600 --> 00:44:30,815 This is fk. 658 00:44:30,815 --> 00:44:34,010 It has that same ratio 1 minus b over 1 659 00:44:34,010 --> 00:44:39,710 plus b to the kth times f0. 660 00:44:39,710 --> 00:44:51,160 That's the beautiful formula that we're 661 00:44:51,160 --> 00:44:54,450 going to take as the best example possible. 662 00:44:54,450 --> 00:44:55,160 Let's just see. 663 00:44:55,160 --> 00:45:04,800 If k equals 0, I have xk equal b yk equal 1 b starting at b1. 664 00:45:04,800 --> 00:45:09,690 And that tells me the rate of decrease of the function. 665 00:45:09,690 --> 00:45:11,680 It's this same ratio. 666 00:45:11,680 --> 00:45:14,730 So what am I learning from this example? 667 00:45:14,730 --> 00:45:20,365 What's jumping out is that this ratio 1 minus b over 1 plus b 668 00:45:20,365 --> 00:45:20,865 is crucial. 669 00:45:25,920 --> 00:45:29,500 If b is near 1, that ratio is small. 670 00:45:29,500 --> 00:45:32,870 If b is near 1, that's near 0 over 2. 671 00:45:32,870 --> 00:45:36,070 And I converge quickly, no problem at all. 672 00:45:36,070 --> 00:45:42,490 But if b is near 0, if my condition number is bad-- 673 00:45:42,490 --> 00:45:51,430 so the bad case, the hard case is small b. 674 00:45:55,200 --> 00:46:01,300 Of course, when b is small, that ratio is very near 1. 675 00:46:01,300 --> 00:46:02,590 It's below 1. 676 00:46:02,590 --> 00:46:06,220 The ratio is below 1, so I'm getting convergence. 677 00:46:06,220 --> 00:46:07,360 I do get convergence. 678 00:46:07,360 --> 00:46:09,460 I do go downhill. 679 00:46:09,460 --> 00:46:13,810 But what happens is I don't go downhill very far until I'm 680 00:46:13,810 --> 00:46:15,910 headed back uphill again. 681 00:46:15,910 --> 00:46:20,720 So the picture to draw for this-- 682 00:46:20,720 --> 00:46:26,070 let me change that picture to a picture in the xy 683 00:46:26,070 --> 00:46:29,400 plane of the level sets. 684 00:46:29,400 --> 00:46:33,870 So the picture really to see is in the xy plane. 685 00:46:33,870 --> 00:46:37,395 The level sets f equal constant. 686 00:46:37,395 --> 00:46:38,940 That's what a level set is. 687 00:46:38,940 --> 00:46:43,570 It's a set of points, x and y where f has the same value. 688 00:46:43,570 --> 00:46:46,510 And what do those look like? 689 00:46:46,510 --> 00:46:48,000 Oh, let's see. 690 00:46:50,920 --> 00:46:53,680 I think-- what do you think? 691 00:46:53,680 --> 00:46:59,860 What do the level sets look like for this particular function? 692 00:46:59,860 --> 00:47:04,520 If I look at the curve x squared plus b y squared equal 693 00:47:04,520 --> 00:47:07,240 a constant, that's what the level set is. 694 00:47:07,240 --> 00:47:13,620 This is x squared plus by squared equal a constant. 695 00:47:13,620 --> 00:47:16,402 What kind of a curve is that? 696 00:47:16,402 --> 00:47:17,330 AUDIENCE: [INAUDIBLE]. 697 00:47:17,330 --> 00:47:19,470 GILBERT STRANG: That's an ellipse. 698 00:47:19,470 --> 00:47:21,900 And what's up with that ellipse? 699 00:47:21,900 --> 00:47:24,750 What's the shape of it? 700 00:47:24,750 --> 00:47:27,960 Because there is no xy term, that ellipse 701 00:47:27,960 --> 00:47:33,180 is like, well lined up with the axes. 702 00:47:33,180 --> 00:47:37,770 The major axes of the ellipse are in the x and y directions, 703 00:47:37,770 --> 00:47:42,150 because there is no cross term here. 704 00:47:42,150 --> 00:47:46,020 We could always have diagonalized our matrix 705 00:47:46,020 --> 00:47:47,623 if it wasn't diagonal. 706 00:47:47,623 --> 00:47:49,290 And that wouldn't have changed anything. 707 00:47:49,290 --> 00:47:52,740 So it's just rotating this space. 708 00:47:52,740 --> 00:47:54,090 And we've done that. 709 00:47:57,570 --> 00:47:59,130 What do the levels set look like? 710 00:47:59,130 --> 00:48:00,870 They're ellipses. 711 00:48:00,870 --> 00:48:06,690 And suppose b is a small number, then what's with the ellipses? 712 00:48:06,690 --> 00:48:10,530 If b is small, I have to go pretty-- 713 00:48:10,530 --> 00:48:14,070 I have to take a pretty large y to match a-- 714 00:48:14,070 --> 00:48:15,090 change an x. 715 00:48:15,090 --> 00:48:18,340 I think maybe they're ellipses of that sort. 716 00:48:18,340 --> 00:48:18,840 Are they? 717 00:48:24,220 --> 00:48:26,780 They're lined up for the axes. 718 00:48:26,780 --> 00:48:30,610 And I hope I'm drawing in the right direction. 719 00:48:30,610 --> 00:48:33,807 They're long and thin. 720 00:48:33,807 --> 00:48:34,390 Is that right? 721 00:48:34,390 --> 00:48:36,880 Because I would have to take a pretty big y 722 00:48:36,880 --> 00:48:40,120 to make up for a small b. 723 00:48:40,120 --> 00:48:41,830 OK. 724 00:48:41,830 --> 00:48:44,140 So what happens when I'm descending? 725 00:48:44,140 --> 00:48:45,910 This is a narrow valley then. 726 00:48:45,910 --> 00:48:52,240 Think of it as a valley which comes down steeply 727 00:48:52,240 --> 00:48:54,730 in the y direction, but in the x direction 728 00:48:54,730 --> 00:48:57,560 I'm crossing the valley slow-- 729 00:48:57,560 --> 00:49:00,250 Oh, is that right? 730 00:49:00,250 --> 00:49:04,300 So what happens if I take a point there? 731 00:49:04,300 --> 00:49:06,690 Oh yeah, I remember what to do. 732 00:49:06,690 --> 00:49:10,850 So let's start at that point on that ellipse. 733 00:49:14,070 --> 00:49:17,490 And those were the levels sets f equal constant. 734 00:49:17,490 --> 00:49:20,980 So what's the first search direction? 735 00:49:20,980 --> 00:49:23,320 What direction do I move from x0 y0? 736 00:49:28,510 --> 00:49:31,210 Do I move along the ellipse? 737 00:49:31,210 --> 00:49:35,490 Absolutely not, because along the ellipse f is constant. 738 00:49:35,490 --> 00:49:39,430 The gradient direction is perpendicular to the ellipse. 739 00:49:39,430 --> 00:49:42,280 So I move perpendicular to the ellipse. 740 00:49:42,280 --> 00:49:43,285 And when do I stop? 741 00:49:47,040 --> 00:49:50,930 Pretty soon, because very soon I'm going back up again. 742 00:50:02,410 --> 00:50:04,120 I haven't practiced with this curve. 743 00:50:04,120 --> 00:50:08,400 But I know-- and time is up, thank God. 744 00:50:08,400 --> 00:50:10,780 So what do I know is going to happen? 745 00:50:10,780 --> 00:50:13,780 And by Friday we'll make it happen? 746 00:50:13,780 --> 00:50:22,840 So what do we see for the curve, the track of the-- 747 00:50:22,840 --> 00:50:24,776 it's say it? 748 00:50:24,776 --> 00:50:25,770 AUDIENCE: Zigzag. 749 00:50:25,770 --> 00:50:28,110 GILBERT STRANG: It's a zigzag, yeah. 750 00:50:28,110 --> 00:50:31,110 We would like to get here, but we're not aimed here at all. 751 00:50:31,110 --> 00:50:36,000 So we zig, zig, zig zag, and very slowly approach 752 00:50:36,000 --> 00:50:36,540 that point. 753 00:50:39,210 --> 00:50:41,910 And how slowly? 754 00:50:41,910 --> 00:50:48,990 With that multiplier, 1 minus b over 1 plus b. 755 00:50:48,990 --> 00:50:51,000 That's what I'm learning from this example, 756 00:50:51,000 --> 00:50:53,010 that that's a key number. 757 00:50:53,010 --> 00:50:56,760 And then you could ask, well, what about general examples? 758 00:50:56,760 --> 00:51:01,470 This was one specially chose an example with exact solution. 759 00:51:01,470 --> 00:51:04,530 Well, we'll see at the beginning of next time 760 00:51:04,530 --> 00:51:08,400 that for a convex function this is typical. 761 00:51:08,400 --> 00:51:14,550 This is 1 minus b is the critical quantity, or 1 over b, 762 00:51:14,550 --> 00:51:17,760 or the how small is b compared to 1? 763 00:51:17,760 --> 00:51:20,110 So that will be the critical quantity. 764 00:51:20,110 --> 00:51:24,390 And we see it in this ratio 1 minus b over 1 plus b. 765 00:51:24,390 --> 00:51:30,210 So if b is 100, this is 0.99 over 1.01. 766 00:51:30,210 --> 00:51:31,830 It's virtually 1. 767 00:51:31,830 --> 00:51:32,460 OK. 768 00:51:32,460 --> 00:51:36,780 So next time is a sort of a key lecture 769 00:51:36,780 --> 00:51:43,380 to see what I've just said, that this controls 770 00:51:43,380 --> 00:51:46,440 the convergence of steepest descent, 771 00:51:46,440 --> 00:51:51,130 and then to see an idea that speeds it up. 772 00:51:51,130 --> 00:51:54,660 That idea is called momentum or heavy ball. 773 00:51:54,660 --> 00:52:02,820 So the physical idea is if you had a heavy ball right there 774 00:52:02,820 --> 00:52:06,930 and wanted to get it down the valley toward the bottom, 775 00:52:06,930 --> 00:52:10,650 you wouldn't go perpendicular to the level sets. 776 00:52:10,650 --> 00:52:11,280 Not at all. 777 00:52:11,280 --> 00:52:13,680 You'd let the momentum of the ball take over 778 00:52:13,680 --> 00:52:16,990 and let it roll down. 779 00:52:16,990 --> 00:52:21,500 So the idea of momentum is to model the possibility 780 00:52:21,500 --> 00:52:26,240 of letting that heavy ball roll instead of directing it 781 00:52:26,240 --> 00:52:30,380 by the steepest descent at every point. 782 00:52:30,380 --> 00:52:34,280 So there's an extra term in steepest descent, the momentum 783 00:52:34,280 --> 00:52:36,230 term that accelerates. 784 00:52:36,230 --> 00:52:36,860 OK. 785 00:52:36,860 --> 00:52:39,530 So Friday is the day. 786 00:52:39,530 --> 00:52:40,190 Good. 787 00:52:40,190 --> 00:52:42,130 See you then.