Flash and JavaScript are required for this feature.
Download the video from Internet Archive.
Description
This lecture is about the central structure of deep neural networks, which are a major force in machine learning. The aim is to find the function that’s constructed to learn the training data and then apply it to the test data.
Summary
The net has layers of nodes. Layer zero is the data.
We choose matrix of "weights" from layer to layer.
Nonlinear step at each layer! Negative values become zero!
We know correct class for the training data.
Weights optimized to (usually) output that correct class.
Related section in textbook: VII.1
Instructor: Prof. Gilbert Strang
Lecture 26: Structure of Ne...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
GILBERT STRANG: OK, so this is an important day, and Friday was an important day. I hope you enjoyed Professor Sra's terrific lecture as much as I did. You probably saw me taking notes like mad for the section that's now to be written about stochastic gradient descent.
And he promised a theorem, if you remember, and there wasn't time. And so he was going to send it to me or still is going to send it to me. I'll report, I haven't got it yet, but I'll bring it to class, waiting to see, hopefully. And that will give us a chance to review stochastic gradient descent, the central algorithm of deep learning.
And then this today is about the central structure of deep neural nets. And some of you will know already how they're connected, what the function F, the learning function-- you could call it the learning function-- that's constructed. The whole system is aiming at constructing this function F which learns the training data and then applying it to the test data. And the miracle is that it does so well in practice. That's what has transformed deep learning into such an important application.
Chapter 7 has been up for months on the math.mit.edu/learningfromdata site, and I'll add it to Stellar. Of course, that's where you'll be looking for it. OK, and then the second, the back propagation, the way to compute the gradient, I'll probably reach that idea today. And you'll see it's the chain rule, but how is it organized.
OK, so what's the structure? What's the plan for deep neural nets? Good. Starting here, so what we have is training data. So we have vectors, x1 to x-- what should I use for the number of samples that we have in the training data? Well, let's say D for Data. OK.
And each vector, those are called feature vectors, so equals feature vectors. So each one, each x, has like m features. So maybe my notation isn't so hot here. I have a whole lot of vectors. Let me not use the subscript for those right away.
So vectors, feature vectors, and each vector has got maybe shall we say m features? Like, if we were measuring height and age and weight and so on, those would be features. The job of the neural network is to create-- and we're going to classify. Maybe we're going to classify men and women or boys and girls. So let's make it a classification problem, just binary.
So the classification problem is-- what shall we say? Minus 1 or 1, or 0 or 1, or boy or girl, or cat or dog, or truck or car, or anyway, just two classes. So I'm just going to do two-class classification. We know which class the training data is in. For each vector x, we know the right answer.
So we want to create a function that gives the right answer, and then we'll use that function on other data. So let me write that down. Create a function F of x so that gets-- most of gets the class correct. In other words, F of x should be negative for when the classification is minus 1, and F of x should be positive when the classification is plus 1.
And as we know, we don't necessarily have to get every x, every sample, right. That may be over-fitting. If there's some sample that's just truly weird, by getting that right we're going to be looking for a truly weird data in the test set, and that's not a good idea. We're trying to discover the rule that covers almost all cases but not every crazy, weird case. OK? So that's our job, to create a function F of x that is correct on almost all of the training data. Yeah.
So before I draw the picture of the network, let me just remember to mention the site Playground. I don't know if you've looked at that, so I'm going to ask you, [email protected]. How many of you know that site or have met with it? Just a few, OK. OK, so it's not a very sophisticated site. It's got only four examples, four examples.
So one example is a whole lot of points that are blue, B for Blue, inside a bunch of points that are another set that are O for Orange, orange, blue. OK, so those are the two classes, orange and blue. So the points x and the feature vector here is just the xy, the coordinates features are the xy coordinates of the points. And our job is to find a function that's positive on these points and negative on those points.
So there is a simple model problem, and I recommend-- well, just partly-- if you're a expert in deep learning. This is for children, but morally here, I certainly learned from playing in this playground. So you set the step size. Do you set it, or does it set it? I guess you can change it. I don't think I've changed it.
What else do you set? Oh, you set the nonlinear activation, the nonlinear activation function, active I'll say function. And let me just go over here and say what function people now mostly use.
The activation function is called ReLU pronounced different ways. I don't know how we got into that crazy thing. For this function, that is 0 and x. So the function, ReLU is a function of x is the maximum the larger of 0 and x.
The point is, it's not linear, and the point is that if we didn't allow nonlinearity in here somewhere, we couldn't even solve this playground problem. Because if our classifiers were all linear classifiers, like support vector machines, I couldn't separate the blue from the orange with a plane. It's got to somehow create some nonlinear function which maybe the function is trying to be-- a good function would be a function of r and theta maybe, maybe r minus 5.
So maybe the distance out to that. Let's suppose that distance is 5. Then, r minus 5 will be negative on the blues, because r is small. And r minus 5 will be positive on the oranges, because r is bigger. And therefore, we will have the right signs, less than 0 or greater than 0, and it'll classify this data, this training data. Yeah.
So it has to do that. This is not a hard one to do. There are four examples, as I say, two are trivial. It finds a good function. Well yeah, I've forgotten, they're so trivial, they shouldn't be mentioned, and then this is the medium test.
And then the hard test is when you have a sort of spiral of oranges, and inside, you have a spiral of blues. That was cooked up by a fiend. So the system is trying to find a function that's positive on one spiral and negative on the other spiral, and that takes quite a bit of time, many, many epochs.
I learned what an epoch is. Did you know what an epoch is? I didn't know whether it was just a fancy word for counting the steps in gradient descent. But it counts the steps, all right, but one epoch is the number of steps that matches the size of the training data. So if you have a million samples-- where ordinary gradient descent you would be doing a million-- you'd have a million by a million problem per step.
Of course, stochastic gradient descent just does a mini-batch of 1 or 32 or something, but anyway. So you have to do it enough mini-batches so that the total number you've covered is the equivalent of one full run through the training data, and that was an interesting point. Did you pick up that point? That in stochastic gradient descent, you could either do a mini-batch, and then put them back in the soup, so with replacement. Or you could just put your data in some order, from one to a zillion.
So here's a first x and then more and more x's, and then just randomize the order. So you'd have to randomize the order for stochastic gradient descent to be reasonable, and then take a mini-batch and a mini-batch and a mini-batch and a mini-batch. And when you get to the bottom, you've finished one epoch. And then you'd probably randomize again, maybe, if you wanted to live right. And go through the mini-batches again, and probably do 1,000 times or more.
Anyway, so I haven't said yet what you do, what this F of x is like, but you can sort of see it on the screen. Because as it creates this function F, it kind of plots it. And what you see on the screen is the 0 set for that function. So perfect would be for it to go through 0-- if I had another color. Oh, I do have another color. Look, this is the first time the whole semester blue is up here.
OK, so if the function was positive there-- in this part, on the blues-- and negative outside that region for the oranges, that would be just what we want. Right? That would be what this little Playground site is creating. And on the screen, you'll see it. You'll see this curve, where it crosses 0.
So that curve, where it crosses 0, is supposed to separate the two sets. One set is positive, one set is negative, where 0 is in between. And the point is, it's not a straight line, because we've got this nonlinear function. This is nonlinear, and it allows us to have functions like r minus 5. And so at 5, that's where the function would be 0, and you'll see that on the screen.
You might just go to playground@tensorflow. Of course, TensorFlow is a big system. This is the child's department, but I thought it was pretty good. And then on this site, you decide how many layers there will be, how many neurons in each layer. So you create the structure that I'm about to draw.
And you won't be able to get to solve this problem to find a function F that learns that data without a number of layers and a number of neurons. If you don't give it enough, you'll see it struggling. The 0 set tries to follow this, but it gives up at some point. This one doesn't take too many layers, and the two trivial examples, just a few neurons do the job.
OK. So now, that's a little commented on one website. If you know other websites that I should know and should call attention to, could you send me an email? I'm just not aware of everything that's out there.
Or if you know a good Convolutional Neural Net, CNN, that is available to practice on, where you could give it the training set. That's what I'm talking about here. I'd be glad to know, because I just don't know all that I should.
OK. So what does the function look like? Well, as I say, linear isn't going to do it, but linear is a very important part of it, of this function F. So the function F really has the form-- well, so we start here with a vector of one, two, three, four, m is five. This is the vector x, five components. OK, so let me erase that now.
OK, so then we have layer 1 with some number of points. Let's say, n1 is 6 neurons, and let me make this simple. I'll just have that one layer, and then I'll have the output. This will be the output layer, and it's just going to be one number.
So I'm going to have a matrix, A1, that takes me from this. A1 will be 6 by 5, because I want 6 outputs and 5 inputs. 6 by 5 matrix, so I have 30 weights to choose there. And so the y that comes out is going to be y1 will be A1 times x0. So x0 is the feature vector with 5 components.
So that's a purely linear thing, but we also want an offset function, offset vector. So that's a vector. Then, this, the y that's coming out, has 6 components. The A1 is 6 by 5, the x0 was 5 by 1, and then of course, this is 6 by 1. So these are the weights. Yeah, I'll call them all weights, weights to compute.
So these are connected. The usual picture is to show all these connections. I'll just put in some of them. So in here, we have 30 plus 6 parameters, 36 parameters, and then I'm going to close this. It's going to be a very shallow thing, so that will be just 1 by 6. Yeah. OK.
Right, so we're just getting one output. So that's just a vector at this final point, but of course, that the whole idea of deep neural nets is that you have many layers. So 36 more realistically is in the tens of thousands, and you have it multiple times. And the idea seems to be that you can separate what layer one learns about the data and from what layer two learns about the data.
Layer one-- this A1, apparently by just looking after the computation-- this learns some basic facts about the data. The next, A2 which would go in here, would learn more detail, and then A3 would learn more details. So we would have a number of layers, and it's that construction that has made neural net successful. But I haven't finished, because right now, it's only linear. Right now, I just have, I'll call it A2 in here.
Right now, I would just have a matrix multiplication apply A1 and then apply A2, but in between there is a 1 by 1 action on each by this function. So that function acts on that number to give that number back again or to give 0. So in there is ReLU. In this comes ReLU on each, 6 copies of ReLU acting on each of those 6 numbers. Right? So really x1 comes from y1 by applying ReLU to it.
Then, that gives the x. So here are the y's from the linear part, and here are the x-- that's y1. That's a vector y1 from just the linear plus an affine map. Linear plus constant, that's affine. And then the next step is component by component we apply this function, and we get x1, and then do it again and again and again.
So do you see the function? How do I describe now the function F of x? So the learning function which depends on the weights, on the A's and b's. So I start with an x, I apply A1 to it. Yeah, let me do this.
This is the function F of x. F of x is going to be F3, let's say, of F2 of F1 of x, one, two, three, parentheses, right? OK, so it's a chain, you could say. F is a-- what's the right word for a chain of functions, if I take a function of a function? The reason I use the word chain is that the chain rule gives the derivative.
So a function of a function of a function, that's called composition, composing function. So this is a composition. I don't know if there's a standard symbol for starting with F1 and do some composition and do some composition. And now what are those separate F's? So the separate F's are the-- F1 of a vector would be-- it includes the ReLU part, the nonlinear part, of A1, x0 plus b1.
So two parts, you do the linear or affine map on your feature vector, and then component by component you apply that nonlinear function. And it took some years before that nonlinear function became a big favorite. People imagined that it was better, it was important, to have a smooth function. So the original functions were sigmoids, like S curves, but of course, it turned out that experiments showed that this worked even better.
Yeah, so that would be F1, and then F2 would have the same form, and F3 would have the same form. So maybe this had 36 weights, and the next one would have another number and the next another number. You get quite complicated functions by composition, by like e to the sine of x, or e to the sign of the logarithm of x, or things like that. Pure math has asked, what functions can you get? Try to think of them all.
Now, what kind of functions do we have here? What can I say about F of x as a function, as a math person? What kind of a function is it? So it's created out of matrices and vectors, out of a linear or affine map, followed by a nonlinear, by that particular nonlinear function. So what kind of a function is it?
Well, I've written those words down up here, and F of x is going to be a continuous piecewise linear function. Because every step is continuous, that's a continuous function. Linear functions are a continuous functions, so we're taking a composition of continuous function, so it's continuous. And it's piecewise linear, because part of it is linear, and part of it is piecewise linear. So this is some continuous, piecewise, linear function of x, x in m dimensions. OK.
So one little math question which I think helps to understand, to like to swallow the idea of a chain, of the kind of chain we have here, of linear followed by ReLU. So here's my question. This is the question I'm going to ask. And by the way, back propagation is certainly going to come Wednesday rather than today. That's a major topic in itself.
So let me keep going with this function. Could you get any function whatsoever this way? Well, no, you only get continuous, piecewise, linear functions. It's an interesting case.
Let me just ask you. One of the exercises says, if I took two continuous, piecewise, linear functions-- the next 20 minutes are an attempt to give us a picture of the graph of a piecewise, linear function in say a function of two variables. So I have m equal to 2, and I draw its graph.
OK, help me to draw this graph. So this would be a graph of F of x1, x2, and it's going to be continuous and piecewise linear. So what does its graph look like? That's the question. What's the graph of a piecewise, linear function looks like?
Well, it's got flat pieces in between the change from-- I do say piecewise, that means it's got different pieces. But within a piece, it's linear, and the pieces with each other, because it's continuous. So I visualize, well, it's like origami. This is the theory of origami almost. So right, origami, you take a flat thing, and you fold it along straight folds.
So what's different from origami? Maybe not much. Well, maybe origami allows more than we allow here, or origami would allow you to fold it up and over. So origami would give you a multi-valued thing, because it's got a top and a bottom and other folds. This is just going out to infinity in flat pieces, and the question will be, how many pieces?
So let me ask you that question. How many pieces do I have? Do you see what I mean by a piece? So I'm thinking of a graph that has these flat pieces, and they're connected along straight edges. And those straight edges come from the ReLU operation.
Well, that's got two pieces. Actually, we could do it 1D. In 1D, we could count the number of pieces pretty easily. So what would be a piecewise linear? Let me put it over here on the side and erase it soon. OK.
So here's m equal 1, a continuous, piecewise, linear F. I'll just draw its graph. So OK, so it's got straight pieces, straight pieces like so. Yeah, you've got the idea. It's a broken line type. Sometimes, people say broken line, but I'm never sure that's a good description of this.
Piecewise, linear, continuous, so it's continuous because the pieces meet, and it's piecewise, linear, obviously. OK, so that's the kind of picture I have for a function of one variable. Now, my question is-- as an aid to try to visualize this function in 2D-- is to see if we can count the pieces, see if we can count the pieces. Yes. So that's in the notes. I found it in a paper by five authors for a meeting. So actually, the whole world of neural nets, it's the conferences every couple of years that everybody prepares for, submits more than one paper.
So it's kind of a piecewise, linear conference, and those are the big conferences. OK. So this is the back propagation section, and I want to look at the-- OK. So this is a paper by Kleinberg and four others. Kleinberg, he's a computer science guy at Cornell. He was a PhD from here in math, and he's a very cool and significant person, not so much on neural networks as just this whole part of computer science. Right.
So anyway, they and other people too have asked this same problem. Suppose I'm in two variables. So what are you imagining now for the surface, the graph of F of x and y? It has these lines, fold lines, right? I'm thinking it has fold lines.
So I can start with a complete plane, and I fold it along one line. So now, it's like ReLU. It's one half plane there going into a different half plane there. Everybody with it? And now, I take that function, that surface which just has two parts, and I put in another fold.
OK, how many parts have I got now? I think four, am I right? Four parts, yes, because this will be different from this, because it was folded along that line. So these will be four different pieces. They have the same value at the center there, and they match along the lines.
So the number of flat pieces is four for this. So that's with two folds, and now I just want to ask you, with m folds how many pieces are there? Can I get up to three folds? So I'm going to look for the number of folds.
So let me just use a notation, maybe r. r is the number of flat pieces, and m is the dimension of x. In my picture, it's two, and N is the number of folds.
So let me say it again. I'm taking a plane. I'll fold that plane-- because the dimension was two-- I'll fold it N times. How many pieces? How many flat pieces?
This would be a central step in understanding how close the function-- what freedom you have in the function F. For example, can you approximate any continuous function by one of these functions F by taking enough folds? Seems like the answer should be yes, and it is yes. For pure math, that's one question. Is this class of functions universal?
So the universality theorem would be to say that any function-- sine x, whatever-- could be approximated as close as you like by one of these guys with enough folds. And over here, we're kind of making it more numerical. We're going to count the number of pieces just to see how quickly do they grow.
So what happens here? So I have four folds. Right now, I have N equal 2. m is 2 here in this picture. And I'm trying to draw this surface, in here I've put in 2.
Did I take N? Yeah, two folds, and now I'm going to go up to three folds. OK. So let me fold it along that line. How many pieces of I got now?
Let's see, can I count those pieces? Is it seven? So what is a formula? What if I do another fold? Yeah, let's pretend we do another fold. Yeah?
AUDIENCE: [INAUDIBLE]
GILBERT STRANG: Uh, yeah. Well, maybe that's going to be it. It's a kind of nice question, because it asks you to visualize this thing. OK. So what happened? How many of those lines will be-- if I put in a fourth line-- how many? Yeah, how many new folds do I create?
That's kind of the question, and I'm assuming that fourth line doesn't go through any of these points. It's sort of in general position. So I put it in a fourth line, da-da-da-da, there it is. OK, so what happened here? How many new ones did it create?
How many new ones did it create? Let me make that one green, because I'm distinguishing that's the guy that's added after the original. We had seven. We had seven pieces, and now we've got more. Was it seven?
It was, wasn't it? One, two, three, four, five, six, seven, but now how many pieces have I got? Or how many pieces did this new line create? We want to build it up, use a recursion. How many pieces did this new-- well, this new line created one new piece there. Right? One new piece there, one new piece there, one new piece there, so there are four new pieces. OK.
Yes, so there's some formula that's going to tell us that, and now what would the next one create? Well, now I have one, two, three, four lines. So now, I'm going to put through a fifth line, and that will create a whole bunch of pieces. I'm losing the thread of this argument, but you're onto it. Right? Yeah, so any suggestions? Yeah.
AUDIENCE: Yeah, I think you add essentially the number of lines that you have each time you add a line at most.
GILBERT STRANG: OK. Yes. That's right. So there is a recursion formula that I want to know, and I learned it from Kleinberg's paper. And then we have an addition to do, so the recursion will tell me how much it goes up with each new function, and then we have to add. OK. So the recursion formula, let me write that down.
So this is r of N and m that I'd like to find a formula for. It's the number of flat pieces with an m dimensional surface-- well, we're taking m to be 2-- and N folds. So N equal 1, 2, 3. Let's write down the numbers we know.
With one fold, how many pieces? Two, good, so far so good. With one fold, there were two pieces. So this is the count r, and then with two folds, how many? Oh, we've gone past that point. So can we get back to just those two? Was it four?
AUDIENCE: Yes.
GILBERT STRANG: OK, thanks. Now, when I put in that third fold, how many did I have without the green line yet? Seven, was it seven? And when the fourth one went in, that green one, how many have I got in this picture? So the question is how many new ones did I create, I guess. So that line got chopped into that piece, that piece, that piece, that piece, four pieces for the new line.
Four pieces for the new line, and then each of those pieces like added a flat bit. Because that piece from here to here separated these two which were previously just one piece, one flat piece. I folded on that line. I folded on this. I folded there. I think it went up by 4 to 11.
So now, we just have to guess a formula that matches those numbers, and then of course, we really should guess it for any m and any N. And I'll write down the formula that they found. It involves binomial numbers. Everything in the world involves binomial numbers, because they satisfy every identity you could think of.
So here's their formula. r with N folds, and we're in m dimensions. So we've really in our thinking had m equal to 2, but we should grow up and get m to be five dimensional. So we have a five dimensional-- let's not think about that. OK.
So it turns out it's binomial numbers-- N 0, N 1, up to N m. So for m equals 2, which is my picture, it's N 0 plus N 1 plus N 2, and what are these? What does that N 2 mean, for example? That's a binomial number. I don't know if you're keen on binomial numbers. Some people, their whole lives go into binomial numbers.
So it's something like-- is it N factorial divided by N minus 2 factorial and 2 factorial? I think that's what that number means. That's the binomial number. So at this point, I'm hoping to get the answer seven, I think. I'm in m equal to-- I've gone up to 2.
Yeah, so I think I've obviously allowed for three cuts, and the r, when we had just three, was 7. So this is now I'm taking N to be 3, and I'm hoping for answer is 7. So I add these three things. So what is 3, the binomial number 3 with 2? I've forgotten how to say that.
I'm ashamed to admit. 3 choose 2, thanks. I knew there was a good way. So what is 3 choose 2? Well, put in 3, and 2 is in there already, so that'd be 6 over 1 times 2. This would be 3. Would that be 3? And what is 3 choose 1?
AUDIENCE: 3.
GILBERT STRANG: How do you know that? You're probably right. 3, I think, yeah. Oh yeah, probably a theorem that if these add 3. Yeah, so I'm doing N equals 3 here.
OK. So yeah, I agree. That's 3, and what about N to 0? That you have to live with 0 factorial, but 0 factorial is by no means 0. So what is 0 factorial? 1, yeah.
I remember when I was an undergraduate having a bet on that. I won, but he didn't pay off. Yeah, so it's 3. This is 3 factorial over 3 factorial times 0 factorial. So it's 6 over 6 times 1.
So it's 1. Yeah, 1 and 3 and 3 make 7. So that proves the formula. Well, it doesn't quite prove the formula, but the way to prove it is by an induction.
If you like this stuff, the recursion that you use induction on. Which is just what we did now, what we did here. Here comes in a number 4, and it cuts through, and then we just counted the 4 pieces there. So yeah, so let me just tell you what the r then and m.
The number we're looking for is the number that we had with one less cut. So that's the previous count of flat pieces plus the number that was here was 4, the number of pieces that cut that. And that's r of N minus 1, m minus 1. Yeah, and I won't go further.
Time's up, but that rule for recursion is proved in the section 7.1 taken from the paper by Kleinberg and others. Yeah. So OK, I think this is-- I don't know what you feel. For me, this like gave me a better feeling that I was understanding what kind of functions we had here. And so then the question is-- with this family of functions, we want to choose the A's and the weights, the A's and b's, to match the training data. So that we have a problem in minimizing the total loss, and we have a gradient descent problem.
So we have to find the gradient, so that Wednesday's job. Wednesday's job is to find the gradient of F, and that's back propagation. Good. Thank you very much. 7.1 is done.
Problems for Lecture 26
From textbook Section VII.1
4. Explain with words or show with graphs why each of these statements about Continuous Piecewise Linear functions (CPL functions) is true:
\(\boldsymbol{M} \hspace{4pt}\) The maximum \(M(x,y)\) of two CPL functions \(F_1(x,y)\) and \(F_2(x, y)\) is CPL.
\(\boldsymbol{S} \hspace{6pt}\) The sum \(S(x,y)\) of two CPL functions \(F_1(x,y)\) and \(F_2(x,y)\) is CPL.
\(\boldsymbol{C} \hspace{6pt}\) If the one-variable functions \(y=F_1(x)\) and \(z=F_2(y)\) are CPL, so is the
\(\hspace{16pt}\)composition \(C(x)=z=(F_2(F_1(x))\)
Problem 7 uses the blue ball, orange ring example on playground.tensorflow.org with one hidden layer and activation by ReLU (not Tanh). When learning succeeds, a white polygon separates blue from orange in the figure that follows.
7. Does learning succeed for \(N=4\)? What is the count \(r(N, 2)\) of flat pieces in \(F(\boldsymbol{x})\)? The white polygon shows where flat pieces in the graph of \(F(\boldsymbol{x})\) change sign as they go through the base plane \(z=0\). How many sides in the polygon?