1 00:00:16,313 --> 00:00:16,980 MICHALE FEE: OK. 2 00:00:16,980 --> 00:00:18,730 All right, let's go ahead and get started. 3 00:00:18,730 --> 00:00:21,690 OK, so we're going to continue talking 4 00:00:21,690 --> 00:00:25,380 about the topic of neural networks. 5 00:00:25,380 --> 00:00:28,620 Last time, we introduced a new framework 6 00:00:28,620 --> 00:00:33,330 for thinking about neural network interactions, 7 00:00:33,330 --> 00:00:37,740 using a rate model to describe the interactions of neurons 8 00:00:37,740 --> 00:00:41,370 and develop a mathematical framework for how 9 00:00:41,370 --> 00:00:43,080 to combine collections of neurons 10 00:00:43,080 --> 00:00:46,300 to study their behavior. 11 00:00:46,300 --> 00:00:50,760 So, last time, we introduced the notion of a perceptron 12 00:00:50,760 --> 00:00:54,180 as a way of building a neural network that 13 00:00:54,180 --> 00:00:57,510 can classify its inputs. 14 00:00:57,510 --> 00:01:03,300 And we started talking about the notion of a perceptron learning 15 00:01:03,300 --> 00:01:06,480 rule, and we're going to flesh that idea out 16 00:01:06,480 --> 00:01:08,580 in more detail today. 17 00:01:08,580 --> 00:01:12,450 We're going to then talk about the idea of using networks 18 00:01:12,450 --> 00:01:17,250 to perform logic with neurons. 19 00:01:17,250 --> 00:01:19,800 We're going to talk about the idea of linear separability 20 00:01:19,800 --> 00:01:21,480 and invariance. 21 00:01:21,480 --> 00:01:24,240 Then we're going to introduce more complex 22 00:01:24,240 --> 00:01:26,190 feed-forward networks, where instead 23 00:01:26,190 --> 00:01:28,260 of having a single output neuron, 24 00:01:28,260 --> 00:01:32,040 we have multiple output neurons. 25 00:01:32,040 --> 00:01:37,800 Then we're going to turn to a more fully developed view 26 00:01:37,800 --> 00:01:40,890 of the math that we use to describe neural networks, 27 00:01:40,890 --> 00:01:45,450 and matrix operations become extremely important 28 00:01:45,450 --> 00:01:50,330 in neural network theory. 29 00:01:50,330 --> 00:01:51,960 And then, finally, we're going to turn 30 00:01:51,960 --> 00:01:55,110 to some of the kinds of transformations that 31 00:01:55,110 --> 00:01:58,290 are performed by matrix multiplication 32 00:01:58,290 --> 00:02:03,080 and by the kinds of-- by feed-forward neural networks. 33 00:02:03,080 --> 00:02:08,160 OK, so we've been considering a kind of neural network called 34 00:02:08,160 --> 00:02:12,065 a rate model that uses firing rates rather than spike trains. 35 00:02:12,065 --> 00:02:13,440 So we introduced the idea that we 36 00:02:13,440 --> 00:02:16,560 have an output neuron with firing rate 37 00:02:16,560 --> 00:02:19,830 v that receives input from an input neuron that 38 00:02:19,830 --> 00:02:21,530 has firing rate u. 39 00:02:21,530 --> 00:02:24,270 The input neuron synapses onto the output neuron 40 00:02:24,270 --> 00:02:26,490 with a synapse of weight w. 41 00:02:26,490 --> 00:02:29,010 And we described how we can think 42 00:02:29,010 --> 00:02:34,110 of the input neuron producing a synaptic input into the output 43 00:02:34,110 --> 00:02:39,600 neuron that has a magnitude of the firing 44 00:02:39,600 --> 00:02:42,350 rate times the strength of the synaptic connection. 45 00:02:42,350 --> 00:02:48,550 So the input to the output neuron here is w times u. 46 00:02:48,550 --> 00:02:53,050 And then we talked about how we can convert that input current, 47 00:02:53,050 --> 00:02:55,330 let's say, into our output neuron 48 00:02:55,330 --> 00:02:59,380 into a firing rate of the output neuron through some function 49 00:02:59,380 --> 00:03:05,050 f, which is what's called the F-I curve of the neuron that 50 00:03:05,050 --> 00:03:08,920 relates the input to the firing rate of the neuron. 51 00:03:08,920 --> 00:03:11,260 And we talked about several different kinds 52 00:03:11,260 --> 00:03:15,850 of F-I firing rate versus input functions that can be useful. 53 00:03:15,850 --> 00:03:20,950 We then extended our network from a single input neuron 54 00:03:20,950 --> 00:03:22,960 synapsing onto a single output neuron 55 00:03:22,960 --> 00:03:26,290 by having multiple input neurons. 56 00:03:26,290 --> 00:03:29,680 Again, the output neuron has a firing rate, 57 00:03:29,680 --> 00:03:34,090 and our input neurons have a vector of firing rates now-- 58 00:03:34,090 --> 00:03:37,800 u1, u2, u3, u4, and so on-- 59 00:03:37,800 --> 00:03:42,940 that we can combine together into a vector, u. 60 00:03:42,940 --> 00:03:47,180 Each one of those input neurons has a synaptic strength w 61 00:03:47,180 --> 00:03:48,470 onto our output neuron. 62 00:03:48,470 --> 00:03:51,580 So we have a vector of synaptic strengths. 63 00:03:51,580 --> 00:03:56,590 And now we can write down the input current to our output 64 00:03:56,590 --> 00:04:00,100 neuron as a sum of the contributions from each 65 00:04:00,100 --> 00:04:07,150 of those input neurons-- so w1, u1 plus w2, u2, plus w3, u3, 66 00:04:07,150 --> 00:04:08,980 and so on. 67 00:04:08,980 --> 00:04:12,100 So we can now write the input current 68 00:04:12,100 --> 00:04:16,810 to our output neuron as a sum of contributions 69 00:04:16,810 --> 00:04:18,850 that we can then write as a dot product-- 70 00:04:18,850 --> 00:04:21,540 w dot u. 71 00:04:21,540 --> 00:04:22,930 OK, any questions about that? 72 00:04:27,480 --> 00:04:30,570 And so, in general, we have the firing rate of our output 73 00:04:30,570 --> 00:04:32,970 neuron is just this F-I function, 74 00:04:32,970 --> 00:04:37,500 this input-output function of our output neuron acting 75 00:04:37,500 --> 00:04:41,023 on the total input, which is w dot u. 76 00:04:41,023 --> 00:04:42,690 And then we talked about different kinds 77 00:04:42,690 --> 00:04:46,770 of functions that are useful computationally 78 00:04:46,770 --> 00:04:47,820 for this function f. 79 00:04:47,820 --> 00:04:51,060 So in the context of the integrate and fire neuron, 80 00:04:51,060 --> 00:04:56,440 we talked about F-I curves that are zero below some threshold 81 00:04:56,440 --> 00:05:01,350 and then are linear above that threshold current. 82 00:05:01,350 --> 00:05:05,640 We talked last time about a binary threshold known 83 00:05:05,640 --> 00:05:08,140 that has zero firing rate below some threshold 84 00:05:08,140 --> 00:05:12,390 and then steps up abruptly to a constant output firing rate 85 00:05:12,390 --> 00:05:14,280 one. 86 00:05:14,280 --> 00:05:16,560 And then we also introduced, last time, the notion 87 00:05:16,560 --> 00:05:19,050 of a linear neuron, whose firing rate is 88 00:05:19,050 --> 00:05:21,600 just proportional to the input current 89 00:05:21,600 --> 00:05:24,300 and has positive and negative firing rates. 90 00:05:24,300 --> 00:05:26,450 And we talked about the idea that although it's 91 00:05:26,450 --> 00:05:28,860 biophysically implausible to have neurons 92 00:05:28,860 --> 00:05:31,650 that have negative firing rates, that this 93 00:05:31,650 --> 00:05:35,040 is a particularly useful simplification of neurons. 94 00:05:35,040 --> 00:05:39,990 Because we can just use linear algebra 95 00:05:39,990 --> 00:05:44,440 to describe the properties of networks of linear neurons. 96 00:05:44,440 --> 00:05:46,980 And we can do some really interesting things 97 00:05:46,980 --> 00:05:52,270 with that kind of mathematical simplification. 98 00:05:52,270 --> 00:05:54,870 We're going to get to some of that today. 99 00:05:54,870 --> 00:05:57,420 And that allows you to really build 100 00:05:57,420 --> 00:06:02,750 an intuition for what neural networks can do. 101 00:06:02,750 --> 00:06:08,570 OK, so let's come back to what perceptron is and introduce 102 00:06:08,570 --> 00:06:11,820 this perceptron learning role. 103 00:06:11,820 --> 00:06:14,690 So we talked about the idea that a perceptron carries out 104 00:06:14,690 --> 00:06:17,510 a classification of its inputs that 105 00:06:17,510 --> 00:06:18,860 represent different features. 106 00:06:18,860 --> 00:06:22,580 So we talked about classifying animals into dogs and non-dogs 107 00:06:22,580 --> 00:06:27,120 based on two features of animals. 108 00:06:27,120 --> 00:06:30,110 We talked about the fact that you 109 00:06:30,110 --> 00:06:34,160 can't make that classification between dogs and non-dogs 110 00:06:34,160 --> 00:06:36,350 just on the basis of one of those features, 111 00:06:36,350 --> 00:06:40,580 because these two categories overlap in this feature 112 00:06:40,580 --> 00:06:42,060 and in this feature. 113 00:06:42,060 --> 00:06:44,960 And so in order to properly separate those categories, 114 00:06:44,960 --> 00:06:47,780 you need a decision boundary that's 115 00:06:47,780 --> 00:06:52,280 actually a combination of those two features. 116 00:06:52,280 --> 00:06:54,290 And we talked about how you can implement 117 00:06:54,290 --> 00:06:57,790 that using a simple network, called 118 00:06:57,790 --> 00:07:02,570 a perceptron, that has an output neuron and two input neurons. 119 00:07:02,570 --> 00:07:06,320 Each one of those input neurons represents the magnitude 120 00:07:06,320 --> 00:07:10,070 of those two different features for each object 121 00:07:10,070 --> 00:07:13,220 that you're trying to classify. 122 00:07:13,220 --> 00:07:19,580 So u1 here and u2 are the dimensions on which we're 123 00:07:19,580 --> 00:07:24,100 performing this classification. 124 00:07:24,100 --> 00:07:28,840 And so we talked about the fact that that decision boundary 125 00:07:28,840 --> 00:07:31,990 between those two classifications 126 00:07:31,990 --> 00:07:35,470 is determined by this weight matrix w. 127 00:07:35,470 --> 00:07:37,810 And then we used a binary threshold neuron 128 00:07:37,810 --> 00:07:39,700 for making the actual decision. 129 00:07:39,700 --> 00:07:42,370 Binary threshold neurons are great for making decisions, 130 00:07:42,370 --> 00:07:46,540 because unlike a linear neuron-- so a linear neuron just 131 00:07:46,540 --> 00:07:48,850 responds more if its input is larger, 132 00:07:48,850 --> 00:07:51,940 and it responds less if its input is smaller. 133 00:07:51,940 --> 00:07:57,220 Binary threshold neurons have a very clear threshold 134 00:07:57,220 --> 00:07:59,380 below which the neuron doesn't spike 135 00:07:59,380 --> 00:08:01,480 and above which the neuron does spike. 136 00:08:01,480 --> 00:08:04,300 So, in this case, this network, this output neuron here, 137 00:08:04,300 --> 00:08:07,420 will fire, will have a firing rate of one, 138 00:08:07,420 --> 00:08:11,530 for any input that's on this side of the decision boundary 139 00:08:11,530 --> 00:08:13,510 and will have a firing rate of zero 140 00:08:13,510 --> 00:08:16,940 for any input that's on this side of the decision boundary, 141 00:08:16,940 --> 00:08:19,570 OK? 142 00:08:19,570 --> 00:08:24,560 All right, so we talked about how we can, in two dimensions, 143 00:08:24,560 --> 00:08:28,940 just write down a decision boundary that will separate, 144 00:08:28,940 --> 00:08:32,870 let's say, green objects from red objects. 145 00:08:32,870 --> 00:08:36,409 So you can see that if you sat down 146 00:08:36,409 --> 00:08:39,770 and you looked at this drawing of green dots and red dots, 147 00:08:39,770 --> 00:08:43,309 that it would be very simple to just look at that picture 148 00:08:43,309 --> 00:08:46,010 and see that if you put a decision boundary right 149 00:08:46,010 --> 00:08:49,910 there, that you would be able to separate the green dots 150 00:08:49,910 --> 00:08:51,350 from the red dots. 151 00:08:51,350 --> 00:08:54,470 How would you actually calculate the weight vector 152 00:08:54,470 --> 00:08:57,030 that that corresponds to in a perceptron? 153 00:08:57,030 --> 00:08:59,100 Well, it's very simple. 154 00:08:59,100 --> 00:09:02,300 You can just look at where that decision boundary crosses 155 00:09:02,300 --> 00:09:04,220 the axes-- 156 00:09:04,220 --> 00:09:07,190 so you can see here, that decision boundary crosses 157 00:09:07,190 --> 00:09:13,080 the u1 axis at point A, crosses the u2 axis at, I should say, 158 00:09:13,080 --> 00:09:17,840 a value of B. And then we can use those numbers to actually 159 00:09:17,840 --> 00:09:19,100 calculate the w. 160 00:09:19,100 --> 00:09:21,950 So, remember, u is the input space. 161 00:09:21,950 --> 00:09:24,230 w is a weight vector that we're trying 162 00:09:24,230 --> 00:09:27,020 to calculate in order to place the decision 163 00:09:27,020 --> 00:09:28,070 boundary at that point. 164 00:09:28,070 --> 00:09:32,380 Is that clear what we're trying to do here? 165 00:09:32,380 --> 00:09:35,220 OK, so we can calculate that weight vector. 166 00:09:35,220 --> 00:09:37,710 We assume that just data is some number. 167 00:09:37,710 --> 00:09:39,840 Let's just call it one. 168 00:09:39,840 --> 00:09:44,760 We have an equation for a line-- w dot u equals theta. 169 00:09:44,760 --> 00:09:47,910 That's the equation for that decision boundary. 170 00:09:47,910 --> 00:09:52,080 We have two knowns, the two points on the decision boundary 171 00:09:52,080 --> 00:09:53,960 that we can just read off by eye. 172 00:09:53,960 --> 00:09:58,020 And we have two unknowns-- the synaptic weights, w1 and w2. 173 00:09:58,020 --> 00:10:00,510 And so we have two equations-- 174 00:10:00,510 --> 00:10:06,020 ua dot w equals theta, ub dot w equals theta. 175 00:10:06,020 --> 00:10:08,400 And we can just solve for w1 and w2, 176 00:10:08,400 --> 00:10:10,470 and that's what you got, OK? 177 00:10:10,470 --> 00:10:13,560 So the weight vector that gives you that decision boundary 178 00:10:13,560 --> 00:10:17,040 is 1 over a and 1 over b, OK? 179 00:10:17,040 --> 00:10:18,480 Those are the two weights. 180 00:10:18,480 --> 00:10:21,700 Any questions about that? 181 00:10:21,700 --> 00:10:23,460 OK. 182 00:10:23,460 --> 00:10:27,630 So in two dimensions, that's very easy to do, right? 183 00:10:27,630 --> 00:10:31,350 You can just look at that cloud of points, 184 00:10:31,350 --> 00:10:34,590 decide where to draw a line that best separates the two 185 00:10:34,590 --> 00:10:37,230 categories that you're interested in separating. 186 00:10:37,230 --> 00:10:40,870 But in higher dimensions, that's really hard. 187 00:10:40,870 --> 00:10:44,250 So in high dimensions, for example, 188 00:10:44,250 --> 00:10:47,720 we're trying to separate images, for example. 189 00:10:47,720 --> 00:10:49,980 So we can have a bunch of images of dogs, 190 00:10:49,980 --> 00:10:51,870 a bunch of images of cats. 191 00:10:51,870 --> 00:10:54,030 Each pixel in that image corresponds 192 00:10:54,030 --> 00:10:56,910 to a different input to our classification unit. 193 00:10:56,910 --> 00:11:00,960 And now how do you decide what all of those weights 194 00:11:00,960 --> 00:11:03,180 should be from all of those different pixels 195 00:11:03,180 --> 00:11:08,760 onto our output neuron that separates images of one class 196 00:11:08,760 --> 00:11:10,720 from images of another class? 197 00:11:10,720 --> 00:11:14,640 So there's just no way to do that by eye in high dimensions. 198 00:11:14,640 --> 00:11:17,460 So you need an algorithm that helps 199 00:11:17,460 --> 00:11:20,130 you choose that set of weights that allows you 200 00:11:20,130 --> 00:11:22,840 to separate different classes-- 201 00:11:22,840 --> 00:11:25,740 you know, a bunch of images of one class from a bunch 202 00:11:25,740 --> 00:11:28,500 of images of another class. 203 00:11:28,500 --> 00:11:33,540 And so we're going to introduce a method called 204 00:11:33,540 --> 00:11:40,710 the perceptron learning rule that is a category of learning 205 00:11:40,710 --> 00:11:47,910 rules called supervised learning rules that allow you to take 206 00:11:47,910 --> 00:11:51,660 a bunch of objects that you know-- so if you 207 00:11:51,660 --> 00:11:53,160 have a bunch of pictures of dogs, 208 00:11:53,160 --> 00:11:54,385 you know that they're dogs. 209 00:11:54,385 --> 00:11:57,010 If you have a bunch of pictures of cats, you know they're cats. 210 00:11:57,010 --> 00:11:58,920 So you label those images. 211 00:11:58,920 --> 00:12:03,780 You feed those inputs, those images, into your network, 212 00:12:03,780 --> 00:12:06,870 and you tell the network what the answer was. 213 00:12:06,870 --> 00:12:09,420 And through an iterative process, 214 00:12:09,420 --> 00:12:13,410 it finds all of the weights that optimally separate those two 215 00:12:13,410 --> 00:12:14,740 different categories. 216 00:12:14,740 --> 00:12:16,800 So that's called the perceptron learning rule. 217 00:12:16,800 --> 00:12:19,240 So let me just set up how that actually works. 218 00:12:19,240 --> 00:12:22,690 So you have a bunch of observations of the input. 219 00:12:22,690 --> 00:12:25,960 So in this case, I'm drawing these in two dimensions, 220 00:12:25,960 --> 00:12:28,560 but you should think about each one of these dots as being, 221 00:12:28,560 --> 00:12:32,520 let's say, an image of a dog in very high dimensions, 222 00:12:32,520 --> 00:12:37,920 where instead of just u1 and u2, you have u1 through u1000, 223 00:12:37,920 --> 00:12:41,280 where each one of those is the value of a different pixel 224 00:12:41,280 --> 00:12:44,190 in your image. 225 00:12:44,190 --> 00:12:46,170 So you have a bunch of images. 226 00:12:46,170 --> 00:12:50,220 Each one of those corresponds to an image of a dog. 227 00:12:50,220 --> 00:12:53,610 Each one of those corresponds to an image of a cat. 228 00:12:53,610 --> 00:12:56,280 And we have a whole bunch of different observations 229 00:12:56,280 --> 00:12:59,610 or images of those different categories. 230 00:12:59,610 --> 00:13:00,720 Any questions about that? 231 00:13:03,800 --> 00:13:06,840 All right, so we have n of those observations. 232 00:13:06,840 --> 00:13:08,880 And for each one of those observations, 233 00:13:08,880 --> 00:13:12,735 we say that the input is equal to one 234 00:13:12,735 --> 00:13:15,930 of those observations for one iteration of this learning 235 00:13:15,930 --> 00:13:17,410 process, OK? 236 00:13:17,410 --> 00:13:19,860 And so with each observation, we're 237 00:13:19,860 --> 00:13:21,810 told whether this input corresponds 238 00:13:21,810 --> 00:13:25,740 to one category or another, so a dog or a non-dog. 239 00:13:25,740 --> 00:13:27,960 And our output, we're asking-- 240 00:13:27,960 --> 00:13:30,240 we want to choose this set of weights 241 00:13:30,240 --> 00:13:32,640 such that the output of our network 242 00:13:32,640 --> 00:13:37,680 is equal to some known value. 243 00:13:37,680 --> 00:13:43,410 So t sub i, where if it's a dog, then the answer is one for yes. 244 00:13:43,410 --> 00:13:48,450 If it's a non-dog, the answer is no for that's not a dog. 245 00:13:48,450 --> 00:13:52,050 And we have n of those answers. 246 00:13:52,050 --> 00:13:56,760 We have n images and labels that tell us what category 247 00:13:56,760 --> 00:13:59,400 that image belongs to. 248 00:13:59,400 --> 00:14:01,380 So for all of these, t equals one. 249 00:14:01,380 --> 00:14:03,300 For all of these, t equals zero. 250 00:14:03,300 --> 00:14:05,400 And we want to find a set of weights 251 00:14:05,400 --> 00:14:10,020 such that when we take the dot product of that weight factor 252 00:14:10,020 --> 00:14:17,970 into each one of those observations minus theta 253 00:14:17,970 --> 00:14:23,340 that we get an answer that is equal to t 254 00:14:23,340 --> 00:14:25,830 for each observation. 255 00:14:25,830 --> 00:14:28,360 Does that make sense? 256 00:14:28,360 --> 00:14:31,240 So how do we do that? 257 00:14:31,240 --> 00:14:37,240 All right, so each observation, we have two things-- 258 00:14:37,240 --> 00:14:41,450 the input and the desired output. 259 00:14:41,450 --> 00:14:43,150 And that gives us information that we 260 00:14:43,150 --> 00:14:45,920 can use to construct this weight vector. 261 00:14:45,920 --> 00:14:48,110 So, again, that's called supervised learning. 262 00:14:48,110 --> 00:14:52,300 And we're going to use an update rule, or a learning rule, 263 00:14:52,300 --> 00:14:54,490 that allows us to change the weight 264 00:14:54,490 --> 00:14:58,180 vector on as a result of each estimate, 265 00:14:58,180 --> 00:15:01,030 depending on whether we got the answer right or not. 266 00:15:01,030 --> 00:15:02,370 So how do we do this? 267 00:15:02,370 --> 00:15:03,912 What we're going to do is we're going 268 00:15:03,912 --> 00:15:08,110 to start with a random set of weights, w1 and w2, OK? 269 00:15:08,110 --> 00:15:11,580 And we're going to put in an input. 270 00:15:11,580 --> 00:15:13,255 So there's a space of inputs. 271 00:15:13,255 --> 00:15:15,130 We're going to start with some random weight, 272 00:15:15,130 --> 00:15:18,230 and I started with some random vector in this direction. 273 00:15:18,230 --> 00:15:21,920 You can see that that gives you a classification boundary here. 274 00:15:21,920 --> 00:15:24,340 And you can see that that classification boundary is not 275 00:15:24,340 --> 00:15:27,290 very good for separating the green dots from the red dots. 276 00:15:27,290 --> 00:15:27,790 Why? 277 00:15:27,790 --> 00:15:31,060 Because it will assign a one to everything 278 00:15:31,060 --> 00:15:33,580 on this side of that decision boundary and a zero 279 00:15:33,580 --> 00:15:35,103 to everything on that side. 280 00:15:35,103 --> 00:15:36,520 But you can see that that does not 281 00:15:36,520 --> 00:15:39,250 correspond to the assignment of green and red 282 00:15:39,250 --> 00:15:41,200 to each of those dots, OK? 283 00:15:41,200 --> 00:15:47,523 So how do we update that w in order to get the right answer? 284 00:15:47,523 --> 00:15:48,940 So what we're going to do is we're 285 00:15:48,940 --> 00:15:53,710 going to put in one of these inputs on each iteration 286 00:15:53,710 --> 00:15:57,520 and ask whether the network got the answer right or not. 287 00:15:57,520 --> 00:16:02,610 So we're going to put in one of those inputs. 288 00:16:02,610 --> 00:16:05,140 So let's pick that input right there. 289 00:16:05,140 --> 00:16:07,190 We're going to put that into our network. 290 00:16:07,190 --> 00:16:09,730 And we see that the answer we get from the network 291 00:16:09,730 --> 00:16:14,770 is one, because it's on the positive side of the decision 292 00:16:14,770 --> 00:16:15,560 boundary. 293 00:16:15,560 --> 00:16:19,060 And so one was the right answer in this case. 294 00:16:19,060 --> 00:16:19,840 So what do we do? 295 00:16:19,840 --> 00:16:20,890 We don't do anything. 296 00:16:20,890 --> 00:16:25,270 We say the change in weight is going to be zero if we already 297 00:16:25,270 --> 00:16:26,940 get the right answer. 298 00:16:26,940 --> 00:16:29,560 So if we got lucky and our initial weight vector 299 00:16:29,560 --> 00:16:32,260 was in the right direction, so our perceptron 300 00:16:32,260 --> 00:16:34,398 already classified the answer, then 301 00:16:34,398 --> 00:16:36,190 the weight vector is never going to change, 302 00:16:36,190 --> 00:16:39,400 because it was already the right answer. 303 00:16:39,400 --> 00:16:41,690 OK, so let's put it in another input-- 304 00:16:41,690 --> 00:16:42,580 a red input. 305 00:16:42,580 --> 00:16:45,970 You can see that the correct answer is a zero. 306 00:16:45,970 --> 00:16:47,950 The network gave us a zero, because it's 307 00:16:47,950 --> 00:16:53,380 on the negative side of the weight vector of the decision 308 00:16:53,380 --> 00:16:54,380 boundary. 309 00:16:54,380 --> 00:16:56,530 And so, again, delta w is zero. 310 00:16:56,530 --> 00:16:58,780 But let's put in another input now such 311 00:16:58,780 --> 00:17:01,420 that we get the wrong answer. 312 00:17:01,420 --> 00:17:03,580 So let's put in this input right here. 313 00:17:03,580 --> 00:17:06,339 So you can see that the answer here, the correct answer 314 00:17:06,339 --> 00:17:12,339 is one, but the network is going to give us a zero. 315 00:17:12,339 --> 00:17:16,470 So what do we do to update that weight vector? 316 00:17:16,470 --> 00:17:19,329 So if the output is not equal to the correct answer, 317 00:17:19,329 --> 00:17:20,150 then we're wrong. 318 00:17:20,150 --> 00:17:22,000 So now we update w. 319 00:17:22,000 --> 00:17:26,140 And the perceptron learning rule is very simple. 320 00:17:26,140 --> 00:17:30,770 We introduce a change in w that looks like this. 321 00:17:30,770 --> 00:17:35,620 It's a little change, so eps eta is a learning rate. 322 00:17:35,620 --> 00:17:39,250 It's generally going to be smaller than one. 323 00:17:39,250 --> 00:17:43,510 So we're going to put in a small change in w that's 324 00:17:43,510 --> 00:17:47,440 in the direction of the input that was wrong 325 00:17:47,440 --> 00:17:51,580 if the correct answer is a one. 326 00:17:51,580 --> 00:17:53,800 We're going to make a small change 327 00:17:53,800 --> 00:17:57,910 to w in the opposite direction of that input 328 00:17:57,910 --> 00:18:00,940 if the correct answer was zero. 329 00:18:00,940 --> 00:18:02,120 Does that make sense? 330 00:18:02,120 --> 00:18:06,430 So we're going to change w in a way that 331 00:18:06,430 --> 00:18:11,930 depends on what the input was and what 332 00:18:11,930 --> 00:18:13,550 the correct answer was. 333 00:18:16,970 --> 00:18:18,200 So let's walk through this. 334 00:18:18,200 --> 00:18:21,200 So we put it in an input here. 335 00:18:21,200 --> 00:18:25,130 The correct answer is a one, and we got the answer wrong. 336 00:18:25,130 --> 00:18:28,400 The network gave us a zero, but the correct answer is a one. 337 00:18:28,400 --> 00:18:31,880 So we're in this region here. 338 00:18:31,880 --> 00:18:35,090 The answer was incorrect, so we're going to update w. 339 00:18:35,090 --> 00:18:38,300 The correct answer was a one, so we're going to change delta-- 340 00:18:38,300 --> 00:18:42,760 we're going to change w in the direction of that input. 341 00:18:42,760 --> 00:18:43,760 So that input is there. 342 00:18:43,760 --> 00:18:50,530 So we're going to add a little bit to w in this direction. 343 00:18:50,530 --> 00:18:53,970 So if we add that little bit of vector to the w, 344 00:18:53,970 --> 00:18:58,280 it's going to move the w vector in this direction, right? 345 00:18:58,280 --> 00:18:59,590 So let's do that. 346 00:18:59,590 --> 00:19:02,160 So there's our new w. 347 00:19:02,160 --> 00:19:05,310 Our new w is the old plus delta w, 348 00:19:05,310 --> 00:19:10,200 which is in the direction of this incorrectly 349 00:19:10,200 --> 00:19:11,880 classified input. 350 00:19:11,880 --> 00:19:16,470 So there's our new decision boundary, all right? 351 00:19:16,470 --> 00:19:18,340 And let's put in another input-- 352 00:19:18,340 --> 00:19:20,490 let's say this one right here. 353 00:19:20,490 --> 00:19:23,610 You can see that this input is also incorrectly classified, 354 00:19:23,610 --> 00:19:25,530 because the correct answer is a zero. 355 00:19:25,530 --> 00:19:28,170 It's a red dot. 356 00:19:28,170 --> 00:19:30,800 But the network since it's on the positive side 357 00:19:30,800 --> 00:19:32,310 of the decision boundary. 358 00:19:32,310 --> 00:19:34,980 So the network classifies it as a one. 359 00:19:34,980 --> 00:19:35,480 OK, good. 360 00:19:35,480 --> 00:19:39,050 So the network classified it as a one and the correct answer 361 00:19:39,050 --> 00:19:40,580 was a zero, so we were wrong. 362 00:19:40,580 --> 00:19:42,650 So we're going to update w, and we're 363 00:19:42,650 --> 00:19:47,060 going to update it in the opposite direction of the input 364 00:19:47,060 --> 00:19:49,880 if the correct answer was zero, which is the case. 365 00:19:49,880 --> 00:19:53,360 So we're going to update w. 366 00:19:53,360 --> 00:19:56,000 And that's the input xi. 367 00:19:56,000 --> 00:19:59,310 Minus xi is in this direction. 368 00:19:59,310 --> 00:20:02,540 So we're going to update w in that direction. 369 00:20:02,540 --> 00:20:06,530 So we're going to add those two vectors to get our new w. 370 00:20:06,530 --> 00:20:09,430 And when we do that, that's what we get. 371 00:20:09,430 --> 00:20:10,730 There's our new w. 372 00:20:10,730 --> 00:20:12,360 There's our new decision boundary. 373 00:20:12,360 --> 00:20:15,200 And you can see that that decision boundary is now 374 00:20:15,200 --> 00:20:22,160 perfectly oriented to separate the red and the green dots. 375 00:20:22,160 --> 00:20:26,060 So that's Rosenblatt's perceptron learning rule. 376 00:20:26,060 --> 00:20:27,156 Yes, Rebecca? 377 00:20:27,156 --> 00:20:29,100 AUDIENCE: How do you change the learning rate? 378 00:20:29,100 --> 00:20:30,308 Because what if it's too big? 379 00:20:30,308 --> 00:20:33,067 You'll sort of get not helpful [INAUDIBLE].. 380 00:20:33,067 --> 00:20:34,400 MICHALE FEE: Yeah, that's right. 381 00:20:34,400 --> 00:20:36,080 So if the learning rate were too big, 382 00:20:36,080 --> 00:20:38,460 you could see this first correction. 383 00:20:38,460 --> 00:20:41,930 So let's say that we corrected w but made a correction that 384 00:20:41,930 --> 00:20:44,160 was too far in this direction. 385 00:20:44,160 --> 00:20:48,350 So now the new w would point up here. 386 00:20:48,350 --> 00:20:50,640 And that would give us, again, the wrong answer. 387 00:20:50,640 --> 00:20:53,180 What happens, generally, is that if your learning 388 00:20:53,180 --> 00:20:59,810 rate is too high, then your weight vector bounces around. 389 00:20:59,810 --> 00:21:01,790 It oscillates around. 390 00:21:01,790 --> 00:21:04,130 So it'll jump too far this way, and then 391 00:21:04,130 --> 00:21:06,530 it'll get an error over here, and it'll 392 00:21:06,530 --> 00:21:07,670 jump too far that way. 393 00:21:07,670 --> 00:21:09,337 And then you'll get an error over there, 394 00:21:09,337 --> 00:21:11,330 and it'll just keep bouncing back and forth. 395 00:21:11,330 --> 00:21:13,460 So you generally choose learning rates 396 00:21:13,460 --> 00:21:16,190 that-- the process of choosing learning rates 397 00:21:16,190 --> 00:21:18,500 can be a little tricky Basically, 398 00:21:18,500 --> 00:21:21,920 the answer is start small and increase it until it breaks. 399 00:21:26,780 --> 00:21:28,210 OK, any questions about that? 400 00:21:31,500 --> 00:21:36,430 So you can see it's a very simple algorithm that 401 00:21:36,430 --> 00:21:40,750 provides a way of changing w that is guaranteed to converge 402 00:21:40,750 --> 00:21:45,400 toward the best answer in separating these two 403 00:21:45,400 --> 00:21:46,360 classes of inputs. 404 00:21:52,270 --> 00:21:55,780 All right, so let's go a little bit further 405 00:21:55,780 --> 00:21:59,770 into single layer binary networks 406 00:21:59,770 --> 00:22:02,350 and see what they can do. 407 00:22:02,350 --> 00:22:06,100 So these kinds of networks are very good for actually 408 00:22:06,100 --> 00:22:08,090 implementing logic operations. 409 00:22:08,090 --> 00:22:10,990 So you can see that-- let's say that we have a perceptron that 410 00:22:10,990 --> 00:22:12,110 looks like this. 411 00:22:12,110 --> 00:22:17,210 Let's give it a threshold of 0.5 and give it 412 00:22:17,210 --> 00:22:20,870 a weight vector that's 1 and 1. 413 00:22:20,870 --> 00:22:24,710 So you can see that this perceptron 414 00:22:24,710 --> 00:22:26,740 gives an answer of zero. 415 00:22:26,740 --> 00:22:29,000 The output neuron has zero firing rate 416 00:22:29,000 --> 00:22:32,320 for an input that's zero. 417 00:22:32,320 --> 00:22:38,010 But any input that's on the other side of the decision 418 00:22:38,010 --> 00:22:41,640 boundary produces an output firing rate of one. 419 00:22:41,640 --> 00:22:50,250 What that means is that if the input a, or u1, is a 1, 420 00:22:50,250 --> 00:22:54,330 0, then the output neuron will fire. 421 00:22:54,330 --> 00:22:57,720 If the input is 0, 1, the output neuron will fire. 422 00:22:57,720 --> 00:23:01,200 And if the input is 1, 1, the output neuron will fire. 423 00:23:01,200 --> 00:23:07,610 So, basically, any input above some threshold 424 00:23:07,610 --> 00:23:09,320 will make the output neuron fire. 425 00:23:09,320 --> 00:23:13,600 So this perceptron implements an OR gate. 426 00:23:13,600 --> 00:23:18,080 If it's input a or input b, the output neuron 427 00:23:18,080 --> 00:23:22,330 spikes, as long as those inputs are above some threshold value. 428 00:23:22,330 --> 00:23:25,280 So that's very much like a logical OR gate. 429 00:23:28,130 --> 00:23:30,200 Now let's see if we can implement an AND gate. 430 00:23:30,200 --> 00:23:33,340 So it turns out that implementing an AND gate 431 00:23:33,340 --> 00:23:35,380 is almost exactly like an OR gate. 432 00:23:35,380 --> 00:23:40,420 We just need-- what would we change about this network 433 00:23:40,420 --> 00:23:42,182 to implement an AND gate? 434 00:23:42,182 --> 00:23:43,600 AUDIENCE: A larger [INAUDIBLE]. 435 00:23:43,600 --> 00:23:44,642 MICHALE FEE: What's that? 436 00:23:44,642 --> 00:23:45,760 AUDIENCE: A larger theta? 437 00:23:45,760 --> 00:23:47,290 MICHALE FEE: Yeah, a larger theta. 438 00:23:47,290 --> 00:23:52,670 So all we have to do is move this line up to here. 439 00:23:52,670 --> 00:23:55,250 And now one of those inputs is not 440 00:23:55,250 --> 00:23:57,830 enough to make the output neuron fire. 441 00:23:57,830 --> 00:24:00,620 The other input is not enough to make the output neuron fire. 442 00:24:00,620 --> 00:24:02,510 Only when you have both. 443 00:24:02,510 --> 00:24:04,520 So that implements an AND gate. 444 00:24:04,520 --> 00:24:09,075 We just increase the threshold a little bit. 445 00:24:09,075 --> 00:24:09,950 Does that make sense? 446 00:24:09,950 --> 00:24:12,890 So we just increase the threshold here to 1.5. 447 00:24:12,890 --> 00:24:17,870 And now when either input is on at a value of one, 448 00:24:17,870 --> 00:24:20,840 that's not enough to make the output neuron fire. 449 00:24:20,840 --> 00:24:22,670 If this input's on, it's not enough. 450 00:24:22,670 --> 00:24:25,790 If that output is on, it's not enough. 451 00:24:25,790 --> 00:24:29,270 Only when both inputs are on do you get enough input 452 00:24:29,270 --> 00:24:33,010 to this output neuron to make it have a non-zero firing rate, 453 00:24:33,010 --> 00:24:37,190 to get it above threshold. 454 00:24:37,190 --> 00:24:42,080 Now, there's another very common logic operation that cannot be 455 00:24:42,080 --> 00:24:47,010 solved by a simple perceptron. 456 00:24:47,010 --> 00:24:51,680 That's called an exclusive OR, where 457 00:24:51,680 --> 00:24:55,100 this neuron, this network, we want 458 00:24:55,100 --> 00:25:05,890 it to fire only if input a is on or input b is on, but not both. 459 00:25:05,890 --> 00:25:08,830 Why is it that that can't be solved 460 00:25:08,830 --> 00:25:12,010 by the kind of perceptron that we've been describing? 461 00:25:12,010 --> 00:25:14,830 Anybody have some intuition about that? 462 00:25:20,022 --> 00:25:23,240 AUDIENCE: I mean, it's obviously [INAUDIBLE] separable. 463 00:25:23,240 --> 00:25:24,680 MICHALE FEE: Yeah, that's right. 464 00:25:24,680 --> 00:25:27,320 The keyword there is separable. 465 00:25:27,320 --> 00:25:33,210 If you look at this set of dots, there's no single line, 466 00:25:33,210 --> 00:25:38,060 there's no single boundary that separates all the red dots 467 00:25:38,060 --> 00:25:40,940 from off the green dots, OK? 468 00:25:40,940 --> 00:25:44,380 And so that set of inputs is called non-separable. 469 00:25:44,380 --> 00:25:52,700 And sets of inputs that are not separable cannot be classified 470 00:25:52,700 --> 00:25:58,160 correctly by a simple perceptron of the type we've been talking 471 00:25:58,160 --> 00:25:59,340 about. 472 00:25:59,340 --> 00:26:00,840 So how do you solve that problem? 473 00:26:00,840 --> 00:26:06,132 So this is a set of inputs that's non-separable. 474 00:26:06,132 --> 00:26:08,090 You can see that you can solve this problem now 475 00:26:08,090 --> 00:26:11,310 if you have two separate perceptrons. 476 00:26:11,310 --> 00:26:12,420 So watch this. 477 00:26:12,420 --> 00:26:15,410 We can build one perceptive one that fires, 478 00:26:15,410 --> 00:26:21,590 that has a positive output when this input is on. 479 00:26:21,590 --> 00:26:24,170 We can have a separate perceptron that is active 480 00:26:24,170 --> 00:26:29,300 when that input is on. 481 00:26:29,300 --> 00:26:32,270 And then what would we do? 482 00:26:32,270 --> 00:26:34,040 If we had one neuron that's active 483 00:26:34,040 --> 00:26:35,990 if that input is on another input 484 00:26:35,990 --> 00:26:37,760 that's active when that input is on? 485 00:26:40,610 --> 00:26:43,260 We would or them together, that's right. 486 00:26:43,260 --> 00:26:47,010 So this is what's known as a multi-layer perceptron. 487 00:26:47,010 --> 00:26:50,040 We have two inputs, one that represents activity 488 00:26:50,040 --> 00:26:53,980 in a, another that represents activity in b. 489 00:26:53,980 --> 00:26:57,840 And we have one neuron in what's called 490 00:26:57,840 --> 00:27:00,840 the intermediate layer of our perceptron 491 00:27:00,840 --> 00:27:04,930 that has a weight vector of 1 minus 1. 492 00:27:04,930 --> 00:27:09,270 What that means is this neuron will be active if input a is 493 00:27:09,270 --> 00:27:14,750 on but not input b. 494 00:27:14,750 --> 00:27:16,880 This one will be active. 495 00:27:16,880 --> 00:27:20,576 This neuron has a different weight vector-- minus 1, 1. 496 00:27:20,576 --> 00:27:27,770 This neuron will be active if input b is on but not input a. 497 00:27:30,512 --> 00:27:34,120 And the output neuron implements an OR operation 498 00:27:34,120 --> 00:27:39,010 that will be active when this intermediate neuron is on 499 00:27:39,010 --> 00:27:42,820 or that intermediate neuron is on, OK? 500 00:27:42,820 --> 00:27:47,220 And so that network altogether implements this exclusive OR 501 00:27:47,220 --> 00:27:48,550 function. 502 00:27:48,550 --> 00:27:50,030 Does that make sense? 503 00:27:50,030 --> 00:27:51,120 Any questions about that? 504 00:27:56,690 --> 00:27:59,030 So this problem of separability is 505 00:27:59,030 --> 00:28:05,820 extremely important in classifying inputs in general. 506 00:28:05,820 --> 00:28:11,420 So if you think about classifying an image, 507 00:28:11,420 --> 00:28:14,840 like a number or a letter, you can 508 00:28:14,840 --> 00:28:21,430 see that in high-dimensional space, images 509 00:28:21,430 --> 00:28:28,590 that are all threes, let's say, are all 510 00:28:28,590 --> 00:28:30,030 very similar to each other. 511 00:28:30,030 --> 00:28:34,000 But they're actually not separable in this linear space. 512 00:28:34,000 --> 00:28:36,900 And that's because in the high dimensional space 513 00:28:36,900 --> 00:28:40,920 they exist on what's called a manifold 514 00:28:40,920 --> 00:28:43,930 in this high-dimensional space, OK? 515 00:28:43,930 --> 00:28:48,180 They're like all lined up on some sheet, OK? 516 00:28:48,180 --> 00:28:51,540 So this is an example of rotations, 517 00:28:51,540 --> 00:28:54,930 and you can see that all these different threes kind of sit 518 00:28:54,930 --> 00:28:59,160 along a manifold in this high-dimensional space that 519 00:28:59,160 --> 00:29:01,605 are separate from all the other numbers. 520 00:29:06,280 --> 00:29:08,310 So all those numbers exist on what's 521 00:29:08,310 --> 00:29:13,110 called an invariant transformation, OK? 522 00:29:13,110 --> 00:29:16,600 Now, how would we separate those images 523 00:29:16,600 --> 00:29:22,060 of threes from all the other numbers or letters? 524 00:29:22,060 --> 00:29:23,570 How would we do that? 525 00:29:23,570 --> 00:29:30,035 Well, we could imagine building a multi-layer perceptron that-- 526 00:29:30,035 --> 00:29:31,410 so here, I'm showing that there's 527 00:29:31,410 --> 00:29:35,040 no single line that separates the threes on this manifold 528 00:29:35,040 --> 00:29:38,130 from all the other digits over here. 529 00:29:38,130 --> 00:29:40,650 We can solve that problem by implementing 530 00:29:40,650 --> 00:29:45,090 a multi-layer perceptron that while one of those perceptrons 531 00:29:45,090 --> 00:29:49,140 detects these objects, another perceptron detects 532 00:29:49,140 --> 00:29:53,400 these objects, and then we can OR those all together. 533 00:29:53,400 --> 00:29:58,380 So that's a kind of network that can now 534 00:29:58,380 --> 00:30:03,990 detect all of these three, separate them from non-threes. 535 00:30:03,990 --> 00:30:06,240 Does that make sense? 536 00:30:06,240 --> 00:30:10,520 So we can think of objects that we recognize, like this three 537 00:30:10,520 --> 00:30:12,980 that we recognize, even though it has different-- 538 00:30:12,980 --> 00:30:15,110 we can recognize it with different rotations 539 00:30:15,110 --> 00:30:20,730 or transformations or scale changes. 540 00:30:20,730 --> 00:30:23,750 You can also think of the problem of separating images 541 00:30:23,750 --> 00:30:28,250 from dogs and cats as also solving this problem, 542 00:30:28,250 --> 00:30:32,450 that the space of dogs, of dog images, 543 00:30:32,450 --> 00:30:36,680 somehow lives on a manifold in the high dimensional space 544 00:30:36,680 --> 00:30:39,260 of inputs that we can distinguish 545 00:30:39,260 --> 00:30:43,070 from the set of images of cats that's 546 00:30:43,070 --> 00:30:48,570 some other manifold in this high-dimensional space. 547 00:30:48,570 --> 00:30:53,790 So it turns out that you need more than just a single layer 548 00:30:53,790 --> 00:30:54,450 perceptron. 549 00:30:54,450 --> 00:30:57,900 You need more than just a two-layer perceptron. 550 00:30:57,900 --> 00:30:59,820 In general, the kinds of networks 551 00:30:59,820 --> 00:31:02,790 that are good for separating different kinds of images, 552 00:31:02,790 --> 00:31:06,240 like dogs and cats and cars and houses and faces, 553 00:31:06,240 --> 00:31:07,890 look more like this. 554 00:31:07,890 --> 00:31:11,250 So this is work from Jim DiCarlo's lab, 555 00:31:11,250 --> 00:31:16,770 where they found evidence that networks in the brain that do 556 00:31:16,770 --> 00:31:18,720 image classification-- for example, 557 00:31:18,720 --> 00:31:21,520 in the visual pathway-- 558 00:31:21,520 --> 00:31:25,690 look a lot like very deep neural networks, where 559 00:31:25,690 --> 00:31:31,420 you have the retina on the left side here sending inputs 560 00:31:31,420 --> 00:31:33,395 to another letter in the thalamus, 561 00:31:33,395 --> 00:31:40,300 sending inputs to v1, to v2, to v4, and so on, up to IT. 562 00:31:40,300 --> 00:31:43,480 And that we can think of this as being, 563 00:31:43,480 --> 00:31:48,100 essentially, many stacked layers of perceptrons 564 00:31:48,100 --> 00:31:52,150 that sort of unravel these manifolds 565 00:31:52,150 --> 00:31:54,550 in this high-dimensional space to allow 566 00:31:54,550 --> 00:31:59,380 neurons here at the very end to separate dogs 567 00:31:59,380 --> 00:32:02,065 from cats from buildings from faces. 568 00:32:04,720 --> 00:32:06,640 And there are learning rules that 569 00:32:06,640 --> 00:32:09,310 can be used to train networks like this 570 00:32:09,310 --> 00:32:14,440 by putting in a bunch of different images of people 571 00:32:14,440 --> 00:32:16,150 and other different categories that you 572 00:32:16,150 --> 00:32:17,650 might want to separate. 573 00:32:17,650 --> 00:32:19,720 And then each one of those images 574 00:32:19,720 --> 00:32:23,230 has a label, just like our perceptron learning rule. 575 00:32:23,230 --> 00:32:27,010 And we can use the image and the correct label-- 576 00:32:27,010 --> 00:32:32,640 face or dog-- and train that network 577 00:32:32,640 --> 00:32:38,560 by projecting that information into these intermediate layers 578 00:32:38,560 --> 00:32:41,380 to train that network to properly classify 579 00:32:41,380 --> 00:32:43,390 those different stimuli, OK? 580 00:32:43,390 --> 00:32:47,770 This is, basically, the kind of technology 581 00:32:47,770 --> 00:32:51,830 that's currently being used to train-- 582 00:32:51,830 --> 00:32:53,470 this is being used in AI. 583 00:32:53,470 --> 00:32:57,880 It's being used to train driverless cars. 584 00:32:57,880 --> 00:33:02,350 All kinds of technological advances 585 00:33:02,350 --> 00:33:06,018 are based on this kind of technology here. 586 00:33:06,018 --> 00:33:07,060 Any questions about that? 587 00:33:07,060 --> 00:33:08,054 Aditi? 588 00:33:08,054 --> 00:33:10,540 AUDIENCE: So in actual neurons, I 589 00:33:10,540 --> 00:33:12,550 assume it's not linear, right? 590 00:33:12,550 --> 00:33:14,230 MICHALE FEE: Yes. 591 00:33:14,230 --> 00:33:17,560 These are all nonlinear neurons. 592 00:33:17,560 --> 00:33:19,960 They're more like these binary threshold units 593 00:33:19,960 --> 00:33:21,628 than they are like linear neurons. 594 00:33:21,628 --> 00:33:22,170 That's right. 595 00:33:22,170 --> 00:33:25,795 AUDIENCE: But then do you there's, like-- 596 00:33:25,795 --> 00:33:28,372 because right now, I imagine that models we make 597 00:33:28,372 --> 00:33:30,482 have to have way more perceptron units. 598 00:33:30,482 --> 00:33:31,190 MICHALE FEE: Yes. 599 00:33:31,190 --> 00:33:34,475 AUDIENCE: We use our simplified [INAUDIBLE].. 600 00:33:34,475 --> 00:33:35,850 But then our brain is sometimes-- 601 00:33:35,850 --> 00:33:38,610 I mean, it's at, like, a much faster level, 602 00:33:38,610 --> 00:33:41,090 like way faster, right? 603 00:33:41,090 --> 00:33:46,000 So you think it'd be like-- if we examine what functions 604 00:33:46,000 --> 00:33:50,320 neurons might be using, in a way that would let us reduce 605 00:33:50,320 --> 00:33:51,760 the number of units needed? 606 00:33:51,760 --> 00:33:53,584 Because right now, for example, [INAUDIBLE] 607 00:33:53,584 --> 00:33:55,380 be a bunch of lines. 608 00:33:55,380 --> 00:33:58,690 But maybe in the brain, there's some other function it's using, 609 00:33:58,690 --> 00:34:00,340 which is smoother. 610 00:34:00,340 --> 00:34:02,580 MICHALE FEE: Yeah. 611 00:34:02,580 --> 00:34:04,330 OK, so let me just make sure I understand. 612 00:34:04,330 --> 00:34:07,540 You're not talking about the F-I curve of the neurons? 613 00:34:07,540 --> 00:34:09,540 Is that correct? 614 00:34:09,540 --> 00:34:12,100 You're talking about the way that you figure out 615 00:34:12,100 --> 00:34:13,514 these weights. 616 00:34:13,514 --> 00:34:14,889 Is that what you're asking about? 617 00:34:14,889 --> 00:34:15,880 AUDIENCE: No. 618 00:34:15,880 --> 00:34:20,034 I'm asking if we use a more accurate F-I curve, 619 00:34:20,034 --> 00:34:21,657 we'll need less units. 620 00:34:21,657 --> 00:34:23,449 MICHALE FEE: OK, so that's a good question. 621 00:34:23,449 --> 00:34:26,230 I don't actually know the answer to the question 622 00:34:26,230 --> 00:34:29,350 of how the specific choice of F-I curve 623 00:34:29,350 --> 00:34:31,659 affects the performance of this. 624 00:34:31,659 --> 00:34:35,380 The big problem that people are trying to figure out 625 00:34:35,380 --> 00:34:39,489 in terms of how these are trained 626 00:34:39,489 --> 00:34:42,250 is the challenge that in order to train these networks, 627 00:34:42,250 --> 00:34:47,420 you actually need thousands and thousands, maybe millions, 628 00:34:47,420 --> 00:34:54,139 of examples of different objects here and the answer here. 629 00:34:54,139 --> 00:34:56,510 So you have to put in many thousands 630 00:34:56,510 --> 00:35:00,620 of example images and the answer in order 631 00:35:00,620 --> 00:35:02,540 to train these networks. 632 00:35:02,540 --> 00:35:06,080 And that's not the way people actually learn. 633 00:35:06,080 --> 00:35:09,530 We don't walk around the world when we're one-year-old 634 00:35:09,530 --> 00:35:12,550 and our mother saying, dog, cat, person, house. 635 00:35:12,550 --> 00:35:16,130 You know, it would be... in order to give a person as many 636 00:35:16,130 --> 00:35:19,070 labeled examples as you need to give these networks, 637 00:35:19,070 --> 00:35:23,270 you would just be doing nothing, but your parents would be 638 00:35:23,270 --> 00:35:27,770 pointing things out to you and telling you one-word answers 639 00:35:27,770 --> 00:35:28,970 of what those are. 640 00:35:28,970 --> 00:35:32,300 Instead, what happens is we just observe the world 641 00:35:32,300 --> 00:35:34,970 and figure out kind of categories 642 00:35:34,970 --> 00:35:38,030 based on other sorts of learning rules that are unsupervised. 643 00:35:38,030 --> 00:35:40,610 We figure out, oh, that's a kind of thing, and then mom says, 644 00:35:40,610 --> 00:35:42,140 that's a dog. 645 00:35:42,140 --> 00:35:45,110 And then we know that that category is a dog. 646 00:35:45,110 --> 00:35:47,510 And we sometimes make mistakes, right? 647 00:35:47,510 --> 00:35:52,820 Like a kid might look at a bear and say, dog. 648 00:35:52,820 --> 00:35:55,840 And then dad says, no, no, that's not a dog, son. 649 00:35:59,930 --> 00:36:04,610 So the learning by which people train their networks 650 00:36:04,610 --> 00:36:06,560 to do classification of inputs is 651 00:36:06,560 --> 00:36:10,020 quite different from the way these deep neural networks 652 00:36:10,020 --> 00:36:10,520 work. 653 00:36:10,520 --> 00:36:15,340 And that's a very important and active area of research. 654 00:36:15,340 --> 00:36:15,840 Yes? 655 00:36:15,840 --> 00:36:19,330 AUDIENCE: Is the fact that [INAUDIBLE] use unsupervised 656 00:36:19,330 --> 00:36:22,690 learning, as well, to train a computer 657 00:36:22,690 --> 00:36:25,970 to recognize an image of a turtle as a gun, 658 00:36:25,970 --> 00:36:28,040 but humans can't do that [INAUDIBLE].. 659 00:36:28,040 --> 00:36:29,737 MICHALE FEE: Recognize a turtle if what? 660 00:36:29,737 --> 00:36:32,112 AUDIENCE: Like I saw this thing where it was like at MIT, 661 00:36:32,112 --> 00:36:33,910 they used an AI. 662 00:36:33,910 --> 00:36:35,810 They manipulated pixels in images 663 00:36:35,810 --> 00:36:38,128 and convinced the computer that it was something 664 00:36:38,128 --> 00:36:39,170 that it was not actually. 665 00:36:39,170 --> 00:36:40,160 MICHALE FEE: I see. 666 00:36:40,160 --> 00:36:40,430 Yeah. 667 00:36:40,430 --> 00:36:41,885 AUDIENCE: So like you would see a picture of a turtle, 668 00:36:41,885 --> 00:36:43,510 but the computer would get that picture 669 00:36:43,510 --> 00:36:45,200 and say it was, like, a machine gun. 670 00:36:45,200 --> 00:36:47,660 MICHALE FEE: Just by manipulating a few pixels 671 00:36:47,660 --> 00:36:49,397 and kind of screwing with its mind. 672 00:36:49,397 --> 00:36:49,980 AUDIENCE: Yes. 673 00:36:49,980 --> 00:36:50,990 So it's [INAUDIBLE]. 674 00:36:54,350 --> 00:36:55,160 MICHALE FEE: Yeah. 675 00:36:55,160 --> 00:36:57,722 Well, people can be tricked by different things. 676 00:37:01,700 --> 00:37:05,490 The answer is, yes, it's related to that. 677 00:37:05,490 --> 00:37:08,090 The problem is after you do this training, 678 00:37:08,090 --> 00:37:09,890 we actually don't really understand 679 00:37:09,890 --> 00:37:14,090 what's going on in the guts of this network. 680 00:37:14,090 --> 00:37:16,640 It's very hard to look at the inside of this network 681 00:37:16,640 --> 00:37:22,090 after it's trained and understand what it's doing. 682 00:37:22,090 --> 00:37:25,180 And so we don't know the answer why 683 00:37:25,180 --> 00:37:28,570 it is that you can fool one of these networks 684 00:37:28,570 --> 00:37:30,550 by changing a few pixels. 685 00:37:30,550 --> 00:37:33,385 Something goes wrong in here, and we don't know what it is. 686 00:37:33,385 --> 00:37:35,920 It may very well have to do with the way it's trained, 687 00:37:35,920 --> 00:37:41,830 rather than building categories in an unsupervised way, which 688 00:37:41,830 --> 00:37:43,940 could be much more generalizable. 689 00:37:43,940 --> 00:37:46,048 So good question. 690 00:37:46,048 --> 00:37:47,340 I don't really know the answer. 691 00:37:50,330 --> 00:37:50,830 Yes? 692 00:37:50,830 --> 00:37:52,372 AUDIENCE: Sorry, can you explain what 693 00:37:52,372 --> 00:37:56,280 you mean [INAUDIBLE] the neural network needs an answer? 694 00:37:56,280 --> 00:38:00,310 They're not categorized and then tell the user dogs? 695 00:38:00,310 --> 00:38:02,420 MICHALE FEE: Yeah, so no, in order 696 00:38:02,420 --> 00:38:05,390 to train one of these networks, you have to give it a data set, 697 00:38:05,390 --> 00:38:07,640 a labeled data set. 698 00:38:07,640 --> 00:38:11,270 So a set of images that already has the answer 699 00:38:11,270 --> 00:38:15,252 that was labeled by a person. 700 00:38:15,252 --> 00:38:16,710 AUDIENCE: So you can't just give it 701 00:38:16,710 --> 00:38:19,046 a set of photos of puppies and snakes 702 00:38:19,046 --> 00:38:21,320 and it'll categorize them into two groups? 703 00:38:21,320 --> 00:38:23,195 MICHALE FEE: No, nobody knows how to do that. 704 00:38:25,890 --> 00:38:31,220 People are working on that, but it's not known yet. 705 00:38:31,220 --> 00:38:32,010 Yes, Jasmine? 706 00:38:34,640 --> 00:38:41,080 AUDIENCE: [INAUDIBLE] but I see [INAUDIBLE] I 707 00:38:41,080 --> 00:38:44,310 can't separate them and like adding an additional feature 708 00:38:44,310 --> 00:38:47,874 to raise it to a higher dimensional space, where 709 00:38:47,874 --> 00:38:50,203 it's separable? 710 00:38:50,203 --> 00:38:52,120 MICHALE FEE: Sorry, I didn't quite understand. 711 00:38:52,120 --> 00:38:53,806 Can you say it again? 712 00:38:53,806 --> 00:38:56,221 AUDIENCE: I think I remember reading somewhere 713 00:38:56,221 --> 00:39:02,182 about how when the scenes are nonlinearly separable-- 714 00:39:02,182 --> 00:39:02,890 MICHALE FEE: Yes. 715 00:39:02,890 --> 00:39:05,720 AUDIENCE: --you can add in another feature to [INAUDIBLE].. 716 00:39:05,720 --> 00:39:06,720 MICHALE FEE: Yeah, yeah. 717 00:39:06,720 --> 00:39:09,090 So let me show you an example of that. 718 00:39:09,090 --> 00:39:11,850 So coming back to the exclusive OR. 719 00:39:11,850 --> 00:39:14,130 So one thing that you can do, you 720 00:39:14,130 --> 00:39:18,570 can see that the reason this is linearly inseparable-- it's not 721 00:39:18,570 --> 00:39:20,970 linearly separable-- is because all these points are 722 00:39:20,970 --> 00:39:23,040 in a plane. 723 00:39:23,040 --> 00:39:26,620 So there's no line that separates them. 724 00:39:26,620 --> 00:39:29,250 But one way, one sort of trick you can do, 725 00:39:29,250 --> 00:39:30,980 is to add noise to this. 726 00:39:30,980 --> 00:39:33,930 So that now, some of these points move. 727 00:39:33,930 --> 00:39:36,040 You can add another dimension. 728 00:39:36,040 --> 00:39:38,440 So now let's say that we add noise, 729 00:39:38,440 --> 00:39:41,790 and we just, by chance, happen to move the green dots this way 730 00:39:41,790 --> 00:39:44,610 and the red dots, well, that way. 731 00:39:44,610 --> 00:39:47,400 And now there's a plane that will separate the red dots 732 00:39:47,400 --> 00:39:49,260 from the green dots. 733 00:39:49,260 --> 00:39:55,170 So that's advanced beyond the scope of what 734 00:39:55,170 --> 00:39:56,320 we're talking about here. 735 00:39:56,320 --> 00:39:57,870 But yes, there are tricks that you 736 00:39:57,870 --> 00:40:02,070 can play to get around this exclusive OR 737 00:40:02,070 --> 00:40:06,570 problem, this linear separability problem, OK? 738 00:40:06,570 --> 00:40:08,940 All right, great question. 739 00:40:08,940 --> 00:40:12,660 All right, let's push on. 740 00:40:12,660 --> 00:40:18,000 So let's talk about more general two-layer 741 00:40:18,000 --> 00:40:20,730 feed-forward networks. 742 00:40:20,730 --> 00:40:25,800 So this is referred to as a two-layer network-- an input 743 00:40:25,800 --> 00:40:28,240 layer and an output layer. 744 00:40:28,240 --> 00:40:31,070 And in this case, we had a single input neuron 745 00:40:31,070 --> 00:40:32,690 and a single output neuron. 746 00:40:32,690 --> 00:40:36,780 We generalized that to having multiple input neurons and one 747 00:40:36,780 --> 00:40:37,470 output neuron. 748 00:40:37,470 --> 00:40:39,450 We saw that we can write down the input current 749 00:40:39,450 --> 00:40:43,500 to this output neuron as w, the vector of weights, 750 00:40:43,500 --> 00:40:46,080 dotted into the vector of input firing rates 751 00:40:46,080 --> 00:40:49,310 to give us an expression for the firing rate of the output 752 00:40:49,310 --> 00:40:50,310 neuron. 753 00:40:50,310 --> 00:40:52,080 And now we can generalize that further 754 00:40:52,080 --> 00:40:54,520 to the case of multiple output neurons. 755 00:40:54,520 --> 00:40:57,420 So we have multiple input neurons, multiple output 756 00:40:57,420 --> 00:40:59,040 neurons. 757 00:40:59,040 --> 00:41:00,510 You can see that we have a vector 758 00:41:00,510 --> 00:41:02,910 of firing rates of the input neurons 759 00:41:02,910 --> 00:41:07,100 and a vector of firing rates of the output neurons. 760 00:41:07,100 --> 00:41:10,043 So we used to just have one of these output neurons, 761 00:41:10,043 --> 00:41:11,710 and now we've got a whole bunch of them. 762 00:41:11,710 --> 00:41:14,520 And so we have to write down a vector of fire rates 763 00:41:14,520 --> 00:41:16,210 in the output layer. 764 00:41:16,210 --> 00:41:19,560 And now we can write down the firing rate of our output 765 00:41:19,560 --> 00:41:20,590 neurons as follows. 766 00:41:20,590 --> 00:41:22,410 So the firing rate of this neuron 767 00:41:22,410 --> 00:41:28,170 here is going to be a dot product of the vector 768 00:41:28,170 --> 00:41:31,110 of weights onto it. 769 00:41:31,110 --> 00:41:33,060 So the firing rate of output neuron one 770 00:41:33,060 --> 00:41:39,180 is the vector of weights onto that first output neuron dotted 771 00:41:39,180 --> 00:41:43,200 into the vector of input firing rates. 772 00:41:43,200 --> 00:41:46,380 And the same for the next output neuron. 773 00:41:46,380 --> 00:41:47,940 The firing rate of output neuron two 774 00:41:47,940 --> 00:41:52,350 is dot product of the weights onto that output neuron two 775 00:41:52,350 --> 00:41:56,040 and onto the vector of input firing rates. 776 00:41:56,040 --> 00:41:57,900 Same for neuron three. 777 00:41:57,900 --> 00:42:00,500 And we can write that down as follows. 778 00:42:00,500 --> 00:42:03,600 So the eighth output-- the firing rate 779 00:42:03,600 --> 00:42:06,150 of the eighth output neuron is the weight vector 780 00:42:06,150 --> 00:42:09,390 onto the eighth output neuron dotted into the input firing 781 00:42:09,390 --> 00:42:10,530 rate vector, OK? 782 00:42:10,530 --> 00:42:12,690 And we can write that down as follows, 783 00:42:12,690 --> 00:42:15,810 where we've now introduced a new thing here, 784 00:42:15,810 --> 00:42:20,780 which is a matrix of weights. 785 00:42:20,780 --> 00:42:23,300 So it's called the weight matrix. 786 00:42:23,300 --> 00:42:26,600 And it essentially is a matrix of all 787 00:42:26,600 --> 00:42:32,900 of these synaptic weights, from the input layer onto the output 788 00:42:32,900 --> 00:42:33,540 layer. 789 00:42:33,540 --> 00:42:36,830 And now if we had a linear neuron, 790 00:42:36,830 --> 00:42:40,900 we can write down the firing rate of the output neuron. 791 00:42:40,900 --> 00:42:45,560 The firing rate vector of output neuron 792 00:42:45,560 --> 00:42:52,610 is just this weight matrix times the vector of input fire rates. 793 00:42:52,610 --> 00:42:56,240 So now, we've rewritten this problem 794 00:42:56,240 --> 00:42:59,870 of finding the vector of output firing rates 795 00:42:59,870 --> 00:43:02,650 as a matrix multiplication. 796 00:43:02,650 --> 00:43:05,490 And we're going to spend some time talking about what 797 00:43:05,490 --> 00:43:09,030 that means and what that does. 798 00:43:09,030 --> 00:43:12,590 So our feed-forward network implements a matrix 799 00:43:12,590 --> 00:43:13,970 multiplication. 800 00:43:13,970 --> 00:43:16,790 All right, so let's take a closer look at what 801 00:43:16,790 --> 00:43:20,780 this weight matrix looks like. 802 00:43:20,780 --> 00:43:26,340 So we have a weight matrix w sub a comma b that looks like this. 803 00:43:26,340 --> 00:43:29,360 So we have four input neurons and four output neurons. 804 00:43:29,360 --> 00:43:34,670 We have a weight for each input neuron onto each output neuron. 805 00:43:34,670 --> 00:43:40,280 The columns here correspond to different input neurons. 806 00:43:40,280 --> 00:43:42,900 The rows correspond to different output neurons. 807 00:43:42,900 --> 00:43:46,550 Remember, for a matrix, the elements 808 00:43:46,550 --> 00:43:54,713 are listed as w sub a, b, where a is the output neuron. 809 00:43:54,713 --> 00:43:55,630 b is the input neuron. 810 00:43:55,630 --> 00:44:01,760 On so it's w postsynaptic, presynaptic-- post, pre. 811 00:44:01,760 --> 00:44:04,010 Rows, columns. 812 00:44:04,010 --> 00:44:07,400 So the rows are the different output neurons. 813 00:44:07,400 --> 00:44:09,485 The columns are the different input neurons. 814 00:44:12,210 --> 00:44:15,980 So it can be a little tricky to remember. 815 00:44:15,980 --> 00:44:21,030 I just remember that it's rows-- 816 00:44:21,030 --> 00:44:23,890 a matrix is labeled by rows and columns. 817 00:44:23,890 --> 00:44:28,000 And weight matrices are postsynaptic, presynaptic-- 818 00:44:28,000 --> 00:44:28,660 post, pre. 819 00:44:31,370 --> 00:44:35,160 AUDIENCE: [INAUDIBLE] comment of [INAUDIBLE]?? 820 00:44:35,160 --> 00:44:37,410 MICHALE FEE: I think that's standard. 821 00:44:37,410 --> 00:44:41,050 I'm pretty sure that's very standard. 822 00:44:41,050 --> 00:44:43,880 If you find any exceptions let me know. 823 00:44:43,880 --> 00:44:49,710 OK, we can think of each row of this matrix 824 00:44:49,710 --> 00:44:53,510 as being the vector of weights onto one output neuron. 825 00:44:56,890 --> 00:45:01,960 That row is a vector of weights onto that output neuron-- 826 00:45:01,960 --> 00:45:05,123 that row, that output neuron; that row, that output neuron. 827 00:45:05,123 --> 00:45:06,040 Does that makes sense? 828 00:45:09,590 --> 00:45:13,350 All right, so let's flesh out this matrix multiplication. 829 00:45:13,350 --> 00:45:15,838 The vector of output firing rates, 830 00:45:15,838 --> 00:45:17,880 we're going to write it as a column vector, where 831 00:45:17,880 --> 00:45:20,670 the first number is this firing rate. 832 00:45:20,670 --> 00:45:22,440 That number is that firing rate. 833 00:45:22,440 --> 00:45:25,560 That number represents that firing rate, OK? 834 00:45:25,560 --> 00:45:27,390 That's equal to this weight matrix 835 00:45:27,390 --> 00:45:31,850 times the vector of input firing rates, 836 00:45:31,850 --> 00:45:36,040 again, written as a column vector. 837 00:45:36,040 --> 00:45:40,320 And in order to calculate the firing rate of the first output 838 00:45:40,320 --> 00:45:44,610 neuron, we take the dot product of the first row of the weight 839 00:45:44,610 --> 00:45:53,020 matrix and the column vector of input firing rates. 840 00:45:53,020 --> 00:45:59,070 And that gives us this first firing rate, OK? 841 00:45:59,070 --> 00:46:00,630 To get the second firing rate, we 842 00:46:00,630 --> 00:46:03,870 take the dot product of the second row of weights 843 00:46:03,870 --> 00:46:06,570 with the vector of firing rates, and that gives us 844 00:46:06,570 --> 00:46:10,050 this second firing rate. 845 00:46:10,050 --> 00:46:11,310 Any questions about that? 846 00:46:11,310 --> 00:46:16,740 Just a brief reminder of matrix multiplication. 847 00:46:16,740 --> 00:46:19,281 All right, no questions? 848 00:46:19,281 --> 00:46:26,910 All right, so let's take a step back and go quickly 849 00:46:26,910 --> 00:46:30,300 through some basic matrix algebra. 850 00:46:30,300 --> 00:46:32,670 I know most of you have probably seen this, 851 00:46:32,670 --> 00:46:35,970 but many haven't, so we're just going to go through it. 852 00:46:35,970 --> 00:46:40,110 All right, so just as vectors are-- 853 00:46:40,110 --> 00:46:42,570 you can think of them as a collection of numbers 854 00:46:42,570 --> 00:46:44,190 that you write down. 855 00:46:44,190 --> 00:46:47,970 So let's say that you are making a measurement of two 856 00:46:47,970 --> 00:46:48,850 different things-- 857 00:46:48,850 --> 00:46:52,740 let's say temperature and humidity. 858 00:46:52,740 --> 00:46:55,980 So you can write down a vector that represents those two 859 00:46:55,980 --> 00:46:57,160 quantities. 860 00:46:57,160 --> 00:47:00,550 So matrices you can think of as collections of vectors. 861 00:47:00,550 --> 00:47:03,870 So let's say we take those two measurements 862 00:47:03,870 --> 00:47:05,980 at different times, at three different times. 863 00:47:05,980 --> 00:47:11,910 So now we have a vector one, a vector two, and a vector three 864 00:47:11,910 --> 00:47:14,760 that measure those two quantities at three 865 00:47:14,760 --> 00:47:16,620 different times, all right? 866 00:47:16,620 --> 00:47:19,350 So we can now write all of those measurements 867 00:47:19,350 --> 00:47:22,860 down as a matrix, where we collect 868 00:47:22,860 --> 00:47:27,900 each one of those vectors as a column in our matrix, 869 00:47:27,900 --> 00:47:28,900 like that. 870 00:47:28,900 --> 00:47:32,070 Any questions about that? 871 00:47:32,070 --> 00:47:37,170 And there's a bit of MATLAB code that calculates this matrix 872 00:47:37,170 --> 00:47:40,180 by writing three different column vectors 873 00:47:40,180 --> 00:47:42,030 and then concatenating them into a matrix. 874 00:47:45,130 --> 00:47:47,930 All right, and you can see that in this matrix, 875 00:47:47,930 --> 00:47:52,070 the columns are just the original vectors, 876 00:47:52,070 --> 00:47:53,990 and the rows are-- 877 00:47:53,990 --> 00:47:56,480 you can think of those as a time series 878 00:47:56,480 --> 00:47:59,010 of our first measurement, let's say temperature. 879 00:47:59,010 --> 00:48:01,610 So that's temperature as a function of time. 880 00:48:01,610 --> 00:48:08,005 This is temperature and humidity at one time. 881 00:48:08,005 --> 00:48:08,880 Does that make sense? 882 00:48:11,480 --> 00:48:14,180 All right, so, again, we can write down this matrix. 883 00:48:14,180 --> 00:48:16,370 Remember, this is the first measurement 884 00:48:16,370 --> 00:48:18,980 at time two, the first measurement at time three. 885 00:48:18,980 --> 00:48:21,650 We have two rows and three columns. 886 00:48:21,650 --> 00:48:23,270 We can also write down what's known 887 00:48:23,270 --> 00:48:27,080 as the transpose of a matrix that just flips the rows 888 00:48:27,080 --> 00:48:27,660 and columns. 889 00:48:27,660 --> 00:48:30,290 So we can write transpose, which is 890 00:48:30,290 --> 00:48:33,860 indicated by this capital super scripted t. 891 00:48:33,860 --> 00:48:36,140 And here, we're just flipping the rows and columns. 892 00:48:36,140 --> 00:48:41,510 So the first row of this matrix becomes the first column 893 00:48:41,510 --> 00:48:43,220 of the transposed matrix. 894 00:48:43,220 --> 00:48:47,450 So we have three rows and two columns. 895 00:48:47,450 --> 00:48:49,140 A symmetric matrix-- 896 00:48:49,140 --> 00:48:50,940 I'm just defining some terms now. 897 00:48:50,940 --> 00:48:54,360 A symmetric matrix is a matrix where 898 00:48:54,360 --> 00:48:58,650 the off-diagonal elements-- so let me just define, 899 00:48:58,650 --> 00:49:01,800 that's the diagonal, the matrix diagonal. 900 00:49:01,800 --> 00:49:04,650 And a symmetric matrix has the property 901 00:49:04,650 --> 00:49:08,130 that the off-diagonal elements are zero. 902 00:49:08,130 --> 00:49:11,040 And a symmetric matrix has the property 903 00:49:11,040 --> 00:49:14,970 that the transpose of that matrix is equal to the matrix, 904 00:49:14,970 --> 00:49:15,600 OK? 905 00:49:15,600 --> 00:49:18,930 That is only possible, of course, 906 00:49:18,930 --> 00:49:23,017 if the matrix has the same number of rows and columns, 907 00:49:23,017 --> 00:49:24,600 if it's what's called a square matrix. 908 00:49:28,990 --> 00:49:31,030 Let me just remind you, in general 909 00:49:31,030 --> 00:49:33,290 about matrix multiplication. 910 00:49:33,290 --> 00:49:36,820 We can write down the product of two matrices. 911 00:49:36,820 --> 00:49:40,090 And we do that multiplication by taking the dot product 912 00:49:40,090 --> 00:49:44,590 of each row in the first matrix with each column 913 00:49:44,590 --> 00:49:46,000 in the second matrix. 914 00:49:46,000 --> 00:49:49,930 So here's the product of matrix A and matrix B. 915 00:49:49,930 --> 00:49:52,660 So there's the product. 916 00:49:52,660 --> 00:49:56,020 If this matrix, if matrix A, is an m by k-- 917 00:49:56,020 --> 00:49:59,090 m rows by k columns-- 918 00:49:59,090 --> 00:50:05,090 and matrix B has k rows by n columns, 919 00:50:05,090 --> 00:50:09,020 then the product of those two matrices 920 00:50:09,020 --> 00:50:14,180 will have m by n rows and columns. 921 00:50:14,180 --> 00:50:17,000 And you can see that in order for matrix multiplication 922 00:50:17,000 --> 00:50:23,510 to work, the number of columns of the first matrix 923 00:50:23,510 --> 00:50:25,970 equal the number of rows in the second matrix. 924 00:50:25,970 --> 00:50:30,890 You can see that this k has to be the same for both matrices. 925 00:50:30,890 --> 00:50:34,120 Does that make sense? 926 00:50:34,120 --> 00:50:37,300 So, again, in order to compute this element right here, 927 00:50:37,300 --> 00:50:40,675 we take the dot product of the first row of A 928 00:50:40,675 --> 00:50:46,450 and the first column of B. That's just 1 times 4, is 4. 929 00:50:46,450 --> 00:50:49,370 Plus negative 2 times 7 is minus 14. 930 00:50:49,370 --> 00:50:51,490 Plus 0 times minus 1 is 0. 931 00:50:51,490 --> 00:50:53,800 Add those up and you get minus 10. 932 00:50:53,800 --> 00:50:55,090 So you get this number. 933 00:50:55,090 --> 00:50:57,040 You multiply this row dot product 934 00:50:57,040 --> 00:50:58,990 this row with this column and so on. 935 00:51:02,710 --> 00:51:06,310 Notice, A times B is not equal to B times A. 936 00:51:06,310 --> 00:51:11,470 In fact, in cases of rectangular matrices, matrices that aren't 937 00:51:11,470 --> 00:51:15,160 square, you can't even do this, often do 938 00:51:15,160 --> 00:51:18,760 this, multiplication in a different order. 939 00:51:18,760 --> 00:51:22,420 Mathematically, it doesn't make sense. 940 00:51:22,420 --> 00:51:27,100 So let's say that we have a matrix of vectors, 941 00:51:27,100 --> 00:51:29,020 and we want to take the dot product 942 00:51:29,020 --> 00:51:35,420 of each one of those vectors x with some other vector v. So 943 00:51:35,420 --> 00:51:36,720 let's just write that down. 944 00:51:36,720 --> 00:51:40,410 The way to do that is to say the answer here, 945 00:51:40,410 --> 00:51:44,130 the dot product of each one of those column vectors 946 00:51:44,130 --> 00:51:46,730 in our matrix with this other vector 947 00:51:46,730 --> 00:51:49,580 v we do by taking the transpose of v, 948 00:51:49,580 --> 00:51:53,100 which takes a column vector and turns it into a row vector. 949 00:51:53,100 --> 00:51:56,660 And we can now multiply that by our data matrix x 950 00:51:56,660 --> 00:52:01,700 by taking the dot product of v with that column of x. 951 00:52:01,700 --> 00:52:05,100 And that gives us a matrix. 952 00:52:05,100 --> 00:52:09,750 So this matrix here, that vector is a one by two matrix. 953 00:52:09,750 --> 00:52:11,450 This is a two by three matrix. 954 00:52:11,450 --> 00:52:16,010 The product of those is a one by three matrix. 955 00:52:16,010 --> 00:52:18,790 Any questions about that? 956 00:52:18,790 --> 00:52:19,480 OK. 957 00:52:19,480 --> 00:52:21,860 We can do this a different way. 958 00:52:21,860 --> 00:52:25,420 Notice that the result of this multiplication 959 00:52:25,420 --> 00:52:27,578 here is a row vector, y. 960 00:52:27,578 --> 00:52:28,870 We can do this a different way. 961 00:52:28,870 --> 00:52:30,740 We can take dot product. 962 00:52:30,740 --> 00:52:35,350 We can also compute this as y equals x transpose v. 963 00:52:35,350 --> 00:52:37,360 So here, we've taken the transpose of the data 964 00:52:37,360 --> 00:52:40,790 matrix times this column vector v. 965 00:52:40,790 --> 00:52:43,850 And again, we take the dot product of this, 966 00:52:43,850 --> 00:52:45,650 this with this, and that with that. 967 00:52:45,650 --> 00:52:47,860 And now we get a column vector that 968 00:52:47,860 --> 00:52:50,650 has the same entries that we had over here. 969 00:52:53,980 --> 00:52:57,440 All right, so I'm just showing you different ways 970 00:52:57,440 --> 00:53:00,920 that you can manipulate a vector in a matrix 971 00:53:00,920 --> 00:53:08,120 to compute the dot product of elements of vectors 972 00:53:08,120 --> 00:53:11,870 within a data matrix and other vectors 973 00:53:11,870 --> 00:53:13,490 that you're interested in. 974 00:53:16,720 --> 00:53:19,160 All right, identity matrix. 975 00:53:19,160 --> 00:53:21,580 So when you're multiplying numbers together, 976 00:53:21,580 --> 00:53:24,370 the number one has the special property 977 00:53:24,370 --> 00:53:27,910 that you can multiply any real number by one 978 00:53:27,910 --> 00:53:29,320 and get the same number back. 979 00:53:33,930 --> 00:53:39,030 You have the same kind of element in matrices. 980 00:53:39,030 --> 00:53:42,530 So is there a matrix that when multiplied by A gives you A? 981 00:53:42,530 --> 00:53:43,530 And the answer is yes. 982 00:53:43,530 --> 00:53:45,640 It's called the identity matrix. 983 00:53:45,640 --> 00:53:49,230 So it's given by the symbol I, usually. 984 00:53:49,230 --> 00:53:54,540 A times I equals A. What does that matrix look like? 985 00:53:54,540 --> 00:53:56,950 Again, the identity matrix looks like this. 986 00:53:56,950 --> 00:54:01,320 It's a square matrix that has ones along the diagonal 987 00:54:01,320 --> 00:54:02,970 and zero everywhere else. 988 00:54:05,580 --> 00:54:09,180 So you can see here that if you take an arbitrary vector x, 989 00:54:09,180 --> 00:54:12,900 multiplied by the identity matrix, 990 00:54:12,900 --> 00:54:18,630 you can see that this product is x1, x2 dotted into 1, 991 00:54:18,630 --> 00:54:21,030 0, which gives you x1. 992 00:54:21,030 --> 00:54:25,230 x1, x2 dotted into 0, 1, gives you x2. 993 00:54:25,230 --> 00:54:29,560 And so the answer looks like that, which is just x. 994 00:54:29,560 --> 00:54:32,450 So the identity matrix times an arbitrary vector x 995 00:54:32,450 --> 00:54:35,420 gives you x back. 996 00:54:35,420 --> 00:54:40,560 Another very useful application of linear algebra, 997 00:54:40,560 --> 00:54:43,720 linear algebra tools, is to solve systems of equations. 998 00:54:43,720 --> 00:54:46,240 So let me show you what that looks like. 999 00:54:46,240 --> 00:54:52,230 So let's say we want to solve a simple equation, ax equals c. 1000 00:54:52,230 --> 00:54:54,720 So, in this case, how do you solve for x? 1001 00:54:54,720 --> 00:54:57,600 Well, you're just going to divide both sides by a, right? 1002 00:54:57,600 --> 00:54:59,640 So if you divide both sides by a, 1003 00:54:59,640 --> 00:55:04,020 you get that x equals 1 over a times c. 1004 00:55:04,020 --> 00:55:07,980 So it turns out that there is a matrix equivalent 1005 00:55:07,980 --> 00:55:11,800 of that, that allows you to solve systems of equations. 1006 00:55:11,800 --> 00:55:14,610 So if you have a pair of equations-- 1007 00:55:14,610 --> 00:55:18,570 x minus 2y equals 3 and 3x plus y equals 5-- 1008 00:55:18,570 --> 00:55:21,360 you can write this down as a matrix equation, 1009 00:55:21,360 --> 00:55:23,910 where you have a matrix 1, minus 2, 1010 00:55:23,910 --> 00:55:26,960 3, 1, which correspond to the coefficients of x and y 1011 00:55:26,960 --> 00:55:28,500 in these equations. 1012 00:55:28,500 --> 00:55:36,120 Times a vector xy is equal to 3, 5, another vector 3, 5. 1013 00:55:36,120 --> 00:55:40,570 So you can write this down as ax equals c-- 1014 00:55:40,570 --> 00:55:42,420 that's kind of nice-- 1015 00:55:42,420 --> 00:55:46,440 where this matrix A is given by these coefficients 1016 00:55:46,440 --> 00:55:49,650 and this vector c is given by these terms 1017 00:55:49,650 --> 00:55:53,620 on this side of the equation, on the right side of the equation. 1018 00:55:53,620 --> 00:55:55,990 Now, how do we solve this? 1019 00:55:55,990 --> 00:56:02,510 Well, can we just divide both sides of that matrix equation, 1020 00:56:02,510 --> 00:56:04,670 that vector equation, by a? 1021 00:56:04,670 --> 00:56:08,450 So division is not really defined for matrices, 1022 00:56:08,450 --> 00:56:10,460 but we can use another trick. 1023 00:56:10,460 --> 00:56:12,800 We can multiply both sides of this equation 1024 00:56:12,800 --> 00:56:17,590 by something that makes the a go away. 1025 00:56:17,590 --> 00:56:22,760 And so that magical thing is called the inverse of A. 1026 00:56:22,760 --> 00:56:24,890 So we take the inverse of matrix A, 1027 00:56:24,890 --> 00:56:28,420 denoted by A with this superscript minus 1. 1028 00:56:28,420 --> 00:56:31,890 And that's the standard notation for identifying the inverse. 1029 00:56:31,890 --> 00:56:34,220 It has the property that A inverse times 1030 00:56:34,220 --> 00:56:37,840 A equals the identity matrix. 1031 00:56:37,840 --> 00:56:39,780 So you can sort of think about this 1032 00:56:39,780 --> 00:56:45,090 as A equals the identity matrix over A. Anyway, don't really 1033 00:56:45,090 --> 00:56:47,580 think of it like that. 1034 00:56:47,580 --> 00:56:51,270 So to solve this system of equations ax equals c, 1035 00:56:51,270 --> 00:56:56,420 we multiply both sides by that A inverse matrix. 1036 00:56:56,420 --> 00:56:58,130 And so that looks like this-- 1037 00:56:58,130 --> 00:57:03,240 A inverse A times x equals A inverse c. 1038 00:57:03,240 --> 00:57:05,790 A inverse A is just what? 1039 00:57:05,790 --> 00:57:10,920 The identity matrix times x equals A inverse c. 1040 00:57:10,920 --> 00:57:14,100 And we just saw before that identity matrix times x 1041 00:57:14,100 --> 00:57:15,930 is just x. 1042 00:57:15,930 --> 00:57:18,240 All right, so there's the solution 1043 00:57:18,240 --> 00:57:24,140 to this system of equations. 1044 00:57:24,140 --> 00:57:25,640 All right, any questions about that? 1045 00:57:30,220 --> 00:57:33,000 So how do you find the inverse of a matrix? 1046 00:57:33,000 --> 00:57:34,650 What is this A inverse? 1047 00:57:34,650 --> 00:57:37,900 How do you get it in real life? 1048 00:57:37,900 --> 00:57:40,590 So in real life, what you usually do is 1049 00:57:40,590 --> 00:57:44,250 you would just use the matrix inverse function in Matlab. 1050 00:57:44,250 --> 00:57:47,520 Because for any matrices other than a two-by-two, 1051 00:57:47,520 --> 00:57:50,160 it's really annoying to get a matrix inverse. 1052 00:57:50,160 --> 00:57:52,800 But for a two-by-two matrix, it's actually pretty easy. 1053 00:57:52,800 --> 00:57:56,340 You can almost just get the answer by looking at the matrix 1054 00:57:56,340 --> 00:57:58,110 and writing down the inverse. 1055 00:57:58,110 --> 00:57:59,530 It looks like this. 1056 00:57:59,530 --> 00:58:03,360 The inverse of a two-by-two square matrix is just given 1057 00:58:03,360 --> 00:58:06,970 by a slight reordering of the coefficients, 1058 00:58:06,970 --> 00:58:09,600 of the entries of that matrix, divided by what's called 1059 00:58:09,600 --> 00:58:14,100 the determinant of A. So what you do is you flip-- 1060 00:58:14,100 --> 00:58:18,090 in a two-by-two matrix, you flip the A and the D, 1061 00:58:18,090 --> 00:58:24,990 and then you multiply the diagonal elements by minus 1. 1062 00:58:24,990 --> 00:58:26,640 Now, what is this determinant? 1063 00:58:26,640 --> 00:58:33,060 The determinant is given by a times d minus b times c. 1064 00:58:33,060 --> 00:58:35,530 And you can prove that that actually 1065 00:58:35,530 --> 00:58:39,940 is the inverse, because if we take this and multiply it by A, 1066 00:58:39,940 --> 00:58:43,450 what you find when you multiply that out is that that's just 1067 00:58:43,450 --> 00:58:48,370 equal to the identity matrix. 1068 00:58:48,370 --> 00:58:52,060 So a matrix has an inverse if and only 1069 00:58:52,060 --> 00:58:55,360 if the determinant is not equal to zero. 1070 00:58:55,360 --> 00:58:57,220 If the determinant is equal to zero, 1071 00:58:57,220 --> 00:58:59,260 you can see that this thing blows up, 1072 00:58:59,260 --> 00:59:02,250 and there's no inverse. 1073 00:59:02,250 --> 00:59:04,510 We're going to spend a little bit of time 1074 00:59:04,510 --> 00:59:07,630 later talking about what that means when a matrix has 1075 00:59:07,630 --> 00:59:11,110 an inverse and what the determinant actually 1076 00:59:11,110 --> 00:59:18,920 corresponds to in a matrix multiplication context. 1077 00:59:18,920 --> 00:59:20,870 If the determinant is equal to zero, 1078 00:59:20,870 --> 00:59:24,260 we say that that matrix is singular. 1079 00:59:24,260 --> 00:59:27,710 And in that case, you can't actually find an inverse, 1080 00:59:27,710 --> 00:59:32,240 and you can't solve this equation right here, 1081 00:59:32,240 --> 00:59:33,950 this system of equations. 1082 00:59:38,720 --> 00:59:42,600 All right, so let's actually go through this example. 1083 00:59:42,600 --> 00:59:45,530 So here's our equation, ax equals c. 1084 00:59:45,530 --> 00:59:47,780 We're going to use the same matrix we had before 1085 00:59:47,780 --> 00:59:50,210 and the same c. 1086 00:59:50,210 --> 00:59:52,910 The determinant is just the product 1087 00:59:52,910 --> 00:59:56,420 of those minus the product of those, so 1 minus negative 6. 1088 00:59:56,420 --> 00:59:58,550 So the determinant is 7. 1089 00:59:58,550 --> 01:00:01,410 So there is an inverse of this matrix. 1090 01:00:01,410 --> 01:00:03,810 And we can just write that down as follows. 1091 01:00:03,810 --> 01:00:05,990 Again, we've flipped those two and multiplied those 1092 01:00:05,990 --> 01:00:07,850 by minus 1. 1093 01:00:07,850 --> 01:00:13,550 So we can solve for x just by taking that inverse times c, 1094 01:00:13,550 --> 01:00:15,920 A inverse times c. 1095 01:00:15,920 --> 01:00:17,840 And if you multiply that out, you 1096 01:00:17,840 --> 01:00:19,418 see that there's the inverse. 1097 01:00:19,418 --> 01:00:20,210 It's just a vector. 1098 01:00:24,680 --> 01:00:26,110 That's it. 1099 01:00:26,110 --> 01:00:31,400 That's how you solve a system of equations, all right? 1100 01:00:31,400 --> 01:00:33,970 Any questions about that? 1101 01:00:33,970 --> 01:00:43,590 So this process of solving systems of equations 1102 01:00:43,590 --> 01:00:49,250 and using matrices and their inverses 1103 01:00:49,250 --> 01:00:53,840 to solve systems of equations is a very important concept 1104 01:00:53,840 --> 01:00:55,820 that we're going to use over and over again. 1105 01:00:58,910 --> 01:01:01,040 All right, let's turn to the topic 1106 01:01:01,040 --> 01:01:03,660 of matrix transformations. 1107 01:01:03,660 --> 01:01:06,710 All right, so you can see from this problem of solving 1108 01:01:06,710 --> 01:01:12,100 this system of equations that that matrix A transformed 1109 01:01:12,100 --> 01:01:15,050 a vector x into a vector c, OK? 1110 01:01:15,050 --> 01:01:21,290 So we have this vector x, which was 3/7 minus 4/7 a vector. 1111 01:01:21,290 --> 01:01:26,940 When we multiplied that by A, we got another vector, c. 1112 01:01:30,730 --> 01:01:34,960 And the vector A inverse transforms this vector 1113 01:01:34,960 --> 01:01:38,320 c back into vector x, right? 1114 01:01:38,320 --> 01:01:44,170 So we can take that vector c, multiply it by A inverse, 1115 01:01:44,170 --> 01:01:46,420 and get back to x. 1116 01:01:46,420 --> 01:01:49,340 Does that make sense? 1117 01:01:49,340 --> 01:01:56,480 So, in general, a matrix A maps a set 1118 01:01:56,480 --> 01:01:59,630 of vectors in this whole space. 1119 01:01:59,630 --> 01:02:01,730 So if you have a two-by-two vector, 1120 01:02:01,730 --> 01:02:08,620 it maps a set of vectors in R2 onto a different set 1121 01:02:08,620 --> 01:02:10,540 of vectors in R2. 1122 01:02:10,540 --> 01:02:12,820 So you can take any vector here-- 1123 01:02:12,820 --> 01:02:16,360 a vector from the origin into here-- 1124 01:02:16,360 --> 01:02:18,460 multiply that vector by A, and it gives you 1125 01:02:18,460 --> 01:02:20,800 a different vector. 1126 01:02:20,800 --> 01:02:23,220 And if you multiply that other vector by A inverse, 1127 01:02:23,220 --> 01:02:27,990 you go back to the original vector. 1128 01:02:27,990 --> 01:02:31,860 So this vector A implements some kind 1129 01:02:31,860 --> 01:02:36,560 of transformation on this space of real numbers 1130 01:02:36,560 --> 01:02:42,120 into a different space of real numbers, OK? 1131 01:02:42,120 --> 01:02:46,120 And you can only do this inverse if the determinant of A 1132 01:02:46,120 --> 01:02:47,250 is not equal to zero. 1133 01:02:51,060 --> 01:02:55,260 So I just want to show you what different kinds of matrix 1134 01:02:55,260 --> 01:02:56,560 transformations look like. 1135 01:03:00,980 --> 01:03:04,810 So let's start with the simplest matrix transformation-- 1136 01:03:04,810 --> 01:03:06,260 the identity matrix. 1137 01:03:06,260 --> 01:03:09,130 So if we take a vector x, multiply it 1138 01:03:09,130 --> 01:03:12,710 by the identity matrix, you get another vector y, 1139 01:03:12,710 --> 01:03:15,350 which is equal to x. 1140 01:03:15,350 --> 01:03:18,650 So what we're going to do is we're going to kind of riff off 1141 01:03:18,650 --> 01:03:21,980 of a theme here, and we're going to take 1142 01:03:21,980 --> 01:03:26,400 slight perturbations of the identity matrix 1143 01:03:26,400 --> 01:03:30,990 and see what that new matrix does to a set of input vectors, 1144 01:03:30,990 --> 01:03:31,490 OK? 1145 01:03:31,490 --> 01:03:33,407 So let me show you how we're going to do that. 1146 01:03:33,407 --> 01:03:37,050 We're going to take it the identity matrix 1, 0, 0, 1. 1147 01:03:37,050 --> 01:03:39,020 And we're going to add a little perturbation 1148 01:03:39,020 --> 01:03:40,085 to the diagonal elements. 1149 01:03:43,900 --> 01:03:47,700 And we're going to see what that does to a set of input vectors. 1150 01:03:47,700 --> 01:03:49,810 So let me show you what we're doing here. 1151 01:03:49,810 --> 01:03:51,540 We have each one of these red dots. 1152 01:03:51,540 --> 01:03:58,410 So what I did was I generated a bunch of random numbers 1153 01:03:58,410 --> 01:03:59,430 in a 2D space. 1154 01:03:59,430 --> 01:04:01,230 So this is a 2D space. 1155 01:04:01,230 --> 01:04:03,330 And I just randomly selected a bunch 1156 01:04:03,330 --> 01:04:07,320 of numbers, a bunch of points on that plane. 1157 01:04:07,320 --> 01:04:11,140 And each one of those is an input vector x. 1158 01:04:11,140 --> 01:04:13,360 And then I multiplied that vector 1159 01:04:13,360 --> 01:04:18,100 times this slightly perturbed identity matrix. 1160 01:04:22,030 --> 01:04:24,270 And then I get a bunch of output vectors y. 1161 01:04:24,270 --> 01:04:26,850 Input vectors x are the red dots. 1162 01:04:26,850 --> 01:04:31,800 The output vectors y are the other end of this blue line. 1163 01:04:31,800 --> 01:04:32,860 Does that make sense? 1164 01:04:32,860 --> 01:04:39,600 So for every vector x, multiplying it by this matrix 1165 01:04:39,600 --> 01:04:43,630 gives me another vector that's over here. 1166 01:04:43,630 --> 01:04:44,930 Does that make sense? 1167 01:04:44,930 --> 01:04:49,150 So you can see that what this matrix does 1168 01:04:49,150 --> 01:04:52,600 is it takes this space, this cloud of points, 1169 01:04:52,600 --> 01:04:56,900 and stretches them equally in all directions. 1170 01:04:56,900 --> 01:05:00,760 So it takes any vector and just makes it longer, 1171 01:05:00,760 --> 01:05:02,200 stretches it out. 1172 01:05:02,200 --> 01:05:04,240 No matter which direction it's pointing, 1173 01:05:04,240 --> 01:05:06,210 it just makes that vector slightly longer. 1174 01:05:09,510 --> 01:05:11,070 And here's that little bit of code 1175 01:05:11,070 --> 01:05:17,670 that I used to generate those vectors. 1176 01:05:17,670 --> 01:05:19,310 OK, so let's take another example. 1177 01:05:19,310 --> 01:05:21,640 Let's say that we take the identity matrix 1178 01:05:21,640 --> 01:05:26,020 and we just add a little perturbation to one element 1179 01:05:26,020 --> 01:05:29,290 of the identity matrix, OK? 1180 01:05:29,290 --> 01:05:30,580 So what does that do? 1181 01:05:30,580 --> 01:05:37,400 It stretches the vectors out in the x direction, 1182 01:05:37,400 --> 01:05:40,540 but it doesn't do anything to the y direction. 1183 01:05:40,540 --> 01:05:45,200 So the vector with a component in the x direction, 1184 01:05:45,200 --> 01:05:51,250 the x component gets increased by an by a factor 1 plus delta. 1185 01:05:51,250 --> 01:05:55,390 The components of each of these vectors in the y direction 1186 01:05:55,390 --> 01:05:57,720 don't change, all right? 1187 01:05:57,720 --> 01:05:59,670 So we're going to take this cloud of points, 1188 01:05:59,670 --> 01:06:02,610 and we're going to stretch it in the x direction. 1189 01:06:02,610 --> 01:06:05,540 What about this matrix here? 1190 01:06:05,540 --> 01:06:06,843 What's that going to do? 1191 01:06:06,843 --> 01:06:08,510 AUDIENCE: Stretch it in the y direction. 1192 01:06:08,510 --> 01:06:09,260 MICHALE FEE: Good. 1193 01:06:09,260 --> 01:06:12,346 It's going to stretch it out in the y direction. 1194 01:06:12,346 --> 01:06:13,392 Good. 1195 01:06:13,392 --> 01:06:14,350 So that's kind of cute. 1196 01:06:19,000 --> 01:06:22,480 And you can see that this earlier matrix that we looked 1197 01:06:22,480 --> 01:06:27,560 at right here stretches in the x direction 1198 01:06:27,560 --> 01:06:29,270 and stretches in the y direction. 1199 01:06:29,270 --> 01:06:32,960 And that's why that cloud of vectors 1200 01:06:32,960 --> 01:06:35,750 just stretched out equally in all directions. 1201 01:06:40,340 --> 01:06:42,580 Out this. 1202 01:06:42,580 --> 01:06:44,404 What is that going to do? 1203 01:06:44,404 --> 01:06:46,864 AUDIENCE: It would stretch in the x direction and compress 1204 01:06:46,864 --> 01:06:47,850 in the y direction 1205 01:06:47,850 --> 01:06:49,410 MICHALE FEE: Right. 1206 01:06:49,410 --> 01:06:52,500 This perturbation here is making this component, 1207 01:06:52,500 --> 01:06:54,990 the x component larger. 1208 01:06:54,990 --> 01:06:58,860 This perturbation here-- and delta here is small. 1209 01:06:58,860 --> 01:07:00,100 It's less than one. 1210 01:07:00,100 --> 01:07:03,930 Here, it's making the y component smaller. 1211 01:07:03,930 --> 01:07:06,600 And so what that looks like is the y component of each one 1212 01:07:06,600 --> 01:07:08,530 of these vectors gets smaller. 1213 01:07:08,530 --> 01:07:10,740 The x component gets larger. 1214 01:07:10,740 --> 01:07:13,830 And so we're squeezing in one direction 1215 01:07:13,830 --> 01:07:17,740 and stretching in the other direction. 1216 01:07:17,740 --> 01:07:22,040 Imagine we took a block of sponge 1217 01:07:22,040 --> 01:07:23,735 and we grabbed it and stretched it out, 1218 01:07:23,735 --> 01:07:25,235 and it gets skinny in this direction 1219 01:07:25,235 --> 01:07:28,700 and stretches out in that direction. 1220 01:07:28,700 --> 01:07:30,050 All right, that's kind of cool. 1221 01:07:32,750 --> 01:07:36,060 What is this going to do? 1222 01:07:36,060 --> 01:07:38,910 Here, I'm not making a small perturbation of this, 1223 01:07:38,910 --> 01:07:42,880 but I'm flipping the sign of one of those. 1224 01:07:42,880 --> 01:07:43,870 What happens there? 1225 01:07:43,870 --> 01:07:44,970 What is that going to do? 1226 01:07:48,470 --> 01:07:50,400 AUDIENCE: [INAUDIBLE] 1227 01:07:50,400 --> 01:07:51,330 MICHALE FEE: Good. 1228 01:07:51,330 --> 01:07:54,240 What do we call that? 1229 01:07:54,240 --> 01:07:57,190 There's a term for it. 1230 01:07:57,190 --> 01:08:02,400 What do you-- yeah, it's called a mirror reflection. 1231 01:08:02,400 --> 01:08:07,340 So every point that's on this side of the origin 1232 01:08:07,340 --> 01:08:10,370 gets reflected over to this side of the origin. 1233 01:08:10,370 --> 01:08:12,020 And every point that's over here-- 1234 01:08:12,020 --> 01:08:13,490 sorry, of this axis. 1235 01:08:13,490 --> 01:08:15,980 Every point that's on this side of the y-axis 1236 01:08:15,980 --> 01:08:19,740 gets reflected over to this side. 1237 01:08:19,740 --> 01:08:23,410 So that's called a mirror reflection. 1238 01:08:23,410 --> 01:08:24,518 What is this? 1239 01:08:24,518 --> 01:08:25,560 What is that going to do? 1240 01:08:35,430 --> 01:08:35,930 Abiba? 1241 01:08:35,930 --> 01:08:38,567 AUDIENCE: Reflect it [INAUDIBLE].. 1242 01:08:38,567 --> 01:08:39,359 MICHALE FEE: Right. 1243 01:08:39,359 --> 01:08:43,399 It's going to reflect it through the origin, like this. 1244 01:08:43,399 --> 01:08:46,229 So every point that's over here, on one side of the origin, 1245 01:08:46,229 --> 01:08:50,270 is going to reflect through to the other side. 1246 01:08:50,270 --> 01:08:52,450 That's pretty neat. 1247 01:08:52,450 --> 01:08:54,660 Inversion of the origin. 1248 01:08:54,660 --> 01:08:56,870 OK? 1249 01:08:56,870 --> 01:08:59,460 So we have symmetric perturbations 1250 01:08:59,460 --> 01:09:04,300 in the x and y components of the identity matrix. 1251 01:09:04,300 --> 01:09:10,200 We have a stretch transformation that stretches along one axis, 1252 01:09:10,200 --> 01:09:12,149 but not the other. 1253 01:09:12,149 --> 01:09:17,130 Stretch around the other axis, the y-axis, but not the x-axis. 1254 01:09:17,130 --> 01:09:21,120 Stretch along x and compression along y. 1255 01:09:21,120 --> 01:09:24,990 Mirror reflection through the y-axis. 1256 01:09:24,990 --> 01:09:27,870 Inversion through the origin. 1257 01:09:27,870 --> 01:09:31,740 These are examples of diagonal matrices, OK? 1258 01:09:31,740 --> 01:09:34,180 So the only thing we've done so far-- 1259 01:09:34,180 --> 01:09:36,779 we've gotten all these really cool transformations, 1260 01:09:36,779 --> 01:09:38,970 but the only thing we've done so far 1261 01:09:38,970 --> 01:09:40,905 are change these two diagonal elements. 1262 01:09:43,779 --> 01:09:46,510 So there's a lot more crazy stuff 1263 01:09:46,510 --> 01:09:51,310 to happen if we start messing with the other components. 1264 01:09:51,310 --> 01:09:55,540 Oh, and I should mention that we can invert 1265 01:09:55,540 --> 01:10:01,060 any one of these transformations that we just did by finding 1266 01:10:01,060 --> 01:10:03,020 the inverse of this matrix. 1267 01:10:03,020 --> 01:10:06,805 The inverse of a diagonal matrix is very simple to calculate. 1268 01:10:06,805 --> 01:10:10,015 It's just one over those diagonal elements. 1269 01:10:13,470 --> 01:10:14,580 All right, how about this? 1270 01:10:17,868 --> 01:10:18,910 What is that going to do? 1271 01:10:18,910 --> 01:10:19,802 Anybody? 1272 01:10:28,970 --> 01:10:30,980 When you take a vector and you multiply it 1273 01:10:30,980 --> 01:10:33,290 by that, what's going to happen? 1274 01:10:33,290 --> 01:10:36,800 This part is going to give you the original vector back. 1275 01:10:36,800 --> 01:10:41,330 This part is going to take a little bit of the y component 1276 01:10:41,330 --> 01:10:45,630 and add it to the x component. 1277 01:10:45,630 --> 01:10:47,450 So what does that do? 1278 01:10:47,450 --> 01:10:50,340 That produces what's known as a shear. 1279 01:10:50,340 --> 01:10:53,340 So points up here, we're going to take 1280 01:10:53,340 --> 01:10:57,700 a little bit of the y component and add it to the x component. 1281 01:10:57,700 --> 01:11:00,300 So if something has a big y component, 1282 01:11:00,300 --> 01:11:04,242 it's going to be shifted in x. 1283 01:11:04,242 --> 01:11:06,710 If something has a negative y component, 1284 01:11:06,710 --> 01:11:08,670 it's going to shift this way in x. 1285 01:11:08,670 --> 01:11:10,440 If something has a positive y component, 1286 01:11:10,440 --> 01:11:12,500 it's going to shift this way an x. 1287 01:11:12,500 --> 01:11:16,050 And it's going to produce what's called a shear. 1288 01:11:16,050 --> 01:11:20,100 So we're pushing these points this way, 1289 01:11:20,100 --> 01:11:21,630 pushing those points this way. 1290 01:11:25,230 --> 01:11:29,760 Shear is very important in things like the flow of liquid. 1291 01:11:29,760 --> 01:11:32,700 So when you have liquid flowing over a surface, 1292 01:11:32,700 --> 01:11:37,620 you have forces, frictional forces to the liquid down here 1293 01:11:37,620 --> 01:11:39,750 that prevent it from moving. 1294 01:11:39,750 --> 01:11:42,550 Liquid up here moves more quickly, 1295 01:11:42,550 --> 01:11:48,250 and it produces a shear in the pattern of velocity profiles. 1296 01:11:48,250 --> 01:11:50,560 OK, that's pretty cool. 1297 01:11:50,560 --> 01:11:52,150 What about this? 1298 01:11:56,520 --> 01:11:58,750 It's going to just produce a shear 1299 01:11:58,750 --> 01:12:00,380 along the other direction. 1300 01:12:00,380 --> 01:12:01,300 That's right. 1301 01:12:01,300 --> 01:12:03,250 So now components that have a-- 1302 01:12:03,250 --> 01:12:07,960 vectors that have a large x component acquire 1303 01:12:07,960 --> 01:12:10,900 a negative projection in y. 1304 01:12:17,160 --> 01:12:19,920 OK, what does this look like? 1305 01:12:19,920 --> 01:12:20,800 It's pretty cool. 1306 01:12:30,600 --> 01:12:36,680 We're going to get some shear in this direction, 1307 01:12:36,680 --> 01:12:39,860 get some shear in this direction. 1308 01:12:39,860 --> 01:12:42,137 What's it going to do? 1309 01:12:42,137 --> 01:12:46,630 AUDIENCE: [INAUDIBLE] 1310 01:12:46,630 --> 01:12:47,420 MICHALE FEE: Good. 1311 01:12:47,420 --> 01:12:48,950 Good guess. 1312 01:12:48,950 --> 01:12:52,840 That's exactly right, produces a rotation. 1313 01:12:52,840 --> 01:12:55,290 Not exactly a rotation, but very close. 1314 01:13:01,470 --> 01:13:04,980 So that's how you actually produce a rotation. 1315 01:13:04,980 --> 01:13:10,000 So notice, for small angles theta, these are close to one, 1316 01:13:10,000 --> 01:13:13,140 so it's close to an identity matrix. 1317 01:13:13,140 --> 01:13:17,090 These are close to zero, but this is negative 1318 01:13:17,090 --> 01:13:20,970 and this is positive, or the other way around. 1319 01:13:20,970 --> 01:13:27,560 So if we have diagonals close to one and the off-diagonals one 1320 01:13:27,560 --> 01:13:31,640 positive and one negative, then that produces a rotation. 1321 01:13:31,640 --> 01:13:33,760 That, formally, is a rotation matrix. 1322 01:13:33,760 --> 01:13:34,560 Yes? 1323 01:13:34,560 --> 01:13:36,580 AUDIENCE: On the previous slide, is there 1324 01:13:36,580 --> 01:13:39,858 a reason you chose to represent the delta on the x-axis as 1325 01:13:39,858 --> 01:13:40,860 negative? 1326 01:13:40,860 --> 01:13:41,780 MICHALE FEE: No. 1327 01:13:41,780 --> 01:13:42,720 It goes either way. 1328 01:13:42,720 --> 01:13:45,600 So if you have a rotation angle that's positive, 1329 01:13:45,600 --> 01:13:48,590 then this is negative and this is positive. 1330 01:13:48,590 --> 01:13:50,840 If your rotation angle is the other sign, 1331 01:13:50,840 --> 01:13:55,520 then this is positive and this is negative. 1332 01:13:55,520 --> 01:14:00,260 So, for example, if we want to produce a 45-degree rotation, 1333 01:14:00,260 --> 01:14:04,820 then we have 1, 1, minus 1, 1. 1334 01:14:04,820 --> 01:14:07,040 And of course, all those things have a square root 1335 01:14:07,040 --> 01:14:10,003 of 2, 1 over square root of 2, in them. 1336 01:14:10,003 --> 01:14:11,170 And so that looks like this. 1337 01:14:11,170 --> 01:14:14,180 So if you have, let's say, theta equals 10 degrees, 1338 01:14:14,180 --> 01:14:17,960 we can produce a 10-degree rotation of all the vectors. 1339 01:14:17,960 --> 01:14:20,180 If theta is 25 degrees, you can see 1340 01:14:20,180 --> 01:14:23,220 that the rotation is further. 1341 01:14:23,220 --> 01:14:25,560 Theta 45, that's this case right here. 1342 01:14:25,560 --> 01:14:28,560 You can see that you get a 45-degree rotation of all 1343 01:14:28,560 --> 01:14:31,440 of those vectors around the origin. 1344 01:14:31,440 --> 01:14:37,850 And if theta is 90 degrees, you can see that, OK? 1345 01:14:37,850 --> 01:14:38,660 Pretty cool, right? 1346 01:14:42,700 --> 01:14:46,880 OK, what is the inverse of this rotation matrix? 1347 01:14:46,880 --> 01:14:50,620 So if we have a rotation-- oh, and I just 1348 01:14:50,620 --> 01:14:53,140 want to point out one more thing. 1349 01:14:53,140 --> 01:14:55,780 In this formulation of the rotation matrix, 1350 01:14:55,780 --> 01:15:00,970 positive angles correspond to rotating counterclockwise. 1351 01:15:03,560 --> 01:15:07,640 Negative angles correspond to rotation 1352 01:15:07,640 --> 01:15:09,920 in the clockwise direction, OK? 1353 01:15:09,920 --> 01:15:11,660 So there's a big hint. 1354 01:15:11,660 --> 01:15:17,230 What is the inverse of our rotation matrix? 1355 01:15:17,230 --> 01:15:22,940 If we have a rotation of 10 degrees this way, 1356 01:15:22,940 --> 01:15:24,960 what is the inverse of that? 1357 01:15:24,960 --> 01:15:26,738 AUDIENCE: [INAUDIBLE] 1358 01:15:26,738 --> 01:15:27,530 MICHALE FEE: Right. 1359 01:15:27,530 --> 01:15:28,910 AUDIENCE: [INAUDIBLE] 1360 01:15:28,910 --> 01:15:30,290 MICHALE FEE: That's right. 1361 01:15:30,290 --> 01:15:35,870 Remember, matrix multiplication implements a transformation. 1362 01:15:35,870 --> 01:15:38,450 The inverse of that transformation 1363 01:15:38,450 --> 01:15:41,420 just takes you back where you were. 1364 01:15:41,420 --> 01:15:44,810 So if you have a rotation matrix that you implemented 1365 01:15:44,810 --> 01:15:47,750 a 20-degree rotation in the plus direction, 1366 01:15:47,750 --> 01:15:51,710 then the inverse of that is a 20-degree rotation 1367 01:15:51,710 --> 01:15:53,120 in the minus direction. 1368 01:15:53,120 --> 01:15:55,000 So the inverse of this matrix you 1369 01:15:55,000 --> 01:15:58,830 can get just by putting in a minus sign into the theta. 1370 01:15:58,830 --> 01:16:01,580 And you can see that cosine of minus theta 1371 01:16:01,580 --> 01:16:03,200 is just cosine of theta. 1372 01:16:03,200 --> 01:16:06,500 But sine of minus theta is negative sine of theta. 1373 01:16:09,770 --> 01:16:13,100 So the inverse of this matrix is just this. 1374 01:16:13,100 --> 01:16:15,450 You change the sign of those diagonals, 1375 01:16:15,450 --> 01:16:19,620 which just makes the shear go in the opposite direction, right? 1376 01:16:23,400 --> 01:16:26,680 OK, so a rotation by angle plus theta 1377 01:16:26,680 --> 01:16:29,590 followed by a rotation of angle minus theta 1378 01:16:29,590 --> 01:16:31,300 puts everything back where it was. 1379 01:16:31,300 --> 01:16:37,590 So rotation matrix phi of minus theta times phi of theta 1380 01:16:37,590 --> 01:16:39,370 is equal to the identity matrix. 1381 01:16:39,370 --> 01:16:41,790 So those two are inverses of each other. 1382 01:16:44,410 --> 01:16:47,860 And the inverse of a-- notice that the inverse 1383 01:16:47,860 --> 01:16:51,850 of this rotation matrix is also just the transpose 1384 01:16:51,850 --> 01:16:52,930 of the rotation matrix. 1385 01:16:56,550 --> 01:16:58,190 All right, so what you can see is 1386 01:16:58,190 --> 01:17:03,170 that these different cool transformations 1387 01:17:03,170 --> 01:17:07,490 that these matrix multiplications can do 1388 01:17:07,490 --> 01:17:11,870 are just examples of what our feed-forward network can do. 1389 01:17:11,870 --> 01:17:13,460 Because the feed-m forward network 1390 01:17:13,460 --> 01:17:16,380 just implements matrix multiplication. 1391 01:17:16,380 --> 01:17:18,950 So this feed-forward network takes 1392 01:17:18,950 --> 01:17:21,890 a set of vectors, a set of input vectors, 1393 01:17:21,890 --> 01:17:26,060 and transforms them into a set of output vectors, all right? 1394 01:17:26,060 --> 01:17:29,510 And you can understand what that transformation does just 1395 01:17:29,510 --> 01:17:32,060 by understanding the different kinds of transformations 1396 01:17:32,060 --> 01:17:37,550 you can get from matrix multiplication. 1397 01:17:37,550 --> 01:17:40,390 All right, we'll continue next time.