1
00:00:16,313 --> 00:00:16,980
MICHALE FEE: OK.

2
00:00:16,980 --> 00:00:18,730
All right, let's go
ahead and get started.

3
00:00:18,730 --> 00:00:21,690
OK, so we're going
to continue talking

4
00:00:21,690 --> 00:00:25,380
about the topic of
neural networks.

5
00:00:25,380 --> 00:00:28,620
Last time, we introduced
a new framework

6
00:00:28,620 --> 00:00:33,330
for thinking about neural
network interactions,

7
00:00:33,330 --> 00:00:37,740
using a rate model to describe
the interactions of neurons

8
00:00:37,740 --> 00:00:41,370
and develop a mathematical
framework for how

9
00:00:41,370 --> 00:00:43,080
to combine
collections of neurons

10
00:00:43,080 --> 00:00:46,300
to study their behavior.

11
00:00:46,300 --> 00:00:50,760
So, last time, we introduced
the notion of a perceptron

12
00:00:50,760 --> 00:00:54,180
as a way of building
a neural network that

13
00:00:54,180 --> 00:00:57,510
can classify its inputs.

14
00:00:57,510 --> 00:01:03,300
And we started talking about the
notion of a perceptron learning

15
00:01:03,300 --> 00:01:06,480
rule, and we're going
to flesh that idea out

16
00:01:06,480 --> 00:01:08,580
in more detail today.

17
00:01:08,580 --> 00:01:12,450
We're going to then talk about
the idea of using networks

18
00:01:12,450 --> 00:01:17,250
to perform logic with neurons.

19
00:01:17,250 --> 00:01:19,800
We're going to talk about the
idea of linear separability

20
00:01:19,800 --> 00:01:21,480
and invariance.

21
00:01:21,480 --> 00:01:24,240
Then we're going to
introduce more complex

22
00:01:24,240 --> 00:01:26,190
feed-forward networks,
where instead

23
00:01:26,190 --> 00:01:28,260
of having a single
output neuron,

24
00:01:28,260 --> 00:01:32,040
we have multiple output neurons.

25
00:01:32,040 --> 00:01:37,800
Then we're going to turn to
a more fully developed view

26
00:01:37,800 --> 00:01:40,890
of the math that we use to
describe neural networks,

27
00:01:40,890 --> 00:01:45,450
and matrix operations
become extremely important

28
00:01:45,450 --> 00:01:50,330
in neural network theory.

29
00:01:50,330 --> 00:01:51,960
And then, finally,
we're going to turn

30
00:01:51,960 --> 00:01:55,110
to some of the kinds
of transformations that

31
00:01:55,110 --> 00:01:58,290
are performed by
matrix multiplication

32
00:01:58,290 --> 00:02:03,080
and by the kinds of-- by
feed-forward neural networks.

33
00:02:03,080 --> 00:02:08,160
OK, so we've been considering
a kind of neural network called

34
00:02:08,160 --> 00:02:12,065
a rate model that uses firing
rates rather than spike trains.

35
00:02:12,065 --> 00:02:13,440
So we introduced
the idea that we

36
00:02:13,440 --> 00:02:16,560
have an output neuron
with firing rate

37
00:02:16,560 --> 00:02:19,830
v that receives input
from an input neuron that

38
00:02:19,830 --> 00:02:21,530
has firing rate u.

39
00:02:21,530 --> 00:02:24,270
The input neuron synapses
onto the output neuron

40
00:02:24,270 --> 00:02:26,490
with a synapse of weight w.

41
00:02:26,490 --> 00:02:29,010
And we described
how we can think

42
00:02:29,010 --> 00:02:34,110
of the input neuron producing a
synaptic input into the output

43
00:02:34,110 --> 00:02:39,600
neuron that has a
magnitude of the firing

44
00:02:39,600 --> 00:02:42,350
rate times the strength of
the synaptic connection.

45
00:02:42,350 --> 00:02:48,550
So the input to the output
neuron here is w times u.

46
00:02:48,550 --> 00:02:53,050
And then we talked about how we
can convert that input current,

47
00:02:53,050 --> 00:02:55,330
let's say, into
our output neuron

48
00:02:55,330 --> 00:02:59,380
into a firing rate of the output
neuron through some function

49
00:02:59,380 --> 00:03:05,050
f, which is what's called the
F-I curve of the neuron that

50
00:03:05,050 --> 00:03:08,920
relates the input to the
firing rate of the neuron.

51
00:03:08,920 --> 00:03:11,260
And we talked about
several different kinds

52
00:03:11,260 --> 00:03:15,850
of F-I firing rate versus input
functions that can be useful.

53
00:03:15,850 --> 00:03:20,950
We then extended our network
from a single input neuron

54
00:03:20,950 --> 00:03:22,960
synapsing onto a
single output neuron

55
00:03:22,960 --> 00:03:26,290
by having multiple
input neurons.

56
00:03:26,290 --> 00:03:29,680
Again, the output neuron
has a firing rate,

57
00:03:29,680 --> 00:03:34,090
and our input neurons have a
vector of firing rates now--

58
00:03:34,090 --> 00:03:37,800
u1, u2, u3, u4, and so on--

59
00:03:37,800 --> 00:03:42,940
that we can combine
together into a vector, u.

60
00:03:42,940 --> 00:03:47,180
Each one of those input neurons
has a synaptic strength w

61
00:03:47,180 --> 00:03:48,470
onto our output neuron.

62
00:03:48,470 --> 00:03:51,580
So we have a vector
of synaptic strengths.

63
00:03:51,580 --> 00:03:56,590
And now we can write down the
input current to our output

64
00:03:56,590 --> 00:04:00,100
neuron as a sum of the
contributions from each

65
00:04:00,100 --> 00:04:07,150
of those input neurons-- so w1,
u1 plus w2, u2, plus w3, u3,

66
00:04:07,150 --> 00:04:08,980
and so on.

67
00:04:08,980 --> 00:04:12,100
So we can now write
the input current

68
00:04:12,100 --> 00:04:16,810
to our output neuron as
a sum of contributions

69
00:04:16,810 --> 00:04:18,850
that we can then write
as a dot product--

70
00:04:18,850 --> 00:04:21,540
w dot u.

71
00:04:21,540 --> 00:04:22,930
OK, any questions about that?

72
00:04:27,480 --> 00:04:30,570
And so, in general, we have
the firing rate of our output

73
00:04:30,570 --> 00:04:32,970
neuron is just
this F-I function,

74
00:04:32,970 --> 00:04:37,500
this input-output function
of our output neuron acting

75
00:04:37,500 --> 00:04:41,023
on the total input,
which is w dot u.

76
00:04:41,023 --> 00:04:42,690
And then we talked
about different kinds

77
00:04:42,690 --> 00:04:46,770
of functions that are
useful computationally

78
00:04:46,770 --> 00:04:47,820
for this function f.

79
00:04:47,820 --> 00:04:51,060
So in the context of the
integrate and fire neuron,

80
00:04:51,060 --> 00:04:56,440
we talked about F-I curves that
are zero below some threshold

81
00:04:56,440 --> 00:05:01,350
and then are linear above
that threshold current.

82
00:05:01,350 --> 00:05:05,640
We talked last time about
a binary threshold known

83
00:05:05,640 --> 00:05:08,140
that has zero firing
rate below some threshold

84
00:05:08,140 --> 00:05:12,390
and then steps up abruptly to
a constant output firing rate

85
00:05:12,390 --> 00:05:14,280
one.

86
00:05:14,280 --> 00:05:16,560
And then we also introduced,
last time, the notion

87
00:05:16,560 --> 00:05:19,050
of a linear neuron,
whose firing rate is

88
00:05:19,050 --> 00:05:21,600
just proportional
to the input current

89
00:05:21,600 --> 00:05:24,300
and has positive and
negative firing rates.

90
00:05:24,300 --> 00:05:26,450
And we talked about the
idea that although it's

91
00:05:26,450 --> 00:05:28,860
biophysically implausible
to have neurons

92
00:05:28,860 --> 00:05:31,650
that have negative
firing rates, that this

93
00:05:31,650 --> 00:05:35,040
is a particularly useful
simplification of neurons.

94
00:05:35,040 --> 00:05:39,990
Because we can just
use linear algebra

95
00:05:39,990 --> 00:05:44,440
to describe the properties of
networks of linear neurons.

96
00:05:44,440 --> 00:05:46,980
And we can do some
really interesting things

97
00:05:46,980 --> 00:05:52,270
with that kind of
mathematical simplification.

98
00:05:52,270 --> 00:05:54,870
We're going to get to
some of that today.

99
00:05:54,870 --> 00:05:57,420
And that allows
you to really build

100
00:05:57,420 --> 00:06:02,750
an intuition for what
neural networks can do.

101
00:06:02,750 --> 00:06:08,570
OK, so let's come back to what
perceptron is and introduce

102
00:06:08,570 --> 00:06:11,820
this perceptron learning role.

103
00:06:11,820 --> 00:06:14,690
So we talked about the idea
that a perceptron carries out

104
00:06:14,690 --> 00:06:17,510
a classification
of its inputs that

105
00:06:17,510 --> 00:06:18,860
represent different features.

106
00:06:18,860 --> 00:06:22,580
So we talked about classifying
animals into dogs and non-dogs

107
00:06:22,580 --> 00:06:27,120
based on two
features of animals.

108
00:06:27,120 --> 00:06:30,110
We talked about
the fact that you

109
00:06:30,110 --> 00:06:34,160
can't make that classification
between dogs and non-dogs

110
00:06:34,160 --> 00:06:36,350
just on the basis of
one of those features,

111
00:06:36,350 --> 00:06:40,580
because these two categories
overlap in this feature

112
00:06:40,580 --> 00:06:42,060
and in this feature.

113
00:06:42,060 --> 00:06:44,960
And so in order to properly
separate those categories,

114
00:06:44,960 --> 00:06:47,780
you need a decision
boundary that's

115
00:06:47,780 --> 00:06:52,280
actually a combination
of those two features.

116
00:06:52,280 --> 00:06:54,290
And we talked about
how you can implement

117
00:06:54,290 --> 00:06:57,790
that using a simple
network, called

118
00:06:57,790 --> 00:07:02,570
a perceptron, that has an output
neuron and two input neurons.

119
00:07:02,570 --> 00:07:06,320
Each one of those input neurons
represents the magnitude

120
00:07:06,320 --> 00:07:10,070
of those two different
features for each object

121
00:07:10,070 --> 00:07:13,220
that you're trying to classify.

122
00:07:13,220 --> 00:07:19,580
So u1 here and u2 are the
dimensions on which we're

123
00:07:19,580 --> 00:07:24,100
performing this classification.

124
00:07:24,100 --> 00:07:28,840
And so we talked about the fact
that that decision boundary

125
00:07:28,840 --> 00:07:31,990
between those two
classifications

126
00:07:31,990 --> 00:07:35,470
is determined by
this weight matrix w.

127
00:07:35,470 --> 00:07:37,810
And then we used a
binary threshold neuron

128
00:07:37,810 --> 00:07:39,700
for making the actual decision.

129
00:07:39,700 --> 00:07:42,370
Binary threshold neurons are
great for making decisions,

130
00:07:42,370 --> 00:07:46,540
because unlike a linear
neuron-- so a linear neuron just

131
00:07:46,540 --> 00:07:48,850
responds more if
its input is larger,

132
00:07:48,850 --> 00:07:51,940
and it responds less if
its input is smaller.

133
00:07:51,940 --> 00:07:57,220
Binary threshold neurons
have a very clear threshold

134
00:07:57,220 --> 00:07:59,380
below which the
neuron doesn't spike

135
00:07:59,380 --> 00:08:01,480
and above which the
neuron does spike.

136
00:08:01,480 --> 00:08:04,300
So, in this case, this network,
this output neuron here,

137
00:08:04,300 --> 00:08:07,420
will fire, will have
a firing rate of one,

138
00:08:07,420 --> 00:08:11,530
for any input that's on this
side of the decision boundary

139
00:08:11,530 --> 00:08:13,510
and will have a
firing rate of zero

140
00:08:13,510 --> 00:08:16,940
for any input that's on this
side of the decision boundary,

141
00:08:16,940 --> 00:08:19,570
OK?

142
00:08:19,570 --> 00:08:24,560
All right, so we talked about
how we can, in two dimensions,

143
00:08:24,560 --> 00:08:28,940
just write down a decision
boundary that will separate,

144
00:08:28,940 --> 00:08:32,870
let's say, green objects
from red objects.

145
00:08:32,870 --> 00:08:36,409
So you can see that
if you sat down

146
00:08:36,409 --> 00:08:39,770
and you looked at this drawing
of green dots and red dots,

147
00:08:39,770 --> 00:08:43,309
that it would be very simple
to just look at that picture

148
00:08:43,309 --> 00:08:46,010
and see that if you put
a decision boundary right

149
00:08:46,010 --> 00:08:49,910
there, that you would be able
to separate the green dots

150
00:08:49,910 --> 00:08:51,350
from the red dots.

151
00:08:51,350 --> 00:08:54,470
How would you actually
calculate the weight vector

152
00:08:54,470 --> 00:08:57,030
that that corresponds
to in a perceptron?

153
00:08:57,030 --> 00:08:59,100
Well, it's very simple.

154
00:08:59,100 --> 00:09:02,300
You can just look at where
that decision boundary crosses

155
00:09:02,300 --> 00:09:04,220
the axes--

156
00:09:04,220 --> 00:09:07,190
so you can see here, that
decision boundary crosses

157
00:09:07,190 --> 00:09:13,080
the u1 axis at point A, crosses
the u2 axis at, I should say,

158
00:09:13,080 --> 00:09:17,840
a value of B. And then we can
use those numbers to actually

159
00:09:17,840 --> 00:09:19,100
calculate the w.

160
00:09:19,100 --> 00:09:21,950
So, remember, u is
the input space.

161
00:09:21,950 --> 00:09:24,230
w is a weight vector
that we're trying

162
00:09:24,230 --> 00:09:27,020
to calculate in order
to place the decision

163
00:09:27,020 --> 00:09:28,070
boundary at that point.

164
00:09:28,070 --> 00:09:32,380
Is that clear what
we're trying to do here?

165
00:09:32,380 --> 00:09:35,220
OK, so we can calculate
that weight vector.

166
00:09:35,220 --> 00:09:37,710
We assume that just
data is some number.

167
00:09:37,710 --> 00:09:39,840
Let's just call it one.

168
00:09:39,840 --> 00:09:44,760
We have an equation for a
line-- w dot u equals theta.

169
00:09:44,760 --> 00:09:47,910
That's the equation for
that decision boundary.

170
00:09:47,910 --> 00:09:52,080
We have two knowns, the two
points on the decision boundary

171
00:09:52,080 --> 00:09:53,960
that we can just
read off by eye.

172
00:09:53,960 --> 00:09:58,020
And we have two unknowns-- the
synaptic weights, w1 and w2.

173
00:09:58,020 --> 00:10:00,510
And so we have two equations--

174
00:10:00,510 --> 00:10:06,020
ua dot w equals theta,
ub dot w equals theta.

175
00:10:06,020 --> 00:10:08,400
And we can just
solve for w1 and w2,

176
00:10:08,400 --> 00:10:10,470
and that's what you got, OK?

177
00:10:10,470 --> 00:10:13,560
So the weight vector that gives
you that decision boundary

178
00:10:13,560 --> 00:10:17,040
is 1 over a and 1 over b, OK?

179
00:10:17,040 --> 00:10:18,480
Those are the two weights.

180
00:10:18,480 --> 00:10:21,700
Any questions about that?

181
00:10:21,700 --> 00:10:23,460
OK.

182
00:10:23,460 --> 00:10:27,630
So in two dimensions, that's
very easy to do, right?

183
00:10:27,630 --> 00:10:31,350
You can just look at
that cloud of points,

184
00:10:31,350 --> 00:10:34,590
decide where to draw a line
that best separates the two

185
00:10:34,590 --> 00:10:37,230
categories that you're
interested in separating.

186
00:10:37,230 --> 00:10:40,870
But in higher dimensions,
that's really hard.

187
00:10:40,870 --> 00:10:44,250
So in high dimensions,
for example,

188
00:10:44,250 --> 00:10:47,720
we're trying to separate
images, for example.

189
00:10:47,720 --> 00:10:49,980
So we can have a bunch
of images of dogs,

190
00:10:49,980 --> 00:10:51,870
a bunch of images of cats.

191
00:10:51,870 --> 00:10:54,030
Each pixel in that
image corresponds

192
00:10:54,030 --> 00:10:56,910
to a different input to
our classification unit.

193
00:10:56,910 --> 00:11:00,960
And now how do you decide
what all of those weights

194
00:11:00,960 --> 00:11:03,180
should be from all of
those different pixels

195
00:11:03,180 --> 00:11:08,760
onto our output neuron that
separates images of one class

196
00:11:08,760 --> 00:11:10,720
from images of another class?

197
00:11:10,720 --> 00:11:14,640
So there's just no way to do
that by eye in high dimensions.

198
00:11:14,640 --> 00:11:17,460
So you need an
algorithm that helps

199
00:11:17,460 --> 00:11:20,130
you choose that set of
weights that allows you

200
00:11:20,130 --> 00:11:22,840
to separate different classes--

201
00:11:22,840 --> 00:11:25,740
you know, a bunch of images
of one class from a bunch

202
00:11:25,740 --> 00:11:28,500
of images of another class.

203
00:11:28,500 --> 00:11:33,540
And so we're going to
introduce a method called

204
00:11:33,540 --> 00:11:40,710
the perceptron learning rule
that is a category of learning

205
00:11:40,710 --> 00:11:47,910
rules called supervised learning
rules that allow you to take

206
00:11:47,910 --> 00:11:51,660
a bunch of objects that
you know-- so if you

207
00:11:51,660 --> 00:11:53,160
have a bunch of
pictures of dogs,

208
00:11:53,160 --> 00:11:54,385
you know that they're dogs.

209
00:11:54,385 --> 00:11:57,010
If you have a bunch of pictures
of cats, you know they're cats.

210
00:11:57,010 --> 00:11:58,920
So you label those images.

211
00:11:58,920 --> 00:12:03,780
You feed those inputs, those
images, into your network,

212
00:12:03,780 --> 00:12:06,870
and you tell the network
what the answer was.

213
00:12:06,870 --> 00:12:09,420
And through an
iterative process,

214
00:12:09,420 --> 00:12:13,410
it finds all of the weights that
optimally separate those two

215
00:12:13,410 --> 00:12:14,740
different categories.

216
00:12:14,740 --> 00:12:16,800
So that's called the
perceptron learning rule.

217
00:12:16,800 --> 00:12:19,240
So let me just set up
how that actually works.

218
00:12:19,240 --> 00:12:22,690
So you have a bunch of
observations of the input.

219
00:12:22,690 --> 00:12:25,960
So in this case, I'm drawing
these in two dimensions,

220
00:12:25,960 --> 00:12:28,560
but you should think about each
one of these dots as being,

221
00:12:28,560 --> 00:12:32,520
let's say, an image of a
dog in very high dimensions,

222
00:12:32,520 --> 00:12:37,920
where instead of just u1 and
u2, you have u1 through u1000,

223
00:12:37,920 --> 00:12:41,280
where each one of those is
the value of a different pixel

224
00:12:41,280 --> 00:12:44,190
in your image.

225
00:12:44,190 --> 00:12:46,170
So you have a bunch of images.

226
00:12:46,170 --> 00:12:50,220
Each one of those corresponds
to an image of a dog.

227
00:12:50,220 --> 00:12:53,610
Each one of those corresponds
to an image of a cat.

228
00:12:53,610 --> 00:12:56,280
And we have a whole bunch
of different observations

229
00:12:56,280 --> 00:12:59,610
or images of those
different categories.

230
00:12:59,610 --> 00:13:00,720
Any questions about that?

231
00:13:03,800 --> 00:13:06,840
All right, so we have n
of those observations.

232
00:13:06,840 --> 00:13:08,880
And for each one of
those observations,

233
00:13:08,880 --> 00:13:12,735
we say that the
input is equal to one

234
00:13:12,735 --> 00:13:15,930
of those observations for one
iteration of this learning

235
00:13:15,930 --> 00:13:17,410
process, OK?

236
00:13:17,410 --> 00:13:19,860
And so with each
observation, we're

237
00:13:19,860 --> 00:13:21,810
told whether this
input corresponds

238
00:13:21,810 --> 00:13:25,740
to one category or another,
so a dog or a non-dog.

239
00:13:25,740 --> 00:13:27,960
And our output, we're asking--

240
00:13:27,960 --> 00:13:30,240
we want to choose
this set of weights

241
00:13:30,240 --> 00:13:32,640
such that the output
of our network

242
00:13:32,640 --> 00:13:37,680
is equal to some known value.

243
00:13:37,680 --> 00:13:43,410
So t sub i, where if it's a dog,
then the answer is one for yes.

244
00:13:43,410 --> 00:13:48,450
If it's a non-dog, the answer
is no for that's not a dog.

245
00:13:48,450 --> 00:13:52,050
And we have n of those answers.

246
00:13:52,050 --> 00:13:56,760
We have n images and labels
that tell us what category

247
00:13:56,760 --> 00:13:59,400
that image belongs to.

248
00:13:59,400 --> 00:14:01,380
So for all of
these, t equals one.

249
00:14:01,380 --> 00:14:03,300
For all of these, t equals zero.

250
00:14:03,300 --> 00:14:05,400
And we want to find
a set of weights

251
00:14:05,400 --> 00:14:10,020
such that when we take the dot
product of that weight factor

252
00:14:10,020 --> 00:14:17,970
into each one of those
observations minus theta

253
00:14:17,970 --> 00:14:23,340
that we get an answer
that is equal to t

254
00:14:23,340 --> 00:14:25,830
for each observation.

255
00:14:25,830 --> 00:14:28,360
Does that make sense?

256
00:14:28,360 --> 00:14:31,240
So how do we do that?

257
00:14:31,240 --> 00:14:37,240
All right, so each observation,
we have two things--

258
00:14:37,240 --> 00:14:41,450
the input and the
desired output.

259
00:14:41,450 --> 00:14:43,150
And that gives us
information that we

260
00:14:43,150 --> 00:14:45,920
can use to construct
this weight vector.

261
00:14:45,920 --> 00:14:48,110
So, again, that's called
supervised learning.

262
00:14:48,110 --> 00:14:52,300
And we're going to use an
update rule, or a learning rule,

263
00:14:52,300 --> 00:14:54,490
that allows us to
change the weight

264
00:14:54,490 --> 00:14:58,180
vector on as a result
of each estimate,

265
00:14:58,180 --> 00:15:01,030
depending on whether we got
the answer right or not.

266
00:15:01,030 --> 00:15:02,370
So how do we do this?

267
00:15:02,370 --> 00:15:03,912
What we're going to
do is we're going

268
00:15:03,912 --> 00:15:08,110
to start with a random set
of weights, w1 and w2, OK?

269
00:15:08,110 --> 00:15:11,580
And we're going to
put in an input.

270
00:15:11,580 --> 00:15:13,255
So there's a space of inputs.

271
00:15:13,255 --> 00:15:15,130
We're going to start
with some random weight,

272
00:15:15,130 --> 00:15:18,230
and I started with some random
vector in this direction.

273
00:15:18,230 --> 00:15:21,920
You can see that that gives you
a classification boundary here.

274
00:15:21,920 --> 00:15:24,340
And you can see that that
classification boundary is not

275
00:15:24,340 --> 00:15:27,290
very good for separating the
green dots from the red dots.

276
00:15:27,290 --> 00:15:27,790
Why?

277
00:15:27,790 --> 00:15:31,060
Because it will assign
a one to everything

278
00:15:31,060 --> 00:15:33,580
on this side of that
decision boundary and a zero

279
00:15:33,580 --> 00:15:35,103
to everything on that side.

280
00:15:35,103 --> 00:15:36,520
But you can see
that that does not

281
00:15:36,520 --> 00:15:39,250
correspond to the
assignment of green and red

282
00:15:39,250 --> 00:15:41,200
to each of those dots, OK?

283
00:15:41,200 --> 00:15:47,523
So how do we update that w in
order to get the right answer?

284
00:15:47,523 --> 00:15:48,940
So what we're going
to do is we're

285
00:15:48,940 --> 00:15:53,710
going to put in one of these
inputs on each iteration

286
00:15:53,710 --> 00:15:57,520
and ask whether the network
got the answer right or not.

287
00:15:57,520 --> 00:16:02,610
So we're going to put
in one of those inputs.

288
00:16:02,610 --> 00:16:05,140
So let's pick that
input right there.

289
00:16:05,140 --> 00:16:07,190
We're going to put
that into our network.

290
00:16:07,190 --> 00:16:09,730
And we see that the answer
we get from the network

291
00:16:09,730 --> 00:16:14,770
is one, because it's on the
positive side of the decision

292
00:16:14,770 --> 00:16:15,560
boundary.

293
00:16:15,560 --> 00:16:19,060
And so one was the right
answer in this case.

294
00:16:19,060 --> 00:16:19,840
So what do we do?

295
00:16:19,840 --> 00:16:20,890
We don't do anything.

296
00:16:20,890 --> 00:16:25,270
We say the change in weight is
going to be zero if we already

297
00:16:25,270 --> 00:16:26,940
get the right answer.

298
00:16:26,940 --> 00:16:29,560
So if we got lucky and
our initial weight vector

299
00:16:29,560 --> 00:16:32,260
was in the right direction,
so our perceptron

300
00:16:32,260 --> 00:16:34,398
already classified
the answer, then

301
00:16:34,398 --> 00:16:36,190
the weight vector is
never going to change,

302
00:16:36,190 --> 00:16:39,400
because it was already
the right answer.

303
00:16:39,400 --> 00:16:41,690
OK, so let's put it
in another input--

304
00:16:41,690 --> 00:16:42,580
a red input.

305
00:16:42,580 --> 00:16:45,970
You can see that the
correct answer is a zero.

306
00:16:45,970 --> 00:16:47,950
The network gave us
a zero, because it's

307
00:16:47,950 --> 00:16:53,380
on the negative side of the
weight vector of the decision

308
00:16:53,380 --> 00:16:54,380
boundary.

309
00:16:54,380 --> 00:16:56,530
And so, again, delta w is zero.

310
00:16:56,530 --> 00:16:58,780
But let's put in
another input now such

311
00:16:58,780 --> 00:17:01,420
that we get the wrong answer.

312
00:17:01,420 --> 00:17:03,580
So let's put in this
input right here.

313
00:17:03,580 --> 00:17:06,339
So you can see that the answer
here, the correct answer

314
00:17:06,339 --> 00:17:12,339
is one, but the network is
going to give us a zero.

315
00:17:12,339 --> 00:17:16,470
So what do we do to
update that weight vector?

316
00:17:16,470 --> 00:17:19,329
So if the output is not
equal to the correct answer,

317
00:17:19,329 --> 00:17:20,150
then we're wrong.

318
00:17:20,150 --> 00:17:22,000
So now we update w.

319
00:17:22,000 --> 00:17:26,140
And the perceptron learning
rule is very simple.

320
00:17:26,140 --> 00:17:30,770
We introduce a change in
w that looks like this.

321
00:17:30,770 --> 00:17:35,620
It's a little change, so
eps eta is a learning rate.

322
00:17:35,620 --> 00:17:39,250
It's generally going
to be smaller than one.

323
00:17:39,250 --> 00:17:43,510
So we're going to put in
a small change in w that's

324
00:17:43,510 --> 00:17:47,440
in the direction of the
input that was wrong

325
00:17:47,440 --> 00:17:51,580
if the correct answer is a one.

326
00:17:51,580 --> 00:17:53,800
We're going to
make a small change

327
00:17:53,800 --> 00:17:57,910
to w in the opposite
direction of that input

328
00:17:57,910 --> 00:18:00,940
if the correct answer was zero.

329
00:18:00,940 --> 00:18:02,120
Does that make sense?

330
00:18:02,120 --> 00:18:06,430
So we're going to
change w in a way that

331
00:18:06,430 --> 00:18:11,930
depends on what the
input was and what

332
00:18:11,930 --> 00:18:13,550
the correct answer was.

333
00:18:16,970 --> 00:18:18,200
So let's walk through this.

334
00:18:18,200 --> 00:18:21,200
So we put it in an input here.

335
00:18:21,200 --> 00:18:25,130
The correct answer is a one,
and we got the answer wrong.

336
00:18:25,130 --> 00:18:28,400
The network gave us a zero, but
the correct answer is a one.

337
00:18:28,400 --> 00:18:31,880
So we're in this region here.

338
00:18:31,880 --> 00:18:35,090
The answer was incorrect,
so we're going to update w.

339
00:18:35,090 --> 00:18:38,300
The correct answer was a one,
so we're going to change delta--

340
00:18:38,300 --> 00:18:42,760
we're going to change w in
the direction of that input.

341
00:18:42,760 --> 00:18:43,760
So that input is there.

342
00:18:43,760 --> 00:18:50,530
So we're going to add a little
bit to w in this direction.

343
00:18:50,530 --> 00:18:53,970
So if we add that little
bit of vector to the w,

344
00:18:53,970 --> 00:18:58,280
it's going to move the w vector
in this direction, right?

345
00:18:58,280 --> 00:18:59,590
So let's do that.

346
00:18:59,590 --> 00:19:02,160
So there's our new w.

347
00:19:02,160 --> 00:19:05,310
Our new w is the
old plus delta w,

348
00:19:05,310 --> 00:19:10,200
which is in the direction
of this incorrectly

349
00:19:10,200 --> 00:19:11,880
classified input.

350
00:19:11,880 --> 00:19:16,470
So there's our new decision
boundary, all right?

351
00:19:16,470 --> 00:19:18,340
And let's put in another input--

352
00:19:18,340 --> 00:19:20,490
let's say this one right here.

353
00:19:20,490 --> 00:19:23,610
You can see that this input is
also incorrectly classified,

354
00:19:23,610 --> 00:19:25,530
because the correct
answer is a zero.

355
00:19:25,530 --> 00:19:28,170
It's a red dot.

356
00:19:28,170 --> 00:19:30,800
But the network since
it's on the positive side

357
00:19:30,800 --> 00:19:32,310
of the decision boundary.

358
00:19:32,310 --> 00:19:34,980
So the network
classifies it as a one.

359
00:19:34,980 --> 00:19:35,480
OK, good.

360
00:19:35,480 --> 00:19:39,050
So the network classified it
as a one and the correct answer

361
00:19:39,050 --> 00:19:40,580
was a zero, so we were wrong.

362
00:19:40,580 --> 00:19:42,650
So we're going to
update w, and we're

363
00:19:42,650 --> 00:19:47,060
going to update it in the
opposite direction of the input

364
00:19:47,060 --> 00:19:49,880
if the correct answer was
zero, which is the case.

365
00:19:49,880 --> 00:19:53,360
So we're going to update w.

366
00:19:53,360 --> 00:19:56,000
And that's the input xi.

367
00:19:56,000 --> 00:19:59,310
Minus xi is in this direction.

368
00:19:59,310 --> 00:20:02,540
So we're going to update
w in that direction.

369
00:20:02,540 --> 00:20:06,530
So we're going to add those
two vectors to get our new w.

370
00:20:06,530 --> 00:20:09,430
And when we do that,
that's what we get.

371
00:20:09,430 --> 00:20:10,730
There's our new w.

372
00:20:10,730 --> 00:20:12,360
There's our new
decision boundary.

373
00:20:12,360 --> 00:20:15,200
And you can see that that
decision boundary is now

374
00:20:15,200 --> 00:20:22,160
perfectly oriented to separate
the red and the green dots.

375
00:20:22,160 --> 00:20:26,060
So that's Rosenblatt's
perceptron learning rule.

376
00:20:26,060 --> 00:20:27,156
Yes, Rebecca?

377
00:20:27,156 --> 00:20:29,100
AUDIENCE: How do you
change the learning rate?

378
00:20:29,100 --> 00:20:30,308
Because what if it's too big?

379
00:20:30,308 --> 00:20:33,067
You'll sort of get not
helpful [INAUDIBLE]..

380
00:20:33,067 --> 00:20:34,400
MICHALE FEE: Yeah, that's right.

381
00:20:34,400 --> 00:20:36,080
So if the learning
rate were too big,

382
00:20:36,080 --> 00:20:38,460
you could see this
first correction.

383
00:20:38,460 --> 00:20:41,930
So let's say that we corrected
w but made a correction that

384
00:20:41,930 --> 00:20:44,160
was too far in this direction.

385
00:20:44,160 --> 00:20:48,350
So now the new w
would point up here.

386
00:20:48,350 --> 00:20:50,640
And that would give us,
again, the wrong answer.

387
00:20:50,640 --> 00:20:53,180
What happens, generally,
is that if your learning

388
00:20:53,180 --> 00:20:59,810
rate is too high, then your
weight vector bounces around.

389
00:20:59,810 --> 00:21:01,790
It oscillates around.

390
00:21:01,790 --> 00:21:04,130
So it'll jump too far
this way, and then

391
00:21:04,130 --> 00:21:06,530
it'll get an error
over here, and it'll

392
00:21:06,530 --> 00:21:07,670
jump too far that way.

393
00:21:07,670 --> 00:21:09,337
And then you'll get
an error over there,

394
00:21:09,337 --> 00:21:11,330
and it'll just keep
bouncing back and forth.

395
00:21:11,330 --> 00:21:13,460
So you generally
choose learning rates

396
00:21:13,460 --> 00:21:16,190
that-- the process of
choosing learning rates

397
00:21:16,190 --> 00:21:18,500
can be a little
tricky Basically,

398
00:21:18,500 --> 00:21:21,920
the answer is start small and
increase it until it breaks.

399
00:21:26,780 --> 00:21:28,210
OK, any questions about that?

400
00:21:31,500 --> 00:21:36,430
So you can see it's a
very simple algorithm that

401
00:21:36,430 --> 00:21:40,750
provides a way of changing w
that is guaranteed to converge

402
00:21:40,750 --> 00:21:45,400
toward the best answer
in separating these two

403
00:21:45,400 --> 00:21:46,360
classes of inputs.

404
00:21:52,270 --> 00:21:55,780
All right, so let's go
a little bit further

405
00:21:55,780 --> 00:21:59,770
into single layer
binary networks

406
00:21:59,770 --> 00:22:02,350
and see what they can do.

407
00:22:02,350 --> 00:22:06,100
So these kinds of networks
are very good for actually

408
00:22:06,100 --> 00:22:08,090
implementing logic operations.

409
00:22:08,090 --> 00:22:10,990
So you can see that-- let's say
that we have a perceptron that

410
00:22:10,990 --> 00:22:12,110
looks like this.

411
00:22:12,110 --> 00:22:17,210
Let's give it a threshold
of 0.5 and give it

412
00:22:17,210 --> 00:22:20,870
a weight vector that's 1 and 1.

413
00:22:20,870 --> 00:22:24,710
So you can see that
this perceptron

414
00:22:24,710 --> 00:22:26,740
gives an answer of zero.

415
00:22:26,740 --> 00:22:29,000
The output neuron
has zero firing rate

416
00:22:29,000 --> 00:22:32,320
for an input that's zero.

417
00:22:32,320 --> 00:22:38,010
But any input that's on the
other side of the decision

418
00:22:38,010 --> 00:22:41,640
boundary produces an
output firing rate of one.

419
00:22:41,640 --> 00:22:50,250
What that means is that if
the input a, or u1, is a 1,

420
00:22:50,250 --> 00:22:54,330
0, then the output
neuron will fire.

421
00:22:54,330 --> 00:22:57,720
If the input is 0, 1, the
output neuron will fire.

422
00:22:57,720 --> 00:23:01,200
And if the input is 1, 1,
the output neuron will fire.

423
00:23:01,200 --> 00:23:07,610
So, basically, any input
above some threshold

424
00:23:07,610 --> 00:23:09,320
will make the
output neuron fire.

425
00:23:09,320 --> 00:23:13,600
So this perceptron
implements an OR gate.

426
00:23:13,600 --> 00:23:18,080
If it's input a or input
b, the output neuron

427
00:23:18,080 --> 00:23:22,330
spikes, as long as those inputs
are above some threshold value.

428
00:23:22,330 --> 00:23:25,280
So that's very much
like a logical OR gate.

429
00:23:28,130 --> 00:23:30,200
Now let's see if we can
implement an AND gate.

430
00:23:30,200 --> 00:23:33,340
So it turns out that
implementing an AND gate

431
00:23:33,340 --> 00:23:35,380
is almost exactly
like an OR gate.

432
00:23:35,380 --> 00:23:40,420
We just need-- what would
we change about this network

433
00:23:40,420 --> 00:23:42,182
to implement an AND gate?

434
00:23:42,182 --> 00:23:43,600
AUDIENCE: A larger [INAUDIBLE].

435
00:23:43,600 --> 00:23:44,642
MICHALE FEE: What's that?

436
00:23:44,642 --> 00:23:45,760
AUDIENCE: A larger theta?

437
00:23:45,760 --> 00:23:47,290
MICHALE FEE: Yeah,
a larger theta.

438
00:23:47,290 --> 00:23:52,670
So all we have to do is
move this line up to here.

439
00:23:52,670 --> 00:23:55,250
And now one of
those inputs is not

440
00:23:55,250 --> 00:23:57,830
enough to make the
output neuron fire.

441
00:23:57,830 --> 00:24:00,620
The other input is not enough
to make the output neuron fire.

442
00:24:00,620 --> 00:24:02,510
Only when you have both.

443
00:24:02,510 --> 00:24:04,520
So that implements an AND gate.

444
00:24:04,520 --> 00:24:09,075
We just increase the
threshold a little bit.

445
00:24:09,075 --> 00:24:09,950
Does that make sense?

446
00:24:09,950 --> 00:24:12,890
So we just increase the
threshold here to 1.5.

447
00:24:12,890 --> 00:24:17,870
And now when either input
is on at a value of one,

448
00:24:17,870 --> 00:24:20,840
that's not enough to make
the output neuron fire.

449
00:24:20,840 --> 00:24:22,670
If this input's on,
it's not enough.

450
00:24:22,670 --> 00:24:25,790
If that output is
on, it's not enough.

451
00:24:25,790 --> 00:24:29,270
Only when both inputs are
on do you get enough input

452
00:24:29,270 --> 00:24:33,010
to this output neuron to make
it have a non-zero firing rate,

453
00:24:33,010 --> 00:24:37,190
to get it above threshold.

454
00:24:37,190 --> 00:24:42,080
Now, there's another very common
logic operation that cannot be

455
00:24:42,080 --> 00:24:47,010
solved by a simple perceptron.

456
00:24:47,010 --> 00:24:51,680
That's called an
exclusive OR, where

457
00:24:51,680 --> 00:24:55,100
this neuron, this
network, we want

458
00:24:55,100 --> 00:25:05,890
it to fire only if input a is on
or input b is on, but not both.

459
00:25:05,890 --> 00:25:08,830
Why is it that that
can't be solved

460
00:25:08,830 --> 00:25:12,010
by the kind of perceptron
that we've been describing?

461
00:25:12,010 --> 00:25:14,830
Anybody have some
intuition about that?

462
00:25:20,022 --> 00:25:23,240
AUDIENCE: I mean, it's
obviously [INAUDIBLE] separable.

463
00:25:23,240 --> 00:25:24,680
MICHALE FEE: Yeah, that's right.

464
00:25:24,680 --> 00:25:27,320
The keyword there is separable.

465
00:25:27,320 --> 00:25:33,210
If you look at this set of
dots, there's no single line,

466
00:25:33,210 --> 00:25:38,060
there's no single boundary
that separates all the red dots

467
00:25:38,060 --> 00:25:40,940
from off the green dots, OK?

468
00:25:40,940 --> 00:25:44,380
And so that set of inputs
is called non-separable.

469
00:25:44,380 --> 00:25:52,700
And sets of inputs that are not
separable cannot be classified

470
00:25:52,700 --> 00:25:58,160
correctly by a simple perceptron
of the type we've been talking

471
00:25:58,160 --> 00:25:59,340
about.

472
00:25:59,340 --> 00:26:00,840
So how do you
solve that problem?

473
00:26:00,840 --> 00:26:06,132
So this is a set of inputs
that's non-separable.

474
00:26:06,132 --> 00:26:08,090
You can see that you can
solve this problem now

475
00:26:08,090 --> 00:26:11,310
if you have two
separate perceptrons.

476
00:26:11,310 --> 00:26:12,420
So watch this.

477
00:26:12,420 --> 00:26:15,410
We can build one
perceptive one that fires,

478
00:26:15,410 --> 00:26:21,590
that has a positive output
when this input is on.

479
00:26:21,590 --> 00:26:24,170
We can have a separate
perceptron that is active

480
00:26:24,170 --> 00:26:29,300
when that input is on.

481
00:26:29,300 --> 00:26:32,270
And then what would we do?

482
00:26:32,270 --> 00:26:34,040
If we had one
neuron that's active

483
00:26:34,040 --> 00:26:35,990
if that input is
on another input

484
00:26:35,990 --> 00:26:37,760
that's active when
that input is on?

485
00:26:40,610 --> 00:26:43,260
We would or them
together, that's right.

486
00:26:43,260 --> 00:26:47,010
So this is what's known as
a multi-layer perceptron.

487
00:26:47,010 --> 00:26:50,040
We have two inputs, one
that represents activity

488
00:26:50,040 --> 00:26:53,980
in a, another that
represents activity in b.

489
00:26:53,980 --> 00:26:57,840
And we have one neuron
in what's called

490
00:26:57,840 --> 00:27:00,840
the intermediate layer
of our perceptron

491
00:27:00,840 --> 00:27:04,930
that has a weight
vector of 1 minus 1.

492
00:27:04,930 --> 00:27:09,270
What that means is this neuron
will be active if input a is

493
00:27:09,270 --> 00:27:14,750
on but not input b.

494
00:27:14,750 --> 00:27:16,880
This one will be active.

495
00:27:16,880 --> 00:27:20,576
This neuron has a different
weight vector-- minus 1, 1.

496
00:27:20,576 --> 00:27:27,770
This neuron will be active if
input b is on but not input a.

497
00:27:30,512 --> 00:27:34,120
And the output neuron
implements an OR operation

498
00:27:34,120 --> 00:27:39,010
that will be active when this
intermediate neuron is on

499
00:27:39,010 --> 00:27:42,820
or that intermediate
neuron is on, OK?

500
00:27:42,820 --> 00:27:47,220
And so that network altogether
implements this exclusive OR

501
00:27:47,220 --> 00:27:48,550
function.

502
00:27:48,550 --> 00:27:50,030
Does that make sense?

503
00:27:50,030 --> 00:27:51,120
Any questions about that?

504
00:27:56,690 --> 00:27:59,030
So this problem
of separability is

505
00:27:59,030 --> 00:28:05,820
extremely important in
classifying inputs in general.

506
00:28:05,820 --> 00:28:11,420
So if you think about
classifying an image,

507
00:28:11,420 --> 00:28:14,840
like a number or
a letter, you can

508
00:28:14,840 --> 00:28:21,430
see that in high-dimensional
space, images

509
00:28:21,430 --> 00:28:28,590
that are all threes,
let's say, are all

510
00:28:28,590 --> 00:28:30,030
very similar to each other.

511
00:28:30,030 --> 00:28:34,000
But they're actually not
separable in this linear space.

512
00:28:34,000 --> 00:28:36,900
And that's because in the
high dimensional space

513
00:28:36,900 --> 00:28:40,920
they exist on what's
called a manifold

514
00:28:40,920 --> 00:28:43,930
in this high-dimensional
space, OK?

515
00:28:43,930 --> 00:28:48,180
They're like all lined
up on some sheet, OK?

516
00:28:48,180 --> 00:28:51,540
So this is an
example of rotations,

517
00:28:51,540 --> 00:28:54,930
and you can see that all these
different threes kind of sit

518
00:28:54,930 --> 00:28:59,160
along a manifold in this
high-dimensional space that

519
00:28:59,160 --> 00:29:01,605
are separate from all
the other numbers.

520
00:29:06,280 --> 00:29:08,310
So all those numbers
exist on what's

521
00:29:08,310 --> 00:29:13,110
called an invariant
transformation, OK?

522
00:29:13,110 --> 00:29:16,600
Now, how would we
separate those images

523
00:29:16,600 --> 00:29:22,060
of threes from all the
other numbers or letters?

524
00:29:22,060 --> 00:29:23,570
How would we do that?

525
00:29:23,570 --> 00:29:30,035
Well, we could imagine building
a multi-layer perceptron that--

526
00:29:30,035 --> 00:29:31,410
so here, I'm
showing that there's

527
00:29:31,410 --> 00:29:35,040
no single line that separates
the threes on this manifold

528
00:29:35,040 --> 00:29:38,130
from all the other
digits over here.

529
00:29:38,130 --> 00:29:40,650
We can solve that
problem by implementing

530
00:29:40,650 --> 00:29:45,090
a multi-layer perceptron that
while one of those perceptrons

531
00:29:45,090 --> 00:29:49,140
detects these objects,
another perceptron detects

532
00:29:49,140 --> 00:29:53,400
these objects, and then we
can OR those all together.

533
00:29:53,400 --> 00:29:58,380
So that's a kind of
network that can now

534
00:29:58,380 --> 00:30:03,990
detect all of these three,
separate them from non-threes.

535
00:30:03,990 --> 00:30:06,240
Does that make sense?

536
00:30:06,240 --> 00:30:10,520
So we can think of objects that
we recognize, like this three

537
00:30:10,520 --> 00:30:12,980
that we recognize, even
though it has different--

538
00:30:12,980 --> 00:30:15,110
we can recognize it
with different rotations

539
00:30:15,110 --> 00:30:20,730
or transformations
or scale changes.

540
00:30:20,730 --> 00:30:23,750
You can also think of the
problem of separating images

541
00:30:23,750 --> 00:30:28,250
from dogs and cats as
also solving this problem,

542
00:30:28,250 --> 00:30:32,450
that the space of
dogs, of dog images,

543
00:30:32,450 --> 00:30:36,680
somehow lives on a manifold
in the high dimensional space

544
00:30:36,680 --> 00:30:39,260
of inputs that we
can distinguish

545
00:30:39,260 --> 00:30:43,070
from the set of
images of cats that's

546
00:30:43,070 --> 00:30:48,570
some other manifold in this
high-dimensional space.

547
00:30:48,570 --> 00:30:53,790
So it turns out that you need
more than just a single layer

548
00:30:53,790 --> 00:30:54,450
perceptron.

549
00:30:54,450 --> 00:30:57,900
You need more than just
a two-layer perceptron.

550
00:30:57,900 --> 00:30:59,820
In general, the
kinds of networks

551
00:30:59,820 --> 00:31:02,790
that are good for separating
different kinds of images,

552
00:31:02,790 --> 00:31:06,240
like dogs and cats and
cars and houses and faces,

553
00:31:06,240 --> 00:31:07,890
look more like this.

554
00:31:07,890 --> 00:31:11,250
So this is work from
Jim DiCarlo's lab,

555
00:31:11,250 --> 00:31:16,770
where they found evidence that
networks in the brain that do

556
00:31:16,770 --> 00:31:18,720
image classification--
for example,

557
00:31:18,720 --> 00:31:21,520
in the visual pathway--

558
00:31:21,520 --> 00:31:25,690
look a lot like very deep
neural networks, where

559
00:31:25,690 --> 00:31:31,420
you have the retina on the
left side here sending inputs

560
00:31:31,420 --> 00:31:33,395
to another letter
in the thalamus,

561
00:31:33,395 --> 00:31:40,300
sending inputs to v1, to v2,
to v4, and so on, up to IT.

562
00:31:40,300 --> 00:31:43,480
And that we can think
of this as being,

563
00:31:43,480 --> 00:31:48,100
essentially, many stacked
layers of perceptrons

564
00:31:48,100 --> 00:31:52,150
that sort of unravel
these manifolds

565
00:31:52,150 --> 00:31:54,550
in this high-dimensional
space to allow

566
00:31:54,550 --> 00:31:59,380
neurons here at the very
end to separate dogs

567
00:31:59,380 --> 00:32:02,065
from cats from
buildings from faces.

568
00:32:04,720 --> 00:32:06,640
And there are
learning rules that

569
00:32:06,640 --> 00:32:09,310
can be used to train
networks like this

570
00:32:09,310 --> 00:32:14,440
by putting in a bunch of
different images of people

571
00:32:14,440 --> 00:32:16,150
and other different
categories that you

572
00:32:16,150 --> 00:32:17,650
might want to separate.

573
00:32:17,650 --> 00:32:19,720
And then each one
of those images

574
00:32:19,720 --> 00:32:23,230
has a label, just like our
perceptron learning rule.

575
00:32:23,230 --> 00:32:27,010
And we can use the image
and the correct label--

576
00:32:27,010 --> 00:32:32,640
face or dog-- and
train that network

577
00:32:32,640 --> 00:32:38,560
by projecting that information
into these intermediate layers

578
00:32:38,560 --> 00:32:41,380
to train that network
to properly classify

579
00:32:41,380 --> 00:32:43,390
those different stimuli, OK?

580
00:32:43,390 --> 00:32:47,770
This is, basically,
the kind of technology

581
00:32:47,770 --> 00:32:51,830
that's currently
being used to train--

582
00:32:51,830 --> 00:32:53,470
this is being used in AI.

583
00:32:53,470 --> 00:32:57,880
It's being used to
train driverless cars.

584
00:32:57,880 --> 00:33:02,350
All kinds of
technological advances

585
00:33:02,350 --> 00:33:06,018
are based on this kind
of technology here.

586
00:33:06,018 --> 00:33:07,060
Any questions about that?

587
00:33:07,060 --> 00:33:08,054
Aditi?

588
00:33:08,054 --> 00:33:10,540
AUDIENCE: So in
actual neurons, I

589
00:33:10,540 --> 00:33:12,550
assume it's not linear, right?

590
00:33:12,550 --> 00:33:14,230
MICHALE FEE: Yes.

591
00:33:14,230 --> 00:33:17,560
These are all nonlinear neurons.

592
00:33:17,560 --> 00:33:19,960
They're more like these
binary threshold units

593
00:33:19,960 --> 00:33:21,628
than they are like
linear neurons.

594
00:33:21,628 --> 00:33:22,170
That's right.

595
00:33:22,170 --> 00:33:25,795
AUDIENCE: But then do
you there's, like--

596
00:33:25,795 --> 00:33:28,372
because right now, I
imagine that models we make

597
00:33:28,372 --> 00:33:30,482
have to have way more
perceptron units.

598
00:33:30,482 --> 00:33:31,190
MICHALE FEE: Yes.

599
00:33:31,190 --> 00:33:34,475
AUDIENCE: We use our
simplified [INAUDIBLE]..

600
00:33:34,475 --> 00:33:35,850
But then our brain
is sometimes--

601
00:33:35,850 --> 00:33:38,610
I mean, it's at, like,
a much faster level,

602
00:33:38,610 --> 00:33:41,090
like way faster, right?

603
00:33:41,090 --> 00:33:46,000
So you think it'd be like--
if we examine what functions

604
00:33:46,000 --> 00:33:50,320
neurons might be using, in a
way that would let us reduce

605
00:33:50,320 --> 00:33:51,760
the number of units needed?

606
00:33:51,760 --> 00:33:53,584
Because right now, for
example, [INAUDIBLE]

607
00:33:53,584 --> 00:33:55,380
be a bunch of lines.

608
00:33:55,380 --> 00:33:58,690
But maybe in the brain, there's
some other function it's using,

609
00:33:58,690 --> 00:34:00,340
which is smoother.

610
00:34:00,340 --> 00:34:02,580
MICHALE FEE: Yeah.

611
00:34:02,580 --> 00:34:04,330
OK, so let me just
make sure I understand.

612
00:34:04,330 --> 00:34:07,540
You're not talking about the
F-I curve of the neurons?

613
00:34:07,540 --> 00:34:09,540
Is that correct?

614
00:34:09,540 --> 00:34:12,100
You're talking about the
way that you figure out

615
00:34:12,100 --> 00:34:13,514
these weights.

616
00:34:13,514 --> 00:34:14,889
Is that what you're
asking about?

617
00:34:14,889 --> 00:34:15,880
AUDIENCE: No.

618
00:34:15,880 --> 00:34:20,034
I'm asking if we use a
more accurate F-I curve,

619
00:34:20,034 --> 00:34:21,657
we'll need less units.

620
00:34:21,657 --> 00:34:23,449
MICHALE FEE: OK, so
that's a good question.

621
00:34:23,449 --> 00:34:26,230
I don't actually know the
answer to the question

622
00:34:26,230 --> 00:34:29,350
of how the specific
choice of F-I curve

623
00:34:29,350 --> 00:34:31,659
affects the performance of this.

624
00:34:31,659 --> 00:34:35,380
The big problem that people
are trying to figure out

625
00:34:35,380 --> 00:34:39,489
in terms of how
these are trained

626
00:34:39,489 --> 00:34:42,250
is the challenge that in
order to train these networks,

627
00:34:42,250 --> 00:34:47,420
you actually need thousands
and thousands, maybe millions,

628
00:34:47,420 --> 00:34:54,139
of examples of different objects
here and the answer here.

629
00:34:54,139 --> 00:34:56,510
So you have to put
in many thousands

630
00:34:56,510 --> 00:35:00,620
of example images and
the answer in order

631
00:35:00,620 --> 00:35:02,540
to train these networks.

632
00:35:02,540 --> 00:35:06,080
And that's not the way
people actually learn.

633
00:35:06,080 --> 00:35:09,530
We don't walk around the
world when we're one-year-old

634
00:35:09,530 --> 00:35:12,550
and our mother saying,
dog, cat, person, house.

635
00:35:12,550 --> 00:35:16,130
You know, it would be... in
order to give a person as many

636
00:35:16,130 --> 00:35:19,070
labeled examples as you
need to give these networks,

637
00:35:19,070 --> 00:35:23,270
you would just be doing nothing,
but your parents would be

638
00:35:23,270 --> 00:35:27,770
pointing things out to you and
telling you one-word answers

639
00:35:27,770 --> 00:35:28,970
of what those are.

640
00:35:28,970 --> 00:35:32,300
Instead, what happens is
we just observe the world

641
00:35:32,300 --> 00:35:34,970
and figure out
kind of categories

642
00:35:34,970 --> 00:35:38,030
based on other sorts of learning
rules that are unsupervised.

643
00:35:38,030 --> 00:35:40,610
We figure out, oh, that's a kind
of thing, and then mom says,

644
00:35:40,610 --> 00:35:42,140
that's a dog.

645
00:35:42,140 --> 00:35:45,110
And then we know that
that category is a dog.

646
00:35:45,110 --> 00:35:47,510
And we sometimes
make mistakes, right?

647
00:35:47,510 --> 00:35:52,820
Like a kid might look
at a bear and say, dog.

648
00:35:52,820 --> 00:35:55,840
And then dad says, no,
no, that's not a dog, son.

649
00:35:59,930 --> 00:36:04,610
So the learning by which
people train their networks

650
00:36:04,610 --> 00:36:06,560
to do classification
of inputs is

651
00:36:06,560 --> 00:36:10,020
quite different from the way
these deep neural networks

652
00:36:10,020 --> 00:36:10,520
work.

653
00:36:10,520 --> 00:36:15,340
And that's a very important
and active area of research.

654
00:36:15,340 --> 00:36:15,840
Yes?

655
00:36:15,840 --> 00:36:19,330
AUDIENCE: Is the fact that
[INAUDIBLE] use unsupervised

656
00:36:19,330 --> 00:36:22,690
learning, as well,
to train a computer

657
00:36:22,690 --> 00:36:25,970
to recognize an image
of a turtle as a gun,

658
00:36:25,970 --> 00:36:28,040
but humans can't do
that [INAUDIBLE]..

659
00:36:28,040 --> 00:36:29,737
MICHALE FEE: Recognize
a turtle if what?

660
00:36:29,737 --> 00:36:32,112
AUDIENCE: Like I saw this
thing where it was like at MIT,

661
00:36:32,112 --> 00:36:33,910
they used an AI.

662
00:36:33,910 --> 00:36:35,810
They manipulated
pixels in images

663
00:36:35,810 --> 00:36:38,128
and convinced the computer
that it was something

664
00:36:38,128 --> 00:36:39,170
that it was not actually.

665
00:36:39,170 --> 00:36:40,160
MICHALE FEE: I see.

666
00:36:40,160 --> 00:36:40,430
Yeah.

667
00:36:40,430 --> 00:36:41,885
AUDIENCE: So like you would
see a picture of a turtle,

668
00:36:41,885 --> 00:36:43,510
but the computer
would get that picture

669
00:36:43,510 --> 00:36:45,200
and say it was,
like, a machine gun.

670
00:36:45,200 --> 00:36:47,660
MICHALE FEE: Just by
manipulating a few pixels

671
00:36:47,660 --> 00:36:49,397
and kind of screwing
with its mind.

672
00:36:49,397 --> 00:36:49,980
AUDIENCE: Yes.

673
00:36:49,980 --> 00:36:50,990
So it's [INAUDIBLE].

674
00:36:54,350 --> 00:36:55,160
MICHALE FEE: Yeah.

675
00:36:55,160 --> 00:36:57,722
Well, people can be tricked
by different things.

676
00:37:01,700 --> 00:37:05,490
The answer is, yes,
it's related to that.

677
00:37:05,490 --> 00:37:08,090
The problem is after
you do this training,

678
00:37:08,090 --> 00:37:09,890
we actually don't
really understand

679
00:37:09,890 --> 00:37:14,090
what's going on in the
guts of this network.

680
00:37:14,090 --> 00:37:16,640
It's very hard to look at
the inside of this network

681
00:37:16,640 --> 00:37:22,090
after it's trained and
understand what it's doing.

682
00:37:22,090 --> 00:37:25,180
And so we don't
know the answer why

683
00:37:25,180 --> 00:37:28,570
it is that you can fool
one of these networks

684
00:37:28,570 --> 00:37:30,550
by changing a few pixels.

685
00:37:30,550 --> 00:37:33,385
Something goes wrong in here,
and we don't know what it is.

686
00:37:33,385 --> 00:37:35,920
It may very well have to do
with the way it's trained,

687
00:37:35,920 --> 00:37:41,830
rather than building categories
in an unsupervised way, which

688
00:37:41,830 --> 00:37:43,940
could be much more
generalizable.

689
00:37:43,940 --> 00:37:46,048
So good question.

690
00:37:46,048 --> 00:37:47,340
I don't really know the answer.

691
00:37:50,330 --> 00:37:50,830
Yes?

692
00:37:50,830 --> 00:37:52,372
AUDIENCE: Sorry,
can you explain what

693
00:37:52,372 --> 00:37:56,280
you mean [INAUDIBLE] the
neural network needs an answer?

694
00:37:56,280 --> 00:38:00,310
They're not categorized and
then tell the user dogs?

695
00:38:00,310 --> 00:38:02,420
MICHALE FEE: Yeah,
so no, in order

696
00:38:02,420 --> 00:38:05,390
to train one of these networks,
you have to give it a data set,

697
00:38:05,390 --> 00:38:07,640
a labeled data set.

698
00:38:07,640 --> 00:38:11,270
So a set of images that
already has the answer

699
00:38:11,270 --> 00:38:15,252
that was labeled by a person.

700
00:38:15,252 --> 00:38:16,710
AUDIENCE: So you
can't just give it

701
00:38:16,710 --> 00:38:19,046
a set of photos of
puppies and snakes

702
00:38:19,046 --> 00:38:21,320
and it'll categorize
them into two groups?

703
00:38:21,320 --> 00:38:23,195
MICHALE FEE: No, nobody
knows how to do that.

704
00:38:25,890 --> 00:38:31,220
People are working on that,
but it's not known yet.

705
00:38:31,220 --> 00:38:32,010
Yes, Jasmine?

706
00:38:34,640 --> 00:38:41,080
AUDIENCE: [INAUDIBLE]
but I see [INAUDIBLE] I

707
00:38:41,080 --> 00:38:44,310
can't separate them and like
adding an additional feature

708
00:38:44,310 --> 00:38:47,874
to raise it to a higher
dimensional space, where

709
00:38:47,874 --> 00:38:50,203
it's separable?

710
00:38:50,203 --> 00:38:52,120
MICHALE FEE: Sorry, I
didn't quite understand.

711
00:38:52,120 --> 00:38:53,806
Can you say it again?

712
00:38:53,806 --> 00:38:56,221
AUDIENCE: I think I
remember reading somewhere

713
00:38:56,221 --> 00:39:02,182
about how when the scenes
are nonlinearly separable--

714
00:39:02,182 --> 00:39:02,890
MICHALE FEE: Yes.

715
00:39:02,890 --> 00:39:05,720
AUDIENCE: --you can add in
another feature to [INAUDIBLE]..

716
00:39:05,720 --> 00:39:06,720
MICHALE FEE: Yeah, yeah.

717
00:39:06,720 --> 00:39:09,090
So let me show you
an example of that.

718
00:39:09,090 --> 00:39:11,850
So coming back to
the exclusive OR.

719
00:39:11,850 --> 00:39:14,130
So one thing that
you can do, you

720
00:39:14,130 --> 00:39:18,570
can see that the reason this is
linearly inseparable-- it's not

721
00:39:18,570 --> 00:39:20,970
linearly separable-- is
because all these points are

722
00:39:20,970 --> 00:39:23,040
in a plane.

723
00:39:23,040 --> 00:39:26,620
So there's no line
that separates them.

724
00:39:26,620 --> 00:39:29,250
But one way, one sort
of trick you can do,

725
00:39:29,250 --> 00:39:30,980
is to add noise to this.

726
00:39:30,980 --> 00:39:33,930
So that now, some of
these points move.

727
00:39:33,930 --> 00:39:36,040
You can add another dimension.

728
00:39:36,040 --> 00:39:38,440
So now let's say
that we add noise,

729
00:39:38,440 --> 00:39:41,790
and we just, by chance, happen
to move the green dots this way

730
00:39:41,790 --> 00:39:44,610
and the red dots,
well, that way.

731
00:39:44,610 --> 00:39:47,400
And now there's a plane that
will separate the red dots

732
00:39:47,400 --> 00:39:49,260
from the green dots.

733
00:39:49,260 --> 00:39:55,170
So that's advanced
beyond the scope of what

734
00:39:55,170 --> 00:39:56,320
we're talking about here.

735
00:39:56,320 --> 00:39:57,870
But yes, there are
tricks that you

736
00:39:57,870 --> 00:40:02,070
can play to get around
this exclusive OR

737
00:40:02,070 --> 00:40:06,570
problem, this linear
separability problem, OK?

738
00:40:06,570 --> 00:40:08,940
All right, great question.

739
00:40:08,940 --> 00:40:12,660
All right, let's push on.

740
00:40:12,660 --> 00:40:18,000
So let's talk about
more general two-layer

741
00:40:18,000 --> 00:40:20,730
feed-forward networks.

742
00:40:20,730 --> 00:40:25,800
So this is referred to as a
two-layer network-- an input

743
00:40:25,800 --> 00:40:28,240
layer and an output layer.

744
00:40:28,240 --> 00:40:31,070
And in this case, we had
a single input neuron

745
00:40:31,070 --> 00:40:32,690
and a single output neuron.

746
00:40:32,690 --> 00:40:36,780
We generalized that to having
multiple input neurons and one

747
00:40:36,780 --> 00:40:37,470
output neuron.

748
00:40:37,470 --> 00:40:39,450
We saw that we can write
down the input current

749
00:40:39,450 --> 00:40:43,500
to this output neuron as
w, the vector of weights,

750
00:40:43,500 --> 00:40:46,080
dotted into the vector
of input firing rates

751
00:40:46,080 --> 00:40:49,310
to give us an expression for
the firing rate of the output

752
00:40:49,310 --> 00:40:50,310
neuron.

753
00:40:50,310 --> 00:40:52,080
And now we can
generalize that further

754
00:40:52,080 --> 00:40:54,520
to the case of multiple
output neurons.

755
00:40:54,520 --> 00:40:57,420
So we have multiple input
neurons, multiple output

756
00:40:57,420 --> 00:40:59,040
neurons.

757
00:40:59,040 --> 00:41:00,510
You can see that
we have a vector

758
00:41:00,510 --> 00:41:02,910
of firing rates of
the input neurons

759
00:41:02,910 --> 00:41:07,100
and a vector of firing
rates of the output neurons.

760
00:41:07,100 --> 00:41:10,043
So we used to just have one
of these output neurons,

761
00:41:10,043 --> 00:41:11,710
and now we've got a
whole bunch of them.

762
00:41:11,710 --> 00:41:14,520
And so we have to write
down a vector of fire rates

763
00:41:14,520 --> 00:41:16,210
in the output layer.

764
00:41:16,210 --> 00:41:19,560
And now we can write down
the firing rate of our output

765
00:41:19,560 --> 00:41:20,590
neurons as follows.

766
00:41:20,590 --> 00:41:22,410
So the firing rate
of this neuron

767
00:41:22,410 --> 00:41:28,170
here is going to be a
dot product of the vector

768
00:41:28,170 --> 00:41:31,110
of weights onto it.

769
00:41:31,110 --> 00:41:33,060
So the firing rate
of output neuron one

770
00:41:33,060 --> 00:41:39,180
is the vector of weights onto
that first output neuron dotted

771
00:41:39,180 --> 00:41:43,200
into the vector of
input firing rates.

772
00:41:43,200 --> 00:41:46,380
And the same for the
next output neuron.

773
00:41:46,380 --> 00:41:47,940
The firing rate of
output neuron two

774
00:41:47,940 --> 00:41:52,350
is dot product of the weights
onto that output neuron two

775
00:41:52,350 --> 00:41:56,040
and onto the vector
of input firing rates.

776
00:41:56,040 --> 00:41:57,900
Same for neuron three.

777
00:41:57,900 --> 00:42:00,500
And we can write
that down as follows.

778
00:42:00,500 --> 00:42:03,600
So the eighth output--
the firing rate

779
00:42:03,600 --> 00:42:06,150
of the eighth output
neuron is the weight vector

780
00:42:06,150 --> 00:42:09,390
onto the eighth output neuron
dotted into the input firing

781
00:42:09,390 --> 00:42:10,530
rate vector, OK?

782
00:42:10,530 --> 00:42:12,690
And we can write
that down as follows,

783
00:42:12,690 --> 00:42:15,810
where we've now introduced
a new thing here,

784
00:42:15,810 --> 00:42:20,780
which is a matrix of weights.

785
00:42:20,780 --> 00:42:23,300
So it's called
the weight matrix.

786
00:42:23,300 --> 00:42:26,600
And it essentially
is a matrix of all

787
00:42:26,600 --> 00:42:32,900
of these synaptic weights, from
the input layer onto the output

788
00:42:32,900 --> 00:42:33,540
layer.

789
00:42:33,540 --> 00:42:36,830
And now if we had
a linear neuron,

790
00:42:36,830 --> 00:42:40,900
we can write down the firing
rate of the output neuron.

791
00:42:40,900 --> 00:42:45,560
The firing rate vector
of output neuron

792
00:42:45,560 --> 00:42:52,610
is just this weight matrix times
the vector of input fire rates.

793
00:42:52,610 --> 00:42:56,240
So now, we've
rewritten this problem

794
00:42:56,240 --> 00:42:59,870
of finding the vector
of output firing rates

795
00:42:59,870 --> 00:43:02,650
as a matrix multiplication.

796
00:43:02,650 --> 00:43:05,490
And we're going to spend
some time talking about what

797
00:43:05,490 --> 00:43:09,030
that means and what that does.

798
00:43:09,030 --> 00:43:12,590
So our feed-forward
network implements a matrix

799
00:43:12,590 --> 00:43:13,970
multiplication.

800
00:43:13,970 --> 00:43:16,790
All right, so let's take
a closer look at what

801
00:43:16,790 --> 00:43:20,780
this weight matrix looks like.

802
00:43:20,780 --> 00:43:26,340
So we have a weight matrix w sub
a comma b that looks like this.

803
00:43:26,340 --> 00:43:29,360
So we have four input neurons
and four output neurons.

804
00:43:29,360 --> 00:43:34,670
We have a weight for each input
neuron onto each output neuron.

805
00:43:34,670 --> 00:43:40,280
The columns here correspond
to different input neurons.

806
00:43:40,280 --> 00:43:42,900
The rows correspond to
different output neurons.

807
00:43:42,900 --> 00:43:46,550
Remember, for a
matrix, the elements

808
00:43:46,550 --> 00:43:54,713
are listed as w sub a, b,
where a is the output neuron.

809
00:43:54,713 --> 00:43:55,630
b is the input neuron.

810
00:43:55,630 --> 00:44:01,760
On so it's w postsynaptic,
presynaptic-- post, pre.

811
00:44:01,760 --> 00:44:04,010
Rows, columns.

812
00:44:04,010 --> 00:44:07,400
So the rows are the
different output neurons.

813
00:44:07,400 --> 00:44:09,485
The columns are the
different input neurons.

814
00:44:12,210 --> 00:44:15,980
So it can be a little
tricky to remember.

815
00:44:15,980 --> 00:44:21,030
I just remember that it's rows--

816
00:44:21,030 --> 00:44:23,890
a matrix is labeled
by rows and columns.

817
00:44:23,890 --> 00:44:28,000
And weight matrices are
postsynaptic, presynaptic--

818
00:44:28,000 --> 00:44:28,660
post, pre.

819
00:44:31,370 --> 00:44:35,160
AUDIENCE: [INAUDIBLE]
comment of [INAUDIBLE]??

820
00:44:35,160 --> 00:44:37,410
MICHALE FEE: I think
that's standard.

821
00:44:37,410 --> 00:44:41,050
I'm pretty sure
that's very standard.

822
00:44:41,050 --> 00:44:43,880
If you find any
exceptions let me know.

823
00:44:43,880 --> 00:44:49,710
OK, we can think of
each row of this matrix

824
00:44:49,710 --> 00:44:53,510
as being the vector of weights
onto one output neuron.

825
00:44:56,890 --> 00:45:01,960
That row is a vector of weights
onto that output neuron--

826
00:45:01,960 --> 00:45:05,123
that row, that output neuron;
that row, that output neuron.

827
00:45:05,123 --> 00:45:06,040
Does that makes sense?

828
00:45:09,590 --> 00:45:13,350
All right, so let's flesh out
this matrix multiplication.

829
00:45:13,350 --> 00:45:15,838
The vector of
output firing rates,

830
00:45:15,838 --> 00:45:17,880
we're going to write it
as a column vector, where

831
00:45:17,880 --> 00:45:20,670
the first number is
this firing rate.

832
00:45:20,670 --> 00:45:22,440
That number is that firing rate.

833
00:45:22,440 --> 00:45:25,560
That number represents
that firing rate, OK?

834
00:45:25,560 --> 00:45:27,390
That's equal to
this weight matrix

835
00:45:27,390 --> 00:45:31,850
times the vector of
input firing rates,

836
00:45:31,850 --> 00:45:36,040
again, written as
a column vector.

837
00:45:36,040 --> 00:45:40,320
And in order to calculate the
firing rate of the first output

838
00:45:40,320 --> 00:45:44,610
neuron, we take the dot product
of the first row of the weight

839
00:45:44,610 --> 00:45:53,020
matrix and the column vector
of input firing rates.

840
00:45:53,020 --> 00:45:59,070
And that gives us this
first firing rate, OK?

841
00:45:59,070 --> 00:46:00,630
To get the second
firing rate, we

842
00:46:00,630 --> 00:46:03,870
take the dot product of
the second row of weights

843
00:46:03,870 --> 00:46:06,570
with the vector of firing
rates, and that gives us

844
00:46:06,570 --> 00:46:10,050
this second firing rate.

845
00:46:10,050 --> 00:46:11,310
Any questions about that?

846
00:46:11,310 --> 00:46:16,740
Just a brief reminder of
matrix multiplication.

847
00:46:16,740 --> 00:46:19,281
All right, no questions?

848
00:46:19,281 --> 00:46:26,910
All right, so let's take
a step back and go quickly

849
00:46:26,910 --> 00:46:30,300
through some basic
matrix algebra.

850
00:46:30,300 --> 00:46:32,670
I know most of you have
probably seen this,

851
00:46:32,670 --> 00:46:35,970
but many haven't, so we're
just going to go through it.

852
00:46:35,970 --> 00:46:40,110
All right, so just
as vectors are--

853
00:46:40,110 --> 00:46:42,570
you can think of them as
a collection of numbers

854
00:46:42,570 --> 00:46:44,190
that you write down.

855
00:46:44,190 --> 00:46:47,970
So let's say that you are
making a measurement of two

856
00:46:47,970 --> 00:46:48,850
different things--

857
00:46:48,850 --> 00:46:52,740
let's say temperature
and humidity.

858
00:46:52,740 --> 00:46:55,980
So you can write down a vector
that represents those two

859
00:46:55,980 --> 00:46:57,160
quantities.

860
00:46:57,160 --> 00:47:00,550
So matrices you can think of
as collections of vectors.

861
00:47:00,550 --> 00:47:03,870
So let's say we take
those two measurements

862
00:47:03,870 --> 00:47:05,980
at different times, at
three different times.

863
00:47:05,980 --> 00:47:11,910
So now we have a vector one, a
vector two, and a vector three

864
00:47:11,910 --> 00:47:14,760
that measure those two
quantities at three

865
00:47:14,760 --> 00:47:16,620
different times, all right?

866
00:47:16,620 --> 00:47:19,350
So we can now write all
of those measurements

867
00:47:19,350 --> 00:47:22,860
down as a matrix,
where we collect

868
00:47:22,860 --> 00:47:27,900
each one of those vectors
as a column in our matrix,

869
00:47:27,900 --> 00:47:28,900
like that.

870
00:47:28,900 --> 00:47:32,070
Any questions about that?

871
00:47:32,070 --> 00:47:37,170
And there's a bit of MATLAB
code that calculates this matrix

872
00:47:37,170 --> 00:47:40,180
by writing three
different column vectors

873
00:47:40,180 --> 00:47:42,030
and then concatenating
them into a matrix.

874
00:47:45,130 --> 00:47:47,930
All right, and you can
see that in this matrix,

875
00:47:47,930 --> 00:47:52,070
the columns are just
the original vectors,

876
00:47:52,070 --> 00:47:53,990
and the rows are--

877
00:47:53,990 --> 00:47:56,480
you can think of
those as a time series

878
00:47:56,480 --> 00:47:59,010
of our first measurement,
let's say temperature.

879
00:47:59,010 --> 00:48:01,610
So that's temperature
as a function of time.

880
00:48:01,610 --> 00:48:08,005
This is temperature and
humidity at one time.

881
00:48:08,005 --> 00:48:08,880
Does that make sense?

882
00:48:11,480 --> 00:48:14,180
All right, so, again, we
can write down this matrix.

883
00:48:14,180 --> 00:48:16,370
Remember, this is
the first measurement

884
00:48:16,370 --> 00:48:18,980
at time two, the first
measurement at time three.

885
00:48:18,980 --> 00:48:21,650
We have two rows
and three columns.

886
00:48:21,650 --> 00:48:23,270
We can also write
down what's known

887
00:48:23,270 --> 00:48:27,080
as the transpose of a matrix
that just flips the rows

888
00:48:27,080 --> 00:48:27,660
and columns.

889
00:48:27,660 --> 00:48:30,290
So we can write
transpose, which is

890
00:48:30,290 --> 00:48:33,860
indicated by this
capital super scripted t.

891
00:48:33,860 --> 00:48:36,140
And here, we're just flipping
the rows and columns.

892
00:48:36,140 --> 00:48:41,510
So the first row of this
matrix becomes the first column

893
00:48:41,510 --> 00:48:43,220
of the transposed matrix.

894
00:48:43,220 --> 00:48:47,450
So we have three
rows and two columns.

895
00:48:47,450 --> 00:48:49,140
A symmetric matrix--

896
00:48:49,140 --> 00:48:50,940
I'm just defining
some terms now.

897
00:48:50,940 --> 00:48:54,360
A symmetric matrix
is a matrix where

898
00:48:54,360 --> 00:48:58,650
the off-diagonal elements--
so let me just define,

899
00:48:58,650 --> 00:49:01,800
that's the diagonal,
the matrix diagonal.

900
00:49:01,800 --> 00:49:04,650
And a symmetric matrix
has the property

901
00:49:04,650 --> 00:49:08,130
that the off-diagonal
elements are zero.

902
00:49:08,130 --> 00:49:11,040
And a symmetric matrix
has the property

903
00:49:11,040 --> 00:49:14,970
that the transpose of that
matrix is equal to the matrix,

904
00:49:14,970 --> 00:49:15,600
OK?

905
00:49:15,600 --> 00:49:18,930
That is only
possible, of course,

906
00:49:18,930 --> 00:49:23,017
if the matrix has the same
number of rows and columns,

907
00:49:23,017 --> 00:49:24,600
if it's what's called
a square matrix.

908
00:49:28,990 --> 00:49:31,030
Let me just remind
you, in general

909
00:49:31,030 --> 00:49:33,290
about matrix multiplication.

910
00:49:33,290 --> 00:49:36,820
We can write down the
product of two matrices.

911
00:49:36,820 --> 00:49:40,090
And we do that multiplication
by taking the dot product

912
00:49:40,090 --> 00:49:44,590
of each row in the first
matrix with each column

913
00:49:44,590 --> 00:49:46,000
in the second matrix.

914
00:49:46,000 --> 00:49:49,930
So here's the product of
matrix A and matrix B.

915
00:49:49,930 --> 00:49:52,660
So there's the product.

916
00:49:52,660 --> 00:49:56,020
If this matrix, if
matrix A, is an m by k--

917
00:49:56,020 --> 00:49:59,090
m rows by k columns--

918
00:49:59,090 --> 00:50:05,090
and matrix B has k
rows by n columns,

919
00:50:05,090 --> 00:50:09,020
then the product of
those two matrices

920
00:50:09,020 --> 00:50:14,180
will have m by n
rows and columns.

921
00:50:14,180 --> 00:50:17,000
And you can see that in order
for matrix multiplication

922
00:50:17,000 --> 00:50:23,510
to work, the number of
columns of the first matrix

923
00:50:23,510 --> 00:50:25,970
equal the number of rows
in the second matrix.

924
00:50:25,970 --> 00:50:30,890
You can see that this k has to
be the same for both matrices.

925
00:50:30,890 --> 00:50:34,120
Does that make sense?

926
00:50:34,120 --> 00:50:37,300
So, again, in order to compute
this element right here,

927
00:50:37,300 --> 00:50:40,675
we take the dot product
of the first row of A

928
00:50:40,675 --> 00:50:46,450
and the first column of B.
That's just 1 times 4, is 4.

929
00:50:46,450 --> 00:50:49,370
Plus negative 2
times 7 is minus 14.

930
00:50:49,370 --> 00:50:51,490
Plus 0 times minus 1 is 0.

931
00:50:51,490 --> 00:50:53,800
Add those up and
you get minus 10.

932
00:50:53,800 --> 00:50:55,090
So you get this number.

933
00:50:55,090 --> 00:50:57,040
You multiply this
row dot product

934
00:50:57,040 --> 00:50:58,990
this row with this
column and so on.

935
00:51:02,710 --> 00:51:06,310
Notice, A times B is
not equal to B times A.

936
00:51:06,310 --> 00:51:11,470
In fact, in cases of rectangular
matrices, matrices that aren't

937
00:51:11,470 --> 00:51:15,160
square, you can't
even do this, often do

938
00:51:15,160 --> 00:51:18,760
this, multiplication
in a different order.

939
00:51:18,760 --> 00:51:22,420
Mathematically, it
doesn't make sense.

940
00:51:22,420 --> 00:51:27,100
So let's say that we
have a matrix of vectors,

941
00:51:27,100 --> 00:51:29,020
and we want to take
the dot product

942
00:51:29,020 --> 00:51:35,420
of each one of those vectors
x with some other vector v. So

943
00:51:35,420 --> 00:51:36,720
let's just write that down.

944
00:51:36,720 --> 00:51:40,410
The way to do that is
to say the answer here,

945
00:51:40,410 --> 00:51:44,130
the dot product of each
one of those column vectors

946
00:51:44,130 --> 00:51:46,730
in our matrix with
this other vector

947
00:51:46,730 --> 00:51:49,580
v we do by taking
the transpose of v,

948
00:51:49,580 --> 00:51:53,100
which takes a column vector
and turns it into a row vector.

949
00:51:53,100 --> 00:51:56,660
And we can now multiply
that by our data matrix x

950
00:51:56,660 --> 00:52:01,700
by taking the dot product
of v with that column of x.

951
00:52:01,700 --> 00:52:05,100
And that gives us a matrix.

952
00:52:05,100 --> 00:52:09,750
So this matrix here, that
vector is a one by two matrix.

953
00:52:09,750 --> 00:52:11,450
This is a two by three matrix.

954
00:52:11,450 --> 00:52:16,010
The product of those is
a one by three matrix.

955
00:52:16,010 --> 00:52:18,790
Any questions about that?

956
00:52:18,790 --> 00:52:19,480
OK.

957
00:52:19,480 --> 00:52:21,860
We can do this a different way.

958
00:52:21,860 --> 00:52:25,420
Notice that the result
of this multiplication

959
00:52:25,420 --> 00:52:27,578
here is a row vector, y.

960
00:52:27,578 --> 00:52:28,870
We can do this a different way.

961
00:52:28,870 --> 00:52:30,740
We can take dot product.

962
00:52:30,740 --> 00:52:35,350
We can also compute this
as y equals x transpose v.

963
00:52:35,350 --> 00:52:37,360
So here, we've taken the
transpose of the data

964
00:52:37,360 --> 00:52:40,790
matrix times this
column vector v.

965
00:52:40,790 --> 00:52:43,850
And again, we take the
dot product of this,

966
00:52:43,850 --> 00:52:45,650
this with this,
and that with that.

967
00:52:45,650 --> 00:52:47,860
And now we get a
column vector that

968
00:52:47,860 --> 00:52:50,650
has the same entries
that we had over here.

969
00:52:53,980 --> 00:52:57,440
All right, so I'm just
showing you different ways

970
00:52:57,440 --> 00:53:00,920
that you can manipulate
a vector in a matrix

971
00:53:00,920 --> 00:53:08,120
to compute the dot product
of elements of vectors

972
00:53:08,120 --> 00:53:11,870
within a data matrix
and other vectors

973
00:53:11,870 --> 00:53:13,490
that you're interested in.

974
00:53:16,720 --> 00:53:19,160
All right, identity matrix.

975
00:53:19,160 --> 00:53:21,580
So when you're multiplying
numbers together,

976
00:53:21,580 --> 00:53:24,370
the number one has
the special property

977
00:53:24,370 --> 00:53:27,910
that you can multiply
any real number by one

978
00:53:27,910 --> 00:53:29,320
and get the same number back.

979
00:53:33,930 --> 00:53:39,030
You have the same kind
of element in matrices.

980
00:53:39,030 --> 00:53:42,530
So is there a matrix that when
multiplied by A gives you A?

981
00:53:42,530 --> 00:53:43,530
And the answer is yes.

982
00:53:43,530 --> 00:53:45,640
It's called the identity matrix.

983
00:53:45,640 --> 00:53:49,230
So it's given by the
symbol I, usually.

984
00:53:49,230 --> 00:53:54,540
A times I equals A. What
does that matrix look like?

985
00:53:54,540 --> 00:53:56,950
Again, the identity
matrix looks like this.

986
00:53:56,950 --> 00:54:01,320
It's a square matrix that
has ones along the diagonal

987
00:54:01,320 --> 00:54:02,970
and zero everywhere else.

988
00:54:05,580 --> 00:54:09,180
So you can see here that if
you take an arbitrary vector x,

989
00:54:09,180 --> 00:54:12,900
multiplied by the
identity matrix,

990
00:54:12,900 --> 00:54:18,630
you can see that this product
is x1, x2 dotted into 1,

991
00:54:18,630 --> 00:54:21,030
0, which gives you x1.

992
00:54:21,030 --> 00:54:25,230
x1, x2 dotted into
0, 1, gives you x2.

993
00:54:25,230 --> 00:54:29,560
And so the answer looks
like that, which is just x.

994
00:54:29,560 --> 00:54:32,450
So the identity matrix
times an arbitrary vector x

995
00:54:32,450 --> 00:54:35,420
gives you x back.

996
00:54:35,420 --> 00:54:40,560
Another very useful
application of linear algebra,

997
00:54:40,560 --> 00:54:43,720
linear algebra tools, is to
solve systems of equations.

998
00:54:43,720 --> 00:54:46,240
So let me show you
what that looks like.

999
00:54:46,240 --> 00:54:52,230
So let's say we want to solve
a simple equation, ax equals c.

1000
00:54:52,230 --> 00:54:54,720
So, in this case, how
do you solve for x?

1001
00:54:54,720 --> 00:54:57,600
Well, you're just going to
divide both sides by a, right?

1002
00:54:57,600 --> 00:54:59,640
So if you divide
both sides by a,

1003
00:54:59,640 --> 00:55:04,020
you get that x equals
1 over a times c.

1004
00:55:04,020 --> 00:55:07,980
So it turns out that there
is a matrix equivalent

1005
00:55:07,980 --> 00:55:11,800
of that, that allows you to
solve systems of equations.

1006
00:55:11,800 --> 00:55:14,610
So if you have a
pair of equations--

1007
00:55:14,610 --> 00:55:18,570
x minus 2y equals 3 and
3x plus y equals 5--

1008
00:55:18,570 --> 00:55:21,360
you can write this down
as a matrix equation,

1009
00:55:21,360 --> 00:55:23,910
where you have a
matrix 1, minus 2,

1010
00:55:23,910 --> 00:55:26,960
3, 1, which correspond to
the coefficients of x and y

1011
00:55:26,960 --> 00:55:28,500
in these equations.

1012
00:55:28,500 --> 00:55:36,120
Times a vector xy is equal
to 3, 5, another vector 3, 5.

1013
00:55:36,120 --> 00:55:40,570
So you can write this
down as ax equals c--

1014
00:55:40,570 --> 00:55:42,420
that's kind of nice--

1015
00:55:42,420 --> 00:55:46,440
where this matrix A is
given by these coefficients

1016
00:55:46,440 --> 00:55:49,650
and this vector c is
given by these terms

1017
00:55:49,650 --> 00:55:53,620
on this side of the equation, on
the right side of the equation.

1018
00:55:53,620 --> 00:55:55,990
Now, how do we solve this?

1019
00:55:55,990 --> 00:56:02,510
Well, can we just divide both
sides of that matrix equation,

1020
00:56:02,510 --> 00:56:04,670
that vector equation, by a?

1021
00:56:04,670 --> 00:56:08,450
So division is not really
defined for matrices,

1022
00:56:08,450 --> 00:56:10,460
but we can use another trick.

1023
00:56:10,460 --> 00:56:12,800
We can multiply both
sides of this equation

1024
00:56:12,800 --> 00:56:17,590
by something that
makes the a go away.

1025
00:56:17,590 --> 00:56:22,760
And so that magical thing
is called the inverse of A.

1026
00:56:22,760 --> 00:56:24,890
So we take the
inverse of matrix A,

1027
00:56:24,890 --> 00:56:28,420
denoted by A with this
superscript minus 1.

1028
00:56:28,420 --> 00:56:31,890
And that's the standard notation
for identifying the inverse.

1029
00:56:31,890 --> 00:56:34,220
It has the property
that A inverse times

1030
00:56:34,220 --> 00:56:37,840
A equals the identity matrix.

1031
00:56:37,840 --> 00:56:39,780
So you can sort of
think about this

1032
00:56:39,780 --> 00:56:45,090
as A equals the identity matrix
over A. Anyway, don't really

1033
00:56:45,090 --> 00:56:47,580
think of it like that.

1034
00:56:47,580 --> 00:56:51,270
So to solve this system
of equations ax equals c,

1035
00:56:51,270 --> 00:56:56,420
we multiply both sides
by that A inverse matrix.

1036
00:56:56,420 --> 00:56:58,130
And so that looks like this--

1037
00:56:58,130 --> 00:57:03,240
A inverse A times x
equals A inverse c.

1038
00:57:03,240 --> 00:57:05,790
A inverse A is just what?

1039
00:57:05,790 --> 00:57:10,920
The identity matrix times
x equals A inverse c.

1040
00:57:10,920 --> 00:57:14,100
And we just saw before that
identity matrix times x

1041
00:57:14,100 --> 00:57:15,930
is just x.

1042
00:57:15,930 --> 00:57:18,240
All right, so
there's the solution

1043
00:57:18,240 --> 00:57:24,140
to this system of equations.

1044
00:57:24,140 --> 00:57:25,640
All right, any
questions about that?

1045
00:57:30,220 --> 00:57:33,000
So how do you find the
inverse of a matrix?

1046
00:57:33,000 --> 00:57:34,650
What is this A inverse?

1047
00:57:34,650 --> 00:57:37,900
How do you get it in real life?

1048
00:57:37,900 --> 00:57:40,590
So in real life, what
you usually do is

1049
00:57:40,590 --> 00:57:44,250
you would just use the matrix
inverse function in Matlab.

1050
00:57:44,250 --> 00:57:47,520
Because for any matrices
other than a two-by-two,

1051
00:57:47,520 --> 00:57:50,160
it's really annoying to
get a matrix inverse.

1052
00:57:50,160 --> 00:57:52,800
But for a two-by-two matrix,
it's actually pretty easy.

1053
00:57:52,800 --> 00:57:56,340
You can almost just get the
answer by looking at the matrix

1054
00:57:56,340 --> 00:57:58,110
and writing down the inverse.

1055
00:57:58,110 --> 00:57:59,530
It looks like this.

1056
00:57:59,530 --> 00:58:03,360
The inverse of a two-by-two
square matrix is just given

1057
00:58:03,360 --> 00:58:06,970
by a slight reordering
of the coefficients,

1058
00:58:06,970 --> 00:58:09,600
of the entries of that matrix,
divided by what's called

1059
00:58:09,600 --> 00:58:14,100
the determinant of A. So
what you do is you flip--

1060
00:58:14,100 --> 00:58:18,090
in a two-by-two matrix,
you flip the A and the D,

1061
00:58:18,090 --> 00:58:24,990
and then you multiply the
diagonal elements by minus 1.

1062
00:58:24,990 --> 00:58:26,640
Now, what is this determinant?

1063
00:58:26,640 --> 00:58:33,060
The determinant is given by
a times d minus b times c.

1064
00:58:33,060 --> 00:58:35,530
And you can prove
that that actually

1065
00:58:35,530 --> 00:58:39,940
is the inverse, because if we
take this and multiply it by A,

1066
00:58:39,940 --> 00:58:43,450
what you find when you multiply
that out is that that's just

1067
00:58:43,450 --> 00:58:48,370
equal to the identity matrix.

1068
00:58:48,370 --> 00:58:52,060
So a matrix has an
inverse if and only

1069
00:58:52,060 --> 00:58:55,360
if the determinant
is not equal to zero.

1070
00:58:55,360 --> 00:58:57,220
If the determinant
is equal to zero,

1071
00:58:57,220 --> 00:58:59,260
you can see that
this thing blows up,

1072
00:58:59,260 --> 00:59:02,250
and there's no inverse.

1073
00:59:02,250 --> 00:59:04,510
We're going to spend
a little bit of time

1074
00:59:04,510 --> 00:59:07,630
later talking about what
that means when a matrix has

1075
00:59:07,630 --> 00:59:11,110
an inverse and what the
determinant actually

1076
00:59:11,110 --> 00:59:18,920
corresponds to in a matrix
multiplication context.

1077
00:59:18,920 --> 00:59:20,870
If the determinant
is equal to zero,

1078
00:59:20,870 --> 00:59:24,260
we say that that
matrix is singular.

1079
00:59:24,260 --> 00:59:27,710
And in that case, you can't
actually find an inverse,

1080
00:59:27,710 --> 00:59:32,240
and you can't solve this
equation right here,

1081
00:59:32,240 --> 00:59:33,950
this system of equations.

1082
00:59:38,720 --> 00:59:42,600
All right, so let's actually
go through this example.

1083
00:59:42,600 --> 00:59:45,530
So here's our
equation, ax equals c.

1084
00:59:45,530 --> 00:59:47,780
We're going to use the
same matrix we had before

1085
00:59:47,780 --> 00:59:50,210
and the same c.

1086
00:59:50,210 --> 00:59:52,910
The determinant is
just the product

1087
00:59:52,910 --> 00:59:56,420
of those minus the product of
those, so 1 minus negative 6.

1088
00:59:56,420 --> 00:59:58,550
So the determinant is 7.

1089
00:59:58,550 --> 01:00:01,410
So there is an inverse
of this matrix.

1090
01:00:01,410 --> 01:00:03,810
And we can just write
that down as follows.

1091
01:00:03,810 --> 01:00:05,990
Again, we've flipped those
two and multiplied those

1092
01:00:05,990 --> 01:00:07,850
by minus 1.

1093
01:00:07,850 --> 01:00:13,550
So we can solve for x just by
taking that inverse times c,

1094
01:00:13,550 --> 01:00:15,920
A inverse times c.

1095
01:00:15,920 --> 01:00:17,840
And if you multiply
that out, you

1096
01:00:17,840 --> 01:00:19,418
see that there's the inverse.

1097
01:00:19,418 --> 01:00:20,210
It's just a vector.

1098
01:00:24,680 --> 01:00:26,110
That's it.

1099
01:00:26,110 --> 01:00:31,400
That's how you solve a system
of equations, all right?

1100
01:00:31,400 --> 01:00:33,970
Any questions about that?

1101
01:00:33,970 --> 01:00:43,590
So this process of solving
systems of equations

1102
01:00:43,590 --> 01:00:49,250
and using matrices
and their inverses

1103
01:00:49,250 --> 01:00:53,840
to solve systems of equations
is a very important concept

1104
01:00:53,840 --> 01:00:55,820
that we're going to use
over and over again.

1105
01:00:58,910 --> 01:01:01,040
All right, let's
turn to the topic

1106
01:01:01,040 --> 01:01:03,660
of matrix transformations.

1107
01:01:03,660 --> 01:01:06,710
All right, so you can see
from this problem of solving

1108
01:01:06,710 --> 01:01:12,100
this system of equations that
that matrix A transformed

1109
01:01:12,100 --> 01:01:15,050
a vector x into a vector c, OK?

1110
01:01:15,050 --> 01:01:21,290
So we have this vector x, which
was 3/7 minus 4/7 a vector.

1111
01:01:21,290 --> 01:01:26,940
When we multiplied that by
A, we got another vector, c.

1112
01:01:30,730 --> 01:01:34,960
And the vector A inverse
transforms this vector

1113
01:01:34,960 --> 01:01:38,320
c back into vector x, right?

1114
01:01:38,320 --> 01:01:44,170
So we can take that vector
c, multiply it by A inverse,

1115
01:01:44,170 --> 01:01:46,420
and get back to x.

1116
01:01:46,420 --> 01:01:49,340
Does that make sense?

1117
01:01:49,340 --> 01:01:56,480
So, in general, a
matrix A maps a set

1118
01:01:56,480 --> 01:01:59,630
of vectors in this whole space.

1119
01:01:59,630 --> 01:02:01,730
So if you have a
two-by-two vector,

1120
01:02:01,730 --> 01:02:08,620
it maps a set of vectors
in R2 onto a different set

1121
01:02:08,620 --> 01:02:10,540
of vectors in R2.

1122
01:02:10,540 --> 01:02:12,820
So you can take
any vector here--

1123
01:02:12,820 --> 01:02:16,360
a vector from the
origin into here--

1124
01:02:16,360 --> 01:02:18,460
multiply that vector
by A, and it gives you

1125
01:02:18,460 --> 01:02:20,800
a different vector.

1126
01:02:20,800 --> 01:02:23,220
And if you multiply that
other vector by A inverse,

1127
01:02:23,220 --> 01:02:27,990
you go back to the
original vector.

1128
01:02:27,990 --> 01:02:31,860
So this vector A
implements some kind

1129
01:02:31,860 --> 01:02:36,560
of transformation on this
space of real numbers

1130
01:02:36,560 --> 01:02:42,120
into a different space
of real numbers, OK?

1131
01:02:42,120 --> 01:02:46,120
And you can only do this
inverse if the determinant of A

1132
01:02:46,120 --> 01:02:47,250
is not equal to zero.

1133
01:02:51,060 --> 01:02:55,260
So I just want to show you
what different kinds of matrix

1134
01:02:55,260 --> 01:02:56,560
transformations look like.

1135
01:03:00,980 --> 01:03:04,810
So let's start with the
simplest matrix transformation--

1136
01:03:04,810 --> 01:03:06,260
the identity matrix.

1137
01:03:06,260 --> 01:03:09,130
So if we take a
vector x, multiply it

1138
01:03:09,130 --> 01:03:12,710
by the identity matrix,
you get another vector y,

1139
01:03:12,710 --> 01:03:15,350
which is equal to x.

1140
01:03:15,350 --> 01:03:18,650
So what we're going to do is
we're going to kind of riff off

1141
01:03:18,650 --> 01:03:21,980
of a theme here, and
we're going to take

1142
01:03:21,980 --> 01:03:26,400
slight perturbations
of the identity matrix

1143
01:03:26,400 --> 01:03:30,990
and see what that new matrix
does to a set of input vectors,

1144
01:03:30,990 --> 01:03:31,490
OK?

1145
01:03:31,490 --> 01:03:33,407
So let me show you how
we're going to do that.

1146
01:03:33,407 --> 01:03:37,050
We're going to take it the
identity matrix 1, 0, 0, 1.

1147
01:03:37,050 --> 01:03:39,020
And we're going to add
a little perturbation

1148
01:03:39,020 --> 01:03:40,085
to the diagonal elements.

1149
01:03:43,900 --> 01:03:47,700
And we're going to see what that
does to a set of input vectors.

1150
01:03:47,700 --> 01:03:49,810
So let me show you
what we're doing here.

1151
01:03:49,810 --> 01:03:51,540
We have each one
of these red dots.

1152
01:03:51,540 --> 01:03:58,410
So what I did was I generated
a bunch of random numbers

1153
01:03:58,410 --> 01:03:59,430
in a 2D space.

1154
01:03:59,430 --> 01:04:01,230
So this is a 2D space.

1155
01:04:01,230 --> 01:04:03,330
And I just randomly
selected a bunch

1156
01:04:03,330 --> 01:04:07,320
of numbers, a bunch of
points on that plane.

1157
01:04:07,320 --> 01:04:11,140
And each one of those
is an input vector x.

1158
01:04:11,140 --> 01:04:13,360
And then I multiplied
that vector

1159
01:04:13,360 --> 01:04:18,100
times this slightly
perturbed identity matrix.

1160
01:04:22,030 --> 01:04:24,270
And then I get a bunch
of output vectors y.

1161
01:04:24,270 --> 01:04:26,850
Input vectors x
are the red dots.

1162
01:04:26,850 --> 01:04:31,800
The output vectors y are the
other end of this blue line.

1163
01:04:31,800 --> 01:04:32,860
Does that make sense?

1164
01:04:32,860 --> 01:04:39,600
So for every vector x,
multiplying it by this matrix

1165
01:04:39,600 --> 01:04:43,630
gives me another vector
that's over here.

1166
01:04:43,630 --> 01:04:44,930
Does that make sense?

1167
01:04:44,930 --> 01:04:49,150
So you can see that
what this matrix does

1168
01:04:49,150 --> 01:04:52,600
is it takes this space,
this cloud of points,

1169
01:04:52,600 --> 01:04:56,900
and stretches them
equally in all directions.

1170
01:04:56,900 --> 01:05:00,760
So it takes any vector
and just makes it longer,

1171
01:05:00,760 --> 01:05:02,200
stretches it out.

1172
01:05:02,200 --> 01:05:04,240
No matter which
direction it's pointing,

1173
01:05:04,240 --> 01:05:06,210
it just makes that
vector slightly longer.

1174
01:05:09,510 --> 01:05:11,070
And here's that
little bit of code

1175
01:05:11,070 --> 01:05:17,670
that I used to
generate those vectors.

1176
01:05:17,670 --> 01:05:19,310
OK, so let's take
another example.

1177
01:05:19,310 --> 01:05:21,640
Let's say that we take
the identity matrix

1178
01:05:21,640 --> 01:05:26,020
and we just add a little
perturbation to one element

1179
01:05:26,020 --> 01:05:29,290
of the identity matrix, OK?

1180
01:05:29,290 --> 01:05:30,580
So what does that do?

1181
01:05:30,580 --> 01:05:37,400
It stretches the vectors
out in the x direction,

1182
01:05:37,400 --> 01:05:40,540
but it doesn't do anything
to the y direction.

1183
01:05:40,540 --> 01:05:45,200
So the vector with a
component in the x direction,

1184
01:05:45,200 --> 01:05:51,250
the x component gets increased
by an by a factor 1 plus delta.

1185
01:05:51,250 --> 01:05:55,390
The components of each of these
vectors in the y direction

1186
01:05:55,390 --> 01:05:57,720
don't change, all right?

1187
01:05:57,720 --> 01:05:59,670
So we're going to take
this cloud of points,

1188
01:05:59,670 --> 01:06:02,610
and we're going to stretch
it in the x direction.

1189
01:06:02,610 --> 01:06:05,540
What about this matrix here?

1190
01:06:05,540 --> 01:06:06,843
What's that going to do?

1191
01:06:06,843 --> 01:06:08,510
AUDIENCE: Stretch it
in the y direction.

1192
01:06:08,510 --> 01:06:09,260
MICHALE FEE: Good.

1193
01:06:09,260 --> 01:06:12,346
It's going to stretch it
out in the y direction.

1194
01:06:12,346 --> 01:06:13,392
Good.

1195
01:06:13,392 --> 01:06:14,350
So that's kind of cute.

1196
01:06:19,000 --> 01:06:22,480
And you can see that this
earlier matrix that we looked

1197
01:06:22,480 --> 01:06:27,560
at right here stretches
in the x direction

1198
01:06:27,560 --> 01:06:29,270
and stretches in
the y direction.

1199
01:06:29,270 --> 01:06:32,960
And that's why that
cloud of vectors

1200
01:06:32,960 --> 01:06:35,750
just stretched out
equally in all directions.

1201
01:06:40,340 --> 01:06:42,580
Out this.

1202
01:06:42,580 --> 01:06:44,404
What is that going to do?

1203
01:06:44,404 --> 01:06:46,864
AUDIENCE: It would stretch in
the x direction and compress

1204
01:06:46,864 --> 01:06:47,850
in the y direction

1205
01:06:47,850 --> 01:06:49,410
MICHALE FEE: Right.

1206
01:06:49,410 --> 01:06:52,500
This perturbation here
is making this component,

1207
01:06:52,500 --> 01:06:54,990
the x component larger.

1208
01:06:54,990 --> 01:06:58,860
This perturbation here--
and delta here is small.

1209
01:06:58,860 --> 01:07:00,100
It's less than one.

1210
01:07:00,100 --> 01:07:03,930
Here, it's making the
y component smaller.

1211
01:07:03,930 --> 01:07:06,600
And so what that looks like
is the y component of each one

1212
01:07:06,600 --> 01:07:08,530
of these vectors gets smaller.

1213
01:07:08,530 --> 01:07:10,740
The x component gets larger.

1214
01:07:10,740 --> 01:07:13,830
And so we're squeezing
in one direction

1215
01:07:13,830 --> 01:07:17,740
and stretching in
the other direction.

1216
01:07:17,740 --> 01:07:22,040
Imagine we took
a block of sponge

1217
01:07:22,040 --> 01:07:23,735
and we grabbed it
and stretched it out,

1218
01:07:23,735 --> 01:07:25,235
and it gets skinny
in this direction

1219
01:07:25,235 --> 01:07:28,700
and stretches out
in that direction.

1220
01:07:28,700 --> 01:07:30,050
All right, that's kind of cool.

1221
01:07:32,750 --> 01:07:36,060
What is this going to do?

1222
01:07:36,060 --> 01:07:38,910
Here, I'm not making a
small perturbation of this,

1223
01:07:38,910 --> 01:07:42,880
but I'm flipping the
sign of one of those.

1224
01:07:42,880 --> 01:07:43,870
What happens there?

1225
01:07:43,870 --> 01:07:44,970
What is that going to do?

1226
01:07:48,470 --> 01:07:50,400
AUDIENCE: [INAUDIBLE]

1227
01:07:50,400 --> 01:07:51,330
MICHALE FEE: Good.

1228
01:07:51,330 --> 01:07:54,240
What do we call that?

1229
01:07:54,240 --> 01:07:57,190
There's a term for it.

1230
01:07:57,190 --> 01:08:02,400
What do you-- yeah, it's
called a mirror reflection.

1231
01:08:02,400 --> 01:08:07,340
So every point that's on
this side of the origin

1232
01:08:07,340 --> 01:08:10,370
gets reflected over to
this side of the origin.

1233
01:08:10,370 --> 01:08:12,020
And every point
that's over here--

1234
01:08:12,020 --> 01:08:13,490
sorry, of this axis.

1235
01:08:13,490 --> 01:08:15,980
Every point that's on
this side of the y-axis

1236
01:08:15,980 --> 01:08:19,740
gets reflected
over to this side.

1237
01:08:19,740 --> 01:08:23,410
So that's called a
mirror reflection.

1238
01:08:23,410 --> 01:08:24,518
What is this?

1239
01:08:24,518 --> 01:08:25,560
What is that going to do?

1240
01:08:35,430 --> 01:08:35,930
Abiba?

1241
01:08:35,930 --> 01:08:38,567
AUDIENCE: Reflect
it [INAUDIBLE]..

1242
01:08:38,567 --> 01:08:39,359
MICHALE FEE: Right.

1243
01:08:39,359 --> 01:08:43,399
It's going to reflect it
through the origin, like this.

1244
01:08:43,399 --> 01:08:46,229
So every point that's over
here, on one side of the origin,

1245
01:08:46,229 --> 01:08:50,270
is going to reflect
through to the other side.

1246
01:08:50,270 --> 01:08:52,450
That's pretty neat.

1247
01:08:52,450 --> 01:08:54,660
Inversion of the origin.

1248
01:08:54,660 --> 01:08:56,870
OK?

1249
01:08:56,870 --> 01:08:59,460
So we have symmetric
perturbations

1250
01:08:59,460 --> 01:09:04,300
in the x and y components
of the identity matrix.

1251
01:09:04,300 --> 01:09:10,200
We have a stretch transformation
that stretches along one axis,

1252
01:09:10,200 --> 01:09:12,149
but not the other.

1253
01:09:12,149 --> 01:09:17,130
Stretch around the other axis,
the y-axis, but not the x-axis.

1254
01:09:17,130 --> 01:09:21,120
Stretch along x and
compression along y.

1255
01:09:21,120 --> 01:09:24,990
Mirror reflection
through the y-axis.

1256
01:09:24,990 --> 01:09:27,870
Inversion through the origin.

1257
01:09:27,870 --> 01:09:31,740
These are examples of
diagonal matrices, OK?

1258
01:09:31,740 --> 01:09:34,180
So the only thing
we've done so far--

1259
01:09:34,180 --> 01:09:36,779
we've gotten all these
really cool transformations,

1260
01:09:36,779 --> 01:09:38,970
but the only thing
we've done so far

1261
01:09:38,970 --> 01:09:40,905
are change these two
diagonal elements.

1262
01:09:43,779 --> 01:09:46,510
So there's a lot
more crazy stuff

1263
01:09:46,510 --> 01:09:51,310
to happen if we start messing
with the other components.

1264
01:09:51,310 --> 01:09:55,540
Oh, and I should mention
that we can invert

1265
01:09:55,540 --> 01:10:01,060
any one of these transformations
that we just did by finding

1266
01:10:01,060 --> 01:10:03,020
the inverse of this matrix.

1267
01:10:03,020 --> 01:10:06,805
The inverse of a diagonal matrix
is very simple to calculate.

1268
01:10:06,805 --> 01:10:10,015
It's just one over
those diagonal elements.

1269
01:10:13,470 --> 01:10:14,580
All right, how about this?

1270
01:10:17,868 --> 01:10:18,910
What is that going to do?

1271
01:10:18,910 --> 01:10:19,802
Anybody?

1272
01:10:28,970 --> 01:10:30,980
When you take a vector
and you multiply it

1273
01:10:30,980 --> 01:10:33,290
by that, what's going to happen?

1274
01:10:33,290 --> 01:10:36,800
This part is going to give
you the original vector back.

1275
01:10:36,800 --> 01:10:41,330
This part is going to take a
little bit of the y component

1276
01:10:41,330 --> 01:10:45,630
and add it to the x component.

1277
01:10:45,630 --> 01:10:47,450
So what does that do?

1278
01:10:47,450 --> 01:10:50,340
That produces what's
known as a shear.

1279
01:10:50,340 --> 01:10:53,340
So points up here,
we're going to take

1280
01:10:53,340 --> 01:10:57,700
a little bit of the y component
and add it to the x component.

1281
01:10:57,700 --> 01:11:00,300
So if something has
a big y component,

1282
01:11:00,300 --> 01:11:04,242
it's going to be shifted in x.

1283
01:11:04,242 --> 01:11:06,710
If something has a
negative y component,

1284
01:11:06,710 --> 01:11:08,670
it's going to shift
this way in x.

1285
01:11:08,670 --> 01:11:10,440
If something has a
positive y component,

1286
01:11:10,440 --> 01:11:12,500
it's going to shift
this way an x.

1287
01:11:12,500 --> 01:11:16,050
And it's going to produce
what's called a shear.

1288
01:11:16,050 --> 01:11:20,100
So we're pushing
these points this way,

1289
01:11:20,100 --> 01:11:21,630
pushing those points this way.

1290
01:11:25,230 --> 01:11:29,760
Shear is very important in
things like the flow of liquid.

1291
01:11:29,760 --> 01:11:32,700
So when you have liquid
flowing over a surface,

1292
01:11:32,700 --> 01:11:37,620
you have forces, frictional
forces to the liquid down here

1293
01:11:37,620 --> 01:11:39,750
that prevent it from moving.

1294
01:11:39,750 --> 01:11:42,550
Liquid up here
moves more quickly,

1295
01:11:42,550 --> 01:11:48,250
and it produces a shear in the
pattern of velocity profiles.

1296
01:11:48,250 --> 01:11:50,560
OK, that's pretty cool.

1297
01:11:50,560 --> 01:11:52,150
What about this?

1298
01:11:56,520 --> 01:11:58,750
It's going to just
produce a shear

1299
01:11:58,750 --> 01:12:00,380
along the other direction.

1300
01:12:00,380 --> 01:12:01,300
That's right.

1301
01:12:01,300 --> 01:12:03,250
So now components that have a--

1302
01:12:03,250 --> 01:12:07,960
vectors that have a
large x component acquire

1303
01:12:07,960 --> 01:12:10,900
a negative projection in y.

1304
01:12:17,160 --> 01:12:19,920
OK, what does this look like?

1305
01:12:19,920 --> 01:12:20,800
It's pretty cool.

1306
01:12:30,600 --> 01:12:36,680
We're going to get some
shear in this direction,

1307
01:12:36,680 --> 01:12:39,860
get some shear in
this direction.

1308
01:12:39,860 --> 01:12:42,137
What's it going to do?

1309
01:12:42,137 --> 01:12:46,630
AUDIENCE: [INAUDIBLE]

1310
01:12:46,630 --> 01:12:47,420
MICHALE FEE: Good.

1311
01:12:47,420 --> 01:12:48,950
Good guess.

1312
01:12:48,950 --> 01:12:52,840
That's exactly right,
produces a rotation.

1313
01:12:52,840 --> 01:12:55,290
Not exactly a rotation,
but very close.

1314
01:13:01,470 --> 01:13:04,980
So that's how you actually
produce a rotation.

1315
01:13:04,980 --> 01:13:10,000
So notice, for small angles
theta, these are close to one,

1316
01:13:10,000 --> 01:13:13,140
so it's close to
an identity matrix.

1317
01:13:13,140 --> 01:13:17,090
These are close to zero,
but this is negative

1318
01:13:17,090 --> 01:13:20,970
and this is positive,
or the other way around.

1319
01:13:20,970 --> 01:13:27,560
So if we have diagonals close
to one and the off-diagonals one

1320
01:13:27,560 --> 01:13:31,640
positive and one negative,
then that produces a rotation.

1321
01:13:31,640 --> 01:13:33,760
That, formally, is
a rotation matrix.

1322
01:13:33,760 --> 01:13:34,560
Yes?

1323
01:13:34,560 --> 01:13:36,580
AUDIENCE: On the
previous slide, is there

1324
01:13:36,580 --> 01:13:39,858
a reason you chose to represent
the delta on the x-axis as

1325
01:13:39,858 --> 01:13:40,860
negative?

1326
01:13:40,860 --> 01:13:41,780
MICHALE FEE: No.

1327
01:13:41,780 --> 01:13:42,720
It goes either way.

1328
01:13:42,720 --> 01:13:45,600
So if you have a rotation
angle that's positive,

1329
01:13:45,600 --> 01:13:48,590
then this is negative
and this is positive.

1330
01:13:48,590 --> 01:13:50,840
If your rotation angle
is the other sign,

1331
01:13:50,840 --> 01:13:55,520
then this is positive
and this is negative.

1332
01:13:55,520 --> 01:14:00,260
So, for example, if we want to
produce a 45-degree rotation,

1333
01:14:00,260 --> 01:14:04,820
then we have 1, 1, minus 1, 1.

1334
01:14:04,820 --> 01:14:07,040
And of course, all those
things have a square root

1335
01:14:07,040 --> 01:14:10,003
of 2, 1 over square
root of 2, in them.

1336
01:14:10,003 --> 01:14:11,170
And so that looks like this.

1337
01:14:11,170 --> 01:14:14,180
So if you have, let's say,
theta equals 10 degrees,

1338
01:14:14,180 --> 01:14:17,960
we can produce a 10-degree
rotation of all the vectors.

1339
01:14:17,960 --> 01:14:20,180
If theta is 25
degrees, you can see

1340
01:14:20,180 --> 01:14:23,220
that the rotation is further.

1341
01:14:23,220 --> 01:14:25,560
Theta 45, that's
this case right here.

1342
01:14:25,560 --> 01:14:28,560
You can see that you get a
45-degree rotation of all

1343
01:14:28,560 --> 01:14:31,440
of those vectors
around the origin.

1344
01:14:31,440 --> 01:14:37,850
And if theta is 90 degrees,
you can see that, OK?

1345
01:14:37,850 --> 01:14:38,660
Pretty cool, right?

1346
01:14:42,700 --> 01:14:46,880
OK, what is the inverse
of this rotation matrix?

1347
01:14:46,880 --> 01:14:50,620
So if we have a
rotation-- oh, and I just

1348
01:14:50,620 --> 01:14:53,140
want to point out
one more thing.

1349
01:14:53,140 --> 01:14:55,780
In this formulation of
the rotation matrix,

1350
01:14:55,780 --> 01:15:00,970
positive angles correspond
to rotating counterclockwise.

1351
01:15:03,560 --> 01:15:07,640
Negative angles
correspond to rotation

1352
01:15:07,640 --> 01:15:09,920
in the clockwise direction, OK?

1353
01:15:09,920 --> 01:15:11,660
So there's a big hint.

1354
01:15:11,660 --> 01:15:17,230
What is the inverse of
our rotation matrix?

1355
01:15:17,230 --> 01:15:22,940
If we have a rotation
of 10 degrees this way,

1356
01:15:22,940 --> 01:15:24,960
what is the inverse of that?

1357
01:15:24,960 --> 01:15:26,738
AUDIENCE: [INAUDIBLE]

1358
01:15:26,738 --> 01:15:27,530
MICHALE FEE: Right.

1359
01:15:27,530 --> 01:15:28,910
AUDIENCE: [INAUDIBLE]

1360
01:15:28,910 --> 01:15:30,290
MICHALE FEE: That's right.

1361
01:15:30,290 --> 01:15:35,870
Remember, matrix multiplication
implements a transformation.

1362
01:15:35,870 --> 01:15:38,450
The inverse of
that transformation

1363
01:15:38,450 --> 01:15:41,420
just takes you back
where you were.

1364
01:15:41,420 --> 01:15:44,810
So if you have a rotation
matrix that you implemented

1365
01:15:44,810 --> 01:15:47,750
a 20-degree rotation
in the plus direction,

1366
01:15:47,750 --> 01:15:51,710
then the inverse of that
is a 20-degree rotation

1367
01:15:51,710 --> 01:15:53,120
in the minus direction.

1368
01:15:53,120 --> 01:15:55,000
So the inverse of
this matrix you

1369
01:15:55,000 --> 01:15:58,830
can get just by putting in
a minus sign into the theta.

1370
01:15:58,830 --> 01:16:01,580
And you can see that
cosine of minus theta

1371
01:16:01,580 --> 01:16:03,200
is just cosine of theta.

1372
01:16:03,200 --> 01:16:06,500
But sine of minus theta
is negative sine of theta.

1373
01:16:09,770 --> 01:16:13,100
So the inverse of this
matrix is just this.

1374
01:16:13,100 --> 01:16:15,450
You change the sign
of those diagonals,

1375
01:16:15,450 --> 01:16:19,620
which just makes the shear go in
the opposite direction, right?

1376
01:16:23,400 --> 01:16:26,680
OK, so a rotation
by angle plus theta

1377
01:16:26,680 --> 01:16:29,590
followed by a rotation
of angle minus theta

1378
01:16:29,590 --> 01:16:31,300
puts everything
back where it was.

1379
01:16:31,300 --> 01:16:37,590
So rotation matrix phi of
minus theta times phi of theta

1380
01:16:37,590 --> 01:16:39,370
is equal to the identity matrix.

1381
01:16:39,370 --> 01:16:41,790
So those two are
inverses of each other.

1382
01:16:44,410 --> 01:16:47,860
And the inverse of a--
notice that the inverse

1383
01:16:47,860 --> 01:16:51,850
of this rotation matrix
is also just the transpose

1384
01:16:51,850 --> 01:16:52,930
of the rotation matrix.

1385
01:16:56,550 --> 01:16:58,190
All right, so what
you can see is

1386
01:16:58,190 --> 01:17:03,170
that these different
cool transformations

1387
01:17:03,170 --> 01:17:07,490
that these matrix
multiplications can do

1388
01:17:07,490 --> 01:17:11,870
are just examples of what our
feed-forward network can do.

1389
01:17:11,870 --> 01:17:13,460
Because the feed-m
forward network

1390
01:17:13,460 --> 01:17:16,380
just implements
matrix multiplication.

1391
01:17:16,380 --> 01:17:18,950
So this feed-forward
network takes

1392
01:17:18,950 --> 01:17:21,890
a set of vectors, a
set of input vectors,

1393
01:17:21,890 --> 01:17:26,060
and transforms them into a set
of output vectors, all right?

1394
01:17:26,060 --> 01:17:29,510
And you can understand what
that transformation does just

1395
01:17:29,510 --> 01:17:32,060
by understanding the different
kinds of transformations

1396
01:17:32,060 --> 01:17:37,550
you can get from
matrix multiplication.

1397
01:17:37,550 --> 01:17:40,390
All right, we'll
continue next time.