1
00:00:01,550 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,310
Commons license.

3
00:00:05,310 --> 00:00:07,520
Your support will help
MIT OpenCourseWare

4
00:00:07,520 --> 00:00:11,610
continue to offer high quality
educational resources for free.

5
00:00:11,610 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,140
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,140 --> 00:00:19,026
at ocw.mit.edu.

8
00:00:21,632 --> 00:00:22,590
JULIAN SHUN: All right.

9
00:00:22,590 --> 00:00:25,920
So we've talked a little
bit about caching before,

10
00:00:25,920 --> 00:00:30,330
but today we're going to talk in
much more detail about caching

11
00:00:30,330 --> 00:00:34,680
and how to design
cache-efficient algorithms.

12
00:00:34,680 --> 00:00:38,070
So first, let's look
at the caching hardware

13
00:00:38,070 --> 00:00:41,830
on modern machines today.

14
00:00:41,830 --> 00:00:43,710
So here's what the
cache hierarchy looks

15
00:00:43,710 --> 00:00:46,140
like for a multicore chip.

16
00:00:46,140 --> 00:00:49,310
We have a whole
bunch of processors.

17
00:00:49,310 --> 00:00:53,040
They all have their
own private L1 caches

18
00:00:53,040 --> 00:00:56,220
for both the data, as
well as the instruction.

19
00:00:56,220 --> 00:00:58,050
They also have a
private L2 cache.

20
00:00:58,050 --> 00:01:01,480
And then they share a last
level cache, or L3 cache,

21
00:01:01,480 --> 00:01:05,129
which is also called LLC.

22
00:01:05,129 --> 00:01:07,080
They're all connected
to a memory controller

23
00:01:07,080 --> 00:01:09,480
that can access DRAM.

24
00:01:09,480 --> 00:01:12,810
And then, oftentimes,
you'll have multiple chips

25
00:01:12,810 --> 00:01:16,710
on the same server,
and these chips

26
00:01:16,710 --> 00:01:18,880
would be connected
through a network.

27
00:01:18,880 --> 00:01:20,910
So here we have a bunch
of multicore chips

28
00:01:20,910 --> 00:01:24,130
that are connected together.

29
00:01:24,130 --> 00:01:27,300
So we can see that there are
different levels of memory

30
00:01:27,300 --> 00:01:30,160
here.

31
00:01:30,160 --> 00:01:32,520
And the sizes of each one
of these levels of memory

32
00:01:32,520 --> 00:01:33,750
is different.

33
00:01:33,750 --> 00:01:36,690
So the sizes tend to
go up as you move up

34
00:01:36,690 --> 00:01:39,480
the memory hierarchy.

35
00:01:39,480 --> 00:01:44,970
The L1 caches tend to
be about 32 kilobytes.

36
00:01:44,970 --> 00:01:47,327
In fact, these are the
specifications for the machines

37
00:01:47,327 --> 00:01:48,660
that you're using in this class.

38
00:01:48,660 --> 00:01:51,660
So 32 kilobytes for
both the L1 data cache

39
00:01:51,660 --> 00:01:54,540
and the L1 instruction cache.

40
00:01:54,540 --> 00:01:57,580
256 kilobytes for the L2 cache.

41
00:01:57,580 --> 00:02:01,200
so the L2 cache tends to
be about 8 to 10 times

42
00:02:01,200 --> 00:02:03,570
larger than the L1 cache.

43
00:02:03,570 --> 00:02:06,790
And then the last level cache,
the size is 30 megabytes.

44
00:02:06,790 --> 00:02:10,610
So this is typically on the
order of tens of megabytes.

45
00:02:10,610 --> 00:02:14,250
And then DRAM is on
the order of gigabytes.

46
00:02:14,250 --> 00:02:18,320
So here we have
128 gigabyte DRAM.

47
00:02:18,320 --> 00:02:21,480
And nowadays, you can
actually get machines

48
00:02:21,480 --> 00:02:25,440
that have terabytes of DRAM.

49
00:02:25,440 --> 00:02:29,880
So the associativity tends
to go up as you move up

50
00:02:29,880 --> 00:02:30,780
the cache hierarchy.

51
00:02:30,780 --> 00:02:32,970
And I'll talk more
about associativity

52
00:02:32,970 --> 00:02:34,980
on the next couple of slides.

53
00:02:34,980 --> 00:02:37,800
The time to access the
memory also tends to go up.

54
00:02:37,800 --> 00:02:39,870
So the latency tends
to go up as you move up

55
00:02:39,870 --> 00:02:41,020
the memory hierarchy.

56
00:02:41,020 --> 00:02:44,490
So the L1 caches are
the quickest to access,

57
00:02:44,490 --> 00:02:48,270
about two nanoseconds,
just rough numbers.

58
00:02:48,270 --> 00:02:50,380
The L2 cache is a
little bit slower--

59
00:02:50,380 --> 00:02:52,810
so say four nanoseconds.

60
00:02:52,810 --> 00:02:55,410
Last level cache,
maybe six nanoseconds.

61
00:02:55,410 --> 00:02:57,240
And then when you
have to go to DRAM,

62
00:02:57,240 --> 00:03:00,930
it's about an order of magnitude
slower-- so 50 nanoseconds

63
00:03:00,930 --> 00:03:03,280
in this example.

64
00:03:03,280 --> 00:03:09,420
And the reason why the memory
is further down in the cache

65
00:03:09,420 --> 00:03:11,070
hierarchy are faster
is because they're

66
00:03:11,070 --> 00:03:14,650
using more expensive materials
to manufacture these things.

67
00:03:14,650 --> 00:03:18,120
But since they tend to be more
expensive, we can't fit as much

68
00:03:18,120 --> 00:03:19,720
of that on the machines.

69
00:03:19,720 --> 00:03:22,620
So that's why the faster
memories are smaller

70
00:03:22,620 --> 00:03:24,690
than the slower memories.

71
00:03:24,690 --> 00:03:26,880
But if we're able to take
advantage of locality

72
00:03:26,880 --> 00:03:31,167
in our programs, then we can
make use of the fast memory

73
00:03:31,167 --> 00:03:32,000
as much as possible.

74
00:03:32,000 --> 00:03:36,730
And we'll talk about ways to
do that in this lecture today.

75
00:03:36,730 --> 00:03:39,000
There's also the latency
across the network, which

76
00:03:39,000 --> 00:03:42,660
tends to be cheaper than
going to main memory

77
00:03:42,660 --> 00:03:47,475
but slower than doing a
last level cache access.

78
00:03:50,520 --> 00:03:52,410
And there's a lot
of work in trying

79
00:03:52,410 --> 00:03:55,770
to get the cache coherence
protocols right, as we

80
00:03:55,770 --> 00:03:56,860
mentioned before.

81
00:03:56,860 --> 00:03:59,730
So since these processors
all have private caches,

82
00:03:59,730 --> 00:04:01,200
we need to make
sure that they all

83
00:04:01,200 --> 00:04:03,510
see a consistent
view of memory when

84
00:04:03,510 --> 00:04:05,670
they're trying to
access the same memory

85
00:04:05,670 --> 00:04:08,290
addresses in parallel.

86
00:04:08,290 --> 00:04:11,340
So we talked about the
MSI cache protocol before.

87
00:04:11,340 --> 00:04:13,500
And there are many other
protocols out there,

88
00:04:13,500 --> 00:04:16,510
and you can read more
about these things online.

89
00:04:16,510 --> 00:04:18,730
But these are very
hard to get right,

90
00:04:18,730 --> 00:04:20,700
and there's a lot of
verification involved

91
00:04:20,700 --> 00:04:23,110
in trying to prove that the
cache coherence protocols are

92
00:04:23,110 --> 00:04:23,610
correct.

93
00:04:27,490 --> 00:04:29,050
So any questions so far?

94
00:04:33,600 --> 00:04:34,100
OK.

95
00:04:34,100 --> 00:04:38,210
So let's talk about the
associativity of a cache.

96
00:04:38,210 --> 00:04:41,690
So here I'm showing you a
fully associative cache.

97
00:04:41,690 --> 00:04:43,700
And in a fully
associative cache,

98
00:04:43,700 --> 00:04:47,060
a cache block can reside
anywhere in the cache.

99
00:04:47,060 --> 00:04:50,760
And a basic unit of movement
here is a cache block.

100
00:04:50,760 --> 00:04:53,750
In this example, the cache
block size is 4 bytes,

101
00:04:53,750 --> 00:04:57,050
but on the machines that
we're using for this class,

102
00:04:57,050 --> 00:05:00,110
the cache block
size is 64 bytes.

103
00:05:00,110 --> 00:05:04,470
But for this example, I'm going
to use a four byte cache line.

104
00:05:04,470 --> 00:05:07,160
So each row here corresponds
to one cache line.

105
00:05:07,160 --> 00:05:10,310
And a fully associative cache
means that each line here

106
00:05:10,310 --> 00:05:13,225
can go anywhere in the cache.

107
00:05:13,225 --> 00:05:14,600
And then here
we're also assuming

108
00:05:14,600 --> 00:05:17,420
a cache size that has 32 bytes.

109
00:05:17,420 --> 00:05:19,450
So, in total, it can
store eight cache line

110
00:05:19,450 --> 00:05:21,245
since the cache line is 4 bytes.

111
00:05:24,970 --> 00:05:28,840
So to find a block in a
fully associative cache,

112
00:05:28,840 --> 00:05:30,970
you have to actually
search the entire cache,

113
00:05:30,970 --> 00:05:35,740
because a cache line can
appear anywhere in the cache.

114
00:05:35,740 --> 00:05:38,860
And there's a tag associated
with each of these cache lines

115
00:05:38,860 --> 00:05:42,610
here that basically
specify which

116
00:05:42,610 --> 00:05:45,670
of the memory addresses
in virtual memory space

117
00:05:45,670 --> 00:05:47,740
it corresponds to.

118
00:05:47,740 --> 00:05:49,440
So for the fully
associative cache,

119
00:05:49,440 --> 00:05:51,940
we're actually going to use
most of the bits of that address

120
00:05:51,940 --> 00:05:53,160
as a tag.

121
00:05:53,160 --> 00:05:54,910
We don't actually need
the two lower order

122
00:05:54,910 --> 00:05:56,980
bits, because the
things are being

123
00:05:56,980 --> 00:06:00,010
moved at the granularity
of cache lines, which

124
00:06:00,010 --> 00:06:00,670
are four bytes.

125
00:06:00,670 --> 00:06:03,190
So the two lower order bits are
always going to be the same,

126
00:06:03,190 --> 00:06:05,560
but we're just going to
use the rest of the bits

127
00:06:05,560 --> 00:06:07,070
to store the tag.

128
00:06:07,070 --> 00:06:09,640
So if our address
space is 64 bits,

129
00:06:09,640 --> 00:06:12,550
then we're going to use 62 bits
to store the tag in a fully

130
00:06:12,550 --> 00:06:14,800
associative caching scheme.

131
00:06:14,800 --> 00:06:18,010
And when a cache
becomes full, a block

132
00:06:18,010 --> 00:06:22,000
has to be evicted to make
room for a new block.

133
00:06:22,000 --> 00:06:24,790
And there are various
ways that you can

134
00:06:24,790 --> 00:06:26,660
decide how to evict a block.

135
00:06:26,660 --> 00:06:29,260
So this is known as
the replacement policy.

136
00:06:29,260 --> 00:06:32,820
One common replacement policy
is LRU Least Recently Used.

137
00:06:32,820 --> 00:06:34,720
So you basically kick
the thing out that

138
00:06:34,720 --> 00:06:39,020
has been used the
farthest in the past.

139
00:06:39,020 --> 00:06:41,980
The other schemes, such
as second chance and clock

140
00:06:41,980 --> 00:06:44,020
replacement, we're
not going to talk

141
00:06:44,020 --> 00:06:47,080
too much about the different
replacement schemes today.

142
00:06:47,080 --> 00:06:50,780
But you can feel free to read
about these things online.

143
00:06:53,470 --> 00:06:55,450
So what's a disadvantage
of this scheme?

144
00:07:05,170 --> 00:07:05,670
Yes?

145
00:07:05,670 --> 00:07:07,270
AUDIENCE: It's slow.

146
00:07:07,270 --> 00:07:08,020
JULIAN SHUN: Yeah.

147
00:07:08,020 --> 00:07:08,830
Why is it slow?

148
00:07:08,830 --> 00:07:12,440
AUDIENCE: Because you have to
go all the way [INAUDIBLE]..

149
00:07:12,440 --> 00:07:13,190
JULIAN SHUN: Yeah.

150
00:07:13,190 --> 00:07:15,590
So the disadvantage
is that searching

151
00:07:15,590 --> 00:07:18,200
for a cache line in the cache
can be pretty slow, because you

152
00:07:18,200 --> 00:07:21,380
have to search entire
cache in the worst case,

153
00:07:21,380 --> 00:07:25,370
since a cache block can
reside anywhere in the cache.

154
00:07:25,370 --> 00:07:28,010
So even though the search can
go on in parallel and hardware

155
00:07:28,010 --> 00:07:31,340
is still expensive in terms
of power and performance

156
00:07:31,340 --> 00:07:35,030
to have to search most
of the cache every time.

157
00:07:35,030 --> 00:07:37,580
So let's look at
another extreme.

158
00:07:37,580 --> 00:07:40,010
This is a direct mapped cache.

159
00:07:40,010 --> 00:07:42,860
So in a direct mapped
cache, each cache block

160
00:07:42,860 --> 00:07:45,690
can only go in one
place in the cache.

161
00:07:45,690 --> 00:07:48,890
So I've color-coded
these cache blocks here.

162
00:07:48,890 --> 00:07:53,990
So the red blocks can only go
in the first row of this cache,

163
00:07:53,990 --> 00:07:57,030
the orange ones can only go
in the second row, and so on.

164
00:08:00,380 --> 00:08:06,110
And the position which a
cache block can go into

165
00:08:06,110 --> 00:08:09,140
is known as that
cache blocks set.

166
00:08:09,140 --> 00:08:11,240
So the set determines
the location

167
00:08:11,240 --> 00:08:14,480
in the cache for each
particular block.

168
00:08:14,480 --> 00:08:19,927
So let's look at how the virtual
memory address is divided up

169
00:08:19,927 --> 00:08:21,510
into and which of
the bits we're going

170
00:08:21,510 --> 00:08:24,380
to use to figure out
where a cache block should

171
00:08:24,380 --> 00:08:25,610
go in the cache.

172
00:08:25,610 --> 00:08:29,450
So we have the offset,
we have the set,

173
00:08:29,450 --> 00:08:31,820
and then the tag fields.

174
00:08:31,820 --> 00:08:35,179
The offset just tells
us which position

175
00:08:35,179 --> 00:08:37,669
we want to access
within a cache block.

176
00:08:37,669 --> 00:08:40,010
So since a cache
block has B bytes,

177
00:08:40,010 --> 00:08:43,850
we only need log base 2
of B bits as the offset.

178
00:08:43,850 --> 00:08:45,350
And the reason why
we have to offset

179
00:08:45,350 --> 00:08:47,383
is because we're not
always accessing something

180
00:08:47,383 --> 00:08:48,800
at the beginning
of a cache block.

181
00:08:48,800 --> 00:08:50,800
We might want to access
something in the middle.

182
00:08:50,800 --> 00:08:52,190
And that's why we
need the offset

183
00:08:52,190 --> 00:08:54,755
to specify where in the cache
block we want to access.

184
00:08:57,730 --> 00:08:59,530
Then there's a set field.

185
00:08:59,530 --> 00:09:05,020
And the set field is going
to determine which position

186
00:09:05,020 --> 00:09:08,110
in the cache that cache
block can go into.

187
00:09:08,110 --> 00:09:12,790
So there are eight possible
positions for each cache block.

188
00:09:12,790 --> 00:09:16,240
And therefore, we only
need log base 2 of 8 bits--

189
00:09:16,240 --> 00:09:19,120
so three bits for the
set in this example.

190
00:09:19,120 --> 00:09:23,200
And more generally, it's going
to be log base 2 of M over B.

191
00:09:23,200 --> 00:09:25,815
And here, M over B is 8.

192
00:09:25,815 --> 00:09:27,940
And then, finally, we're
going to use the remaining

193
00:09:27,940 --> 00:09:29,030
bits as a tag.

194
00:09:29,030 --> 00:09:32,800
So w minus log base 2
of M bits for the tag.

195
00:09:32,800 --> 00:09:36,250
And that gets stored along with
the cache block in the cache.

196
00:09:36,250 --> 00:09:39,070
And that's going to
uniquely identify

197
00:09:39,070 --> 00:09:44,560
which of the memory blocks
the cache block corresponds to

198
00:09:44,560 --> 00:09:47,430
in virtual memory.

199
00:09:47,430 --> 00:09:53,110
And you can verify that the
sum of all these quantities

200
00:09:53,110 --> 00:09:55,190
here sums to w bits.

201
00:09:55,190 --> 00:09:58,120
So in total, we have
a w bit address space.

202
00:09:58,120 --> 00:09:59,863
And the sum of those
three things is w.

203
00:10:03,034 --> 00:10:06,880
So what's the advantage and
disadvantage of this scheme?

204
00:10:16,880 --> 00:10:19,760
So first, what's a good thing
about this scheme compared

205
00:10:19,760 --> 00:10:21,990
to the previous
scheme that we saw?

206
00:10:21,990 --> 00:10:22,490
Yes?

207
00:10:22,490 --> 00:10:23,250
AUDIENCE: Faster.

208
00:10:23,250 --> 00:10:24,000
JULIAN SHUN: Yeah.

209
00:10:24,000 --> 00:10:26,240
It's fast because you only
have to check one place.

210
00:10:26,240 --> 00:10:27,620
Because each cache
block can only

211
00:10:27,620 --> 00:10:30,410
go in one place in a cache,
and that's only place

212
00:10:30,410 --> 00:10:32,810
you have to check when
you try to do a lookup.

213
00:10:32,810 --> 00:10:34,740
If the cache block is
there, then you find it.

214
00:10:34,740 --> 00:10:38,750
If it's not, then you know
it's not in the cache.

215
00:10:38,750 --> 00:10:42,020
What's the downside
to this scheme?

216
00:10:42,020 --> 00:10:42,520
Yeah?

217
00:10:42,520 --> 00:10:44,437
AUDIENCE: You only end
up putting the red ones

218
00:10:44,437 --> 00:10:47,350
into the cache and you have
mostly every [INAUDIBLE],, which

219
00:10:47,350 --> 00:10:48,520
is totally [INAUDIBLE].

220
00:10:48,520 --> 00:10:49,270
JULIAN SHUN: Yeah.

221
00:10:49,270 --> 00:10:50,440
So good answer.

222
00:10:50,440 --> 00:10:54,630
So the downside is that you
might, for example, just

223
00:10:54,630 --> 00:10:58,740
be accessing the
red cache blocks

224
00:10:58,740 --> 00:11:01,260
and then not accessing any
of the other cache blocks.

225
00:11:01,260 --> 00:11:04,140
They'll all get mapped to the
same location in the cache,

226
00:11:04,140 --> 00:11:06,240
and then they'll keep
evicting each other,

227
00:11:06,240 --> 00:11:09,150
even though there's a lot
of empty space in the cache.

228
00:11:09,150 --> 00:11:11,130
And this is known
as a conflict miss.

229
00:11:11,130 --> 00:11:15,000
And these can be very
bad for performance

230
00:11:15,000 --> 00:11:16,760
and very hard to debug.

231
00:11:16,760 --> 00:11:19,140
So that's one downside
of a direct map

232
00:11:19,140 --> 00:11:22,950
cache is that you can get these
conflict misses where you have

233
00:11:22,950 --> 00:11:25,050
to evict things from the
cache even though there's

234
00:11:25,050 --> 00:11:26,205
empty space in the cache.

235
00:11:29,720 --> 00:11:32,330
So as we said, finding
a block is very fast.

236
00:11:32,330 --> 00:11:35,810
Only a single location in
the cache has to be searched.

237
00:11:35,810 --> 00:11:38,390
But you might
suffer from conflict

238
00:11:38,390 --> 00:11:40,620
misses if you keep axing
things in the same set

239
00:11:40,620 --> 00:11:45,140
repeatedly without accessing
the things in the other sets.

240
00:11:45,140 --> 00:11:46,220
So any questions?

241
00:11:53,030 --> 00:11:53,530
OK.

242
00:11:53,530 --> 00:11:58,870
So these are sort of the two
extremes for cache design.

243
00:11:58,870 --> 00:12:01,060
There's actually
a hybrid solution

244
00:12:01,060 --> 00:12:03,872
called set associative cache.

245
00:12:03,872 --> 00:12:07,180
And in a set associative
cache, you still sets,

246
00:12:07,180 --> 00:12:11,200
but each of the sets contains
more than one line now.

247
00:12:11,200 --> 00:12:14,970
So all the red blocks
still map to the red set,

248
00:12:14,970 --> 00:12:16,990
but there's actually
two possible locations

249
00:12:16,990 --> 00:12:20,020
for the red blocks now.

250
00:12:20,020 --> 00:12:24,730
So in this case, this is known
as a two-way associate of cache

251
00:12:24,730 --> 00:12:28,870
since there are two possible
locations inside each set.

252
00:12:28,870 --> 00:12:33,670
And again, a cache block's set
determines k possible cache

253
00:12:33,670 --> 00:12:35,140
locations for that block.

254
00:12:35,140 --> 00:12:38,440
So within a set it's
fully associative,

255
00:12:38,440 --> 00:12:42,040
but each block can only
go in one of the sets.

256
00:12:44,590 --> 00:12:48,190
So let's look again
at how the bits are

257
00:12:48,190 --> 00:12:50,680
divided into in the address.

258
00:12:50,680 --> 00:12:53,770
So we still have the tag
set and offset fields.

259
00:12:53,770 --> 00:12:58,180
The offset field is
still a log base 2 of b.

260
00:12:58,180 --> 00:13:04,510
The set field is going to take
log base 2 of M over kB bits.

261
00:13:04,510 --> 00:13:07,320
So the number of sets
we have is M over kB.

262
00:13:07,320 --> 00:13:11,080
So we need log base
2 of that number

263
00:13:11,080 --> 00:13:14,230
to represent the set of a block.

264
00:13:14,230 --> 00:13:17,590
And then, finally, we use
the remaining bits as a tag,

265
00:13:17,590 --> 00:13:22,730
so it's going to be w minus
log base 2 of M over k.

266
00:13:22,730 --> 00:13:25,900
And now, to find a
block in the cache,

267
00:13:25,900 --> 00:13:30,400
only k locations of it's
set must be searched.

268
00:13:30,400 --> 00:13:33,970
So you basically find which
set the cache block maps too,

269
00:13:33,970 --> 00:13:36,130
and then you check
all k locations

270
00:13:36,130 --> 00:13:41,320
within that set to see if
that cached block is there.

271
00:13:41,320 --> 00:13:44,358
And whenever you
want to whenever

272
00:13:44,358 --> 00:13:46,900
you try to put something in the
cache because it's not there,

273
00:13:46,900 --> 00:13:48,067
you have to evict something.

274
00:13:48,067 --> 00:13:51,010
And you evict something from
the same set as the block

275
00:13:51,010 --> 00:13:53,780
that you're placing
into the cache.

276
00:13:53,780 --> 00:13:56,410
So for this example, I showed
a two-way associative cache.

277
00:13:56,410 --> 00:13:59,200
But in practice, the
associated is usually bigger

278
00:13:59,200 --> 00:14:04,090
say eight-way, 16-way,
or sometimes 20-way.

279
00:14:04,090 --> 00:14:09,490
And as you keep increasing
the associativity,

280
00:14:09,490 --> 00:14:13,130
it's going to look more and more
like a fully associative cache.

281
00:14:13,130 --> 00:14:15,460
And if you have a one
way associative cache,

282
00:14:15,460 --> 00:14:17,050
then there's just
a direct map cache.

283
00:14:17,050 --> 00:14:21,310
So this is a sort of a hybrid in
between-- a fully mapped cache

284
00:14:21,310 --> 00:14:24,325
and a fully associative
cache in a direct map cache.

285
00:14:27,620 --> 00:14:30,650
So any questions on set
associative caches ?

286
00:14:38,310 --> 00:14:38,810
OK.

287
00:14:38,810 --> 00:14:43,340
So let's go over a taxonomy
of different types of cache

288
00:14:43,340 --> 00:14:45,510
misses that you can incur.

289
00:14:45,510 --> 00:14:48,620
So the first type of cache
miss is called a cold miss.

290
00:14:48,620 --> 00:14:50,150
And this is the
cache miss that you

291
00:14:50,150 --> 00:14:53,705
have to incur the first time
you access a cache block.

292
00:14:53,705 --> 00:14:55,580
And if you need to access
this piece of data,

293
00:14:55,580 --> 00:14:58,220
there's no way to get around
getting a cold miss for this.

294
00:14:58,220 --> 00:15:01,225
Because your cache starts
out not having this block,

295
00:15:01,225 --> 00:15:02,600
and the first time
you access it,

296
00:15:02,600 --> 00:15:06,960
you have to bring it into cache.

297
00:15:06,960 --> 00:15:09,860
Then there are capacity misses.

298
00:15:09,860 --> 00:15:12,660
So capacity misses
are cache misses

299
00:15:12,660 --> 00:15:14,690
You get because
the cache is full

300
00:15:14,690 --> 00:15:16,590
and it can't fit all
of the cache blocks

301
00:15:16,590 --> 00:15:18,870
that you want to access.

302
00:15:18,870 --> 00:15:21,540
So you get a capacity miss
when the previous cache

303
00:15:21,540 --> 00:15:23,970
copy would have been
evicted even with a fully

304
00:15:23,970 --> 00:15:24,870
associative scheme.

305
00:15:24,870 --> 00:15:28,260
So even if all of the possible
locations in your cache

306
00:15:28,260 --> 00:15:31,230
could be used for a
particular cache line,

307
00:15:31,230 --> 00:15:33,750
that cache line still has to
be evicted because there's not

308
00:15:33,750 --> 00:15:34,420
enough space.

309
00:15:34,420 --> 00:15:37,530
So that's what's
called a capacity miss.

310
00:15:37,530 --> 00:15:41,010
And you can deal
with capacity misses

311
00:15:41,010 --> 00:15:44,610
by introducing more locality
into your code, both spatial

312
00:15:44,610 --> 00:15:46,440
and temporal locality.

313
00:15:46,440 --> 00:15:48,690
And we'll look at ways
to reduce the capacity

314
00:15:48,690 --> 00:15:51,420
misses of algorithms
later on in this lecture.

315
00:15:53,930 --> 00:15:55,830
Then there are conflict misses.

316
00:15:55,830 --> 00:16:00,000
And conflict misses happen
in set associate of caches

317
00:16:00,000 --> 00:16:06,420
when you have too many blocks
from the same set wanting

318
00:16:06,420 --> 00:16:08,640
to go into the cache.

319
00:16:08,640 --> 00:16:10,770
And some of these
have to be evicted,

320
00:16:10,770 --> 00:16:14,130
because the set can't
fit all of the blocks.

321
00:16:14,130 --> 00:16:15,720
And these blocks
wouldn't have been

322
00:16:15,720 --> 00:16:18,540
evicted if you had a fully
associative scheme, so these

323
00:16:18,540 --> 00:16:21,750
are what's called
conflict misses.

324
00:16:21,750 --> 00:16:25,800
For example, if you
have 16 things in a set

325
00:16:25,800 --> 00:16:29,820
and you keep accessing 17 things
that all belong in the set,

326
00:16:29,820 --> 00:16:32,310
something's going
to get kicked out

327
00:16:32,310 --> 00:16:35,340
every time you want
to access something.

328
00:16:35,340 --> 00:16:38,280
And these cache
evictions might not

329
00:16:38,280 --> 00:16:41,115
have happened if you had
a fully associative cache.

330
00:16:44,600 --> 00:16:46,460
And then, finally,
they're sharing misses.

331
00:16:46,460 --> 00:16:50,810
So sharing misses only
happened in a parallel context.

332
00:16:50,810 --> 00:16:52,940
And we talked a little
bit about true sharing

333
00:16:52,940 --> 00:16:56,300
a false sharing misses
in prior lectures.

334
00:16:56,300 --> 00:16:59,270
So let's just
review this briefly.

335
00:16:59,270 --> 00:17:03,860
So a sharing miss can happen
if multiple processors are

336
00:17:03,860 --> 00:17:06,619
accessing the same cache
line and at least one of them

337
00:17:06,619 --> 00:17:08,869
is writing to that cache line.

338
00:17:08,869 --> 00:17:10,460
If all of the
processors are just

339
00:17:10,460 --> 00:17:13,010
reading from the cache line,
then the cache [INAUDIBLE]

340
00:17:13,010 --> 00:17:16,250
protocol knows how to make
it work so that you don't get

341
00:17:16,250 --> 00:17:16,880
misses.

342
00:17:16,880 --> 00:17:19,670
They can all access the same
cache line at the same time

343
00:17:19,670 --> 00:17:22,099
if nobody's modifying it.

344
00:17:22,099 --> 00:17:24,290
But if at least one
processor is modifying it,

345
00:17:24,290 --> 00:17:26,359
you could get either
true sharing misses

346
00:17:26,359 --> 00:17:28,250
or false sharing misses.

347
00:17:28,250 --> 00:17:31,580
So a true sharing miss is
when two processors are

348
00:17:31,580 --> 00:17:36,590
accessing the same data
on the same cache line.

349
00:17:36,590 --> 00:17:38,750
And as you recall from
a previous lecture,

350
00:17:38,750 --> 00:17:41,150
if one of the two processors
is writing to this cache

351
00:17:41,150 --> 00:17:43,640
line, whenever it
does a write it

352
00:17:43,640 --> 00:17:46,370
needs to acquire the cache
line in exclusive mode

353
00:17:46,370 --> 00:17:51,710
and then invalidate that cache
line and all other caches.

354
00:17:51,710 --> 00:17:54,020
So then when one
another processor

355
00:17:54,020 --> 00:17:55,820
tries to access the
same memory location,

356
00:17:55,820 --> 00:17:58,130
it has to bring it back
into its own cache,

357
00:17:58,130 --> 00:18:02,260
and then you get a
cache miss there.

358
00:18:02,260 --> 00:18:04,430
A false sharing this
happens if two processes

359
00:18:04,430 --> 00:18:07,070
are accessing different data
that just happened to reside

360
00:18:07,070 --> 00:18:08,870
on the same cache line.

361
00:18:08,870 --> 00:18:10,670
Because the basic
unit of movement

362
00:18:10,670 --> 00:18:13,580
is a cache line in
the architecture.

363
00:18:13,580 --> 00:18:15,860
So even if you're
asking different things,

364
00:18:15,860 --> 00:18:17,480
if they are on the
same cache line,

365
00:18:17,480 --> 00:18:20,810
you're still going to
get a sharing miss.

366
00:18:20,810 --> 00:18:22,940
And false sharing is
pretty hard to deal with,

367
00:18:22,940 --> 00:18:26,030
because, in general,
you don't know what data

368
00:18:26,030 --> 00:18:28,282
gets placed on what cache line.

369
00:18:28,282 --> 00:18:29,990
There are certain
heuristics you can use.

370
00:18:29,990 --> 00:18:32,510
For example, if you're
mallocing a big memory region,

371
00:18:32,510 --> 00:18:35,430
you know that that memory
region is contiguous,

372
00:18:35,430 --> 00:18:37,670
so you can space your
access is far enough apart

373
00:18:37,670 --> 00:18:40,310
by different processors so
they don't touch the same cache

374
00:18:40,310 --> 00:18:41,110
line.

375
00:18:41,110 --> 00:18:43,910
But if you're just declaring
local variables on the stack,

376
00:18:43,910 --> 00:18:45,710
you don't know
where the compiler

377
00:18:45,710 --> 00:18:50,810
is going to decide to
place these variables

378
00:18:50,810 --> 00:18:54,480
in the virtual
memory address space.

379
00:18:54,480 --> 00:18:57,050
So these are four
different types of cache

380
00:18:57,050 --> 00:19:00,150
misses that you
should know about.

381
00:19:00,150 --> 00:19:02,690
And there's many
models out there

382
00:19:02,690 --> 00:19:05,840
for analyzing the cache
performance of algorithms.

383
00:19:05,840 --> 00:19:08,720
And some of the models ignore
some of these different types

384
00:19:08,720 --> 00:19:10,640
of cache misses.

385
00:19:10,640 --> 00:19:13,940
So just be aware of this when
you're looking at algorithm

386
00:19:13,940 --> 00:19:16,010
analysis, because
not all of the models

387
00:19:16,010 --> 00:19:18,120
will capture all of these
different types of cache

388
00:19:18,120 --> 00:19:18,620
misses.

389
00:19:22,830 --> 00:19:27,540
So let's look at a bad
case for conflict misses.

390
00:19:27,540 --> 00:19:33,270
So here I want to access a
submatrix within a larger

391
00:19:33,270 --> 00:19:34,440
matrix.

392
00:19:34,440 --> 00:19:39,540
And recall that matrices are
stored in row-major order.

393
00:19:39,540 --> 00:19:44,850
And let's say our matrix is
4,096 columns by 4,096 rows

394
00:19:44,850 --> 00:19:47,670
and it still stores doubles.

395
00:19:47,670 --> 00:19:50,190
So therefore, each
row here is going

396
00:19:50,190 --> 00:19:55,140
to contain 2 to the 15th
bytes, because 4,096

397
00:19:55,140 --> 00:19:58,800
is t2 to the 12th,
and we have doubles,

398
00:19:58,800 --> 00:20:00,110
which takes eight bytes.

399
00:20:00,110 --> 00:20:03,390
So 2 to the 12 times to the
3rd, which is 2 to the 15th.

400
00:20:06,750 --> 00:20:11,280
We're going to assume the word
width is 64, which is standard.

401
00:20:11,280 --> 00:20:15,060
We're going to assume that
we have a cache size of 32k.

402
00:20:15,060 --> 00:20:19,710
And the cache block size is
64, which, again, is standard.

403
00:20:19,710 --> 00:20:22,125
And let's say we have a
four-way associative cache.

404
00:20:26,520 --> 00:20:31,860
So let's look at how the
bits are divided into.

405
00:20:31,860 --> 00:20:36,270
So again we have
this offset, which

406
00:20:36,270 --> 00:20:38,867
takes log base 2 of B bits.

407
00:20:38,867 --> 00:20:41,325
So how many bits do we have
for the offset in this example?

408
00:20:48,300 --> 00:20:48,800
Right.

409
00:20:48,800 --> 00:20:50,030
So we have 6 bits.

410
00:20:50,030 --> 00:20:53,930
So it's just log base 2 of 64.

411
00:20:53,930 --> 00:20:56,180
What about for the set?

412
00:20:56,180 --> 00:20:59,030
How many bits do
we have for that?

413
00:20:59,030 --> 00:21:00,350
7.

414
00:21:00,350 --> 00:21:02,280
Who said 7?

415
00:21:02,280 --> 00:21:02,780
Yeah.

416
00:21:02,780 --> 00:21:04,220
So it is 7.

417
00:21:04,220 --> 00:21:10,130
So M is 32k, which
is 2 to the 15th.

418
00:21:10,130 --> 00:21:17,310
And then k is 2 to
the 2, b is 2 6.

419
00:21:17,310 --> 00:21:21,050
So it's 2 to the 15th divided by
2 the 8th, which is to the 7th.

420
00:21:21,050 --> 00:21:23,930
And log base 2 of that is 7.

421
00:21:23,930 --> 00:21:25,940
And finally, what
about the tag field?

422
00:21:29,660 --> 00:21:31,990
AUDIENCE: 51.

423
00:21:31,990 --> 00:21:33,100
JULIAN SHUN: 51.

424
00:21:33,100 --> 00:21:33,730
Why is that?

425
00:21:33,730 --> 00:21:36,330
AUDIENCE: 64 minus 13.

426
00:21:36,330 --> 00:21:37,080
JULIAN SHUN: Yeah.

427
00:21:37,080 --> 00:21:43,880
So it's just 64 minus
7 minus 6, which is 51.

428
00:21:43,880 --> 00:21:44,380
OK.

429
00:21:44,380 --> 00:21:47,890
So let's say that we want
to access a submatrix

430
00:21:47,890 --> 00:21:49,710
within this larger matrix.

431
00:21:49,710 --> 00:21:52,660
Let's say we want to acts
as a 32 by 32 submatrix.

432
00:21:52,660 --> 00:21:57,220
And THIS is pretty common
in matrix algorithms, where

433
00:21:57,220 --> 00:21:59,810
you want to access submatrices,
especially in divide

434
00:21:59,810 --> 00:22:01,591
and conquer algorithms.

435
00:22:04,240 --> 00:22:09,850
And let's say we want to access
a column of this submatrix A.

436
00:22:09,850 --> 00:22:13,180
So the addresses of the elements
that we're going to access

437
00:22:13,180 --> 00:22:14,050
are as follows--

438
00:22:14,050 --> 00:22:17,290
so let's say the first
element in the column

439
00:22:17,290 --> 00:22:19,600
is stored at address x.

440
00:22:19,600 --> 00:22:21,280
Then the second
element in the column

441
00:22:21,280 --> 00:22:24,640
is going to be stored at
address x plus 2 to the 15th,

442
00:22:24,640 --> 00:22:27,910
because each row has
2 to the 15th bytes,

443
00:22:27,910 --> 00:22:29,650
and we're skipping
over an entire row

444
00:22:29,650 --> 00:22:34,490
here to get to the element in
the next row of the sub matrix.

445
00:22:34,490 --> 00:22:36,460
So we're going to
add 2 to the 15th.

446
00:22:36,460 --> 00:22:38,020
And then to get
the third element,

447
00:22:38,020 --> 00:22:40,660
we're going to add 2
times 2 to the 15th.

448
00:22:40,660 --> 00:22:43,420
And so on, until we get
to the last element,

449
00:22:43,420 --> 00:22:48,490
which is x plus 31
times 2 to the 15th.

450
00:22:48,490 --> 00:22:50,350
So which fields
of the address are

451
00:22:50,350 --> 00:22:54,850
changing as we go through
one column of this submatrix?

452
00:23:05,586 --> 00:23:09,002
AUDIENCE: You're just adding
multiple [INAUDIBLE] tag

453
00:23:09,002 --> 00:23:10,000
the [INAUDIBLE].

454
00:23:10,000 --> 00:23:10,750
JULIAN SHUN: Yeah.

455
00:23:10,750 --> 00:23:13,490
So it's just going to be
the tag that's changing.

456
00:23:13,490 --> 00:23:17,360
The set and the offset are going
to stay the same, because we're

457
00:23:17,360 --> 00:23:22,190
just using the lower 13 bits
to store the set and a tag.

458
00:23:22,190 --> 00:23:24,890
And therefore, when we
increment by 2 to the 15th,

459
00:23:24,890 --> 00:23:28,920
we're not going to touch
the set and the offset.

460
00:23:28,920 --> 00:23:32,060
So all of these addresses
fall into the same set.

461
00:23:32,060 --> 00:23:35,640
And this is a problem,
because our cache

462
00:23:35,640 --> 00:23:37,160
is only four-way associative.

463
00:23:37,160 --> 00:23:42,860
So we can only fit four
cache lines in each set.

464
00:23:42,860 --> 00:23:45,860
And here, we're accessing
31 of these things.

465
00:23:45,860 --> 00:23:50,510
So by the time we get
to the next column of A,

466
00:23:50,510 --> 00:23:53,280
all the things that we access
in the current column of A

467
00:23:53,280 --> 00:23:56,360
are going to be evicted
from cache already.

468
00:23:56,360 --> 00:23:58,970
And this is known
as a conflict miss,

469
00:23:58,970 --> 00:24:01,850
because if you had a
fully associative cache

470
00:24:01,850 --> 00:24:04,730
this might not have happened,
because you could actually

471
00:24:04,730 --> 00:24:09,940
use any location in the cache
to store these cache blocks.

472
00:24:09,940 --> 00:24:13,720
So does anybody have
any questions on why

473
00:24:13,720 --> 00:24:15,060
we get conflict misses here?

474
00:24:22,860 --> 00:24:27,110
So anybody have any
ideas on how to fix this?

475
00:24:27,110 --> 00:24:29,300
So what can I do to
make it so that I'm not

476
00:24:29,300 --> 00:24:32,990
incrementing by exactly
2 to the 15th every time?

477
00:24:39,696 --> 00:24:40,654
Yeah.

478
00:24:40,654 --> 00:24:43,050
AUDIENCE: So pad the matrix?

479
00:24:43,050 --> 00:24:44,020
JULIAN SHUN: Yeah.

480
00:24:44,020 --> 00:24:46,270
So one solution is
to pad the matrix.

481
00:24:46,270 --> 00:24:49,060
You can add some
constant amount of space

482
00:24:49,060 --> 00:24:50,920
to the end of the matrix.

483
00:24:50,920 --> 00:24:53,320
So each row is going
to be longer than 2

484
00:24:53,320 --> 00:24:54,550
to the 15th bytes.

485
00:24:54,550 --> 00:24:57,400
So maybe you add some
small constant like 17.

486
00:24:57,400 --> 00:25:00,130
So add 17 bytes to
the end of each row.

487
00:25:00,130 --> 00:25:04,090
And now, when you access a
column of this submatrix,

488
00:25:04,090 --> 00:25:07,000
you're not just incrementing
by 2 to the 15th,

489
00:25:07,000 --> 00:25:10,570
you're also adding
some small integer.

490
00:25:10,570 --> 00:25:14,535
And that's going to cause
the set and the offset fields

491
00:25:14,535 --> 00:25:15,910
to change as well,
and you're not

492
00:25:15,910 --> 00:25:18,640
going to get as many
conflict misses.

493
00:25:18,640 --> 00:25:22,610
So that's one way to
solve the problem.

494
00:25:22,610 --> 00:25:25,570
It turns out that if you're
doing a matrix multiplication

495
00:25:25,570 --> 00:25:27,910
algorithm, that's a
cubic work algorithm,

496
00:25:27,910 --> 00:25:31,630
and you can basically
afford to copy the submatrix

497
00:25:31,630 --> 00:25:34,270
into a temporary
32 by 32 matrix,

498
00:25:34,270 --> 00:25:36,580
do all the operations
on the temporary matrix,

499
00:25:36,580 --> 00:25:39,760
and then copy it back out
to the original matrix.

500
00:25:39,760 --> 00:25:42,610
The copying only
takes quadratic work

501
00:25:42,610 --> 00:25:45,160
to do across the
whole algorithm.

502
00:25:45,160 --> 00:25:48,070
And since the whole
algorithm takes cubic work,

503
00:25:48,070 --> 00:25:50,620
the quadratic work is
a lower order term.

504
00:25:50,620 --> 00:25:54,790
So you can use temporary
space to make sure that you

505
00:25:54,790 --> 00:25:56,050
don't get conflict misses.

506
00:25:58,560 --> 00:25:59,490
Any questions?

507
00:26:06,030 --> 00:26:09,340
So this was conflict misses.

508
00:26:09,340 --> 00:26:10,900
So conflict misses
are important.

509
00:26:10,900 --> 00:26:13,180
But usually, we're going
to be first concerned

510
00:26:13,180 --> 00:26:15,820
about getting good spatial
and temporal locality,

511
00:26:15,820 --> 00:26:19,240
because those are
usually the higher order

512
00:26:19,240 --> 00:26:21,070
factors in the
performance of a program.

513
00:26:21,070 --> 00:26:24,250
And once we get good spatial
and temporal locality

514
00:26:24,250 --> 00:26:25,840
in our program,
we can then start

515
00:26:25,840 --> 00:26:28,720
worrying about conflict
misses, for example,

516
00:26:28,720 --> 00:26:32,860
by using temporary space
or padding our data

517
00:26:32,860 --> 00:26:35,650
by some small constants
so that we don't

518
00:26:35,650 --> 00:26:37,210
have as if any conflict misses.

519
00:26:41,120 --> 00:26:43,170
So now, I want to
talk about a model

520
00:26:43,170 --> 00:26:45,270
that we can use to
analyze the cache

521
00:26:45,270 --> 00:26:46,530
performance of algorithms.

522
00:26:46,530 --> 00:26:51,010
And this is called
the ideal-cache model.

523
00:26:51,010 --> 00:26:57,030
So in this model, we have a
two-level cache hierarchy.

524
00:26:57,030 --> 00:27:01,440
So we have the cache
and then main memory.

525
00:27:01,440 --> 00:27:05,205
The cache size is of size
M, and the cache line size

526
00:27:05,205 --> 00:27:06,750
is of B bytes.

527
00:27:06,750 --> 00:27:10,245
And therefore, we can fit M over
V cache lines inside our cache.

528
00:27:13,020 --> 00:27:15,930
This model assumes that the
cache is fully associative,

529
00:27:15,930 --> 00:27:18,920
so any cache block can
go anywhere in the cache.

530
00:27:18,920 --> 00:27:23,070
And it also assumes an optimal
omniscient replacement policy.

531
00:27:23,070 --> 00:27:25,140
So this means that where
we want to evict a cache

532
00:27:25,140 --> 00:27:26,600
block from the
cache, we're going

533
00:27:26,600 --> 00:27:28,410
to pick the thing to
evict that gives us

534
00:27:28,410 --> 00:27:30,060
the best performance overall.

535
00:27:30,060 --> 00:27:31,830
It gives us the
lowest number of cache

536
00:27:31,830 --> 00:27:34,210
misses throughout
our entire algorithm.

537
00:27:34,210 --> 00:27:36,960
So we're assuming that we know
the sequence of memory requests

538
00:27:36,960 --> 00:27:38,858
throughout the entire algorithm.

539
00:27:38,858 --> 00:27:41,400
And that's why it's called the
omniscient mission replacement

540
00:27:41,400 --> 00:27:41,900
policy.

541
00:27:45,370 --> 00:27:49,000
And if something is in cache,
you can operate on it for free.

542
00:27:49,000 --> 00:27:51,040
And if something
is in main memory,

543
00:27:51,040 --> 00:27:52,810
you have to bring it
into cache and then

544
00:27:52,810 --> 00:27:54,070
you incur a cache miss.

545
00:27:56,990 --> 00:27:59,880
So two performance measures
that we care about--

546
00:27:59,880 --> 00:28:01,890
first, we care about
the ordinary work,

547
00:28:01,890 --> 00:28:04,830
which is just the ordinary
running time of a program.

548
00:28:04,830 --> 00:28:07,740
So this is the
same as before when

549
00:28:07,740 --> 00:28:09,360
we were analyzing algorithms.

550
00:28:09,360 --> 00:28:11,160
It's just a total
number of operations

551
00:28:11,160 --> 00:28:13,690
that the program does.

552
00:28:13,690 --> 00:28:15,420
And the number of
cache misses is

553
00:28:15,420 --> 00:28:17,190
going to be the
number of lines we

554
00:28:17,190 --> 00:28:21,893
have to transfer between the
main memory and the cache.

555
00:28:21,893 --> 00:28:23,310
So the number of
cache misses just

556
00:28:23,310 --> 00:28:24,930
counts a number of
cache transfers,

557
00:28:24,930 --> 00:28:27,570
whereas as the work counts
all the operations that you

558
00:28:27,570 --> 00:28:29,227
have to do in the algorithm.

559
00:28:32,640 --> 00:28:35,490
So ideally, we would
like to come up

560
00:28:35,490 --> 00:28:38,970
with algorithms that have a
low number of cache misses

561
00:28:38,970 --> 00:28:42,540
without increasing the work
from the traditional standard

562
00:28:42,540 --> 00:28:44,550
algorithm.

563
00:28:44,550 --> 00:28:47,060
Sometimes we can do that,
sometimes we can't do that.

564
00:28:47,060 --> 00:28:49,470
And then there's a
trade-off between the work

565
00:28:49,470 --> 00:28:51,210
and the number of cache misses.

566
00:28:51,210 --> 00:28:53,850
And it's a trade-off
that you have

567
00:28:53,850 --> 00:28:56,910
to decide whether it's
worthwhile as a performance

568
00:28:56,910 --> 00:28:57,960
engineer.

569
00:28:57,960 --> 00:28:59,790
Today, we're going to
look at an algorithm

570
00:28:59,790 --> 00:29:01,915
where you can't actually
reduce the number of cache

571
00:29:01,915 --> 00:29:03,780
misses without
increasing the work.

572
00:29:03,780 --> 00:29:06,090
So you basically get
the best of both worlds.

573
00:29:08,880 --> 00:29:11,430
So any questions on
this ideal cache model?

574
00:29:19,430 --> 00:29:23,810
So this model is just used
for analyzing algorithms.

575
00:29:23,810 --> 00:29:27,530
You can't actually buy one
of these caches at the store.

576
00:29:27,530 --> 00:29:31,760
So this is a very ideal
cache, and they don't exist.

577
00:29:31,760 --> 00:29:35,000
But it turns out that this
optimal omniscient replacement

578
00:29:35,000 --> 00:29:38,580
policy has nice
theoretical properties.

579
00:29:38,580 --> 00:29:43,970
And this is a very important
lemma that was proved in 1985.

580
00:29:43,970 --> 00:29:46,720
It's called the LRU lemma.

581
00:29:46,720 --> 00:29:48,770
It was proved by
Slater and Tarjan.

582
00:29:48,770 --> 00:29:51,950
And the lemma says, suppose
that an algorithm incurs

583
00:29:51,950 --> 00:29:56,540
Q cache misses on an ideal
cache of size M. Then,

584
00:29:56,540 --> 00:30:01,280
on a fully associative cache
of size 2M, that uses the LRU,

585
00:30:01,280 --> 00:30:04,760
or Least Recently Used
replacement policy,

586
00:30:04,760 --> 00:30:08,900
it incurs at most
2Q cache misses.

587
00:30:08,900 --> 00:30:12,980
So what this says is if I
can show the number of cache

588
00:30:12,980 --> 00:30:16,700
misses for an algorithm
on the ideal cache,

589
00:30:16,700 --> 00:30:19,820
then if I take a fully
associative cache that's twice

590
00:30:19,820 --> 00:30:23,220
the size and use the
LRU replacement policy,

591
00:30:23,220 --> 00:30:25,280
which is a pretty
practical policy,

592
00:30:25,280 --> 00:30:26,900
then the algorithm
is going to incur,

593
00:30:26,900 --> 00:30:31,160
at most, twice the
number of cache misses.

594
00:30:31,160 --> 00:30:33,890
And the implication
of this lemma

595
00:30:33,890 --> 00:30:36,590
is that for asymptotic
analyses, you

596
00:30:36,590 --> 00:30:40,040
can assume either the optimal
replacement policy or the LRU

597
00:30:40,040 --> 00:30:41,930
replacement policy
as convenient.

598
00:30:41,930 --> 00:30:46,010
Because the number
of cache misses

599
00:30:46,010 --> 00:30:50,270
is just going to be within a
constant factor of each other.

600
00:30:50,270 --> 00:30:52,610
So this is a very
important lemma.

601
00:30:52,610 --> 00:30:54,650
It says that this
basically makes

602
00:30:54,650 --> 00:31:00,306
it much easier for us to analyze
our cache misses in algorithms.

603
00:31:03,780 --> 00:31:06,240
And here's a software
engineering principle

604
00:31:06,240 --> 00:31:08,770
that I want to point out.

605
00:31:08,770 --> 00:31:13,480
So first, when you're trying
to get good performance,

606
00:31:13,480 --> 00:31:16,540
you should come up with a
theoretically good algorithm

607
00:31:16,540 --> 00:31:20,670
that has good balance on the
work and the cache complexity.

608
00:31:20,670 --> 00:31:23,130
And then after you come up
with an algorithm that's

609
00:31:23,130 --> 00:31:26,040
theoretically good, then
you start engineering

610
00:31:26,040 --> 00:31:27,150
for detailed performance.

611
00:31:27,150 --> 00:31:30,630
You start worrying about the
details such as real world

612
00:31:30,630 --> 00:31:34,770
caches not being fully
associative, and, for example,

613
00:31:34,770 --> 00:31:37,080
loads and stores having
different costs with respect

614
00:31:37,080 --> 00:31:39,090
to bandwidth and latency.

615
00:31:39,090 --> 00:31:41,340
But coming up with a
theoretically good algorithm

616
00:31:41,340 --> 00:31:43,980
is the first order bit to
getting good performance.

617
00:31:48,840 --> 00:31:49,812
Questions?

618
00:31:58,090 --> 00:32:00,550
So let's start analyzing
the number of cache

619
00:32:00,550 --> 00:32:02,320
misses in a program.

620
00:32:02,320 --> 00:32:04,090
So here's a lemma.

621
00:32:04,090 --> 00:32:07,990
So the lemma says, suppose that
a program reads a set of r data

622
00:32:07,990 --> 00:32:13,480
segments, where the i-th segment
consists of s sub i bytes.

623
00:32:13,480 --> 00:32:17,110
And suppose that the sum of
the sizes of all the segments

624
00:32:17,110 --> 00:32:22,360
is equal to N. And we're going
to assume that N is less than M

625
00:32:22,360 --> 00:32:23,120
over 3.

626
00:32:23,120 --> 00:32:26,260
So the sum of the
size of the segments

627
00:32:26,260 --> 00:32:30,100
is less than the cache
size divided by 3.

628
00:32:30,100 --> 00:32:32,320
We're also going to
assume that N over r

629
00:32:32,320 --> 00:32:34,870
is greater than or
equal to B. So recall

630
00:32:34,870 --> 00:32:38,650
that r is the number of
data segments we have,

631
00:32:38,650 --> 00:32:41,090
and N is the total
size of the segment.

632
00:32:41,090 --> 00:32:46,080
So what does N over
r mean, semantically?

633
00:32:46,080 --> 00:32:46,580
Yes.

634
00:32:46,580 --> 00:32:47,950
AUDIENCE: Average [INAUDIBLE].

635
00:32:47,950 --> 00:32:48,700
JULIAN SHUN: Yeah.

636
00:32:48,700 --> 00:32:53,390
So N over r is the just the
average size of a segment.

637
00:32:53,390 --> 00:32:56,390
And here we're saying that
the average size of a segment

638
00:32:56,390 --> 00:33:01,790
is at least B-- so at least
the size of a cache line.

639
00:33:01,790 --> 00:33:04,830
So if these two assumptions
hold, then all of the segments

640
00:33:04,830 --> 00:33:07,590
are going to fit into cache,
and the number of cache

641
00:33:07,590 --> 00:33:13,590
misses to read them all is,
at most, 3 times N over B.

642
00:33:13,590 --> 00:33:20,490
So if you had just a
single array of size N,

643
00:33:20,490 --> 00:33:21,990
then the number of
cache misses you

644
00:33:21,990 --> 00:33:24,180
would need to read
that array into cache

645
00:33:24,180 --> 00:33:25,920
is going to be N
over B. And this

646
00:33:25,920 --> 00:33:29,280
is saying that, even
if our data is divided

647
00:33:29,280 --> 00:33:32,040
into a bunch of segments, as
long as the average length

648
00:33:32,040 --> 00:33:35,580
of the segments is large enough,
then the number of cache misses

649
00:33:35,580 --> 00:33:41,550
is just a constant factor worse
than reading a single array.

650
00:33:41,550 --> 00:33:44,160
So let's try to prove
this cache miss lemma.

651
00:33:48,000 --> 00:33:50,220
So here's a proof so.

652
00:33:50,220 --> 00:33:52,290
A single segment,
s sub i is going

653
00:33:52,290 --> 00:33:58,350
to incur at most s sub i
over B plus 2 cache misses.

654
00:33:58,350 --> 00:34:01,800
So does anyone want to tell me
where the s sub i over B plus 2

655
00:34:01,800 --> 00:34:02,370
comes from?

656
00:34:09,540 --> 00:34:13,170
So let's say this is a
segment that we're analyzing,

657
00:34:13,170 --> 00:34:16,320
and this is how it's
aligned in virtual memory.

658
00:34:21,900 --> 00:34:22,400
Yes?

659
00:34:22,400 --> 00:34:25,310
AUDIENCE: How many blocks
it could overlap worst case.

660
00:34:25,310 --> 00:34:26,060
JULIAN SHUN: Yeah.

661
00:34:26,060 --> 00:34:29,870
So s sub i over B plus 2 is
the number of blocks that could

662
00:34:29,870 --> 00:34:32,610
overlap within the worst case.

663
00:34:32,610 --> 00:34:36,949
So you need s sub i
over B cache misses just

664
00:34:36,949 --> 00:34:39,949
to load those s sub i bytes.

665
00:34:39,949 --> 00:34:43,400
But then the beginning and
the end of that segment

666
00:34:43,400 --> 00:34:47,360
might not be perfectly aligned
with a cache line boundary.

667
00:34:47,360 --> 00:34:49,670
And therefore, you could
waste, at most, one block

668
00:34:49,670 --> 00:34:51,320
on each side of the segment.

669
00:34:51,320 --> 00:34:55,310
So that's where the
plus 2 comes from.

670
00:34:55,310 --> 00:34:57,560
So to get the total
number of cache

671
00:34:57,560 --> 00:35:03,170
misses, we just have to sum this
quantity from i equals 1 to r.

672
00:35:03,170 --> 00:35:06,620
So if I sum s sub i over
B from i equals 1 to r,

673
00:35:06,620 --> 00:35:08,810
I just get N over
B, by definition.

674
00:35:08,810 --> 00:35:12,640
And then I sum 2
from i equals 1 to r.

675
00:35:12,640 --> 00:35:14,840
So that just gives me 2r.

676
00:35:14,840 --> 00:35:17,180
Now, I'm going to multiply
the top and the bottom

677
00:35:17,180 --> 00:35:21,080
with the second term by
B. So 2r B over B now.

678
00:35:21,080 --> 00:35:24,200
And then that's less
than or equal to N over B

679
00:35:24,200 --> 00:35:29,730
plus 2N over B. So where did
I get this inequality here?

680
00:35:29,730 --> 00:35:32,420
Why do I know that 2r B is
less than or equal to 2N?

681
00:35:35,500 --> 00:35:36,000
Yes?

682
00:35:36,000 --> 00:35:38,760
AUDIENCE: You know that the N
is greater than or equal to B r.

683
00:35:38,760 --> 00:35:38,940
JULIAN SHUN: Yeah.

684
00:35:38,940 --> 00:35:41,250
So you know that N is
greater than or equal to B

685
00:35:41,250 --> 00:35:43,380
r by this assumption up here.

686
00:35:43,380 --> 00:35:46,830
So therefore, r B is
less than or equal to N.

687
00:35:46,830 --> 00:35:51,450
And then, N B plus 2 N
B just sums up to 3 N B.

688
00:35:51,450 --> 00:35:55,335
So in the worst case, we're
going to incur 3N over B cache

689
00:35:55,335 --> 00:35:55,835
misses.

690
00:36:00,800 --> 00:36:03,340
So any questions on
this cache miss lemma?

691
00:36:07,620 --> 00:36:11,520
So the Important thing to
remember here is that if you

692
00:36:11,520 --> 00:36:14,070
have a whole bunch of data
segments and the average length

693
00:36:14,070 --> 00:36:15,780
of your segments
is large enough--

694
00:36:15,780 --> 00:36:18,540
bigger than a cache block size--

695
00:36:18,540 --> 00:36:21,690
then you can access all
of these segments just

696
00:36:21,690 --> 00:36:24,360
like a single array.

697
00:36:24,360 --> 00:36:25,980
It only increases
the number of cache

698
00:36:25,980 --> 00:36:27,810
misses by a constant factor.

699
00:36:27,810 --> 00:36:29,892
And if you're doing an
asymptotic analysis,

700
00:36:29,892 --> 00:36:30,850
then it doesn't matter.

701
00:36:30,850 --> 00:36:33,360
So we're going to be using
this cache miss lemma later

702
00:36:33,360 --> 00:36:35,160
on when we analyze algorithms.

703
00:36:40,720 --> 00:36:44,200
So another assumption
that we're going to need

704
00:36:44,200 --> 00:36:46,840
is called the tall
cache assumption.

705
00:36:46,840 --> 00:36:49,450
And the tall cache
assumption basically

706
00:36:49,450 --> 00:36:52,390
says that the cache is
taller than it is wide.

707
00:36:52,390 --> 00:36:55,750
So it says that B
squared is less than c M

708
00:36:55,750 --> 00:36:58,750
for some sufficiently
small constant c less than

709
00:36:58,750 --> 00:37:02,050
or equal to 1.

710
00:37:02,050 --> 00:37:05,830
So in other words, it says
that the number of cache lines

711
00:37:05,830 --> 00:37:13,660
M over B you have is
going to be bigger than B.

712
00:37:13,660 --> 00:37:16,330
And this tall cache
assumption is usually

713
00:37:16,330 --> 00:37:17,650
satisfied in practice.

714
00:37:17,650 --> 00:37:22,090
So here are the cache
line sizes and the cache

715
00:37:22,090 --> 00:37:24,460
sizes on the machines
that we're using.

716
00:37:24,460 --> 00:37:28,990
So cache line size is 64
bytes, and the L1 cache size

717
00:37:28,990 --> 00:37:31,390
is 32 kilobytes.

718
00:37:31,390 --> 00:37:36,400
So 64 bytes squared,
that's 2 to the 12th.

719
00:37:36,400 --> 00:37:39,420
And 32 kilobytes is
2 to the 15th bytes.

720
00:37:39,420 --> 00:37:41,510
So 2 to the 12th is
less than 2 to the 15th,

721
00:37:41,510 --> 00:37:44,530
so it satisfies the
tall cache assumption.

722
00:37:44,530 --> 00:37:46,540
And as we go up the
memory hierarchy,

723
00:37:46,540 --> 00:37:49,990
the cache size increases,
but the cache line length

724
00:37:49,990 --> 00:37:51,080
stays the same.

725
00:37:51,080 --> 00:37:53,230
So the cache has
become even taller

726
00:37:53,230 --> 00:37:57,160
as we move up the
memory hierarchy.

727
00:37:57,160 --> 00:38:00,468
So let's see why this
tall cache assumption is

728
00:38:00,468 --> 00:38:01,260
going to be useful.

729
00:38:04,550 --> 00:38:06,300
To see that, we're
going to look at what's

730
00:38:06,300 --> 00:38:07,770
wrong with a short cache.

731
00:38:07,770 --> 00:38:11,580
So in a short cache, our lines
are going to be very wide,

732
00:38:11,580 --> 00:38:14,190
and they're wider than
the number of lines

733
00:38:14,190 --> 00:38:18,200
that we can have in our cache.

734
00:38:18,200 --> 00:38:19,950
And let's say we're
working with an m

735
00:38:19,950 --> 00:38:24,120
by n submatrix sorted
in row-major order.

736
00:38:24,120 --> 00:38:27,810
If you have a short cache,
then even if n squared

737
00:38:27,810 --> 00:38:29,700
is less than c M,
meaning that you

738
00:38:29,700 --> 00:38:33,540
can fit all the bytes of
the submatrix in cache,

739
00:38:33,540 --> 00:38:37,620
you might still not be able
to fit it into a short cache.

740
00:38:37,620 --> 00:38:40,650
And this picture sort
of illustrates this.

741
00:38:40,650 --> 00:38:43,050
So we have m rows here.

742
00:38:43,050 --> 00:38:46,290
But we can only fit M over
B of the rows in the cache,

743
00:38:46,290 --> 00:38:48,960
because the cache
lines are so long,

744
00:38:48,960 --> 00:38:51,045
and we're actually
wasting a lot of space

745
00:38:51,045 --> 00:38:52,170
on each of the cache lines.

746
00:38:52,170 --> 00:38:54,570
We're only using a very small
fraction of each cache line

747
00:38:54,570 --> 00:38:58,690
to store the row
of this submatrix.

748
00:38:58,690 --> 00:39:00,960
If this were the
entire matrix, then

749
00:39:00,960 --> 00:39:05,250
it would actually be OK,
because consecutive rows

750
00:39:05,250 --> 00:39:08,850
are going to be placed together
consecutively in memory.

751
00:39:08,850 --> 00:39:10,740
But if this is a
submatrix, then we

752
00:39:10,740 --> 00:39:14,070
can't be guaranteed that the
next row is going to be placed

753
00:39:14,070 --> 00:39:17,220
right after the current row.

754
00:39:17,220 --> 00:39:19,290
And oftentimes, we have
to deal with submatrices

755
00:39:19,290 --> 00:39:22,110
when we're doing recursive
matrix algorithms.

756
00:39:25,330 --> 00:39:27,760
So this is what's wrong
with short caches.

757
00:39:27,760 --> 00:39:32,340
And that's why we want us assume
the tall cache assumption.

758
00:39:32,340 --> 00:39:34,210
And we can assume that,
because it's usually

759
00:39:34,210 --> 00:39:35,185
satisfied in practice.

760
00:39:37,945 --> 00:39:40,080
The TLB be actually
tends to be short.

761
00:39:40,080 --> 00:39:42,550
It only has a couple of
entries, so it might not satisfy

762
00:39:42,550 --> 00:39:44,020
the tall cache assumption.

763
00:39:44,020 --> 00:39:50,060
But all of the other caches
will satisfy this assumption.

764
00:39:50,060 --> 00:39:51,100
Any questions?

765
00:39:54,630 --> 00:39:56,797
OK.

766
00:39:56,797 --> 00:39:58,880
So here's another lemma
that's going to be useful.

767
00:39:58,880 --> 00:40:03,220
This is called the
submatrix caching llama.

768
00:40:03,220 --> 00:40:06,310
So suppose that we
have an n by m matrix,

769
00:40:06,310 --> 00:40:08,650
and it's read into
a tall cache that

770
00:40:08,650 --> 00:40:13,190
satisfies B squared less than c
M for some constant c less than

771
00:40:13,190 --> 00:40:15,580
or equal to 1.

772
00:40:15,580 --> 00:40:19,840
And suppose that n squared
is less than M over 3,

773
00:40:19,840 --> 00:40:24,280
but it's greater than
or equal to c M. Then

774
00:40:24,280 --> 00:40:27,580
A is going to fit into cache,
and the number of cache

775
00:40:27,580 --> 00:40:31,600
misses required to read all
of A's elements into cache is,

776
00:40:31,600 --> 00:40:38,470
at most, 3n squared over B.

777
00:40:38,470 --> 00:40:42,900
So let's see why this is true.

778
00:40:42,900 --> 00:40:45,120
So we're going to
let big N denote

779
00:40:45,120 --> 00:40:48,930
the total number of bytes
that we need to access.

780
00:40:48,930 --> 00:40:50,940
So big N is going to
be equal to n squared.

781
00:40:53,800 --> 00:40:56,550
And we're going to use the
cache miss lemma, which

782
00:40:56,550 --> 00:40:59,160
says that if the average
length of our segments

783
00:40:59,160 --> 00:41:02,310
is large enough, then we
can read all of the segments

784
00:41:02,310 --> 00:41:05,770
in just like it were a
single contiguous array.

785
00:41:05,770 --> 00:41:09,930
So the lengths of our segments
here are going to be little n.

786
00:41:09,930 --> 00:41:13,230
So r is going to be a little n.

787
00:41:13,230 --> 00:41:16,470
And also, the number of segments
is going to be little n.

788
00:41:16,470 --> 00:41:18,040
And the segment
length is also going

789
00:41:18,040 --> 00:41:21,660
to be little n, since we're
working with a square submatrix

790
00:41:21,660 --> 00:41:24,090
here.

791
00:41:24,090 --> 00:41:30,120
And then we also have the
cache block size B is less than

792
00:41:30,120 --> 00:41:32,310
or equal to n.

793
00:41:32,310 --> 00:41:36,090
And that's equal
to big N over r.

794
00:41:36,090 --> 00:41:39,750
And where do we get this
property that B is less than

795
00:41:39,750 --> 00:41:42,600
or equal to n?

796
00:41:42,600 --> 00:41:46,110
So I made some
assumptions up here,

797
00:41:46,110 --> 00:41:50,070
where I can use to infer that
B is less than or equal to n.

798
00:41:50,070 --> 00:41:53,150
Does anybody see where?

799
00:41:53,150 --> 00:41:53,922
Yeah.

800
00:41:53,922 --> 00:41:55,850
AUDIENCE: So B squared
is less than c M,

801
00:41:55,850 --> 00:41:57,300
and c M is [INAUDIBLE]

802
00:41:57,300 --> 00:41:58,050
JULIAN SHUN: Yeah.

803
00:41:58,050 --> 00:42:00,435
So I know that B
squared is less than c

804
00:42:00,435 --> 00:42:02,820
M. C M is less than
or equal to n squared.

805
00:42:02,820 --> 00:42:05,250
So therefore, B squared
is less than n squared,

806
00:42:05,250 --> 00:42:09,360
and B is less than n.

807
00:42:09,360 --> 00:42:15,060
So now, I also have
that N is less than M

808
00:42:15,060 --> 00:42:18,450
over 3, just by assumption.

809
00:42:18,450 --> 00:42:20,810
And therefore, I can use
the cache miss lemma.

810
00:42:20,810 --> 00:42:23,700
So the cache miss lemma
tells me that I only

811
00:42:23,700 --> 00:42:26,610
need a total of 3n
squared over B cache

812
00:42:26,610 --> 00:42:28,120
misses to read this
whole thing in.

813
00:42:32,780 --> 00:42:35,150
Any questions on the
submatrix caching lemma?

814
00:42:48,980 --> 00:42:53,198
So now, let's analyze
matrix multiplication.

815
00:42:53,198 --> 00:42:55,490
How many of you have seen
matrix multiplication before?

816
00:42:59,250 --> 00:43:00,130
So a couple of you.

817
00:43:03,340 --> 00:43:07,150
So here's what the
code looks like

818
00:43:07,150 --> 00:43:11,260
for the standard cubic
work matrix multiplication

819
00:43:11,260 --> 00:43:12,980
algorithm.

820
00:43:12,980 --> 00:43:15,430
So we have two input
matrices, A and B,

821
00:43:15,430 --> 00:43:18,610
And we're going to
store the result in C.

822
00:43:18,610 --> 00:43:22,930
And the height and the
width of our matrix is n.

823
00:43:22,930 --> 00:43:25,798
We're just going to deal
with square matrices here,

824
00:43:25,798 --> 00:43:27,340
but what I'm going
to talk about also

825
00:43:27,340 --> 00:43:30,770
extends to non-square matrices.

826
00:43:30,770 --> 00:43:33,450
And then we just have
three loops here.

827
00:43:33,450 --> 00:43:37,600
We're going to loop through i
from 0 to n minus 1, j from 0

828
00:43:37,600 --> 00:43:40,540
to n minus 1, and k
from 0 to n minus 1.

829
00:43:40,540 --> 00:43:43,225
And then we're going to
let's C of i n plus j

830
00:43:43,225 --> 00:43:48,280
be incremented by a of i n
plus k times b of k n plus j.

831
00:43:48,280 --> 00:43:53,200
So that's just the standard
code for matrix multiply.

832
00:43:53,200 --> 00:43:57,105
So what's the work
of this algorithm?

833
00:43:57,105 --> 00:44:02,140
It should be review
for all of you.

834
00:44:02,140 --> 00:44:02,740
n cubed.

835
00:44:05,790 --> 00:44:08,850
So now, let's analyze
the number of cache

836
00:44:08,850 --> 00:44:11,400
misses this algorithm
is going to incur.

837
00:44:11,400 --> 00:44:13,680
And again, we're going to
assume that the matrix is

838
00:44:13,680 --> 00:44:16,770
in row-major order, and
we satisfy the tall cache

839
00:44:16,770 --> 00:44:17,760
assumption.

840
00:44:20,640 --> 00:44:23,100
We're also going to
analyze the number of cache

841
00:44:23,100 --> 00:44:25,723
misses in matrix B,
because it turns out

842
00:44:25,723 --> 00:44:27,390
that the number of
cache misses incurred

843
00:44:27,390 --> 00:44:29,850
by matrix B is going to
dominate the number of cache

844
00:44:29,850 --> 00:44:31,470
misses overall.

845
00:44:31,470 --> 00:44:33,720
And there are three cases
we need to consider.

846
00:44:33,720 --> 00:44:37,110
The first case is when
n is greater than c M

847
00:44:37,110 --> 00:44:39,570
over B for some constant c.

848
00:44:42,890 --> 00:44:44,900
And we're going to analyze
matrix B, as I said.

849
00:44:44,900 --> 00:44:48,650
And we're also going to
assume LRU, because we can.

850
00:44:48,650 --> 00:44:50,300
If you recall,
the LRU lemma says

851
00:44:50,300 --> 00:44:52,390
that whatever we
analyze using the LRU

852
00:44:52,390 --> 00:44:55,160
is just going to be a constant
factor within what we analyze

853
00:44:55,160 --> 00:44:56,420
using the ideal cache.

854
00:45:01,220 --> 00:45:07,460
So to do this matrix
multiplication,

855
00:45:07,460 --> 00:45:10,940
I'm going to go through one
row of A and one column of B

856
00:45:10,940 --> 00:45:12,740
and do the dot product there.

857
00:45:12,740 --> 00:45:17,460
This is what happens
in the innermost loop.

858
00:45:17,460 --> 00:45:19,010
And how many cache
misses am I going

859
00:45:19,010 --> 00:45:24,110
to incur when I go down
one column of B here?

860
00:45:24,110 --> 00:45:29,120
So here, I have the case where
n is greater than M over B.

861
00:45:29,120 --> 00:45:38,430
So I can't fit one block
from each row into the cache.

862
00:45:38,430 --> 00:45:40,490
So how many cache misses
do I have the first time

863
00:45:40,490 --> 00:45:41,840
I go down a column of B?

864
00:45:44,440 --> 00:45:45,990
So how many rows of B do I have?

865
00:45:48,820 --> 00:45:49,700
n.

866
00:45:49,700 --> 00:45:54,850
Yeah, and how many cache
misses do I need for each row?

867
00:45:54,850 --> 00:45:55,350
One.

868
00:45:55,350 --> 00:45:58,590
So in total, I'm going
to need n cache misses

869
00:45:58,590 --> 00:46:02,280
for the first column of B.

870
00:46:02,280 --> 00:46:04,020
What about the
second column of B?

871
00:46:08,980 --> 00:46:12,090
So recall that I'm assuming the
LRU replacement policy here.

872
00:46:12,090 --> 00:46:13,590
So when the cache
is full, I'm going

873
00:46:13,590 --> 00:46:17,030
to evict the thing that
was least recently used--

874
00:46:17,030 --> 00:46:18,610
used the furthest in the past.

875
00:46:26,932 --> 00:46:28,140
Sorry, could you repeat that?

876
00:46:28,140 --> 00:46:29,080
AUDIENCE: [INAUDIBLE].

877
00:46:29,080 --> 00:46:29,830
JULIAN SHUN: Yeah.

878
00:46:29,830 --> 00:46:30,997
So it's still going to be n.

879
00:46:30,997 --> 00:46:33,462
Why is that?

880
00:46:33,462 --> 00:46:38,350
AUDIENCE: Because there
are [INAUDIBLE] integer.

881
00:46:38,350 --> 00:46:39,822
JULIAN SHUN: Yeah.

882
00:46:39,822 --> 00:46:41,280
It's still going
to be n, because I

883
00:46:41,280 --> 00:46:45,030
can't fit one cache block
from each row into my cache.

884
00:46:45,030 --> 00:46:48,630
And by the time I get back
to the top of my matrix B,

885
00:46:48,630 --> 00:46:52,130
the top block has already
been evicted from the cache,

886
00:46:52,130 --> 00:46:53,410
and I have to load it back in.

887
00:46:53,410 --> 00:46:56,070
And this is the same for every
other block that I access.

888
00:46:56,070 --> 00:46:58,680
So I'm, again, going
to need n cache misses

889
00:46:58,680 --> 00:47:01,200
for the second
column of B. And this

890
00:47:01,200 --> 00:47:05,400
is going to be the same
for all the columns of B.

891
00:47:05,400 --> 00:47:09,790
And then I have to do this
again for the second row of A.

892
00:47:09,790 --> 00:47:13,120
So in total, I'm going
to need theta of n

893
00:47:13,120 --> 00:47:15,730
cubed number of cache misses.

894
00:47:15,730 --> 00:47:21,710
And this is one cache miss
per entry that I access in B.

895
00:47:21,710 --> 00:47:25,420
And this is not very good,
because the total work was also

896
00:47:25,420 --> 00:47:26,270
theta of n cubed.

897
00:47:26,270 --> 00:47:29,170
So I'm not gaining anything
from having any locality

898
00:47:29,170 --> 00:47:32,900
in this algorithm here.

899
00:47:32,900 --> 00:47:36,440
So any questions
on this analysis?

900
00:47:36,440 --> 00:47:39,410
So this just case 1.

901
00:47:39,410 --> 00:47:41,580
Let's look at case 2.

902
00:47:41,580 --> 00:47:46,130
So in this case, n is
less than c M over B.

903
00:47:46,130 --> 00:47:50,270
So I can fit one block from
each row of B into cache.

904
00:47:50,270 --> 00:47:55,370
And then n is also greater than
another constant, c prime time

905
00:47:55,370 --> 00:48:00,080
square root of M, so I can't
fit the whole matrix into cache.

906
00:48:00,080 --> 00:48:02,600
And again, let's analyze
the number of cache

907
00:48:02,600 --> 00:48:07,432
misses incurred by
accessing B, assuming LRU.

908
00:48:07,432 --> 00:48:08,890
So how many cache
misses am I going

909
00:48:08,890 --> 00:48:12,882
to incur for the
first column of B?

910
00:48:12,882 --> 00:48:13,382
AUDIENCE: n.

911
00:48:13,382 --> 00:48:14,007
JULIAN SHUN: n.

912
00:48:14,007 --> 00:48:15,530
So that's the same as before.

913
00:48:15,530 --> 00:48:18,470
What about the
second column of B?

914
00:48:18,470 --> 00:48:24,260
So by the time I get to the
beginning of the matrix here,

915
00:48:24,260 --> 00:48:26,690
is the top block
going to be in cache?

916
00:48:29,940 --> 00:48:33,330
So who thinks the block is
still going to be in cache when

917
00:48:33,330 --> 00:48:35,410
I get back to the beginning?

918
00:48:35,410 --> 00:48:35,910
Yeah.

919
00:48:35,910 --> 00:48:37,320
So a couple of people.

920
00:48:37,320 --> 00:48:39,000
Who think it going
to be out of cache?

921
00:48:42,550 --> 00:48:46,660
So it turns out it is going
to be in cache, because I

922
00:48:46,660 --> 00:48:50,710
can fit one block for every
row of B into my cache

923
00:48:50,710 --> 00:48:53,980
since I have n less
than c M over B.

924
00:48:53,980 --> 00:48:58,668
So therefore, when I get to the
beginning of the second column,

925
00:48:58,668 --> 00:49:01,210
that block is still going to be
in cache, because I loaded it

926
00:49:01,210 --> 00:49:03,050
in when I was accessing
the first column.

927
00:49:03,050 --> 00:49:04,800
So I'm not going to
incur any cache misses

928
00:49:04,800 --> 00:49:07,450
for the second column.

929
00:49:07,450 --> 00:49:14,230
And, in general, if I can fit
B columns or some constant

930
00:49:14,230 --> 00:49:19,540
times B columns
into cache, then I

931
00:49:19,540 --> 00:49:23,830
can reduce the number of cache
misses I have by a factor of B.

932
00:49:23,830 --> 00:49:26,365
So I only need to incur a
cache miss the first time I

933
00:49:26,365 --> 00:49:29,190
access of block and not for
all the subsequent accesses.

934
00:49:33,250 --> 00:49:37,740
And the same is true
for the second row of A.

935
00:49:37,740 --> 00:49:40,500
And since I have
m rows of A, I'm

936
00:49:40,500 --> 00:49:44,850
going to have n times theta of
n squared over B cache misses.

937
00:49:44,850 --> 00:49:46,530
For each row of A,
I'm going to incur

938
00:49:46,530 --> 00:49:49,260
n squared over B cache misses.

939
00:49:49,260 --> 00:49:52,750
So the overall number of cache
misses is n cubed over B.

940
00:49:52,750 --> 00:49:55,110
And this is because
inside matrix B

941
00:49:55,110 --> 00:49:56,850
I can exploit spatial locality.

942
00:49:56,850 --> 00:50:00,000
Once I load in a block, I
can reuse it the next time

943
00:50:00,000 --> 00:50:02,280
I traverse down a
column that's nearby.

944
00:50:06,780 --> 00:50:08,400
Any questions on this analysis?

945
00:50:16,640 --> 00:50:18,530
So let's look at the third case.

946
00:50:18,530 --> 00:50:23,120
And here, n is less than c
prime times square root of M.

947
00:50:23,120 --> 00:50:27,810
So this means that the entire
matrix fits into cache.

948
00:50:27,810 --> 00:50:30,350
So let's analyze the number
of cache misses for matrix B

949
00:50:30,350 --> 00:50:32,150
again, assuming LRU.

950
00:50:32,150 --> 00:50:34,100
So how many cache
misses do I have now?

951
00:50:36,950 --> 00:50:39,300
So let's count the
total number of cache

952
00:50:39,300 --> 00:50:50,750
misses I have for every time
I go through a row of A. Yes.

953
00:50:50,750 --> 00:50:53,540
AUDIENCE: Is it just n
for the first column?

954
00:50:56,030 --> 00:50:56,780
JULIAN SHUN: Yeah.

955
00:50:56,780 --> 00:51:00,110
So for the first column,
it's going to be n.

956
00:51:00,110 --> 00:51:04,000
What about the second column?

957
00:51:04,000 --> 00:51:05,950
AUDIENCE: [INAUDIBLE]
the second [INAUDIBLE]..

958
00:51:05,950 --> 00:51:07,420
JULIAN SHUN: Right.

959
00:51:07,420 --> 00:51:11,042
So basically, for
the first row of A,

960
00:51:11,042 --> 00:51:13,000
the analysis is going to
be the same as before.

961
00:51:13,000 --> 00:51:16,870
I need n squared over B cache
misses to load the cache in.

962
00:51:16,870 --> 00:51:18,750
What about the second row of A?

963
00:51:18,750 --> 00:51:20,500
How many cache misses
am I going to incur?

964
00:51:27,262 --> 00:51:30,230
AUDIENCE: [INAUDIBLE].

965
00:51:30,230 --> 00:51:30,980
JULIAN SHUN: Yeah.

966
00:51:30,980 --> 00:51:32,420
So for the second
row of A, I'm not

967
00:51:32,420 --> 00:51:33,770
going to incur any cache misses.

968
00:51:33,770 --> 00:51:36,173
Because once I
load B into cache,

969
00:51:36,173 --> 00:51:37,340
it's going to stay in cache.

970
00:51:37,340 --> 00:51:39,470
Because the entire
matrix can fit in cache,

971
00:51:39,470 --> 00:51:44,870
I assume that n is less than c
prime times square root of M.

972
00:51:44,870 --> 00:51:46,340
So total number
of cache misses I

973
00:51:46,340 --> 00:51:50,900
need for matrix B is theta of n
squared over B since everything

974
00:51:50,900 --> 00:51:51,660
fits in cache.

975
00:51:51,660 --> 00:51:54,770
And I just apply the submatrix
caching lemma from before.

976
00:51:58,100 --> 00:52:00,290
Overall, this is not
a very good algorithm.

977
00:52:00,290 --> 00:52:02,360
Because as you
recall, in case 1 I

978
00:52:02,360 --> 00:52:06,410
needed a cubic number
of cache misses.

979
00:52:09,200 --> 00:52:12,980
What happens if I swap the
order of the inner two loops?

980
00:52:12,980 --> 00:52:16,850
So recall that this was one of
the optimizations in lecture 1,

981
00:52:16,850 --> 00:52:19,910
when Charles was talking
about matrix multiplication

982
00:52:19,910 --> 00:52:22,250
and how to speed it up.

983
00:52:22,250 --> 00:52:26,450
So if I swapped the order
of the two inner loops,

984
00:52:26,450 --> 00:52:31,190
then, for every
iteration, what I'm doing

985
00:52:31,190 --> 00:52:35,450
is I'm actually going over
a row of C and a row of B,

986
00:52:35,450 --> 00:52:40,520
and A stays fixed inside
the innermost iteration.

987
00:52:40,520 --> 00:52:42,950
So now, when I analyze
the number of cache

988
00:52:42,950 --> 00:52:45,920
misses of matrix
B, assuming LRU,

989
00:52:45,920 --> 00:52:47,840
I'm going to benefit
from spatial locality,

990
00:52:47,840 --> 00:52:49,970
since I'm going row by
row and the matrix is

991
00:52:49,970 --> 00:52:53,030
stored in row-major order.

992
00:52:53,030 --> 00:52:55,700
So across all of
the rows, I'm just

993
00:52:55,700 --> 00:53:00,380
going to require theta of n
squared over B cache misses.

994
00:53:00,380 --> 00:53:04,142
And I have to do this n
times for the outer loop.

995
00:53:04,142 --> 00:53:05,600
So in total, I'm
going to get theta

996
00:53:05,600 --> 00:53:08,450
of n cubed over B cache misses.

997
00:53:08,450 --> 00:53:10,700
So if you swap the order
of the inner two loops

998
00:53:10,700 --> 00:53:13,697
this significantly improves
the locality of your algorithm

999
00:53:13,697 --> 00:53:15,530
and you can benefit
from spatial accounting.

1000
00:53:15,530 --> 00:53:18,500
That's why we saw a significant
performance improvement

1001
00:53:18,500 --> 00:53:23,750
in the first lecture when we
swapped the order of the loops.

1002
00:53:23,750 --> 00:53:24,560
Any questions?

1003
00:53:31,280 --> 00:53:34,210
So does anybody think
we can do better than n

1004
00:53:34,210 --> 00:53:36,140
cubed over B cache misses?

1005
00:53:36,140 --> 00:53:39,440
Or do you think that
it's the best you can do?

1006
00:53:39,440 --> 00:53:41,510
So how many people
think you can do better?

1007
00:53:46,010 --> 00:53:46,510
Yeah.

1008
00:53:46,510 --> 00:53:49,480
And how many people think
this is the best you can do?

1009
00:53:53,780 --> 00:53:55,970
And how many people don't care?

1010
00:54:00,660 --> 00:54:03,960
So it turns out
you can do better.

1011
00:54:03,960 --> 00:54:06,060
And we're going to
do better by using

1012
00:54:06,060 --> 00:54:09,870
an optimization called tiling.

1013
00:54:09,870 --> 00:54:12,210
So how this is going
to work is instead

1014
00:54:12,210 --> 00:54:13,910
of just having
three for loops, I'm

1015
00:54:13,910 --> 00:54:15,570
going to have six for loops.

1016
00:54:15,570 --> 00:54:19,220
And I'm going to
loop over tiles.

1017
00:54:19,220 --> 00:54:22,070
So I've got a loop over
s by s submatrices.

1018
00:54:22,070 --> 00:54:24,110
And within each
submatrix, I'm going

1019
00:54:24,110 --> 00:54:27,050
to do all of the computation
I need for that submatrix

1020
00:54:27,050 --> 00:54:30,270
before moving on to
the next submatrix.

1021
00:54:30,270 --> 00:54:32,840
So the three innermost
loops are going

1022
00:54:32,840 --> 00:54:36,710
to loop inside a submatrix,
and a three outermost loops

1023
00:54:36,710 --> 00:54:39,110
are going to loop within
the larger matrix,

1024
00:54:39,110 --> 00:54:42,710
one submatrix matrix at a time.

1025
00:54:42,710 --> 00:54:45,330
So let's analyze the
work of this algorithm.

1026
00:54:48,150 --> 00:54:54,380
So the work that we need to
do for a submatrix of size

1027
00:54:54,380 --> 00:54:58,610
s by s is going to be s cubed,
since that's just a bound

1028
00:54:58,610 --> 00:55:00,950
for matrix multiplication.

1029
00:55:00,950 --> 00:55:04,160
And then a number of times I
have to operate on submatrices

1030
00:55:04,160 --> 00:55:07,590
is going to be n over s cubed.

1031
00:55:07,590 --> 00:55:11,210
And you can see this if you just
consider each submatrix to be

1032
00:55:11,210 --> 00:55:13,820
a single element, and then
using the same cubic work

1033
00:55:13,820 --> 00:55:18,740
analysis on the smaller matrix.

1034
00:55:18,740 --> 00:55:22,710
So the work is n over
s cubed times s cubed,

1035
00:55:22,710 --> 00:55:24,620
which is equal to
theta of n cubed.

1036
00:55:24,620 --> 00:55:27,800
So the work of this tiled
matrix multiplies the same

1037
00:55:27,800 --> 00:55:31,820
as the version that
didn't do tiling.

1038
00:55:31,820 --> 00:55:34,040
And now, let's analyze the
number of cache misses.

1039
00:55:38,390 --> 00:55:42,020
So we're going to tune s so
that the submatrices just

1040
00:55:42,020 --> 00:55:43,100
fit into cache.

1041
00:55:43,100 --> 00:55:46,250
So we're going to set
s to be equal to theta

1042
00:55:46,250 --> 00:55:53,990
of square root of M. We
actually need to make this 1/3

1043
00:55:53,990 --> 00:55:55,760
square root of M,
because we need to fit

1044
00:55:55,760 --> 00:55:57,800
three submatrices in the cache.

1045
00:55:57,800 --> 00:55:59,780
But it's going to be some
constant times square

1046
00:55:59,780 --> 00:56:02,780
root of M.

1047
00:56:02,780 --> 00:56:07,190
The submatrix caching level
implies that for each submatrix

1048
00:56:07,190 --> 00:56:10,550
we're going to need x squared
over B misses to load it in.

1049
00:56:10,550 --> 00:56:13,850
And once we loaded into cache,
it fits entirely into cache,

1050
00:56:13,850 --> 00:56:16,430
so we can do all of our
computations within cache

1051
00:56:16,430 --> 00:56:18,230
and not incur any
more cache misses.

1052
00:56:21,530 --> 00:56:23,540
So therefore, the
total number of cache

1053
00:56:23,540 --> 00:56:26,027
misses we're going
to incur, it's

1054
00:56:26,027 --> 00:56:27,860
going to be the number
of subproblems, which

1055
00:56:27,860 --> 00:56:30,860
is n over s cubed and
then a number of cache

1056
00:56:30,860 --> 00:56:35,210
misses per subproblem,
which is s squared over B.

1057
00:56:35,210 --> 00:56:37,530
And if you multiply
this out, you're

1058
00:56:37,530 --> 00:56:43,070
going to get n cubed over
B times square root of M.

1059
00:56:43,070 --> 00:56:45,500
So here, I plugged in
square root of M for s.

1060
00:56:48,440 --> 00:56:49,940
And this is a
pretty cool result,

1061
00:56:49,940 --> 00:56:51,950
because it says that you
can actually do better

1062
00:56:51,950 --> 00:56:53,540
than the n cubed over B bound.

1063
00:56:53,540 --> 00:56:58,520
You can improve this bound by
a factor of square root of M.

1064
00:56:58,520 --> 00:57:00,950
And in practice,
square root of M

1065
00:57:00,950 --> 00:57:04,230
is actually not insignificant.

1066
00:57:04,230 --> 00:57:07,250
So, for example, if you're
looking at the last level

1067
00:57:07,250 --> 00:57:10,290
cache, the size of that is
on the order of megabytes.

1068
00:57:10,290 --> 00:57:12,080
So a square root
of M is going to be

1069
00:57:12,080 --> 00:57:13,340
on the order of thousands.

1070
00:57:13,340 --> 00:57:15,710
So this significantly
improves the performance

1071
00:57:15,710 --> 00:57:18,110
of the matrix
multiplication code

1072
00:57:18,110 --> 00:57:20,750
if you tune s so that
the submatrices just

1073
00:57:20,750 --> 00:57:23,540
fit in the cache.

1074
00:57:23,540 --> 00:57:26,180
It turns out that
this bound is optimal.

1075
00:57:26,180 --> 00:57:30,590
So this was shown in 1981.

1076
00:57:30,590 --> 00:57:32,760
So for cubic work
matrix multiplication,

1077
00:57:32,760 --> 00:57:33,950
this is the best you can do.

1078
00:57:33,950 --> 00:57:35,960
If you use another matrix
multiply algorithm,

1079
00:57:35,960 --> 00:57:40,380
like Strassen's algorithm,
you can do better.

1080
00:57:40,380 --> 00:57:42,230
So I want you to
remember this bound.

1081
00:57:42,230 --> 00:57:44,910
It's a very important
bound to know.

1082
00:57:44,910 --> 00:57:48,050
It says that for a
matrix multiplication

1083
00:57:48,050 --> 00:57:51,440
you can benefit both from
spatial locality as well

1084
00:57:51,440 --> 00:57:53,160
as temporal locality.

1085
00:57:53,160 --> 00:57:58,820
So I get spatial locality in
the B term in the denominator.

1086
00:57:58,820 --> 00:58:00,500
And then the square
root of M term

1087
00:58:00,500 --> 00:58:02,510
comes from temporal
locality, since I'm

1088
00:58:02,510 --> 00:58:04,730
doing all of the work
inside a submatrix

1089
00:58:04,730 --> 00:58:07,310
before I evict that
submatrix from cache.

1090
00:58:10,190 --> 00:58:13,250
Any questions on this analysis?

1091
00:58:13,250 --> 00:58:15,640
So what's one issue with
this algorithm here?

1092
00:58:19,920 --> 00:58:20,697
Yes.

1093
00:58:20,697 --> 00:58:23,030
AUDIENCE: It's not portable,
like different architecture

1094
00:58:23,030 --> 00:58:24,120
[INAUDIBLE].

1095
00:58:24,120 --> 00:58:24,870
JULIAN SHUN: Yeah.

1096
00:58:24,870 --> 00:58:27,930
So the problem here
is I have to tune s

1097
00:58:27,930 --> 00:58:30,910
for my particular machine.

1098
00:58:30,910 --> 00:58:32,670
And I call this a
voodoo parameter.

1099
00:58:32,670 --> 00:58:36,420
It's sort of like a magic
number I put into my program

1100
00:58:36,420 --> 00:58:39,900
so that it fits in the cache
on the particular machine I'm

1101
00:58:39,900 --> 00:58:40,920
running on.

1102
00:58:40,920 --> 00:58:42,630
And this makes the
code not portable,

1103
00:58:42,630 --> 00:58:46,200
because if I try to run this
code on a another machine,

1104
00:58:46,200 --> 00:58:49,480
the cache sizes might
be different there,

1105
00:58:49,480 --> 00:58:51,450
and then I won't get
the same performance

1106
00:58:51,450 --> 00:58:53,130
as I did on my machine.

1107
00:58:55,710 --> 00:58:57,840
And this is also an issue
even if you're running it

1108
00:58:57,840 --> 00:58:59,423
on the same machine,
because you might

1109
00:58:59,423 --> 00:59:01,620
have other programs
running at the same time

1110
00:59:01,620 --> 00:59:03,330
and using up part of the cache.

1111
00:59:03,330 --> 00:59:06,540
So you don't actually
know how much of the cache

1112
00:59:06,540 --> 00:59:10,020
your program actually gets
to use in a multiprogramming

1113
00:59:10,020 --> 00:59:11,036
environment.

1114
00:59:14,610 --> 00:59:17,280
And then this was also just
for one level of cache.

1115
00:59:17,280 --> 00:59:20,550
If we want to optimize
for two levels of caches,

1116
00:59:20,550 --> 00:59:23,910
we're going to have two
voodoo parameters, s and t.

1117
00:59:23,910 --> 00:59:27,370
We're going to have submatrices
and sub-submatrices.

1118
00:59:27,370 --> 00:59:29,970
And then we have to tune
both of these parameters

1119
00:59:29,970 --> 00:59:32,310
to get the best
performance on our machine.

1120
00:59:32,310 --> 00:59:34,410
And multi-dimensional
tuning optimization

1121
00:59:34,410 --> 00:59:36,790
can't be done simply
with binary search.

1122
00:59:36,790 --> 00:59:38,790
So if you're just tuning
for one level of cache,

1123
00:59:38,790 --> 00:59:41,220
you can do a binary
search on the parameter s,

1124
00:59:41,220 --> 00:59:43,470
but here you can't
do binary search.

1125
00:59:43,470 --> 00:59:47,910
So it's much more
expensive to optimize here.

1126
00:59:47,910 --> 00:59:51,180
And the code becomes
a little bit messier.

1127
00:59:51,180 --> 00:59:55,580
You have nine for
loops instead of six.

1128
00:59:55,580 --> 00:59:59,330
And how many levels of caches
do we have on the machines

1129
00:59:59,330 --> 01:00:00,870
that we're using today?

1130
01:00:00,870 --> 01:00:01,630
AUDIENCE: Three.

1131
01:00:01,630 --> 01:00:02,810
JULIAN SHUN: Three.

1132
01:00:02,810 --> 01:00:06,920
So for three level cache, you
have three voodoo parameters.

1133
01:00:06,920 --> 01:00:08,510
You have 12 nested for loops.

1134
01:00:08,510 --> 01:00:11,480
This code becomes very ugly.

1135
01:00:11,480 --> 01:00:13,310
And you have to tune
these parameters

1136
01:00:13,310 --> 01:00:15,300
for your particular machine.

1137
01:00:15,300 --> 01:00:17,870
And this makes the
code not very portable,

1138
01:00:17,870 --> 01:00:19,970
as one student pointed out.

1139
01:00:19,970 --> 01:00:21,650
And in a multiprogramming
environment,

1140
01:00:21,650 --> 01:00:23,990
you don't actually know
the effective cache size

1141
01:00:23,990 --> 01:00:25,490
that your program has access to.

1142
01:00:25,490 --> 01:00:28,073
Because other jobs are running
at the same time, and therefore

1143
01:00:28,073 --> 01:00:30,948
it's very easy to
mistune the parameters.

1144
01:00:30,948 --> 01:00:31,740
Was their question?

1145
01:00:31,740 --> 01:00:33,130
No?

1146
01:00:33,130 --> 01:00:35,310
So any questions?

1147
01:00:35,310 --> 01:00:35,810
Yeah.

1148
01:00:35,810 --> 01:00:37,563
Is there a way to
programmatically get

1149
01:00:37,563 --> 01:00:38,850
the size of the cache?

1150
01:00:38,850 --> 01:00:40,120
[INAUDIBLE]

1151
01:00:40,120 --> 01:00:40,870
JULIAN SHUN: Yeah.

1152
01:00:40,870 --> 01:00:43,610
So you can auto-tune
your program

1153
01:00:43,610 --> 01:00:47,090
so that it's optimized
for the cache sizes

1154
01:00:47,090 --> 01:00:48,283
of your particular machine.

1155
01:00:48,283 --> 01:00:49,658
AUDIENCE: [INAUDIBLE]
instruction

1156
01:00:49,658 --> 01:00:52,640
to get the size of
the cache [INAUDIBLE]..

1157
01:00:52,640 --> 01:00:56,473
JULIAN SHUN: Instruction to
get the size of your cache.

1158
01:00:56,473 --> 01:00:57,390
I'm not actually sure.

1159
01:00:57,390 --> 01:00:57,890
Do you know?

1160
01:00:57,890 --> 01:00:59,172
AUDIENCE: [INAUDIBLE] in--

1161
01:00:59,172 --> 01:01:00,534
AUDIENCE: [INAUDIBLE].

1162
01:01:00,534 --> 01:01:02,410
AUDIENCE: Yeah, in the proc--

1163
01:01:07,595 --> 01:01:09,400
JULIAN SHUN: Yeah, proc cpuinfo.

1164
01:01:09,400 --> 01:01:10,980
AUDIENCE: Yeah. proc cpuinfo
or something like that.

1165
01:01:10,980 --> 01:01:11,730
JULIAN SHUN: Yeah.

1166
01:01:11,730 --> 01:01:14,260
So you can probably
get that as well.

1167
01:01:14,260 --> 01:01:16,367
AUDIENCE: And I
think if you google,

1168
01:01:16,367 --> 01:01:17,950
I think you'll find
it pretty quickly.

1169
01:01:17,950 --> 01:01:18,300
JULIAN SHUN: Yeah.

1170
01:01:18,300 --> 01:01:18,925
AUDIENCE: Yeah.

1171
01:01:23,400 --> 01:01:25,710
But even if you do
that, and you're

1172
01:01:25,710 --> 01:01:27,960
running this program when
other jobs are running,

1173
01:01:27,960 --> 01:01:30,570
you don't actually know how much
cache your program has access

1174
01:01:30,570 --> 01:01:30,780
to.

1175
01:01:30,780 --> 01:01:31,280
Yes?

1176
01:01:31,280 --> 01:01:34,140
Is cache of architecture
and stuff like that

1177
01:01:34,140 --> 01:01:37,110
optimized around
matrix problems?

1178
01:01:37,110 --> 01:01:38,355
JULIAN SHUN: No.

1179
01:01:38,355 --> 01:01:41,370
They're actually
general purpose.

1180
01:01:41,370 --> 01:01:43,320
Today, we're just looking
at matrix multiply,

1181
01:01:43,320 --> 01:01:46,290
but on Thursday's
lecture we'll actually

1182
01:01:46,290 --> 01:01:47,880
be looking at many
other problems

1183
01:01:47,880 --> 01:01:50,848
and how to optimize them
for the cache hierarchy.

1184
01:01:56,180 --> 01:01:57,312
Other questions?

1185
01:02:01,790 --> 01:02:06,500
So this was a good algorithm
in terms of cache performance,

1186
01:02:06,500 --> 01:02:07,935
but it wasn't very portable.

1187
01:02:07,935 --> 01:02:09,310
So let's see if
we can do better.

1188
01:02:09,310 --> 01:02:12,050
Let's see if we can come
up with a simpler design

1189
01:02:12,050 --> 01:02:15,390
where we still get pretty
good cache performance.

1190
01:02:15,390 --> 01:02:19,250
So we're going to turn
to divide and conquer.

1191
01:02:19,250 --> 01:02:21,770
We're going to look at the
recursive matrix multiplication

1192
01:02:21,770 --> 01:02:24,750
algorithm that we saw before.

1193
01:02:24,750 --> 01:02:26,750
Again, we're going to
deal with square matrices,

1194
01:02:26,750 --> 01:02:30,330
but the results generalize
to non-square matrices.

1195
01:02:30,330 --> 01:02:33,800
So how this works is
we're going to split

1196
01:02:33,800 --> 01:02:37,340
our [INAUDIBLE] matrices
into four submatrices or four

1197
01:02:37,340 --> 01:02:38,990
quadrants.

1198
01:02:38,990 --> 01:02:41,220
And then for each quadrant
of the output matrix,

1199
01:02:41,220 --> 01:02:45,110
it's just going to be the sum
of two matrix multiplies on n

1200
01:02:45,110 --> 01:02:46,700
over 2 by n over 2 matrices.

1201
01:02:46,700 --> 01:02:51,260
So c 1 1 one is going
to be a 1 1 times b 1 1,

1202
01:02:51,260 --> 01:02:54,530
plus a 1 2 times B 2 1.

1203
01:02:54,530 --> 01:02:56,900
And then we're going
to do this recursively.

1204
01:02:56,900 --> 01:03:00,140
So every level of
recursion we're

1205
01:03:00,140 --> 01:03:04,070
going to get eight
multiplied adds of n over 2

1206
01:03:04,070 --> 01:03:07,580
by n over 2 matrices.

1207
01:03:07,580 --> 01:03:10,440
Here's what the recursive
code looks like.

1208
01:03:10,440 --> 01:03:14,660
You can see that we have
eight recursive calls here.

1209
01:03:14,660 --> 01:03:17,060
The base case here is of size 1.

1210
01:03:17,060 --> 01:03:19,760
In practice, you want to coarsen
the base case to overcome

1211
01:03:19,760 --> 01:03:20,930
function call overheads.

1212
01:03:23,690 --> 01:03:27,480
Let's also look at what these
values here correspond to.

1213
01:03:27,480 --> 01:03:31,890
So I've color coded these
so that they correspond

1214
01:03:31,890 --> 01:03:33,570
to particular elements
in the submatrix

1215
01:03:33,570 --> 01:03:36,330
that I'm looking
at on the right.

1216
01:03:36,330 --> 01:03:39,060
So these values here
correspond to the index

1217
01:03:39,060 --> 01:03:41,700
of the first element in
each of my quadrants.

1218
01:03:41,700 --> 01:03:43,920
So the first element
in my first quadrant

1219
01:03:43,920 --> 01:03:47,250
is just going to
have an offset of 0.

1220
01:03:47,250 --> 01:03:50,370
And then the first element
of my second quadrant,

1221
01:03:50,370 --> 01:03:51,870
that's going to
be on the same row

1222
01:03:51,870 --> 01:03:54,120
as the first element
in my first quadrant.

1223
01:03:54,120 --> 01:04:02,790
So I just need to add the
width of my quadrant, which

1224
01:04:02,790 --> 01:04:04,410
is n over 2.

1225
01:04:04,410 --> 01:04:09,480
And then to get the first
element in quadrant 2 1,

1226
01:04:09,480 --> 01:04:12,850
I'm going to jump over
and over two rows.

1227
01:04:12,850 --> 01:04:16,140
And each row has
a length row size,

1228
01:04:16,140 --> 01:04:18,930
so it's just going to be
n over 2 times row size.

1229
01:04:18,930 --> 01:04:23,400
And then to get the first
element in quadrant 2 2,

1230
01:04:23,400 --> 01:04:27,810
it's just the first element
in quadrant 2 1 plus n over 2.

1231
01:04:27,810 --> 01:04:30,450
So that's n over 2
times row size plus 1.

1232
01:04:34,540 --> 01:04:38,390
So let's analyze the
work of this algorithm.

1233
01:04:38,390 --> 01:04:41,930
So what's the recurrence
for this algorithm--

1234
01:04:41,930 --> 01:04:44,750
for the work of this algorithm?

1235
01:04:44,750 --> 01:04:46,300
So how many
subproblems do we have?

1236
01:04:46,300 --> 01:04:47,078
AUDIENCE: Eight

1237
01:04:47,078 --> 01:04:47,870
JULIAN SHUN: Eight.

1238
01:04:47,870 --> 01:04:53,840
And what's the size of
each Subproblem n over 2.

1239
01:04:53,840 --> 01:04:57,800
And how much work are we doing
to set up the recursive calls?

1240
01:05:00,887 --> 01:05:03,250
A constant amount of work.

1241
01:05:03,250 --> 01:05:06,580
So the recurrences is
W of n is equal to 8 W

1242
01:05:06,580 --> 01:05:09,280
n over 2 plus theta of 1.

1243
01:05:09,280 --> 01:05:12,560
And what does that solve to?

1244
01:05:12,560 --> 01:05:13,440
n cubed.

1245
01:05:13,440 --> 01:05:16,500
So it's one of the three
cases of the master theorem.

1246
01:05:20,850 --> 01:05:24,360
We're actually going to
analyze this in more detail

1247
01:05:24,360 --> 01:05:25,920
by drawing out the
recursion tree.

1248
01:05:25,920 --> 01:05:29,190
And this is going to give
us more intuition about why

1249
01:05:29,190 --> 01:05:32,540
the master theorem is true.

1250
01:05:32,540 --> 01:05:35,480
So at the top level
of my recursion tree,

1251
01:05:35,480 --> 01:05:38,320
I'm going to have a
problem of size n.

1252
01:05:38,320 --> 01:05:41,570
And then I'm going to branch
into eight subproblems of size

1253
01:05:41,570 --> 01:05:42,217
n over 2.

1254
01:05:42,217 --> 01:05:44,300
And then I'm going to do
a constant amount of work

1255
01:05:44,300 --> 01:05:45,950
to set up the recursive calls.

1256
01:05:45,950 --> 01:05:47,570
Here, I'm just
labeling this with one.

1257
01:05:47,570 --> 01:05:48,820
So I'm ignoring the constants.

1258
01:05:48,820 --> 01:05:52,670
But it's not going to matter
for asymptotic analysis.

1259
01:05:52,670 --> 01:05:54,560
And then I'm going
to branch again

1260
01:05:54,560 --> 01:05:58,250
into eight subproblems
of size n over 4.

1261
01:05:58,250 --> 01:06:01,790
And eventually, I'm going
to get down to the leaves.

1262
01:06:01,790 --> 01:06:06,342
And how many levels do I have
until I get to the leaves?

1263
01:06:11,510 --> 01:06:12,010
Yes?

1264
01:06:12,010 --> 01:06:12,750
AUDIENCE: Log n.

1265
01:06:12,750 --> 01:06:13,500
JULIAN SHUN: Yeah.

1266
01:06:13,500 --> 01:06:17,790
So log n-- what's
the base of the log?

1267
01:06:17,790 --> 01:06:18,290
Yeah.

1268
01:06:18,290 --> 01:06:21,000
So it's log base 2 of n,
because I'm dividing my problem

1269
01:06:21,000 --> 01:06:22,470
size by 2 every time.

1270
01:06:24,942 --> 01:06:26,400
And therefore, the
number of leaves

1271
01:06:26,400 --> 01:06:28,950
I have is going to be 8
to the log base 2 of n,

1272
01:06:28,950 --> 01:06:31,500
because I'm branching it
eight ways every time.

1273
01:06:31,500 --> 01:06:35,400
8 to the log base 2 of n is
the same as n to the log base

1274
01:06:35,400 --> 01:06:37,230
2 of 8, which is n cubed.

1275
01:06:40,660 --> 01:06:44,740
The amount of work I'm doing
at the top level is constant.

1276
01:06:44,740 --> 01:06:47,530
So I'm just going to say 1 here.

1277
01:06:47,530 --> 01:06:52,450
At the next level, it's
eight times, then 64.

1278
01:06:52,450 --> 01:06:54,210
And then when I
get to the leaves,

1279
01:06:54,210 --> 01:06:55,900
it's going to be
theta of n cubed,

1280
01:06:55,900 --> 01:06:58,330
since I have m cubed
leaves, and they're all

1281
01:06:58,330 --> 01:07:01,090
doing constant work.

1282
01:07:01,090 --> 01:07:04,060
And the work is geometrically
increasing as I go down

1283
01:07:04,060 --> 01:07:05,020
the recursion tree.

1284
01:07:05,020 --> 01:07:07,780
So the overall work is
just dominated by the work

1285
01:07:07,780 --> 01:07:09,850
I need to do at the leaves.

1286
01:07:09,850 --> 01:07:13,780
So the overall work is just
going to be theta of n cubed.

1287
01:07:13,780 --> 01:07:15,430
And this is the
same as the looping

1288
01:07:15,430 --> 01:07:18,100
versions of matrix multiply--

1289
01:07:18,100 --> 01:07:20,410
they're all cubic work.

1290
01:07:20,410 --> 01:07:22,990
Now, let's analyze the number
of cache misses of this divide

1291
01:07:22,990 --> 01:07:26,260
and conquer algorithm.

1292
01:07:26,260 --> 01:07:29,540
So now, my recurrence is
going to be different.

1293
01:07:29,540 --> 01:07:34,400
My base case now is when the
submatrix fits in the cache--

1294
01:07:34,400 --> 01:07:38,200
so when n squared is less than
c M. And when that's true,

1295
01:07:38,200 --> 01:07:40,690
I just need to load that
submatrix into cache,

1296
01:07:40,690 --> 01:07:43,300
and then I don't incur
any more cache misses.

1297
01:07:43,300 --> 01:07:45,390
So I need theta of n
squared over B cache

1298
01:07:45,390 --> 01:07:49,840
misses when n squared is less
than c M for some sufficiently

1299
01:07:49,840 --> 01:07:52,360
small constant c, less
than or equal to 1.

1300
01:07:52,360 --> 01:07:56,680
And then, otherwise, I recurse
into 8 subproblems of size n

1301
01:07:56,680 --> 01:07:57,460
over 2.

1302
01:07:57,460 --> 01:07:59,290
And then I add theta
of 1, because I'm

1303
01:07:59,290 --> 01:08:03,740
doing a constant amount of work
to set up the recursive calls.

1304
01:08:03,740 --> 01:08:06,700
And I get this state of
n squared over B term

1305
01:08:06,700 --> 01:08:08,935
from the submatrix
caching lemma.

1306
01:08:08,935 --> 01:08:12,430
It says I can just load the
entire matrix into cache

1307
01:08:12,430 --> 01:08:15,020
with this many cache misses.

1308
01:08:15,020 --> 01:08:18,359
So the difference between
the cache analysis here

1309
01:08:18,359 --> 01:08:20,859
and the work analysis before
is that I have a different base

1310
01:08:20,859 --> 01:08:22,510
case.

1311
01:08:22,510 --> 01:08:24,460
And I think in all
of the algorithms

1312
01:08:24,460 --> 01:08:26,979
that you've seen before,
the base case was always

1313
01:08:26,979 --> 01:08:27,970
of a constant size.

1314
01:08:27,970 --> 01:08:29,800
But here, we're working
with a base case

1315
01:08:29,800 --> 01:08:31,350
that's not of a constant size.

1316
01:08:34,359 --> 01:08:36,790
So let's try to analyze this
using the recursion tree

1317
01:08:36,790 --> 01:08:38,390
approach.

1318
01:08:38,390 --> 01:08:42,260
So at the top level, I
have a problem of size n

1319
01:08:42,260 --> 01:08:44,649
that I'm going to branch
into eight problems of size n

1320
01:08:44,649 --> 01:08:45,160
over 2.

1321
01:08:45,160 --> 01:08:48,170
And then I'm also going to
incur a constant number of cache

1322
01:08:48,170 --> 01:08:48,670
misses.

1323
01:08:48,670 --> 01:08:51,580
I'm just going to say 1 here.

1324
01:08:51,580 --> 01:08:54,850
Then I'm going to branch again.

1325
01:08:54,850 --> 01:08:58,210
And then, eventually, I'm
going to get to the base case

1326
01:08:58,210 --> 01:09:01,840
where n squared
is less than c M.

1327
01:09:01,840 --> 01:09:05,649
And when n squared is less than
c M, then the number of cache

1328
01:09:05,649 --> 01:09:07,300
misses that I'm going
to incur is going

1329
01:09:07,300 --> 01:09:12,460
to be theta of c M over B. So
I can just plug-in c M here

1330
01:09:12,460 --> 01:09:15,790
for n squared.

1331
01:09:15,790 --> 01:09:17,830
And the number of
levels of recursion

1332
01:09:17,830 --> 01:09:22,340
I have in this recursion tree is
no longer just log base 2 of n.

1333
01:09:22,340 --> 01:09:27,370
I'm going to have log base
2 of n minus log base 2

1334
01:09:27,370 --> 01:09:31,149
of square root of c M
number of levels, which

1335
01:09:31,149 --> 01:09:33,850
is the same as log base
2 of n minus 1/2 times

1336
01:09:33,850 --> 01:09:40,390
log base 2 of c M. And then,
the number of leaves I get

1337
01:09:40,390 --> 01:09:44,710
is going to be 8 to this
number of levels here.

1338
01:09:44,710 --> 01:09:50,680
So it's 8 to log base 2 of n
minus 1/2 of log base 2 of c M.

1339
01:09:50,680 --> 01:09:56,400
And this is equal to the theta
of n cubed over M to the 3/2.

1340
01:09:56,400 --> 01:10:00,580
So the n cubed comes from the
8 to the log base 2 of n term.

1341
01:10:00,580 --> 01:10:07,450
And then if I do 8 to the
negative 1/2 of log base 2

1342
01:10:07,450 --> 01:10:12,520
of c M, that's just going
to give me M to the 3/2

1343
01:10:12,520 --> 01:10:13,480
in the denominator.

1344
01:10:16,210 --> 01:10:19,160
So any questions on how I
computed the number of levels

1345
01:10:19,160 --> 01:10:20,627
of this recursion tree here?

1346
01:10:29,400 --> 01:10:32,110
So I'm basically dividing
my problem size by 2

1347
01:10:32,110 --> 01:10:35,410
until I get to a problem
size that fits in the cache.

1348
01:10:35,410 --> 01:10:40,180
So that means n is less
than square root of c M.

1349
01:10:40,180 --> 01:10:42,310
So therefore, I can
subtract that many levels

1350
01:10:42,310 --> 01:10:43,556
for my recursion tree.

1351
01:10:46,248 --> 01:10:47,790
And then to get the
number of leaves,

1352
01:10:47,790 --> 01:10:49,320
since I'm branching
eight ways, I

1353
01:10:49,320 --> 01:10:52,630
just do 8 to the power of
the number of levels I have.

1354
01:10:52,630 --> 01:10:54,713
And then that gives me the
total number of leaves.

1355
01:10:58,580 --> 01:11:00,320
So now, let's analyze
a number of cache

1356
01:11:00,320 --> 01:11:03,440
misses I need each level
of this recursion tree.

1357
01:11:03,440 --> 01:11:05,630
At the top level, I
have a constant number

1358
01:11:05,630 --> 01:11:06,710
of cache misses--

1359
01:11:06,710 --> 01:11:08,240
let's just say 1.

1360
01:11:08,240 --> 01:11:12,530
The next level, I have 8, 64.

1361
01:11:12,530 --> 01:11:14,540
And then at the
leaves, I'm going

1362
01:11:14,540 --> 01:11:18,050
to have theta of n cubed over
B times square root of M cache

1363
01:11:18,050 --> 01:11:18,960
misses.

1364
01:11:18,960 --> 01:11:21,620
And I got this quantity
just by multiplying

1365
01:11:21,620 --> 01:11:23,660
the number of
leaves by the number

1366
01:11:23,660 --> 01:11:25,040
of cache misses per leaf.

1367
01:11:25,040 --> 01:11:28,730
So number of leaves is n
cubed over M to the 3/2.

1368
01:11:28,730 --> 01:11:32,150
The cache misses per leaves
is theta of c M over B.

1369
01:11:32,150 --> 01:11:35,640
So I lose one factor of
B in the denominator.

1370
01:11:35,640 --> 01:11:37,940
I'm left with the square
root of M at the bottom.

1371
01:11:37,940 --> 01:11:41,450
And then I also divide
by the block size B.

1372
01:11:41,450 --> 01:11:45,110
So overall, I get n cubed over
B times square root of M cache

1373
01:11:45,110 --> 01:11:46,070
misses.

1374
01:11:46,070 --> 01:11:48,440
And again, this is
a geometric series.

1375
01:11:48,440 --> 01:11:50,690
And the number of cache
misses at the leaves

1376
01:11:50,690 --> 01:11:53,372
dominates all of
the other levels.

1377
01:11:53,372 --> 01:11:54,830
So the total number
of cache misses

1378
01:11:54,830 --> 01:11:57,980
I have is going to
be theta of n cubed

1379
01:11:57,980 --> 01:12:00,896
over B times square root of M.

1380
01:12:00,896 --> 01:12:04,630
And notice that I'm getting
the same number of cache

1381
01:12:04,630 --> 01:12:07,330
misses as I did with the
tiling version of the code.

1382
01:12:07,330 --> 01:12:09,710
But here, I don't actually
have the tune my code

1383
01:12:09,710 --> 01:12:12,510
for the particular cache size.

1384
01:12:12,510 --> 01:12:14,958
So what cache sizes
does this code work for?

1385
01:12:22,130 --> 01:12:24,481
So is this code going
to work on your machine?

1386
01:12:27,920 --> 01:12:30,700
Is it going to get
good cache performance?

1387
01:12:30,700 --> 01:12:33,340
So this code is going to
work for all cache sizes,

1388
01:12:33,340 --> 01:12:38,370
because I didn't tune it for
any particular cache size.

1389
01:12:38,370 --> 01:12:42,250
And this is what's known as
a cache-oblivious algorithm.

1390
01:12:42,250 --> 01:12:44,300
It doesn't have any
voodoo tuning parameters,

1391
01:12:44,300 --> 01:12:47,030
it has no explicit
knowledge of the caches,

1392
01:12:47,030 --> 01:12:49,540
and it's essentially
passively auto-tuning itself

1393
01:12:49,540 --> 01:12:53,710
for the particular cache
size of your machine.

1394
01:12:53,710 --> 01:12:56,620
It can also work for
multi-level caches

1395
01:12:56,620 --> 01:12:59,470
automatically, because I never
specified what level of cache

1396
01:12:59,470 --> 01:13:00,940
I'm analyzing this for.

1397
01:13:00,940 --> 01:13:03,170
I can analyze it for
any level of cache,

1398
01:13:03,170 --> 01:13:06,330
and it's still going to give
me good cache complexity.

1399
01:13:06,330 --> 01:13:08,680
And this is also good in
multiprogramming environments,

1400
01:13:08,680 --> 01:13:10,490
where you might have
other jobs running

1401
01:13:10,490 --> 01:13:12,410
and you don't know your
effective cache size.

1402
01:13:12,410 --> 01:13:14,660
This is just going to passively
auto-tune for whatever

1403
01:13:14,660 --> 01:13:15,700
cache size is available.

1404
01:13:18,780 --> 01:13:21,620
It turns out that the best
cache-oblivious codes to date

1405
01:13:21,620 --> 01:13:24,150
work on arbitrary
rectangular matrices.

1406
01:13:24,150 --> 01:13:26,480
I just talked about
square matrices,

1407
01:13:26,480 --> 01:13:29,000
but the best codes work
on rectangular matrices.

1408
01:13:29,000 --> 01:13:30,440
And they perform
binary splitting

1409
01:13:30,440 --> 01:13:32,000
instead of eight-way splitting.

1410
01:13:32,000 --> 01:13:37,130
And you're split on the
largest of i, j, and k.

1411
01:13:37,130 --> 01:13:39,590
So this is what the best
cache-oblivious matrix

1412
01:13:39,590 --> 01:13:41,060
multiplication algorithm does.

1413
01:13:44,970 --> 01:13:46,101
Any questions?

1414
01:13:50,940 --> 01:13:54,440
So I only talked about
the serial setting so far.

1415
01:13:54,440 --> 01:13:56,090
I was assuming that
these algorithms

1416
01:13:56,090 --> 01:13:58,190
ran on just a single thread.

1417
01:13:58,190 --> 01:14:02,674
What happens if I go
to multiple processors?

1418
01:14:02,674 --> 01:14:05,340
It turns out that the
results do generalize

1419
01:14:05,340 --> 01:14:08,380
to a parallel context.

1420
01:14:08,380 --> 01:14:10,770
So this is the recursive
parallel matrix multiply

1421
01:14:10,770 --> 01:14:13,710
code that we saw before.

1422
01:14:13,710 --> 01:14:17,040
And notice that we're executing
four sub calls in parallel,

1423
01:14:17,040 --> 01:14:19,620
doing a sync, and then
doing four more sub

1424
01:14:19,620 --> 01:14:20,385
calls in parallel.

1425
01:14:23,310 --> 01:14:25,920
So let's try to analyze
the number of cache

1426
01:14:25,920 --> 01:14:27,540
misses in this parallel code.

1427
01:14:27,540 --> 01:14:30,210
And to do that, we're going
to use this theorem, which

1428
01:14:30,210 --> 01:14:32,910
says that let Q sub p
be the number of cache

1429
01:14:32,910 --> 01:14:34,980
misses in a deterministic
cell computation

1430
01:14:34,980 --> 01:14:39,000
why run on P processors, each
with a private cache of size M.

1431
01:14:39,000 --> 01:14:41,610
And let S sub p be the
number of successful steals

1432
01:14:41,610 --> 01:14:43,830
during the computation.

1433
01:14:43,830 --> 01:14:46,800
In the ideal cache model,
the number of cache

1434
01:14:46,800 --> 01:14:50,970
misses we're going to have
is Q sub p equal to Q sub 1

1435
01:14:50,970 --> 01:14:55,830
plus big O of number of
steals times M over B.

1436
01:14:55,830 --> 01:14:59,520
So the number of cache misses
in the parallel context is

1437
01:14:59,520 --> 01:15:02,730
equal to the number of cache
misses when you run it serially

1438
01:15:02,730 --> 01:15:05,970
plus this term here, which
is the number of steals

1439
01:15:05,970 --> 01:15:09,670
times M over B.

1440
01:15:09,670 --> 01:15:13,650
And the proof for this goes
as follows-- so every call

1441
01:15:13,650 --> 01:15:16,200
in the Cilk runtime
system, we can

1442
01:15:16,200 --> 01:15:18,900
have workers steal
tasks from other workers

1443
01:15:18,900 --> 01:15:20,580
when they don't have work to do.

1444
01:15:20,580 --> 01:15:23,520
And after a worker steals
a task from another worker,

1445
01:15:23,520 --> 01:15:26,700
it's cache becomes completely
cold in the worst case,

1446
01:15:26,700 --> 01:15:29,790
because it wasn't actually
working on that subproblem

1447
01:15:29,790 --> 01:15:31,080
before.

1448
01:15:31,080 --> 01:15:33,750
But after M over B
cold cache misses,

1449
01:15:33,750 --> 01:15:36,630
its cache is going to become
identical to what it would

1450
01:15:36,630 --> 01:15:38,500
be in the serial execution.

1451
01:15:38,500 --> 01:15:40,590
So we just need to
pay M over B cache

1452
01:15:40,590 --> 01:15:44,130
misses to make it so that
the cache looks the same as

1453
01:15:44,130 --> 01:15:47,010
if it were executing serially.

1454
01:15:47,010 --> 01:15:48,630
And the same is
true when a worker

1455
01:15:48,630 --> 01:15:52,380
resumes a stolen subcomputation
after a Cilk sync.

1456
01:15:52,380 --> 01:15:55,230
And the number of times that
these two situations can happen

1457
01:15:55,230 --> 01:15:57,795
is 2 times as S p--

1458
01:15:57,795 --> 01:16:00,270
2 times the number of steals.

1459
01:16:00,270 --> 01:16:03,780
And each time, we have to
pay M over b cache misses.

1460
01:16:03,780 --> 01:16:06,870
And this is where this additive
term comes from-- order

1461
01:16:06,870 --> 01:16:13,260
S sub p times M over B.

1462
01:16:13,260 --> 01:16:16,920
We also know that the number
of steals in a Cilk program

1463
01:16:16,920 --> 01:16:21,770
is upper-bounded by
P times T infinity,

1464
01:16:21,770 --> 01:16:24,150
in the expectation where P
is the number of processors

1465
01:16:24,150 --> 01:16:27,390
and T infinity is the
span of your computation.

1466
01:16:27,390 --> 01:16:30,060
So if you can minimize the
span of your computation,

1467
01:16:30,060 --> 01:16:34,170
then this also gives
you a good cache bounds.

1468
01:16:34,170 --> 01:16:37,140
So moral of the story
here is that minimizing

1469
01:16:37,140 --> 01:16:41,010
the number of cache
misses in a serial elision

1470
01:16:41,010 --> 01:16:44,370
essentially minimizes them
in the parallel execution

1471
01:16:44,370 --> 01:16:46,080
for a low span algorithm.

1472
01:16:48,690 --> 01:16:51,660
So in this recursive matrix
multiplication algorithm,

1473
01:16:51,660 --> 01:16:55,910
the span of this is as follows--

1474
01:16:55,910 --> 01:16:58,920
so T infinity of n is 2T
infinity of of n over 2

1475
01:16:58,920 --> 01:17:01,260
plus theta of 1.

1476
01:17:01,260 --> 01:17:02,670
Since we're doing
a sync here, we

1477
01:17:02,670 --> 01:17:06,960
have to pay the critical
path length of two sub calls.

1478
01:17:06,960 --> 01:17:09,180
This solves to theta of n.

1479
01:17:09,180 --> 01:17:12,150
And applying to previous
lemma, this gives us

1480
01:17:12,150 --> 01:17:17,190
a cache miss bound of theta of
n cubed over B square root of M.

1481
01:17:17,190 --> 01:17:20,550
This is just the same
as the serial execution

1482
01:17:20,550 --> 01:17:24,150
And then this additive term is
going to be order P times n.

1483
01:17:24,150 --> 01:17:29,570
And it's a span times M over B

1484
01:17:29,570 --> 01:17:35,510
So that was a parallel
algorithm for matrix multiply.

1485
01:17:35,510 --> 01:17:39,320
And we saw that we can also
get good cache bounds there.

1486
01:17:39,320 --> 01:17:41,430
So here's a summary of
what we talked about today.

1487
01:17:41,430 --> 01:17:45,950
We talked about associativity
and caches, different ways

1488
01:17:45,950 --> 01:17:47,790
you can design a cache.

1489
01:17:47,790 --> 01:17:49,520
We talked about the
ideal cache model

1490
01:17:49,520 --> 01:17:52,940
that's useful for
analyzing algorithms.

1491
01:17:52,940 --> 01:17:55,910
We talked about
cache-aware algorithms

1492
01:17:55,910 --> 01:17:58,110
that have explicit
knowledge of the cache.

1493
01:17:58,110 --> 01:18:01,850
And the example we used
was titled matrix multiply.

1494
01:18:01,850 --> 01:18:03,980
Then we came up with a
much simpler algorithm

1495
01:18:03,980 --> 01:18:09,290
that was cache-oblivious
using divide and conquer.

1496
01:18:09,290 --> 01:18:11,510
And then on Thursday's
lecture, we'll

1497
01:18:11,510 --> 01:18:14,730
actually see much more on
cache-oblivious algorithm

1498
01:18:14,730 --> 01:18:15,230
design.

1499
01:18:15,230 --> 01:18:16,897
And then you'll also
have an opportunity

1500
01:18:16,897 --> 01:18:20,150
to analyze the cache
efficiency of some algorithms

1501
01:18:20,150 --> 01:18:22,690
in the next homework.