1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:21,632 --> 00:00:22,590 JULIAN SHUN: All right. 9 00:00:22,590 --> 00:00:25,920 So we've talked a little bit about caching before, 10 00:00:25,920 --> 00:00:30,330 but today we're going to talk in much more detail about caching 11 00:00:30,330 --> 00:00:34,680 and how to design cache-efficient algorithms. 12 00:00:34,680 --> 00:00:38,070 So first, let's look at the caching hardware 13 00:00:38,070 --> 00:00:41,830 on modern machines today. 14 00:00:41,830 --> 00:00:43,710 So here's what the cache hierarchy looks 15 00:00:43,710 --> 00:00:46,140 like for a multicore chip. 16 00:00:46,140 --> 00:00:49,310 We have a whole bunch of processors. 17 00:00:49,310 --> 00:00:53,040 They all have their own private L1 caches 18 00:00:53,040 --> 00:00:56,220 for both the data, as well as the instruction. 19 00:00:56,220 --> 00:00:58,050 They also have a private L2 cache. 20 00:00:58,050 --> 00:01:01,480 And then they share a last level cache, or L3 cache, 21 00:01:01,480 --> 00:01:05,129 which is also called LLC. 22 00:01:05,129 --> 00:01:07,080 They're all connected to a memory controller 23 00:01:07,080 --> 00:01:09,480 that can access DRAM. 24 00:01:09,480 --> 00:01:12,810 And then, oftentimes, you'll have multiple chips 25 00:01:12,810 --> 00:01:16,710 on the same server, and these chips 26 00:01:16,710 --> 00:01:18,880 would be connected through a network. 27 00:01:18,880 --> 00:01:20,910 So here we have a bunch of multicore chips 28 00:01:20,910 --> 00:01:24,130 that are connected together. 29 00:01:24,130 --> 00:01:27,300 So we can see that there are different levels of memory 30 00:01:27,300 --> 00:01:30,160 here. 31 00:01:30,160 --> 00:01:32,520 And the sizes of each one of these levels of memory 32 00:01:32,520 --> 00:01:33,750 is different. 33 00:01:33,750 --> 00:01:36,690 So the sizes tend to go up as you move up 34 00:01:36,690 --> 00:01:39,480 the memory hierarchy. 35 00:01:39,480 --> 00:01:44,970 The L1 caches tend to be about 32 kilobytes. 36 00:01:44,970 --> 00:01:47,327 In fact, these are the specifications for the machines 37 00:01:47,327 --> 00:01:48,660 that you're using in this class. 38 00:01:48,660 --> 00:01:51,660 So 32 kilobytes for both the L1 data cache 39 00:01:51,660 --> 00:01:54,540 and the L1 instruction cache. 40 00:01:54,540 --> 00:01:57,580 256 kilobytes for the L2 cache. 41 00:01:57,580 --> 00:02:01,200 so the L2 cache tends to be about 8 to 10 times 42 00:02:01,200 --> 00:02:03,570 larger than the L1 cache. 43 00:02:03,570 --> 00:02:06,790 And then the last level cache, the size is 30 megabytes. 44 00:02:06,790 --> 00:02:10,610 So this is typically on the order of tens of megabytes. 45 00:02:10,610 --> 00:02:14,250 And then DRAM is on the order of gigabytes. 46 00:02:14,250 --> 00:02:18,320 So here we have 128 gigabyte DRAM. 47 00:02:18,320 --> 00:02:21,480 And nowadays, you can actually get machines 48 00:02:21,480 --> 00:02:25,440 that have terabytes of DRAM. 49 00:02:25,440 --> 00:02:29,880 So the associativity tends to go up as you move up 50 00:02:29,880 --> 00:02:30,780 the cache hierarchy. 51 00:02:30,780 --> 00:02:32,970 And I'll talk more about associativity 52 00:02:32,970 --> 00:02:34,980 on the next couple of slides. 53 00:02:34,980 --> 00:02:37,800 The time to access the memory also tends to go up. 54 00:02:37,800 --> 00:02:39,870 So the latency tends to go up as you move up 55 00:02:39,870 --> 00:02:41,020 the memory hierarchy. 56 00:02:41,020 --> 00:02:44,490 So the L1 caches are the quickest to access, 57 00:02:44,490 --> 00:02:48,270 about two nanoseconds, just rough numbers. 58 00:02:48,270 --> 00:02:50,380 The L2 cache is a little bit slower-- 59 00:02:50,380 --> 00:02:52,810 so say four nanoseconds. 60 00:02:52,810 --> 00:02:55,410 Last level cache, maybe six nanoseconds. 61 00:02:55,410 --> 00:02:57,240 And then when you have to go to DRAM, 62 00:02:57,240 --> 00:03:00,930 it's about an order of magnitude slower-- so 50 nanoseconds 63 00:03:00,930 --> 00:03:03,280 in this example. 64 00:03:03,280 --> 00:03:09,420 And the reason why the memory is further down in the cache 65 00:03:09,420 --> 00:03:11,070 hierarchy are faster is because they're 66 00:03:11,070 --> 00:03:14,650 using more expensive materials to manufacture these things. 67 00:03:14,650 --> 00:03:18,120 But since they tend to be more expensive, we can't fit as much 68 00:03:18,120 --> 00:03:19,720 of that on the machines. 69 00:03:19,720 --> 00:03:22,620 So that's why the faster memories are smaller 70 00:03:22,620 --> 00:03:24,690 than the slower memories. 71 00:03:24,690 --> 00:03:26,880 But if we're able to take advantage of locality 72 00:03:26,880 --> 00:03:31,167 in our programs, then we can make use of the fast memory 73 00:03:31,167 --> 00:03:32,000 as much as possible. 74 00:03:32,000 --> 00:03:36,730 And we'll talk about ways to do that in this lecture today. 75 00:03:36,730 --> 00:03:39,000 There's also the latency across the network, which 76 00:03:39,000 --> 00:03:42,660 tends to be cheaper than going to main memory 77 00:03:42,660 --> 00:03:47,475 but slower than doing a last level cache access. 78 00:03:50,520 --> 00:03:52,410 And there's a lot of work in trying 79 00:03:52,410 --> 00:03:55,770 to get the cache coherence protocols right, as we 80 00:03:55,770 --> 00:03:56,860 mentioned before. 81 00:03:56,860 --> 00:03:59,730 So since these processors all have private caches, 82 00:03:59,730 --> 00:04:01,200 we need to make sure that they all 83 00:04:01,200 --> 00:04:03,510 see a consistent view of memory when 84 00:04:03,510 --> 00:04:05,670 they're trying to access the same memory 85 00:04:05,670 --> 00:04:08,290 addresses in parallel. 86 00:04:08,290 --> 00:04:11,340 So we talked about the MSI cache protocol before. 87 00:04:11,340 --> 00:04:13,500 And there are many other protocols out there, 88 00:04:13,500 --> 00:04:16,510 and you can read more about these things online. 89 00:04:16,510 --> 00:04:18,730 But these are very hard to get right, 90 00:04:18,730 --> 00:04:20,700 and there's a lot of verification involved 91 00:04:20,700 --> 00:04:23,110 in trying to prove that the cache coherence protocols are 92 00:04:23,110 --> 00:04:23,610 correct. 93 00:04:27,490 --> 00:04:29,050 So any questions so far? 94 00:04:33,600 --> 00:04:34,100 OK. 95 00:04:34,100 --> 00:04:38,210 So let's talk about the associativity of a cache. 96 00:04:38,210 --> 00:04:41,690 So here I'm showing you a fully associative cache. 97 00:04:41,690 --> 00:04:43,700 And in a fully associative cache, 98 00:04:43,700 --> 00:04:47,060 a cache block can reside anywhere in the cache. 99 00:04:47,060 --> 00:04:50,760 And a basic unit of movement here is a cache block. 100 00:04:50,760 --> 00:04:53,750 In this example, the cache block size is 4 bytes, 101 00:04:53,750 --> 00:04:57,050 but on the machines that we're using for this class, 102 00:04:57,050 --> 00:05:00,110 the cache block size is 64 bytes. 103 00:05:00,110 --> 00:05:04,470 But for this example, I'm going to use a four byte cache line. 104 00:05:04,470 --> 00:05:07,160 So each row here corresponds to one cache line. 105 00:05:07,160 --> 00:05:10,310 And a fully associative cache means that each line here 106 00:05:10,310 --> 00:05:13,225 can go anywhere in the cache. 107 00:05:13,225 --> 00:05:14,600 And then here we're also assuming 108 00:05:14,600 --> 00:05:17,420 a cache size that has 32 bytes. 109 00:05:17,420 --> 00:05:19,450 So, in total, it can store eight cache line 110 00:05:19,450 --> 00:05:21,245 since the cache line is 4 bytes. 111 00:05:24,970 --> 00:05:28,840 So to find a block in a fully associative cache, 112 00:05:28,840 --> 00:05:30,970 you have to actually search the entire cache, 113 00:05:30,970 --> 00:05:35,740 because a cache line can appear anywhere in the cache. 114 00:05:35,740 --> 00:05:38,860 And there's a tag associated with each of these cache lines 115 00:05:38,860 --> 00:05:42,610 here that basically specify which 116 00:05:42,610 --> 00:05:45,670 of the memory addresses in virtual memory space 117 00:05:45,670 --> 00:05:47,740 it corresponds to. 118 00:05:47,740 --> 00:05:49,440 So for the fully associative cache, 119 00:05:49,440 --> 00:05:51,940 we're actually going to use most of the bits of that address 120 00:05:51,940 --> 00:05:53,160 as a tag. 121 00:05:53,160 --> 00:05:54,910 We don't actually need the two lower order 122 00:05:54,910 --> 00:05:56,980 bits, because the things are being 123 00:05:56,980 --> 00:06:00,010 moved at the granularity of cache lines, which 124 00:06:00,010 --> 00:06:00,670 are four bytes. 125 00:06:00,670 --> 00:06:03,190 So the two lower order bits are always going to be the same, 126 00:06:03,190 --> 00:06:05,560 but we're just going to use the rest of the bits 127 00:06:05,560 --> 00:06:07,070 to store the tag. 128 00:06:07,070 --> 00:06:09,640 So if our address space is 64 bits, 129 00:06:09,640 --> 00:06:12,550 then we're going to use 62 bits to store the tag in a fully 130 00:06:12,550 --> 00:06:14,800 associative caching scheme. 131 00:06:14,800 --> 00:06:18,010 And when a cache becomes full, a block 132 00:06:18,010 --> 00:06:22,000 has to be evicted to make room for a new block. 133 00:06:22,000 --> 00:06:24,790 And there are various ways that you can 134 00:06:24,790 --> 00:06:26,660 decide how to evict a block. 135 00:06:26,660 --> 00:06:29,260 So this is known as the replacement policy. 136 00:06:29,260 --> 00:06:32,820 One common replacement policy is LRU Least Recently Used. 137 00:06:32,820 --> 00:06:34,720 So you basically kick the thing out that 138 00:06:34,720 --> 00:06:39,020 has been used the farthest in the past. 139 00:06:39,020 --> 00:06:41,980 The other schemes, such as second chance and clock 140 00:06:41,980 --> 00:06:44,020 replacement, we're not going to talk 141 00:06:44,020 --> 00:06:47,080 too much about the different replacement schemes today. 142 00:06:47,080 --> 00:06:50,780 But you can feel free to read about these things online. 143 00:06:53,470 --> 00:06:55,450 So what's a disadvantage of this scheme? 144 00:07:05,170 --> 00:07:05,670 Yes? 145 00:07:05,670 --> 00:07:07,270 AUDIENCE: It's slow. 146 00:07:07,270 --> 00:07:08,020 JULIAN SHUN: Yeah. 147 00:07:08,020 --> 00:07:08,830 Why is it slow? 148 00:07:08,830 --> 00:07:12,440 AUDIENCE: Because you have to go all the way [INAUDIBLE].. 149 00:07:12,440 --> 00:07:13,190 JULIAN SHUN: Yeah. 150 00:07:13,190 --> 00:07:15,590 So the disadvantage is that searching 151 00:07:15,590 --> 00:07:18,200 for a cache line in the cache can be pretty slow, because you 152 00:07:18,200 --> 00:07:21,380 have to search entire cache in the worst case, 153 00:07:21,380 --> 00:07:25,370 since a cache block can reside anywhere in the cache. 154 00:07:25,370 --> 00:07:28,010 So even though the search can go on in parallel and hardware 155 00:07:28,010 --> 00:07:31,340 is still expensive in terms of power and performance 156 00:07:31,340 --> 00:07:35,030 to have to search most of the cache every time. 157 00:07:35,030 --> 00:07:37,580 So let's look at another extreme. 158 00:07:37,580 --> 00:07:40,010 This is a direct mapped cache. 159 00:07:40,010 --> 00:07:42,860 So in a direct mapped cache, each cache block 160 00:07:42,860 --> 00:07:45,690 can only go in one place in the cache. 161 00:07:45,690 --> 00:07:48,890 So I've color-coded these cache blocks here. 162 00:07:48,890 --> 00:07:53,990 So the red blocks can only go in the first row of this cache, 163 00:07:53,990 --> 00:07:57,030 the orange ones can only go in the second row, and so on. 164 00:08:00,380 --> 00:08:06,110 And the position which a cache block can go into 165 00:08:06,110 --> 00:08:09,140 is known as that cache blocks set. 166 00:08:09,140 --> 00:08:11,240 So the set determines the location 167 00:08:11,240 --> 00:08:14,480 in the cache for each particular block. 168 00:08:14,480 --> 00:08:19,927 So let's look at how the virtual memory address is divided up 169 00:08:19,927 --> 00:08:21,510 into and which of the bits we're going 170 00:08:21,510 --> 00:08:24,380 to use to figure out where a cache block should 171 00:08:24,380 --> 00:08:25,610 go in the cache. 172 00:08:25,610 --> 00:08:29,450 So we have the offset, we have the set, 173 00:08:29,450 --> 00:08:31,820 and then the tag fields. 174 00:08:31,820 --> 00:08:35,179 The offset just tells us which position 175 00:08:35,179 --> 00:08:37,669 we want to access within a cache block. 176 00:08:37,669 --> 00:08:40,010 So since a cache block has B bytes, 177 00:08:40,010 --> 00:08:43,850 we only need log base 2 of B bits as the offset. 178 00:08:43,850 --> 00:08:45,350 And the reason why we have to offset 179 00:08:45,350 --> 00:08:47,383 is because we're not always accessing something 180 00:08:47,383 --> 00:08:48,800 at the beginning of a cache block. 181 00:08:48,800 --> 00:08:50,800 We might want to access something in the middle. 182 00:08:50,800 --> 00:08:52,190 And that's why we need the offset 183 00:08:52,190 --> 00:08:54,755 to specify where in the cache block we want to access. 184 00:08:57,730 --> 00:08:59,530 Then there's a set field. 185 00:08:59,530 --> 00:09:05,020 And the set field is going to determine which position 186 00:09:05,020 --> 00:09:08,110 in the cache that cache block can go into. 187 00:09:08,110 --> 00:09:12,790 So there are eight possible positions for each cache block. 188 00:09:12,790 --> 00:09:16,240 And therefore, we only need log base 2 of 8 bits-- 189 00:09:16,240 --> 00:09:19,120 so three bits for the set in this example. 190 00:09:19,120 --> 00:09:23,200 And more generally, it's going to be log base 2 of M over B. 191 00:09:23,200 --> 00:09:25,815 And here, M over B is 8. 192 00:09:25,815 --> 00:09:27,940 And then, finally, we're going to use the remaining 193 00:09:27,940 --> 00:09:29,030 bits as a tag. 194 00:09:29,030 --> 00:09:32,800 So w minus log base 2 of M bits for the tag. 195 00:09:32,800 --> 00:09:36,250 And that gets stored along with the cache block in the cache. 196 00:09:36,250 --> 00:09:39,070 And that's going to uniquely identify 197 00:09:39,070 --> 00:09:44,560 which of the memory blocks the cache block corresponds to 198 00:09:44,560 --> 00:09:47,430 in virtual memory. 199 00:09:47,430 --> 00:09:53,110 And you can verify that the sum of all these quantities 200 00:09:53,110 --> 00:09:55,190 here sums to w bits. 201 00:09:55,190 --> 00:09:58,120 So in total, we have a w bit address space. 202 00:09:58,120 --> 00:09:59,863 And the sum of those three things is w. 203 00:10:03,034 --> 00:10:06,880 So what's the advantage and disadvantage of this scheme? 204 00:10:16,880 --> 00:10:19,760 So first, what's a good thing about this scheme compared 205 00:10:19,760 --> 00:10:21,990 to the previous scheme that we saw? 206 00:10:21,990 --> 00:10:22,490 Yes? 207 00:10:22,490 --> 00:10:23,250 AUDIENCE: Faster. 208 00:10:23,250 --> 00:10:24,000 JULIAN SHUN: Yeah. 209 00:10:24,000 --> 00:10:26,240 It's fast because you only have to check one place. 210 00:10:26,240 --> 00:10:27,620 Because each cache block can only 211 00:10:27,620 --> 00:10:30,410 go in one place in a cache, and that's only place 212 00:10:30,410 --> 00:10:32,810 you have to check when you try to do a lookup. 213 00:10:32,810 --> 00:10:34,740 If the cache block is there, then you find it. 214 00:10:34,740 --> 00:10:38,750 If it's not, then you know it's not in the cache. 215 00:10:38,750 --> 00:10:42,020 What's the downside to this scheme? 216 00:10:42,020 --> 00:10:42,520 Yeah? 217 00:10:42,520 --> 00:10:44,437 AUDIENCE: You only end up putting the red ones 218 00:10:44,437 --> 00:10:47,350 into the cache and you have mostly every [INAUDIBLE],, which 219 00:10:47,350 --> 00:10:48,520 is totally [INAUDIBLE]. 220 00:10:48,520 --> 00:10:49,270 JULIAN SHUN: Yeah. 221 00:10:49,270 --> 00:10:50,440 So good answer. 222 00:10:50,440 --> 00:10:54,630 So the downside is that you might, for example, just 223 00:10:54,630 --> 00:10:58,740 be accessing the red cache blocks 224 00:10:58,740 --> 00:11:01,260 and then not accessing any of the other cache blocks. 225 00:11:01,260 --> 00:11:04,140 They'll all get mapped to the same location in the cache, 226 00:11:04,140 --> 00:11:06,240 and then they'll keep evicting each other, 227 00:11:06,240 --> 00:11:09,150 even though there's a lot of empty space in the cache. 228 00:11:09,150 --> 00:11:11,130 And this is known as a conflict miss. 229 00:11:11,130 --> 00:11:15,000 And these can be very bad for performance 230 00:11:15,000 --> 00:11:16,760 and very hard to debug. 231 00:11:16,760 --> 00:11:19,140 So that's one downside of a direct map 232 00:11:19,140 --> 00:11:22,950 cache is that you can get these conflict misses where you have 233 00:11:22,950 --> 00:11:25,050 to evict things from the cache even though there's 234 00:11:25,050 --> 00:11:26,205 empty space in the cache. 235 00:11:29,720 --> 00:11:32,330 So as we said, finding a block is very fast. 236 00:11:32,330 --> 00:11:35,810 Only a single location in the cache has to be searched. 237 00:11:35,810 --> 00:11:38,390 But you might suffer from conflict 238 00:11:38,390 --> 00:11:40,620 misses if you keep axing things in the same set 239 00:11:40,620 --> 00:11:45,140 repeatedly without accessing the things in the other sets. 240 00:11:45,140 --> 00:11:46,220 So any questions? 241 00:11:53,030 --> 00:11:53,530 OK. 242 00:11:53,530 --> 00:11:58,870 So these are sort of the two extremes for cache design. 243 00:11:58,870 --> 00:12:01,060 There's actually a hybrid solution 244 00:12:01,060 --> 00:12:03,872 called set associative cache. 245 00:12:03,872 --> 00:12:07,180 And in a set associative cache, you still sets, 246 00:12:07,180 --> 00:12:11,200 but each of the sets contains more than one line now. 247 00:12:11,200 --> 00:12:14,970 So all the red blocks still map to the red set, 248 00:12:14,970 --> 00:12:16,990 but there's actually two possible locations 249 00:12:16,990 --> 00:12:20,020 for the red blocks now. 250 00:12:20,020 --> 00:12:24,730 So in this case, this is known as a two-way associate of cache 251 00:12:24,730 --> 00:12:28,870 since there are two possible locations inside each set. 252 00:12:28,870 --> 00:12:33,670 And again, a cache block's set determines k possible cache 253 00:12:33,670 --> 00:12:35,140 locations for that block. 254 00:12:35,140 --> 00:12:38,440 So within a set it's fully associative, 255 00:12:38,440 --> 00:12:42,040 but each block can only go in one of the sets. 256 00:12:44,590 --> 00:12:48,190 So let's look again at how the bits are 257 00:12:48,190 --> 00:12:50,680 divided into in the address. 258 00:12:50,680 --> 00:12:53,770 So we still have the tag set and offset fields. 259 00:12:53,770 --> 00:12:58,180 The offset field is still a log base 2 of b. 260 00:12:58,180 --> 00:13:04,510 The set field is going to take log base 2 of M over kB bits. 261 00:13:04,510 --> 00:13:07,320 So the number of sets we have is M over kB. 262 00:13:07,320 --> 00:13:11,080 So we need log base 2 of that number 263 00:13:11,080 --> 00:13:14,230 to represent the set of a block. 264 00:13:14,230 --> 00:13:17,590 And then, finally, we use the remaining bits as a tag, 265 00:13:17,590 --> 00:13:22,730 so it's going to be w minus log base 2 of M over k. 266 00:13:22,730 --> 00:13:25,900 And now, to find a block in the cache, 267 00:13:25,900 --> 00:13:30,400 only k locations of it's set must be searched. 268 00:13:30,400 --> 00:13:33,970 So you basically find which set the cache block maps too, 269 00:13:33,970 --> 00:13:36,130 and then you check all k locations 270 00:13:36,130 --> 00:13:41,320 within that set to see if that cached block is there. 271 00:13:41,320 --> 00:13:44,358 And whenever you want to whenever 272 00:13:44,358 --> 00:13:46,900 you try to put something in the cache because it's not there, 273 00:13:46,900 --> 00:13:48,067 you have to evict something. 274 00:13:48,067 --> 00:13:51,010 And you evict something from the same set as the block 275 00:13:51,010 --> 00:13:53,780 that you're placing into the cache. 276 00:13:53,780 --> 00:13:56,410 So for this example, I showed a two-way associative cache. 277 00:13:56,410 --> 00:13:59,200 But in practice, the associated is usually bigger 278 00:13:59,200 --> 00:14:04,090 say eight-way, 16-way, or sometimes 20-way. 279 00:14:04,090 --> 00:14:09,490 And as you keep increasing the associativity, 280 00:14:09,490 --> 00:14:13,130 it's going to look more and more like a fully associative cache. 281 00:14:13,130 --> 00:14:15,460 And if you have a one way associative cache, 282 00:14:15,460 --> 00:14:17,050 then there's just a direct map cache. 283 00:14:17,050 --> 00:14:21,310 So this is a sort of a hybrid in between-- a fully mapped cache 284 00:14:21,310 --> 00:14:24,325 and a fully associative cache in a direct map cache. 285 00:14:27,620 --> 00:14:30,650 So any questions on set associative caches ? 286 00:14:38,310 --> 00:14:38,810 OK. 287 00:14:38,810 --> 00:14:43,340 So let's go over a taxonomy of different types of cache 288 00:14:43,340 --> 00:14:45,510 misses that you can incur. 289 00:14:45,510 --> 00:14:48,620 So the first type of cache miss is called a cold miss. 290 00:14:48,620 --> 00:14:50,150 And this is the cache miss that you 291 00:14:50,150 --> 00:14:53,705 have to incur the first time you access a cache block. 292 00:14:53,705 --> 00:14:55,580 And if you need to access this piece of data, 293 00:14:55,580 --> 00:14:58,220 there's no way to get around getting a cold miss for this. 294 00:14:58,220 --> 00:15:01,225 Because your cache starts out not having this block, 295 00:15:01,225 --> 00:15:02,600 and the first time you access it, 296 00:15:02,600 --> 00:15:06,960 you have to bring it into cache. 297 00:15:06,960 --> 00:15:09,860 Then there are capacity misses. 298 00:15:09,860 --> 00:15:12,660 So capacity misses are cache misses 299 00:15:12,660 --> 00:15:14,690 You get because the cache is full 300 00:15:14,690 --> 00:15:16,590 and it can't fit all of the cache blocks 301 00:15:16,590 --> 00:15:18,870 that you want to access. 302 00:15:18,870 --> 00:15:21,540 So you get a capacity miss when the previous cache 303 00:15:21,540 --> 00:15:23,970 copy would have been evicted even with a fully 304 00:15:23,970 --> 00:15:24,870 associative scheme. 305 00:15:24,870 --> 00:15:28,260 So even if all of the possible locations in your cache 306 00:15:28,260 --> 00:15:31,230 could be used for a particular cache line, 307 00:15:31,230 --> 00:15:33,750 that cache line still has to be evicted because there's not 308 00:15:33,750 --> 00:15:34,420 enough space. 309 00:15:34,420 --> 00:15:37,530 So that's what's called a capacity miss. 310 00:15:37,530 --> 00:15:41,010 And you can deal with capacity misses 311 00:15:41,010 --> 00:15:44,610 by introducing more locality into your code, both spatial 312 00:15:44,610 --> 00:15:46,440 and temporal locality. 313 00:15:46,440 --> 00:15:48,690 And we'll look at ways to reduce the capacity 314 00:15:48,690 --> 00:15:51,420 misses of algorithms later on in this lecture. 315 00:15:53,930 --> 00:15:55,830 Then there are conflict misses. 316 00:15:55,830 --> 00:16:00,000 And conflict misses happen in set associate of caches 317 00:16:00,000 --> 00:16:06,420 when you have too many blocks from the same set wanting 318 00:16:06,420 --> 00:16:08,640 to go into the cache. 319 00:16:08,640 --> 00:16:10,770 And some of these have to be evicted, 320 00:16:10,770 --> 00:16:14,130 because the set can't fit all of the blocks. 321 00:16:14,130 --> 00:16:15,720 And these blocks wouldn't have been 322 00:16:15,720 --> 00:16:18,540 evicted if you had a fully associative scheme, so these 323 00:16:18,540 --> 00:16:21,750 are what's called conflict misses. 324 00:16:21,750 --> 00:16:25,800 For example, if you have 16 things in a set 325 00:16:25,800 --> 00:16:29,820 and you keep accessing 17 things that all belong in the set, 326 00:16:29,820 --> 00:16:32,310 something's going to get kicked out 327 00:16:32,310 --> 00:16:35,340 every time you want to access something. 328 00:16:35,340 --> 00:16:38,280 And these cache evictions might not 329 00:16:38,280 --> 00:16:41,115 have happened if you had a fully associative cache. 330 00:16:44,600 --> 00:16:46,460 And then, finally, they're sharing misses. 331 00:16:46,460 --> 00:16:50,810 So sharing misses only happened in a parallel context. 332 00:16:50,810 --> 00:16:52,940 And we talked a little bit about true sharing 333 00:16:52,940 --> 00:16:56,300 a false sharing misses in prior lectures. 334 00:16:56,300 --> 00:16:59,270 So let's just review this briefly. 335 00:16:59,270 --> 00:17:03,860 So a sharing miss can happen if multiple processors are 336 00:17:03,860 --> 00:17:06,619 accessing the same cache line and at least one of them 337 00:17:06,619 --> 00:17:08,869 is writing to that cache line. 338 00:17:08,869 --> 00:17:10,460 If all of the processors are just 339 00:17:10,460 --> 00:17:13,010 reading from the cache line, then the cache [INAUDIBLE] 340 00:17:13,010 --> 00:17:16,250 protocol knows how to make it work so that you don't get 341 00:17:16,250 --> 00:17:16,880 misses. 342 00:17:16,880 --> 00:17:19,670 They can all access the same cache line at the same time 343 00:17:19,670 --> 00:17:22,099 if nobody's modifying it. 344 00:17:22,099 --> 00:17:24,290 But if at least one processor is modifying it, 345 00:17:24,290 --> 00:17:26,359 you could get either true sharing misses 346 00:17:26,359 --> 00:17:28,250 or false sharing misses. 347 00:17:28,250 --> 00:17:31,580 So a true sharing miss is when two processors are 348 00:17:31,580 --> 00:17:36,590 accessing the same data on the same cache line. 349 00:17:36,590 --> 00:17:38,750 And as you recall from a previous lecture, 350 00:17:38,750 --> 00:17:41,150 if one of the two processors is writing to this cache 351 00:17:41,150 --> 00:17:43,640 line, whenever it does a write it 352 00:17:43,640 --> 00:17:46,370 needs to acquire the cache line in exclusive mode 353 00:17:46,370 --> 00:17:51,710 and then invalidate that cache line and all other caches. 354 00:17:51,710 --> 00:17:54,020 So then when one another processor 355 00:17:54,020 --> 00:17:55,820 tries to access the same memory location, 356 00:17:55,820 --> 00:17:58,130 it has to bring it back into its own cache, 357 00:17:58,130 --> 00:18:02,260 and then you get a cache miss there. 358 00:18:02,260 --> 00:18:04,430 A false sharing this happens if two processes 359 00:18:04,430 --> 00:18:07,070 are accessing different data that just happened to reside 360 00:18:07,070 --> 00:18:08,870 on the same cache line. 361 00:18:08,870 --> 00:18:10,670 Because the basic unit of movement 362 00:18:10,670 --> 00:18:13,580 is a cache line in the architecture. 363 00:18:13,580 --> 00:18:15,860 So even if you're asking different things, 364 00:18:15,860 --> 00:18:17,480 if they are on the same cache line, 365 00:18:17,480 --> 00:18:20,810 you're still going to get a sharing miss. 366 00:18:20,810 --> 00:18:22,940 And false sharing is pretty hard to deal with, 367 00:18:22,940 --> 00:18:26,030 because, in general, you don't know what data 368 00:18:26,030 --> 00:18:28,282 gets placed on what cache line. 369 00:18:28,282 --> 00:18:29,990 There are certain heuristics you can use. 370 00:18:29,990 --> 00:18:32,510 For example, if you're mallocing a big memory region, 371 00:18:32,510 --> 00:18:35,430 you know that that memory region is contiguous, 372 00:18:35,430 --> 00:18:37,670 so you can space your access is far enough apart 373 00:18:37,670 --> 00:18:40,310 by different processors so they don't touch the same cache 374 00:18:40,310 --> 00:18:41,110 line. 375 00:18:41,110 --> 00:18:43,910 But if you're just declaring local variables on the stack, 376 00:18:43,910 --> 00:18:45,710 you don't know where the compiler 377 00:18:45,710 --> 00:18:50,810 is going to decide to place these variables 378 00:18:50,810 --> 00:18:54,480 in the virtual memory address space. 379 00:18:54,480 --> 00:18:57,050 So these are four different types of cache 380 00:18:57,050 --> 00:19:00,150 misses that you should know about. 381 00:19:00,150 --> 00:19:02,690 And there's many models out there 382 00:19:02,690 --> 00:19:05,840 for analyzing the cache performance of algorithms. 383 00:19:05,840 --> 00:19:08,720 And some of the models ignore some of these different types 384 00:19:08,720 --> 00:19:10,640 of cache misses. 385 00:19:10,640 --> 00:19:13,940 So just be aware of this when you're looking at algorithm 386 00:19:13,940 --> 00:19:16,010 analysis, because not all of the models 387 00:19:16,010 --> 00:19:18,120 will capture all of these different types of cache 388 00:19:18,120 --> 00:19:18,620 misses. 389 00:19:22,830 --> 00:19:27,540 So let's look at a bad case for conflict misses. 390 00:19:27,540 --> 00:19:33,270 So here I want to access a submatrix within a larger 391 00:19:33,270 --> 00:19:34,440 matrix. 392 00:19:34,440 --> 00:19:39,540 And recall that matrices are stored in row-major order. 393 00:19:39,540 --> 00:19:44,850 And let's say our matrix is 4,096 columns by 4,096 rows 394 00:19:44,850 --> 00:19:47,670 and it still stores doubles. 395 00:19:47,670 --> 00:19:50,190 So therefore, each row here is going 396 00:19:50,190 --> 00:19:55,140 to contain 2 to the 15th bytes, because 4,096 397 00:19:55,140 --> 00:19:58,800 is t2 to the 12th, and we have doubles, 398 00:19:58,800 --> 00:20:00,110 which takes eight bytes. 399 00:20:00,110 --> 00:20:03,390 So 2 to the 12 times to the 3rd, which is 2 to the 15th. 400 00:20:06,750 --> 00:20:11,280 We're going to assume the word width is 64, which is standard. 401 00:20:11,280 --> 00:20:15,060 We're going to assume that we have a cache size of 32k. 402 00:20:15,060 --> 00:20:19,710 And the cache block size is 64, which, again, is standard. 403 00:20:19,710 --> 00:20:22,125 And let's say we have a four-way associative cache. 404 00:20:26,520 --> 00:20:31,860 So let's look at how the bits are divided into. 405 00:20:31,860 --> 00:20:36,270 So again we have this offset, which 406 00:20:36,270 --> 00:20:38,867 takes log base 2 of B bits. 407 00:20:38,867 --> 00:20:41,325 So how many bits do we have for the offset in this example? 408 00:20:48,300 --> 00:20:48,800 Right. 409 00:20:48,800 --> 00:20:50,030 So we have 6 bits. 410 00:20:50,030 --> 00:20:53,930 So it's just log base 2 of 64. 411 00:20:53,930 --> 00:20:56,180 What about for the set? 412 00:20:56,180 --> 00:20:59,030 How many bits do we have for that? 413 00:20:59,030 --> 00:21:00,350 7. 414 00:21:00,350 --> 00:21:02,280 Who said 7? 415 00:21:02,280 --> 00:21:02,780 Yeah. 416 00:21:02,780 --> 00:21:04,220 So it is 7. 417 00:21:04,220 --> 00:21:10,130 So M is 32k, which is 2 to the 15th. 418 00:21:10,130 --> 00:21:17,310 And then k is 2 to the 2, b is 2 6. 419 00:21:17,310 --> 00:21:21,050 So it's 2 to the 15th divided by 2 the 8th, which is to the 7th. 420 00:21:21,050 --> 00:21:23,930 And log base 2 of that is 7. 421 00:21:23,930 --> 00:21:25,940 And finally, what about the tag field? 422 00:21:29,660 --> 00:21:31,990 AUDIENCE: 51. 423 00:21:31,990 --> 00:21:33,100 JULIAN SHUN: 51. 424 00:21:33,100 --> 00:21:33,730 Why is that? 425 00:21:33,730 --> 00:21:36,330 AUDIENCE: 64 minus 13. 426 00:21:36,330 --> 00:21:37,080 JULIAN SHUN: Yeah. 427 00:21:37,080 --> 00:21:43,880 So it's just 64 minus 7 minus 6, which is 51. 428 00:21:43,880 --> 00:21:44,380 OK. 429 00:21:44,380 --> 00:21:47,890 So let's say that we want to access a submatrix 430 00:21:47,890 --> 00:21:49,710 within this larger matrix. 431 00:21:49,710 --> 00:21:52,660 Let's say we want to acts as a 32 by 32 submatrix. 432 00:21:52,660 --> 00:21:57,220 And THIS is pretty common in matrix algorithms, where 433 00:21:57,220 --> 00:21:59,810 you want to access submatrices, especially in divide 434 00:21:59,810 --> 00:22:01,591 and conquer algorithms. 435 00:22:04,240 --> 00:22:09,850 And let's say we want to access a column of this submatrix A. 436 00:22:09,850 --> 00:22:13,180 So the addresses of the elements that we're going to access 437 00:22:13,180 --> 00:22:14,050 are as follows-- 438 00:22:14,050 --> 00:22:17,290 so let's say the first element in the column 439 00:22:17,290 --> 00:22:19,600 is stored at address x. 440 00:22:19,600 --> 00:22:21,280 Then the second element in the column 441 00:22:21,280 --> 00:22:24,640 is going to be stored at address x plus 2 to the 15th, 442 00:22:24,640 --> 00:22:27,910 because each row has 2 to the 15th bytes, 443 00:22:27,910 --> 00:22:29,650 and we're skipping over an entire row 444 00:22:29,650 --> 00:22:34,490 here to get to the element in the next row of the sub matrix. 445 00:22:34,490 --> 00:22:36,460 So we're going to add 2 to the 15th. 446 00:22:36,460 --> 00:22:38,020 And then to get the third element, 447 00:22:38,020 --> 00:22:40,660 we're going to add 2 times 2 to the 15th. 448 00:22:40,660 --> 00:22:43,420 And so on, until we get to the last element, 449 00:22:43,420 --> 00:22:48,490 which is x plus 31 times 2 to the 15th. 450 00:22:48,490 --> 00:22:50,350 So which fields of the address are 451 00:22:50,350 --> 00:22:54,850 changing as we go through one column of this submatrix? 452 00:23:05,586 --> 00:23:09,002 AUDIENCE: You're just adding multiple [INAUDIBLE] tag 453 00:23:09,002 --> 00:23:10,000 the [INAUDIBLE]. 454 00:23:10,000 --> 00:23:10,750 JULIAN SHUN: Yeah. 455 00:23:10,750 --> 00:23:13,490 So it's just going to be the tag that's changing. 456 00:23:13,490 --> 00:23:17,360 The set and the offset are going to stay the same, because we're 457 00:23:17,360 --> 00:23:22,190 just using the lower 13 bits to store the set and a tag. 458 00:23:22,190 --> 00:23:24,890 And therefore, when we increment by 2 to the 15th, 459 00:23:24,890 --> 00:23:28,920 we're not going to touch the set and the offset. 460 00:23:28,920 --> 00:23:32,060 So all of these addresses fall into the same set. 461 00:23:32,060 --> 00:23:35,640 And this is a problem, because our cache 462 00:23:35,640 --> 00:23:37,160 is only four-way associative. 463 00:23:37,160 --> 00:23:42,860 So we can only fit four cache lines in each set. 464 00:23:42,860 --> 00:23:45,860 And here, we're accessing 31 of these things. 465 00:23:45,860 --> 00:23:50,510 So by the time we get to the next column of A, 466 00:23:50,510 --> 00:23:53,280 all the things that we access in the current column of A 467 00:23:53,280 --> 00:23:56,360 are going to be evicted from cache already. 468 00:23:56,360 --> 00:23:58,970 And this is known as a conflict miss, 469 00:23:58,970 --> 00:24:01,850 because if you had a fully associative cache 470 00:24:01,850 --> 00:24:04,730 this might not have happened, because you could actually 471 00:24:04,730 --> 00:24:09,940 use any location in the cache to store these cache blocks. 472 00:24:09,940 --> 00:24:13,720 So does anybody have any questions on why 473 00:24:13,720 --> 00:24:15,060 we get conflict misses here? 474 00:24:22,860 --> 00:24:27,110 So anybody have any ideas on how to fix this? 475 00:24:27,110 --> 00:24:29,300 So what can I do to make it so that I'm not 476 00:24:29,300 --> 00:24:32,990 incrementing by exactly 2 to the 15th every time? 477 00:24:39,696 --> 00:24:40,654 Yeah. 478 00:24:40,654 --> 00:24:43,050 AUDIENCE: So pad the matrix? 479 00:24:43,050 --> 00:24:44,020 JULIAN SHUN: Yeah. 480 00:24:44,020 --> 00:24:46,270 So one solution is to pad the matrix. 481 00:24:46,270 --> 00:24:49,060 You can add some constant amount of space 482 00:24:49,060 --> 00:24:50,920 to the end of the matrix. 483 00:24:50,920 --> 00:24:53,320 So each row is going to be longer than 2 484 00:24:53,320 --> 00:24:54,550 to the 15th bytes. 485 00:24:54,550 --> 00:24:57,400 So maybe you add some small constant like 17. 486 00:24:57,400 --> 00:25:00,130 So add 17 bytes to the end of each row. 487 00:25:00,130 --> 00:25:04,090 And now, when you access a column of this submatrix, 488 00:25:04,090 --> 00:25:07,000 you're not just incrementing by 2 to the 15th, 489 00:25:07,000 --> 00:25:10,570 you're also adding some small integer. 490 00:25:10,570 --> 00:25:14,535 And that's going to cause the set and the offset fields 491 00:25:14,535 --> 00:25:15,910 to change as well, and you're not 492 00:25:15,910 --> 00:25:18,640 going to get as many conflict misses. 493 00:25:18,640 --> 00:25:22,610 So that's one way to solve the problem. 494 00:25:22,610 --> 00:25:25,570 It turns out that if you're doing a matrix multiplication 495 00:25:25,570 --> 00:25:27,910 algorithm, that's a cubic work algorithm, 496 00:25:27,910 --> 00:25:31,630 and you can basically afford to copy the submatrix 497 00:25:31,630 --> 00:25:34,270 into a temporary 32 by 32 matrix, 498 00:25:34,270 --> 00:25:36,580 do all the operations on the temporary matrix, 499 00:25:36,580 --> 00:25:39,760 and then copy it back out to the original matrix. 500 00:25:39,760 --> 00:25:42,610 The copying only takes quadratic work 501 00:25:42,610 --> 00:25:45,160 to do across the whole algorithm. 502 00:25:45,160 --> 00:25:48,070 And since the whole algorithm takes cubic work, 503 00:25:48,070 --> 00:25:50,620 the quadratic work is a lower order term. 504 00:25:50,620 --> 00:25:54,790 So you can use temporary space to make sure that you 505 00:25:54,790 --> 00:25:56,050 don't get conflict misses. 506 00:25:58,560 --> 00:25:59,490 Any questions? 507 00:26:06,030 --> 00:26:09,340 So this was conflict misses. 508 00:26:09,340 --> 00:26:10,900 So conflict misses are important. 509 00:26:10,900 --> 00:26:13,180 But usually, we're going to be first concerned 510 00:26:13,180 --> 00:26:15,820 about getting good spatial and temporal locality, 511 00:26:15,820 --> 00:26:19,240 because those are usually the higher order 512 00:26:19,240 --> 00:26:21,070 factors in the performance of a program. 513 00:26:21,070 --> 00:26:24,250 And once we get good spatial and temporal locality 514 00:26:24,250 --> 00:26:25,840 in our program, we can then start 515 00:26:25,840 --> 00:26:28,720 worrying about conflict misses, for example, 516 00:26:28,720 --> 00:26:32,860 by using temporary space or padding our data 517 00:26:32,860 --> 00:26:35,650 by some small constants so that we don't 518 00:26:35,650 --> 00:26:37,210 have as if any conflict misses. 519 00:26:41,120 --> 00:26:43,170 So now, I want to talk about a model 520 00:26:43,170 --> 00:26:45,270 that we can use to analyze the cache 521 00:26:45,270 --> 00:26:46,530 performance of algorithms. 522 00:26:46,530 --> 00:26:51,010 And this is called the ideal-cache model. 523 00:26:51,010 --> 00:26:57,030 So in this model, we have a two-level cache hierarchy. 524 00:26:57,030 --> 00:27:01,440 So we have the cache and then main memory. 525 00:27:01,440 --> 00:27:05,205 The cache size is of size M, and the cache line size 526 00:27:05,205 --> 00:27:06,750 is of B bytes. 527 00:27:06,750 --> 00:27:10,245 And therefore, we can fit M over V cache lines inside our cache. 528 00:27:13,020 --> 00:27:15,930 This model assumes that the cache is fully associative, 529 00:27:15,930 --> 00:27:18,920 so any cache block can go anywhere in the cache. 530 00:27:18,920 --> 00:27:23,070 And it also assumes an optimal omniscient replacement policy. 531 00:27:23,070 --> 00:27:25,140 So this means that where we want to evict a cache 532 00:27:25,140 --> 00:27:26,600 block from the cache, we're going 533 00:27:26,600 --> 00:27:28,410 to pick the thing to evict that gives us 534 00:27:28,410 --> 00:27:30,060 the best performance overall. 535 00:27:30,060 --> 00:27:31,830 It gives us the lowest number of cache 536 00:27:31,830 --> 00:27:34,210 misses throughout our entire algorithm. 537 00:27:34,210 --> 00:27:36,960 So we're assuming that we know the sequence of memory requests 538 00:27:36,960 --> 00:27:38,858 throughout the entire algorithm. 539 00:27:38,858 --> 00:27:41,400 And that's why it's called the omniscient mission replacement 540 00:27:41,400 --> 00:27:41,900 policy. 541 00:27:45,370 --> 00:27:49,000 And if something is in cache, you can operate on it for free. 542 00:27:49,000 --> 00:27:51,040 And if something is in main memory, 543 00:27:51,040 --> 00:27:52,810 you have to bring it into cache and then 544 00:27:52,810 --> 00:27:54,070 you incur a cache miss. 545 00:27:56,990 --> 00:27:59,880 So two performance measures that we care about-- 546 00:27:59,880 --> 00:28:01,890 first, we care about the ordinary work, 547 00:28:01,890 --> 00:28:04,830 which is just the ordinary running time of a program. 548 00:28:04,830 --> 00:28:07,740 So this is the same as before when 549 00:28:07,740 --> 00:28:09,360 we were analyzing algorithms. 550 00:28:09,360 --> 00:28:11,160 It's just a total number of operations 551 00:28:11,160 --> 00:28:13,690 that the program does. 552 00:28:13,690 --> 00:28:15,420 And the number of cache misses is 553 00:28:15,420 --> 00:28:17,190 going to be the number of lines we 554 00:28:17,190 --> 00:28:21,893 have to transfer between the main memory and the cache. 555 00:28:21,893 --> 00:28:23,310 So the number of cache misses just 556 00:28:23,310 --> 00:28:24,930 counts a number of cache transfers, 557 00:28:24,930 --> 00:28:27,570 whereas as the work counts all the operations that you 558 00:28:27,570 --> 00:28:29,227 have to do in the algorithm. 559 00:28:32,640 --> 00:28:35,490 So ideally, we would like to come up 560 00:28:35,490 --> 00:28:38,970 with algorithms that have a low number of cache misses 561 00:28:38,970 --> 00:28:42,540 without increasing the work from the traditional standard 562 00:28:42,540 --> 00:28:44,550 algorithm. 563 00:28:44,550 --> 00:28:47,060 Sometimes we can do that, sometimes we can't do that. 564 00:28:47,060 --> 00:28:49,470 And then there's a trade-off between the work 565 00:28:49,470 --> 00:28:51,210 and the number of cache misses. 566 00:28:51,210 --> 00:28:53,850 And it's a trade-off that you have 567 00:28:53,850 --> 00:28:56,910 to decide whether it's worthwhile as a performance 568 00:28:56,910 --> 00:28:57,960 engineer. 569 00:28:57,960 --> 00:28:59,790 Today, we're going to look at an algorithm 570 00:28:59,790 --> 00:29:01,915 where you can't actually reduce the number of cache 571 00:29:01,915 --> 00:29:03,780 misses without increasing the work. 572 00:29:03,780 --> 00:29:06,090 So you basically get the best of both worlds. 573 00:29:08,880 --> 00:29:11,430 So any questions on this ideal cache model? 574 00:29:19,430 --> 00:29:23,810 So this model is just used for analyzing algorithms. 575 00:29:23,810 --> 00:29:27,530 You can't actually buy one of these caches at the store. 576 00:29:27,530 --> 00:29:31,760 So this is a very ideal cache, and they don't exist. 577 00:29:31,760 --> 00:29:35,000 But it turns out that this optimal omniscient replacement 578 00:29:35,000 --> 00:29:38,580 policy has nice theoretical properties. 579 00:29:38,580 --> 00:29:43,970 And this is a very important lemma that was proved in 1985. 580 00:29:43,970 --> 00:29:46,720 It's called the LRU lemma. 581 00:29:46,720 --> 00:29:48,770 It was proved by Slater and Tarjan. 582 00:29:48,770 --> 00:29:51,950 And the lemma says, suppose that an algorithm incurs 583 00:29:51,950 --> 00:29:56,540 Q cache misses on an ideal cache of size M. Then, 584 00:29:56,540 --> 00:30:01,280 on a fully associative cache of size 2M, that uses the LRU, 585 00:30:01,280 --> 00:30:04,760 or Least Recently Used replacement policy, 586 00:30:04,760 --> 00:30:08,900 it incurs at most 2Q cache misses. 587 00:30:08,900 --> 00:30:12,980 So what this says is if I can show the number of cache 588 00:30:12,980 --> 00:30:16,700 misses for an algorithm on the ideal cache, 589 00:30:16,700 --> 00:30:19,820 then if I take a fully associative cache that's twice 590 00:30:19,820 --> 00:30:23,220 the size and use the LRU replacement policy, 591 00:30:23,220 --> 00:30:25,280 which is a pretty practical policy, 592 00:30:25,280 --> 00:30:26,900 then the algorithm is going to incur, 593 00:30:26,900 --> 00:30:31,160 at most, twice the number of cache misses. 594 00:30:31,160 --> 00:30:33,890 And the implication of this lemma 595 00:30:33,890 --> 00:30:36,590 is that for asymptotic analyses, you 596 00:30:36,590 --> 00:30:40,040 can assume either the optimal replacement policy or the LRU 597 00:30:40,040 --> 00:30:41,930 replacement policy as convenient. 598 00:30:41,930 --> 00:30:46,010 Because the number of cache misses 599 00:30:46,010 --> 00:30:50,270 is just going to be within a constant factor of each other. 600 00:30:50,270 --> 00:30:52,610 So this is a very important lemma. 601 00:30:52,610 --> 00:30:54,650 It says that this basically makes 602 00:30:54,650 --> 00:31:00,306 it much easier for us to analyze our cache misses in algorithms. 603 00:31:03,780 --> 00:31:06,240 And here's a software engineering principle 604 00:31:06,240 --> 00:31:08,770 that I want to point out. 605 00:31:08,770 --> 00:31:13,480 So first, when you're trying to get good performance, 606 00:31:13,480 --> 00:31:16,540 you should come up with a theoretically good algorithm 607 00:31:16,540 --> 00:31:20,670 that has good balance on the work and the cache complexity. 608 00:31:20,670 --> 00:31:23,130 And then after you come up with an algorithm that's 609 00:31:23,130 --> 00:31:26,040 theoretically good, then you start engineering 610 00:31:26,040 --> 00:31:27,150 for detailed performance. 611 00:31:27,150 --> 00:31:30,630 You start worrying about the details such as real world 612 00:31:30,630 --> 00:31:34,770 caches not being fully associative, and, for example, 613 00:31:34,770 --> 00:31:37,080 loads and stores having different costs with respect 614 00:31:37,080 --> 00:31:39,090 to bandwidth and latency. 615 00:31:39,090 --> 00:31:41,340 But coming up with a theoretically good algorithm 616 00:31:41,340 --> 00:31:43,980 is the first order bit to getting good performance. 617 00:31:48,840 --> 00:31:49,812 Questions? 618 00:31:58,090 --> 00:32:00,550 So let's start analyzing the number of cache 619 00:32:00,550 --> 00:32:02,320 misses in a program. 620 00:32:02,320 --> 00:32:04,090 So here's a lemma. 621 00:32:04,090 --> 00:32:07,990 So the lemma says, suppose that a program reads a set of r data 622 00:32:07,990 --> 00:32:13,480 segments, where the i-th segment consists of s sub i bytes. 623 00:32:13,480 --> 00:32:17,110 And suppose that the sum of the sizes of all the segments 624 00:32:17,110 --> 00:32:22,360 is equal to N. And we're going to assume that N is less than M 625 00:32:22,360 --> 00:32:23,120 over 3. 626 00:32:23,120 --> 00:32:26,260 So the sum of the size of the segments 627 00:32:26,260 --> 00:32:30,100 is less than the cache size divided by 3. 628 00:32:30,100 --> 00:32:32,320 We're also going to assume that N over r 629 00:32:32,320 --> 00:32:34,870 is greater than or equal to B. So recall 630 00:32:34,870 --> 00:32:38,650 that r is the number of data segments we have, 631 00:32:38,650 --> 00:32:41,090 and N is the total size of the segment. 632 00:32:41,090 --> 00:32:46,080 So what does N over r mean, semantically? 633 00:32:46,080 --> 00:32:46,580 Yes. 634 00:32:46,580 --> 00:32:47,950 AUDIENCE: Average [INAUDIBLE]. 635 00:32:47,950 --> 00:32:48,700 JULIAN SHUN: Yeah. 636 00:32:48,700 --> 00:32:53,390 So N over r is the just the average size of a segment. 637 00:32:53,390 --> 00:32:56,390 And here we're saying that the average size of a segment 638 00:32:56,390 --> 00:33:01,790 is at least B-- so at least the size of a cache line. 639 00:33:01,790 --> 00:33:04,830 So if these two assumptions hold, then all of the segments 640 00:33:04,830 --> 00:33:07,590 are going to fit into cache, and the number of cache 641 00:33:07,590 --> 00:33:13,590 misses to read them all is, at most, 3 times N over B. 642 00:33:13,590 --> 00:33:20,490 So if you had just a single array of size N, 643 00:33:20,490 --> 00:33:21,990 then the number of cache misses you 644 00:33:21,990 --> 00:33:24,180 would need to read that array into cache 645 00:33:24,180 --> 00:33:25,920 is going to be N over B. And this 646 00:33:25,920 --> 00:33:29,280 is saying that, even if our data is divided 647 00:33:29,280 --> 00:33:32,040 into a bunch of segments, as long as the average length 648 00:33:32,040 --> 00:33:35,580 of the segments is large enough, then the number of cache misses 649 00:33:35,580 --> 00:33:41,550 is just a constant factor worse than reading a single array. 650 00:33:41,550 --> 00:33:44,160 So let's try to prove this cache miss lemma. 651 00:33:48,000 --> 00:33:50,220 So here's a proof so. 652 00:33:50,220 --> 00:33:52,290 A single segment, s sub i is going 653 00:33:52,290 --> 00:33:58,350 to incur at most s sub i over B plus 2 cache misses. 654 00:33:58,350 --> 00:34:01,800 So does anyone want to tell me where the s sub i over B plus 2 655 00:34:01,800 --> 00:34:02,370 comes from? 656 00:34:09,540 --> 00:34:13,170 So let's say this is a segment that we're analyzing, 657 00:34:13,170 --> 00:34:16,320 and this is how it's aligned in virtual memory. 658 00:34:21,900 --> 00:34:22,400 Yes? 659 00:34:22,400 --> 00:34:25,310 AUDIENCE: How many blocks it could overlap worst case. 660 00:34:25,310 --> 00:34:26,060 JULIAN SHUN: Yeah. 661 00:34:26,060 --> 00:34:29,870 So s sub i over B plus 2 is the number of blocks that could 662 00:34:29,870 --> 00:34:32,610 overlap within the worst case. 663 00:34:32,610 --> 00:34:36,949 So you need s sub i over B cache misses just 664 00:34:36,949 --> 00:34:39,949 to load those s sub i bytes. 665 00:34:39,949 --> 00:34:43,400 But then the beginning and the end of that segment 666 00:34:43,400 --> 00:34:47,360 might not be perfectly aligned with a cache line boundary. 667 00:34:47,360 --> 00:34:49,670 And therefore, you could waste, at most, one block 668 00:34:49,670 --> 00:34:51,320 on each side of the segment. 669 00:34:51,320 --> 00:34:55,310 So that's where the plus 2 comes from. 670 00:34:55,310 --> 00:34:57,560 So to get the total number of cache 671 00:34:57,560 --> 00:35:03,170 misses, we just have to sum this quantity from i equals 1 to r. 672 00:35:03,170 --> 00:35:06,620 So if I sum s sub i over B from i equals 1 to r, 673 00:35:06,620 --> 00:35:08,810 I just get N over B, by definition. 674 00:35:08,810 --> 00:35:12,640 And then I sum 2 from i equals 1 to r. 675 00:35:12,640 --> 00:35:14,840 So that just gives me 2r. 676 00:35:14,840 --> 00:35:17,180 Now, I'm going to multiply the top and the bottom 677 00:35:17,180 --> 00:35:21,080 with the second term by B. So 2r B over B now. 678 00:35:21,080 --> 00:35:24,200 And then that's less than or equal to N over B 679 00:35:24,200 --> 00:35:29,730 plus 2N over B. So where did I get this inequality here? 680 00:35:29,730 --> 00:35:32,420 Why do I know that 2r B is less than or equal to 2N? 681 00:35:35,500 --> 00:35:36,000 Yes? 682 00:35:36,000 --> 00:35:38,760 AUDIENCE: You know that the N is greater than or equal to B r. 683 00:35:38,760 --> 00:35:38,940 JULIAN SHUN: Yeah. 684 00:35:38,940 --> 00:35:41,250 So you know that N is greater than or equal to B 685 00:35:41,250 --> 00:35:43,380 r by this assumption up here. 686 00:35:43,380 --> 00:35:46,830 So therefore, r B is less than or equal to N. 687 00:35:46,830 --> 00:35:51,450 And then, N B plus 2 N B just sums up to 3 N B. 688 00:35:51,450 --> 00:35:55,335 So in the worst case, we're going to incur 3N over B cache 689 00:35:55,335 --> 00:35:55,835 misses. 690 00:36:00,800 --> 00:36:03,340 So any questions on this cache miss lemma? 691 00:36:07,620 --> 00:36:11,520 So the Important thing to remember here is that if you 692 00:36:11,520 --> 00:36:14,070 have a whole bunch of data segments and the average length 693 00:36:14,070 --> 00:36:15,780 of your segments is large enough-- 694 00:36:15,780 --> 00:36:18,540 bigger than a cache block size-- 695 00:36:18,540 --> 00:36:21,690 then you can access all of these segments just 696 00:36:21,690 --> 00:36:24,360 like a single array. 697 00:36:24,360 --> 00:36:25,980 It only increases the number of cache 698 00:36:25,980 --> 00:36:27,810 misses by a constant factor. 699 00:36:27,810 --> 00:36:29,892 And if you're doing an asymptotic analysis, 700 00:36:29,892 --> 00:36:30,850 then it doesn't matter. 701 00:36:30,850 --> 00:36:33,360 So we're going to be using this cache miss lemma later 702 00:36:33,360 --> 00:36:35,160 on when we analyze algorithms. 703 00:36:40,720 --> 00:36:44,200 So another assumption that we're going to need 704 00:36:44,200 --> 00:36:46,840 is called the tall cache assumption. 705 00:36:46,840 --> 00:36:49,450 And the tall cache assumption basically 706 00:36:49,450 --> 00:36:52,390 says that the cache is taller than it is wide. 707 00:36:52,390 --> 00:36:55,750 So it says that B squared is less than c M 708 00:36:55,750 --> 00:36:58,750 for some sufficiently small constant c less than 709 00:36:58,750 --> 00:37:02,050 or equal to 1. 710 00:37:02,050 --> 00:37:05,830 So in other words, it says that the number of cache lines 711 00:37:05,830 --> 00:37:13,660 M over B you have is going to be bigger than B. 712 00:37:13,660 --> 00:37:16,330 And this tall cache assumption is usually 713 00:37:16,330 --> 00:37:17,650 satisfied in practice. 714 00:37:17,650 --> 00:37:22,090 So here are the cache line sizes and the cache 715 00:37:22,090 --> 00:37:24,460 sizes on the machines that we're using. 716 00:37:24,460 --> 00:37:28,990 So cache line size is 64 bytes, and the L1 cache size 717 00:37:28,990 --> 00:37:31,390 is 32 kilobytes. 718 00:37:31,390 --> 00:37:36,400 So 64 bytes squared, that's 2 to the 12th. 719 00:37:36,400 --> 00:37:39,420 And 32 kilobytes is 2 to the 15th bytes. 720 00:37:39,420 --> 00:37:41,510 So 2 to the 12th is less than 2 to the 15th, 721 00:37:41,510 --> 00:37:44,530 so it satisfies the tall cache assumption. 722 00:37:44,530 --> 00:37:46,540 And as we go up the memory hierarchy, 723 00:37:46,540 --> 00:37:49,990 the cache size increases, but the cache line length 724 00:37:49,990 --> 00:37:51,080 stays the same. 725 00:37:51,080 --> 00:37:53,230 So the cache has become even taller 726 00:37:53,230 --> 00:37:57,160 as we move up the memory hierarchy. 727 00:37:57,160 --> 00:38:00,468 So let's see why this tall cache assumption is 728 00:38:00,468 --> 00:38:01,260 going to be useful. 729 00:38:04,550 --> 00:38:06,300 To see that, we're going to look at what's 730 00:38:06,300 --> 00:38:07,770 wrong with a short cache. 731 00:38:07,770 --> 00:38:11,580 So in a short cache, our lines are going to be very wide, 732 00:38:11,580 --> 00:38:14,190 and they're wider than the number of lines 733 00:38:14,190 --> 00:38:18,200 that we can have in our cache. 734 00:38:18,200 --> 00:38:19,950 And let's say we're working with an m 735 00:38:19,950 --> 00:38:24,120 by n submatrix sorted in row-major order. 736 00:38:24,120 --> 00:38:27,810 If you have a short cache, then even if n squared 737 00:38:27,810 --> 00:38:29,700 is less than c M, meaning that you 738 00:38:29,700 --> 00:38:33,540 can fit all the bytes of the submatrix in cache, 739 00:38:33,540 --> 00:38:37,620 you might still not be able to fit it into a short cache. 740 00:38:37,620 --> 00:38:40,650 And this picture sort of illustrates this. 741 00:38:40,650 --> 00:38:43,050 So we have m rows here. 742 00:38:43,050 --> 00:38:46,290 But we can only fit M over B of the rows in the cache, 743 00:38:46,290 --> 00:38:48,960 because the cache lines are so long, 744 00:38:48,960 --> 00:38:51,045 and we're actually wasting a lot of space 745 00:38:51,045 --> 00:38:52,170 on each of the cache lines. 746 00:38:52,170 --> 00:38:54,570 We're only using a very small fraction of each cache line 747 00:38:54,570 --> 00:38:58,690 to store the row of this submatrix. 748 00:38:58,690 --> 00:39:00,960 If this were the entire matrix, then 749 00:39:00,960 --> 00:39:05,250 it would actually be OK, because consecutive rows 750 00:39:05,250 --> 00:39:08,850 are going to be placed together consecutively in memory. 751 00:39:08,850 --> 00:39:10,740 But if this is a submatrix, then we 752 00:39:10,740 --> 00:39:14,070 can't be guaranteed that the next row is going to be placed 753 00:39:14,070 --> 00:39:17,220 right after the current row. 754 00:39:17,220 --> 00:39:19,290 And oftentimes, we have to deal with submatrices 755 00:39:19,290 --> 00:39:22,110 when we're doing recursive matrix algorithms. 756 00:39:25,330 --> 00:39:27,760 So this is what's wrong with short caches. 757 00:39:27,760 --> 00:39:32,340 And that's why we want us assume the tall cache assumption. 758 00:39:32,340 --> 00:39:34,210 And we can assume that, because it's usually 759 00:39:34,210 --> 00:39:35,185 satisfied in practice. 760 00:39:37,945 --> 00:39:40,080 The TLB be actually tends to be short. 761 00:39:40,080 --> 00:39:42,550 It only has a couple of entries, so it might not satisfy 762 00:39:42,550 --> 00:39:44,020 the tall cache assumption. 763 00:39:44,020 --> 00:39:50,060 But all of the other caches will satisfy this assumption. 764 00:39:50,060 --> 00:39:51,100 Any questions? 765 00:39:54,630 --> 00:39:56,797 OK. 766 00:39:56,797 --> 00:39:58,880 So here's another lemma that's going to be useful. 767 00:39:58,880 --> 00:40:03,220 This is called the submatrix caching llama. 768 00:40:03,220 --> 00:40:06,310 So suppose that we have an n by m matrix, 769 00:40:06,310 --> 00:40:08,650 and it's read into a tall cache that 770 00:40:08,650 --> 00:40:13,190 satisfies B squared less than c M for some constant c less than 771 00:40:13,190 --> 00:40:15,580 or equal to 1. 772 00:40:15,580 --> 00:40:19,840 And suppose that n squared is less than M over 3, 773 00:40:19,840 --> 00:40:24,280 but it's greater than or equal to c M. Then 774 00:40:24,280 --> 00:40:27,580 A is going to fit into cache, and the number of cache 775 00:40:27,580 --> 00:40:31,600 misses required to read all of A's elements into cache is, 776 00:40:31,600 --> 00:40:38,470 at most, 3n squared over B. 777 00:40:38,470 --> 00:40:42,900 So let's see why this is true. 778 00:40:42,900 --> 00:40:45,120 So we're going to let big N denote 779 00:40:45,120 --> 00:40:48,930 the total number of bytes that we need to access. 780 00:40:48,930 --> 00:40:50,940 So big N is going to be equal to n squared. 781 00:40:53,800 --> 00:40:56,550 And we're going to use the cache miss lemma, which 782 00:40:56,550 --> 00:40:59,160 says that if the average length of our segments 783 00:40:59,160 --> 00:41:02,310 is large enough, then we can read all of the segments 784 00:41:02,310 --> 00:41:05,770 in just like it were a single contiguous array. 785 00:41:05,770 --> 00:41:09,930 So the lengths of our segments here are going to be little n. 786 00:41:09,930 --> 00:41:13,230 So r is going to be a little n. 787 00:41:13,230 --> 00:41:16,470 And also, the number of segments is going to be little n. 788 00:41:16,470 --> 00:41:18,040 And the segment length is also going 789 00:41:18,040 --> 00:41:21,660 to be little n, since we're working with a square submatrix 790 00:41:21,660 --> 00:41:24,090 here. 791 00:41:24,090 --> 00:41:30,120 And then we also have the cache block size B is less than 792 00:41:30,120 --> 00:41:32,310 or equal to n. 793 00:41:32,310 --> 00:41:36,090 And that's equal to big N over r. 794 00:41:36,090 --> 00:41:39,750 And where do we get this property that B is less than 795 00:41:39,750 --> 00:41:42,600 or equal to n? 796 00:41:42,600 --> 00:41:46,110 So I made some assumptions up here, 797 00:41:46,110 --> 00:41:50,070 where I can use to infer that B is less than or equal to n. 798 00:41:50,070 --> 00:41:53,150 Does anybody see where? 799 00:41:53,150 --> 00:41:53,922 Yeah. 800 00:41:53,922 --> 00:41:55,850 AUDIENCE: So B squared is less than c M, 801 00:41:55,850 --> 00:41:57,300 and c M is [INAUDIBLE] 802 00:41:57,300 --> 00:41:58,050 JULIAN SHUN: Yeah. 803 00:41:58,050 --> 00:42:00,435 So I know that B squared is less than c 804 00:42:00,435 --> 00:42:02,820 M. C M is less than or equal to n squared. 805 00:42:02,820 --> 00:42:05,250 So therefore, B squared is less than n squared, 806 00:42:05,250 --> 00:42:09,360 and B is less than n. 807 00:42:09,360 --> 00:42:15,060 So now, I also have that N is less than M 808 00:42:15,060 --> 00:42:18,450 over 3, just by assumption. 809 00:42:18,450 --> 00:42:20,810 And therefore, I can use the cache miss lemma. 810 00:42:20,810 --> 00:42:23,700 So the cache miss lemma tells me that I only 811 00:42:23,700 --> 00:42:26,610 need a total of 3n squared over B cache 812 00:42:26,610 --> 00:42:28,120 misses to read this whole thing in. 813 00:42:32,780 --> 00:42:35,150 Any questions on the submatrix caching lemma? 814 00:42:48,980 --> 00:42:53,198 So now, let's analyze matrix multiplication. 815 00:42:53,198 --> 00:42:55,490 How many of you have seen matrix multiplication before? 816 00:42:59,250 --> 00:43:00,130 So a couple of you. 817 00:43:03,340 --> 00:43:07,150 So here's what the code looks like 818 00:43:07,150 --> 00:43:11,260 for the standard cubic work matrix multiplication 819 00:43:11,260 --> 00:43:12,980 algorithm. 820 00:43:12,980 --> 00:43:15,430 So we have two input matrices, A and B, 821 00:43:15,430 --> 00:43:18,610 And we're going to store the result in C. 822 00:43:18,610 --> 00:43:22,930 And the height and the width of our matrix is n. 823 00:43:22,930 --> 00:43:25,798 We're just going to deal with square matrices here, 824 00:43:25,798 --> 00:43:27,340 but what I'm going to talk about also 825 00:43:27,340 --> 00:43:30,770 extends to non-square matrices. 826 00:43:30,770 --> 00:43:33,450 And then we just have three loops here. 827 00:43:33,450 --> 00:43:37,600 We're going to loop through i from 0 to n minus 1, j from 0 828 00:43:37,600 --> 00:43:40,540 to n minus 1, and k from 0 to n minus 1. 829 00:43:40,540 --> 00:43:43,225 And then we're going to let's C of i n plus j 830 00:43:43,225 --> 00:43:48,280 be incremented by a of i n plus k times b of k n plus j. 831 00:43:48,280 --> 00:43:53,200 So that's just the standard code for matrix multiply. 832 00:43:53,200 --> 00:43:57,105 So what's the work of this algorithm? 833 00:43:57,105 --> 00:44:02,140 It should be review for all of you. 834 00:44:02,140 --> 00:44:02,740 n cubed. 835 00:44:05,790 --> 00:44:08,850 So now, let's analyze the number of cache 836 00:44:08,850 --> 00:44:11,400 misses this algorithm is going to incur. 837 00:44:11,400 --> 00:44:13,680 And again, we're going to assume that the matrix is 838 00:44:13,680 --> 00:44:16,770 in row-major order, and we satisfy the tall cache 839 00:44:16,770 --> 00:44:17,760 assumption. 840 00:44:20,640 --> 00:44:23,100 We're also going to analyze the number of cache 841 00:44:23,100 --> 00:44:25,723 misses in matrix B, because it turns out 842 00:44:25,723 --> 00:44:27,390 that the number of cache misses incurred 843 00:44:27,390 --> 00:44:29,850 by matrix B is going to dominate the number of cache 844 00:44:29,850 --> 00:44:31,470 misses overall. 845 00:44:31,470 --> 00:44:33,720 And there are three cases we need to consider. 846 00:44:33,720 --> 00:44:37,110 The first case is when n is greater than c M 847 00:44:37,110 --> 00:44:39,570 over B for some constant c. 848 00:44:42,890 --> 00:44:44,900 And we're going to analyze matrix B, as I said. 849 00:44:44,900 --> 00:44:48,650 And we're also going to assume LRU, because we can. 850 00:44:48,650 --> 00:44:50,300 If you recall, the LRU lemma says 851 00:44:50,300 --> 00:44:52,390 that whatever we analyze using the LRU 852 00:44:52,390 --> 00:44:55,160 is just going to be a constant factor within what we analyze 853 00:44:55,160 --> 00:44:56,420 using the ideal cache. 854 00:45:01,220 --> 00:45:07,460 So to do this matrix multiplication, 855 00:45:07,460 --> 00:45:10,940 I'm going to go through one row of A and one column of B 856 00:45:10,940 --> 00:45:12,740 and do the dot product there. 857 00:45:12,740 --> 00:45:17,460 This is what happens in the innermost loop. 858 00:45:17,460 --> 00:45:19,010 And how many cache misses am I going 859 00:45:19,010 --> 00:45:24,110 to incur when I go down one column of B here? 860 00:45:24,110 --> 00:45:29,120 So here, I have the case where n is greater than M over B. 861 00:45:29,120 --> 00:45:38,430 So I can't fit one block from each row into the cache. 862 00:45:38,430 --> 00:45:40,490 So how many cache misses do I have the first time 863 00:45:40,490 --> 00:45:41,840 I go down a column of B? 864 00:45:44,440 --> 00:45:45,990 So how many rows of B do I have? 865 00:45:48,820 --> 00:45:49,700 n. 866 00:45:49,700 --> 00:45:54,850 Yeah, and how many cache misses do I need for each row? 867 00:45:54,850 --> 00:45:55,350 One. 868 00:45:55,350 --> 00:45:58,590 So in total, I'm going to need n cache misses 869 00:45:58,590 --> 00:46:02,280 for the first column of B. 870 00:46:02,280 --> 00:46:04,020 What about the second column of B? 871 00:46:08,980 --> 00:46:12,090 So recall that I'm assuming the LRU replacement policy here. 872 00:46:12,090 --> 00:46:13,590 So when the cache is full, I'm going 873 00:46:13,590 --> 00:46:17,030 to evict the thing that was least recently used-- 874 00:46:17,030 --> 00:46:18,610 used the furthest in the past. 875 00:46:26,932 --> 00:46:28,140 Sorry, could you repeat that? 876 00:46:28,140 --> 00:46:29,080 AUDIENCE: [INAUDIBLE]. 877 00:46:29,080 --> 00:46:29,830 JULIAN SHUN: Yeah. 878 00:46:29,830 --> 00:46:30,997 So it's still going to be n. 879 00:46:30,997 --> 00:46:33,462 Why is that? 880 00:46:33,462 --> 00:46:38,350 AUDIENCE: Because there are [INAUDIBLE] integer. 881 00:46:38,350 --> 00:46:39,822 JULIAN SHUN: Yeah. 882 00:46:39,822 --> 00:46:41,280 It's still going to be n, because I 883 00:46:41,280 --> 00:46:45,030 can't fit one cache block from each row into my cache. 884 00:46:45,030 --> 00:46:48,630 And by the time I get back to the top of my matrix B, 885 00:46:48,630 --> 00:46:52,130 the top block has already been evicted from the cache, 886 00:46:52,130 --> 00:46:53,410 and I have to load it back in. 887 00:46:53,410 --> 00:46:56,070 And this is the same for every other block that I access. 888 00:46:56,070 --> 00:46:58,680 So I'm, again, going to need n cache misses 889 00:46:58,680 --> 00:47:01,200 for the second column of B. And this 890 00:47:01,200 --> 00:47:05,400 is going to be the same for all the columns of B. 891 00:47:05,400 --> 00:47:09,790 And then I have to do this again for the second row of A. 892 00:47:09,790 --> 00:47:13,120 So in total, I'm going to need theta of n 893 00:47:13,120 --> 00:47:15,730 cubed number of cache misses. 894 00:47:15,730 --> 00:47:21,710 And this is one cache miss per entry that I access in B. 895 00:47:21,710 --> 00:47:25,420 And this is not very good, because the total work was also 896 00:47:25,420 --> 00:47:26,270 theta of n cubed. 897 00:47:26,270 --> 00:47:29,170 So I'm not gaining anything from having any locality 898 00:47:29,170 --> 00:47:32,900 in this algorithm here. 899 00:47:32,900 --> 00:47:36,440 So any questions on this analysis? 900 00:47:36,440 --> 00:47:39,410 So this just case 1. 901 00:47:39,410 --> 00:47:41,580 Let's look at case 2. 902 00:47:41,580 --> 00:47:46,130 So in this case, n is less than c M over B. 903 00:47:46,130 --> 00:47:50,270 So I can fit one block from each row of B into cache. 904 00:47:50,270 --> 00:47:55,370 And then n is also greater than another constant, c prime time 905 00:47:55,370 --> 00:48:00,080 square root of M, so I can't fit the whole matrix into cache. 906 00:48:00,080 --> 00:48:02,600 And again, let's analyze the number of cache 907 00:48:02,600 --> 00:48:07,432 misses incurred by accessing B, assuming LRU. 908 00:48:07,432 --> 00:48:08,890 So how many cache misses am I going 909 00:48:08,890 --> 00:48:12,882 to incur for the first column of B? 910 00:48:12,882 --> 00:48:13,382 AUDIENCE: n. 911 00:48:13,382 --> 00:48:14,007 JULIAN SHUN: n. 912 00:48:14,007 --> 00:48:15,530 So that's the same as before. 913 00:48:15,530 --> 00:48:18,470 What about the second column of B? 914 00:48:18,470 --> 00:48:24,260 So by the time I get to the beginning of the matrix here, 915 00:48:24,260 --> 00:48:26,690 is the top block going to be in cache? 916 00:48:29,940 --> 00:48:33,330 So who thinks the block is still going to be in cache when 917 00:48:33,330 --> 00:48:35,410 I get back to the beginning? 918 00:48:35,410 --> 00:48:35,910 Yeah. 919 00:48:35,910 --> 00:48:37,320 So a couple of people. 920 00:48:37,320 --> 00:48:39,000 Who think it going to be out of cache? 921 00:48:42,550 --> 00:48:46,660 So it turns out it is going to be in cache, because I 922 00:48:46,660 --> 00:48:50,710 can fit one block for every row of B into my cache 923 00:48:50,710 --> 00:48:53,980 since I have n less than c M over B. 924 00:48:53,980 --> 00:48:58,668 So therefore, when I get to the beginning of the second column, 925 00:48:58,668 --> 00:49:01,210 that block is still going to be in cache, because I loaded it 926 00:49:01,210 --> 00:49:03,050 in when I was accessing the first column. 927 00:49:03,050 --> 00:49:04,800 So I'm not going to incur any cache misses 928 00:49:04,800 --> 00:49:07,450 for the second column. 929 00:49:07,450 --> 00:49:14,230 And, in general, if I can fit B columns or some constant 930 00:49:14,230 --> 00:49:19,540 times B columns into cache, then I 931 00:49:19,540 --> 00:49:23,830 can reduce the number of cache misses I have by a factor of B. 932 00:49:23,830 --> 00:49:26,365 So I only need to incur a cache miss the first time I 933 00:49:26,365 --> 00:49:29,190 access of block and not for all the subsequent accesses. 934 00:49:33,250 --> 00:49:37,740 And the same is true for the second row of A. 935 00:49:37,740 --> 00:49:40,500 And since I have m rows of A, I'm 936 00:49:40,500 --> 00:49:44,850 going to have n times theta of n squared over B cache misses. 937 00:49:44,850 --> 00:49:46,530 For each row of A, I'm going to incur 938 00:49:46,530 --> 00:49:49,260 n squared over B cache misses. 939 00:49:49,260 --> 00:49:52,750 So the overall number of cache misses is n cubed over B. 940 00:49:52,750 --> 00:49:55,110 And this is because inside matrix B 941 00:49:55,110 --> 00:49:56,850 I can exploit spatial locality. 942 00:49:56,850 --> 00:50:00,000 Once I load in a block, I can reuse it the next time 943 00:50:00,000 --> 00:50:02,280 I traverse down a column that's nearby. 944 00:50:06,780 --> 00:50:08,400 Any questions on this analysis? 945 00:50:16,640 --> 00:50:18,530 So let's look at the third case. 946 00:50:18,530 --> 00:50:23,120 And here, n is less than c prime times square root of M. 947 00:50:23,120 --> 00:50:27,810 So this means that the entire matrix fits into cache. 948 00:50:27,810 --> 00:50:30,350 So let's analyze the number of cache misses for matrix B 949 00:50:30,350 --> 00:50:32,150 again, assuming LRU. 950 00:50:32,150 --> 00:50:34,100 So how many cache misses do I have now? 951 00:50:36,950 --> 00:50:39,300 So let's count the total number of cache 952 00:50:39,300 --> 00:50:50,750 misses I have for every time I go through a row of A. Yes. 953 00:50:50,750 --> 00:50:53,540 AUDIENCE: Is it just n for the first column? 954 00:50:56,030 --> 00:50:56,780 JULIAN SHUN: Yeah. 955 00:50:56,780 --> 00:51:00,110 So for the first column, it's going to be n. 956 00:51:00,110 --> 00:51:04,000 What about the second column? 957 00:51:04,000 --> 00:51:05,950 AUDIENCE: [INAUDIBLE] the second [INAUDIBLE].. 958 00:51:05,950 --> 00:51:07,420 JULIAN SHUN: Right. 959 00:51:07,420 --> 00:51:11,042 So basically, for the first row of A, 960 00:51:11,042 --> 00:51:13,000 the analysis is going to be the same as before. 961 00:51:13,000 --> 00:51:16,870 I need n squared over B cache misses to load the cache in. 962 00:51:16,870 --> 00:51:18,750 What about the second row of A? 963 00:51:18,750 --> 00:51:20,500 How many cache misses am I going to incur? 964 00:51:27,262 --> 00:51:30,230 AUDIENCE: [INAUDIBLE]. 965 00:51:30,230 --> 00:51:30,980 JULIAN SHUN: Yeah. 966 00:51:30,980 --> 00:51:32,420 So for the second row of A, I'm not 967 00:51:32,420 --> 00:51:33,770 going to incur any cache misses. 968 00:51:33,770 --> 00:51:36,173 Because once I load B into cache, 969 00:51:36,173 --> 00:51:37,340 it's going to stay in cache. 970 00:51:37,340 --> 00:51:39,470 Because the entire matrix can fit in cache, 971 00:51:39,470 --> 00:51:44,870 I assume that n is less than c prime times square root of M. 972 00:51:44,870 --> 00:51:46,340 So total number of cache misses I 973 00:51:46,340 --> 00:51:50,900 need for matrix B is theta of n squared over B since everything 974 00:51:50,900 --> 00:51:51,660 fits in cache. 975 00:51:51,660 --> 00:51:54,770 And I just apply the submatrix caching lemma from before. 976 00:51:58,100 --> 00:52:00,290 Overall, this is not a very good algorithm. 977 00:52:00,290 --> 00:52:02,360 Because as you recall, in case 1 I 978 00:52:02,360 --> 00:52:06,410 needed a cubic number of cache misses. 979 00:52:09,200 --> 00:52:12,980 What happens if I swap the order of the inner two loops? 980 00:52:12,980 --> 00:52:16,850 So recall that this was one of the optimizations in lecture 1, 981 00:52:16,850 --> 00:52:19,910 when Charles was talking about matrix multiplication 982 00:52:19,910 --> 00:52:22,250 and how to speed it up. 983 00:52:22,250 --> 00:52:26,450 So if I swapped the order of the two inner loops, 984 00:52:26,450 --> 00:52:31,190 then, for every iteration, what I'm doing 985 00:52:31,190 --> 00:52:35,450 is I'm actually going over a row of C and a row of B, 986 00:52:35,450 --> 00:52:40,520 and A stays fixed inside the innermost iteration. 987 00:52:40,520 --> 00:52:42,950 So now, when I analyze the number of cache 988 00:52:42,950 --> 00:52:45,920 misses of matrix B, assuming LRU, 989 00:52:45,920 --> 00:52:47,840 I'm going to benefit from spatial locality, 990 00:52:47,840 --> 00:52:49,970 since I'm going row by row and the matrix is 991 00:52:49,970 --> 00:52:53,030 stored in row-major order. 992 00:52:53,030 --> 00:52:55,700 So across all of the rows, I'm just 993 00:52:55,700 --> 00:53:00,380 going to require theta of n squared over B cache misses. 994 00:53:00,380 --> 00:53:04,142 And I have to do this n times for the outer loop. 995 00:53:04,142 --> 00:53:05,600 So in total, I'm going to get theta 996 00:53:05,600 --> 00:53:08,450 of n cubed over B cache misses. 997 00:53:08,450 --> 00:53:10,700 So if you swap the order of the inner two loops 998 00:53:10,700 --> 00:53:13,697 this significantly improves the locality of your algorithm 999 00:53:13,697 --> 00:53:15,530 and you can benefit from spatial accounting. 1000 00:53:15,530 --> 00:53:18,500 That's why we saw a significant performance improvement 1001 00:53:18,500 --> 00:53:23,750 in the first lecture when we swapped the order of the loops. 1002 00:53:23,750 --> 00:53:24,560 Any questions? 1003 00:53:31,280 --> 00:53:34,210 So does anybody think we can do better than n 1004 00:53:34,210 --> 00:53:36,140 cubed over B cache misses? 1005 00:53:36,140 --> 00:53:39,440 Or do you think that it's the best you can do? 1006 00:53:39,440 --> 00:53:41,510 So how many people think you can do better? 1007 00:53:46,010 --> 00:53:46,510 Yeah. 1008 00:53:46,510 --> 00:53:49,480 And how many people think this is the best you can do? 1009 00:53:53,780 --> 00:53:55,970 And how many people don't care? 1010 00:54:00,660 --> 00:54:03,960 So it turns out you can do better. 1011 00:54:03,960 --> 00:54:06,060 And we're going to do better by using 1012 00:54:06,060 --> 00:54:09,870 an optimization called tiling. 1013 00:54:09,870 --> 00:54:12,210 So how this is going to work is instead 1014 00:54:12,210 --> 00:54:13,910 of just having three for loops, I'm 1015 00:54:13,910 --> 00:54:15,570 going to have six for loops. 1016 00:54:15,570 --> 00:54:19,220 And I'm going to loop over tiles. 1017 00:54:19,220 --> 00:54:22,070 So I've got a loop over s by s submatrices. 1018 00:54:22,070 --> 00:54:24,110 And within each submatrix, I'm going 1019 00:54:24,110 --> 00:54:27,050 to do all of the computation I need for that submatrix 1020 00:54:27,050 --> 00:54:30,270 before moving on to the next submatrix. 1021 00:54:30,270 --> 00:54:32,840 So the three innermost loops are going 1022 00:54:32,840 --> 00:54:36,710 to loop inside a submatrix, and a three outermost loops 1023 00:54:36,710 --> 00:54:39,110 are going to loop within the larger matrix, 1024 00:54:39,110 --> 00:54:42,710 one submatrix matrix at a time. 1025 00:54:42,710 --> 00:54:45,330 So let's analyze the work of this algorithm. 1026 00:54:48,150 --> 00:54:54,380 So the work that we need to do for a submatrix of size 1027 00:54:54,380 --> 00:54:58,610 s by s is going to be s cubed, since that's just a bound 1028 00:54:58,610 --> 00:55:00,950 for matrix multiplication. 1029 00:55:00,950 --> 00:55:04,160 And then a number of times I have to operate on submatrices 1030 00:55:04,160 --> 00:55:07,590 is going to be n over s cubed. 1031 00:55:07,590 --> 00:55:11,210 And you can see this if you just consider each submatrix to be 1032 00:55:11,210 --> 00:55:13,820 a single element, and then using the same cubic work 1033 00:55:13,820 --> 00:55:18,740 analysis on the smaller matrix. 1034 00:55:18,740 --> 00:55:22,710 So the work is n over s cubed times s cubed, 1035 00:55:22,710 --> 00:55:24,620 which is equal to theta of n cubed. 1036 00:55:24,620 --> 00:55:27,800 So the work of this tiled matrix multiplies the same 1037 00:55:27,800 --> 00:55:31,820 as the version that didn't do tiling. 1038 00:55:31,820 --> 00:55:34,040 And now, let's analyze the number of cache misses. 1039 00:55:38,390 --> 00:55:42,020 So we're going to tune s so that the submatrices just 1040 00:55:42,020 --> 00:55:43,100 fit into cache. 1041 00:55:43,100 --> 00:55:46,250 So we're going to set s to be equal to theta 1042 00:55:46,250 --> 00:55:53,990 of square root of M. We actually need to make this 1/3 1043 00:55:53,990 --> 00:55:55,760 square root of M, because we need to fit 1044 00:55:55,760 --> 00:55:57,800 three submatrices in the cache. 1045 00:55:57,800 --> 00:55:59,780 But it's going to be some constant times square 1046 00:55:59,780 --> 00:56:02,780 root of M. 1047 00:56:02,780 --> 00:56:07,190 The submatrix caching level implies that for each submatrix 1048 00:56:07,190 --> 00:56:10,550 we're going to need x squared over B misses to load it in. 1049 00:56:10,550 --> 00:56:13,850 And once we loaded into cache, it fits entirely into cache, 1050 00:56:13,850 --> 00:56:16,430 so we can do all of our computations within cache 1051 00:56:16,430 --> 00:56:18,230 and not incur any more cache misses. 1052 00:56:21,530 --> 00:56:23,540 So therefore, the total number of cache 1053 00:56:23,540 --> 00:56:26,027 misses we're going to incur, it's 1054 00:56:26,027 --> 00:56:27,860 going to be the number of subproblems, which 1055 00:56:27,860 --> 00:56:30,860 is n over s cubed and then a number of cache 1056 00:56:30,860 --> 00:56:35,210 misses per subproblem, which is s squared over B. 1057 00:56:35,210 --> 00:56:37,530 And if you multiply this out, you're 1058 00:56:37,530 --> 00:56:43,070 going to get n cubed over B times square root of M. 1059 00:56:43,070 --> 00:56:45,500 So here, I plugged in square root of M for s. 1060 00:56:48,440 --> 00:56:49,940 And this is a pretty cool result, 1061 00:56:49,940 --> 00:56:51,950 because it says that you can actually do better 1062 00:56:51,950 --> 00:56:53,540 than the n cubed over B bound. 1063 00:56:53,540 --> 00:56:58,520 You can improve this bound by a factor of square root of M. 1064 00:56:58,520 --> 00:57:00,950 And in practice, square root of M 1065 00:57:00,950 --> 00:57:04,230 is actually not insignificant. 1066 00:57:04,230 --> 00:57:07,250 So, for example, if you're looking at the last level 1067 00:57:07,250 --> 00:57:10,290 cache, the size of that is on the order of megabytes. 1068 00:57:10,290 --> 00:57:12,080 So a square root of M is going to be 1069 00:57:12,080 --> 00:57:13,340 on the order of thousands. 1070 00:57:13,340 --> 00:57:15,710 So this significantly improves the performance 1071 00:57:15,710 --> 00:57:18,110 of the matrix multiplication code 1072 00:57:18,110 --> 00:57:20,750 if you tune s so that the submatrices just 1073 00:57:20,750 --> 00:57:23,540 fit in the cache. 1074 00:57:23,540 --> 00:57:26,180 It turns out that this bound is optimal. 1075 00:57:26,180 --> 00:57:30,590 So this was shown in 1981. 1076 00:57:30,590 --> 00:57:32,760 So for cubic work matrix multiplication, 1077 00:57:32,760 --> 00:57:33,950 this is the best you can do. 1078 00:57:33,950 --> 00:57:35,960 If you use another matrix multiply algorithm, 1079 00:57:35,960 --> 00:57:40,380 like Strassen's algorithm, you can do better. 1080 00:57:40,380 --> 00:57:42,230 So I want you to remember this bound. 1081 00:57:42,230 --> 00:57:44,910 It's a very important bound to know. 1082 00:57:44,910 --> 00:57:48,050 It says that for a matrix multiplication 1083 00:57:48,050 --> 00:57:51,440 you can benefit both from spatial locality as well 1084 00:57:51,440 --> 00:57:53,160 as temporal locality. 1085 00:57:53,160 --> 00:57:58,820 So I get spatial locality in the B term in the denominator. 1086 00:57:58,820 --> 00:58:00,500 And then the square root of M term 1087 00:58:00,500 --> 00:58:02,510 comes from temporal locality, since I'm 1088 00:58:02,510 --> 00:58:04,730 doing all of the work inside a submatrix 1089 00:58:04,730 --> 00:58:07,310 before I evict that submatrix from cache. 1090 00:58:10,190 --> 00:58:13,250 Any questions on this analysis? 1091 00:58:13,250 --> 00:58:15,640 So what's one issue with this algorithm here? 1092 00:58:19,920 --> 00:58:20,697 Yes. 1093 00:58:20,697 --> 00:58:23,030 AUDIENCE: It's not portable, like different architecture 1094 00:58:23,030 --> 00:58:24,120 [INAUDIBLE]. 1095 00:58:24,120 --> 00:58:24,870 JULIAN SHUN: Yeah. 1096 00:58:24,870 --> 00:58:27,930 So the problem here is I have to tune s 1097 00:58:27,930 --> 00:58:30,910 for my particular machine. 1098 00:58:30,910 --> 00:58:32,670 And I call this a voodoo parameter. 1099 00:58:32,670 --> 00:58:36,420 It's sort of like a magic number I put into my program 1100 00:58:36,420 --> 00:58:39,900 so that it fits in the cache on the particular machine I'm 1101 00:58:39,900 --> 00:58:40,920 running on. 1102 00:58:40,920 --> 00:58:42,630 And this makes the code not portable, 1103 00:58:42,630 --> 00:58:46,200 because if I try to run this code on a another machine, 1104 00:58:46,200 --> 00:58:49,480 the cache sizes might be different there, 1105 00:58:49,480 --> 00:58:51,450 and then I won't get the same performance 1106 00:58:51,450 --> 00:58:53,130 as I did on my machine. 1107 00:58:55,710 --> 00:58:57,840 And this is also an issue even if you're running it 1108 00:58:57,840 --> 00:58:59,423 on the same machine, because you might 1109 00:58:59,423 --> 00:59:01,620 have other programs running at the same time 1110 00:59:01,620 --> 00:59:03,330 and using up part of the cache. 1111 00:59:03,330 --> 00:59:06,540 So you don't actually know how much of the cache 1112 00:59:06,540 --> 00:59:10,020 your program actually gets to use in a multiprogramming 1113 00:59:10,020 --> 00:59:11,036 environment. 1114 00:59:14,610 --> 00:59:17,280 And then this was also just for one level of cache. 1115 00:59:17,280 --> 00:59:20,550 If we want to optimize for two levels of caches, 1116 00:59:20,550 --> 00:59:23,910 we're going to have two voodoo parameters, s and t. 1117 00:59:23,910 --> 00:59:27,370 We're going to have submatrices and sub-submatrices. 1118 00:59:27,370 --> 00:59:29,970 And then we have to tune both of these parameters 1119 00:59:29,970 --> 00:59:32,310 to get the best performance on our machine. 1120 00:59:32,310 --> 00:59:34,410 And multi-dimensional tuning optimization 1121 00:59:34,410 --> 00:59:36,790 can't be done simply with binary search. 1122 00:59:36,790 --> 00:59:38,790 So if you're just tuning for one level of cache, 1123 00:59:38,790 --> 00:59:41,220 you can do a binary search on the parameter s, 1124 00:59:41,220 --> 00:59:43,470 but here you can't do binary search. 1125 00:59:43,470 --> 00:59:47,910 So it's much more expensive to optimize here. 1126 00:59:47,910 --> 00:59:51,180 And the code becomes a little bit messier. 1127 00:59:51,180 --> 00:59:55,580 You have nine for loops instead of six. 1128 00:59:55,580 --> 00:59:59,330 And how many levels of caches do we have on the machines 1129 00:59:59,330 --> 01:00:00,870 that we're using today? 1130 01:00:00,870 --> 01:00:01,630 AUDIENCE: Three. 1131 01:00:01,630 --> 01:00:02,810 JULIAN SHUN: Three. 1132 01:00:02,810 --> 01:00:06,920 So for three level cache, you have three voodoo parameters. 1133 01:00:06,920 --> 01:00:08,510 You have 12 nested for loops. 1134 01:00:08,510 --> 01:00:11,480 This code becomes very ugly. 1135 01:00:11,480 --> 01:00:13,310 And you have to tune these parameters 1136 01:00:13,310 --> 01:00:15,300 for your particular machine. 1137 01:00:15,300 --> 01:00:17,870 And this makes the code not very portable, 1138 01:00:17,870 --> 01:00:19,970 as one student pointed out. 1139 01:00:19,970 --> 01:00:21,650 And in a multiprogramming environment, 1140 01:00:21,650 --> 01:00:23,990 you don't actually know the effective cache size 1141 01:00:23,990 --> 01:00:25,490 that your program has access to. 1142 01:00:25,490 --> 01:00:28,073 Because other jobs are running at the same time, and therefore 1143 01:00:28,073 --> 01:00:30,948 it's very easy to mistune the parameters. 1144 01:00:30,948 --> 01:00:31,740 Was their question? 1145 01:00:31,740 --> 01:00:33,130 No? 1146 01:00:33,130 --> 01:00:35,310 So any questions? 1147 01:00:35,310 --> 01:00:35,810 Yeah. 1148 01:00:35,810 --> 01:00:37,563 Is there a way to programmatically get 1149 01:00:37,563 --> 01:00:38,850 the size of the cache? 1150 01:00:38,850 --> 01:00:40,120 [INAUDIBLE] 1151 01:00:40,120 --> 01:00:40,870 JULIAN SHUN: Yeah. 1152 01:00:40,870 --> 01:00:43,610 So you can auto-tune your program 1153 01:00:43,610 --> 01:00:47,090 so that it's optimized for the cache sizes 1154 01:00:47,090 --> 01:00:48,283 of your particular machine. 1155 01:00:48,283 --> 01:00:49,658 AUDIENCE: [INAUDIBLE] instruction 1156 01:00:49,658 --> 01:00:52,640 to get the size of the cache [INAUDIBLE].. 1157 01:00:52,640 --> 01:00:56,473 JULIAN SHUN: Instruction to get the size of your cache. 1158 01:00:56,473 --> 01:00:57,390 I'm not actually sure. 1159 01:00:57,390 --> 01:00:57,890 Do you know? 1160 01:00:57,890 --> 01:00:59,172 AUDIENCE: [INAUDIBLE] in-- 1161 01:00:59,172 --> 01:01:00,534 AUDIENCE: [INAUDIBLE]. 1162 01:01:00,534 --> 01:01:02,410 AUDIENCE: Yeah, in the proc-- 1163 01:01:07,595 --> 01:01:09,400 JULIAN SHUN: Yeah, proc cpuinfo. 1164 01:01:09,400 --> 01:01:10,980 AUDIENCE: Yeah. proc cpuinfo or something like that. 1165 01:01:10,980 --> 01:01:11,730 JULIAN SHUN: Yeah. 1166 01:01:11,730 --> 01:01:14,260 So you can probably get that as well. 1167 01:01:14,260 --> 01:01:16,367 AUDIENCE: And I think if you google, 1168 01:01:16,367 --> 01:01:17,950 I think you'll find it pretty quickly. 1169 01:01:17,950 --> 01:01:18,300 JULIAN SHUN: Yeah. 1170 01:01:18,300 --> 01:01:18,925 AUDIENCE: Yeah. 1171 01:01:23,400 --> 01:01:25,710 But even if you do that, and you're 1172 01:01:25,710 --> 01:01:27,960 running this program when other jobs are running, 1173 01:01:27,960 --> 01:01:30,570 you don't actually know how much cache your program has access 1174 01:01:30,570 --> 01:01:30,780 to. 1175 01:01:30,780 --> 01:01:31,280 Yes? 1176 01:01:31,280 --> 01:01:34,140 Is cache of architecture and stuff like that 1177 01:01:34,140 --> 01:01:37,110 optimized around matrix problems? 1178 01:01:37,110 --> 01:01:38,355 JULIAN SHUN: No. 1179 01:01:38,355 --> 01:01:41,370 They're actually general purpose. 1180 01:01:41,370 --> 01:01:43,320 Today, we're just looking at matrix multiply, 1181 01:01:43,320 --> 01:01:46,290 but on Thursday's lecture we'll actually 1182 01:01:46,290 --> 01:01:47,880 be looking at many other problems 1183 01:01:47,880 --> 01:01:50,848 and how to optimize them for the cache hierarchy. 1184 01:01:56,180 --> 01:01:57,312 Other questions? 1185 01:02:01,790 --> 01:02:06,500 So this was a good algorithm in terms of cache performance, 1186 01:02:06,500 --> 01:02:07,935 but it wasn't very portable. 1187 01:02:07,935 --> 01:02:09,310 So let's see if we can do better. 1188 01:02:09,310 --> 01:02:12,050 Let's see if we can come up with a simpler design 1189 01:02:12,050 --> 01:02:15,390 where we still get pretty good cache performance. 1190 01:02:15,390 --> 01:02:19,250 So we're going to turn to divide and conquer. 1191 01:02:19,250 --> 01:02:21,770 We're going to look at the recursive matrix multiplication 1192 01:02:21,770 --> 01:02:24,750 algorithm that we saw before. 1193 01:02:24,750 --> 01:02:26,750 Again, we're going to deal with square matrices, 1194 01:02:26,750 --> 01:02:30,330 but the results generalize to non-square matrices. 1195 01:02:30,330 --> 01:02:33,800 So how this works is we're going to split 1196 01:02:33,800 --> 01:02:37,340 our [INAUDIBLE] matrices into four submatrices or four 1197 01:02:37,340 --> 01:02:38,990 quadrants. 1198 01:02:38,990 --> 01:02:41,220 And then for each quadrant of the output matrix, 1199 01:02:41,220 --> 01:02:45,110 it's just going to be the sum of two matrix multiplies on n 1200 01:02:45,110 --> 01:02:46,700 over 2 by n over 2 matrices. 1201 01:02:46,700 --> 01:02:51,260 So c 1 1 one is going to be a 1 1 times b 1 1, 1202 01:02:51,260 --> 01:02:54,530 plus a 1 2 times B 2 1. 1203 01:02:54,530 --> 01:02:56,900 And then we're going to do this recursively. 1204 01:02:56,900 --> 01:03:00,140 So every level of recursion we're 1205 01:03:00,140 --> 01:03:04,070 going to get eight multiplied adds of n over 2 1206 01:03:04,070 --> 01:03:07,580 by n over 2 matrices. 1207 01:03:07,580 --> 01:03:10,440 Here's what the recursive code looks like. 1208 01:03:10,440 --> 01:03:14,660 You can see that we have eight recursive calls here. 1209 01:03:14,660 --> 01:03:17,060 The base case here is of size 1. 1210 01:03:17,060 --> 01:03:19,760 In practice, you want to coarsen the base case to overcome 1211 01:03:19,760 --> 01:03:20,930 function call overheads. 1212 01:03:23,690 --> 01:03:27,480 Let's also look at what these values here correspond to. 1213 01:03:27,480 --> 01:03:31,890 So I've color coded these so that they correspond 1214 01:03:31,890 --> 01:03:33,570 to particular elements in the submatrix 1215 01:03:33,570 --> 01:03:36,330 that I'm looking at on the right. 1216 01:03:36,330 --> 01:03:39,060 So these values here correspond to the index 1217 01:03:39,060 --> 01:03:41,700 of the first element in each of my quadrants. 1218 01:03:41,700 --> 01:03:43,920 So the first element in my first quadrant 1219 01:03:43,920 --> 01:03:47,250 is just going to have an offset of 0. 1220 01:03:47,250 --> 01:03:50,370 And then the first element of my second quadrant, 1221 01:03:50,370 --> 01:03:51,870 that's going to be on the same row 1222 01:03:51,870 --> 01:03:54,120 as the first element in my first quadrant. 1223 01:03:54,120 --> 01:04:02,790 So I just need to add the width of my quadrant, which 1224 01:04:02,790 --> 01:04:04,410 is n over 2. 1225 01:04:04,410 --> 01:04:09,480 And then to get the first element in quadrant 2 1, 1226 01:04:09,480 --> 01:04:12,850 I'm going to jump over and over two rows. 1227 01:04:12,850 --> 01:04:16,140 And each row has a length row size, 1228 01:04:16,140 --> 01:04:18,930 so it's just going to be n over 2 times row size. 1229 01:04:18,930 --> 01:04:23,400 And then to get the first element in quadrant 2 2, 1230 01:04:23,400 --> 01:04:27,810 it's just the first element in quadrant 2 1 plus n over 2. 1231 01:04:27,810 --> 01:04:30,450 So that's n over 2 times row size plus 1. 1232 01:04:34,540 --> 01:04:38,390 So let's analyze the work of this algorithm. 1233 01:04:38,390 --> 01:04:41,930 So what's the recurrence for this algorithm-- 1234 01:04:41,930 --> 01:04:44,750 for the work of this algorithm? 1235 01:04:44,750 --> 01:04:46,300 So how many subproblems do we have? 1236 01:04:46,300 --> 01:04:47,078 AUDIENCE: Eight 1237 01:04:47,078 --> 01:04:47,870 JULIAN SHUN: Eight. 1238 01:04:47,870 --> 01:04:53,840 And what's the size of each Subproblem n over 2. 1239 01:04:53,840 --> 01:04:57,800 And how much work are we doing to set up the recursive calls? 1240 01:05:00,887 --> 01:05:03,250 A constant amount of work. 1241 01:05:03,250 --> 01:05:06,580 So the recurrences is W of n is equal to 8 W 1242 01:05:06,580 --> 01:05:09,280 n over 2 plus theta of 1. 1243 01:05:09,280 --> 01:05:12,560 And what does that solve to? 1244 01:05:12,560 --> 01:05:13,440 n cubed. 1245 01:05:13,440 --> 01:05:16,500 So it's one of the three cases of the master theorem. 1246 01:05:20,850 --> 01:05:24,360 We're actually going to analyze this in more detail 1247 01:05:24,360 --> 01:05:25,920 by drawing out the recursion tree. 1248 01:05:25,920 --> 01:05:29,190 And this is going to give us more intuition about why 1249 01:05:29,190 --> 01:05:32,540 the master theorem is true. 1250 01:05:32,540 --> 01:05:35,480 So at the top level of my recursion tree, 1251 01:05:35,480 --> 01:05:38,320 I'm going to have a problem of size n. 1252 01:05:38,320 --> 01:05:41,570 And then I'm going to branch into eight subproblems of size 1253 01:05:41,570 --> 01:05:42,217 n over 2. 1254 01:05:42,217 --> 01:05:44,300 And then I'm going to do a constant amount of work 1255 01:05:44,300 --> 01:05:45,950 to set up the recursive calls. 1256 01:05:45,950 --> 01:05:47,570 Here, I'm just labeling this with one. 1257 01:05:47,570 --> 01:05:48,820 So I'm ignoring the constants. 1258 01:05:48,820 --> 01:05:52,670 But it's not going to matter for asymptotic analysis. 1259 01:05:52,670 --> 01:05:54,560 And then I'm going to branch again 1260 01:05:54,560 --> 01:05:58,250 into eight subproblems of size n over 4. 1261 01:05:58,250 --> 01:06:01,790 And eventually, I'm going to get down to the leaves. 1262 01:06:01,790 --> 01:06:06,342 And how many levels do I have until I get to the leaves? 1263 01:06:11,510 --> 01:06:12,010 Yes? 1264 01:06:12,010 --> 01:06:12,750 AUDIENCE: Log n. 1265 01:06:12,750 --> 01:06:13,500 JULIAN SHUN: Yeah. 1266 01:06:13,500 --> 01:06:17,790 So log n-- what's the base of the log? 1267 01:06:17,790 --> 01:06:18,290 Yeah. 1268 01:06:18,290 --> 01:06:21,000 So it's log base 2 of n, because I'm dividing my problem 1269 01:06:21,000 --> 01:06:22,470 size by 2 every time. 1270 01:06:24,942 --> 01:06:26,400 And therefore, the number of leaves 1271 01:06:26,400 --> 01:06:28,950 I have is going to be 8 to the log base 2 of n, 1272 01:06:28,950 --> 01:06:31,500 because I'm branching it eight ways every time. 1273 01:06:31,500 --> 01:06:35,400 8 to the log base 2 of n is the same as n to the log base 1274 01:06:35,400 --> 01:06:37,230 2 of 8, which is n cubed. 1275 01:06:40,660 --> 01:06:44,740 The amount of work I'm doing at the top level is constant. 1276 01:06:44,740 --> 01:06:47,530 So I'm just going to say 1 here. 1277 01:06:47,530 --> 01:06:52,450 At the next level, it's eight times, then 64. 1278 01:06:52,450 --> 01:06:54,210 And then when I get to the leaves, 1279 01:06:54,210 --> 01:06:55,900 it's going to be theta of n cubed, 1280 01:06:55,900 --> 01:06:58,330 since I have m cubed leaves, and they're all 1281 01:06:58,330 --> 01:07:01,090 doing constant work. 1282 01:07:01,090 --> 01:07:04,060 And the work is geometrically increasing as I go down 1283 01:07:04,060 --> 01:07:05,020 the recursion tree. 1284 01:07:05,020 --> 01:07:07,780 So the overall work is just dominated by the work 1285 01:07:07,780 --> 01:07:09,850 I need to do at the leaves. 1286 01:07:09,850 --> 01:07:13,780 So the overall work is just going to be theta of n cubed. 1287 01:07:13,780 --> 01:07:15,430 And this is the same as the looping 1288 01:07:15,430 --> 01:07:18,100 versions of matrix multiply-- 1289 01:07:18,100 --> 01:07:20,410 they're all cubic work. 1290 01:07:20,410 --> 01:07:22,990 Now, let's analyze the number of cache misses of this divide 1291 01:07:22,990 --> 01:07:26,260 and conquer algorithm. 1292 01:07:26,260 --> 01:07:29,540 So now, my recurrence is going to be different. 1293 01:07:29,540 --> 01:07:34,400 My base case now is when the submatrix fits in the cache-- 1294 01:07:34,400 --> 01:07:38,200 so when n squared is less than c M. And when that's true, 1295 01:07:38,200 --> 01:07:40,690 I just need to load that submatrix into cache, 1296 01:07:40,690 --> 01:07:43,300 and then I don't incur any more cache misses. 1297 01:07:43,300 --> 01:07:45,390 So I need theta of n squared over B cache 1298 01:07:45,390 --> 01:07:49,840 misses when n squared is less than c M for some sufficiently 1299 01:07:49,840 --> 01:07:52,360 small constant c, less than or equal to 1. 1300 01:07:52,360 --> 01:07:56,680 And then, otherwise, I recurse into 8 subproblems of size n 1301 01:07:56,680 --> 01:07:57,460 over 2. 1302 01:07:57,460 --> 01:07:59,290 And then I add theta of 1, because I'm 1303 01:07:59,290 --> 01:08:03,740 doing a constant amount of work to set up the recursive calls. 1304 01:08:03,740 --> 01:08:06,700 And I get this state of n squared over B term 1305 01:08:06,700 --> 01:08:08,935 from the submatrix caching lemma. 1306 01:08:08,935 --> 01:08:12,430 It says I can just load the entire matrix into cache 1307 01:08:12,430 --> 01:08:15,020 with this many cache misses. 1308 01:08:15,020 --> 01:08:18,359 So the difference between the cache analysis here 1309 01:08:18,359 --> 01:08:20,859 and the work analysis before is that I have a different base 1310 01:08:20,859 --> 01:08:22,510 case. 1311 01:08:22,510 --> 01:08:24,460 And I think in all of the algorithms 1312 01:08:24,460 --> 01:08:26,979 that you've seen before, the base case was always 1313 01:08:26,979 --> 01:08:27,970 of a constant size. 1314 01:08:27,970 --> 01:08:29,800 But here, we're working with a base case 1315 01:08:29,800 --> 01:08:31,350 that's not of a constant size. 1316 01:08:34,359 --> 01:08:36,790 So let's try to analyze this using the recursion tree 1317 01:08:36,790 --> 01:08:38,390 approach. 1318 01:08:38,390 --> 01:08:42,260 So at the top level, I have a problem of size n 1319 01:08:42,260 --> 01:08:44,649 that I'm going to branch into eight problems of size n 1320 01:08:44,649 --> 01:08:45,160 over 2. 1321 01:08:45,160 --> 01:08:48,170 And then I'm also going to incur a constant number of cache 1322 01:08:48,170 --> 01:08:48,670 misses. 1323 01:08:48,670 --> 01:08:51,580 I'm just going to say 1 here. 1324 01:08:51,580 --> 01:08:54,850 Then I'm going to branch again. 1325 01:08:54,850 --> 01:08:58,210 And then, eventually, I'm going to get to the base case 1326 01:08:58,210 --> 01:09:01,840 where n squared is less than c M. 1327 01:09:01,840 --> 01:09:05,649 And when n squared is less than c M, then the number of cache 1328 01:09:05,649 --> 01:09:07,300 misses that I'm going to incur is going 1329 01:09:07,300 --> 01:09:12,460 to be theta of c M over B. So I can just plug-in c M here 1330 01:09:12,460 --> 01:09:15,790 for n squared. 1331 01:09:15,790 --> 01:09:17,830 And the number of levels of recursion 1332 01:09:17,830 --> 01:09:22,340 I have in this recursion tree is no longer just log base 2 of n. 1333 01:09:22,340 --> 01:09:27,370 I'm going to have log base 2 of n minus log base 2 1334 01:09:27,370 --> 01:09:31,149 of square root of c M number of levels, which 1335 01:09:31,149 --> 01:09:33,850 is the same as log base 2 of n minus 1/2 times 1336 01:09:33,850 --> 01:09:40,390 log base 2 of c M. And then, the number of leaves I get 1337 01:09:40,390 --> 01:09:44,710 is going to be 8 to this number of levels here. 1338 01:09:44,710 --> 01:09:50,680 So it's 8 to log base 2 of n minus 1/2 of log base 2 of c M. 1339 01:09:50,680 --> 01:09:56,400 And this is equal to the theta of n cubed over M to the 3/2. 1340 01:09:56,400 --> 01:10:00,580 So the n cubed comes from the 8 to the log base 2 of n term. 1341 01:10:00,580 --> 01:10:07,450 And then if I do 8 to the negative 1/2 of log base 2 1342 01:10:07,450 --> 01:10:12,520 of c M, that's just going to give me M to the 3/2 1343 01:10:12,520 --> 01:10:13,480 in the denominator. 1344 01:10:16,210 --> 01:10:19,160 So any questions on how I computed the number of levels 1345 01:10:19,160 --> 01:10:20,627 of this recursion tree here? 1346 01:10:29,400 --> 01:10:32,110 So I'm basically dividing my problem size by 2 1347 01:10:32,110 --> 01:10:35,410 until I get to a problem size that fits in the cache. 1348 01:10:35,410 --> 01:10:40,180 So that means n is less than square root of c M. 1349 01:10:40,180 --> 01:10:42,310 So therefore, I can subtract that many levels 1350 01:10:42,310 --> 01:10:43,556 for my recursion tree. 1351 01:10:46,248 --> 01:10:47,790 And then to get the number of leaves, 1352 01:10:47,790 --> 01:10:49,320 since I'm branching eight ways, I 1353 01:10:49,320 --> 01:10:52,630 just do 8 to the power of the number of levels I have. 1354 01:10:52,630 --> 01:10:54,713 And then that gives me the total number of leaves. 1355 01:10:58,580 --> 01:11:00,320 So now, let's analyze a number of cache 1356 01:11:00,320 --> 01:11:03,440 misses I need each level of this recursion tree. 1357 01:11:03,440 --> 01:11:05,630 At the top level, I have a constant number 1358 01:11:05,630 --> 01:11:06,710 of cache misses-- 1359 01:11:06,710 --> 01:11:08,240 let's just say 1. 1360 01:11:08,240 --> 01:11:12,530 The next level, I have 8, 64. 1361 01:11:12,530 --> 01:11:14,540 And then at the leaves, I'm going 1362 01:11:14,540 --> 01:11:18,050 to have theta of n cubed over B times square root of M cache 1363 01:11:18,050 --> 01:11:18,960 misses. 1364 01:11:18,960 --> 01:11:21,620 And I got this quantity just by multiplying 1365 01:11:21,620 --> 01:11:23,660 the number of leaves by the number 1366 01:11:23,660 --> 01:11:25,040 of cache misses per leaf. 1367 01:11:25,040 --> 01:11:28,730 So number of leaves is n cubed over M to the 3/2. 1368 01:11:28,730 --> 01:11:32,150 The cache misses per leaves is theta of c M over B. 1369 01:11:32,150 --> 01:11:35,640 So I lose one factor of B in the denominator. 1370 01:11:35,640 --> 01:11:37,940 I'm left with the square root of M at the bottom. 1371 01:11:37,940 --> 01:11:41,450 And then I also divide by the block size B. 1372 01:11:41,450 --> 01:11:45,110 So overall, I get n cubed over B times square root of M cache 1373 01:11:45,110 --> 01:11:46,070 misses. 1374 01:11:46,070 --> 01:11:48,440 And again, this is a geometric series. 1375 01:11:48,440 --> 01:11:50,690 And the number of cache misses at the leaves 1376 01:11:50,690 --> 01:11:53,372 dominates all of the other levels. 1377 01:11:53,372 --> 01:11:54,830 So the total number of cache misses 1378 01:11:54,830 --> 01:11:57,980 I have is going to be theta of n cubed 1379 01:11:57,980 --> 01:12:00,896 over B times square root of M. 1380 01:12:00,896 --> 01:12:04,630 And notice that I'm getting the same number of cache 1381 01:12:04,630 --> 01:12:07,330 misses as I did with the tiling version of the code. 1382 01:12:07,330 --> 01:12:09,710 But here, I don't actually have the tune my code 1383 01:12:09,710 --> 01:12:12,510 for the particular cache size. 1384 01:12:12,510 --> 01:12:14,958 So what cache sizes does this code work for? 1385 01:12:22,130 --> 01:12:24,481 So is this code going to work on your machine? 1386 01:12:27,920 --> 01:12:30,700 Is it going to get good cache performance? 1387 01:12:30,700 --> 01:12:33,340 So this code is going to work for all cache sizes, 1388 01:12:33,340 --> 01:12:38,370 because I didn't tune it for any particular cache size. 1389 01:12:38,370 --> 01:12:42,250 And this is what's known as a cache-oblivious algorithm. 1390 01:12:42,250 --> 01:12:44,300 It doesn't have any voodoo tuning parameters, 1391 01:12:44,300 --> 01:12:47,030 it has no explicit knowledge of the caches, 1392 01:12:47,030 --> 01:12:49,540 and it's essentially passively auto-tuning itself 1393 01:12:49,540 --> 01:12:53,710 for the particular cache size of your machine. 1394 01:12:53,710 --> 01:12:56,620 It can also work for multi-level caches 1395 01:12:56,620 --> 01:12:59,470 automatically, because I never specified what level of cache 1396 01:12:59,470 --> 01:13:00,940 I'm analyzing this for. 1397 01:13:00,940 --> 01:13:03,170 I can analyze it for any level of cache, 1398 01:13:03,170 --> 01:13:06,330 and it's still going to give me good cache complexity. 1399 01:13:06,330 --> 01:13:08,680 And this is also good in multiprogramming environments, 1400 01:13:08,680 --> 01:13:10,490 where you might have other jobs running 1401 01:13:10,490 --> 01:13:12,410 and you don't know your effective cache size. 1402 01:13:12,410 --> 01:13:14,660 This is just going to passively auto-tune for whatever 1403 01:13:14,660 --> 01:13:15,700 cache size is available. 1404 01:13:18,780 --> 01:13:21,620 It turns out that the best cache-oblivious codes to date 1405 01:13:21,620 --> 01:13:24,150 work on arbitrary rectangular matrices. 1406 01:13:24,150 --> 01:13:26,480 I just talked about square matrices, 1407 01:13:26,480 --> 01:13:29,000 but the best codes work on rectangular matrices. 1408 01:13:29,000 --> 01:13:30,440 And they perform binary splitting 1409 01:13:30,440 --> 01:13:32,000 instead of eight-way splitting. 1410 01:13:32,000 --> 01:13:37,130 And you're split on the largest of i, j, and k. 1411 01:13:37,130 --> 01:13:39,590 So this is what the best cache-oblivious matrix 1412 01:13:39,590 --> 01:13:41,060 multiplication algorithm does. 1413 01:13:44,970 --> 01:13:46,101 Any questions? 1414 01:13:50,940 --> 01:13:54,440 So I only talked about the serial setting so far. 1415 01:13:54,440 --> 01:13:56,090 I was assuming that these algorithms 1416 01:13:56,090 --> 01:13:58,190 ran on just a single thread. 1417 01:13:58,190 --> 01:14:02,674 What happens if I go to multiple processors? 1418 01:14:02,674 --> 01:14:05,340 It turns out that the results do generalize 1419 01:14:05,340 --> 01:14:08,380 to a parallel context. 1420 01:14:08,380 --> 01:14:10,770 So this is the recursive parallel matrix multiply 1421 01:14:10,770 --> 01:14:13,710 code that we saw before. 1422 01:14:13,710 --> 01:14:17,040 And notice that we're executing four sub calls in parallel, 1423 01:14:17,040 --> 01:14:19,620 doing a sync, and then doing four more sub 1424 01:14:19,620 --> 01:14:20,385 calls in parallel. 1425 01:14:23,310 --> 01:14:25,920 So let's try to analyze the number of cache 1426 01:14:25,920 --> 01:14:27,540 misses in this parallel code. 1427 01:14:27,540 --> 01:14:30,210 And to do that, we're going to use this theorem, which 1428 01:14:30,210 --> 01:14:32,910 says that let Q sub p be the number of cache 1429 01:14:32,910 --> 01:14:34,980 misses in a deterministic cell computation 1430 01:14:34,980 --> 01:14:39,000 why run on P processors, each with a private cache of size M. 1431 01:14:39,000 --> 01:14:41,610 And let S sub p be the number of successful steals 1432 01:14:41,610 --> 01:14:43,830 during the computation. 1433 01:14:43,830 --> 01:14:46,800 In the ideal cache model, the number of cache 1434 01:14:46,800 --> 01:14:50,970 misses we're going to have is Q sub p equal to Q sub 1 1435 01:14:50,970 --> 01:14:55,830 plus big O of number of steals times M over B. 1436 01:14:55,830 --> 01:14:59,520 So the number of cache misses in the parallel context is 1437 01:14:59,520 --> 01:15:02,730 equal to the number of cache misses when you run it serially 1438 01:15:02,730 --> 01:15:05,970 plus this term here, which is the number of steals 1439 01:15:05,970 --> 01:15:09,670 times M over B. 1440 01:15:09,670 --> 01:15:13,650 And the proof for this goes as follows-- so every call 1441 01:15:13,650 --> 01:15:16,200 in the Cilk runtime system, we can 1442 01:15:16,200 --> 01:15:18,900 have workers steal tasks from other workers 1443 01:15:18,900 --> 01:15:20,580 when they don't have work to do. 1444 01:15:20,580 --> 01:15:23,520 And after a worker steals a task from another worker, 1445 01:15:23,520 --> 01:15:26,700 it's cache becomes completely cold in the worst case, 1446 01:15:26,700 --> 01:15:29,790 because it wasn't actually working on that subproblem 1447 01:15:29,790 --> 01:15:31,080 before. 1448 01:15:31,080 --> 01:15:33,750 But after M over B cold cache misses, 1449 01:15:33,750 --> 01:15:36,630 its cache is going to become identical to what it would 1450 01:15:36,630 --> 01:15:38,500 be in the serial execution. 1451 01:15:38,500 --> 01:15:40,590 So we just need to pay M over B cache 1452 01:15:40,590 --> 01:15:44,130 misses to make it so that the cache looks the same as 1453 01:15:44,130 --> 01:15:47,010 if it were executing serially. 1454 01:15:47,010 --> 01:15:48,630 And the same is true when a worker 1455 01:15:48,630 --> 01:15:52,380 resumes a stolen subcomputation after a Cilk sync. 1456 01:15:52,380 --> 01:15:55,230 And the number of times that these two situations can happen 1457 01:15:55,230 --> 01:15:57,795 is 2 times as S p-- 1458 01:15:57,795 --> 01:16:00,270 2 times the number of steals. 1459 01:16:00,270 --> 01:16:03,780 And each time, we have to pay M over b cache misses. 1460 01:16:03,780 --> 01:16:06,870 And this is where this additive term comes from-- order 1461 01:16:06,870 --> 01:16:13,260 S sub p times M over B. 1462 01:16:13,260 --> 01:16:16,920 We also know that the number of steals in a Cilk program 1463 01:16:16,920 --> 01:16:21,770 is upper-bounded by P times T infinity, 1464 01:16:21,770 --> 01:16:24,150 in the expectation where P is the number of processors 1465 01:16:24,150 --> 01:16:27,390 and T infinity is the span of your computation. 1466 01:16:27,390 --> 01:16:30,060 So if you can minimize the span of your computation, 1467 01:16:30,060 --> 01:16:34,170 then this also gives you a good cache bounds. 1468 01:16:34,170 --> 01:16:37,140 So moral of the story here is that minimizing 1469 01:16:37,140 --> 01:16:41,010 the number of cache misses in a serial elision 1470 01:16:41,010 --> 01:16:44,370 essentially minimizes them in the parallel execution 1471 01:16:44,370 --> 01:16:46,080 for a low span algorithm. 1472 01:16:48,690 --> 01:16:51,660 So in this recursive matrix multiplication algorithm, 1473 01:16:51,660 --> 01:16:55,910 the span of this is as follows-- 1474 01:16:55,910 --> 01:16:58,920 so T infinity of n is 2T infinity of of n over 2 1475 01:16:58,920 --> 01:17:01,260 plus theta of 1. 1476 01:17:01,260 --> 01:17:02,670 Since we're doing a sync here, we 1477 01:17:02,670 --> 01:17:06,960 have to pay the critical path length of two sub calls. 1478 01:17:06,960 --> 01:17:09,180 This solves to theta of n. 1479 01:17:09,180 --> 01:17:12,150 And applying to previous lemma, this gives us 1480 01:17:12,150 --> 01:17:17,190 a cache miss bound of theta of n cubed over B square root of M. 1481 01:17:17,190 --> 01:17:20,550 This is just the same as the serial execution 1482 01:17:20,550 --> 01:17:24,150 And then this additive term is going to be order P times n. 1483 01:17:24,150 --> 01:17:29,570 And it's a span times M over B 1484 01:17:29,570 --> 01:17:35,510 So that was a parallel algorithm for matrix multiply. 1485 01:17:35,510 --> 01:17:39,320 And we saw that we can also get good cache bounds there. 1486 01:17:39,320 --> 01:17:41,430 So here's a summary of what we talked about today. 1487 01:17:41,430 --> 01:17:45,950 We talked about associativity and caches, different ways 1488 01:17:45,950 --> 01:17:47,790 you can design a cache. 1489 01:17:47,790 --> 01:17:49,520 We talked about the ideal cache model 1490 01:17:49,520 --> 01:17:52,940 that's useful for analyzing algorithms. 1491 01:17:52,940 --> 01:17:55,910 We talked about cache-aware algorithms 1492 01:17:55,910 --> 01:17:58,110 that have explicit knowledge of the cache. 1493 01:17:58,110 --> 01:18:01,850 And the example we used was titled matrix multiply. 1494 01:18:01,850 --> 01:18:03,980 Then we came up with a much simpler algorithm 1495 01:18:03,980 --> 01:18:09,290 that was cache-oblivious using divide and conquer. 1496 01:18:09,290 --> 01:18:11,510 And then on Thursday's lecture, we'll 1497 01:18:11,510 --> 01:18:14,730 actually see much more on cache-oblivious algorithm 1498 01:18:14,730 --> 01:18:15,230 design. 1499 01:18:15,230 --> 01:18:16,897 And then you'll also have an opportunity 1500 01:18:16,897 --> 01:18:20,150 to analyze the cache efficiency of some algorithms 1501 01:18:20,150 --> 01:18:22,690 in the next homework.