1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,001 --> 00:00:24,910 PROFESSOR: Hey, everybody. 9 00:00:24,910 --> 00:00:28,090 It's my pleasure once again to welcome 10 00:00:28,090 --> 00:00:35,590 TB Schardl, who is the author of your taper compiler, 11 00:00:35,590 --> 00:00:41,072 to talk about the Cilk runtime system. 12 00:00:41,072 --> 00:00:42,280 TAO SCHARDL: Thanks, Charles. 13 00:00:42,280 --> 00:00:46,110 Can anyone hear me in the back, seem good? 14 00:00:46,110 --> 00:00:48,253 OK. 15 00:00:48,253 --> 00:00:49,420 Thanks for the introduction. 16 00:00:49,420 --> 00:00:52,448 Today I'll be talking about the Cilk runtime system. 17 00:00:52,448 --> 00:00:53,740 This is pretty exciting for me. 18 00:00:53,740 --> 00:00:56,080 This is a lecture that's not about compilers. 19 00:00:56,080 --> 00:01:00,410 I get to talk about something a little different for once. 20 00:01:00,410 --> 00:01:01,840 It should be a fun lecture. 21 00:01:01,840 --> 00:01:03,790 Recently, as I understand it, you've 22 00:01:03,790 --> 00:01:07,180 been looking at storage allocation, 23 00:01:07,180 --> 00:01:10,870 both in the serial case as well as the parallel case. 24 00:01:10,870 --> 00:01:15,520 And you've already done Cilk programming for a while, 25 00:01:15,520 --> 00:01:17,200 at this point. 26 00:01:17,200 --> 00:01:19,270 This lecture, honestly, is a bit of a non 27 00:01:19,270 --> 00:01:25,600 sequitur in terms of the overall flow of the course. 28 00:01:25,600 --> 00:01:27,460 And it's also an advanced topic. 29 00:01:27,460 --> 00:01:30,400 The Cilk runtime system is a pretty complicated piece 30 00:01:30,400 --> 00:01:31,310 of software. 31 00:01:31,310 --> 00:01:35,560 But nevertheless, I believe you should have enough background 32 00:01:35,560 --> 00:01:39,070 to at least start to understand and appreciate 33 00:01:39,070 --> 00:01:42,940 some of the aspects of the design of the Cilk runtime 34 00:01:42,940 --> 00:01:44,240 system. 35 00:01:44,240 --> 00:01:47,770 So that's why we're talking about that today. 36 00:01:47,770 --> 00:01:50,950 Just to quickly recall something that you're all, 37 00:01:50,950 --> 00:01:55,120 I'm sure, intimately familiar with by this point, what's 38 00:01:55,120 --> 00:01:56,965 Cilk programming all about? 39 00:01:56,965 --> 00:01:58,840 Well, Cilk is a parallel programming language 40 00:01:58,840 --> 00:02:02,770 that allows you to make your software run faster 41 00:02:02,770 --> 00:02:04,960 using parallel processors. 42 00:02:04,960 --> 00:02:07,810 And to use Cilk, it's pretty straightforward. 43 00:02:07,810 --> 00:02:10,570 You may start with some serial code that 44 00:02:10,570 --> 00:02:13,870 runs in some running time-- we'll denote that as Ts 45 00:02:13,870 --> 00:02:15,910 for certain parts of the lecture. 46 00:02:15,910 --> 00:02:18,580 If you wanted to run in parallel using Cilk, 47 00:02:18,580 --> 00:02:22,390 you just insert Cilk keywords in choice locations. 48 00:02:22,390 --> 00:02:24,910 For example, you can parallelize the outer loop 49 00:02:24,910 --> 00:02:28,870 in this matrix multiply kernel, and that will let your code run 50 00:02:28,870 --> 00:02:32,450 in time Tp on P processors. 51 00:02:32,450 --> 00:02:36,580 And ideally, Tp should be less than Ts. 52 00:02:36,580 --> 00:02:39,040 Now, just adding keywords is all you 53 00:02:39,040 --> 00:02:42,370 need to do to tell Cilk to execute 54 00:02:42,370 --> 00:02:43,930 the computation in parallel. 55 00:02:43,930 --> 00:02:46,630 What does Cilk do in light of those keywords? 56 00:02:46,630 --> 00:02:52,270 At a very high level, Cilk and specifically its runtime system 57 00:02:52,270 --> 00:02:55,120 takes care of the task of scheduling and load 58 00:02:55,120 --> 00:02:58,570 balancing the computation on the parallel processors 59 00:02:58,570 --> 00:03:01,610 and on the multicore system in general. 60 00:03:01,610 --> 00:03:04,500 So after you've denoted logical parallel in the program using 61 00:03:04,500 --> 00:03:07,140 spawn, Cilk spawn, Cilk sync, and Cilk four, 62 00:03:07,140 --> 00:03:09,700 the Cilk scheduler maps that computation 63 00:03:09,700 --> 00:03:10,930 onto the processors. 64 00:03:10,930 --> 00:03:12,700 And it does so dynamically at runtime, 65 00:03:12,700 --> 00:03:15,460 based on whatever processing resources happen 66 00:03:15,460 --> 00:03:19,300 to be available, and still uses a randomized work stealing 67 00:03:19,300 --> 00:03:23,790 scheduler which guarantees that that mapping is efficient 68 00:03:23,790 --> 00:03:27,670 and the execution runs efficiently. 69 00:03:27,670 --> 00:03:30,190 Now you've all been using the Cilk platform for a while. 70 00:03:30,190 --> 00:03:33,850 In its basic usage, you write some Cilk code, possibly 71 00:03:33,850 --> 00:03:36,370 by parallelizing ordinary serial code, 72 00:03:36,370 --> 00:03:38,570 you feed that to a compiler, you get a binary, 73 00:03:38,570 --> 00:03:44,240 you run the binary the binary with some particular input 74 00:03:44,240 --> 00:03:45,940 on a multicore system. 75 00:03:45,940 --> 00:03:47,710 You get parallel performance. 76 00:03:47,710 --> 00:03:51,910 Today, we're going to look at how exactly does Cilk work? 77 00:03:51,910 --> 00:03:54,850 What's the magic that goes on, hidden 78 00:03:54,850 --> 00:03:58,490 by the boxes on this diagram? 79 00:03:58,490 --> 00:04:02,470 And the very first thing to note is that this picture 80 00:04:02,470 --> 00:04:04,860 is a little bit-- 81 00:04:04,860 --> 00:04:07,420 the first simplification that we're going to break 82 00:04:07,420 --> 00:04:10,900 is that it's not really just Cilk source and the Cilk 83 00:04:10,900 --> 00:04:11,830 compiler. 84 00:04:11,830 --> 00:04:17,470 There's also a runtime system library, libcilkrts.so, in case 85 00:04:17,470 --> 00:04:19,750 you've seen that file or messages 86 00:04:19,750 --> 00:04:21,760 about that file on your system. 87 00:04:21,760 --> 00:04:24,280 And really it's the compiler and the runtime library, 88 00:04:24,280 --> 00:04:28,180 that work together to implement Cilk's runtime system, 89 00:04:28,180 --> 00:04:31,180 to do the work stealing and do the efficient scheduling 90 00:04:31,180 --> 00:04:34,600 and load balancing. 91 00:04:34,600 --> 00:04:39,810 Now we might suspect that if you just take a look at the code 92 00:04:39,810 --> 00:04:42,060 that you get when you compile a Cilk program, 93 00:04:42,060 --> 00:04:45,120 that might tell you something about how Cilk works. 94 00:04:45,120 --> 00:04:50,100 Here's C pseudocode for the results when you compile 95 00:04:50,100 --> 00:04:53,570 a simple piece of Cilk code. 96 00:04:53,570 --> 00:04:55,335 It's a bit complicated. 97 00:04:55,335 --> 00:04:56,460 I think that's fair to say. 98 00:04:56,460 --> 00:04:57,813 There's a lot going on here. 99 00:04:57,813 --> 00:04:59,730 There is one function in the original program, 100 00:04:59,730 --> 00:05:01,000 now there are two. 101 00:05:01,000 --> 00:05:02,700 There's some new variables, there's 102 00:05:02,700 --> 00:05:06,840 some calls to functions that look a little bit strange, 103 00:05:06,840 --> 00:05:09,300 there's a lot going on in the compiled results. 104 00:05:09,300 --> 00:05:12,810 This isn't exactly easy to interpret or understand, 105 00:05:12,810 --> 00:05:15,720 and this doesn't even bring into the picture the runtime system 106 00:05:15,720 --> 00:05:16,235 library. 107 00:05:16,235 --> 00:05:18,360 The runtime system library, you can find the source 108 00:05:18,360 --> 00:05:19,290 code online. 109 00:05:19,290 --> 00:05:21,720 It's a little less than 20,000 lines of code. 110 00:05:21,720 --> 00:05:24,090 It's also kind of complicated. 111 00:05:24,090 --> 00:05:26,490 So rather than dive into the code directly, 112 00:05:26,490 --> 00:05:30,120 what we're going to do today is an attempt 113 00:05:30,120 --> 00:05:32,370 at a top-down approach to understanding 114 00:05:32,370 --> 00:05:34,080 how the Cilk runtime system works, 115 00:05:34,080 --> 00:05:36,460 and some of the design considerations. 116 00:05:36,460 --> 00:05:38,430 So we're going to start by talking about some 117 00:05:38,430 --> 00:05:41,640 of the required functionality that we need out of the Cilk 118 00:05:41,640 --> 00:05:44,010 runtime system, as well as some performance 119 00:05:44,010 --> 00:05:48,030 considerations for how the runtime system should work. 120 00:05:48,030 --> 00:05:51,390 And then we'll take a look at how the worker deques in Cilk 121 00:05:51,390 --> 00:05:54,480 get implemented, how spawning actually works, 122 00:05:54,480 --> 00:05:56,910 how stealing a computation works, 123 00:05:56,910 --> 00:06:01,350 and how synchronization works within Cilk. 124 00:06:01,350 --> 00:06:02,370 That all sound good? 125 00:06:02,370 --> 00:06:04,070 Any questions so far? 126 00:06:04,070 --> 00:06:06,630 This should all be review, more or less. 127 00:06:10,890 --> 00:06:15,550 OK, so let's talk a little bit about required functionality. 128 00:06:15,550 --> 00:06:18,210 You've seen this picture before, I hope. 129 00:06:18,210 --> 00:06:20,670 This picture illustrated the execution model 130 00:06:20,670 --> 00:06:21,450 of a Cilk program. 131 00:06:21,450 --> 00:06:25,110 Here we have everyone's favorite exponential time Fibonacci 132 00:06:25,110 --> 00:06:27,370 routine, parallelized using Cilk. 133 00:06:27,370 --> 00:06:30,010 This is not an efficient way to compute Fibonacci numbers, 134 00:06:30,010 --> 00:06:32,670 but it's a nice didactic example for understanding 135 00:06:32,670 --> 00:06:35,940 parallel computation, especially the Cilk model. 136 00:06:35,940 --> 00:06:39,360 And as we saw many lectures ago, when 137 00:06:39,360 --> 00:06:41,700 you run this program on a given input, 138 00:06:41,700 --> 00:06:43,170 the execution of the program can be 139 00:06:43,170 --> 00:06:47,070 modeled as a computation dag. 140 00:06:47,070 --> 00:06:50,040 And this computation dag unfolds dynamically 141 00:06:50,040 --> 00:06:53,050 as the program executes. 142 00:06:53,050 --> 00:06:56,250 But I want to stop and take a hard look 143 00:06:56,250 --> 00:07:00,390 at exactly what that dynamic execution looks like when we've 144 00:07:00,390 --> 00:07:06,668 got parallel processors and work stealing all coming into play. 145 00:07:06,668 --> 00:07:08,460 So we'll stick with this Fibonacci routine, 146 00:07:08,460 --> 00:07:11,370 and we'll imagine we've just got one processor on the system, 147 00:07:11,370 --> 00:07:12,080 to start. 148 00:07:12,080 --> 00:07:13,580 And we're just going to use this one 149 00:07:13,580 --> 00:07:16,160 processor to execute fib(4). 150 00:07:16,160 --> 00:07:18,420 And it's going to take some time to do it, 151 00:07:18,420 --> 00:07:24,410 just to make the story interesting. 152 00:07:24,410 --> 00:07:28,420 So we start executing this computation, 153 00:07:28,420 --> 00:07:31,760 and that one processor is just going to execute the Fibonacci 154 00:07:31,760 --> 00:07:35,330 routine from beginning up to the Cilk spawn statement, 155 00:07:35,330 --> 00:07:37,850 as if it's ordinary serial code, because it 156 00:07:37,850 --> 00:07:40,730 is ordinary serial code. 157 00:07:40,730 --> 00:07:43,640 At this point the processor hits the Cilk spawn statement. 158 00:07:43,640 --> 00:07:46,903 What happens now? 159 00:07:46,903 --> 00:07:47,570 Anyone remember? 160 00:07:50,170 --> 00:07:51,170 What happens to the dag? 161 00:08:05,322 --> 00:08:09,047 AUDIENCE: It branches down [INAUDIBLE] 162 00:08:09,047 --> 00:08:10,880 TAO SCHARDL: It branches downward and spawns 163 00:08:10,880 --> 00:08:13,960 another process, more or less. 164 00:08:13,960 --> 00:08:16,300 The way we model that-- 165 00:08:16,300 --> 00:08:19,855 the Cilk spawn is of a routine fib of n minus 1. 166 00:08:19,855 --> 00:08:22,330 In this case, that'll be fib(3). 167 00:08:22,330 --> 00:08:24,520 And so, like an ordinary function call, 168 00:08:24,520 --> 00:08:27,078 we're going to get a brand new frame for fib(3). 169 00:08:27,078 --> 00:08:28,870 And that's going to have some strand that's 170 00:08:28,870 --> 00:08:30,310 available to execute. 171 00:08:30,310 --> 00:08:32,830 But the spawn is not your typical function call. 172 00:08:32,830 --> 00:08:37,059 It actually allows some other computation to run in parallel. 173 00:08:37,059 --> 00:08:38,980 And so the way we model that in this picture 174 00:08:38,980 --> 00:08:41,500 is that we get a new frame for fib(3). 175 00:08:41,500 --> 00:08:43,780 There's a strand available to execute there. 176 00:08:43,780 --> 00:08:47,110 And the continuation, the green strand, 177 00:08:47,110 --> 00:08:52,540 is now available in the frame fib(4). 178 00:08:52,540 --> 00:08:54,190 But no one's necessarily executing it. 179 00:08:54,190 --> 00:08:57,940 It's just kind of faded in the picture. 180 00:08:57,940 --> 00:08:59,530 So once the spawn has occurred, what's 181 00:08:59,530 --> 00:09:00,613 the processor going to do? 182 00:09:00,613 --> 00:09:02,950 The processor is actually going to dive in and start 183 00:09:02,950 --> 00:09:07,090 executing fib(3), as if it were an ordinary function call. 184 00:09:07,090 --> 00:09:10,630 Yes, there's a strand available within the frame of fib(4), 185 00:09:10,630 --> 00:09:13,230 but the processor isn't going to worry about that strand. 186 00:09:13,230 --> 00:09:15,920 It's just going to say, oh, fib(4) calls fib(3), 187 00:09:15,920 --> 00:09:18,490 going to start computing for fib(3). 188 00:09:18,490 --> 00:09:21,250 Sound good? 189 00:09:21,250 --> 00:09:24,790 And so the processor dives down from pink 190 00:09:24,790 --> 00:09:26,320 strand to pink strand. 191 00:09:26,320 --> 00:09:28,570 The instruction pointer for the processor 192 00:09:28,570 --> 00:09:30,910 returns to the beginning of the fib routine, 193 00:09:30,910 --> 00:09:34,780 because we're now calling fib once again. 194 00:09:34,780 --> 00:09:37,120 And this process repeats. 195 00:09:37,120 --> 00:09:40,570 It executes the pink strand up until the Cilk spawn, 196 00:09:40,570 --> 00:09:42,340 just like ordinary serial code. 197 00:09:42,340 --> 00:09:45,460 The spawn occurs-- and we've already seen this picture 198 00:09:45,460 --> 00:09:46,600 before-- 199 00:09:46,600 --> 00:09:49,205 the spawn allows another strand to execute in parallel. 200 00:09:49,205 --> 00:09:50,830 But it also creates a frame for fib(2). 201 00:09:53,430 --> 00:09:56,560 And the processor dives into fib(2), 202 00:09:56,560 --> 00:10:00,220 resetting the instruction pointer to the beginning fib, 203 00:10:00,220 --> 00:10:02,890 P1 executes up to the spawn. 204 00:10:02,890 --> 00:10:05,290 Once again, we get another string to execute, 205 00:10:05,290 --> 00:10:07,810 as well as an invocation of fib(1). 206 00:10:07,810 --> 00:10:10,880 Processor dives even further. 207 00:10:10,880 --> 00:10:11,650 So that's fine. 208 00:10:11,650 --> 00:10:14,110 This is just the processor doing more or less 209 00:10:14,110 --> 00:10:16,810 ordinary serial execution of this fib routine, 210 00:10:16,810 --> 00:10:19,120 but it's also allowing some strands 211 00:10:19,120 --> 00:10:21,040 to be executed in parallel. 212 00:10:21,040 --> 00:10:23,230 This is the one processor situation, 213 00:10:23,230 --> 00:10:24,310 looks pretty good so far. 214 00:10:28,110 --> 00:10:29,910 Right, and in the fib(1) case, it 215 00:10:29,910 --> 00:10:32,010 doesn't make it as far through the pink strand 216 00:10:32,010 --> 00:10:34,860 because, in fact, we hit the base case. 217 00:10:34,860 --> 00:10:36,750 But now let's bring in some more processors. 218 00:10:36,750 --> 00:10:39,000 Suppose that another processor finally 219 00:10:39,000 --> 00:10:42,870 shows up, says I'm bored, I want to do some work, 220 00:10:42,870 --> 00:10:44,690 and decides to steal some computation. 221 00:10:44,690 --> 00:10:49,500 It's going to discover the green strand in the frame fib(4), 222 00:10:49,500 --> 00:10:51,090 and P2 is just going to jump in there 223 00:10:51,090 --> 00:10:53,460 and start executing that strand. 224 00:10:53,460 --> 00:10:56,820 And if we think really hard about what this means, 225 00:10:56,820 --> 00:10:59,010 P2 is another processor on the system. 226 00:10:59,010 --> 00:11:00,990 It has its own set of registers. 227 00:11:00,990 --> 00:11:02,940 It has its own instruction pointer. 228 00:11:02,940 --> 00:11:05,880 And so what Cilk somehow allows to happen 229 00:11:05,880 --> 00:11:09,720 is for P2 to just jump right into the middle 230 00:11:09,720 --> 00:11:12,870 of this fib(4) routine, which is already executing. 231 00:11:12,870 --> 00:11:14,370 It just sets the instruction pointer 232 00:11:14,370 --> 00:11:17,370 to point at that green instruction, 233 00:11:17,370 --> 00:11:20,730 at the call to fib of n minus 2. 234 00:11:20,730 --> 00:11:24,240 And it's just going to pick up where processor 1 left off, 235 00:11:24,240 --> 00:11:30,270 when it executed up to this point in fib(4), somehow. 236 00:11:30,270 --> 00:11:32,670 In this case, it executes fib of n minus 2. 237 00:11:32,670 --> 00:11:35,520 That calls fib(2), creates a new strand, 238 00:11:35,520 --> 00:11:37,760 it's just an ordinary function call. 239 00:11:37,760 --> 00:11:39,510 It's going to descend into that new frame. 240 00:11:39,510 --> 00:11:42,630 It's going to return to the beginning of fib. 241 00:11:42,630 --> 00:11:45,043 All that's well and good. 242 00:11:45,043 --> 00:11:47,460 Another processor might come along and steal another piece 243 00:11:47,460 --> 00:11:48,780 of the computation. 244 00:11:48,780 --> 00:11:52,080 It steals another green strand, and so once again, 245 00:11:52,080 --> 00:11:55,170 this processor needs to jump into the middle of an executing 246 00:11:55,170 --> 00:11:56,658 function. 247 00:11:56,658 --> 00:11:58,200 Its instruction pointer is just going 248 00:11:58,200 --> 00:12:01,350 to point at this call of the fib of n minus 2. 249 00:12:01,350 --> 00:12:03,660 Somehow, it's going to have the state of this executing 250 00:12:03,660 --> 00:12:07,320 function available, despite having independent registers. 251 00:12:07,320 --> 00:12:09,960 And it needs to just start from this location, 252 00:12:09,960 --> 00:12:13,205 with all the parameters set appropriately, 253 00:12:13,205 --> 00:12:14,580 and start executing this function 254 00:12:14,580 --> 00:12:16,860 as if it's an ordinary function. 255 00:12:16,860 --> 00:12:21,630 It calls fib(3) minus 2 is 1. 256 00:12:21,630 --> 00:12:24,390 And now these processors might start executing in parallel. 257 00:12:24,390 --> 00:12:28,180 P1 might return from its base case routine 258 00:12:28,180 --> 00:12:30,378 up to the parent call of fib of n minus 2 259 00:12:30,378 --> 00:12:31,920 and start executing its continuation, 260 00:12:31,920 --> 00:12:33,210 because that wasn't stolen. 261 00:12:33,210 --> 00:12:36,180 Meanwhile, P3 descends into the execution of fib(1). 262 00:12:39,290 --> 00:12:41,970 And then in another step, P3 and P2 263 00:12:41,970 --> 00:12:44,190 make some progress executing their computation. 264 00:12:44,190 --> 00:12:46,650 P2 encounters a Cilk spawn statement, 265 00:12:46,650 --> 00:12:49,360 which creates a new frame and allows another strand 266 00:12:49,360 --> 00:12:50,870 to execute in parallel. 267 00:12:50,870 --> 00:12:54,520 P3 encounters the base case routine and says, 268 00:12:54,520 --> 00:12:55,915 OK, it's time to return. 269 00:12:55,915 --> 00:12:57,540 And all of that can happen in parallel, 270 00:12:57,540 --> 00:13:01,990 and somehow the Cilk system has to coordinate all of this. 271 00:13:01,990 --> 00:13:03,420 But we already have one mystery. 272 00:13:03,420 --> 00:13:06,570 How does a processor start executing from the middle 273 00:13:06,570 --> 00:13:08,490 of a running function? 274 00:13:08,490 --> 00:13:13,380 The running function and it's state lived on P1 initially, 275 00:13:13,380 --> 00:13:17,580 and then P2 and P3 somehow find that state, 276 00:13:17,580 --> 00:13:19,200 hop into the middle of the function, 277 00:13:19,200 --> 00:13:21,690 and just start running. 278 00:13:21,690 --> 00:13:22,680 That's kind of strange. 279 00:13:22,680 --> 00:13:23,700 How does that happen? 280 00:13:23,700 --> 00:13:25,870 How does the Cilk runtime system make that happen? 281 00:13:25,870 --> 00:13:27,120 This is one thing to consider. 282 00:13:29,905 --> 00:13:31,280 Another thing to consider is what 283 00:13:31,280 --> 00:13:32,600 happens when we hit a sync. 284 00:13:32,600 --> 00:13:35,270 We'll talk about how these issues get addressed later on, 285 00:13:35,270 --> 00:13:38,990 but let's lay out all of the considerations upfront, 286 00:13:38,990 --> 00:13:41,990 before we-- just see how bad the problem is before we 287 00:13:41,990 --> 00:13:46,350 try to solve it bit by bit. 288 00:13:46,350 --> 00:13:48,915 So now, let's take this picture again and progress it 289 00:13:48,915 --> 00:13:49,790 a little bit further. 290 00:13:49,790 --> 00:13:52,910 Let's suppose that processor three 291 00:13:52,910 --> 00:13:54,720 decides to execute the return. 292 00:13:54,720 --> 00:13:58,670 It's going to return to an invocation of fib(3). 293 00:13:58,670 --> 00:14:05,030 And the return statement is a Cilk sync statement. 294 00:14:05,030 --> 00:14:08,330 But processor three can't execute the sync 295 00:14:08,330 --> 00:14:13,310 because the computation of fib(2) in this case-- 296 00:14:13,310 --> 00:14:14,810 that's being done by processor one-- 297 00:14:14,810 --> 00:14:16,790 that computation is not done yet. 298 00:14:16,790 --> 00:14:19,790 So the execution can proceed past the sync. 299 00:14:19,790 --> 00:14:23,920 So somehow P3 needs to say, OK, there is a sync statement, 300 00:14:23,920 --> 00:14:26,420 but we can't execute beyond this point 301 00:14:26,420 --> 00:14:29,780 because, specifically, it's waiting on processor one. 302 00:14:29,780 --> 00:14:31,670 It doesn't care what processor two is doing. 303 00:14:31,670 --> 00:14:34,610 Processor two is having a dandy time executing fib(2) 304 00:14:34,610 --> 00:14:35,980 on the other side of the tree. 305 00:14:35,980 --> 00:14:37,748 Processor three shouldn't care. 306 00:14:37,748 --> 00:14:39,290 So processor three can't do something 307 00:14:39,290 --> 00:14:41,960 like, OK, all processors need to stop, 308 00:14:41,960 --> 00:14:44,330 get to this point in the code, and then the execution 309 00:14:44,330 --> 00:14:44,830 can proceed. 310 00:14:44,830 --> 00:14:47,430 No, no, it just needs to wait on processor one. 311 00:14:47,430 --> 00:14:51,920 Somehow the Cilk system has to allow that fine grain 312 00:14:51,920 --> 00:14:56,150 synchronization to happen in this nested pattern. 313 00:14:56,150 --> 00:14:59,420 So how does a Cilk sync wait on only the nested sub 314 00:14:59,420 --> 00:15:01,670 computations within the program? 315 00:15:01,670 --> 00:15:03,420 How does it figure out how to do that? 316 00:15:03,420 --> 00:15:06,717 How does the Cilk runtime system implement this? 317 00:15:06,717 --> 00:15:08,050 So that's another consideration. 318 00:15:08,050 --> 00:15:11,780 OK, so at this point, we have three top level considerations. 319 00:15:11,780 --> 00:15:14,300 A single worker needs to be able to execute this program as 320 00:15:14,300 --> 00:15:15,980 if it's an ordinary serial program. 321 00:15:15,980 --> 00:15:18,830 Thieves have to be able to jump into the middle of executing 322 00:15:18,830 --> 00:15:21,950 functions and pick up from where they left off, 323 00:15:21,950 --> 00:15:24,550 from where other processors in the system left off. 324 00:15:24,550 --> 00:15:28,310 Syncs have to be able to stall functions appropriately, 325 00:15:28,310 --> 00:15:34,880 based only on those functions' nested child sub computations. 326 00:15:34,880 --> 00:15:36,860 So we have three big considerations 327 00:15:36,860 --> 00:15:39,950 that we need to pick apart so far. 328 00:15:39,950 --> 00:15:42,080 That's not the whole story, though. 329 00:15:42,080 --> 00:15:44,330 Any ideas what other functionality we 330 00:15:44,330 --> 00:15:47,960 need to worry about, for implementing this Cilk system? 331 00:15:47,960 --> 00:15:51,230 It's kind of an open ended question, but any thoughts? 332 00:16:07,660 --> 00:16:13,180 We have serial execution, spawning, stealing, and syncing 333 00:16:13,180 --> 00:16:15,790 as top level concerns. 334 00:16:15,790 --> 00:16:18,850 Anyone remember some other features of Cilk 335 00:16:18,850 --> 00:16:23,140 that the runtime system magically makes happen, 336 00:16:23,140 --> 00:16:25,025 correctly? 337 00:16:25,025 --> 00:16:27,150 It's probably been a while since you've seen those. 338 00:16:27,150 --> 00:16:27,720 Yeah. 339 00:16:27,720 --> 00:16:29,820 AUDIENCE: Cilk for loops divide and conquer? 340 00:16:29,820 --> 00:16:32,770 TAO SCHARDL: The Cilk for loops divide and conquer. 341 00:16:32,770 --> 00:16:38,170 Somehow, the runtime system does have to implement Cilk fours. 342 00:16:38,170 --> 00:16:41,200 The Cilk fours end up getting implemented internally, 343 00:16:41,200 --> 00:16:42,490 with spawns and syncs. 344 00:16:42,490 --> 00:16:46,090 That's courtesy of the compiler. 345 00:16:46,090 --> 00:16:49,180 Yeah, courtesy of the compiler. 346 00:16:49,180 --> 00:16:51,490 So we wont look too hard at Cilk fors today, 347 00:16:51,490 --> 00:16:54,820 but that's definitely one concern. 348 00:16:54,820 --> 00:16:56,090 Good observation. 349 00:16:56,090 --> 00:17:00,580 Any other thoughts, sort of low level system details 350 00:17:00,580 --> 00:17:04,118 that Cilk needs to implement correctly? 351 00:17:09,380 --> 00:17:12,500 Cache coherence-- it actually doesn't 352 00:17:12,500 --> 00:17:15,470 need to worry too much about cache coherence 353 00:17:15,470 --> 00:17:19,790 although, given the latest performance numbers 354 00:17:19,790 --> 00:17:22,010 I've seen from Cilk, maybe it should worry more 355 00:17:22,010 --> 00:17:24,613 about the cache. 356 00:17:24,613 --> 00:17:26,030 But it turns out the hardware does 357 00:17:26,030 --> 00:17:28,700 a pretty good job maintaining the cache coherence 358 00:17:28,700 --> 00:17:30,320 protocol itself. 359 00:17:30,320 --> 00:17:31,670 But good guess . 360 00:17:40,645 --> 00:17:42,020 It's not really a tough question, 361 00:17:42,020 --> 00:17:48,080 because it's really just calling back memories of old lectures. 362 00:17:48,080 --> 00:17:50,270 I think you recently had a quiz on this material, 363 00:17:50,270 --> 00:17:53,300 so it's probably safe to say that all that material has 364 00:17:53,300 --> 00:17:57,680 been paged out of your brain at this point. 365 00:17:57,680 --> 00:18:01,070 So I'll just spoil the fun for you. 366 00:18:01,070 --> 00:18:03,700 Cilk has a notion of a cactus stack. 367 00:18:03,700 --> 00:18:05,730 So we talked a little bit about processors 368 00:18:05,730 --> 00:18:07,730 jumping into the middle of an executing function 369 00:18:07,730 --> 00:18:13,220 and somehow having the state of that function available. 370 00:18:13,220 --> 00:18:14,960 One consideration is registered state, 371 00:18:14,960 --> 00:18:17,720 but another consideration is the stack itself. 372 00:18:17,720 --> 00:18:20,810 And Cilk supports the C's rule for pointers, 373 00:18:20,810 --> 00:18:25,850 namely that children can see pointers into parent frames, 374 00:18:25,850 --> 00:18:29,150 but parents can't see pointers into child frames. 375 00:18:29,150 --> 00:18:32,030 Now each processor, each worker in a Cilk system, 376 00:18:32,030 --> 00:18:35,330 needs to have its own view of the stack. 377 00:18:35,330 --> 00:18:38,180 But those views aren't necessarily independent. 378 00:18:38,180 --> 00:18:41,420 In this picture, all five processors 379 00:18:41,420 --> 00:18:47,900 share the same view of the frame for Function A instantiation A, 380 00:18:47,900 --> 00:18:50,000 then processors three through five all share 381 00:18:50,000 --> 00:18:53,120 the same view for the instantiation of C. 382 00:18:53,120 --> 00:18:56,330 So somehow, Cilk has to make all of those views 383 00:18:56,330 --> 00:19:01,310 available and consistent but not quite the same, sort 384 00:19:01,310 --> 00:19:05,450 of consistent as we get with cache coherence. 385 00:19:05,450 --> 00:19:08,630 Cilk somehow has to implement this cactus stack. 386 00:19:08,630 --> 00:19:13,455 So that's another consideration that we have to worry about. 387 00:19:13,455 --> 00:19:16,130 And then there's one more kind of funny detail. 388 00:19:16,130 --> 00:19:19,735 If we take another look at work stealing itself-- 389 00:19:19,735 --> 00:19:23,300 you may remember we had this picture from several lectures 390 00:19:23,300 --> 00:19:25,910 ago where we have processors on the system, 391 00:19:25,910 --> 00:19:29,780 each maintains its own deck of frames, 392 00:19:29,780 --> 00:19:33,710 and workers are allowed to steal frames from each other. 393 00:19:33,710 --> 00:19:37,760 But if we take a look at how this all unfolds, 394 00:19:37,760 --> 00:19:40,910 yes we may have a processor that performs a call, 395 00:19:40,910 --> 00:19:44,090 and that'll push another frame for a called function 396 00:19:44,090 --> 00:19:46,480 onto its deque on the bottom. 397 00:19:46,480 --> 00:19:48,770 It may spawn, and that'll push a spawn frame 398 00:19:48,770 --> 00:19:50,600 onto the bottom of its deck. 399 00:19:50,600 --> 00:19:52,580 But if we fast forward a little bit 400 00:19:52,580 --> 00:19:55,070 and we get in up with a worker with nothing to do, 401 00:19:55,070 --> 00:19:56,870 that worker is going to go ahead and steal, 402 00:19:56,870 --> 00:20:01,750 picking another worker in the system at random. 403 00:20:01,750 --> 00:20:04,120 And it's going to steal from the top of the deque. 404 00:20:04,120 --> 00:20:07,400 But it's not just going to steal the topmost item on the deque. 405 00:20:07,400 --> 00:20:10,760 It's actually going to steal a chunk of items from the deque. 406 00:20:10,760 --> 00:20:15,170 In particular, if it selects the third processor 407 00:20:15,170 --> 00:20:18,530 in this picture, third from the left, 408 00:20:18,530 --> 00:20:23,570 this thief is going to steal everything 409 00:20:23,570 --> 00:20:27,160 through the parent of the next spawned frame. 410 00:20:27,160 --> 00:20:29,940 It needs to take this whole stack of frames, 411 00:20:29,940 --> 00:20:33,470 and it's not clear a priori how many frames 412 00:20:33,470 --> 00:20:37,335 the worker is going to have to steal in this case. 413 00:20:37,335 --> 00:20:39,460 But nevertheless, it needs to take all those frames 414 00:20:39,460 --> 00:20:40,420 and resume execution. 415 00:20:40,420 --> 00:20:44,080 After all, that bottom was a call frame that it just stole. 416 00:20:44,080 --> 00:20:45,700 That's where there's a continuation 417 00:20:45,700 --> 00:20:48,460 with work available to be done in parallel. 418 00:20:51,440 --> 00:20:53,233 And so, if we think about it, there 419 00:20:53,233 --> 00:20:54,650 are a lot of questions that arise. 420 00:20:54,650 --> 00:20:56,890 What's involved in stealing frames? 421 00:20:56,890 --> 00:21:00,280 What synchronization does this system have to implement? 422 00:21:00,280 --> 00:21:02,100 What happens to the stack? 423 00:21:02,100 --> 00:21:04,600 It looks like we just shifted some frames from one processor 424 00:21:04,600 --> 00:21:07,390 to another, but the first processor, the victim, 425 00:21:07,390 --> 00:21:09,820 still needs access to the data in that stack. 426 00:21:09,820 --> 00:21:13,300 So how does that part work, and how does any of this actually 427 00:21:13,300 --> 00:21:16,360 become efficient? 428 00:21:16,360 --> 00:21:19,060 So now we have a pretty decent list of functionality 429 00:21:19,060 --> 00:21:21,340 that we need out of the Cilk runtime system. 430 00:21:21,340 --> 00:21:23,650 We need serial execution to work. 431 00:21:23,650 --> 00:21:26,350 We need thieves to be able to jump into the middle of running 432 00:21:26,350 --> 00:21:27,310 functions. 433 00:21:27,310 --> 00:21:32,290 We need sinks to synchronize in this nested, fine grain way. 434 00:21:32,290 --> 00:21:36,190 We need to implement a cactus stack for all the workers 435 00:21:36,190 --> 00:21:37,570 to see. 436 00:21:37,570 --> 00:21:41,860 And these have to deal with mixtures of spawned frames 437 00:21:41,860 --> 00:21:45,190 and called frames that may be available 438 00:21:45,190 --> 00:21:48,730 when they steal a computation. 439 00:21:48,730 --> 00:21:50,770 So that's a bunch of considerations. 440 00:21:50,770 --> 00:21:53,380 Is this the whole picture? 441 00:21:53,380 --> 00:21:55,600 Well, there's a little bit more to it than that. 442 00:21:55,600 --> 00:21:57,100 So before I give you an answers, I'm 443 00:21:57,100 --> 00:22:00,008 just going to keep raising questions. 444 00:22:00,008 --> 00:22:02,050 And now I want to raise some questions concerning 445 00:22:02,050 --> 00:22:03,430 the performance of the system. 446 00:22:03,430 --> 00:22:06,310 How do we want to design the system 447 00:22:06,310 --> 00:22:12,580 to get good parallel execution times? 448 00:22:12,580 --> 00:22:15,080 Well if we take a look at the work stealing bounds for Cilk, 449 00:22:15,080 --> 00:22:17,480 the Cilk's work stealing scheduler 450 00:22:17,480 --> 00:22:20,830 achieves an expected running time of Tp, 451 00:22:20,830 --> 00:22:24,770 on P processors, which is proportional to the work 452 00:22:24,770 --> 00:22:27,200 of the computation divided by the number of processors, 453 00:22:27,200 --> 00:22:31,160 plus something on the order of the span of the computation. 454 00:22:31,160 --> 00:22:34,490 Now if we take a look at this running time bound, 455 00:22:34,490 --> 00:22:37,500 we can decompose it into two pieces. 456 00:22:37,500 --> 00:22:40,280 The T1 over P part, that's really the time 457 00:22:40,280 --> 00:22:44,960 that the parallel workers on the system spend doing actual work. 458 00:22:44,960 --> 00:22:48,170 They're P of those workers, they're all making progress 459 00:22:48,170 --> 00:22:50,000 on the work of the computation. 460 00:22:50,000 --> 00:22:52,760 That comes out to T of one over P. 461 00:22:52,760 --> 00:22:55,450 The other part of the bound, order T infinity, that's 462 00:22:55,450 --> 00:22:58,040 a time that turns out to be the time that workers 463 00:22:58,040 --> 00:23:01,940 spend stealing computation from each other. 464 00:23:01,940 --> 00:23:04,880 And ideally, what we want when we paralyze a program using 465 00:23:04,880 --> 00:23:09,440 Cilk, is we want to see this program achieve linear speedup. 466 00:23:09,440 --> 00:23:14,870 That means that if we give the program more processors to run, 467 00:23:14,870 --> 00:23:17,960 if we increase P, we want to see the execution time 468 00:23:17,960 --> 00:23:21,820 decrease, linearly, with P. 469 00:23:21,820 --> 00:23:26,310 And that means we want the of the workers in the Cilk system 470 00:23:26,310 --> 00:23:28,460 to spend most of the time doing useful work. 471 00:23:28,460 --> 00:23:30,470 We don't want the workers spending a lot of time 472 00:23:30,470 --> 00:23:31,512 stealing from each other. 473 00:23:34,660 --> 00:23:38,060 In fact, we want even more than this. 474 00:23:38,060 --> 00:23:41,650 We don't just want work divided by number of processors. 475 00:23:41,650 --> 00:23:44,290 We really care about how the performance compares 476 00:23:44,290 --> 00:23:47,950 to the running time of the original serial code 477 00:23:47,950 --> 00:23:50,140 that we were given, that we parallelized. 478 00:23:50,140 --> 00:23:53,800 That original serial code ran in time Ts of S. 479 00:23:53,800 --> 00:23:56,200 And now we paralyze it using Cilk spawn, Cilk sync, 480 00:23:56,200 --> 00:23:59,090 or in this case, Cilk for. 481 00:23:59,090 --> 00:24:01,583 And ideally, with sufficient parallelism, 482 00:24:01,583 --> 00:24:03,250 we'll guarantee that the running time is 483 00:24:03,250 --> 00:24:07,320 going to be Ts of P proportional to the work of a processor, T1 484 00:24:07,320 --> 00:24:10,780 divided by P. But we really want to speed up compared 485 00:24:10,780 --> 00:24:14,200 to Ts of S. So that's our goal. 486 00:24:14,200 --> 00:24:18,130 We want Tp to be proportional to Ts of S over P. 487 00:24:18,130 --> 00:24:20,620 That says that we want the serial running time 488 00:24:20,620 --> 00:24:24,580 to be pretty close to the work of the parallel computation. 489 00:24:24,580 --> 00:24:28,120 So the one processor running time of our Cilk code, ideally, 490 00:24:28,120 --> 00:24:31,390 should look pretty close to the running time 491 00:24:31,390 --> 00:24:32,590 of the original serial code. 492 00:24:35,610 --> 00:24:38,090 So just to put these pieces together, 493 00:24:38,090 --> 00:24:41,180 if we were originally given a serial program that 494 00:24:41,180 --> 00:24:44,330 ran on time Ts of S, and we parallelize it using Cilk, 495 00:24:44,330 --> 00:24:46,430 we end up with a parallel program with work T1 496 00:24:46,430 --> 00:24:48,050 and span T infinity. 497 00:24:48,050 --> 00:24:51,410 We want to achieve linear speed up on P processors, 498 00:24:51,410 --> 00:24:54,320 compared to the original serial running time. 499 00:24:54,320 --> 00:24:56,490 In order to do that, we need two things. 500 00:24:56,490 --> 00:24:58,260 We need ample parallelism. 501 00:24:58,260 --> 00:25:01,220 T1 one over T infinity should be a lot bigger than P. 502 00:25:01,220 --> 00:25:05,780 And we've seen why that's the case in lectures past. 503 00:25:05,780 --> 00:25:08,690 We also want what's called high work efficiency. 504 00:25:08,690 --> 00:25:11,060 We want the ratio of the serial running time divided 505 00:25:11,060 --> 00:25:13,670 by the work of the still computation 506 00:25:13,670 --> 00:25:15,755 to be pretty close to one, as close as possible. 507 00:25:19,330 --> 00:25:23,670 Now, the Cilk runtime system is designed with these two 508 00:25:23,670 --> 00:25:25,020 observations in mind. 509 00:25:25,020 --> 00:25:27,330 And in particular, the Cilk runtime system 510 00:25:27,330 --> 00:25:29,910 says, suppose that we have a Cilk program that 511 00:25:29,910 --> 00:25:31,950 has ample parallelism. 512 00:25:31,950 --> 00:25:33,600 It has efficient parallelism to make 513 00:25:33,600 --> 00:25:38,280 good use of the available parallel processors. 514 00:25:38,280 --> 00:25:40,020 Then in implementing the Cilk runtime, 515 00:25:40,020 --> 00:25:44,298 we have a goal to maintain high work efficiency. 516 00:25:44,298 --> 00:25:45,840 And to maintain high work efficiency, 517 00:25:45,840 --> 00:25:48,000 the Cilk runtime system abides by what's 518 00:25:48,000 --> 00:25:50,460 called the work first principle, which 519 00:25:50,460 --> 00:25:53,550 is to optimize the ordinary serial execution 520 00:25:53,550 --> 00:25:57,280 of the program, even at the expense of some additional cost 521 00:25:57,280 --> 00:25:57,780 to steals. 522 00:26:01,570 --> 00:26:06,372 Now at 30,000 feet, the way that the Cilk runtime system 523 00:26:06,372 --> 00:26:07,830 implements the work first principle 524 00:26:07,830 --> 00:26:10,150 and makes all these components work 525 00:26:10,150 --> 00:26:14,200 is by dividing the job between both the compiler 526 00:26:14,200 --> 00:26:16,870 and the runtime system library. 527 00:26:16,870 --> 00:26:20,990 The compiler uses a handful of small data structures, 528 00:26:20,990 --> 00:26:23,110 including workers and stack frames, 529 00:26:23,110 --> 00:26:25,270 and implements optimized fast paths 530 00:26:25,270 --> 00:26:28,840 for execution of functions, which should be 531 00:26:28,840 --> 00:26:31,630 executed when no steals occur. 532 00:26:31,630 --> 00:26:34,213 The runtime system library handles issues 533 00:26:34,213 --> 00:26:35,380 with the parallel execution. 534 00:26:35,380 --> 00:26:38,320 And uses larger data structures that maintain parallel 535 00:26:38,320 --> 00:26:40,110 running time state. 536 00:26:40,110 --> 00:26:42,760 And it handles slower paths of execution, 537 00:26:42,760 --> 00:26:46,180 in particular when seals actually occur. 538 00:26:46,180 --> 00:26:47,680 So those are all the considerations. 539 00:26:47,680 --> 00:26:49,927 We have a lot of functionality requirements 540 00:26:49,927 --> 00:26:51,760 and we have some performance considerations. 541 00:26:51,760 --> 00:26:53,650 We want to optimize the work, even 542 00:26:53,650 --> 00:26:56,020 at the expense of some steals. 543 00:26:56,020 --> 00:26:59,050 Let's finally take a look at how Cilk works. 544 00:26:59,050 --> 00:27:02,140 How do we deal with all these problems? 545 00:27:02,140 --> 00:27:07,150 I imagine some you may have some ideas as to how you might 546 00:27:07,150 --> 00:27:13,418 tackle one issue or another, but let's see what really happens. 547 00:27:13,418 --> 00:27:14,710 Let's start from the beginning. 548 00:27:14,710 --> 00:27:16,590 How do we implement a worker deque? 549 00:27:20,650 --> 00:27:22,630 Now for this discussion, we're going 550 00:27:22,630 --> 00:27:26,050 to use a running example with just a really, really 551 00:27:26,050 --> 00:27:27,350 simple, Cilk routine. 552 00:27:27,350 --> 00:27:29,830 It's not even as complicated as fib. 553 00:27:29,830 --> 00:27:33,010 We're going to have a function foo that, at one point, 554 00:27:33,010 --> 00:27:36,880 spawns a function bar, in the continuation calls baz, 555 00:27:36,880 --> 00:27:39,670 performs a sync, and then returns. 556 00:27:39,670 --> 00:27:42,130 And just to establish some terminology, 557 00:27:42,130 --> 00:27:44,980 foo will be what we call a spawning function, 558 00:27:44,980 --> 00:27:48,300 meaning that foo is capable of executing a Cilk spawn 559 00:27:48,300 --> 00:27:49,630 statement. 560 00:27:49,630 --> 00:27:52,720 The function bar is spawned by foo. 561 00:27:52,720 --> 00:27:55,870 We can see that from the Cilk spawn in front of bar. 562 00:27:55,870 --> 00:27:58,870 And the call to baz occurs in the continuation of that Cilk 563 00:27:58,870 --> 00:28:03,835 spawn, simple picture. 564 00:28:03,835 --> 00:28:05,140 Everyone good so far? 565 00:28:05,140 --> 00:28:07,630 Any questions about the functionality requirements, 566 00:28:07,630 --> 00:28:10,447 terminology, performance considerations? 567 00:28:13,020 --> 00:28:13,520 OK. 568 00:28:16,290 --> 00:28:19,750 So now we're going to take a hard look at just one worker 569 00:28:19,750 --> 00:28:21,480 and we're going to say, conceptually, we 570 00:28:21,480 --> 00:28:24,810 have this deque-like structure which has spawned frames 571 00:28:24,810 --> 00:28:25,805 and called frames. 572 00:28:25,805 --> 00:28:27,930 Let's ignore the rest of the workers on the system. 573 00:28:27,930 --> 00:28:29,160 Let's not worry about-- 574 00:28:29,160 --> 00:28:32,490 well, we'll worry a little bit about how steals can work, 575 00:28:32,490 --> 00:28:35,100 but we're just going to focus on the actions 576 00:28:35,100 --> 00:28:37,200 that one worker performs. 577 00:28:37,200 --> 00:28:39,857 How do we implement this deque? 578 00:28:39,857 --> 00:28:41,940 And we want the worker to operate on its own deck, 579 00:28:41,940 --> 00:28:42,930 a lot like a stack. 580 00:28:42,930 --> 00:28:44,972 It's going to push and pop frames from the bottom 581 00:28:44,972 --> 00:28:45,930 up the deque. 582 00:28:45,930 --> 00:28:47,820 Steals need to be able to transfer 583 00:28:47,820 --> 00:28:50,370 ownership of several consecutive frames 584 00:28:50,370 --> 00:28:52,410 from the top of the deque. 585 00:28:52,410 --> 00:28:54,908 And thieves need to be able to resume a continuation. 586 00:28:57,660 --> 00:29:01,510 So the way that the Cilk system does this, 587 00:29:01,510 --> 00:29:04,783 to bring this concept into an implementation, 588 00:29:04,783 --> 00:29:06,950 is that it's going to implement the deque externally 589 00:29:06,950 --> 00:29:08,510 from the actual call stack. 590 00:29:08,510 --> 00:29:11,660 Those frames will still be in a stack somewhere 591 00:29:11,660 --> 00:29:14,690 and they'll be managed, roughly speaking, 592 00:29:14,690 --> 00:29:18,710 with a standard calling convention. 593 00:29:18,710 --> 00:29:21,800 But the worker is going to maintain a separate deque data 594 00:29:21,800 --> 00:29:27,170 structure, which will contain pointers into this stack. 595 00:29:27,170 --> 00:29:29,540 And the worker itself will maintain the deque 596 00:29:29,540 --> 00:29:30,860 using head and tail pointers. 597 00:29:33,840 --> 00:29:37,080 Now in addition to this picture, the frames 598 00:29:37,080 --> 00:29:38,668 that are available to be stolen-- 599 00:29:38,668 --> 00:29:40,710 the frames that have computation that a thief can 600 00:29:40,710 --> 00:29:42,600 come along and execute-- 601 00:29:42,600 --> 00:29:46,470 those frames will store an additional local structure 602 00:29:46,470 --> 00:29:49,260 that will contain information as necessary for stealing 603 00:29:49,260 --> 00:29:51,370 to occur. 604 00:29:51,370 --> 00:29:52,380 Does this make sense? 605 00:29:52,380 --> 00:29:54,810 Questions so far? 606 00:29:54,810 --> 00:29:57,870 Ordinary call stack, deque lives outside of it, 607 00:29:57,870 --> 00:30:02,340 worker points at the deque, pretty simple design. 608 00:30:09,230 --> 00:30:13,620 So I mentioned that the compiler used relatively lightweight 609 00:30:13,620 --> 00:30:16,050 structures. 610 00:30:16,050 --> 00:30:17,440 This is essentially one of them. 611 00:30:17,440 --> 00:30:21,450 And if we take a look at the implementation of the Cilk 612 00:30:21,450 --> 00:30:25,440 runtime system, this is the essence of it. 613 00:30:25,440 --> 00:30:28,110 There are some additional implementation details, 614 00:30:28,110 --> 00:30:30,750 but these are the core-- 615 00:30:30,750 --> 00:30:35,083 this is, in a sense, the core piece of the design. 616 00:30:35,083 --> 00:30:36,250 So the rest is just details. 617 00:30:36,250 --> 00:30:37,940 The Intel Cilk Plus runtime system 618 00:30:37,940 --> 00:30:43,620 takes this design and elaborates on it in a variety of ways. 619 00:30:43,620 --> 00:30:46,115 And we're going to take a look at those elaborations. 620 00:30:46,115 --> 00:30:47,490 First off, what we'll see is that 621 00:30:47,490 --> 00:30:49,410 every spawned subcomputation ends up 622 00:30:49,410 --> 00:30:52,650 being executed within its own helper function, which 623 00:30:52,650 --> 00:30:54,720 the compiler will generate. 624 00:30:54,720 --> 00:30:57,680 That's called a spawn helper function. 625 00:30:57,680 --> 00:30:59,180 And then the runtime system is going 626 00:30:59,180 --> 00:31:03,300 to maintain a few basic data structures as the workers 627 00:31:03,300 --> 00:31:04,185 execute their work. 628 00:31:04,185 --> 00:31:06,060 There'll be a structure for the worker, which 629 00:31:06,060 --> 00:31:08,610 will look similar to what we just saw in the previous slide. 630 00:31:08,610 --> 00:31:11,280 There'll be a Cilk stack frame structure 631 00:31:11,280 --> 00:31:14,970 for each instantiation of a spawning function, 632 00:31:14,970 --> 00:31:16,765 some function that can perform and spawn. 633 00:31:16,765 --> 00:31:18,390 And there'll be a stack-frame structure 634 00:31:18,390 --> 00:31:25,150 for each spawn helper, each instantiation that is spawned. 635 00:31:25,150 --> 00:31:27,400 Now if we take another look at the compiled code 636 00:31:27,400 --> 00:31:31,180 we had before, some of it starts to make some sense. 637 00:31:31,180 --> 00:31:35,710 Originally, we had our spawning function foo and a statement 638 00:31:35,710 --> 00:31:38,200 that spawned off, called a bar. 639 00:31:38,200 --> 00:31:41,450 And in the C pseudocode of the compiled results, 640 00:31:41,450 --> 00:31:43,400 we see that we have two functions. 641 00:31:43,400 --> 00:31:44,627 The first function foo-- 642 00:31:44,627 --> 00:31:46,960 that's our spawning function-- it's got a bunch of stuff 643 00:31:46,960 --> 00:31:50,578 in it, and we'll figure out what that's doing in a second. 644 00:31:50,578 --> 00:31:52,870 But there's a second function, and that second function 645 00:31:52,870 --> 00:31:55,380 is the spawn helper. 646 00:31:55,380 --> 00:31:57,190 And that spawn helper actually contains 647 00:31:57,190 --> 00:32:02,890 a statement which calls bar and ultimately saves the result. 648 00:32:02,890 --> 00:32:03,730 Make sense? 649 00:32:03,730 --> 00:32:08,880 Now we're starting to understand some of the confusing C 650 00:32:08,880 --> 00:32:10,110 pseudocode we saw before. 651 00:32:16,470 --> 00:32:19,270 And if we take a look at each of these routines we see, 652 00:32:19,270 --> 00:32:23,360 indeed, there is a stack frame structure. 653 00:32:23,360 --> 00:32:27,340 And so in Intel Cilk Plus it's called a Cilk RTS stack frame, 654 00:32:27,340 --> 00:32:29,180 very creative name, I know. 655 00:32:29,180 --> 00:32:31,570 And it's just added as an extra local variable 656 00:32:31,570 --> 00:32:33,012 in each of these functions. 657 00:32:33,012 --> 00:32:34,720 You got one inside of foo, because that's 658 00:32:34,720 --> 00:32:37,720 a spawning function, and you get one inside of the spawn helper. 659 00:32:41,120 --> 00:32:43,940 Now if we dive into the Cilk stack frame structure itself, 660 00:32:43,940 --> 00:32:47,660 by cracking open the source code for the Intel Cilk Plus 661 00:32:47,660 --> 00:32:51,120 runtime, we see that there are a lot of fields in the structure. 662 00:32:51,120 --> 00:32:55,280 The main fields are as follows-- there is a buffer, a context 663 00:32:55,280 --> 00:32:58,160 buffer, and that's going to contain enough information 664 00:32:58,160 --> 00:33:01,190 to resume a function at a continuation, 665 00:33:01,190 --> 00:33:03,800 particularly to mean after a Cilk spawn or, in fact, 666 00:33:03,800 --> 00:33:05,990 after a Cilk sync statement. 667 00:33:05,990 --> 00:33:09,500 There's an additional integer in the stack frame called flags, 668 00:33:09,500 --> 00:33:12,580 which will summarize the state of the Cilk stack rate, 669 00:33:12,580 --> 00:33:14,750 and we'll see a little bit more about that later. 670 00:33:14,750 --> 00:33:17,540 And there's going to be a pointer to a parent Cilk stack 671 00:33:17,540 --> 00:33:21,980 frame that's somewhere above this Cilk RTS stack frame, 672 00:33:21,980 --> 00:33:23,600 somewhere in the call stack. 673 00:33:23,600 --> 00:33:25,460 So these Cilk RTS stack frames, these 674 00:33:25,460 --> 00:33:30,740 are the extra bit of state that the Cilk runtime system adds 675 00:33:30,740 --> 00:33:32,150 to the ordinary call stack. 676 00:33:35,073 --> 00:33:37,240 So if we take a look at the actual worker structure, 677 00:33:37,240 --> 00:33:38,800 it's a lot like what we saw before. 678 00:33:38,800 --> 00:33:41,560 We have a deque that's external to the call stack. 679 00:33:41,560 --> 00:33:46,700 The Cilk worker maintains head and tail pointers to the deque. 680 00:33:46,700 --> 00:33:49,030 The Cilk workers are also going to maintain a pointer 681 00:33:49,030 --> 00:33:52,150 to the current Cilk RTS stack frame, which 682 00:33:52,150 --> 00:33:56,560 will tend to be somewhere near the bottom of the stack. 683 00:34:02,880 --> 00:34:05,860 OK, so those are the basic data structures that a single worker 684 00:34:05,860 --> 00:34:07,650 is going to maintain. 685 00:34:07,650 --> 00:34:09,239 That includes the deque. 686 00:34:09,239 --> 00:34:12,420 Let's see them all in action, shall we? 687 00:34:12,420 --> 00:34:15,120 Any questions about that so far, before we 688 00:34:15,120 --> 00:34:17,050 start watching pointers fly? 689 00:34:17,050 --> 00:34:17,616 Yeah. 690 00:34:17,616 --> 00:34:19,830 AUDIENCE: I guess with the previous slide, 691 00:34:19,830 --> 00:34:22,480 there were arrows on the workers' call stack. 692 00:34:22,480 --> 00:34:25,920 What do you [INAUDIBLE]? 693 00:34:25,920 --> 00:34:29,580 TAO SCHARDL: What do the arrows among the elements on the call 694 00:34:29,580 --> 00:34:31,050 stack mean? 695 00:34:31,050 --> 00:34:33,540 So in this picture of the call stack, 696 00:34:33,540 --> 00:34:35,850 function instantiations are actually in green, 697 00:34:35,850 --> 00:34:39,360 and local variables-- specifically the Cilk RTS stack 698 00:34:39,360 --> 00:34:41,010 frames-- 699 00:34:41,010 --> 00:34:43,139 those show up in beige. 700 00:34:43,139 --> 00:34:48,900 So foo SF is the Cilk RTS stack frame inside the instantiation 701 00:34:48,900 --> 00:34:49,777 of foo. 702 00:34:49,777 --> 00:34:51,360 It's just a local variable that's also 703 00:34:51,360 --> 00:34:53,880 stored in the stack, right? 704 00:34:53,880 --> 00:34:58,440 Now, the Cilk RTS stack frame maintains a parent pointer, 705 00:34:58,440 --> 00:35:02,170 and it maintains a pointer up to some Cilk RTS stack 706 00:35:02,170 --> 00:35:03,660 frame above it on the stack. 707 00:35:03,660 --> 00:35:06,090 It's just another local variable, also stored 708 00:35:06,090 --> 00:35:07,570 in the stack. 709 00:35:07,570 --> 00:35:10,290 So when we step away and look at the whole call stack 710 00:35:10,290 --> 00:35:14,640 with all the function frames and the Cilk RTS stack frames, 711 00:35:14,640 --> 00:35:17,715 that's where we get the pointers climbing up the stack. 712 00:35:17,715 --> 00:35:20,735 We're good? 713 00:35:20,735 --> 00:35:22,208 Other questions? 714 00:35:27,610 --> 00:35:30,880 All right, let's make some pointers fly. 715 00:35:30,880 --> 00:35:32,680 OK, this is going to be kind of a letdown, 716 00:35:32,680 --> 00:35:35,780 because the first thing we're going to look at is some code. 717 00:35:35,780 --> 00:35:37,947 So we're not going to have pointers flying just yet. 718 00:35:40,540 --> 00:35:43,900 We can take a look at the code for the spawning function foo, 719 00:35:43,900 --> 00:35:45,630 at this point. 720 00:35:45,630 --> 00:35:48,970 And there's a lot of extra code in here, clearly. 721 00:35:48,970 --> 00:35:51,490 I've highlighted a lot of stuff on this slide, 722 00:35:51,490 --> 00:35:53,980 and all the highlighted material is 723 00:35:53,980 --> 00:35:58,340 related to the execution of the Cilk runtime system. 724 00:35:58,340 --> 00:36:00,140 But basically, if we look at this code, 725 00:36:00,140 --> 00:36:03,880 we can understand each of these pieces. 726 00:36:03,880 --> 00:36:07,160 Each of them has some role to play in making the Cilk runtime 727 00:36:07,160 --> 00:36:08,020 system work. 728 00:36:08,020 --> 00:36:10,870 So at the very beginning, we have our Cilk stack frame 729 00:36:10,870 --> 00:36:11,920 structure. 730 00:36:11,920 --> 00:36:15,310 And there's a call to this enter frame 731 00:36:15,310 --> 00:36:17,560 function, which all that really does 732 00:36:17,560 --> 00:36:19,510 is initialize the stack frame. 733 00:36:19,510 --> 00:36:21,610 That's all the function is doing. 734 00:36:21,610 --> 00:36:24,490 Later on, we find that there's this set jump routine-- 735 00:36:24,490 --> 00:36:26,920 we'll talk a lot more about set jump in a bit-- 736 00:36:26,920 --> 00:36:30,820 that, at this point, we can say the set jump prepares 737 00:36:30,820 --> 00:36:32,840 the function for a spawn. 738 00:36:32,840 --> 00:36:37,960 And inside the conditional, where 739 00:36:37,960 --> 00:36:39,730 the set jump occurs as a predicate, 740 00:36:39,730 --> 00:36:41,508 we have a call to spawn bar. 741 00:36:41,508 --> 00:36:43,300 If we remember from a couple of slides ago, 742 00:36:43,300 --> 00:36:45,530 spawn bar was our spawn helper function. 743 00:36:45,530 --> 00:36:48,520 So we're here, we're just invoking the spawn helper. 744 00:36:48,520 --> 00:36:51,100 Later on in the code, we have another blob 745 00:36:51,100 --> 00:36:55,510 of conditionals with a Cilk RTS sync call, deep inside. 746 00:36:55,510 --> 00:36:57,100 All that code performs a sync. 747 00:36:57,100 --> 00:37:01,150 We'll talk about that a bit near the end of lecture. 748 00:37:01,150 --> 00:37:03,940 And finally, at the end of the spawning function, 749 00:37:03,940 --> 00:37:07,570 we have a call to pop frame, which just cleans up 750 00:37:07,570 --> 00:37:12,070 the Cilk stack frame structure within this function. 751 00:37:12,070 --> 00:37:14,680 And then there's a call to leave frame, which essentially 752 00:37:14,680 --> 00:37:17,990 cleans up the deque. 753 00:37:17,990 --> 00:37:20,310 That's the spawning function. 754 00:37:20,310 --> 00:37:21,560 This is the spawn helper. 755 00:37:21,560 --> 00:37:22,710 It looks somewhat similar. 756 00:37:22,710 --> 00:37:26,000 I've added extra whitespace just to make the slide 757 00:37:26,000 --> 00:37:28,550 a little bit prettier. 758 00:37:28,550 --> 00:37:30,890 And in some ways, it's similar to the spawning function 759 00:37:30,890 --> 00:37:31,400 itself. 760 00:37:31,400 --> 00:37:34,430 We have a Cilk RTS stack frame [INAUDIBLE] spawn helper, 761 00:37:34,430 --> 00:37:36,258 another call to enter frame, which 762 00:37:36,258 --> 00:37:37,550 is just a little bit different. 763 00:37:37,550 --> 00:37:42,000 But essentially, it initializes the stack frame. 764 00:37:42,000 --> 00:37:45,260 Its reason to be is similar to the enter frame 765 00:37:45,260 --> 00:37:47,400 call we saw before. 766 00:37:47,400 --> 00:37:49,790 There's a call to Cilk RTS detach, 767 00:37:49,790 --> 00:37:53,280 which performs a bunch of updates on the deque. 768 00:37:53,280 --> 00:37:54,770 Then there is the actual invocation 769 00:37:54,770 --> 00:37:56,570 of the spawn subroutine. 770 00:37:56,570 --> 00:37:58,653 This is where we're calling bar. 771 00:37:58,653 --> 00:38:00,320 And finally, at the end of the function, 772 00:38:00,320 --> 00:38:03,650 there is a call to pop frame, to clean up the stack structure, 773 00:38:03,650 --> 00:38:06,920 and a call to leave frame, which will clean up the deck 774 00:38:06,920 --> 00:38:08,750 and possibly return. 775 00:38:08,750 --> 00:38:10,127 It'll try to return. 776 00:38:10,127 --> 00:38:11,210 We'll see more about that. 777 00:38:14,510 --> 00:38:17,050 So let's watch all of this in action. 778 00:38:17,050 --> 00:38:18,500 Question? 779 00:38:18,500 --> 00:38:19,020 OK, cool. 780 00:38:22,390 --> 00:38:23,870 Let's see all of this in action. 781 00:38:23,870 --> 00:38:25,840 We'll start off with a pretty boring picture. 782 00:38:25,840 --> 00:38:28,190 All we've got on our call stack is main, 783 00:38:28,190 --> 00:38:30,225 and our Cilk worker has nothing on its deque. 784 00:38:33,190 --> 00:38:36,100 But now we suppose that main calls our responding function 785 00:38:36,100 --> 00:38:38,590 foo, and the spawning function foo 786 00:38:38,590 --> 00:38:41,813 contains a Cilk RTS stack frame. 787 00:38:41,813 --> 00:38:44,230 What we're going to do in the Cilk worker, what that enter 788 00:38:44,230 --> 00:38:48,153 frame call is going to perform, all it's going to do 789 00:38:48,153 --> 00:38:49,570 is update the current stack frame. 790 00:38:49,570 --> 00:38:51,520 We now have a Cilk RTS stack frame, 791 00:38:51,520 --> 00:38:56,460 make sure the worker points at it, that's all. 792 00:38:56,460 --> 00:38:59,250 Fast forward a little bit, and foo encounters 793 00:38:59,250 --> 00:39:02,100 this call to Cilk spawn a bar. 794 00:39:02,100 --> 00:39:04,890 And in the C pseudocode that's compiled for foo, 795 00:39:04,890 --> 00:39:07,410 we have a set jump routine. 796 00:39:07,410 --> 00:39:11,083 This set jump is kind of a magical function. 797 00:39:11,083 --> 00:39:12,750 This is the function that allows thieves 798 00:39:12,750 --> 00:39:15,210 to steal the continuation. 799 00:39:15,210 --> 00:39:19,290 And in particular, the set jump takes, as an argument, 800 00:39:19,290 --> 00:39:20,160 a buffer. 801 00:39:20,160 --> 00:39:21,750 In this case, it's the context buffer 802 00:39:21,750 --> 00:39:24,322 that we have in the Cilk RTS stack frame. 803 00:39:24,322 --> 00:39:25,780 And what the set jump will do is it 804 00:39:25,780 --> 00:39:28,920 will store information that's necessary to resume 805 00:39:28,920 --> 00:39:32,850 the function at the location of the set jump. 806 00:39:32,850 --> 00:39:35,280 And it stores that information into the buffer. 807 00:39:35,280 --> 00:39:37,620 Can anyone guess what that information might be? 808 00:39:45,900 --> 00:39:49,900 AUDIENCE: The instruction points at [INAUDIBLE].. 809 00:39:49,900 --> 00:39:52,870 TAO SCHARDL: Instruction pointer or stock pointer, 810 00:39:52,870 --> 00:39:55,678 I believe both of those are in the frame. 811 00:39:55,678 --> 00:39:57,220 Yeah, both of those are in the frame. 812 00:39:57,220 --> 00:39:58,210 Good, what else? 813 00:40:06,421 --> 00:40:09,820 AUDIENCE: All the registers are in use. 814 00:40:09,820 --> 00:40:12,830 TAO SCHARDL: All the registers are currently in use. 815 00:40:12,830 --> 00:40:14,800 Does it need all the registers? 816 00:40:14,800 --> 00:40:17,352 You're absolutely on the right track, 817 00:40:17,352 --> 00:40:19,810 but is there any way it could restrict the set of registers 818 00:40:19,810 --> 00:40:20,530 it needs to save? 819 00:40:25,420 --> 00:40:29,318 AUDIENCE: The registers are used later in the execution. 820 00:40:29,318 --> 00:40:30,610 TAO SCHARDL: That's part of it. 821 00:40:30,610 --> 00:40:32,260 Set jump isn't that clever though, 822 00:40:32,260 --> 00:40:37,120 so it just stores a predetermined set of registers. 823 00:40:37,120 --> 00:40:39,230 But there is another way to restrict the set. 824 00:40:46,146 --> 00:40:50,468 AUDIENCE: [INAUDIBLE] 825 00:40:50,468 --> 00:40:52,260 TAO SCHARDL: Only registers uses parameters 826 00:40:52,260 --> 00:40:57,880 in the called function, yeah, close enough. 827 00:40:57,880 --> 00:41:00,590 Callee-saved registers. 828 00:41:00,590 --> 00:41:04,460 So registers that the function might-- 829 00:41:04,460 --> 00:41:08,390 that it's the responsibility of foo to save, 830 00:41:08,390 --> 00:41:12,290 this goes all the way back to that discussion in lecture, 831 00:41:12,290 --> 00:41:15,140 I don't remember which small number, talking 832 00:41:15,140 --> 00:41:17,785 about the calling convention. 833 00:41:17,785 --> 00:41:19,160 These registers need to be saved, 834 00:41:19,160 --> 00:41:21,620 as well as the instruction pointer and various stack 835 00:41:21,620 --> 00:41:22,400 pointers. 836 00:41:22,400 --> 00:41:24,830 Those are what gets saved into the buffer. 837 00:41:24,830 --> 00:41:27,343 The other registers, well, we're about to call a function, 838 00:41:27,343 --> 00:41:29,510 it's up to that other function to save the registers 839 00:41:29,510 --> 00:41:30,680 appropriately. 840 00:41:30,680 --> 00:41:32,780 So we don't need to worry about those. 841 00:41:36,936 --> 00:41:37,488 So all good? 842 00:41:37,488 --> 00:41:38,530 Any questions about that? 843 00:41:42,820 --> 00:41:45,290 All right, so this set jump routine, 844 00:41:45,290 --> 00:41:47,290 let's take it for granted that when 845 00:41:47,290 --> 00:41:51,790 we call a set jump on this given buffer, it returns zero. 846 00:41:51,790 --> 00:41:53,682 That's a good lie for now. 847 00:41:53,682 --> 00:41:54,640 We'll just run with it. 848 00:41:54,640 --> 00:41:56,430 So set jump returs zero. 849 00:41:56,430 --> 00:41:58,690 The condition says, if not zero-- 850 00:41:58,690 --> 00:42:00,760 which turns out to be true-- 851 00:42:00,760 --> 00:42:02,380 and so the next thing that happens 852 00:42:02,380 --> 00:42:06,010 is this call to the spawn helper, spawn_bar, 853 00:42:06,010 --> 00:42:07,410 in this case. 854 00:42:07,410 --> 00:42:11,990 When we call spawn_bar, what happens to our stack? 855 00:42:11,990 --> 00:42:14,455 So this should look pretty routine. 856 00:42:14,455 --> 00:42:16,900 We're doing a function call, and so we 857 00:42:16,900 --> 00:42:20,950 push the frame for the called function onto the stack. 858 00:42:20,950 --> 00:42:23,380 And that called function, spawn bar, 859 00:42:23,380 --> 00:42:25,652 contains a local variable, which is 860 00:42:25,652 --> 00:42:26,860 this [INAUDIBLE] stack frame. 861 00:42:26,860 --> 00:42:29,072 So that also gets pushed onto the stack, 862 00:42:29,072 --> 00:42:30,030 pretty straightforward. 863 00:42:30,030 --> 00:42:33,460 We've seen function calls many times before. 864 00:42:33,460 --> 00:42:35,650 This should look pretty familiar. 865 00:42:35,650 --> 00:42:39,113 Now we do this Cilk RTS enter frame fast routine. 866 00:42:39,113 --> 00:42:40,780 And I mentioned before that that's going 867 00:42:40,780 --> 00:42:44,200 to update the worker structure. 868 00:42:48,638 --> 00:42:49,930 So what's going to happen here? 869 00:42:49,930 --> 00:42:54,250 Well, we have a brand new Cilk RTS stack frame on the stack. 870 00:42:54,250 --> 00:42:57,070 Any guesses as to what change we make? 871 00:43:02,430 --> 00:43:04,380 What would enter frame do? 872 00:43:04,380 --> 00:43:07,687 AUDIENCE: [INAUDIBLE] 873 00:43:07,687 --> 00:43:09,270 TAO SCHARDL: Point current stack frame 874 00:43:09,270 --> 00:43:11,327 to spawn in bar stack frame, you're right. 875 00:43:11,327 --> 00:43:11,910 Anything else? 876 00:43:18,306 --> 00:43:20,110 Hope I got this animation right. 877 00:43:30,550 --> 00:43:34,840 What are the various fields within the stack frame? 878 00:43:34,840 --> 00:43:36,820 And what did-- sorry, I don't know your name. 879 00:43:36,820 --> 00:43:37,528 What's your name? 880 00:43:40,407 --> 00:43:41,330 AUDIENCE: I'm Greg. 881 00:43:41,330 --> 00:43:44,370 TAO SCHARDL: Greg, what did Greg ask about before, 882 00:43:44,370 --> 00:43:46,460 when we saw an earlier picture of the call stack? 883 00:43:58,292 --> 00:44:00,760 AUDIENCE: Set a pointer to the parent. 884 00:44:00,760 --> 00:44:03,803 TAO SCHARDL: Set a pointer to the parent, exactly. 885 00:44:03,803 --> 00:44:05,220 So what we're going to do is we're 886 00:44:05,220 --> 00:44:06,980 going to take this call stack, we'll 887 00:44:06,980 --> 00:44:09,120 do the enter frame fast routine. 888 00:44:09,120 --> 00:44:12,870 That establishes this parent pointer in our brand new stack 889 00:44:12,870 --> 00:44:14,010 frame. 890 00:44:14,010 --> 00:44:16,637 And we update the worker's current stack frame to point 891 00:44:16,637 --> 00:44:17,220 at the bottom. 892 00:44:17,220 --> 00:44:18,754 Yeah, question? 893 00:44:18,754 --> 00:44:21,870 AUDIENCE: How does enter frame know what the parent is? 894 00:44:21,870 --> 00:44:24,870 TAO SCHARDL: How does enter frame know what the parent is? 895 00:44:24,870 --> 00:44:25,740 Good question. 896 00:44:25,740 --> 00:44:29,950 Enter frame knows the worker. 897 00:44:29,950 --> 00:44:33,510 Or rather, enter frame can do a call, which will give it access 898 00:44:33,510 --> 00:44:35,915 to the Cilk worker structure. 899 00:44:35,915 --> 00:44:38,100 And because it can do a call, it can 900 00:44:38,100 --> 00:44:41,553 read the current stack frame pointer in the worker. 901 00:44:41,553 --> 00:44:43,220 AUDIENCE: So we do [INAUDIBLE] before we 902 00:44:43,220 --> 00:44:46,990 change the current [INAUDIBLE]? 903 00:44:46,990 --> 00:44:50,320 TAO SCHARDL: Yeah, in this case we do. 904 00:44:50,320 --> 00:44:55,950 So we add the parent pointer, then we delete and update. 905 00:44:55,950 --> 00:44:59,505 So, good catch. 906 00:44:59,505 --> 00:45:00,948 Any other questions? 907 00:45:05,560 --> 00:45:06,060 Cool. 908 00:45:08,640 --> 00:45:11,190 All right, now we encounter this thing, Cilk RTS detach. 909 00:45:11,190 --> 00:45:13,080 This one's kind of exciting. 910 00:45:13,080 --> 00:45:18,720 Finally we get to do something to the deque. 911 00:45:18,720 --> 00:45:20,280 Any guesses what we do? 912 00:45:20,280 --> 00:45:22,770 How do we update the deque? 913 00:45:22,770 --> 00:45:23,870 Here's a hint. 914 00:45:23,870 --> 00:45:27,450 Cilk RTS detach allows-- 915 00:45:27,450 --> 00:45:31,380 this is the function that allows some computation to be stolen. 916 00:45:31,380 --> 00:45:35,810 Once Cilk RTS detach is done executing, 917 00:45:35,810 --> 00:45:38,610 a thief could come along and steal the continuation 918 00:45:38,610 --> 00:45:40,320 of the Cilk spawn. 919 00:45:40,320 --> 00:45:46,350 So what would Cilk RTS detach do to our worker 920 00:45:46,350 --> 00:45:47,260 and its structures? 921 00:45:52,060 --> 00:45:52,810 Yeah, in the back. 922 00:45:52,810 --> 00:45:55,750 AUDIENCE: Push the stack frame to the worker deque? 923 00:45:55,750 --> 00:45:58,510 TAO SCHARDL: Push the stack frame to the worker deque, 924 00:45:58,510 --> 00:46:00,220 specifically at the tail. 925 00:46:03,100 --> 00:46:05,725 Right, I gave it away by clicking the animation, 926 00:46:05,725 --> 00:46:07,690 oh well. 927 00:46:07,690 --> 00:46:11,920 Now the thing that's available to be stolen is inside of foo. 928 00:46:11,920 --> 00:46:14,350 So what ends up getting pushed onto the deque 929 00:46:14,350 --> 00:46:16,660 is not the current stack frame, but in fact 930 00:46:16,660 --> 00:46:20,590 its immediate parent, so the stack frame of foo. 931 00:46:20,590 --> 00:46:23,270 That gets pushed onto the tail of the deque. 932 00:46:23,270 --> 00:46:27,340 And we now push something onto the tail of a deque. 933 00:46:27,340 --> 00:46:30,610 And so we advance the tail pointer. 934 00:46:30,610 --> 00:46:32,110 Still good, everyone? 935 00:46:32,110 --> 00:46:33,730 I see some nods. 936 00:46:33,730 --> 00:46:35,158 I see at least one nod. 937 00:46:35,158 --> 00:46:35,700 I'll take it. 938 00:46:37,980 --> 00:46:39,730 But feel free to ask questions, of course. 939 00:46:43,120 --> 00:46:46,540 And then of course there is this invocation of bar. 940 00:46:46,540 --> 00:46:48,340 This does what you might expect. 941 00:46:48,340 --> 00:46:51,310 It calls bar, no magic here. 942 00:46:51,310 --> 00:46:54,890 Well, no new magic here. 943 00:46:54,890 --> 00:46:58,360 OK, fast forward, let's suppose that bar finally returns. 944 00:46:58,360 --> 00:47:00,280 And now we return to the statement 945 00:47:00,280 --> 00:47:02,500 after bar in the spawn helper. 946 00:47:02,500 --> 00:47:04,630 That statement is the pop frame. 947 00:47:07,210 --> 00:47:10,120 Actually, since we just returned from bar, 948 00:47:10,120 --> 00:47:12,100 we need to get rid of bar from the stack frame. 949 00:47:12,100 --> 00:47:14,410 Good, now we can execute the pop frame. 950 00:47:14,410 --> 00:47:17,050 What would the pop frame do? 951 00:47:17,050 --> 00:47:19,220 It's going to clean up the stack frame structure. 952 00:47:19,220 --> 00:47:22,370 So what would that entail, any guesses? 953 00:47:27,140 --> 00:47:29,640 AUDIENCE: I guess it would move the current stack frame back 954 00:47:29,640 --> 00:47:31,338 to the parent stack frame? 955 00:47:31,338 --> 00:47:33,880 TAO SCHARDL: Move the current stack frame back to the parent, 956 00:47:33,880 --> 00:47:36,030 very good. 957 00:47:36,030 --> 00:47:43,330 I think that's largely it. 958 00:47:43,330 --> 00:47:45,780 I guess there's one other thing it can do. 959 00:47:45,780 --> 00:47:47,710 It's kind of optional, given that it's going 960 00:47:47,710 --> 00:47:51,870 to garbage the memory anyway. 961 00:47:51,870 --> 00:47:54,690 So it updates the current stack frame to point to the parent, 962 00:47:54,690 --> 00:47:56,648 and now it no longer needs that parent pointer. 963 00:47:56,648 --> 00:48:00,360 So it can clean that up, in principle. 964 00:48:00,360 --> 00:48:03,000 And then there's this call to Cilk RTS leave frame. 965 00:48:03,000 --> 00:48:07,590 This is magic-- well, not really, but it's not obvious. 966 00:48:07,590 --> 00:48:10,782 This is a function call that may or may not return. 967 00:48:10,782 --> 00:48:12,240 Welcome to the Cilk runtime system. 968 00:48:12,240 --> 00:48:14,150 You end up with calls to functions 969 00:48:14,150 --> 00:48:15,990 that you may never return from. 970 00:48:15,990 --> 00:48:19,620 This happens all the time. 971 00:48:19,620 --> 00:48:23,490 And the Cilk RTS leave frame may or may not 972 00:48:23,490 --> 00:48:26,730 return, based entirely on what's on the status 973 00:48:26,730 --> 00:48:29,610 of the deque, what content is currently 974 00:48:29,610 --> 00:48:33,870 sitting on the workers' deque. 975 00:48:33,870 --> 00:48:35,880 Anyone have a guess as to why the leave frame 976 00:48:35,880 --> 00:48:40,560 routine might not return, in the conventional sense? 977 00:48:43,133 --> 00:48:45,300 AUDIENCE: There's nothing else for the worker to do, 978 00:48:45,300 --> 00:48:48,958 so it'll sit there spinning. 979 00:48:48,958 --> 00:48:51,250 TAO SCHARDL: If there's nothing left to do on the deck, 980 00:48:51,250 --> 00:48:53,040 then it's going to-- sorry, say again? 981 00:48:53,040 --> 00:48:57,190 AUDIENCE: It'll just wait until there's work you can steal? 982 00:48:57,190 --> 00:48:59,540 TAO SCHARDL: Right, if there's nothing on the deque, 983 00:48:59,540 --> 00:49:02,540 then it has nowhere to return to. 984 00:49:02,540 --> 00:49:08,003 And so naturally, as we've seen from Cilk workers in the past, 985 00:49:08,003 --> 00:49:10,420 it discovers there's nothing on the deque, there's no work 986 00:49:10,420 --> 00:49:12,520 to do, time to turn to a life of crime, 987 00:49:12,520 --> 00:49:14,680 and try to steal work from someone else. 988 00:49:17,330 --> 00:49:18,880 So there are two possible scenarios. 989 00:49:18,880 --> 00:49:23,350 The pop could succeed and execution continues as normal, 990 00:49:23,350 --> 00:49:26,140 or it fails and it becomes a thief. 991 00:49:26,140 --> 00:49:28,900 Now which of these two cases do you 992 00:49:28,900 --> 00:49:32,091 think is more important for the runtime system to optimize? 993 00:49:40,440 --> 00:49:44,750 Success, case one, exactly, so why is that? 994 00:49:50,074 --> 00:49:52,943 AUDIENCE: [INAUDIBLE] 995 00:49:52,943 --> 00:49:54,610 TAO SCHARDL: At least, we hope so, yeah. 996 00:49:54,610 --> 00:49:58,330 We assume-- this hearkens all the way back to that work first 997 00:49:58,330 --> 00:49:59,440 principle-- 998 00:49:59,440 --> 00:50:01,690 we assume that in the common case, 999 00:50:01,690 --> 00:50:03,520 workers are doing useful work, they're 1000 00:50:03,520 --> 00:50:06,850 not just spending their time stealing from each other. 1001 00:50:06,850 --> 00:50:11,470 And therefore, ideally, we want to assume 1002 00:50:11,470 --> 00:50:15,400 that the worker will do what's normal, 1003 00:50:15,400 --> 00:50:17,970 just an ordinary serial execution. 1004 00:50:17,970 --> 00:50:20,120 In a normal serial execution, there 1005 00:50:20,120 --> 00:50:25,280 is something on the deque, the pop succeeds, that's case one. 1006 00:50:25,280 --> 00:50:28,060 So what we'll see is that the runtime system, in fact, 1007 00:50:28,060 --> 00:50:31,557 does a little bit of optimization on case one. 1008 00:50:31,557 --> 00:50:33,640 Let's talk about something a little more exciting. 1009 00:50:33,640 --> 00:50:35,545 How about stealing computation. 1010 00:50:35,545 --> 00:50:39,060 We like stealing stuff from each other. 1011 00:50:39,060 --> 00:50:41,096 Yes? 1012 00:50:41,096 --> 00:50:53,803 AUDIENCE: [INAUDIBLE] 1013 00:50:53,803 --> 00:50:55,720 TAO SCHARDL: Where does it return the results? 1014 00:50:55,720 --> 00:50:59,770 So where does it return the result in the spawn bar? 1015 00:50:59,770 --> 00:51:05,600 The answer you can kind of see two lines above this. 1016 00:51:05,600 --> 00:51:08,060 So in this case, in the original Cilk code, 1017 00:51:08,060 --> 00:51:11,150 we had X equals Cilk spawn of bar. 1018 00:51:11,150 --> 00:51:15,200 And here, what are the parameters to our spawn bar 1019 00:51:15,200 --> 00:51:15,700 function? 1020 00:51:24,150 --> 00:51:29,760 X and N. Now N is the input to bar, right? 1021 00:51:29,760 --> 00:51:30,570 So what's X? 1022 00:51:39,300 --> 00:51:45,163 AUDIENCE: [INAUDIBLE] 1023 00:51:45,163 --> 00:51:46,830 TAO SCHARDL: You can rewind a little bit 1024 00:51:46,830 --> 00:51:50,300 and see that you are correct. 1025 00:51:50,300 --> 00:51:51,850 There we go. 1026 00:51:51,850 --> 00:51:56,190 Yeah, so the original Cilk code, we had X equals Cilk spawn bar. 1027 00:51:56,190 --> 00:51:59,700 That's the same X. All that Cilk does 1028 00:51:59,700 --> 00:52:02,850 is pass a pointer to the memory allocated 1029 00:52:02,850 --> 00:52:07,680 for that variable down to the spawn helper. 1030 00:52:07,680 --> 00:52:11,550 And now the spawn helper, when it calls bar and that returns, 1031 00:52:11,550 --> 00:52:16,530 it gets stored into that storage in the parent stack frame. 1032 00:52:16,530 --> 00:52:18,780 Good catch. 1033 00:52:18,780 --> 00:52:20,190 Good observation. 1034 00:52:20,190 --> 00:52:21,555 Any questions about that? 1035 00:52:21,555 --> 00:52:25,060 Does that make sense? 1036 00:52:25,060 --> 00:52:25,560 Cool. 1037 00:52:30,520 --> 00:52:32,620 Probably used too many animations in these slides. 1038 00:52:36,980 --> 00:52:40,070 All right, now let's talk about stealing. 1039 00:52:40,070 --> 00:52:43,190 How does a worker steal computation? 1040 00:52:43,190 --> 00:52:47,000 Now the conceptual diagram we had before 1041 00:52:47,000 --> 00:52:49,730 saw this one worker, with nothing on its deque, 1042 00:52:49,730 --> 00:52:52,160 take a couple of frames from another workers deque 1043 00:52:52,160 --> 00:52:55,130 and just slide them on over. 1044 00:52:55,130 --> 00:52:58,590 What does that actually look like in the implementation? 1045 00:52:58,590 --> 00:53:01,940 Well, we're still going to take from the top of the deque, 1046 00:53:01,940 --> 00:53:05,600 but now we have a picture that's a little bit more 1047 00:53:05,600 --> 00:53:09,050 accurate in terms of the structures that are really 1048 00:53:09,050 --> 00:53:10,260 implemented in the system. 1049 00:53:10,260 --> 00:53:13,220 So we have the call stack of the victim, 1050 00:53:13,220 --> 00:53:16,520 and the victim also has a deque data structure and a Cilk 1051 00:53:16,520 --> 00:53:18,860 worker data structure, with head and tail pointers 1052 00:53:18,860 --> 00:53:21,860 and a current stack frame. 1053 00:53:21,860 --> 00:53:25,470 So what happens when a thief comes along out of nowhere? 1054 00:53:25,470 --> 00:53:27,530 It's bored, it has nothing on its deque. 1055 00:53:27,530 --> 00:53:29,720 Head and tail pointers both point to the top. 1056 00:53:29,720 --> 00:53:32,330 Current stack frame has nothing. 1057 00:53:32,330 --> 00:53:34,270 What's the thief going to do? 1058 00:53:34,270 --> 00:53:34,910 Any guesses? 1059 00:53:57,300 --> 00:53:59,490 How does this thief take the content 1060 00:53:59,490 --> 00:54:00,790 from the worker's deque? 1061 00:54:11,190 --> 00:54:14,200 AUDIENCE: The worker sets their current stack frame 1062 00:54:14,200 --> 00:54:22,150 to the one that [INAUDIBLE] 1063 00:54:22,150 --> 00:54:26,410 TAO SCHARDL: Exactly right, yeah. 1064 00:54:26,410 --> 00:54:27,220 Sorry, was that-- 1065 00:54:27,220 --> 00:54:29,540 I didn't mean to interrupt. 1066 00:54:29,540 --> 00:54:30,400 All right, cool. 1067 00:54:30,400 --> 00:54:34,210 So the red highlighting should give a little bit of a hint. 1068 00:54:34,210 --> 00:54:37,780 The current stack frame in the thief 1069 00:54:37,780 --> 00:54:39,790 is going to end up pointing to the stack frame 1070 00:54:39,790 --> 00:54:43,210 at the top of the deque, pointed to by the top of the deque. 1071 00:54:43,210 --> 00:54:47,060 And the head of the deque needs to be updated. 1072 00:54:47,060 --> 00:54:51,220 So let's just see all those pointers shuffle. 1073 00:54:51,220 --> 00:54:54,920 The thief is going to target the head of the deque. 1074 00:54:54,920 --> 00:54:59,862 It's going to deque that item from the top of the deck. 1075 00:54:59,862 --> 00:55:01,570 It's going to set the current stack frame 1076 00:55:01,570 --> 00:55:05,680 to point to that item, and it will delete the pointer 1077 00:55:05,680 --> 00:55:08,530 on the deque. 1078 00:55:08,530 --> 00:55:11,160 That make sense? 1079 00:55:11,160 --> 00:55:12,300 Cool. 1080 00:55:12,300 --> 00:55:17,640 Now the victim and the thief are on different processors, 1081 00:55:17,640 --> 00:55:20,310 and this scenario involves shuffling a lot of pointers 1082 00:55:20,310 --> 00:55:21,620 around. 1083 00:55:21,620 --> 00:55:25,050 So if we think about this process, 1084 00:55:25,050 --> 00:55:27,240 there needs to be some way to handle 1085 00:55:27,240 --> 00:55:30,188 the concurrent accesses that are going to occur 1086 00:55:30,188 --> 00:55:31,230 on the head of the deque. 1087 00:55:33,993 --> 00:55:35,660 You haven't talked about synchronization 1088 00:55:35,660 --> 00:55:38,160 yet in this class, that's going to be a couple lectures down 1089 00:55:38,160 --> 00:55:39,733 the road. 1090 00:55:39,733 --> 00:55:41,150 I'll give you a couple of spoilers 1091 00:55:41,150 --> 00:55:42,980 for those synchronization lectures. 1092 00:55:42,980 --> 00:55:45,650 First off, synchronization is expensive. 1093 00:55:45,650 --> 00:55:48,290 And second, reasoning about synchronization 1094 00:55:48,290 --> 00:55:52,598 is a source of massive headaches. 1095 00:55:52,598 --> 00:55:54,640 Congratulations, you now know those two lectures. 1096 00:55:54,640 --> 00:55:55,515 No, I'm just kidding. 1097 00:55:55,515 --> 00:55:58,820 Go to the lectures, you'll learn a lot, they're great. 1098 00:55:58,820 --> 00:56:02,540 In the Cilk runtime system, the way 1099 00:56:02,540 --> 00:56:07,820 that those concurrent accesses are handled 1100 00:56:07,820 --> 00:56:11,930 is by using a protocol known as the THE protocol. 1101 00:56:11,930 --> 00:56:17,570 This is pseudo code for most of the logic in the THE protocol. 1102 00:56:17,570 --> 00:56:20,630 There's a protocol that the worker, executing work 1103 00:56:20,630 --> 00:56:21,910 normally, follows. 1104 00:56:21,910 --> 00:56:23,905 And there is the protocol for the thief. 1105 00:56:23,905 --> 00:56:26,030 I'm not going to walk through all the lines of code 1106 00:56:26,030 --> 00:56:28,490 here and describe what they do. 1107 00:56:28,490 --> 00:56:32,390 I'll just give you the very high level view of this protocol. 1108 00:56:32,390 --> 00:56:34,610 From the thief's perspective, the thief 1109 00:56:34,610 --> 00:56:38,660 always grabs a lock on the deque before doing any operations 1110 00:56:38,660 --> 00:56:40,340 on the deque. 1111 00:56:40,340 --> 00:56:43,430 Always acquire the lock first. 1112 00:56:43,430 --> 00:56:48,160 For the worker, it's a little bit more optimized. 1113 00:56:48,160 --> 00:56:51,460 So what the worker will do is optimistically try 1114 00:56:51,460 --> 00:56:55,120 to pop something from the bottom of the deque. 1115 00:56:55,120 --> 00:56:58,720 And only if it looks like that pop operation fails 1116 00:56:58,720 --> 00:57:01,120 does the worker do something more complicated. 1117 00:57:01,120 --> 00:57:04,490 Only then does it try to acquire a lock on the deque, 1118 00:57:04,490 --> 00:57:08,350 then try to pop something off, see if it really 1119 00:57:08,350 --> 00:57:13,810 succeeds or fails, and possibly turn to a life of crime. 1120 00:57:13,810 --> 00:57:15,860 So the worker's protocol looks longer, 1121 00:57:15,860 --> 00:57:19,930 but that's just because the worker implements 1122 00:57:19,930 --> 00:57:24,880 a special case, which is optimized for the common case. 1123 00:57:24,880 --> 00:57:28,420 This is essentially where the leave frame routine, 1124 00:57:28,420 --> 00:57:33,010 that we saw before, is optimized for case one, optimized 1125 00:57:33,010 --> 00:57:36,730 for the pop from the deque succeeding. 1126 00:57:36,730 --> 00:57:39,390 Any questions about that? 1127 00:57:39,390 --> 00:57:43,775 Seem clear from 30,000 feet? 1128 00:57:43,775 --> 00:57:46,190 Cool. 1129 00:57:46,190 --> 00:57:49,280 OK, so that's how a worker steals work 1130 00:57:49,280 --> 00:57:53,510 from the top of the victim's deque. 1131 00:57:53,510 --> 00:57:56,330 Now, that thief needs to resume a continuation. 1132 00:57:56,330 --> 00:57:59,900 And this is that whole process about jumping into the middle 1133 00:57:59,900 --> 00:58:01,550 of an executing function. 1134 00:58:01,550 --> 00:58:03,470 It already has a frame, it already 1135 00:58:03,470 --> 00:58:05,630 has a [INAUDIBLE] state going on, 1136 00:58:05,630 --> 00:58:09,340 and all that was established by a different processor. 1137 00:58:09,340 --> 00:58:13,220 So somehow that thief has to magically come up 1138 00:58:13,220 --> 00:58:16,880 with the right state and start executing that function. 1139 00:58:16,880 --> 00:58:18,780 How does that happen? 1140 00:58:18,780 --> 00:58:21,200 Well, this has to do with a routine that's 1141 00:58:21,200 --> 00:58:24,920 the complement of the set jump routine we saw before. 1142 00:58:24,920 --> 00:58:28,580 The complement of set jump is what's called long jump. 1143 00:58:28,580 --> 00:58:30,902 So Cilk uses, in particular Cilk thieves, 1144 00:58:30,902 --> 00:58:32,360 use the long jump function in order 1145 00:58:32,360 --> 00:58:34,550 to resume a stolen continuation. 1146 00:58:34,550 --> 00:58:36,830 Previously, in our spawning function foo, 1147 00:58:36,830 --> 00:58:39,970 we had this set jump call. 1148 00:58:39,970 --> 00:58:44,390 And that set jump saved some state to a local buffer, 1149 00:58:44,390 --> 00:58:49,160 in particular the buffer in the stack frame of foo. 1150 00:58:49,160 --> 00:58:53,420 Now the thief has just created this Cilk worker structure, 1151 00:58:53,420 --> 00:58:56,540 where the current stack frame is pointing 1152 00:58:56,540 --> 00:58:59,720 at the stack frame of foo. 1153 00:58:59,720 --> 00:59:02,850 And so what the thief will do is it'll execute a call, 1154 00:59:02,850 --> 00:59:07,970 it'll execute the statement, it will execute the long jump 1155 00:59:07,970 --> 00:59:11,920 function, passing that particular stack frame's buffer 1156 00:59:11,920 --> 00:59:15,190 and an additional argument, and that long jump 1157 00:59:15,190 --> 00:59:18,010 will take the registered state stored in the buffer, 1158 00:59:18,010 --> 00:59:20,840 put that registered state into the worker, 1159 00:59:20,840 --> 00:59:24,190 and then let the worker proceed. 1160 00:59:24,190 --> 00:59:25,400 That make sense? 1161 00:59:25,400 --> 00:59:26,475 Any questions about that? 1162 00:59:31,030 --> 00:59:34,660 This is kind of a wacky routine because, if you remember, 1163 00:59:34,660 --> 00:59:37,840 one of the registers stored in that buffer 1164 00:59:37,840 --> 00:59:39,970 is an instruction pointer. 1165 00:59:39,970 --> 00:59:42,627 And so it's going to read the instruction pointer out 1166 00:59:42,627 --> 00:59:43,210 of the buffer. 1167 00:59:43,210 --> 00:59:45,585 It's also going to read a bunch of callee-saved registers 1168 00:59:45,585 --> 00:59:47,980 and stack pointers out of the buffer. 1169 00:59:47,980 --> 00:59:51,760 And it is going to say, that's my register state now, 1170 00:59:51,760 --> 00:59:53,170 that's what the thief says. 1171 00:59:53,170 --> 00:59:55,350 It just stole that register state. 1172 00:59:55,350 --> 01:00:01,030 And it's going to set its RAP to be the RAP it just read. 1173 01:00:01,030 --> 01:00:07,375 So what does that mean for where the long jump routine returns? 1174 01:00:16,452 --> 01:00:18,160 AUDIENCE: It returns into the stack frame 1175 01:00:18,160 --> 01:00:21,790 above the [INAUDIBLE] 1176 01:00:21,790 --> 01:00:23,290 TAO SCHARDL: Returns the stack frame 1177 01:00:23,290 --> 01:00:25,690 above the one it just stole. 1178 01:00:25,690 --> 01:00:29,020 More or less, but more specifically, 1179 01:00:29,020 --> 01:00:32,222 where in that function does it return? 1180 01:00:32,222 --> 01:00:33,870 AUDIENCE: Just after the call. 1181 01:00:33,870 --> 01:00:35,396 TAO SCHARDL: Which call? 1182 01:00:35,396 --> 01:00:37,690 AUDIENCE: [INAUDIBLE] 1183 01:00:37,690 --> 01:00:43,870 TAO SCHARDL: To the spawn bar, here? 1184 01:00:43,870 --> 01:00:50,000 Almost, very, very close, very, very close. 1185 01:00:50,000 --> 01:00:52,840 What ends up happening is that the long jump effectively 1186 01:00:52,840 --> 01:00:55,375 returns from the set jump a second time. 1187 01:00:57,980 --> 01:01:02,260 This is the weird protocol between set jump and long jump. 1188 01:01:02,260 --> 01:01:05,320 Set jump, you pass it a buffer, it saves and registers state, 1189 01:01:05,320 --> 01:01:06,370 and then it returns. 1190 01:01:06,370 --> 01:01:09,220 And it returns immediately, and on its directed vocation, 1191 01:01:09,220 --> 01:01:12,280 that set jump call returns the value zero, 1192 01:01:12,280 --> 01:01:14,380 as we mentioned before. 1193 01:01:14,380 --> 01:01:19,270 Now if you invoke a long jump using the same buffer, 1194 01:01:19,270 --> 01:01:23,950 that causes the processor to effectively return 1195 01:01:23,950 --> 01:01:26,800 from the same set jump call. 1196 01:01:26,800 --> 01:01:29,320 They use the same buffer. 1197 01:01:29,320 --> 01:01:31,570 But now it's going to return with a different value, 1198 01:01:31,570 --> 01:01:33,700 and it's going to return with the value specified 1199 01:01:33,700 --> 01:01:35,200 in the second argument. 1200 01:01:35,200 --> 01:01:38,290 So invoking long jump of buffer X returns 1201 01:01:38,290 --> 01:01:40,900 from that set jump with the value 1202 01:01:40,900 --> 01:01:47,320 X. So when the thief executes a long jump 1203 01:01:47,320 --> 01:01:51,520 with the appropriate buffer, and the second argument is one, 1204 01:01:51,520 --> 01:01:53,097 what happens? 1205 01:01:53,097 --> 01:01:54,430 Can anyone walk me through this? 1206 01:01:54,430 --> 01:01:56,250 Oh, it's on the slide, OK. 1207 01:01:59,770 --> 01:02:03,850 So now that set jump effectively returns a second time, 1208 01:02:03,850 --> 01:02:07,430 but now it returns with a value one. 1209 01:02:07,430 --> 01:02:09,760 And now the predicate gets evaluated. 1210 01:02:09,760 --> 01:02:14,200 So if not one, which would be if false, 1211 01:02:14,200 --> 01:02:17,380 well don't do the consequent, because the predicate 1212 01:02:17,380 --> 01:02:18,880 was false. 1213 01:02:18,880 --> 01:02:21,930 And that means it's going to skip the call to spawn bar, 1214 01:02:21,930 --> 01:02:25,360 and it'll just fall through and execute the stuff right 1215 01:02:25,360 --> 01:02:29,950 after that conditional, which happens to be 1216 01:02:29,950 --> 01:02:33,670 the continuation of the spawn. 1217 01:02:33,670 --> 01:02:36,100 That's kind of neat. 1218 01:02:36,100 --> 01:02:38,024 I think that's kind of neat, being unbiased. 1219 01:02:38,024 --> 01:02:39,607 Anyone else think that's kind of neat? 1220 01:02:43,270 --> 01:02:44,230 Excellent. 1221 01:02:44,230 --> 01:02:46,980 Anyone desperately confused about this set jump, long jump 1222 01:02:46,980 --> 01:02:47,480 nonsense? 1223 01:02:52,650 --> 01:02:55,170 Any questions you want to ask, or just 1224 01:02:55,170 --> 01:02:57,420 generally confused about why these things 1225 01:02:57,420 --> 01:02:58,950 exist in modern computing? 1226 01:03:02,310 --> 01:03:02,964 Yeah. 1227 01:03:02,964 --> 01:03:04,386 AUDIENCE: Is there any reason you couldn't just 1228 01:03:04,386 --> 01:03:06,344 add, like, [INAUDIBLE] to the instruction point 1229 01:03:06,344 --> 01:03:09,137 and jump over the call, instead? 1230 01:03:09,137 --> 01:03:11,220 TAO SCHARDL: Is there any reason you couldn't just 1231 01:03:11,220 --> 01:03:14,190 add some fixed offset to the instruction pointer 1232 01:03:14,190 --> 01:03:16,500 to jump over the call? 1233 01:03:16,500 --> 01:03:20,070 In principle, I think, if you can statically 1234 01:03:20,070 --> 01:03:22,800 compute the distance you need to jump, 1235 01:03:22,800 --> 01:03:26,550 then you can just add that to RIP and let the long jump 1236 01:03:26,550 --> 01:03:28,200 do its thing. 1237 01:03:28,200 --> 01:03:31,630 Or rather, the thief will just adopt that RIP 1238 01:03:31,630 --> 01:03:32,880 and end up in the right place. 1239 01:03:37,150 --> 01:03:40,870 What's done here is-- 1240 01:03:40,870 --> 01:03:43,750 basically, this was the protocol that the existing set 1241 01:03:43,750 --> 01:03:46,150 jump and long jump routines implement. 1242 01:03:46,150 --> 01:03:52,120 And I imagine it's a bit more flexible of a protocol 1243 01:03:52,120 --> 01:03:55,890 than what you strictly need for the Cilk runtime. 1244 01:03:55,890 --> 01:03:58,183 And so, you know, it ends up working out. 1245 01:03:58,183 --> 01:04:00,100 But if you can statically compute that offset, 1246 01:04:00,100 --> 01:04:01,892 there's no reason in principle you couldn't 1247 01:04:01,892 --> 01:04:03,950 adopt a different approach. 1248 01:04:03,950 --> 01:04:05,742 So, good observation. 1249 01:04:08,940 --> 01:04:09,970 Any questions? 1250 01:04:09,970 --> 01:04:12,113 Any other questions? 1251 01:04:12,113 --> 01:04:13,530 It's fine to be generally confused 1252 01:04:13,530 --> 01:04:15,390 why their routines, set jump and long jump, 1253 01:04:15,390 --> 01:04:17,220 with this wacky behavior. 1254 01:04:17,220 --> 01:04:21,090 Compiler writers have that reaction all the time. 1255 01:04:21,090 --> 01:04:24,750 These are a nightmare to compile. 1256 01:04:24,750 --> 01:04:30,990 Anyway, OK, so we've seen how a thief can take some computation 1257 01:04:30,990 --> 01:04:33,210 off of a victim's deque, and we've 1258 01:04:33,210 --> 01:04:36,570 seen how the thief can jump right 1259 01:04:36,570 --> 01:04:38,460 into the middle of an executing function 1260 01:04:38,460 --> 01:04:41,242 with the appropriate register state. 1261 01:04:41,242 --> 01:04:42,450 Is this the end of the story? 1262 01:04:42,450 --> 01:04:44,460 Is there anything else we need to talk about, 1263 01:04:44,460 --> 01:04:47,280 with respect to stealing? 1264 01:04:47,280 --> 01:04:50,078 Or, more pointedly, what else do we not need to talk about 1265 01:04:50,078 --> 01:04:51,120 with respect to stealing? 1266 01:05:02,020 --> 01:05:04,760 You're welcome to answer, if you like. 1267 01:05:04,760 --> 01:05:05,260 OK. 1268 01:05:08,092 --> 01:05:09,550 Hey, remember that list of concerns 1269 01:05:09,550 --> 01:05:13,180 we had at the beginning? 1270 01:05:13,180 --> 01:05:16,162 List of requirements is what it was called. 1271 01:05:21,960 --> 01:05:25,260 We will talk about syncs, but not just yet. 1272 01:05:28,230 --> 01:05:31,080 What other thing was brought up? 1273 01:05:31,080 --> 01:05:33,060 Remember this slide from a previous lecture? 1274 01:05:35,797 --> 01:05:36,630 Here's another hint. 1275 01:05:36,630 --> 01:05:39,090 So the register state is certainly 1276 01:05:39,090 --> 01:05:41,520 part of the state of an executing function. 1277 01:05:41,520 --> 01:05:44,930 What else defines a state of an executing function? 1278 01:05:44,930 --> 01:05:48,073 Where doe the other state of the function live? 1279 01:05:52,710 --> 01:05:55,280 It lives on the stack, so what is there to talk 1280 01:05:55,280 --> 01:05:56,890 about regarding the stack? 1281 01:06:00,890 --> 01:06:02,321 AUDIENCE: Cactus stack. 1282 01:06:02,321 --> 01:06:05,800 TAO SCHARDL: The cactus stack, exactly. 1283 01:06:05,800 --> 01:06:08,380 So you mentioned before that thieves 1284 01:06:08,380 --> 01:06:11,600 need to implement this cactus stack abstraction 1285 01:06:11,600 --> 01:06:13,840 for the Cilk runtime system. 1286 01:06:16,840 --> 01:06:19,510 Why exactly do we need this cactus stack? 1287 01:06:19,510 --> 01:06:24,380 What's wrong with just having the thief use the victim's 1288 01:06:24,380 --> 01:06:24,880 stack? 1289 01:06:32,640 --> 01:06:40,422 AUDIENCE: [INAUDIBLE] 1290 01:06:40,422 --> 01:06:42,880 TAO SCHARDL: The victim might just free up a bunch of stuff 1291 01:06:42,880 --> 01:06:45,680 and then it's no longer accessible. 1292 01:06:45,680 --> 01:06:49,960 So it can free some amount of stuff, in particular everything 1293 01:06:49,960 --> 01:06:53,860 up to the function foo, but in fact 1294 01:06:53,860 --> 01:06:55,900 it can't return from the function foo 1295 01:06:55,900 --> 01:06:57,430 because some other-- 1296 01:06:57,430 --> 01:07:01,060 well, assuming that the Cilk RTS leave frame thing 1297 01:07:01,060 --> 01:07:02,628 is implemented-- 1298 01:07:02,628 --> 01:07:04,420 the function foo is no longer in the stack, 1299 01:07:04,420 --> 01:07:06,490 it won't ever reach it. 1300 01:07:06,490 --> 01:07:09,430 So it won't return from the function foo 1301 01:07:09,430 --> 01:07:13,180 while another worker is working on it. 1302 01:07:13,180 --> 01:07:14,890 But good observation. 1303 01:07:14,890 --> 01:07:17,440 There is something else that can go wrong 1304 01:07:17,440 --> 01:07:20,895 if the thief just directly uses the victim's stack. 1305 01:07:30,880 --> 01:07:33,130 Well, let's take a hint from the slide we have so far. 1306 01:07:33,130 --> 01:07:35,010 So the example that's going to be shown 1307 01:07:35,010 --> 01:07:38,660 is that the thief steals the continuation of foo, 1308 01:07:38,660 --> 01:07:40,785 and then the thief is going to call a function baz. 1309 01:07:44,180 --> 01:07:46,910 So the thief is using the victim's stack, 1310 01:07:46,910 --> 01:07:48,860 and then it calls a function baz. 1311 01:07:48,860 --> 01:07:49,790 What goes wrong? 1312 01:07:57,020 --> 01:07:58,920 AUDIENCE: The victim has called something, 1313 01:07:58,920 --> 01:08:02,430 but underneath, there is some other function 1314 01:08:02,430 --> 01:08:05,790 stack [INAUDIBLE] 1315 01:08:05,790 --> 01:08:06,960 TAO SCHARDL: Exactly. 1316 01:08:06,960 --> 01:08:10,110 The victim in this picture, for example, 1317 01:08:10,110 --> 01:08:13,680 has some other functions on its stack below foo. 1318 01:08:13,680 --> 01:08:17,729 So if the thief does any function calls and is using 1319 01:08:17,729 --> 01:08:21,660 the same stack, it's going to scribble all over the state 1320 01:08:21,660 --> 01:08:24,000 of, in this case spawn bar, and bar, 1321 01:08:24,000 --> 01:08:27,609 which the victim is trying to use and maintain. 1322 01:08:27,609 --> 01:08:31,160 So the thief will end up corrupting the victim stack. 1323 01:08:31,160 --> 01:08:33,660 And if you think about it, it's also possible for the victim 1324 01:08:33,660 --> 01:08:35,010 to call the thief stack. 1325 01:08:35,010 --> 01:08:37,950 They can't share a stack, but they 1326 01:08:37,950 --> 01:08:42,149 do want to share some amount of data on the stack. 1327 01:08:42,149 --> 01:08:44,520 They do both care about the state of foo, 1328 01:08:44,520 --> 01:08:48,310 and that needs to be consistent across all the workers. 1329 01:08:48,310 --> 01:08:53,370 But we at least need a separate call stack for the thief. 1330 01:08:53,370 --> 01:08:55,500 We'd rather not do unnecessary work 1331 01:08:55,500 --> 01:08:59,399 in order to initialize this call stack, however. 1332 01:08:59,399 --> 01:09:03,660 We really need this call stack for things that the thief might 1333 01:09:03,660 --> 01:09:07,439 invoke, local variables the thief might need, 1334 01:09:07,439 --> 01:09:10,970 or functions that the thief might call or spawn. 1335 01:09:10,970 --> 01:09:15,000 OK, so how do we implement the cactus stack? 1336 01:09:15,000 --> 01:09:17,680 We have a victim stack, we have a thief stack, 1337 01:09:17,680 --> 01:09:22,500 and we have a pretty cute trick, in my opinion. 1338 01:09:22,500 --> 01:09:25,160 So the thief steals its continuation. 1339 01:09:25,160 --> 01:09:29,100 It's going to do a little bit of magic with its stack pointers. 1340 01:09:29,100 --> 01:09:31,229 What it's going to do is it's going 1341 01:09:31,229 --> 01:09:34,470 to use the RBP it was given, which points out the victim 1342 01:09:34,470 --> 01:09:37,800 stack, and it's going to set the stack pointer 1343 01:09:37,800 --> 01:09:40,260 to point at its own stack. 1344 01:09:40,260 --> 01:09:44,670 So RBP is over there, and RSP, for the thief, 1345 01:09:44,670 --> 01:09:48,600 is pointing to the beginning of the thief's call stack. 1346 01:09:48,600 --> 01:09:50,850 And that is basically fine. 1347 01:09:50,850 --> 01:09:54,570 The thief can access all the state in the function foo, 1348 01:09:54,570 --> 01:09:57,570 as offsets from RBP, but if the thief 1349 01:09:57,570 --> 01:10:00,060 needs to do any function calls, we 1350 01:10:00,060 --> 01:10:02,760 have a calling convention that involves 1351 01:10:02,760 --> 01:10:08,830 saving RBP and updating RSP in order to execute the call. 1352 01:10:08,830 --> 01:10:12,060 So in particular, the thief calls the function baz, 1353 01:10:12,060 --> 01:10:16,260 it saves its current value of RBP onto its own stack, 1354 01:10:16,260 --> 01:10:20,040 it advances RSP, it says RBP equals RSP, 1355 01:10:20,040 --> 01:10:22,740 it pushes the stack frame for baz onto the stack, 1356 01:10:22,740 --> 01:10:25,460 and it advances RSP a little bit further. 1357 01:10:25,460 --> 01:10:31,020 And just like that, the thief is churning away on its own stack. 1358 01:10:31,020 --> 01:10:33,970 So just with this magic of RBP pointing there and RSP 1359 01:10:33,970 --> 01:10:39,575 pointing here, we got our cactus stack. 1360 01:10:39,575 --> 01:10:40,450 Everyone follow that? 1361 01:10:47,100 --> 01:10:49,780 Anyone desperately confused by this stack pointer? 1362 01:10:54,430 --> 01:10:58,320 Who thinks this is kind of a neat trick? 1363 01:10:58,320 --> 01:11:00,657 All right, cool. 1364 01:11:00,657 --> 01:11:02,490 Anyone think this is a really mundane trick? 1365 01:11:02,490 --> 01:11:05,810 Hopefully no one thinks it's a mundane trick. 1366 01:11:05,810 --> 01:11:10,080 OK, there's like half a hand there, that's fine. 1367 01:11:10,080 --> 01:11:12,450 I think this is a neat trick, just messing around 1368 01:11:12,450 --> 01:11:13,530 with the stack pointers. 1369 01:11:13,530 --> 01:11:17,340 Are there any worries about using RBP and RSP this way? 1370 01:11:17,340 --> 01:11:24,210 Any concerns that you might think of from using these two 1371 01:11:24,210 --> 01:11:28,020 stack pointers as described? 1372 01:11:28,020 --> 01:11:31,470 In a past lecture, briefly mentioned 1373 01:11:31,470 --> 01:11:35,790 was a compiler optimization for dealing with stacks. 1374 01:11:35,790 --> 01:11:36,455 Yeah. 1375 01:11:36,455 --> 01:11:45,152 AUDIENCE: [INAUDIBLE] We were offsetting [INAUDIBLE] 1376 01:11:45,152 --> 01:11:47,360 TAO SCHARDL: Right, there was a compiler optimization 1377 01:11:47,360 --> 01:11:51,290 that said, in certain cases you don't need both the base 1378 01:11:51,290 --> 01:11:52,820 pointer and the stack pointer. 1379 01:11:52,820 --> 01:11:54,337 You can do all offsets. 1380 01:11:54,337 --> 01:11:56,170 I think it's actually off the stack pointer, 1381 01:11:56,170 --> 01:11:57,545 and then the base pointer becomes 1382 01:11:57,545 --> 01:11:59,990 an additional general purpose register. 1383 01:11:59,990 --> 01:12:02,660 That optimization clearly does not 1384 01:12:02,660 --> 01:12:05,960 work if you need the base pointer stack pointer 1385 01:12:05,960 --> 01:12:08,510 to do this wacky trick. 1386 01:12:11,150 --> 01:12:15,950 The answer is that the Cilk compiler specifically 1387 01:12:15,950 --> 01:12:18,170 says, if this function has a continuation that 1388 01:12:18,170 --> 01:12:21,530 could be stolen, don't do that optimization. 1389 01:12:21,530 --> 01:12:26,665 It's super illegal, it's very bad, don't do the optimization. 1390 01:12:26,665 --> 01:12:28,040 So that ends up being the answer. 1391 01:12:28,040 --> 01:12:30,190 And it costs us a general purpose register 1392 01:12:30,190 --> 01:12:32,450 for Cilk functions, not the biggest loss 1393 01:12:32,450 --> 01:12:35,983 in the world, all right. 1394 01:12:35,983 --> 01:12:37,400 There's a little bit of time left, 1395 01:12:37,400 --> 01:12:41,897 so we can talk about synchronizing computation. 1396 01:12:41,897 --> 01:12:43,480 I'll give you a brief version of this. 1397 01:12:43,480 --> 01:12:46,300 This part gets fairly complicated, 1398 01:12:46,300 --> 01:12:48,820 and so I'll give you a high level summary 1399 01:12:48,820 --> 01:12:51,560 of how all of this works. 1400 01:12:51,560 --> 01:12:54,920 So just to page back in some context, 1401 01:12:54,920 --> 01:12:57,520 we have this scenario where different processors are 1402 01:12:57,520 --> 01:13:01,150 executing different parts of our computation dag, 1403 01:13:01,150 --> 01:13:04,300 and one processor might encounter a Cilk sync statement 1404 01:13:04,300 --> 01:13:07,600 that it can't execute because some other processor is busy 1405 01:13:07,600 --> 01:13:11,320 executing a spawn subcomputation. 1406 01:13:11,320 --> 01:13:14,500 Now, in this case, P3 is waiting on P1 1407 01:13:14,500 --> 01:13:18,430 to finish its execution before the sync can proceed. 1408 01:13:18,430 --> 01:13:20,770 And synchronization needs to happen, really, 1409 01:13:20,770 --> 01:13:24,040 only on the subcomputation that P1 is executing. 1410 01:13:24,040 --> 01:13:26,380 P2 shouldn't play a role in this. 1411 01:13:29,420 --> 01:13:31,835 So what exactly happens when a worker reaches a Cilk 1412 01:13:31,835 --> 01:13:34,810 sync before all the spawned subcomputations return? 1413 01:13:34,810 --> 01:13:37,750 Well, we'd like the worker to become a thief. 1414 01:13:37,750 --> 01:13:39,550 We'd rather the worker not just sit there 1415 01:13:39,550 --> 01:13:43,030 and wait until all the spawned subcomputations return. 1416 01:13:43,030 --> 01:13:46,920 That's a waste of a perfectly good worker. 1417 01:13:46,920 --> 01:13:49,900 But we also can't let the worker's current function 1418 01:13:49,900 --> 01:13:51,390 frame disappear. 1419 01:13:51,390 --> 01:13:53,140 There is a spawned subcomputation 1420 01:13:53,140 --> 01:13:54,460 that's using that frame. 1421 01:13:54,460 --> 01:13:56,110 That frame is its parent. 1422 01:13:56,110 --> 01:13:57,850 It may be accessing state in that frame, 1423 01:13:57,850 --> 01:14:00,220 it may be trying to save a return value 1424 01:14:00,220 --> 01:14:03,280 to some location in that frame. 1425 01:14:03,280 --> 01:14:06,730 And so the frame has to persist, even 1426 01:14:06,730 --> 01:14:09,100 if the worker that's working on the frame 1427 01:14:09,100 --> 01:14:11,980 goes off and becomes a thief. 1428 01:14:11,980 --> 01:14:15,210 Moreover, in the future, that subcomputation, we believe, 1429 01:14:15,210 --> 01:14:17,560 should return. 1430 01:14:17,560 --> 01:14:20,350 And that worker must resume the frame 1431 01:14:20,350 --> 01:14:24,910 and actually execute past the Cilk sync. 1432 01:14:24,910 --> 01:14:26,650 Finally, the Cilk sync should only 1433 01:14:26,650 --> 01:14:28,810 apply to the nested subcomputations 1434 01:14:28,810 --> 01:14:31,470 underneath its function, not the program in general. 1435 01:14:31,470 --> 01:14:36,460 And so we don't allow ourselves synchronization, just among all 1436 01:14:36,460 --> 01:14:38,120 the workers, wholesale. 1437 01:14:38,120 --> 01:14:40,120 We don't say, OK, we've hit a sync, 1438 01:14:40,120 --> 01:14:42,910 every worker in the system must reach 1439 01:14:42,910 --> 01:14:44,410 some point in the execution. 1440 01:14:44,410 --> 01:14:49,930 We only care about this nested synchronization. 1441 01:14:49,930 --> 01:14:51,430 So if we think about this, and we're 1442 01:14:51,430 --> 01:14:53,410 talking about nested synchronization 1443 01:14:53,410 --> 01:14:56,500 for computations under a function, 1444 01:14:56,500 --> 01:14:58,300 we have this notion of cactus stack, 1445 01:14:58,300 --> 01:15:03,280 we have this notion of a tree of function invocations. 1446 01:15:03,280 --> 01:15:05,660 We may immediately start to think about, 1447 01:15:05,660 --> 01:15:09,130 well, what if we just maintain some state, in a tree, 1448 01:15:09,130 --> 01:15:12,250 to keep track of who needs this to synchronize with whom, 1449 01:15:12,250 --> 01:15:14,590 which computations are waiting on which 1450 01:15:14,590 --> 01:15:16,690 other computations to finish? 1451 01:15:16,690 --> 01:15:18,940 And, in fact, that's essentially what the Cilk runtime 1452 01:15:18,940 --> 01:15:19,690 system does. 1453 01:15:19,690 --> 01:15:24,760 It maintains a tree of states called full frames, 1454 01:15:24,760 --> 01:15:26,740 and those full frames store state 1455 01:15:26,740 --> 01:15:28,480 for the parallel subcomputations. 1456 01:15:28,480 --> 01:15:31,900 And those full frames keep track of which 1457 01:15:31,900 --> 01:15:36,950 subcomputations are standing and how they relate to each other. 1458 01:15:36,950 --> 01:15:39,550 This is a high level picture of a full frame. 1459 01:15:39,550 --> 01:15:43,870 There are lots of details highlighted, to be honest. 1460 01:15:43,870 --> 01:15:46,300 But at 30,000 feet, a full frame keeps 1461 01:15:46,300 --> 01:15:49,930 track of a bunch of information for the parallel execution-- 1462 01:15:49,930 --> 01:15:53,060 I know, I'm giving you the quick version of this-- 1463 01:15:53,060 --> 01:15:55,930 including pointers to parent frames 1464 01:15:55,930 --> 01:15:58,810 and possibly pointers to child frames, or at least the number 1465 01:15:58,810 --> 01:16:01,967 of outstanding child frames. 1466 01:16:01,967 --> 01:16:03,550 The processors, when there's a system, 1467 01:16:03,550 --> 01:16:05,740 work on what are called active full frames. 1468 01:16:05,740 --> 01:16:07,750 In the diagram, those full frames 1469 01:16:07,750 --> 01:16:12,350 are the rounded rectangles highlighted in dark blue. 1470 01:16:12,350 --> 01:16:15,960 Other full frames in the system are, what we call, suspended. 1471 01:16:15,960 --> 01:16:20,830 They're waiting on some subcomputation to return. 1472 01:16:20,830 --> 01:16:23,440 That's what a full frame tree can look like under, 1473 01:16:23,440 --> 01:16:24,650 some execution. 1474 01:16:24,650 --> 01:16:28,390 Let's see how a full frame tree can come into being, just 1475 01:16:28,390 --> 01:16:31,450 by working through an animation. 1476 01:16:31,450 --> 01:16:33,940 So suppose we have some worker with a bunch of spawned 1477 01:16:33,940 --> 01:16:35,620 and called frames on its deque. 1478 01:16:35,620 --> 01:16:39,880 No other workers have anything on their deques. 1479 01:16:39,880 --> 01:16:45,320 And finally, some worker wants to steal. 1480 01:16:45,320 --> 01:16:50,380 And I'll admit, this animation is crafted slightly, just 1481 01:16:50,380 --> 01:16:54,460 to make the pictures a little bit nicer. 1482 01:16:54,460 --> 01:16:56,380 It can look more complicated in practice, 1483 01:16:56,380 --> 01:17:00,460 don't worry, if that was actually a worry of yours. 1484 01:17:00,460 --> 01:17:02,500 So what's going to happen, the thief 1485 01:17:02,500 --> 01:17:06,430 is going to take some frames from the top of the victim's 1486 01:17:06,430 --> 01:17:07,108 deque. 1487 01:17:07,108 --> 01:17:09,400 And it's actually going to steal not just those frames, 1488 01:17:09,400 --> 01:17:12,727 but the whole full frame structure along with it. 1489 01:17:12,727 --> 01:17:14,560 The full frame structure is just represented 1490 01:17:14,560 --> 01:17:15,830 with this rounded rectangle. 1491 01:17:15,830 --> 01:17:19,390 In fact, it's a constant size thing. 1492 01:17:19,390 --> 01:17:22,570 But the thief is going to take the whole full frame structure. 1493 01:17:22,570 --> 01:17:27,580 And it's going to give the victim a brand new full frame 1494 01:17:27,580 --> 01:17:33,700 and establish the child to parent pointer in the victim's 1495 01:17:33,700 --> 01:17:35,980 new full frame. 1496 01:17:35,980 --> 01:17:37,270 That's kind of weird. 1497 01:17:37,270 --> 01:17:40,420 It's not obvious why the thief would take the full frame 1498 01:17:40,420 --> 01:17:45,520 as it's stealing computation, at least not from one step. 1499 01:17:45,520 --> 01:17:48,700 But we can see why it helps, just given one more step. 1500 01:17:48,700 --> 01:17:51,000 So let's fast forward this picture a little bit, 1501 01:17:51,000 --> 01:17:56,350 and now we have another worker try to steal some computation, 1502 01:17:56,350 --> 01:17:59,650 and we have a little bit more stuff going on. 1503 01:17:59,650 --> 01:18:02,170 So this worker might randomly select the last worker 1504 01:18:02,170 --> 01:18:05,880 on the right, steal computation from the top of its deque, 1505 01:18:05,880 --> 01:18:08,920 and it's going to steal the full frame along 1506 01:18:08,920 --> 01:18:14,350 with the deque frames. 1507 01:18:14,350 --> 01:18:17,680 And because it stole the full frame, 1508 01:18:17,680 --> 01:18:21,910 all pointers to that full frame from any child subcomputations 1509 01:18:21,910 --> 01:18:24,170 are still valid. 1510 01:18:24,170 --> 01:18:26,470 The child's computation on the left 1511 01:18:26,470 --> 01:18:30,120 still points to the correct full frame. 1512 01:18:30,120 --> 01:18:33,340 The full frame that was stolen has the parent context 1513 01:18:33,340 --> 01:18:35,650 of that child, and so we need to make sure 1514 01:18:35,650 --> 01:18:39,330 that pointer is still good. 1515 01:18:39,330 --> 01:18:42,310 If it created a new full frame for itself, 1516 01:18:42,310 --> 01:18:45,730 then you would have to update the child pointers somehow, 1517 01:18:45,730 --> 01:18:48,670 and that requires more synchronization and a more 1518 01:18:48,670 --> 01:18:50,800 complicated protocol. 1519 01:18:50,800 --> 01:18:54,010 Synchronization is expensive, protocols are complicated. 1520 01:18:54,010 --> 01:18:57,970 This ends up saving some complexity. 1521 01:18:57,970 --> 01:19:01,710 And then it creates a frame for the child, 1522 01:19:01,710 --> 01:19:03,200 and we can see this process unfold 1523 01:19:03,200 --> 01:19:07,170 just a little bit further. 1524 01:19:07,170 --> 01:19:10,800 And we'll hold off for a few steals, we end up with a tree. 1525 01:19:10,800 --> 01:19:14,550 We have two children pointing to one parent, 1526 01:19:14,550 --> 01:19:18,580 and one of those children has its own child. 1527 01:19:18,580 --> 01:19:20,010 Great. 1528 01:19:20,010 --> 01:19:22,680 Now suppose that some worker says, oh, I encountered a sync, 1529 01:19:22,680 --> 01:19:24,240 can I synchronize? 1530 01:19:24,240 --> 01:19:27,120 In this case, the worker has an outstanding child computation 1531 01:19:27,120 --> 01:19:30,400 so it can't synchronize. 1532 01:19:30,400 --> 01:19:32,490 And so we can't recycle the full frame, 1533 01:19:32,490 --> 01:19:36,350 we can't recycle any of the stack for this child. 1534 01:19:36,350 --> 01:19:39,700 And so, instead, the worker will suspend this full frame, 1535 01:19:39,700 --> 01:19:42,486 turning it from dark blue to light blue in our picture, 1536 01:19:42,486 --> 01:19:44,470 and it goes and becomes a thief. 1537 01:19:48,440 --> 01:19:50,340 The program has ample parallelism. 1538 01:19:50,340 --> 01:19:52,590 What do we expect to typically happen when the program 1539 01:19:52,590 --> 01:19:54,858 execution reaches a Cilk sync? 1540 01:19:54,858 --> 01:19:56,400 We're kind of out of time, so I think 1541 01:19:56,400 --> 01:19:58,830 I'm just going to spoil the answer for this, unless anyone 1542 01:19:58,830 --> 01:20:00,700 has a guess handy. 1543 01:20:06,280 --> 01:20:08,830 So what's the common case for a Cilk sync? 1544 01:20:17,000 --> 01:20:19,690 For the sake of time, the common case 1545 01:20:19,690 --> 01:20:22,443 is that the executing function has no outstanding children. 1546 01:20:22,443 --> 01:20:23,860 All the workers on the system were 1547 01:20:23,860 --> 01:20:26,140 busy doing their own thing, there 1548 01:20:26,140 --> 01:20:29,166 is no synchronization that's necessary. 1549 01:20:29,166 --> 01:20:32,140 And so how does the runtime optimize this case? 1550 01:20:32,140 --> 01:20:36,980 It ends up having the full frame, 1551 01:20:36,980 --> 01:20:40,360 uses some bits of an associated stack frame, 1552 01:20:40,360 --> 01:20:43,470 in particular the flag field. 1553 01:20:43,470 --> 01:20:46,090 And that's why, when we look at the compiled code for a Cilk 1554 01:20:46,090 --> 01:20:50,170 sync, we see some conditions that evaluate the flags 1555 01:20:50,170 --> 01:20:53,380 within the local stack frame. 1556 01:20:53,380 --> 01:20:56,410 That's just an optimization to say, if you don't need a sync, 1557 01:20:56,410 --> 01:21:01,960 don't do any computation, otherwise some steals really 1558 01:21:01,960 --> 01:21:07,237 did occur, go ahead and execute the Cilk RTS sync routine. 1559 01:21:07,237 --> 01:21:09,070 There are a bunch of other runtime features. 1560 01:21:09,070 --> 01:21:11,260 If you take a look at that picture for a long time, 1561 01:21:11,260 --> 01:21:15,130 you may be dissatisfied with what that implies about some 1562 01:21:15,130 --> 01:21:16,642 of the protocols. 1563 01:21:16,642 --> 01:21:18,850 And there's a lot more code within the runtime system 1564 01:21:18,850 --> 01:21:21,490 itself, to implement a variety of other features such 1565 01:21:21,490 --> 01:21:25,360 as support for C++ exceptions, reducer hyperobjects, 1566 01:21:25,360 --> 01:21:29,920 and a form of IDs called pedigrees. 1567 01:21:29,920 --> 01:21:32,170 We won't talk about that today. 1568 01:21:32,170 --> 01:21:34,460 I'm actually all out of time. 1569 01:21:34,460 --> 01:21:38,510 Thanks for listening to all this about the Cilk runtime system. 1570 01:21:38,510 --> 01:21:41,250 Feel free to ask any questions after class.