1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high-quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:21,860 --> 00:00:23,860 CHARLES E. LEISERSON: Hi, it's my great pleasure 9 00:00:23,860 --> 00:00:28,030 to introduce, again, TB Schardl. 10 00:00:28,030 --> 00:00:36,640 TB is not only a fabulous, world-class performance 11 00:00:36,640 --> 00:00:44,020 engineer, he is a world-class performance meta engineer. 12 00:00:44,020 --> 00:00:52,030 In other words, building the tools and such to make it 13 00:00:52,030 --> 00:00:55,690 so that people can engineer fast code. 14 00:00:55,690 --> 00:00:59,560 And he's the author of the technology 15 00:00:59,560 --> 00:01:01,540 that we're using in our compiler, the taper 16 00:01:01,540 --> 00:01:05,810 technology that's in the open compiler for parallelism. 17 00:01:05,810 --> 00:01:09,130 So he implemented all of that, and all the optimizations, 18 00:01:09,130 --> 00:01:12,160 and so forth, which has greatly improved the quality 19 00:01:12,160 --> 00:01:15,290 of the programming environment. 20 00:01:15,290 --> 00:01:18,310 So today, he's going to talk about something near and dear 21 00:01:18,310 --> 00:01:21,520 to his heart, which is compilers, 22 00:01:21,520 --> 00:01:24,778 and what they can and cannot do. 23 00:01:24,778 --> 00:01:26,320 TAO B. SCHARDL: Great, thank you very 24 00:01:26,320 --> 00:01:28,400 much for that introduction. 25 00:01:28,400 --> 00:01:30,700 Can everyone hear me in the back? 26 00:01:30,700 --> 00:01:32,740 Yes, great. 27 00:01:32,740 --> 00:01:35,170 All right, so as I understand it, 28 00:01:35,170 --> 00:01:37,780 last lecture you talked about multi-threaded algorithms. 29 00:01:37,780 --> 00:01:40,690 And you spent the lecture studying those algorithms, 30 00:01:40,690 --> 00:01:42,970 analyzing them in a theoretical sense, 31 00:01:42,970 --> 00:01:46,990 essentially analyzing their asymptotic running times, work 32 00:01:46,990 --> 00:01:48,520 and span complexity. 33 00:01:48,520 --> 00:01:51,370 This lecture is not that at all. 34 00:01:51,370 --> 00:01:53,650 We're not going to do that kind of math 35 00:01:53,650 --> 00:01:56,260 anywhere in the course of this lecture. 36 00:01:56,260 --> 00:01:59,920 Instead, this lecture is going to take a look at compilers, 37 00:01:59,920 --> 00:02:03,790 as professor mentioned, and what compilers can and cannot do. 38 00:02:06,440 --> 00:02:09,490 So the last time, you saw me standing up here 39 00:02:09,490 --> 00:02:11,500 was back in lecture five. 40 00:02:11,500 --> 00:02:13,000 And during that lecture we talked 41 00:02:13,000 --> 00:02:17,530 about LLVM IR and x8664 assembly, 42 00:02:17,530 --> 00:02:25,750 and how C code got translated into assembly code via LLVM IR. 43 00:02:25,750 --> 00:02:27,730 In this lecture, we're going to talk 44 00:02:27,730 --> 00:02:32,050 more about what happens between the LLVM IR and assembly 45 00:02:32,050 --> 00:02:33,050 stages. 46 00:02:33,050 --> 00:02:35,830 And, essentially, that's what happens when the compiler is 47 00:02:35,830 --> 00:02:41,200 allowed to edit and optimize the code in its IR representation, 48 00:02:41,200 --> 00:02:45,230 while it's producing the assembly. 49 00:02:45,230 --> 00:02:47,380 So last time, we were talking about this IR, 50 00:02:47,380 --> 00:02:49,010 and the assembly. 51 00:02:49,010 --> 00:02:51,190 And this time, they called the compiler guy back, 52 00:02:51,190 --> 00:02:55,260 I suppose, to tell you about the boxes in the middle. 53 00:02:55,260 --> 00:02:58,780 Now, even though you're predominately dealing with C 54 00:02:58,780 --> 00:03:02,080 code within this class, I hope that some of the lessons from 55 00:03:02,080 --> 00:03:06,490 today's lecture you will be able to take away into any job that 56 00:03:06,490 --> 00:03:10,060 you pursue in the future, because there are a lot 57 00:03:10,060 --> 00:03:16,000 of languages today that do end up being compiled, C and C++, 58 00:03:16,000 --> 00:03:18,940 Rust, Swift, even Haskell, Julia, Halide, 59 00:03:18,940 --> 00:03:20,140 the list goes on and on. 60 00:03:20,140 --> 00:03:21,640 And those languages all get compiled 61 00:03:21,640 --> 00:03:23,410 for a variety of different what we 62 00:03:23,410 --> 00:03:29,080 call backends, different machine architectures, not just x86-64. 63 00:03:29,080 --> 00:03:33,370 And, in fact, a lot of those languages 64 00:03:33,370 --> 00:03:37,450 get compiled using very similar compilation technology 65 00:03:37,450 --> 00:03:40,537 to what you have in the Clang LLVM compiler 66 00:03:40,537 --> 00:03:41,870 that you're using in this class. 67 00:03:41,870 --> 00:03:45,040 In fact, many of those languages today 68 00:03:45,040 --> 00:03:47,340 are optimized by LLVM itself. 69 00:03:47,340 --> 00:03:50,590 LLVM is the internal engine within the compiler 70 00:03:50,590 --> 00:03:53,860 that actually does all of the optimization. 71 00:03:53,860 --> 00:03:57,100 So that's my hope, that the lessons you'll learn here today 72 00:03:57,100 --> 00:03:58,840 don't just apply to 172. 73 00:03:58,840 --> 00:04:00,460 They'll, in fact, apply to software 74 00:04:00,460 --> 00:04:05,740 that you use and develop for many years on the road. 75 00:04:05,740 --> 00:04:08,170 But let's take a step back, and ask ourselves, 76 00:04:08,170 --> 00:04:11,950 why bother studying the compiler optimizations at all? 77 00:04:11,950 --> 00:04:13,750 Why should we take a look at what's 78 00:04:13,750 --> 00:04:19,810 going on within this, up to this point, black box of software? 79 00:04:19,810 --> 00:04:20,860 Any ideas? 80 00:04:20,860 --> 00:04:21,899 Any suggestions? 81 00:04:27,910 --> 00:04:29,190 In the back? 82 00:04:29,190 --> 00:04:31,110 AUDIENCE: [INAUDIBLE] 83 00:04:33,607 --> 00:04:35,190 TAO B. SCHARDL: You can avoid manually 84 00:04:35,190 --> 00:04:37,190 trying to optimize things that the compiler will 85 00:04:37,190 --> 00:04:38,910 do for you, great answer. 86 00:04:38,910 --> 00:04:39,990 Great, great answer. 87 00:04:39,990 --> 00:04:40,800 Any other answers? 88 00:04:47,450 --> 00:04:49,940 AUDIENCE: You learn how to best write 89 00:04:49,940 --> 00:04:53,565 your code to take advantages of the compiler optimizations. 90 00:04:53,565 --> 00:04:54,940 TAO B. SCHARDL: You can learn how 91 00:04:54,940 --> 00:04:57,700 to write your code to take advantage of the compiler 92 00:04:57,700 --> 00:05:02,260 optimizations, how to suggest to the compiler what it should 93 00:05:02,260 --> 00:05:04,510 or should not do as you're constructing 94 00:05:04,510 --> 00:05:07,720 your program, great answer as well. 95 00:05:07,720 --> 00:05:08,860 Very good, in the front. 96 00:05:08,860 --> 00:05:11,330 AUDIENCE: It might help for debugging 97 00:05:11,330 --> 00:05:13,306 if the compiler has bugs. 98 00:05:15,615 --> 00:05:16,990 TAO B. SCHARDL: It can absolutely 99 00:05:16,990 --> 00:05:19,630 help for debugging when the compiler itself has bugs. 100 00:05:19,630 --> 00:05:21,640 The compiler is a big piece of software. 101 00:05:21,640 --> 00:05:25,120 And you may have noticed that a lot of software contains bugs. 102 00:05:25,120 --> 00:05:26,860 The compiler is no exception. 103 00:05:26,860 --> 00:05:30,520 And it helps to understand where the compiler might have made 104 00:05:30,520 --> 00:05:33,940 a mistake, or where the compiler simply just 105 00:05:33,940 --> 00:05:37,420 didn't do what you thought it should be able to do. 106 00:05:37,420 --> 00:05:39,850 Understanding more of what happens in the compiler 107 00:05:39,850 --> 00:05:44,860 can demystify some of those oddities. 108 00:05:44,860 --> 00:05:46,420 Good answer. 109 00:05:46,420 --> 00:05:47,728 Any other thoughts? 110 00:05:47,728 --> 00:05:48,520 AUDIENCE: It's fun. 111 00:05:50,898 --> 00:05:51,940 TAO B. SCHARDL: It's fun. 112 00:05:51,940 --> 00:05:55,930 Well, OK, so in my completely biased opinion, 113 00:05:55,930 --> 00:05:57,820 I would agree that it's fun to understand 114 00:05:57,820 --> 00:05:59,900 what the compiler does. 115 00:05:59,900 --> 00:06:01,870 You may have different opinions. 116 00:06:01,870 --> 00:06:03,430 That's OK. 117 00:06:03,430 --> 00:06:05,620 I won't judge. 118 00:06:05,620 --> 00:06:08,650 So I put together a list of reasons 119 00:06:08,650 --> 00:06:12,790 why, in general, we may care about what 120 00:06:12,790 --> 00:06:14,070 goes on inside the compiler. 121 00:06:14,070 --> 00:06:18,710 I highlighted that last point from this list, my bad. 122 00:06:18,710 --> 00:06:23,572 Compilers can have a really big impact on software. 123 00:06:23,572 --> 00:06:24,530 It's kind of like this. 124 00:06:24,530 --> 00:06:27,670 Imagine that you're working on some software project. 125 00:06:27,670 --> 00:06:30,050 And you have a teammate on your team 126 00:06:30,050 --> 00:06:32,960 he's pretty quiet but extremely smart. 127 00:06:32,960 --> 00:06:36,220 And what that teammate does is whenever that teammate gets 128 00:06:36,220 --> 00:06:39,460 access to some code, they jump in 129 00:06:39,460 --> 00:06:43,510 and immediately start trying to make that code work faster. 130 00:06:43,510 --> 00:06:46,360 And that's really cool, because that teammate does good work. 131 00:06:46,360 --> 00:06:49,510 And, oftentimes, you see that what the teammate produces 132 00:06:49,510 --> 00:06:52,480 is, indeed, much faster code than what you wrote. 133 00:06:52,480 --> 00:06:55,390 Now, in other industries, you might just sit back 134 00:06:55,390 --> 00:06:58,720 and say, this teammate does fantastic work. 135 00:06:58,720 --> 00:07:00,130 Maybe they don't talk very often. 136 00:07:00,130 --> 00:07:01,420 But that's OK. 137 00:07:01,420 --> 00:07:03,230 Teammate, you do you. 138 00:07:03,230 --> 00:07:05,460 But in this class, we're performance engineers. 139 00:07:05,460 --> 00:07:09,190 We want to understand what that teammate did to the software. 140 00:07:09,190 --> 00:07:11,980 How did that teammate get so much performance out 141 00:07:11,980 --> 00:07:13,660 of the code? 142 00:07:13,660 --> 00:07:16,330 The compiler is kind of like that teammate. 143 00:07:16,330 --> 00:07:18,280 And so understanding what the compiler does 144 00:07:18,280 --> 00:07:21,670 is valuable in that sense. 145 00:07:21,670 --> 00:07:24,550 As mentioned before, compilers can save you 146 00:07:24,550 --> 00:07:25,840 performance engineering work. 147 00:07:25,840 --> 00:07:28,300 If you understand that the compiler can 148 00:07:28,300 --> 00:07:30,040 do some optimization for you, then you 149 00:07:30,040 --> 00:07:31,720 don't have to do it yourself. 150 00:07:31,720 --> 00:07:34,030 And that means that you can continue 151 00:07:34,030 --> 00:07:37,210 writing simple, and readable, and maintainable code 152 00:07:37,210 --> 00:07:40,780 without sacrificing performance. 153 00:07:40,780 --> 00:07:43,510 You can also understand the differences between the source 154 00:07:43,510 --> 00:07:46,600 code and whatever you might see show up in either the LLVM 155 00:07:46,600 --> 00:07:49,630 IR or the assembly, if you have to look 156 00:07:49,630 --> 00:07:56,260 at the assembly language produced for your executable. 157 00:07:56,260 --> 00:07:58,632 And compilers can make mistakes. 158 00:07:58,632 --> 00:08:01,090 Sometimes, that's because of a genuine bug in the compiler. 159 00:08:01,090 --> 00:08:03,400 And other times, it's because the compiler just 160 00:08:03,400 --> 00:08:06,250 couldn't understand something about what was going on. 161 00:08:06,250 --> 00:08:10,300 And having some insight into how the compiler reasons about code 162 00:08:10,300 --> 00:08:12,910 can help you understand why those mistakes were made, 163 00:08:12,910 --> 00:08:17,620 or figure out ways to work around those mistakes, 164 00:08:17,620 --> 00:08:20,620 or let you write meaningful bug reports to the compiler 165 00:08:20,620 --> 00:08:22,955 developers. 166 00:08:22,955 --> 00:08:24,580 And, of course, understanding computers 167 00:08:24,580 --> 00:08:26,350 can help you use them more effectively. 168 00:08:26,350 --> 00:08:30,010 Plus, I think it's fun. 169 00:08:30,010 --> 00:08:32,110 So the first thing to understand about a compiler 170 00:08:32,110 --> 00:08:35,440 is a basic anatomy of how the compiler works. 171 00:08:35,440 --> 00:08:38,710 The compiler takes as input LLVM IR. 172 00:08:38,710 --> 00:08:40,900 And up until this point, we thought of it 173 00:08:40,900 --> 00:08:43,030 as just a big black box. 174 00:08:43,030 --> 00:08:47,740 That does stuff to the IR, and out pops more LLVM IR, 175 00:08:47,740 --> 00:08:49,990 but it's somehow optimized. 176 00:08:49,990 --> 00:08:53,350 In fact, what's going on within that black box 177 00:08:53,350 --> 00:08:55,300 the compiler is executing a sequence 178 00:08:55,300 --> 00:08:58,990 of what we call transformation passes on the code. 179 00:08:58,990 --> 00:09:03,100 Each transformation pass takes a look at its input, 180 00:09:03,100 --> 00:09:05,380 and analyzes that code, and then tries 181 00:09:05,380 --> 00:09:07,690 to edit the code in an effort to optimize 182 00:09:07,690 --> 00:09:09,850 the code's performance. 183 00:09:09,850 --> 00:09:13,460 Now, a transformation pass might end up running multiple times. 184 00:09:13,460 --> 00:09:15,970 And those passes run in some order. 185 00:09:15,970 --> 00:09:19,990 That order ends up being a predetermined order 186 00:09:19,990 --> 00:09:22,600 that the compiler writers found to work 187 00:09:22,600 --> 00:09:25,240 pretty well on their tests. 188 00:09:25,240 --> 00:09:27,340 That's about the level of insight that 189 00:09:27,340 --> 00:09:29,650 went into picking the order. 190 00:09:29,650 --> 00:09:32,990 It seems to work well. 191 00:09:32,990 --> 00:09:34,870 Now, some good news, in terms of trying 192 00:09:34,870 --> 00:09:37,240 to understand what the compiler does, 193 00:09:37,240 --> 00:09:40,710 you can actually just ask the compiler, what did you do? 194 00:09:40,710 --> 00:09:43,360 And you've already used this functionality, 195 00:09:43,360 --> 00:09:45,585 as I understand, in some of your assignments. 196 00:09:45,585 --> 00:09:46,960 You've already asked the compiler 197 00:09:46,960 --> 00:09:49,330 to give you a report specifically 198 00:09:49,330 --> 00:09:52,300 about whether or not it could vectorize some code. 199 00:09:52,300 --> 00:09:56,050 But, in fact, LLVM, the compiler you have access to, 200 00:09:56,050 --> 00:09:58,870 can produce reports not just for factorization, 201 00:09:58,870 --> 00:10:01,480 but for a lot of the different transformation 202 00:10:01,480 --> 00:10:03,898 passes that it tries to perform. 203 00:10:03,898 --> 00:10:05,440 And there's some syntax that you have 204 00:10:05,440 --> 00:10:08,275 to pass to the compiler, some compiler flags 205 00:10:08,275 --> 00:10:10,995 that you have to specify in order to get those reports. 206 00:10:10,995 --> 00:10:12,370 Those are described on the slide. 207 00:10:12,370 --> 00:10:13,828 I won't walk you through that text. 208 00:10:13,828 --> 00:10:16,540 You can look at the slides afterwards. 209 00:10:16,540 --> 00:10:18,790 At the end of the day, the string that you're passing 210 00:10:18,790 --> 00:10:20,202 is actually a regular expression. 211 00:10:20,202 --> 00:10:21,910 If you know what regular expressions are, 212 00:10:21,910 --> 00:10:24,340 great, then you can use that to narrow down 213 00:10:24,340 --> 00:10:27,140 the search for your report. 214 00:10:27,140 --> 00:10:29,500 If you don't, and you just want to see the whole report, 215 00:10:29,500 --> 00:10:32,945 just provide dot star as a string and you're good to go. 216 00:10:32,945 --> 00:10:33,820 That's the good news. 217 00:10:33,820 --> 00:10:37,810 You can get the compiler to tell you exactly what it did. 218 00:10:37,810 --> 00:10:41,220 The bad news is that when you ask the compiler what it did, 219 00:10:41,220 --> 00:10:43,600 it will give you a report. 220 00:10:43,600 --> 00:10:46,957 And the report looks something like this. 221 00:10:46,957 --> 00:10:48,790 In fact, I've highlighted most of the report 222 00:10:48,790 --> 00:10:50,590 for this particular piece of code, 223 00:10:50,590 --> 00:10:53,403 because the report ends up being very long. 224 00:10:53,403 --> 00:10:54,820 And as you might have noticed just 225 00:10:54,820 --> 00:10:58,090 from reading some of the texts, there are definitely 226 00:10:58,090 --> 00:11:00,970 English words in this text. 227 00:11:00,970 --> 00:11:06,130 And there are pointers to pieces of code that you've compiled. 228 00:11:06,130 --> 00:11:08,710 But it is very jargon, and hard to understand. 229 00:11:11,780 --> 00:11:16,770 This isn't the easiest report to make sense of. 230 00:11:16,770 --> 00:11:18,782 OK, so that's some good news and some bad news 231 00:11:18,782 --> 00:11:19,990 about these compiler reports. 232 00:11:19,990 --> 00:11:21,782 The good news is, you can ask the compiler. 233 00:11:21,782 --> 00:11:25,000 And it'll happily tell you all about the things that it did. 234 00:11:25,000 --> 00:11:28,900 It can tell you about which transformation passes were 235 00:11:28,900 --> 00:11:31,300 successfully able to transform the code. 236 00:11:31,300 --> 00:11:33,730 It can tell you conclusions that it drew 237 00:11:33,730 --> 00:11:37,080 about its analysis of the code. 238 00:11:37,080 --> 00:11:39,400 But the bad news is, these reports 239 00:11:39,400 --> 00:11:41,260 are kind of complicated. 240 00:11:41,260 --> 00:11:42,670 They can be long. 241 00:11:42,670 --> 00:11:45,670 They use a lot of internal compiler jargon, which 242 00:11:45,670 --> 00:11:48,100 if you're not familiar with that jargon, 243 00:11:48,100 --> 00:11:50,830 it makes it hard to understand. 244 00:11:50,830 --> 00:11:53,380 It also turns out that not all of the transformation 245 00:11:53,380 --> 00:11:56,930 passes in the compiler give you these nice reports. 246 00:11:56,930 --> 00:11:58,820 So you don't get to see the whole picture. 247 00:11:58,820 --> 00:12:00,528 And, in general, the reports don't really 248 00:12:00,528 --> 00:12:03,400 tell you the whole story about what the compiler did 249 00:12:03,400 --> 00:12:04,430 or did not do. 250 00:12:04,430 --> 00:12:07,250 And we'll see another example of that later on. 251 00:12:07,250 --> 00:12:09,220 So part of the goal of today's lecture 252 00:12:09,220 --> 00:12:12,840 is to get some context for understanding the reports 253 00:12:12,840 --> 00:12:17,630 that you might see if you pass those flags to the compiler. 254 00:12:17,630 --> 00:12:19,550 And the structure of today's lecture 255 00:12:19,550 --> 00:12:21,220 is basically divided up into two parts. 256 00:12:21,220 --> 00:12:23,350 First, I want to give you some examples 257 00:12:23,350 --> 00:12:25,840 of compiler optimizations, just simple examples 258 00:12:25,840 --> 00:12:30,370 so you get a sense as to how a compiler mechanically reasons 259 00:12:30,370 --> 00:12:34,142 about the code it's given, and tries to optimize that code. 260 00:12:34,142 --> 00:12:36,100 We'll take a look at how the compiler optimizes 261 00:12:36,100 --> 00:12:39,460 a single scalar value, how it can optimize a structure, 262 00:12:39,460 --> 00:12:41,110 how it can optimize function calls, 263 00:12:41,110 --> 00:12:43,780 and how it can optimize loops, just simple examples 264 00:12:43,780 --> 00:12:46,060 to give some flavor. 265 00:12:46,060 --> 00:12:47,560 And then the second half of lecture, 266 00:12:47,560 --> 00:12:49,900 I have a few case studies for you 267 00:12:49,900 --> 00:12:54,220 which get into diagnosing ways in which the compiler failed, 268 00:12:54,220 --> 00:12:56,890 not due to bugs, per se, but simply 269 00:12:56,890 --> 00:13:00,420 didn't do an optimization you might have expected it to do. 270 00:13:00,420 --> 00:13:02,620 But, to be frank, I think all those case 271 00:13:02,620 --> 00:13:03,850 studies are really cool. 272 00:13:03,850 --> 00:13:06,520 But it's not totally crucial that we 273 00:13:06,520 --> 00:13:10,050 get through every single case study during today's lecture. 274 00:13:10,050 --> 00:13:11,860 The slides will be available afterwards. 275 00:13:11,860 --> 00:13:13,277 So when we get to that part, we'll 276 00:13:13,277 --> 00:13:15,710 just see how many case studies we can cover. 277 00:13:15,710 --> 00:13:16,250 Sound good? 278 00:13:16,250 --> 00:13:17,125 Any questions so far? 279 00:13:21,000 --> 00:13:24,010 All right, let's get to it. 280 00:13:24,010 --> 00:13:25,650 Let's start with a quick overview 281 00:13:25,650 --> 00:13:28,450 of compiler optimizations. 282 00:13:28,450 --> 00:13:30,750 So here is a summary of the various-- 283 00:13:30,750 --> 00:13:36,210 oh, I forgot that I just copied this slide 284 00:13:36,210 --> 00:13:40,060 from a previous lecture given in this class. 285 00:13:40,060 --> 00:13:44,700 You might recognize this slide I think from lecture two. 286 00:13:44,700 --> 00:13:46,590 Sorry about that. 287 00:13:46,590 --> 00:13:47,920 That's OK. 288 00:13:47,920 --> 00:13:49,350 We can fix this. 289 00:13:49,350 --> 00:13:53,310 We'll just go ahead and add this slide right now. 290 00:13:53,310 --> 00:13:55,070 We need to change the title. 291 00:13:55,070 --> 00:13:59,400 So let's cross that out and put in our new title. 292 00:13:59,400 --> 00:14:05,730 OK, so, great, and now we should double check these lists 293 00:14:05,730 --> 00:14:07,650 and make sure that they're accurate. 294 00:14:07,650 --> 00:14:10,980 Data structures, we'll come back to data structures. 295 00:14:10,980 --> 00:14:14,670 Loops, hoisting, yeah, the compiler can do hoisting. 296 00:14:14,670 --> 00:14:17,260 Sentinels, not really, the compiler 297 00:14:17,260 --> 00:14:19,230 is not good at sentinels. 298 00:14:19,230 --> 00:14:22,110 Loop unrolling, yeah, it absolutely does loop unrolling. 299 00:14:22,110 --> 00:14:25,680 Loop fusion, yeah, it can, but there are 300 00:14:25,680 --> 00:14:27,450 some restrictions that apply. 301 00:14:27,450 --> 00:14:29,220 Your mileage might vary. 302 00:14:29,220 --> 00:14:33,390 Eliminate waste iterations, some restrictions might apply. 303 00:14:33,390 --> 00:14:36,030 OK, logic, constant folding and propagation, yeah, 304 00:14:36,030 --> 00:14:37,090 it's good on that. 305 00:14:37,090 --> 00:14:38,880 Common subexpression elimination, yeah, I 306 00:14:38,880 --> 00:14:41,930 can find common subexpressions, you're fin there. 307 00:14:41,930 --> 00:14:43,770 It knows algebra, yeah good. 308 00:14:43,770 --> 00:14:45,390 Short circuiting, yes, absolutely. 309 00:14:45,390 --> 00:14:49,230 Ordering tests, depends on the tests-- 310 00:14:49,230 --> 00:14:50,850 I'll give it to the compiler. 311 00:14:50,850 --> 00:14:54,300 But I'll say, restrictions apply. 312 00:14:54,300 --> 00:14:57,210 Creating a fast path, compilers aren't 313 00:14:57,210 --> 00:14:58,950 that smart about fast paths. 314 00:14:58,950 --> 00:15:00,810 They come up with really boring fast paths. 315 00:15:00,810 --> 00:15:02,580 I'm going to take that off the list. 316 00:15:02,580 --> 00:15:05,507 Combining tests, again, it kind of depends on the tests. 317 00:15:05,507 --> 00:15:07,590 Functions, compilers are pretty good at functions. 318 00:15:07,590 --> 00:15:09,330 So inling, it can do that. 319 00:15:09,330 --> 00:15:11,760 Tail recursion elimination, yes, absolutely. 320 00:15:11,760 --> 00:15:15,150 Coarsening, not so much. 321 00:15:15,150 --> 00:15:16,320 OK, great. 322 00:15:16,320 --> 00:15:18,240 Let's come back to data structures, 323 00:15:18,240 --> 00:15:20,370 which we skipped before. 324 00:15:20,370 --> 00:15:24,840 Packing, augmentation-- OK, honestly, the compiler 325 00:15:24,840 --> 00:15:27,540 does a lot with data structures but really 326 00:15:27,540 --> 00:15:29,050 none of those things. 327 00:15:29,050 --> 00:15:31,380 The compiler isn't smart about data structures 328 00:15:31,380 --> 00:15:33,515 in that particular way. 329 00:15:33,515 --> 00:15:34,890 Really, the way that the compiler 330 00:15:34,890 --> 00:15:38,730 is smart about data structures is shown here, 331 00:15:38,730 --> 00:15:41,730 if we expand this list to include even more compiler 332 00:15:41,730 --> 00:15:43,410 optimizations. 333 00:15:43,410 --> 00:15:45,780 Bottom line with data structures, the compiler 334 00:15:45,780 --> 00:15:48,190 knows a lot about architecture. 335 00:15:48,190 --> 00:15:50,760 And it really has put a lot of effort 336 00:15:50,760 --> 00:15:54,450 into figuring out how to use registers really effectively. 337 00:15:54,450 --> 00:15:57,150 Reading and writing and register is super fast. 338 00:15:57,150 --> 00:15:59,530 Touching memory is not so fast. 339 00:15:59,530 --> 00:16:03,450 And so the compiler works really hard to allocate registers, put 340 00:16:03,450 --> 00:16:08,460 anything that lives in memory ordinarily into registers, 341 00:16:08,460 --> 00:16:11,250 manipulate aggregate types to use registers, 342 00:16:11,250 --> 00:16:13,800 as we'll see in a couple of slides, align data 343 00:16:13,800 --> 00:16:15,390 that has to live in memory. 344 00:16:15,390 --> 00:16:17,165 Compilers are good at that. 345 00:16:17,165 --> 00:16:18,540 Compilers are also good at loops. 346 00:16:18,540 --> 00:16:20,950 We already saw some example optimization 347 00:16:20,950 --> 00:16:22,080 on the previous slide. 348 00:16:22,080 --> 00:16:23,610 It can vectorize. 349 00:16:23,610 --> 00:16:25,110 It does a lot of other cool stuff. 350 00:16:25,110 --> 00:16:26,610 Unswitching is a cool optimization 351 00:16:26,610 --> 00:16:28,140 that I won't cover here. 352 00:16:28,140 --> 00:16:30,540 Idiom replacement, it finds common patterns, 353 00:16:30,540 --> 00:16:33,000 and does something smart with those. 354 00:16:33,000 --> 00:16:36,330 Vision, skewing, tiling, interchange, those all 355 00:16:36,330 --> 00:16:41,430 try to process the iterations of the loop in some clever way 356 00:16:41,430 --> 00:16:43,020 to make stuff go fast. 357 00:16:43,020 --> 00:16:44,430 And some restrictions apply. 358 00:16:44,430 --> 00:16:47,750 Those are really in development in LLVM. 359 00:16:47,750 --> 00:16:51,313 Logic, it does a lot more with logic than what we saw before. 360 00:16:51,313 --> 00:16:53,480 It can eliminate instructions that aren't necessary. 361 00:16:53,480 --> 00:16:56,250 It can do strength reduction, and other cool optimization. 362 00:16:56,250 --> 00:16:59,340 I think we saw that one in the Bentley slides. 363 00:16:59,340 --> 00:17:01,080 It gets rid of dead code. 364 00:17:01,080 --> 00:17:02,580 It can do more idiom replacement. 365 00:17:02,580 --> 00:17:05,817 Branch reordering is kind like reordering tests. 366 00:17:05,817 --> 00:17:07,859 Global value numbering, another cool optimization 367 00:17:07,859 --> 00:17:09,510 that we won't talk about today. 368 00:17:09,510 --> 00:17:11,550 Functions, it can do more on switching. 369 00:17:11,550 --> 00:17:13,740 It can eliminate arguments that aren't necessary. 370 00:17:13,740 --> 00:17:16,763 So the compiler can do a lot of stuff for you. 371 00:17:16,763 --> 00:17:18,930 And at the end the day, writing down this whole list 372 00:17:18,930 --> 00:17:22,880 is kind of a futile activity because it changes over time. 373 00:17:22,880 --> 00:17:24,810 Compilers are a moving target. 374 00:17:24,810 --> 00:17:27,150 Compiler developers, they're software engineers 375 00:17:27,150 --> 00:17:28,470 like you and me. 376 00:17:28,470 --> 00:17:29,700 And they're clever. 377 00:17:29,700 --> 00:17:31,980 And they're trying to apply all their clever software 378 00:17:31,980 --> 00:17:35,220 engineering practice to this compiler code base 379 00:17:35,220 --> 00:17:37,980 to make it do more stuff. 380 00:17:37,980 --> 00:17:40,830 And so they are constantly adding new optimizations 381 00:17:40,830 --> 00:17:44,985 to the compiler, new clever analyses, all the time. 382 00:17:44,985 --> 00:17:46,860 So, really, what we're going to look at today 383 00:17:46,860 --> 00:17:49,290 is just a couple examples to get a flavor for what 384 00:17:49,290 --> 00:17:52,378 the compiler does internally. 385 00:17:52,378 --> 00:17:54,920 Now, if you want to follow along with how the compiler works, 386 00:17:54,920 --> 00:17:57,210 the good news is, by and large, you 387 00:17:57,210 --> 00:18:01,080 can take a look at the LLVM IR to see 388 00:18:01,080 --> 00:18:03,930 what happens as the compiler processes your code. 389 00:18:03,930 --> 00:18:06,570 You don't need to look out the assembly. 390 00:18:06,570 --> 00:18:08,700 That's generally true. 391 00:18:08,700 --> 00:18:11,800 But there are some exceptions. 392 00:18:11,800 --> 00:18:15,510 So, for example, if we have these three snippets of C code 393 00:18:15,510 --> 00:18:21,300 on the left, and we look at what your LLVM compiler generates, 394 00:18:21,300 --> 00:18:23,670 in terms of the IR, we can see that there 395 00:18:23,670 --> 00:18:25,860 are some optimizations reflected, but not 396 00:18:25,860 --> 00:18:28,445 too many interesting ones. 397 00:18:28,445 --> 00:18:32,580 The multiply by 8 turns into a shift left operation by 3, 398 00:18:32,580 --> 00:18:33,780 because 8 is a power of 2. 399 00:18:33,780 --> 00:18:35,120 That's straightforward. 400 00:18:35,120 --> 00:18:37,010 Good, we can see that in the IR. 401 00:18:37,010 --> 00:18:40,440 The multiply by 15 still looks like a multiply by 15. 402 00:18:40,440 --> 00:18:42,150 No changes there. 403 00:18:42,150 --> 00:18:45,610 The divide by 71 looks like a divide by 71. 404 00:18:45,610 --> 00:18:49,270 Again, no changes there. 405 00:18:49,270 --> 00:18:51,090 Now, with arithmetic ops, the difference 406 00:18:51,090 --> 00:18:53,585 between what you see in the LLVM IR 407 00:18:53,585 --> 00:18:54,960 and what you see in the assembly, 408 00:18:54,960 --> 00:18:56,580 this is where it's most pronounced, 409 00:18:56,580 --> 00:18:59,220 at least in my experience, because if we 410 00:18:59,220 --> 00:19:02,120 take a look at these same snippets of C code, 411 00:19:02,120 --> 00:19:06,360 and we look at the corresponding x86 assembly for it, 412 00:19:06,360 --> 00:19:09,180 we get the stuff on the right. 413 00:19:09,180 --> 00:19:12,660 And this looks different. 414 00:19:12,660 --> 00:19:14,280 Let's pick through what this assembly 415 00:19:14,280 --> 00:19:15,700 code does one line at a time. 416 00:19:15,700 --> 00:19:19,500 So the first one in the C code, takes the argument n, 417 00:19:19,500 --> 00:19:20,870 and multiplies it by 8. 418 00:19:20,870 --> 00:19:23,190 And then the assembly, we have this LEA instruction. 419 00:19:23,190 --> 00:19:26,260 Anyone remember what the LEA instruction does? 420 00:19:26,260 --> 00:19:27,760 I see one person shaking their head. 421 00:19:27,760 --> 00:19:29,385 That's a perfectly reasonable response. 422 00:19:29,385 --> 00:19:31,107 Yeah, go for it? 423 00:19:31,107 --> 00:19:32,940 Load effective address, what does that mean? 424 00:19:38,040 --> 00:19:40,630 Load the address, but don't actually access memory. 425 00:19:40,630 --> 00:19:44,750 Another way to phrase that, do this address calculation. 426 00:19:44,750 --> 00:19:47,010 And give me the result of the address calculation. 427 00:19:47,010 --> 00:19:49,110 Don't read or write memory at that address. 428 00:19:49,110 --> 00:19:51,330 Just do the calculation. 429 00:19:51,330 --> 00:19:56,340 That's what loading an effective address means, essentially. 430 00:19:56,340 --> 00:19:58,380 But you're exactly right. 431 00:19:58,380 --> 00:20:01,445 The LEA instruction does an address calculation, 432 00:20:01,445 --> 00:20:03,570 and stores the result in the register on the right. 433 00:20:03,570 --> 00:20:08,070 Anyone remember enough about x86 address calculations 434 00:20:08,070 --> 00:20:12,570 to tell me how that LEA in particular works, the first LEA 435 00:20:12,570 --> 00:20:13,200 on the slide? 436 00:20:16,710 --> 00:20:17,708 Yeah? 437 00:20:17,708 --> 00:20:21,500 AUDIENCE: [INAUDIBLE] 438 00:20:23,267 --> 00:20:25,600 TAO B. SCHARDL: But before the first comma, in this case 439 00:20:25,600 --> 00:20:29,100 nothing, gets added to the product of the second two 440 00:20:29,100 --> 00:20:30,420 arguments in those parens. 441 00:20:30,420 --> 00:20:31,530 You're exactly right. 442 00:20:31,530 --> 00:20:36,850 So this LEA takes the value 8, multiplies it by whatever 443 00:20:36,850 --> 00:20:40,920 is in register RDI, which holds the value n. 444 00:20:40,920 --> 00:20:42,960 And it stores the result into AX. 445 00:20:42,960 --> 00:20:47,190 So, perfect, it does a multiply by 8. 446 00:20:47,190 --> 00:20:50,430 The address calculator is only capable of a small range 447 00:20:50,430 --> 00:20:51,090 of operations. 448 00:20:51,090 --> 00:20:52,380 It can do additions. 449 00:20:52,380 --> 00:20:55,980 And it can multiply by 1, 2, 4, or 8. 450 00:20:55,980 --> 00:20:56,910 That's it. 451 00:20:56,910 --> 00:21:00,450 So it's a really simple circuit in the hardware. 452 00:21:00,450 --> 00:21:01,410 But it's fast. 453 00:21:01,410 --> 00:21:04,920 It's optimized heavily by modern processors. 454 00:21:04,920 --> 00:21:07,260 And so if the compiler can use it, 455 00:21:07,260 --> 00:21:09,900 they tend to try to use these LEA instructions. 456 00:21:09,900 --> 00:21:11,432 So good job. 457 00:21:11,432 --> 00:21:12,390 How about the next one? 458 00:21:12,390 --> 00:21:16,170 Multiply by 15 turns into these two LEA instructions. 459 00:21:16,170 --> 00:21:19,035 Can anyone tell me how these work? 460 00:21:19,035 --> 00:21:23,738 AUDIENCE: [INAUDIBLE] 461 00:21:23,738 --> 00:21:25,780 TAO B. SCHARDL: You're basically multiplying by 5 462 00:21:25,780 --> 00:21:29,350 and multiplying by 3, exactly right. 463 00:21:29,350 --> 00:21:31,040 We can step through this as well. 464 00:21:31,040 --> 00:21:32,980 If we look at the first LEA instruction, 465 00:21:32,980 --> 00:21:35,880 we take RDI, which stores the value n. 466 00:21:35,880 --> 00:21:38,200 We multiply that by 4. 467 00:21:38,200 --> 00:21:41,520 We add it to the original value of RDI. 468 00:21:41,520 --> 00:21:47,590 And so that computes 4 times n, plus n, which is five times n. 469 00:21:47,590 --> 00:21:49,690 And that result gets stored into AX. 470 00:21:49,690 --> 00:21:52,960 Could, we've effectively multiplied by 5. 471 00:21:52,960 --> 00:21:54,970 The next instruction takes whatever 472 00:21:54,970 --> 00:22:01,180 is in REX, which is now 5n, multiplies that by 2, adds it 473 00:22:01,180 --> 00:22:05,230 to whatever is currently in REX, which is once again 5n. 474 00:22:05,230 --> 00:22:10,570 So that computes 2 times 5n, plus 5n, which 475 00:22:10,570 --> 00:22:14,740 is 3 times 5n, which is 15n. 476 00:22:14,740 --> 00:22:16,750 So just like that, we've done our multiply 477 00:22:16,750 --> 00:22:19,780 with two LEA instructions. 478 00:22:19,780 --> 00:22:21,410 How about the last one? 479 00:22:21,410 --> 00:22:26,230 In this last piece of code, we take the arguments in RDI. 480 00:22:26,230 --> 00:22:28,720 We move it into EX. 481 00:22:28,720 --> 00:22:36,940 We then move the value 3,871,519,817, 482 00:22:36,940 --> 00:22:40,840 and put that into ECX, as you do. 483 00:22:40,840 --> 00:22:43,980 We multiply those two values together. 484 00:22:43,980 --> 00:22:46,500 And then we shift the product right by 38. 485 00:22:49,180 --> 00:22:50,700 So, obviously, this divides by 71. 486 00:22:53,920 --> 00:22:57,370 Any guesses as to how this performs 487 00:22:57,370 --> 00:23:01,720 the division operation we want? 488 00:23:01,720 --> 00:23:03,670 Both of you answered. 489 00:23:03,670 --> 00:23:06,700 I might still call on you. 490 00:23:06,700 --> 00:23:08,390 give a little more time for someone else 491 00:23:08,390 --> 00:23:09,223 to raise their hand. 492 00:23:15,460 --> 00:23:16,307 Go for it. 493 00:23:16,307 --> 00:23:19,795 AUDIENCE: [INAUDIBLE] 494 00:23:19,795 --> 00:23:22,420 TAO B. SCHARDL: It has a lot to do with 2 to the 38, very good. 495 00:23:25,950 --> 00:23:29,050 Yeah, all right, any further guesses 496 00:23:29,050 --> 00:23:30,580 before I give the answer away? 497 00:23:30,580 --> 00:23:31,517 Yeah, in the back? 498 00:23:31,517 --> 00:23:36,654 AUDIENCE: [INAUDIBLE] 499 00:23:42,760 --> 00:23:43,760 TAO B. SCHARDL: Kind of. 500 00:23:43,760 --> 00:23:48,620 So this is what's technically called a magic number. 501 00:23:48,620 --> 00:23:51,830 And, yes, it's technically called a magic number. 502 00:23:51,830 --> 00:23:53,480 And this magic number is equal to 2 503 00:23:53,480 --> 00:23:58,070 to the 38, divided by 71, plus 1 to deal with some rounding 504 00:23:58,070 --> 00:23:59,390 effects. 505 00:23:59,390 --> 00:24:03,200 What this code does is it says, let's 506 00:24:03,200 --> 00:24:08,035 compute n divided by 71, by first computing n divided 507 00:24:08,035 --> 00:24:13,640 by 71, times 2 to the 38, and then shifting off the lower 38 508 00:24:13,640 --> 00:24:17,600 bits with that shift right operation. 509 00:24:17,600 --> 00:24:23,150 And by converting the operation into this, 510 00:24:23,150 --> 00:24:26,360 it's able to replace the division operation 511 00:24:26,360 --> 00:24:28,270 with a multiply. 512 00:24:28,270 --> 00:24:31,760 And if you remember, hopefully, from the architecture lecture, 513 00:24:31,760 --> 00:24:34,040 multiply operations, they're not the cheapest things 514 00:24:34,040 --> 00:24:34,582 in the world. 515 00:24:34,582 --> 00:24:35,780 But they're not too bad. 516 00:24:35,780 --> 00:24:37,550 Division is really expensive. 517 00:24:37,550 --> 00:24:41,060 If you want fast code, never divide. 518 00:24:41,060 --> 00:24:43,960 Also, never compute modulus, or access memory. 519 00:24:43,960 --> 00:24:45,056 Yeah, question? 520 00:24:45,056 --> 00:24:46,550 AUDIENCE: Why did you choose 38? 521 00:24:46,550 --> 00:24:48,050 TAO B. SCHARDL: Why did I choose 38? 522 00:24:51,050 --> 00:24:54,740 I think it shows 38 because 38 works. 523 00:24:54,740 --> 00:24:56,750 There's actually a formula for-- 524 00:24:56,750 --> 00:24:59,660 pretty much it doesn't want to choose 525 00:24:59,660 --> 00:25:02,408 a value that's too large, or else it'll overflow. 526 00:25:02,408 --> 00:25:04,700 And it doesn't want to choose a value that's too small, 527 00:25:04,700 --> 00:25:06,680 or else you lose precision. 528 00:25:06,680 --> 00:25:10,130 So it's able to find a balancing point. 529 00:25:10,130 --> 00:25:12,470 If you want to know more about magic numbers, 530 00:25:12,470 --> 00:25:16,370 I recommend checking out this book called Hackers Delight. 531 00:25:16,370 --> 00:25:18,490 For any of you who are familiar with this book, 532 00:25:18,490 --> 00:25:20,810 it is a book full of bit tricks. 533 00:25:20,810 --> 00:25:22,550 Seriously, that's the entire book. 534 00:25:22,550 --> 00:25:24,110 It's just a book full of bit tricks. 535 00:25:24,110 --> 00:25:26,300 And there's a whole section in there describing 536 00:25:26,300 --> 00:25:31,790 how you do division by various constants using multiplication, 537 00:25:31,790 --> 00:25:33,770 either signed or unsigned. 538 00:25:33,770 --> 00:25:35,560 It's very cool. 539 00:25:35,560 --> 00:25:38,810 But magic number to convert a division 540 00:25:38,810 --> 00:25:41,720 into a multiply, that's the kind of thing 541 00:25:41,720 --> 00:25:43,370 that you might see from the assembly. 542 00:25:43,370 --> 00:25:46,390 That's one of these examples of arithmetic operations 543 00:25:46,390 --> 00:25:49,520 that are really optimized at the very last step. 544 00:25:49,520 --> 00:25:51,200 But for the rest of the optimizations, 545 00:25:51,200 --> 00:25:53,477 fortunately we can focus on the IR. 546 00:25:53,477 --> 00:25:54,810 Any questions about that so far? 547 00:25:57,730 --> 00:25:59,590 Cool. 548 00:25:59,590 --> 00:26:02,670 OK, so for the next part of the lecture, 549 00:26:02,670 --> 00:26:05,790 I want to show you a couple example optimizations in terms 550 00:26:05,790 --> 00:26:07,470 of the LLVM IR. 551 00:26:07,470 --> 00:26:09,210 And to show you these optimizations, 552 00:26:09,210 --> 00:26:12,870 we'll have a little bit of code that we'll work through, 553 00:26:12,870 --> 00:26:15,280 a running example, if you will. 554 00:26:15,280 --> 00:26:17,640 And this running example will be some code 555 00:26:17,640 --> 00:26:22,680 that I stole from I think it was a serial program that simulates 556 00:26:22,680 --> 00:26:27,060 the behavior of n massive bodies in 2D space 557 00:26:27,060 --> 00:26:29,010 under the law of gravitation. 558 00:26:29,010 --> 00:26:31,020 So we've got a whole bunch of point masses. 559 00:26:31,020 --> 00:26:33,335 Those point masses have varying masses. 560 00:26:33,335 --> 00:26:34,710 And we just want to simulate what 561 00:26:34,710 --> 00:26:42,510 happens due to gravity as these masses interact in the plane. 562 00:26:42,510 --> 00:26:46,420 At a high level, the n body code is pretty simple. 563 00:26:46,420 --> 00:26:48,990 We have a top level simulate routine, 564 00:26:48,990 --> 00:26:50,670 which just loops over all the time 565 00:26:50,670 --> 00:26:55,050 steps, during which we want to perform this simulation. 566 00:26:55,050 --> 00:26:58,830 And at each time step, it calculates the various forces 567 00:26:58,830 --> 00:27:00,630 acting on those different bodies. 568 00:27:00,630 --> 00:27:02,580 And then it updates the position of each body, 569 00:27:02,580 --> 00:27:05,187 based on those forces. 570 00:27:05,187 --> 00:27:06,520 In order to do that calculation. 571 00:27:06,520 --> 00:27:08,220 It has some internal data structures, 572 00:27:08,220 --> 00:27:10,860 one to represent each body, which contains 573 00:27:10,860 --> 00:27:12,650 a couple of vector types. 574 00:27:12,650 --> 00:27:14,160 And we define our own vector type 575 00:27:14,160 --> 00:27:18,227 to store to double precision floating point values. 576 00:27:18,227 --> 00:27:20,310 Now, we don't need to see the entire code in order 577 00:27:20,310 --> 00:27:23,910 to look at some compiler optimizations. 578 00:27:23,910 --> 00:27:26,160 The one routine that we will take a look at 579 00:27:26,160 --> 00:27:27,900 is this one to update the positions. 580 00:27:27,900 --> 00:27:33,750 This is a simple loop that takes each body, one at a time, 581 00:27:33,750 --> 00:27:35,610 computes the new velocity on that body, 582 00:27:35,610 --> 00:27:38,490 based on the forces acting on that body, 583 00:27:38,490 --> 00:27:41,780 and uses vector operations to do that. 584 00:27:41,780 --> 00:27:43,530 Then it updates the position of that body, 585 00:27:43,530 --> 00:27:47,800 again using these vector operations that we've defined. 586 00:27:47,800 --> 00:27:50,517 And then it stores the results into the data structure 587 00:27:50,517 --> 00:27:51,100 for that body. 588 00:27:54,200 --> 00:27:56,180 So all these methods with this code 589 00:27:56,180 --> 00:28:00,770 make use of these basic routines on 2D vectors, points in x, y, 590 00:28:00,770 --> 00:28:03,248 or points in 2D space. 591 00:28:03,248 --> 00:28:04,790 And these routines are pretty simple. 592 00:28:04,790 --> 00:28:06,600 There is one to add two vectors. 593 00:28:06,600 --> 00:28:10,725 There's another to scale a vector by a scalar value. 594 00:28:10,725 --> 00:28:13,100 And there's a third to compute the length, which we won't 595 00:28:13,100 --> 00:28:14,183 actually look at too much. 596 00:28:17,640 --> 00:28:20,230 Everyone good so far? 597 00:28:20,230 --> 00:28:23,640 OK, so let's try to start simple. 598 00:28:23,640 --> 00:28:27,440 Let's take a look at just one of these one line vector 599 00:28:27,440 --> 00:28:30,260 operations, vec scale. 600 00:28:30,260 --> 00:28:36,260 All vec scale does is it takes one of these vector inputs 601 00:28:36,260 --> 00:28:38,180 at a scalar value a. 602 00:28:38,180 --> 00:28:43,190 And it multiplies x by a, and y by a, and stores the results 603 00:28:43,190 --> 00:28:46,090 into a vector type, and return to it. 604 00:28:46,090 --> 00:28:49,340 Great, couldn't be simpler. 605 00:28:49,340 --> 00:28:52,260 If we compile this with no optimizations whatsoever, 606 00:28:52,260 --> 00:28:54,530 and we take a look at the LLVM IR, 607 00:28:54,530 --> 00:29:01,250 we get that, which is a little more complicated 608 00:29:01,250 --> 00:29:03,110 than you might imagine. 609 00:29:06,260 --> 00:29:09,890 The good news, though, is that if you turn on optimizations, 610 00:29:09,890 --> 00:29:14,630 and you just turn on the first level of optimization, just 01, 611 00:29:14,630 --> 00:29:20,420 whereas we got this code before, now we get this, which is far, 612 00:29:20,420 --> 00:29:24,790 far simpler, and so simple I can blow up the font size so you 613 00:29:24,790 --> 00:29:29,180 can actually read the code on the slide. 614 00:29:29,180 --> 00:29:35,520 So to see, again, no optimizations, optimizations. 615 00:29:35,520 --> 00:29:41,990 So a lot of stuff happened to optimize this simple function. 616 00:29:41,990 --> 00:29:45,782 We're going to see what those optimizations actually were. 617 00:29:45,782 --> 00:29:47,240 But, first, let's pick apart what's 618 00:29:47,240 --> 00:29:49,070 going on in this function. 619 00:29:49,070 --> 00:29:52,280 We have our vec scale routine in LLVM IR. 620 00:29:52,280 --> 00:29:54,500 It takes a structure as its first argument. 621 00:29:54,500 --> 00:29:57,680 And that's represented using two doubles. 622 00:29:57,680 --> 00:29:59,840 It takes a scalar as the second argument. 623 00:29:59,840 --> 00:30:05,810 And what the operation does is it multiplies those two 624 00:30:05,810 --> 00:30:09,840 fields by the third argument, the double A. 625 00:30:09,840 --> 00:30:16,220 It then packs those values into a struct that'll return. 626 00:30:16,220 --> 00:30:19,460 And, finally, it returns that struct. 627 00:30:19,460 --> 00:30:21,680 So that's what the optimized code does. 628 00:30:21,680 --> 00:30:25,360 Let's see actually how we get to this optimized code. 629 00:30:25,360 --> 00:30:28,710 And we'll do this one step at a time. 630 00:30:28,710 --> 00:30:31,400 Let's start by optimizing the operations on a single scalar 631 00:30:31,400 --> 00:30:32,000 value. 632 00:30:32,000 --> 00:30:34,850 That's why I picked this example. 633 00:30:34,850 --> 00:30:36,350 So we go back to the 00 code. 634 00:30:36,350 --> 00:30:38,690 And we just pick out the operations that 635 00:30:38,690 --> 00:30:41,450 dealt with that scalar value. 636 00:30:41,450 --> 00:30:46,160 We our scope down to just these lines. 637 00:30:46,160 --> 00:30:51,110 So the argument double A is the final argument in the list. 638 00:30:51,110 --> 00:30:54,080 And what we see is that within the vector scale 639 00:30:54,080 --> 00:30:59,010 routine, compiler to 0, we allocate some local storage. 640 00:30:59,010 --> 00:31:02,240 We store that double A into the local storage. 641 00:31:02,240 --> 00:31:04,490 And then later on, we'll load the value out 642 00:31:04,490 --> 00:31:07,940 of the local storage before the multiply. 643 00:31:07,940 --> 00:31:12,470 And then we load it again before the other multiply. 644 00:31:12,470 --> 00:31:17,045 OK, any ideas how we could make this code faster? 645 00:31:21,400 --> 00:31:23,800 Don't store in memory, what a great idea. 646 00:31:23,800 --> 00:31:25,900 How do we get around not storing it in memory? 647 00:31:28,440 --> 00:31:30,430 Saving a register. 648 00:31:30,430 --> 00:31:34,750 In particular, what property of LLVM IR makes that really easy? 649 00:31:37,900 --> 00:31:39,540 There are infinite registers. 650 00:31:39,540 --> 00:31:44,050 And, in fact, the argument is already in a register. 651 00:31:44,050 --> 00:31:48,180 It's already in the register percent two, if I recall. 652 00:31:48,180 --> 00:31:50,830 So we don't need to move it into a register. 653 00:31:50,830 --> 00:31:53,560 It's already there. 654 00:31:53,560 --> 00:31:56,530 So how do we go about optimizing that code in this case? 655 00:31:56,530 --> 00:32:00,430 Well, let's find the places where we're using the value. 656 00:32:00,430 --> 00:32:04,750 And we're using the value loaded from memory. 657 00:32:04,750 --> 00:32:08,080 And what we're going to do is just replace those loads 658 00:32:08,080 --> 00:32:10,090 from memory with the original argument. 659 00:32:10,090 --> 00:32:12,430 We know exactly what operation we're trying to do. 660 00:32:12,430 --> 00:32:15,670 We know we're trying to do a multiply 661 00:32:15,670 --> 00:32:18,340 by the original parameter. 662 00:32:18,340 --> 00:32:20,950 So we just find those two uses. 663 00:32:20,950 --> 00:32:22,090 We cross them out. 664 00:32:22,090 --> 00:32:27,010 And we put in the input parameter in its place. 665 00:32:27,010 --> 00:32:29,110 That make sense? 666 00:32:29,110 --> 00:32:31,670 Questions so far? 667 00:32:31,670 --> 00:32:33,040 Cool. 668 00:32:33,040 --> 00:32:36,370 So now, those multipliers aren't using the values 669 00:32:36,370 --> 00:32:38,247 returned by the loads. 670 00:32:38,247 --> 00:32:39,830 How further can we optimize this code? 671 00:32:45,900 --> 00:32:47,290 Delete the loads. 672 00:32:47,290 --> 00:32:48,290 What else can we delete? 673 00:32:55,980 --> 00:32:58,380 So there's no address calculation here 674 00:32:58,380 --> 00:33:03,090 just because the code is so simple, but good insight. 675 00:33:03,090 --> 00:33:07,920 The allocation and the store, great. 676 00:33:07,920 --> 00:33:09,870 So those loads are dead code. 677 00:33:09,870 --> 00:33:11,550 The store is dead code. 678 00:33:11,550 --> 00:33:12,960 The allocation is dead code. 679 00:33:12,960 --> 00:33:15,690 We eliminate all that dead code. 680 00:33:15,690 --> 00:33:17,040 We got rid of those loads. 681 00:33:17,040 --> 00:33:19,450 We just used the value living in the register. 682 00:33:19,450 --> 00:33:23,080 And we've already eliminated a bunch of instructions. 683 00:33:23,080 --> 00:33:26,920 So the net effect of that was to turn the code optimizer at 00 684 00:33:26,920 --> 00:33:29,730 that we had in the background into the code we have 685 00:33:29,730 --> 00:33:34,230 in the foreground, which is slightly shorter, 686 00:33:34,230 --> 00:33:36,190 but not that much. 687 00:33:36,190 --> 00:33:39,960 So it's a little bit faster, but not that much faster. 688 00:33:39,960 --> 00:33:42,350 How do we optimize this function further? 689 00:33:42,350 --> 00:33:45,180 Do it for every variable we have. 690 00:33:45,180 --> 00:33:47,310 In particular, the only other variable we have 691 00:33:47,310 --> 00:33:50,130 is a structure that we're passing in. 692 00:33:50,130 --> 00:33:55,300 So we want to do this kind of optimization on the structure. 693 00:33:55,300 --> 00:33:58,380 Make sense? 694 00:33:58,380 --> 00:34:02,130 So let's see how we optimize this structure. 695 00:34:02,130 --> 00:34:03,660 Now, the problem is that structures 696 00:34:03,660 --> 00:34:07,350 are harder to handle than individual scalar values, 697 00:34:07,350 --> 00:34:10,020 because, in general, you can't store the whole structure 698 00:34:10,020 --> 00:34:11,840 in just a single register. 699 00:34:11,840 --> 00:34:14,969 It's more complicated to juggle all the data 700 00:34:14,969 --> 00:34:17,310 within a structure. 701 00:34:17,310 --> 00:34:18,929 But, nevertheless, let's take a look 702 00:34:18,929 --> 00:34:21,239 at the code that operates on the structure, 703 00:34:21,239 --> 00:34:23,280 or at least operates on the structure 704 00:34:23,280 --> 00:34:26,620 that we pass in to the function. 705 00:34:26,620 --> 00:34:28,350 So when we eliminate all the other code, 706 00:34:28,350 --> 00:34:31,420 we see that we've got an allocation. 707 00:34:31,420 --> 00:34:32,989 See if I animations here, yeah, I do. 708 00:34:32,989 --> 00:34:34,860 We have an allocation. 709 00:34:34,860 --> 00:34:38,560 So we can store the structure onto the stack. 710 00:34:38,560 --> 00:34:40,380 Then we have an address calculation 711 00:34:40,380 --> 00:34:43,560 that lets us store the first part of the structure 712 00:34:43,560 --> 00:34:45,449 onto the stack. 713 00:34:45,449 --> 00:34:46,949 We have a second address calculation 714 00:34:46,949 --> 00:34:49,800 to store the second field on the stack. 715 00:34:49,800 --> 00:34:52,469 And later on, when we need those values, 716 00:34:52,469 --> 00:34:55,980 we load the first field out of memory. 717 00:34:55,980 --> 00:34:58,020 And we load the second field out of memory. 718 00:34:58,020 --> 00:35:00,870 It's a very similar pattern to what we had before, 719 00:35:00,870 --> 00:35:03,990 except we've got more going on in this case. 720 00:35:08,480 --> 00:35:12,340 So how do we go about optimizing this structure? 721 00:35:12,340 --> 00:35:16,420 Any ideas, high level ideas? 722 00:35:16,420 --> 00:35:19,690 Ultimately, we want to get rid of all of the memory 723 00:35:19,690 --> 00:35:26,170 references and all that storage for the structure. 724 00:35:26,170 --> 00:35:28,750 How do we reason through eliminating all that stuff 725 00:35:28,750 --> 00:35:33,640 in a mechanical fashion, based on what we've seen so far? 726 00:35:33,640 --> 00:35:35,411 Go for it. 727 00:35:35,411 --> 00:35:39,794 AUDIENCE: [INAUDIBLE] 728 00:35:43,458 --> 00:35:46,000 TAO B. SCHARDL: They are passed in using separate parameters, 729 00:35:46,000 --> 00:35:50,120 separate registers if you will, as a quirk of how LLVM does it. 730 00:35:50,120 --> 00:35:55,158 So given that insight, how would you optimize it? 731 00:35:55,158 --> 00:35:58,567 AUDIENCE: [INAUDIBLE] 732 00:36:01,600 --> 00:36:03,600 TAO B. SCHARDL: Cross out percent 12, percent 6, 733 00:36:03,600 --> 00:36:07,640 and put in the relevant field. 734 00:36:07,640 --> 00:36:08,677 Cool. 735 00:36:08,677 --> 00:36:10,510 Let me phrase that a little bit differently. 736 00:36:10,510 --> 00:36:13,680 Let's do this one field at a time. 737 00:36:13,680 --> 00:36:16,660 We've got a structure, which has multiple fields. 738 00:36:16,660 --> 00:36:18,900 Let's just take it one step at a time. 739 00:36:23,140 --> 00:36:25,980 All right, so let's look at the first field. 740 00:36:25,980 --> 00:36:29,320 And let's look at the operations that deal with the first field. 741 00:36:29,320 --> 00:36:34,710 We have, in our code, in our LLVM IR, some address 742 00:36:34,710 --> 00:36:38,787 calculations that refer to the same field of the structure. 743 00:36:38,787 --> 00:36:40,870 In this case, I believe it's the first field, yes. 744 00:36:45,300 --> 00:36:49,220 And, ultimately, we end up loading from this location 745 00:36:49,220 --> 00:36:51,260 in local memory. 746 00:36:51,260 --> 00:36:54,485 So what value is this load going to retrieve? 747 00:36:54,485 --> 00:36:56,360 How do we know that both address calculations 748 00:36:56,360 --> 00:36:57,410 refer to the same field? 749 00:36:57,410 --> 00:36:59,000 Good question. 750 00:36:59,000 --> 00:37:02,060 What we do in this case is very careful analysis 751 00:37:02,060 --> 00:37:04,770 of the math that's going on. 752 00:37:04,770 --> 00:37:10,640 We know that the alga, the location in local memory, 753 00:37:10,640 --> 00:37:12,320 that's just a fixed location. 754 00:37:12,320 --> 00:37:15,830 And from that, we can interpret what each of the instructions 755 00:37:15,830 --> 00:37:18,390 does in terms of an address calculation. 756 00:37:18,390 --> 00:37:21,860 And we can determine that they're the same value. 757 00:37:29,340 --> 00:37:35,410 So we have this location in memory that we operate on. 758 00:37:35,410 --> 00:37:39,940 And before you do a multiply, we end up 759 00:37:39,940 --> 00:37:43,070 loading from that location in memory. 760 00:37:43,070 --> 00:37:46,660 So what value do we know is going to be loaded by that load 761 00:37:46,660 --> 00:37:49,098 instruction? 762 00:37:49,098 --> 00:37:50,589 Go for it. 763 00:37:54,818 --> 00:37:57,360 AUDIENCE: So what we're doing right now is taking some value, 764 00:37:57,360 --> 00:37:59,760 and then storing it, and then getting it back out, 765 00:37:59,760 --> 00:38:02,552 and putting it back. 766 00:38:02,552 --> 00:38:04,760 TAO B. SCHARDL: Not putting it back, but we don't you 767 00:38:04,760 --> 00:38:05,970 worry about putting it back. 768 00:38:05,970 --> 00:38:08,944 AUDIENCE: So we don't need to put it somewhere just 769 00:38:08,944 --> 00:38:11,300 to take it back out? 770 00:38:11,300 --> 00:38:12,890 TAO B. SCHARDL: Correct. 771 00:38:12,890 --> 00:38:13,820 Correct. 772 00:38:13,820 --> 00:38:17,315 So what are we multiplying in that multiply, which value? 773 00:38:22,370 --> 00:38:23,680 First element of the struct. 774 00:38:23,680 --> 00:38:25,000 It's percent zero. 775 00:38:25,000 --> 00:38:29,042 It's the value that we stored right there. 776 00:38:29,042 --> 00:38:29,750 That makes sense? 777 00:38:29,750 --> 00:38:30,958 Everyone see that? 778 00:38:30,958 --> 00:38:32,000 Any questions about that? 779 00:38:39,040 --> 00:38:42,710 All right, so we're storing the first element of the struct 780 00:38:42,710 --> 00:38:43,820 into this location. 781 00:38:43,820 --> 00:38:46,670 Later, we load it out of that same location. 782 00:38:46,670 --> 00:38:49,280 Nothing else happened to that location. 783 00:38:49,280 --> 00:38:52,070 So let's go ahead and optimize it just the same way 784 00:38:52,070 --> 00:38:54,200 we optimize the scalar. 785 00:38:54,200 --> 00:38:56,450 We see that we use the result of the load right there. 786 00:38:56,450 --> 00:39:00,230 But we know that load is going to return the first field 787 00:39:00,230 --> 00:39:03,560 of our struct input. 788 00:39:03,560 --> 00:39:07,512 So we'll just cross it out, and replace it with that field. 789 00:39:07,512 --> 00:39:09,470 So now we're not using the result of that load. 790 00:39:09,470 --> 00:39:11,900 What do we get to do as the compiler? 791 00:39:17,548 --> 00:39:18,840 I can tell you know the answer. 792 00:39:23,790 --> 00:39:27,060 Delete the dead code, delete all of it. 793 00:39:27,060 --> 00:39:30,450 Remove the now dead code, which is all those address 794 00:39:30,450 --> 00:39:33,030 calculations, as well as the load operation, and the store 795 00:39:33,030 --> 00:39:34,800 operation. 796 00:39:34,800 --> 00:39:36,930 And that's pretty much it. 797 00:39:36,930 --> 00:39:39,770 Yeah, good. 798 00:39:39,770 --> 00:39:42,030 So we replace that operation. 799 00:39:42,030 --> 00:39:46,800 And we got rid of a bunch of other code from our function. 800 00:39:46,800 --> 00:39:50,970 We've now optimized one of the two fields in our struct. 801 00:39:50,970 --> 00:39:51,810 What do we do next? 802 00:39:55,510 --> 00:39:58,190 Optimize the next one. 803 00:39:58,190 --> 00:39:59,330 That happened similarly. 804 00:39:59,330 --> 00:40:02,090 I won't walk you through that a second time. 805 00:40:02,090 --> 00:40:04,760 We find where we're using the result of that load. 806 00:40:04,760 --> 00:40:09,238 We can cross it out, and replace it with the appropriate input, 807 00:40:09,238 --> 00:40:11,030 and then delete all the relevant dead code. 808 00:40:11,030 --> 00:40:13,550 And now, we get to delete the original allocation 809 00:40:13,550 --> 00:40:16,133 because nothing's getting stored to that memory. 810 00:40:16,133 --> 00:40:16,800 That make sense? 811 00:40:16,800 --> 00:40:18,360 Any questions about that? 812 00:40:18,360 --> 00:40:19,910 Yeah? 813 00:40:19,910 --> 00:40:23,690 AUDIENCE: So when we first compile it to LLVM IR, 814 00:40:23,690 --> 00:40:25,420 does it unpack the struct and just 815 00:40:25,420 --> 00:40:28,572 put in separate parameters? 816 00:40:28,572 --> 00:40:30,530 TAO B. SCHARDL: When we first compiled LLVM IR, 817 00:40:30,530 --> 00:40:32,870 do we unpack the struct and pass in the separate parameters? 818 00:40:32,870 --> 00:40:34,400 AUDIENCE: Like, how we have three parameters here 819 00:40:34,400 --> 00:40:35,108 that are doubled. 820 00:40:35,108 --> 00:40:39,721 Wasn't our original C code just a struct of vectors in 821 00:40:39,721 --> 00:40:40,730 the double? 822 00:40:40,730 --> 00:40:44,780 TAO B. SCHARDL: So LLVM IR in this case, when we compiled it 823 00:40:44,780 --> 00:40:50,360 as zero, decided to pass it as separate parameters, 824 00:40:50,360 --> 00:40:54,350 just as it's representation. 825 00:40:54,350 --> 00:40:56,660 So in that sense, yes. 826 00:40:56,660 --> 00:41:00,440 But it was still doing the standard, 827 00:41:00,440 --> 00:41:02,870 create some local storage, store the parameters 828 00:41:02,870 --> 00:41:05,930 on to local storage, and then all operations just 829 00:41:05,930 --> 00:41:07,760 read out of local storage. 830 00:41:07,760 --> 00:41:11,810 It's the standard thing that the compiler generates when 831 00:41:11,810 --> 00:41:13,980 it's asked to compile C code. 832 00:41:13,980 --> 00:41:17,180 And with no other optimizations, that's what you get. 833 00:41:17,180 --> 00:41:19,230 That makes sense? 834 00:41:19,230 --> 00:41:19,845 Yeah? 835 00:41:19,845 --> 00:41:22,430 AUDIENCE: What are all the align eights? 836 00:41:22,430 --> 00:41:24,680 TAO B. SCHARDL: What are all the aligned eights doing? 837 00:41:24,680 --> 00:41:27,770 The align eights are attributes that 838 00:41:27,770 --> 00:41:31,340 specify the alignment of that location in memory. 839 00:41:31,340 --> 00:41:34,340 This is alignment information that the compiler 840 00:41:34,340 --> 00:41:38,240 either determines by analysis, or implements 841 00:41:38,240 --> 00:41:41,180 as part of a standard. 842 00:41:41,180 --> 00:41:44,060 So they're specifying how values are aligned in memory. 843 00:41:44,060 --> 00:41:47,180 That matters a lot more for ultimate code generation, 844 00:41:47,180 --> 00:41:49,310 unless we're able to just delete the memory 845 00:41:49,310 --> 00:41:51,020 references altogether. 846 00:41:51,020 --> 00:41:51,886 Make sense? 847 00:41:51,886 --> 00:41:53,670 Cool. 848 00:41:53,670 --> 00:41:54,743 Any other questions? 849 00:41:58,610 --> 00:42:02,940 All right, so we optimized the first field. 850 00:42:02,940 --> 00:42:05,880 We optimize the second field in a similar way. 851 00:42:05,880 --> 00:42:08,610 Turns out, there's additional optimizations 852 00:42:08,610 --> 00:42:10,620 that need to happen in order to return 853 00:42:10,620 --> 00:42:14,610 a structure from this function. 854 00:42:14,610 --> 00:42:17,160 Those operations can be optimized in a similar way. 855 00:42:17,160 --> 00:42:18,300 They're shown here. 856 00:42:18,300 --> 00:42:21,150 We're not going to go through exactly how that works. 857 00:42:21,150 --> 00:42:23,070 But at the end of the day, after we've 858 00:42:23,070 --> 00:42:27,210 optimized all of that code we end up with this. 859 00:42:27,210 --> 00:42:30,930 We end up with our function compiled at 01. 860 00:42:30,930 --> 00:42:32,477 And it's far simpler. 861 00:42:32,477 --> 00:42:33,810 I think it's far more intuitive. 862 00:42:33,810 --> 00:42:35,643 This is what I would imagine the code should 863 00:42:35,643 --> 00:42:40,920 look like when I wrote the C code in the first place. 864 00:42:40,920 --> 00:42:41,970 Take your input. 865 00:42:41,970 --> 00:42:44,070 Do a couple of multiplications. 866 00:42:44,070 --> 00:42:48,310 And then it does them operations to create the return value, 867 00:42:48,310 --> 00:42:51,460 and ultimately return that value. 868 00:42:51,460 --> 00:42:54,330 So, in summary, the compiler works 869 00:42:54,330 --> 00:42:57,570 hard to transform data structures and scalar 870 00:42:57,570 --> 00:42:59,370 values to store as much as it possibly 871 00:42:59,370 --> 00:43:02,760 can purely within registers, and avoid using 872 00:43:02,760 --> 00:43:06,064 any local storage, if possible. 873 00:43:06,064 --> 00:43:09,360 Everyone good with that so far? 874 00:43:09,360 --> 00:43:11,250 Cool. 875 00:43:11,250 --> 00:43:12,900 Let's move on to another optimization. 876 00:43:12,900 --> 00:43:15,600 Let's talk about function calls. 877 00:43:15,600 --> 00:43:17,790 Let's take a look at how the compiler 878 00:43:17,790 --> 00:43:19,260 can optimize function calls. 879 00:43:19,260 --> 00:43:20,940 By and large, these optimizations 880 00:43:20,940 --> 00:43:29,510 will occur if you pass optimization level 2 or higher, 881 00:43:29,510 --> 00:43:31,310 just FYI. 882 00:43:31,310 --> 00:43:33,490 So from our original C code, we had 883 00:43:33,490 --> 00:43:37,150 some lines that performed a bunch of vector operations. 884 00:43:37,150 --> 00:43:40,690 We had a vec add that added two vectors together, one of which 885 00:43:40,690 --> 00:43:42,880 was the result of a vec scale, which 886 00:43:42,880 --> 00:43:47,270 scaled the result of a vec add by some scalar value. 887 00:43:47,270 --> 00:43:52,353 So we had this chain of calls in our code. 888 00:43:52,353 --> 00:43:54,270 And if we take a look at the code compile that 889 00:43:54,270 --> 00:43:57,130 was 0, what we end up with is this snippet shown 890 00:43:57,130 --> 00:44:01,720 on the bottom, which performs some operations on these vector 891 00:44:01,720 --> 00:44:04,720 structures, does this multiply operation, 892 00:44:04,720 --> 00:44:07,000 and then calls this vector scale routine, 893 00:44:07,000 --> 00:44:12,230 the vector scale routine that we decide to focus on first. 894 00:44:12,230 --> 00:44:18,340 So any ideas for how we go about optimizing this? 895 00:44:21,880 --> 00:44:25,810 So to give you a little bit of a hint, what the compiler sees 896 00:44:25,810 --> 00:44:29,320 when it looks at that call is it sees a snippet containing 897 00:44:29,320 --> 00:44:30,920 the call instruction. 898 00:44:30,920 --> 00:44:36,730 And in our example, it also sees the code for the vec scale 899 00:44:36,730 --> 00:44:38,620 function that we were just looking at. 900 00:44:38,620 --> 00:44:40,870 And we're going to suppose that it's already optimized 901 00:44:40,870 --> 00:44:42,280 vec scale as best as it can. 902 00:44:42,280 --> 00:44:45,260 It's produced this code for the vec scale routine. 903 00:44:45,260 --> 00:44:47,830 And so it sees that call instruction. 904 00:44:47,830 --> 00:44:52,400 And it sees this code for the function that's being called. 905 00:44:52,400 --> 00:44:54,790 So what could the compiler do at this point 906 00:44:54,790 --> 00:45:01,570 to try to make the code above even faster? 907 00:45:04,498 --> 00:45:08,402 AUDIENCE: [INAUDIBLE] 908 00:45:09,638 --> 00:45:11,180 TAO B. SCHARDL: You're exactly right. 909 00:45:11,180 --> 00:45:15,020 Remove the call, and just put the body of the vec scale code 910 00:45:15,020 --> 00:45:17,450 right there in place of the call. 911 00:45:17,450 --> 00:45:20,130 It takes a little bit of effort to pull that off. 912 00:45:20,130 --> 00:45:22,070 But, roughly speaking, yeah, we're 913 00:45:22,070 --> 00:45:25,220 just going to copy and paste this code in our function 914 00:45:25,220 --> 00:45:28,800 into the place where we're calling the function. 915 00:45:28,800 --> 00:45:30,620 And so if we do that simple copy paste, 916 00:45:30,620 --> 00:45:34,358 we end up with some garbage code as an intermediate. 917 00:45:34,358 --> 00:45:35,900 We had to do a little bit of renaming 918 00:45:35,900 --> 00:45:39,040 to make everything work out. 919 00:45:39,040 --> 00:45:40,580 But at this point, we have the code 920 00:45:40,580 --> 00:45:43,910 from our function in the place of that call. 921 00:45:43,910 --> 00:45:46,782 And now, we can observe that to restore correctness, 922 00:45:46,782 --> 00:45:47,990 we don't want to do the call. 923 00:45:47,990 --> 00:45:51,980 And we don't want to do the return that we just 924 00:45:51,980 --> 00:45:54,200 pasted in place. 925 00:45:54,200 --> 00:45:55,610 So we'll just go ahead and remove 926 00:45:55,610 --> 00:45:58,370 both that call and the return. 927 00:45:58,370 --> 00:46:00,350 That is called function inlining. 928 00:46:00,350 --> 00:46:03,260 We identify some function call, or the compiler 929 00:46:03,260 --> 00:46:04,790 identifies some function call. 930 00:46:04,790 --> 00:46:06,710 And it takes the body of the function, 931 00:46:06,710 --> 00:46:11,360 and just pastes it right in place of that call. 932 00:46:11,360 --> 00:46:13,520 Sound good? 933 00:46:13,520 --> 00:46:14,480 Make sense? 934 00:46:14,480 --> 00:46:15,200 Anyone confused? 935 00:46:21,472 --> 00:46:22,930 Raise your hand if you're confused. 936 00:46:29,370 --> 00:46:32,610 Now, once you've done some amount of function inlining, 937 00:46:32,610 --> 00:46:35,680 we can actually do some more optimizations. 938 00:46:35,680 --> 00:46:37,470 So here, we have the code after we got rid 939 00:46:37,470 --> 00:46:39,530 of the unnecessary call and return. 940 00:46:39,530 --> 00:46:42,840 And we have a couple multiply operations sitting in place. 941 00:46:42,840 --> 00:46:44,370 That looks fine. 942 00:46:44,370 --> 00:46:47,070 But if we expand our scope just a little bit, 943 00:46:47,070 --> 00:46:49,500 what we see, so we have some operations 944 00:46:49,500 --> 00:46:53,670 happening that were sitting there already 945 00:46:53,670 --> 00:46:56,215 after the original call. 946 00:46:56,215 --> 00:46:57,840 What the compiler can do is it can take 947 00:46:57,840 --> 00:46:59,970 a look at these instructions. 948 00:46:59,970 --> 00:47:02,940 And long story short, it realizes 949 00:47:02,940 --> 00:47:05,130 that all these instructions do is 950 00:47:05,130 --> 00:47:08,280 pack some data into a structure, and then immediately unpack 951 00:47:08,280 --> 00:47:09,690 the structure. 952 00:47:09,690 --> 00:47:12,630 So it's like you put a bunch of stuff into a bag, 953 00:47:12,630 --> 00:47:15,540 and then immediately dump out the bag. 954 00:47:15,540 --> 00:47:17,010 That was kind of a waste of time. 955 00:47:17,010 --> 00:47:18,637 That's kind of a waste of code. 956 00:47:18,637 --> 00:47:19,470 Let's get rid of it. 957 00:47:23,540 --> 00:47:24,830 Those operations are useless. 958 00:47:24,830 --> 00:47:25,580 Let's delete them. 959 00:47:25,580 --> 00:47:29,252 The compiler has a great time deleting dead code. 960 00:47:29,252 --> 00:47:30,710 It's like it's what it lives to do. 961 00:47:33,410 --> 00:47:36,410 All right, now, in fact, in the original code, 962 00:47:36,410 --> 00:47:38,090 we didn't just have one function call. 963 00:47:38,090 --> 00:47:40,340 We had a whole sequence of function calls. 964 00:47:40,340 --> 00:47:44,180 And if we expand our LLVM IR snippet even a little further, 965 00:47:44,180 --> 00:47:45,770 we can include those two function 966 00:47:45,770 --> 00:47:49,730 calls, the original call to vec ad, followed by the code 967 00:47:49,730 --> 00:47:52,490 that we've now optimized by inlining, 968 00:47:52,490 --> 00:47:56,960 ultimately followed by yet another call to vec add. 969 00:47:56,960 --> 00:48:00,290 Minor spoiler, the vec add routine, once it's optimized, 970 00:48:00,290 --> 00:48:04,420 looks pretty similar to the vec scalar routine. 971 00:48:04,420 --> 00:48:06,650 And, in particular, it has comparable size 972 00:48:06,650 --> 00:48:08,570 to the vector scale routine. 973 00:48:08,570 --> 00:48:11,620 So what's the compiler is going to do to those to call sites? 974 00:48:20,710 --> 00:48:24,460 Inline it, do more inlining, inlining is great. 975 00:48:24,460 --> 00:48:28,840 We'll inline these functions as well, 976 00:48:28,840 --> 00:48:31,430 and then remove all of the additional, now-useless 977 00:48:31,430 --> 00:48:32,600 instructions. 978 00:48:32,600 --> 00:48:34,220 We'll walk through that process. 979 00:48:34,220 --> 00:48:37,980 The result of that process looks something like this. 980 00:48:37,980 --> 00:48:40,040 So in the original C code, we had this vec 981 00:48:40,040 --> 00:48:42,250 add, which called the vec scale as one 982 00:48:42,250 --> 00:48:44,000 of its arguments, which called the vec add 983 00:48:44,000 --> 00:48:45,500 is one of its arguments. 984 00:48:45,500 --> 00:48:48,000 And what we end up with in the optimized IR 985 00:48:48,000 --> 00:48:50,600 is just a bunch of straight line code that performs 986 00:48:50,600 --> 00:48:52,580 floating point operations. 987 00:48:52,580 --> 00:48:57,860 It's almost as if the compiler took the original C code, 988 00:48:57,860 --> 00:49:00,800 and transformed it into the equivalency code shown 989 00:49:00,800 --> 00:49:03,740 on the bottom, where it just operates 990 00:49:03,740 --> 00:49:07,970 on a whole bunch of doubles, and just does primitive operations. 991 00:49:07,970 --> 00:49:12,230 So function inlining, as well as the additional transformations 992 00:49:12,230 --> 00:49:14,600 it was able to perform as a result, 993 00:49:14,600 --> 00:49:17,030 together those were able to eliminate 994 00:49:17,030 --> 00:49:18,360 all of those function calls. 995 00:49:18,360 --> 00:49:20,330 It was able to completely eliminate 996 00:49:20,330 --> 00:49:25,130 any costs associated with the function call abstraction, 997 00:49:25,130 --> 00:49:27,270 at least in this code. 998 00:49:27,270 --> 00:49:27,950 Make sense? 999 00:49:30,500 --> 00:49:32,060 I think that's pretty cool. 1000 00:49:32,060 --> 00:49:34,520 You write code that has a bunch of function calls, 1001 00:49:34,520 --> 00:49:37,250 because that's how you've constructed your interfaces. 1002 00:49:37,250 --> 00:49:39,500 But you're not really paying for those function calls. 1003 00:49:39,500 --> 00:49:41,210 Function calls aren't the cheapest operation 1004 00:49:41,210 --> 00:49:42,830 in the world, especially if you think 1005 00:49:42,830 --> 00:49:44,420 about everything that goes on in terms 1006 00:49:44,420 --> 00:49:47,090 of the registers and the stack. 1007 00:49:47,090 --> 00:49:50,420 But the compiler is able to avoid all of that overhead, 1008 00:49:50,420 --> 00:49:54,540 and just perform the floating point operations we care about. 1009 00:49:54,540 --> 00:49:57,380 OK, well, if function inlining is so great, 1010 00:49:57,380 --> 00:50:00,560 and it enables so many great optimizations, 1011 00:50:00,560 --> 00:50:03,248 why doesn't the compiler just inline every function call? 1012 00:50:06,320 --> 00:50:08,190 Go for it. 1013 00:50:08,190 --> 00:50:12,630 Recursion, it's really hard to inline a recursive call. 1014 00:50:12,630 --> 00:50:15,940 In general, you can't inline a function into itself, 1015 00:50:15,940 --> 00:50:17,940 although it turns out there are some exceptions. 1016 00:50:17,940 --> 00:50:20,580 So, yes, recursion creates problems 1017 00:50:20,580 --> 00:50:21,900 with function inlining. 1018 00:50:21,900 --> 00:50:23,670 Any other thoughts? 1019 00:50:23,670 --> 00:50:25,545 In the back. 1020 00:50:25,545 --> 00:50:29,505 AUDIENCE: [INAUDIBLE] 1021 00:50:38,057 --> 00:50:40,140 TAO B. SCHARDL: You're definitely on to something. 1022 00:50:40,140 --> 00:50:43,170 So we had to do a bunch of this renaming stuff 1023 00:50:43,170 --> 00:50:45,090 when we inlined the first time, and when 1024 00:50:45,090 --> 00:50:47,760 we inlined every single time. 1025 00:50:47,760 --> 00:50:51,870 And even though LLVM IR has an infinite number of registers, 1026 00:50:51,870 --> 00:50:53,760 the machine doesn't. 1027 00:50:53,760 --> 00:50:56,790 And so all of that renaming does create a problem. 1028 00:50:56,790 --> 00:50:59,370 But there are other problems as well of 1029 00:50:59,370 --> 00:51:02,770 a similar nature when you start inlining all those functions. 1030 00:51:02,770 --> 00:51:06,100 For example, you copy pasted a bunch of code. 1031 00:51:06,100 --> 00:51:09,422 And that made the original call site even bigger, and bigger, 1032 00:51:09,422 --> 00:51:10,380 and bigger, and bigger. 1033 00:51:10,380 --> 00:51:13,950 And programs, we generally don't think about the space 1034 00:51:13,950 --> 00:51:15,125 they take in memory. 1035 00:51:15,125 --> 00:51:16,500 But they do take space in memory. 1036 00:51:16,500 --> 00:51:19,120 And that does have an impact on performance. 1037 00:51:19,120 --> 00:51:22,140 So great answer, any other thoughts? 1038 00:51:25,056 --> 00:51:29,430 AUDIENCE: [INAUDIBLE] 1039 00:51:35,487 --> 00:51:37,570 TAO B. SCHARDL: If your function becomes too long, 1040 00:51:37,570 --> 00:51:39,443 then it may not fit in instruction cache. 1041 00:51:39,443 --> 00:51:41,110 And that can increase the amount of time 1042 00:51:41,110 --> 00:51:43,850 it takes just to execute the function. 1043 00:51:43,850 --> 00:51:47,367 Right, because you're now not getting hash hits, 1044 00:51:47,367 --> 00:51:47,950 exactly right. 1045 00:51:47,950 --> 00:51:50,570 That's one of the problems with this code size blow 1046 00:51:50,570 --> 00:51:52,630 up from inlining everything. 1047 00:51:52,630 --> 00:51:54,010 Any other thoughts? 1048 00:51:54,010 --> 00:51:54,810 Any final thoughts? 1049 00:52:03,290 --> 00:52:05,790 So there are three main reasons why the compiler 1050 00:52:05,790 --> 00:52:07,140 won't inline every function. 1051 00:52:07,140 --> 00:52:11,070 I think we touched on two of them here. 1052 00:52:11,070 --> 00:52:13,770 For some function calls, like recursive calls, 1053 00:52:13,770 --> 00:52:15,960 it's impossible to inline them, because you can't 1054 00:52:15,960 --> 00:52:18,450 inline a function into itself. 1055 00:52:18,450 --> 00:52:21,300 But there are exceptions to that, namely 1056 00:52:21,300 --> 00:52:22,710 recursive tail calls. 1057 00:52:22,710 --> 00:52:26,280 If the last thing in a function is a function call, 1058 00:52:26,280 --> 00:52:28,110 then it turns out you can effectively 1059 00:52:28,110 --> 00:52:31,860 inline that function call as an optimization. 1060 00:52:31,860 --> 00:52:34,680 We're not going to talk too much about how that works. 1061 00:52:34,680 --> 00:52:36,940 But there are corner cases. 1062 00:52:36,940 --> 00:52:42,120 But, in general, you can't inline a recursive call. 1063 00:52:42,120 --> 00:52:43,800 The compiler has another problem. 1064 00:52:43,800 --> 00:52:47,570 Namely, if the function that you're calling 1065 00:52:47,570 --> 00:52:50,070 is in a different castle, if it's in a different compilation 1066 00:52:50,070 --> 00:52:54,240 unit, literally in a different file 1067 00:52:54,240 --> 00:52:57,720 that's compiled independently, then the compiler 1068 00:52:57,720 --> 00:53:00,238 can't very well inline that function, 1069 00:53:00,238 --> 00:53:02,030 because it doesn't know about the function. 1070 00:53:02,030 --> 00:53:05,280 It doesn't have access to that function's code. 1071 00:53:05,280 --> 00:53:07,020 There is a way to get around that problem 1072 00:53:07,020 --> 00:53:09,750 with modern compiler technology that involves whole program 1073 00:53:09,750 --> 00:53:11,040 optimization. 1074 00:53:11,040 --> 00:53:13,440 And I think there's some backup slides that will tell you 1075 00:53:13,440 --> 00:53:16,260 how to do that with LLVM. 1076 00:53:16,260 --> 00:53:19,350 But, in general, if it's in a different compilation unit, 1077 00:53:19,350 --> 00:53:21,390 it can't be inline. 1078 00:53:21,390 --> 00:53:24,060 And, finally, as touched on, function inlining 1079 00:53:24,060 --> 00:53:28,200 can increase code size, which can hurt performance. 1080 00:53:28,200 --> 00:53:31,620 OK, so some functions are OK to inline. 1081 00:53:31,620 --> 00:53:34,110 Other functions could create this performance problem, 1082 00:53:34,110 --> 00:53:35,890 because you've increased code size. 1083 00:53:35,890 --> 00:53:38,820 So how does the compiler know whether or not 1084 00:53:38,820 --> 00:53:42,660 inlining any particular function at a call site 1085 00:53:42,660 --> 00:53:45,480 could hurt performance? 1086 00:53:45,480 --> 00:53:47,780 Any guesses? 1087 00:53:47,780 --> 00:53:48,844 Yeah? 1088 00:53:48,844 --> 00:53:52,580 AUDIENCE: [INAUDIBLE] 1089 00:53:55,975 --> 00:53:56,850 TAO B. SCHARDL: Yeah. 1090 00:53:56,850 --> 00:53:59,740 So the compiler has some cost model, which gives it 1091 00:53:59,740 --> 00:54:02,740 some information about, how much will it 1092 00:54:02,740 --> 00:54:06,370 cost to inline that function? 1093 00:54:06,370 --> 00:54:07,690 Is the cost model always right? 1094 00:54:10,560 --> 00:54:12,040 It is not. 1095 00:54:12,040 --> 00:54:15,270 So the answer, how does the compiler know, 1096 00:54:15,270 --> 00:54:17,400 is, really, it doesn't know. 1097 00:54:17,400 --> 00:54:21,210 It makes a best guess using that cost model, 1098 00:54:21,210 --> 00:54:24,000 and other heuristics, to determine, 1099 00:54:24,000 --> 00:54:27,840 when does it make sense to try to inline a function? 1100 00:54:27,840 --> 00:54:29,820 And because it's making a best guess, 1101 00:54:29,820 --> 00:54:33,490 sometimes the compiler guesses wrong. 1102 00:54:33,490 --> 00:54:35,430 So to wrap up this part, here are just 1103 00:54:35,430 --> 00:54:38,160 a couple of tips for controlling function inlining 1104 00:54:38,160 --> 00:54:39,630 in your own programs. 1105 00:54:39,630 --> 00:54:42,810 If there's a function that you know must always be inlined, 1106 00:54:42,810 --> 00:54:46,470 no matter what happens, you can mark that function 1107 00:54:46,470 --> 00:54:49,963 with a special attribute, namely the always inline attribute. 1108 00:54:49,963 --> 00:54:51,630 For example, if you have a function that 1109 00:54:51,630 --> 00:54:53,900 does some complex address calculation, 1110 00:54:53,900 --> 00:54:57,330 and it should be inlined rather than called, 1111 00:54:57,330 --> 00:55:00,413 you may want to mark that with an always inline attribute. 1112 00:55:00,413 --> 00:55:02,580 Similarly, if you have a function that really should 1113 00:55:02,580 --> 00:55:04,980 never be inlined, it's never cost effective 1114 00:55:04,980 --> 00:55:08,160 to inline that function, you can mark that function 1115 00:55:08,160 --> 00:55:11,100 with the no inline attribute. 1116 00:55:11,100 --> 00:55:15,150 And, finally, if you want to enable more function inlining 1117 00:55:15,150 --> 00:55:19,560 in the compiler, you can use link time optimization, or LTO, 1118 00:55:19,560 --> 00:55:22,380 to enable whole program optimization. 1119 00:55:22,380 --> 00:55:24,940 Won't go into that during these slides. 1120 00:55:24,940 --> 00:55:28,170 Let's move on, and talk about loop optimizations. 1121 00:55:28,170 --> 00:55:31,590 Any questions so far, before continue? 1122 00:55:31,590 --> 00:55:32,213 Yeah? 1123 00:55:32,213 --> 00:55:35,460 AUDIENCE: [INAUDIBLE] 1124 00:55:35,460 --> 00:55:36,688 TAO B. SCHARDL: Sorry? 1125 00:55:36,688 --> 00:55:40,520 AUDIENCE: [INAUDIBLE] 1126 00:55:42,773 --> 00:55:44,190 TAO B. SCHARDL: Does static inline 1127 00:55:44,190 --> 00:55:47,100 guarantee you the compiler will always inline it? 1128 00:55:47,100 --> 00:55:49,440 It actually doesn't. 1129 00:55:49,440 --> 00:55:54,420 The inline keyword will provide a hint to the compiler 1130 00:55:54,420 --> 00:55:56,700 that it should think about inlining the function. 1131 00:55:56,700 --> 00:55:58,890 But it doesn't provide any guarantees. 1132 00:55:58,890 --> 00:56:01,230 If you want a strong guarantee, use the always inline 1133 00:56:01,230 --> 00:56:03,048 attribute. 1134 00:56:03,048 --> 00:56:03,965 Good question, though. 1135 00:56:08,060 --> 00:56:10,967 All right, loop optimizations-- 1136 00:56:10,967 --> 00:56:12,800 you've already seen some loop optimizations. 1137 00:56:12,800 --> 00:56:17,010 You've seen vectorization, for example. 1138 00:56:17,010 --> 00:56:19,400 It turns out, the compiler does a lot of work 1139 00:56:19,400 --> 00:56:21,590 to try to optimize loops. 1140 00:56:21,590 --> 00:56:24,230 So first, why is that? 1141 00:56:24,230 --> 00:56:27,890 Why would the compiler engineers invest so much effort 1142 00:56:27,890 --> 00:56:30,480 into optimizing loops? 1143 00:56:30,480 --> 00:56:32,218 Why loops in particular? 1144 00:56:42,470 --> 00:56:44,640 They're extremely common control structure 1145 00:56:44,640 --> 00:56:47,310 that also has a branch. 1146 00:56:47,310 --> 00:56:48,930 Both things are true. 1147 00:56:48,930 --> 00:56:52,710 I think there's a higher level reason, though, 1148 00:56:52,710 --> 00:56:55,854 or more fundamental reason, if you will. 1149 00:56:55,854 --> 00:56:56,788 Yeah? 1150 00:56:56,788 --> 00:57:00,787 AUDIENCE: Most of the time, the loop takes up the most time. 1151 00:57:00,787 --> 00:57:02,870 TAO B. SCHARDL: Most of the time the loop takes up 1152 00:57:02,870 --> 00:57:04,070 the most time. 1153 00:57:04,070 --> 00:57:05,120 You got it. 1154 00:57:05,120 --> 00:57:09,830 Loops account for a lot of the execution time of programs. 1155 00:57:09,830 --> 00:57:12,050 The way I like to think about this 1156 00:57:12,050 --> 00:57:14,270 is with a really simple thought experiment. 1157 00:57:14,270 --> 00:57:16,790 Let's imagine that you've got a machine with a two gigahertz 1158 00:57:16,790 --> 00:57:17,360 processor. 1159 00:57:17,360 --> 00:57:19,670 We've chosen these values to be easier 1160 00:57:19,670 --> 00:57:23,413 to think about using mental math. 1161 00:57:23,413 --> 00:57:24,830 Suppose you've got a two gigahertz 1162 00:57:24,830 --> 00:57:26,870 processor with 16 cores. 1163 00:57:26,870 --> 00:57:29,570 Each core executes one instruction per cycle. 1164 00:57:29,570 --> 00:57:32,120 And suppose you've got a program which 1165 00:57:32,120 --> 00:57:35,900 contains a trillion instructions and ample parallelism 1166 00:57:35,900 --> 00:57:37,490 for those 16 cores. 1167 00:57:37,490 --> 00:57:41,560 But all of those instructions are simple, straight line code. 1168 00:57:41,560 --> 00:57:42,900 There are no branches. 1169 00:57:42,900 --> 00:57:43,850 There are no loops. 1170 00:57:43,850 --> 00:57:46,760 There no complicated operations like IO. 1171 00:57:46,760 --> 00:57:50,180 It's just a bunch of really simple straight line code. 1172 00:57:50,180 --> 00:57:52,310 Each instruction takes a cycle to execute. 1173 00:57:52,310 --> 00:57:56,060 The processor executes one instruction per cycle. 1174 00:57:56,060 --> 00:58:01,640 How long does it take to run this code, to execute 1175 00:58:01,640 --> 00:58:04,175 the entire terabyte binary? 1176 00:58:15,740 --> 00:58:19,770 2 to the 40th cycles for 2 to the 40 instructions. 1177 00:58:19,770 --> 00:58:24,610 But you're using a two gigahertz processor and 16 cores. 1178 00:58:24,610 --> 00:58:26,650 And you've got ample parallelism in the program 1179 00:58:26,650 --> 00:58:28,930 to keep them all saturated. 1180 00:58:28,930 --> 00:58:30,304 So how much time? 1181 00:58:35,174 --> 00:58:38,110 AUDIENCE: 32 seconds. 1182 00:58:38,110 --> 00:58:43,210 TAO B. SCHARDL: 32 seconds, nice job. 1183 00:58:43,210 --> 00:58:47,620 This one has mastered power of 2 arithmetic in one's head. 1184 00:58:47,620 --> 00:58:50,860 It's a good skill to have, especially in core six. 1185 00:58:50,860 --> 00:58:53,770 Yeah, so if you have just a bunch of simple, 1186 00:58:53,770 --> 00:58:57,610 straight line code, and you have a terabyte of it. 1187 00:58:57,610 --> 00:58:58,690 That's a lot of code. 1188 00:58:58,690 --> 00:59:01,330 That is a big binary. 1189 00:59:01,330 --> 00:59:04,035 And, yet, the program, this processor, 1190 00:59:04,035 --> 00:59:05,410 this relatively simple processor, 1191 00:59:05,410 --> 00:59:08,980 can execute the whole thing in just about 30 seconds. 1192 00:59:08,980 --> 00:59:11,290 Now, in your experience working with software, 1193 00:59:11,290 --> 00:59:12,880 you might have noticed that there 1194 00:59:12,880 --> 00:59:17,480 are some programs that take longer than 30 seconds to run. 1195 00:59:17,480 --> 00:59:22,420 And some of those programs don't have terabyte size binaries. 1196 00:59:22,420 --> 00:59:25,720 The reason that those programs take longer to run, 1197 00:59:25,720 --> 00:59:27,760 by and large, is loops. 1198 00:59:27,760 --> 00:59:30,580 So loops account for a lot of the execution 1199 00:59:30,580 --> 00:59:31,960 time in real programs. 1200 00:59:34,718 --> 00:59:36,760 Now, you've already seen some loop optimizations. 1201 00:59:36,760 --> 00:59:38,802 We're just going to take a look at one other loop 1202 00:59:38,802 --> 00:59:42,040 optimization today, namely code hoisting, also known 1203 00:59:42,040 --> 00:59:44,360 as loop invariant code motion. 1204 00:59:44,360 --> 00:59:46,540 To look at that, we're going to take 1205 00:59:46,540 --> 00:59:48,370 a look at a different snippet of code 1206 00:59:48,370 --> 00:59:50,500 from the end body simulation. 1207 00:59:50,500 --> 00:59:53,860 This code calculates the forces going 1208 00:59:53,860 --> 00:59:55,980 on each of the end bodies. 1209 00:59:55,980 --> 00:59:58,810 And it does it with a doubly nested loop. 1210 00:59:58,810 --> 01:00:01,943 For all the zero to number of bodies, 1211 01:00:01,943 --> 01:00:03,610 for all body zero number bodies, as long 1212 01:00:03,610 --> 01:00:05,470 as you're not looking at the same body, 1213 01:00:05,470 --> 01:00:10,210 call this add force routine, which calculates to-- 1214 01:00:10,210 --> 01:00:13,690 calculate the force between those two bodies. 1215 01:00:13,690 --> 01:00:16,600 And add that force to one of the bodies. 1216 01:00:16,600 --> 01:00:19,810 That's all that's going on in this code. 1217 01:00:19,810 --> 01:00:22,330 If we translate this code into LLVM IR, 1218 01:00:22,330 --> 01:00:25,810 we end up with, hopefully unsurprisingly, 1219 01:00:25,810 --> 01:00:28,210 a doubly nested loop. 1220 01:00:28,210 --> 01:00:29,510 It looks something like this. 1221 01:00:29,510 --> 01:00:31,930 The body of the code, the body of the innermost loop, 1222 01:00:31,930 --> 01:00:35,170 has been lighted, just so things can fit on the slide. 1223 01:00:35,170 --> 01:00:37,900 But we can see the overall structure. 1224 01:00:37,900 --> 01:00:41,070 On the outside, we have some outer loop control. 1225 01:00:41,070 --> 01:00:45,010 This should look familiar from lecture five, hopefully. 1226 01:00:45,010 --> 01:00:48,278 Inside of that outer loop, we have an inner loop. 1227 01:00:48,278 --> 01:00:50,320 And at the top and the bottom of that inner loop, 1228 01:00:50,320 --> 01:00:52,420 we have the inner loop control. 1229 01:00:52,420 --> 01:00:54,670 And within that inner loop, we do 1230 01:00:54,670 --> 01:00:57,190 have one branch, which can skip a bunch of code 1231 01:00:57,190 --> 01:01:01,930 if you're looking at the same body for i and j. 1232 01:01:01,930 --> 01:01:06,130 But, otherwise, we have the loop body of the inner most loop, 1233 01:01:06,130 --> 01:01:08,590 basic structure. 1234 01:01:08,590 --> 01:01:11,290 Now, if we just zoom in on the top part 1235 01:01:11,290 --> 01:01:15,910 of this doubly-nested loop, just the topmost three basic blocks, 1236 01:01:15,910 --> 01:01:19,240 take a look at more of the code that's going on here, 1237 01:01:19,240 --> 01:01:22,200 we end up with something that looks like this. 1238 01:01:22,200 --> 01:01:23,950 And if you remember some of the discussion 1239 01:01:23,950 --> 01:01:26,680 from lecture five about the loop induction variables, 1240 01:01:26,680 --> 01:01:29,830 and what that looks like in LLVM IR, what you find 1241 01:01:29,830 --> 01:01:32,710 is that for the outer loop we have an induction variable 1242 01:01:32,710 --> 01:01:33,430 at the very top. 1243 01:01:33,430 --> 01:01:37,270 It's that weird fee instruction, once again. 1244 01:01:37,270 --> 01:01:39,640 Inside that outer loop, we have the loop control 1245 01:01:39,640 --> 01:01:43,090 for the inner loop, which has its own induction variable. 1246 01:01:43,090 --> 01:01:44,800 Once again, we have another fee node. 1247 01:01:44,800 --> 01:01:46,750 That's how we can spot it. 1248 01:01:46,750 --> 01:01:50,360 And then we have the body of the innermost loop. 1249 01:01:50,360 --> 01:01:51,610 And this is just the start of. 1250 01:01:51,610 --> 01:01:54,260 It it's just a couple address calculations. 1251 01:01:54,260 --> 01:01:56,920 But can anyone tell me some interesting property 1252 01:01:56,920 --> 01:02:00,370 about just a couple of these address calculations 1253 01:02:00,370 --> 01:02:02,532 that could lead to an optimization? 1254 01:02:05,400 --> 01:02:07,670 AUDIENCE: [INAUDIBLE] 1255 01:02:07,670 --> 01:02:10,070 TAO B. SCHARDL: The first two address calculations only 1256 01:02:10,070 --> 01:02:14,600 depend on the outermost loop variable, the iteration 1257 01:02:14,600 --> 01:02:18,920 variable for the outer loop, exactly right. 1258 01:02:18,920 --> 01:02:21,614 So what can we do with those instructions? 1259 01:02:31,460 --> 01:02:33,260 Bring them out of the inner loop. 1260 01:02:33,260 --> 01:02:35,840 Why should we keep computing these addresses 1261 01:02:35,840 --> 01:02:38,750 in the innermost loop when we could just compute them once 1262 01:02:38,750 --> 01:02:40,460 in the outer loop? 1263 01:02:40,460 --> 01:02:45,120 That optimization is called code hoisting, or loop invariant 1264 01:02:45,120 --> 01:02:46,110 code motion. 1265 01:02:46,110 --> 01:02:48,260 Those instructions are invariant to the code 1266 01:02:48,260 --> 01:02:49,400 in the innermost loop. 1267 01:02:49,400 --> 01:02:51,430 So you hoist them out. 1268 01:02:51,430 --> 01:02:53,210 And once you hoist them out, you end up 1269 01:02:53,210 --> 01:02:57,260 with a transformed loop that looks something like this. 1270 01:02:57,260 --> 01:03:01,040 What we have is the same outer loop control at the very top. 1271 01:03:01,040 --> 01:03:04,410 But now, we're doing some address calculations there. 1272 01:03:04,410 --> 01:03:06,620 And we no longer have those address calculations 1273 01:03:06,620 --> 01:03:07,320 on the inside. 1274 01:03:10,310 --> 01:03:13,100 And as a result, those hoisted calculations 1275 01:03:13,100 --> 01:03:17,150 are performed just once per iteration of the outer loop, 1276 01:03:17,150 --> 01:03:20,590 rather than once per iteration of the inner loop. 1277 01:03:20,590 --> 01:03:23,110 And so those instructions are run far fewer times. 1278 01:03:23,110 --> 01:03:24,860 You get to save a lot of running time. 1279 01:03:28,450 --> 01:03:29,920 So the effect of this optimization 1280 01:03:29,920 --> 01:03:31,337 in terms of C code, because it can 1281 01:03:31,337 --> 01:03:34,080 be a little tedious to look at LLVM IR, 1282 01:03:34,080 --> 01:03:35,590 is essentially like this. 1283 01:03:35,590 --> 01:03:38,580 We took this doubly-nested loop in C. 1284 01:03:38,580 --> 01:03:43,390 We're calling add force of blah, blah, blah, calculate force, 1285 01:03:43,390 --> 01:03:44,480 blah, blah, blah. 1286 01:03:44,480 --> 01:03:48,340 And now, we just move the address calculation 1287 01:03:48,340 --> 01:03:51,130 to get the ith body that we care about. 1288 01:03:51,130 --> 01:03:53,710 We move that to the outer. 1289 01:03:53,710 --> 01:03:56,410 Now, this was an example of loop invariant code motion on just 1290 01:03:56,410 --> 01:03:57,790 a couple address calculations. 1291 01:03:57,790 --> 01:04:00,400 In general, the compiler will try 1292 01:04:00,400 --> 01:04:04,630 to prove that some calculation is invariant across all 1293 01:04:04,630 --> 01:04:05,680 the iterations of a loop. 1294 01:04:05,680 --> 01:04:07,120 And whenever it can prove that, it 1295 01:04:07,120 --> 01:04:10,030 will try to hoist that code out of the loop. 1296 01:04:10,030 --> 01:04:13,210 If it can get code out of the body of a loop, 1297 01:04:13,210 --> 01:04:15,250 that reduces the running time of the loop, 1298 01:04:15,250 --> 01:04:16,960 saves a lot of execution time. 1299 01:04:16,960 --> 01:04:20,550 Huge bang for the buck. 1300 01:04:20,550 --> 01:04:21,160 Make sense? 1301 01:04:21,160 --> 01:04:25,130 Any questions about that so far? 1302 01:04:25,130 --> 01:04:27,190 All right, so just to summarize this part, 1303 01:04:27,190 --> 01:04:28,600 what can the compiler do? 1304 01:04:28,600 --> 01:04:31,480 The compiler optimizes code by performing a sequence 1305 01:04:31,480 --> 01:04:33,100 of transformation passes. 1306 01:04:33,100 --> 01:04:35,680 All those passes are pretty mechanical. 1307 01:04:35,680 --> 01:04:37,570 The compiler goes through the code. 1308 01:04:37,570 --> 01:04:40,675 It tries to find some property, like this address calculation 1309 01:04:40,675 --> 01:04:43,120 is the same as that address calculation. 1310 01:04:43,120 --> 01:04:46,620 And so this load will return the same value as that store, 1311 01:04:46,620 --> 01:04:47,620 and so on, and so forth. 1312 01:04:47,620 --> 01:04:49,840 And based on that analysis, it tries 1313 01:04:49,840 --> 01:04:55,180 to get rid of some dead code, and replace certain register 1314 01:04:55,180 --> 01:04:57,323 values with other register values, 1315 01:04:57,323 --> 01:04:59,240 replace things that live in memory with things 1316 01:04:59,240 --> 01:05:00,900 that just live in registers. 1317 01:05:00,900 --> 01:05:04,660 A lot of the transformations resemble Bentley-rule work 1318 01:05:04,660 --> 01:05:06,610 optimizations that you've seen in lecture two. 1319 01:05:06,610 --> 01:05:08,650 So as you're studying for your upcoming quiz, 1320 01:05:08,650 --> 01:05:10,960 you can kind of get two for one by looking 1321 01:05:10,960 --> 01:05:15,410 at those Bentley-rule optimizations. 1322 01:05:15,410 --> 01:05:18,430 And one transformation pass, in particular function inlining, 1323 01:05:18,430 --> 01:05:19,660 was a good example of this. 1324 01:05:19,660 --> 01:05:22,630 One transformation can enable other transformations. 1325 01:05:22,630 --> 01:05:26,627 And those together can compound to give you fast code. 1326 01:05:26,627 --> 01:05:28,960 In general, compilers perform a lot more transformations 1327 01:05:28,960 --> 01:05:30,650 than just the ones we saw today. 1328 01:05:30,650 --> 01:05:33,310 But there are things that the compiler can't do. 1329 01:05:33,310 --> 01:05:34,750 Here's one very simple example. 1330 01:05:37,025 --> 01:05:38,650 In this case, we're taking another look 1331 01:05:38,650 --> 01:05:40,900 at this calculate forces routine. 1332 01:05:40,900 --> 01:05:44,740 Although the compiler can optimize the code 1333 01:05:44,740 --> 01:05:47,050 by moving address calculations out of loop, 1334 01:05:47,050 --> 01:05:50,350 one thing that I can't do is exploit symmetry 1335 01:05:50,350 --> 01:05:51,630 in the problem. 1336 01:05:51,630 --> 01:05:54,100 So in this problem, what's going on 1337 01:05:54,100 --> 01:05:57,130 is we're computing the forces on any pair of bodies 1338 01:05:57,130 --> 01:05:59,350 using the law of gravitation. 1339 01:05:59,350 --> 01:06:03,940 And it turns out that the force acting on one body by another 1340 01:06:03,940 --> 01:06:07,210 is exactly the opposite the force acting on the other body 1341 01:06:07,210 --> 01:06:08,610 by the one. 1342 01:06:08,610 --> 01:06:12,910 So F of 12 is equal to minus F of 21. 1343 01:06:12,910 --> 01:06:15,610 The compiler will not figure that out. 1344 01:06:15,610 --> 01:06:17,230 The compiler knows algebra. 1345 01:06:17,230 --> 01:06:18,760 It doesn't know physics. 1346 01:06:18,760 --> 01:06:20,370 So it won't be able to figure out 1347 01:06:20,370 --> 01:06:21,980 that there's symmetry in this problem, 1348 01:06:21,980 --> 01:06:26,880 and it can avoid wasted operations. 1349 01:06:26,880 --> 01:06:27,490 Make sense? 1350 01:06:29,933 --> 01:06:31,350 All right, so that was an overview 1351 01:06:31,350 --> 01:06:33,600 of some simple compiler optimizations. 1352 01:06:33,600 --> 01:06:38,460 We now have some examples of some case studies 1353 01:06:38,460 --> 01:06:42,080 to see where the compiler can get tripped up. 1354 01:06:42,080 --> 01:06:44,580 And it doesn't matter if we get through all of these or not. 1355 01:06:44,580 --> 01:06:46,450 You'll have access to the slides afterwards. 1356 01:06:46,450 --> 01:06:47,908 But I think these are kind of cool. 1357 01:06:47,908 --> 01:06:48,960 So shall we take a look? 1358 01:06:52,950 --> 01:06:58,200 Simple question-- does the compiler vectorize this loop? 1359 01:07:04,290 --> 01:07:08,720 So just to go over what this loop does, it's a simple loop. 1360 01:07:08,720 --> 01:07:13,100 The function takes two vectors as inputs, 1361 01:07:13,100 --> 01:07:15,470 or two arrays as inputs, I should say-- 1362 01:07:15,470 --> 01:07:21,920 an array called y, of like then, and an array x of like then, 1363 01:07:21,920 --> 01:07:24,230 and some scalar value a. 1364 01:07:24,230 --> 01:07:26,090 And all that this function does is 1365 01:07:26,090 --> 01:07:30,200 it loops over each element of the vector, multiplies x of i 1366 01:07:30,200 --> 01:07:34,790 by the input scalar, adds the product into y's of i. 1367 01:07:34,790 --> 01:07:36,380 So does the loop vectorize? 1368 01:07:36,380 --> 01:07:37,270 Yes? 1369 01:07:37,270 --> 01:07:41,500 AUDIENCE: [INAUDIBLE] 1370 01:07:42,920 --> 01:07:44,578 TAO B. SCHARDL: y and x could overlap. 1371 01:07:44,578 --> 01:07:46,870 And there is no information about whether they overlap. 1372 01:07:46,870 --> 01:07:49,520 So do they vectorize? 1373 01:07:49,520 --> 01:07:51,990 We have a vote for no. 1374 01:07:51,990 --> 01:07:55,860 Anyone think that it does vectorize? 1375 01:07:55,860 --> 01:07:57,360 You made a very convincing argument. 1376 01:07:57,360 --> 01:08:04,850 So everyone believes that this loop does not vectorize. 1377 01:08:04,850 --> 01:08:07,590 Is that true? 1378 01:08:07,590 --> 01:08:10,860 Anyone uncertain? 1379 01:08:10,860 --> 01:08:14,220 Anyone unwilling to commit to yes or no right here? 1380 01:08:16,402 --> 01:08:18,569 All right, a bunch of people are unwilling to commit 1381 01:08:18,569 --> 01:08:19,319 to yes or no. 1382 01:08:19,319 --> 01:08:21,990 All right, let's resolve this question. 1383 01:08:21,990 --> 01:08:23,740 Let's first ask for the report. 1384 01:08:23,740 --> 01:08:26,590 Let's look at the vectorization report. 1385 01:08:26,590 --> 01:08:27,390 We compile it. 1386 01:08:27,390 --> 01:08:29,490 We pass the flags to get the vectorization report. 1387 01:08:29,490 --> 01:08:33,750 And the vectorization report says, yes, it 1388 01:08:33,750 --> 01:08:37,590 does vectorize this loop, which is interesting, 1389 01:08:37,590 --> 01:08:40,460 because we have this great argument that says, 1390 01:08:40,460 --> 01:08:44,060 but you don't know how these addresses fit in memory. 1391 01:08:44,060 --> 01:08:46,920 You don't know if x and y overlap with each other. 1392 01:08:46,920 --> 01:08:50,160 How can you possibly vectorize? 1393 01:08:50,160 --> 01:08:52,720 Kind of a mystery. 1394 01:08:52,720 --> 01:08:57,540 Well, if we take a look at the actual compiled code when we 1395 01:08:57,540 --> 01:09:01,210 optimize this at 02, turns out you can pass certain flags 1396 01:09:01,210 --> 01:09:04,590 to the compiler, and get it to print out not just the LLVM IR, 1397 01:09:04,590 --> 01:09:08,490 but the LLVM IR formatted as a control flow graph. 1398 01:09:08,490 --> 01:09:13,200 And the control flow graph for this simple two line function 1399 01:09:13,200 --> 01:09:17,609 is the thing on the right, which you obviously 1400 01:09:17,609 --> 01:09:20,819 can't, read because it's a little bit 1401 01:09:20,819 --> 01:09:22,319 small, in terms of its text. 1402 01:09:22,319 --> 01:09:26,520 And it seems have a lot going on. 1403 01:09:26,520 --> 01:09:29,130 So I took the liberty of redrawing that control flow 1404 01:09:29,130 --> 01:09:32,520 graph with none of the code inside, 1405 01:09:32,520 --> 01:09:35,010 just get a picture of what the structure looks 1406 01:09:35,010 --> 01:09:37,740 like for this compiled function. 1407 01:09:37,740 --> 01:09:42,130 And, structurally speaking, it looks like this. 1408 01:09:42,130 --> 01:09:45,312 And with a bit of practice staring at control flow graphs, 1409 01:09:45,312 --> 01:09:47,729 which you might get if you spend way too much time working 1410 01:09:47,729 --> 01:09:50,819 on compilers, you might look at this control flow graph, 1411 01:09:50,819 --> 01:09:55,020 and think, this graph looks a little too complicated 1412 01:09:55,020 --> 01:09:59,010 for the two line function that we gave as input. 1413 01:09:59,010 --> 01:10:02,170 So what's going on here? 1414 01:10:02,170 --> 01:10:04,783 Well, we've got three different loops in this code. 1415 01:10:04,783 --> 01:10:06,450 And it turns out that one of those loops 1416 01:10:06,450 --> 01:10:08,910 is full of vector operations. 1417 01:10:08,910 --> 01:10:13,100 OK, the other two loops are not full of vector operations. 1418 01:10:13,100 --> 01:10:15,480 That's unvectorized code. 1419 01:10:15,480 --> 01:10:17,190 And then there's this basic block right 1420 01:10:17,190 --> 01:10:20,460 at the top that has a conditional branch 1421 01:10:20,460 --> 01:10:23,460 at the end of it, branching to either the vectorized loop 1422 01:10:23,460 --> 01:10:24,960 or the unvectorized loop. 1423 01:10:24,960 --> 01:10:27,280 And, yeah, there's a lot of other control flow going on 1424 01:10:27,280 --> 01:10:27,780 as well. 1425 01:10:27,780 --> 01:10:32,610 But we can focus on just these components for the time being. 1426 01:10:32,610 --> 01:10:35,910 So what's that conditional branch doing? 1427 01:10:35,910 --> 01:10:38,400 Well, we can zoom in on just this one basic block, 1428 01:10:38,400 --> 01:10:43,590 and actually show it to be readable on the slide. 1429 01:10:43,590 --> 01:10:46,830 And the basic block looks like this. 1430 01:10:46,830 --> 01:10:49,530 So let's just study this LLVM IR code. 1431 01:10:49,530 --> 01:10:54,320 In this case, we have got the address y stored in register 0. 1432 01:10:54,320 --> 01:10:56,940 The address of x is stored in register 2. 1433 01:10:56,940 --> 01:10:59,290 And register 3 stores the value of n. 1434 01:10:59,290 --> 01:11:01,200 So one instruction at a time, who 1435 01:11:01,200 --> 01:11:05,010 can tell me what the first instruction in this code does? 1436 01:11:05,010 --> 01:11:06,286 Yes? 1437 01:11:06,286 --> 01:11:09,640 AUDIENCE: [INAUDIBLE] 1438 01:11:09,640 --> 01:11:11,455 TAO B. SCHARDL: Gets the address of y. 1439 01:11:14,263 --> 01:11:15,560 Is that what you said? 1440 01:11:19,090 --> 01:11:21,130 So it does use the address of y. 1441 01:11:21,130 --> 01:11:24,790 It's an address calculation that operates on register 0, which 1442 01:11:24,790 --> 01:11:26,320 stores the address of y. 1443 01:11:26,320 --> 01:11:31,302 But it's not just computing the address of y. 1444 01:11:31,302 --> 01:11:33,628 AUDIENCE: [INAUDIBLE] 1445 01:11:33,628 --> 01:11:35,420 TAO B. SCHARDL: It's getting me the address 1446 01:11:35,420 --> 01:11:36,830 of the nth element of y. 1447 01:11:36,830 --> 01:11:40,010 It's adding in whatever is in register 3, which is the value 1448 01:11:40,010 --> 01:11:42,860 n, into the address of y. 1449 01:11:42,860 --> 01:11:46,100 So that computes the address y plus n. 1450 01:11:46,100 --> 01:11:50,130 This is testing your memory of pointer arithmetic 1451 01:11:50,130 --> 01:11:52,460 in C just a little bit but. 1452 01:11:52,460 --> 01:11:53,420 Don't worry. 1453 01:11:53,420 --> 01:11:55,070 It won't be too rough. 1454 01:11:55,070 --> 01:11:57,290 So that's what the first address calculation does. 1455 01:11:57,290 --> 01:11:59,875 What does the next instruction do? 1456 01:11:59,875 --> 01:12:02,150 AUDIENCE: It does x plus n. 1457 01:12:02,150 --> 01:12:04,388 TAO B. SCHARDL: That computes x plus, very good. 1458 01:12:04,388 --> 01:12:06,778 How about the next one? 1459 01:12:12,992 --> 01:12:16,440 AUDIENCE: It compares whether x plus n and y plus n 1460 01:12:16,440 --> 01:12:18,880 are the same. 1461 01:12:18,880 --> 01:12:22,785 TAO B. SCHARDL: It compares x plus n, versus y plus n. 1462 01:12:22,785 --> 01:12:29,250 AUDIENCE: [INAUDIBLE] compares the 33, which is x plus n, 1463 01:12:29,250 --> 01:12:30,660 and compares it to y. 1464 01:12:30,660 --> 01:12:35,590 So if x plus n is bigger than y, there's overlap. 1465 01:12:35,590 --> 01:12:37,930 TAO B. SCHARDL: Right, so it does a comparison. 1466 01:12:37,930 --> 01:12:40,030 We'll take that a little more slowly. 1467 01:12:40,030 --> 01:12:42,490 It does a comparison of x plus n, versus y in checks. 1468 01:12:42,490 --> 01:12:44,290 Is x plus n greater than y? 1469 01:12:44,290 --> 01:12:45,430 Perfect. 1470 01:12:45,430 --> 01:12:47,644 How about the next instruction? 1471 01:12:51,572 --> 01:12:53,050 Yeah? 1472 01:12:53,050 --> 01:12:55,698 AUDIENCE: It compares y plus n versus x. 1473 01:12:55,698 --> 01:12:57,240 TAO B. SCHARDL: It compares y plus n, 1474 01:12:57,240 --> 01:12:59,930 versus x, is y plus n even greater than x. 1475 01:12:59,930 --> 01:13:02,476 How would the last instruction before the branch? 1476 01:13:14,335 --> 01:13:14,960 Yep, go for it? 1477 01:13:14,960 --> 01:13:16,220 AUDIENCE: [INAUDIBLE] 1478 01:13:16,220 --> 01:13:19,420 TAO B. SCHARDL: [INAUDIBLE] one of the results. 1479 01:13:19,420 --> 01:13:22,430 So this computes the comparison, is x plus n 1480 01:13:22,430 --> 01:13:23,930 greater than y, bit-wise? 1481 01:13:23,930 --> 01:13:28,330 And is y plus n greater than x. 1482 01:13:28,330 --> 01:13:29,840 Fair enough. 1483 01:13:29,840 --> 01:13:31,850 So what does the result of that condition mean? 1484 01:13:31,850 --> 01:13:34,700 I think we've pretty much already spoiled the answer. 1485 01:13:34,700 --> 01:13:36,910 Anyone want to hear it one last time? 1486 01:13:40,326 --> 01:13:42,766 We had this whole setup. 1487 01:13:45,710 --> 01:13:46,242 Go for it. 1488 01:13:46,242 --> 01:13:47,200 AUDIENCE: They overlap. 1489 01:13:47,200 --> 01:13:49,218 TAO B. SCHARDL: Checks if they overlap. 1490 01:13:49,218 --> 01:13:51,010 So let's look at this condition in a couple 1491 01:13:51,010 --> 01:13:52,430 of different situations. 1492 01:13:52,430 --> 01:13:55,210 If we have x living in one place in memory, 1493 01:13:55,210 --> 01:13:57,790 and y living in another place in memory, 1494 01:13:57,790 --> 01:14:02,770 then no matter how we resolve this condition, 1495 01:14:02,770 --> 01:14:05,740 if we check is both y plus n greater than x, 1496 01:14:05,740 --> 01:14:11,300 and x plus n greater than y, the results will be false. 1497 01:14:11,300 --> 01:14:15,380 But if we have this situation, where 1498 01:14:15,380 --> 01:14:20,600 x and y overlap in memory some portion of memory, 1499 01:14:20,600 --> 01:14:23,210 then it turns out that regardless of whether x or y is 1500 01:14:23,210 --> 01:14:25,910 first, x plus n will be greater than y. y 1501 01:14:25,910 --> 01:14:28,040 plus n will be greater than x. 1502 01:14:28,040 --> 01:14:30,060 And the condition will return true. 1503 01:14:30,060 --> 01:14:32,090 In other words, the condition returns true, 1504 01:14:32,090 --> 01:14:35,960 if and only if these portions of memory pointed by x and y 1505 01:14:35,960 --> 01:14:38,470 alias. 1506 01:14:38,470 --> 01:14:41,240 So going back to our original looping code, 1507 01:14:41,240 --> 01:14:44,810 we have a situation where we have a branch based on 1508 01:14:44,810 --> 01:14:46,280 whether or not they alias. 1509 01:14:46,280 --> 01:14:50,900 And in one case, it executes the vectorized loop. 1510 01:14:50,900 --> 01:14:55,190 And in another case, it executes a non-vectorized code. 1511 01:14:55,190 --> 01:14:57,620 So returning to our original question, in particular 1512 01:14:57,620 --> 01:15:01,030 is a vectorized loop if they don't alias. 1513 01:15:01,030 --> 01:15:04,130 So returning to our original question, 1514 01:15:04,130 --> 01:15:06,590 does this code get vectorized? 1515 01:15:06,590 --> 01:15:09,800 The answer is yes and no. 1516 01:15:09,800 --> 01:15:12,780 So if you voted yes, you're actually right. 1517 01:15:12,780 --> 01:15:15,950 If you voted no, and you were persuaded, you were right. 1518 01:15:15,950 --> 01:15:18,960 And if you didn't commit to an answer, I can't help you. 1519 01:15:21,472 --> 01:15:22,430 But that's interesting. 1520 01:15:22,430 --> 01:15:27,560 The compiler actually generated multiple versions of this loop, 1521 01:15:27,560 --> 01:15:30,110 due to uncertainty about memory aliasing. 1522 01:15:30,110 --> 01:15:31,422 Yeah, question? 1523 01:15:31,422 --> 01:15:36,342 AUDIENCE: [INAUDIBLE] 1524 01:15:47,180 --> 01:15:49,520 TAO B. SCHARDL: So the question is, could the compiler 1525 01:15:49,520 --> 01:15:52,010 figure out this condition statically 1526 01:15:52,010 --> 01:15:53,630 while it's compiling the function? 1527 01:15:53,630 --> 01:15:55,463 Because we know the function is going to get 1528 01:15:55,463 --> 01:15:57,950 called from somewhere. 1529 01:15:57,950 --> 01:16:01,100 The answer is, sometimes it can. 1530 01:16:01,100 --> 01:16:03,200 A lot of times it can't. 1531 01:16:03,200 --> 01:16:05,370 If it's not capable of inlining this function, 1532 01:16:05,370 --> 01:16:08,660 for example, then it probably doesn't have enough information 1533 01:16:08,660 --> 01:16:11,848 to tell whether or not these two pointers will alias. 1534 01:16:11,848 --> 01:16:13,640 For example, you're just building a library 1535 01:16:13,640 --> 01:16:17,417 with a bunch of vector routines. 1536 01:16:17,417 --> 01:16:19,250 You don't know the code that's going to call 1537 01:16:19,250 --> 01:16:23,090 this routine eventually. 1538 01:16:23,090 --> 01:16:25,080 Now, in general, memory aliasing, 1539 01:16:25,080 --> 01:16:28,010 this will be the last point before we wrap up, in general, 1540 01:16:28,010 --> 01:16:30,925 memory aliasing can cause a lot of issues 1541 01:16:30,925 --> 01:16:32,550 when it comes to compiler optimization. 1542 01:16:32,550 --> 01:16:36,320 It can cause the compiler to act very conservatively. 1543 01:16:36,320 --> 01:16:39,470 In this example, we have a simple serial base case 1544 01:16:39,470 --> 01:16:41,555 for a matrix multiply routine. 1545 01:16:41,555 --> 01:16:43,430 But we don't know anything about the pointers 1546 01:16:43,430 --> 01:16:46,400 to the C, A, or B matrices. 1547 01:16:46,400 --> 01:16:48,620 And when we try to compile this and optimize it, 1548 01:16:48,620 --> 01:16:52,130 the compiler complains that it can't do loop invariant code 1549 01:16:52,130 --> 01:16:55,310 motion, because it doesn't know anything about these pointers. 1550 01:16:55,310 --> 01:16:58,310 It could be that the pointer changes 1551 01:16:58,310 --> 01:16:59,480 within the innermost loop. 1552 01:16:59,480 --> 01:17:02,120 So it can't move some calculation out 1553 01:17:02,120 --> 01:17:02,930 to an outer loop. 1554 01:17:05,760 --> 01:17:10,070 Compilers try to deal with this statically using an analysis 1555 01:17:10,070 --> 01:17:12,600 technique called alias analysis. 1556 01:17:12,600 --> 01:17:14,960 And they do try very hard to figure out, 1557 01:17:14,960 --> 01:17:18,740 when are these pointers going to alias? 1558 01:17:18,740 --> 01:17:22,280 Or when are they guaranteed to not alias? 1559 01:17:22,280 --> 01:17:25,220 Now, in general, it turns out that alias analysis 1560 01:17:25,220 --> 01:17:26,150 isn't just hard. 1561 01:17:26,150 --> 01:17:27,470 It's undecidable. 1562 01:17:27,470 --> 01:17:30,940 If only it were hard, maybe we'd have some hope. 1563 01:17:30,940 --> 01:17:32,930 But compilers, in practice, are faced 1564 01:17:32,930 --> 01:17:34,460 with this undecidable question. 1565 01:17:34,460 --> 01:17:37,550 And they try a variety of tricks to get useful alias analysis 1566 01:17:37,550 --> 01:17:38,870 results in practice. 1567 01:17:38,870 --> 01:17:42,570 For example, based on information in the source code, 1568 01:17:42,570 --> 01:17:44,960 the compiler might annotate instructions 1569 01:17:44,960 --> 01:17:48,860 with various metadata to track this aliasing information. 1570 01:17:48,860 --> 01:17:54,140 For example, TBAA is aliasing information based on types. 1571 01:17:54,140 --> 01:17:57,092 There's some scoping information for aliasing. 1572 01:17:57,092 --> 01:17:58,550 There is some information that says 1573 01:17:58,550 --> 01:18:01,640 it's guaranteed not to alias with this other operation, 1574 01:18:01,640 --> 01:18:03,080 all kinds of metadata. 1575 01:18:03,080 --> 01:18:04,580 Now, what can you do as a programmer 1576 01:18:04,580 --> 01:18:08,330 to avoid these issues of memory aliasing? 1577 01:18:08,330 --> 01:18:10,850 Always annotate your pointers, kids. 1578 01:18:10,850 --> 01:18:13,310 Always annotate your pointers. 1579 01:18:13,310 --> 01:18:15,170 The restrict keyword you've seen before. 1580 01:18:15,170 --> 01:18:18,730 It tells the compiler, address calculations based off 1581 01:18:18,730 --> 01:18:21,830 this pointer won't alias with address calculations 1582 01:18:21,830 --> 01:18:23,670 based off other pointers. 1583 01:18:23,670 --> 01:18:26,110 The const keyword provides a little more information. 1584 01:18:26,110 --> 01:18:29,740 It says, these addresses will only be read from. 1585 01:18:29,740 --> 01:18:31,700 They won't be written to. 1586 01:18:31,700 --> 01:18:35,030 And that can enable a lot more compiler optimizations. 1587 01:18:35,030 --> 01:18:36,830 Now, that's all the time that we have. 1588 01:18:36,830 --> 01:18:39,950 There are a couple of other cool case studies in the slides. 1589 01:18:39,950 --> 01:18:42,390 You're welcome to peruse the slides afterwards. 1590 01:18:42,390 --> 01:18:44,490 Thanks for listening.