1 00:00:15,480 --> 00:00:17,090 PETER SZOLOVITS: So today I'm going 2 00:00:17,090 --> 00:00:20,010 to talk about precision medicine. 3 00:00:20,010 --> 00:00:22,800 And we don't really have a very precise idea 4 00:00:22,800 --> 00:00:25,080 of what precision medicine is. 5 00:00:25,080 --> 00:00:26,880 And so I'm going to start by talking 6 00:00:26,880 --> 00:00:29,370 about that a little bit. 7 00:00:29,370 --> 00:00:31,920 David talked about disease subtyping. 8 00:00:31,920 --> 00:00:35,280 And if you think about how do you figure out 9 00:00:35,280 --> 00:00:38,250 what are the subtypes of a disease, 10 00:00:38,250 --> 00:00:40,350 you do it by some kind of clustering 11 00:00:40,350 --> 00:00:42,910 on a bunch of different sorts of data. 12 00:00:42,910 --> 00:00:45,990 And so we have data like demographics, comorbidities, 13 00:00:45,990 --> 00:00:48,870 vital signs, medications, procedures, 14 00:00:48,870 --> 00:00:54,330 disease trajectories, whatever those mean, image similarities. 15 00:00:54,330 --> 00:00:57,000 And today, mostly I'm going to focus on genetics. 16 00:00:57,000 --> 00:01:01,020 Because this was the great hope of the Human Genome Project, 17 00:01:01,020 --> 00:01:04,830 that as we understood more about the genetic influences 18 00:01:04,830 --> 00:01:08,220 on disease, it would help us create 19 00:01:08,220 --> 00:01:12,390 precise ways of dealing with various diseases 20 00:01:12,390 --> 00:01:16,470 and figuring out the right therapies for them and so on. 21 00:01:16,470 --> 00:01:19,560 So I want to start by reviewing a little bit 22 00:01:19,560 --> 00:01:24,390 a study that was done by the National Research Council, 23 00:01:24,390 --> 00:01:28,470 so the National Academies, and it's called "Toward Precision 24 00:01:28,470 --> 00:01:30,010 Medicine." 25 00:01:30,010 --> 00:01:34,080 This was fairly recent, 2017. 26 00:01:34,080 --> 00:01:36,150 And they have some interesting observations. 27 00:01:36,150 --> 00:01:39,750 So they start off and they say, well, 28 00:01:39,750 --> 00:01:43,380 why is this relevant now, when it may not 29 00:01:43,380 --> 00:01:45,420 have been relevant before? 30 00:01:45,420 --> 00:01:48,330 And of course, the biggie is new capabilities 31 00:01:48,330 --> 00:01:51,870 to compile molecular data on patients on a scale that 32 00:01:51,870 --> 00:01:54,370 was unimaginable 20 years ago. 33 00:01:54,370 --> 00:01:57,960 So people estimated that getting the first human genome 34 00:01:57,960 --> 00:01:59,370 cost about $3 billion. 35 00:02:02,070 --> 00:02:07,560 Today, getting a human genome costs less than $1,000. 36 00:02:07,560 --> 00:02:11,730 I have some figures later in the talk showing some of the ads 37 00:02:11,730 --> 00:02:14,520 that people are running. 38 00:02:14,520 --> 00:02:17,460 Increasing success in utilizing molecular information 39 00:02:17,460 --> 00:02:19,830 to improve diagnosis and treatment, 40 00:02:19,830 --> 00:02:21,780 we'll talk about some of those. 41 00:02:21,780 --> 00:02:27,720 Advances in IT so that we have bigger capabilities of dealing 42 00:02:27,720 --> 00:02:30,690 with so-called big data-- 43 00:02:30,690 --> 00:02:34,170 a perfect storm among stakeholders 44 00:02:34,170 --> 00:02:36,270 that has made them much more receptive 45 00:02:36,270 --> 00:02:38,320 to this kind of information. 46 00:02:38,320 --> 00:02:42,630 So the fact that costs in the health-care system in the US 47 00:02:42,630 --> 00:02:46,920 keep rising and quality doesn't keep rising proportionately 48 00:02:46,920 --> 00:02:48,960 makes everybody desperate to come up 49 00:02:48,960 --> 00:02:52,600 with new ways of dealing with this problem. 50 00:02:52,600 --> 00:02:58,620 And so this looks like the next great hope for how to do it. 51 00:02:58,620 --> 00:03:01,780 And shifting public attitudes toward molecular data-- 52 00:03:01,780 --> 00:03:05,980 so how many of you have seen the movie Gattaca? 53 00:03:05,980 --> 00:03:06,970 A few. 54 00:03:06,970 --> 00:03:10,090 So that's a dystopian view of what 55 00:03:10,090 --> 00:03:13,960 happens when people are genotyped and can therefore 56 00:03:13,960 --> 00:03:16,720 be tracked by their genetics. 57 00:03:16,720 --> 00:03:20,620 And it is true that there are horror stories that can happen. 58 00:03:20,620 --> 00:03:25,600 But nevertheless, people seem to be more relaxed today 59 00:03:25,600 --> 00:03:29,620 about allowing that kind of data to be collected and used. 60 00:03:29,620 --> 00:03:34,630 Because they see the potential benefits outweighing the costs. 61 00:03:34,630 --> 00:03:37,990 Not everybody-- but that continues 62 00:03:37,990 --> 00:03:41,270 to be a serious issue. 63 00:03:41,270 --> 00:03:44,680 So this report goes on and says, you know, 64 00:03:44,680 --> 00:03:47,680 let's think about how to integrate 65 00:03:47,680 --> 00:03:51,190 all kinds of different data about individuals. 66 00:03:51,190 --> 00:03:53,170 And they start off and they say, you know, 67 00:03:53,170 --> 00:03:57,280 one good example of this has been Google Maps. 68 00:03:57,280 --> 00:04:00,350 So Google Maps has a coordinate system, 69 00:04:00,350 --> 00:04:03,040 which is basically longitude and latitude, 70 00:04:03,040 --> 00:04:05,140 for every point on the Earth. 71 00:04:05,140 --> 00:04:08,800 And they can use that coordinate system in order 72 00:04:08,800 --> 00:04:13,720 to layer on top of each other information about postal codes, 73 00:04:13,720 --> 00:04:17,950 built structures, census tracts, land use, transportation, 74 00:04:17,950 --> 00:04:19,870 everything. 75 00:04:19,870 --> 00:04:22,420 And they said, wow, this is really cool, if only we 76 00:04:22,420 --> 00:04:25,240 could do this in health care. 77 00:04:25,240 --> 00:04:28,950 And so their vision is to try to do that in health care 78 00:04:28,950 --> 00:04:31,620 by saying, well, what corresponds 79 00:04:31,620 --> 00:04:36,390 to latitude and longitude is individual patients. 80 00:04:36,390 --> 00:04:39,720 And these individual patients have various kinds 81 00:04:39,720 --> 00:04:42,990 of data about them, including their microbiome, 82 00:04:42,990 --> 00:04:47,910 their epigenome, their genome, clinical signs and symptoms, 83 00:04:47,910 --> 00:04:52,090 the exposome, what are they exposed to. 84 00:04:52,090 --> 00:04:54,630 And so there's been a real attempt 85 00:04:54,630 --> 00:04:59,100 to go out and create large collections of data 86 00:04:59,100 --> 00:05:03,160 that bring together all of this kind of information. 87 00:05:03,160 --> 00:05:07,560 One of those that is notable is the Department of Health-- 88 00:05:07,560 --> 00:05:11,280 well, NIH basically started a project about a year 89 00:05:11,280 --> 00:05:16,050 and a half ago called All of Us, sounds sort of menacing. 90 00:05:16,050 --> 00:05:19,320 But it's really a million of us. 91 00:05:19,320 --> 00:05:23,160 And they have asked institutions around the United States 92 00:05:23,160 --> 00:05:26,970 to get volunteers to volunteer to provide 93 00:05:26,970 --> 00:05:31,690 their genetic information, their clinical data, where they live, 94 00:05:31,690 --> 00:05:34,110 where they commute, things like that, 95 00:05:34,110 --> 00:05:37,500 so that they can get environmental data about them. 96 00:05:37,500 --> 00:05:41,790 And then it's meant to be an ongoing collection of data 97 00:05:41,790 --> 00:05:44,850 about a million people who are supposed 98 00:05:44,850 --> 00:05:48,360 to be a representative sample of the United States. 99 00:05:48,360 --> 00:05:50,550 So you'll see in some of the projects I talk 100 00:05:50,550 --> 00:05:53,310 about later today that many of the studies 101 00:05:53,310 --> 00:05:57,130 have been done in populations of European ancestry. 102 00:05:57,130 --> 00:06:00,930 And so they may not apply to people of other ethnicities. 103 00:06:00,930 --> 00:06:05,370 This is attempting to sample accurately so 104 00:06:05,370 --> 00:06:09,120 that the fraction of African Americans and Asians 105 00:06:09,120 --> 00:06:12,570 and Hispanics and so on corresponds 106 00:06:12,570 --> 00:06:16,180 to the sample in the United States population. 107 00:06:16,180 --> 00:06:17,950 There's a long history. 108 00:06:17,950 --> 00:06:21,120 How many of you have heard of the Framingham Heart Study? 109 00:06:21,120 --> 00:06:22,350 So a lot of people. 110 00:06:22,350 --> 00:06:25,710 So Framingham, in the 1940s, agreed 111 00:06:25,710 --> 00:06:29,370 to become the subject of a long-term experiment. 112 00:06:29,370 --> 00:06:35,340 I think it's now run by Boston University, where every year 113 00:06:35,340 --> 00:06:40,520 or two they go out and they survey-- 114 00:06:40,520 --> 00:06:42,510 I can't remember the number of people. 115 00:06:42,510 --> 00:06:47,370 It started as something like 50,000 people-- 116 00:06:47,370 --> 00:06:49,920 about their habits and whether they smoke, 117 00:06:49,920 --> 00:06:52,020 and what their weight and height is, 118 00:06:52,020 --> 00:06:54,870 and any clinical problems they've had, 119 00:06:54,870 --> 00:06:57,300 surgical procedures, et cetera. 120 00:06:57,300 --> 00:06:59,580 And they've been collecting that database now 121 00:06:59,580 --> 00:07:05,160 over several generations of people that descend from those. 122 00:07:05,160 --> 00:07:08,260 And they've started collecting genetic data as well. 123 00:07:08,260 --> 00:07:12,810 So All of Us is really doing this on a very large scale. 124 00:07:12,810 --> 00:07:16,560 Now, the vision of these of this study 125 00:07:16,560 --> 00:07:20,370 was to say that we're going to build this information 126 00:07:20,370 --> 00:07:24,270 commons, which collects all this kind of information, 127 00:07:24,270 --> 00:07:29,100 and then we're going to develop knowledge from that information 128 00:07:29,100 --> 00:07:30,780 or from that data. 129 00:07:30,780 --> 00:07:33,900 And that knowledge will become the substrate on which 130 00:07:33,900 --> 00:07:36,340 biomedical research can rest. 131 00:07:36,340 --> 00:07:39,510 So if we find significant associations, 132 00:07:39,510 --> 00:07:41,310 then that suggests that one should 133 00:07:41,310 --> 00:07:44,190 do studies, which will not necessarily 134 00:07:44,190 --> 00:07:46,950 be answered by the data that they've collected. 135 00:07:46,950 --> 00:07:49,890 You may have to grow knock-out mice or something in order 136 00:07:49,890 --> 00:07:52,560 to test whether an idea really works. 137 00:07:52,560 --> 00:07:54,660 But this is a way of integrating all 138 00:07:54,660 --> 00:07:56,920 of that type of information. 139 00:07:56,920 --> 00:07:59,530 And of course, it can affect diagnosis, treatment, 140 00:07:59,530 --> 00:08:02,490 and health outcomes, which are the holy grail for what 141 00:08:02,490 --> 00:08:05,880 you'd like to do in medicine. 142 00:08:05,880 --> 00:08:09,060 Now, here's an interesting problem. 143 00:08:09,060 --> 00:08:14,310 So the focus, notice, was on taxonomies. 144 00:08:14,310 --> 00:08:21,540 So Sam Johnson was a very famous 17th century British writer. 145 00:08:21,540 --> 00:08:26,310 And he built encyclopedias and dictionaries, 146 00:08:26,310 --> 00:08:30,690 and was a poet and a reviewer and a commentator, 147 00:08:30,690 --> 00:08:34,200 and did all kinds of fancy things. 148 00:08:34,200 --> 00:08:36,900 And one of his quotes is, "My diseases 149 00:08:36,900 --> 00:08:40,380 are an asthma and a dropsy and, what is less curable, 150 00:08:40,380 --> 00:08:42,820 75," years old. 151 00:08:42,820 --> 00:08:46,380 So he was funny, too. 152 00:08:46,380 --> 00:08:52,020 Now, if you look up dropsy in a dictionary-- 153 00:08:52,020 --> 00:08:55,370 how many of you have heard of dropsy? 154 00:08:55,370 --> 00:08:56,570 A couple. 155 00:08:56,570 --> 00:08:59,094 So how did you hear of it? 156 00:08:59,094 --> 00:09:01,310 AUDIENCE: From Jane Austen novels. 157 00:09:01,310 --> 00:09:01,810 [LAUGHS] 158 00:09:01,810 --> 00:09:02,768 PETER SZOLOVITS: Sorry? 159 00:09:02,768 --> 00:09:03,470 From a novel? 160 00:09:03,470 --> 00:09:04,490 AUDIENCE: Novels. 161 00:09:04,490 --> 00:09:05,407 PETER SZOLOVITS: Yeah. 162 00:09:05,407 --> 00:09:08,043 AUDIENCE: I've heard of dropsy [INAUDIBLE].. 163 00:09:08,043 --> 00:09:08,960 PETER SZOLOVITS: Yeah. 164 00:09:08,960 --> 00:09:12,380 I mean, I learned about it by watching Masterpiece Theatre 165 00:09:12,380 --> 00:09:18,200 with 19th century people, where the grandmother would take 166 00:09:18,200 --> 00:09:21,260 to her bed with the dropsy. 167 00:09:21,260 --> 00:09:23,870 And it didn't turn out well, typically. 168 00:09:23,870 --> 00:09:27,050 But it took a long time for those people to die. 169 00:09:27,050 --> 00:09:31,010 So dropsy is water sickness, swelling, edema, et cetera. 170 00:09:31,010 --> 00:09:32,910 It's actually not a disease. 171 00:09:32,910 --> 00:09:35,720 It's a symptom of a whole bunch of diseases. 172 00:09:35,720 --> 00:09:39,540 So it could be pulmonary disease, heart failure, 173 00:09:39,540 --> 00:09:42,200 kidney disease, et cetera. 174 00:09:42,200 --> 00:09:43,220 And it's interesting. 175 00:09:43,220 --> 00:09:44,360 I look back on this. 176 00:09:44,360 --> 00:09:47,840 I couldn't find it for putting together this lecture. 177 00:09:47,840 --> 00:09:52,190 But at one point, I did discover that the last time dropsy was 178 00:09:52,190 --> 00:09:56,060 listed as the cause of death of a patient in the United States 179 00:09:56,060 --> 00:09:58,520 was in 1949. 180 00:09:58,520 --> 00:10:04,100 So since then, it's disappeared as a disease from the taxonomy. 181 00:10:04,100 --> 00:10:07,070 And if you talk to pulmonary people, 182 00:10:07,070 --> 00:10:09,950 they suspect that asthma, which is still 183 00:10:09,950 --> 00:10:13,770 a disease in our current lexicon, 184 00:10:13,770 --> 00:10:15,890 may be very much like dropsy. 185 00:10:15,890 --> 00:10:17,250 It's not a disease. 186 00:10:17,250 --> 00:10:21,050 It's a symptom of a whole bunch of underlying causes. 187 00:10:21,050 --> 00:10:22,940 And the idea is that we need to get 188 00:10:22,940 --> 00:10:27,110 good enough and precise enough at being able to figure out 189 00:10:27,110 --> 00:10:29,670 what these are. 190 00:10:29,670 --> 00:10:35,300 So I talked to my friend Zack Kohane at Harvard 191 00:10:35,300 --> 00:10:38,780 a few weeks ago when I started preparing this lecture. 192 00:10:38,780 --> 00:10:41,480 And he has the following idea. 193 00:10:41,480 --> 00:10:46,080 And the example I'm going to show you is from him. 194 00:10:46,080 --> 00:10:48,110 So he says, well, look, we should 195 00:10:48,110 --> 00:10:50,930 have this precision medicine modality 196 00:10:50,930 --> 00:10:54,920 space, which is this high-dimensional space that 197 00:10:54,920 --> 00:11:01,160 contains all of that information that is in the NRC report. 198 00:11:01,160 --> 00:11:07,880 And then what we do is, in this high-dimensional space, 199 00:11:07,880 --> 00:11:11,630 if we're lucky, we're going to find clusters of data. 200 00:11:11,630 --> 00:11:14,030 So this always happens. 201 00:11:14,030 --> 00:11:17,960 If you ever take a very high-dimensional data set 202 00:11:17,960 --> 00:11:21,800 and put it into its very high-dimensional representation 203 00:11:21,800 --> 00:11:24,950 space, it's almost never the case 204 00:11:24,950 --> 00:11:30,320 that the data is scattered uniformly through the space. 205 00:11:30,320 --> 00:11:33,350 If that were true, it wouldn't help us very much. 206 00:11:33,350 --> 00:11:35,450 But generally, it's not true. 207 00:11:35,450 --> 00:11:37,160 And what you find is that the data 208 00:11:37,160 --> 00:11:40,790 tends to be on lower-dimensional manifolds. 209 00:11:40,790 --> 00:11:43,650 So it's in subsets of the space. 210 00:11:43,650 --> 00:11:45,800 And so a lot of the trick in trying 211 00:11:45,800 --> 00:11:49,040 to analyze this kind of data is figuring out 212 00:11:49,040 --> 00:11:53,000 what those lower-dimensional manifolds look like. 213 00:11:53,000 --> 00:11:58,220 And often you will find among a very large data set a cluster 214 00:11:58,220 --> 00:12:00,920 of patients like this. 215 00:12:00,920 --> 00:12:06,020 And then Zack's approach is to say, well, if you're patient-- 216 00:12:06,020 --> 00:12:08,810 it's hard to represent three dimensions in two. 217 00:12:08,810 --> 00:12:11,780 But if you're patient that falls somewhere in the middle of such 218 00:12:11,780 --> 00:12:14,420 a cluster, then that probably means 219 00:12:14,420 --> 00:12:17,630 that they're kind of normal for that cluster, 220 00:12:17,630 --> 00:12:21,500 whereas if they fall somewhere at the edge of such a cluster, 221 00:12:21,500 --> 00:12:23,660 that probably means that there's something odd 222 00:12:23,660 --> 00:12:26,900 going on that is worth investigating, because they're 223 00:12:26,900 --> 00:12:28,820 unusual. 224 00:12:28,820 --> 00:12:34,370 So then he gave me an example of a patient of his. 225 00:12:34,370 --> 00:12:37,570 And let me give you a minute to read this. 226 00:12:50,902 --> 00:12:51,402 Yeah? 227 00:12:51,402 --> 00:12:53,460 AUDIENCE: What's an armamentarium? 228 00:12:56,893 --> 00:12:58,935 PETER SZOLOVITS: Where does it say armamentarium? 229 00:12:58,935 --> 00:12:59,730 AUDIENCE: [INAUDIBLE] 230 00:12:59,730 --> 00:13:00,813 PETER SZOLOVITS: Oh, yeah. 231 00:13:00,813 --> 00:13:03,810 So an armamentarium, historically, 232 00:13:03,810 --> 00:13:07,230 is the set of arms that are available to an army. 233 00:13:07,230 --> 00:13:10,560 So this is the set of treatments that are available to a doctor. 234 00:13:13,050 --> 00:13:15,008 AUDIENCE: Is that the only word you don't know? 235 00:13:15,008 --> 00:13:17,378 [LAUGHTER] 236 00:13:17,378 --> 00:13:18,330 It's the only word-- 237 00:13:18,330 --> 00:13:19,960 AUDIENCE: If I start asking-- 238 00:13:19,960 --> 00:13:21,252 AUDIENCE: Based on [INAUDIBLE]. 239 00:13:21,252 --> 00:13:22,218 AUDIENCE: Oh, OK. 240 00:13:22,218 --> 00:13:24,150 AUDIENCE: In the world. 241 00:13:24,150 --> 00:13:26,570 Some of it, I thought I could understand. 242 00:13:26,570 --> 00:13:28,028 PETER SZOLOVITS: Well, you probably 243 00:13:28,028 --> 00:13:29,750 know what antibiotics are. 244 00:13:29,750 --> 00:13:33,440 And immunosuppressants, you've probably heard of. 245 00:13:33,440 --> 00:13:37,820 Anyway, it's a bunch of different therapies. 246 00:13:37,820 --> 00:13:41,270 So this is what's called a sick puppy. 247 00:13:41,270 --> 00:13:44,900 It's a kid who is not doing well. 248 00:13:44,900 --> 00:13:49,220 They started life, at age three, with ulcerative colitis, 249 00:13:49,220 --> 00:13:52,010 which was well-controlled by the kinds of medications 250 00:13:52,010 --> 00:13:56,090 that they normally give people with that disease. 251 00:13:56,090 --> 00:13:59,570 And then all of a sudden, 10 years later, 252 00:13:59,570 --> 00:14:03,710 he breaks out with this horrible abdominal pain and diarrhea 253 00:14:03,710 --> 00:14:07,490 and blood in his stool. 254 00:14:07,490 --> 00:14:12,717 And they try a bunch of stuff that they think ought to work, 255 00:14:12,717 --> 00:14:13,550 and it doesn't work. 256 00:14:16,220 --> 00:14:23,030 So the kid was facing some fairly drastic options, 257 00:14:23,030 --> 00:14:28,410 like cutting out the part of his colon that was inflamed. 258 00:14:28,410 --> 00:14:31,950 So your colon is an important part of your digestive tract. 259 00:14:31,950 --> 00:14:36,770 And so losing it is not fun and would have bad consequences 260 00:14:36,770 --> 00:14:40,520 for the rest of his life. 261 00:14:40,520 --> 00:14:46,850 But what they did is they said, well, 262 00:14:46,850 --> 00:14:51,790 why is he not responding to any of these therapies? 263 00:14:51,790 --> 00:14:57,330 And the difficulty, you can imagine, 264 00:14:57,330 --> 00:15:01,050 in that cloud-of-points picture, is, 265 00:15:01,050 --> 00:15:04,860 how do you figure out whether the person is an outlier 266 00:15:04,860 --> 00:15:07,820 or is in the middle of one of these clusters, 267 00:15:07,820 --> 00:15:09,870 when it depends on a lot of things? 268 00:15:09,870 --> 00:15:13,560 In this kid's case, what it depended on most significantly 269 00:15:13,560 --> 00:15:16,920 was the last six months of his experience, 270 00:15:16,920 --> 00:15:21,980 where, before, he was doing OK with the standard treatment. 271 00:15:21,980 --> 00:15:24,830 So that cloud might have represented people 272 00:15:24,830 --> 00:15:27,650 with ulcerative colitis who were well-controlled 273 00:15:27,650 --> 00:15:29,330 by the standard treatment. 274 00:15:29,330 --> 00:15:33,360 And now, all of a sudden, he becomes an outlier. 275 00:15:33,360 --> 00:15:38,850 So what happened in this case is they said, 276 00:15:38,850 --> 00:15:42,170 well, maybe there are different groups 277 00:15:42,170 --> 00:15:44,420 of ulcerative colitis patients. 278 00:15:44,420 --> 00:15:48,080 So maybe there are ones who have a lifelong remission 279 00:15:48,080 --> 00:15:51,260 after treatment with a commonly used monoclonal antibody. 280 00:15:51,260 --> 00:15:56,090 So that's the center of the cluster. 281 00:15:56,090 --> 00:15:59,540 Maybe there are people who have multi-year remission 282 00:15:59,540 --> 00:16:03,230 but become refractory to these drugs. 283 00:16:03,230 --> 00:16:09,200 And after other treatments, they have to undergo a colectomy. 284 00:16:09,200 --> 00:16:12,560 So that's the removal of the colon. 285 00:16:12,560 --> 00:16:15,590 And then there are people who have, initially, a remission, 286 00:16:15,590 --> 00:16:18,230 but then those standard therapy works. 287 00:16:18,230 --> 00:16:22,550 So that's what this kid is in, this cluster. 288 00:16:22,550 --> 00:16:29,470 So how do you treat this as a machine learning problem 289 00:16:29,470 --> 00:16:31,750 from the point of view of having lots of data 290 00:16:31,750 --> 00:16:34,060 about lots of different patients? 291 00:16:34,060 --> 00:16:38,560 And the challenges, of course, include things like, 292 00:16:38,560 --> 00:16:42,250 what's your distance function in doing the kind of clustering 293 00:16:42,250 --> 00:16:44,650 that people typically do? 294 00:16:44,650 --> 00:16:47,920 How do you define what an outlier is? 295 00:16:47,920 --> 00:16:51,280 Because there's always a continuum where it just 296 00:16:51,280 --> 00:16:54,190 gets more and more diffuse. 297 00:16:54,190 --> 00:16:57,610 What's the best representation for time-varying data, 298 00:16:57,610 --> 00:17:00,160 which is critical in this case? 299 00:17:00,160 --> 00:17:04,069 What's the optimal weighting or normalization of dimensions? 300 00:17:04,069 --> 00:17:07,630 So does every dimension in this very high-dimensional space 301 00:17:07,630 --> 00:17:09,040 count the same? 302 00:17:09,040 --> 00:17:11,380 Or are differences along certain dimensions 303 00:17:11,380 --> 00:17:13,990 more important than those among others? 304 00:17:13,990 --> 00:17:17,900 And does that, in fact, vary from problem to problem? 305 00:17:17,900 --> 00:17:21,160 The answer is probably yes. 306 00:17:21,160 --> 00:17:25,300 So how do we find the neighborhood for the patient? 307 00:17:25,300 --> 00:17:29,110 Well, I'm going to give you some clues 308 00:17:29,110 --> 00:17:32,590 by starting with a shallow dive into genetics. 309 00:17:32,590 --> 00:17:37,390 So if you've taken a molecular cell biology class, 310 00:17:37,390 --> 00:17:39,080 this should not be news to you. 311 00:17:39,080 --> 00:17:41,320 And I'm going to run through it pretty quickly. 312 00:17:41,320 --> 00:17:43,480 If you haven't, then I hope at least 313 00:17:43,480 --> 00:17:46,120 you'll pick up some of the vocabulary. 314 00:17:46,120 --> 00:17:49,240 So a wise biologist said, "Biology 315 00:17:49,240 --> 00:17:51,670 is the science of exceptions." 316 00:17:51,670 --> 00:17:53,990 There are almost no rules. 317 00:17:53,990 --> 00:17:57,820 About 25 years ago, the biology department 318 00:17:57,820 --> 00:18:02,590 here taught a special class for engineering faculty 319 00:18:02,590 --> 00:18:05,320 to try to explain to us what they 320 00:18:05,320 --> 00:18:08,150 were teaching in their introductory biology, 321 00:18:08,150 --> 00:18:10,240 molecular biology classes. 322 00:18:10,240 --> 00:18:12,400 And I remember, I was sitting next 323 00:18:12,400 --> 00:18:14,800 to Jerry Sussman, one of my colleagues. 324 00:18:14,800 --> 00:18:19,150 And after we heard some lecture about the 47 ways 325 00:18:19,150 --> 00:18:23,460 that some theory doesn't apply in many, many cases, 326 00:18:23,460 --> 00:18:25,780 Jerry turns to me and he says, you know, 327 00:18:25,780 --> 00:18:28,540 the problem with this field is there are just too many damned 328 00:18:28,540 --> 00:18:29,950 exceptions. 329 00:18:29,950 --> 00:18:31,990 There are no theories. 330 00:18:31,990 --> 00:18:34,540 It's all exceptions. 331 00:18:34,540 --> 00:18:39,310 And so even biologists recognize this. 332 00:18:39,310 --> 00:18:43,570 Now, people have observed, ever since human beings walked 333 00:18:43,570 --> 00:18:45,610 the earth, that children tend to be 334 00:18:45,610 --> 00:18:49,030 similar to their parents in many ways. 335 00:18:49,030 --> 00:18:53,470 And until Gregor Mendel, this was a great mystery. 336 00:18:53,470 --> 00:18:57,077 Why is it that you are like your parents? 337 00:18:57,077 --> 00:18:58,660 I mean, you must have gotten something 338 00:18:58,660 --> 00:19:01,750 from them that sort of carries through and makes 339 00:19:01,750 --> 00:19:03,790 you similar to them. 340 00:19:03,790 --> 00:19:06,880 So Mendel had this notion of having 341 00:19:06,880 --> 00:19:10,810 discrete factors of inheritance, which he called genes. 342 00:19:10,810 --> 00:19:13,060 He had no idea what these were. 343 00:19:13,060 --> 00:19:17,080 But conceptually, he knew that they must exist. 344 00:19:17,080 --> 00:19:20,860 And then he did a bunch of experiments on pea plants, 345 00:19:20,860 --> 00:19:23,920 showing that peas that are wrinkled 346 00:19:23,920 --> 00:19:28,330 tend to have offspring peas that are also wrinkled. 347 00:19:28,330 --> 00:19:33,010 And he worked out the genetics of what we now 348 00:19:33,010 --> 00:19:36,430 call Mendelian inheritance, namely 349 00:19:36,430 --> 00:19:41,710 dominant versus recessive inheritance patterns. 350 00:19:41,710 --> 00:19:47,980 Then Johann Miescher came along some years later, 351 00:19:47,980 --> 00:19:52,270 and he discovered a weird thing in cells 352 00:19:52,270 --> 00:19:56,545 called nuclein, which is now known as DNA. 353 00:20:00,220 --> 00:20:07,940 But it wasn't until 1952 that Hershey and Chase said, hey, 354 00:20:07,940 --> 00:20:11,950 it's DNA that is carrying this genetic information 355 00:20:11,950 --> 00:20:15,400 from generation to generation. 356 00:20:15,400 --> 00:20:18,100 And then, of course, Watson, Crick, and Franklin, 357 00:20:18,100 --> 00:20:22,120 the following year, deciphered the structure of DNA, 358 00:20:22,120 --> 00:20:25,180 that it's this double helix, and then figured out 359 00:20:25,180 --> 00:20:28,480 what the mechanism must be that allows DNA 360 00:20:28,480 --> 00:20:30,920 to transmit this information. 361 00:20:30,920 --> 00:20:32,840 So you have a double helix. 362 00:20:32,840 --> 00:20:44,570 You match the four letters A, C, T, G opposite each other, 363 00:20:44,570 --> 00:20:47,380 and you can replicate this DNA by splitting it 364 00:20:47,380 --> 00:20:50,050 apart and growing another strand that 365 00:20:50,050 --> 00:20:52,430 is the complement of the first one. 366 00:20:52,430 --> 00:20:53,470 Now you have two. 367 00:20:53,470 --> 00:20:57,980 And you can have children, pass on this information to them. 368 00:20:57,980 --> 00:21:00,110 So that was a big deal. 369 00:21:00,110 --> 00:21:04,330 So a gene is defined by the National Center 370 00:21:04,330 --> 00:21:07,930 for Biotechnology Information as a fundamental physical 371 00:21:07,930 --> 00:21:11,980 and functional unit of heredity that's a DNA sequence located 372 00:21:11,980 --> 00:21:14,830 on a specific site on a chromosome which 373 00:21:14,830 --> 00:21:17,590 encodes a specific functional product, 374 00:21:17,590 --> 00:21:20,350 namely RNA or a protein. 375 00:21:20,350 --> 00:21:22,710 I'll come back to that in a minute. 376 00:21:22,710 --> 00:21:24,940 The remaining mystery is it's still very 377 00:21:24,940 --> 00:21:29,980 hard to figure out what parts of the DNA code genes. 378 00:21:29,980 --> 00:21:32,090 So you would think we might have solved this, 379 00:21:32,090 --> 00:21:34,210 but we haven't quite. 380 00:21:34,210 --> 00:21:37,840 And what does the rest, which is the vast majority of the DNA, 381 00:21:37,840 --> 00:21:41,920 do if it's not encoding genes? 382 00:21:41,920 --> 00:21:45,250 And then, how does the folding and the geometry, 383 00:21:45,250 --> 00:21:49,383 the topology of these structures, 384 00:21:49,383 --> 00:21:50,425 influence their function? 385 00:21:53,270 --> 00:21:57,250 So I went back and I read some of Francis Crick's work 386 00:21:57,250 --> 00:21:59,660 from the 1950s. 387 00:21:59,660 --> 00:22:01,360 And it's very interesting. 388 00:22:01,360 --> 00:22:06,130 This hypothesis was considered controversial and tentative 389 00:22:06,130 --> 00:22:07,420 at the time. 390 00:22:07,420 --> 00:22:11,290 So he said, "The specificity of a piece of nucleic acid 391 00:22:11,290 --> 00:22:14,950 is expressed solely by the sequence of its bases, 392 00:22:14,950 --> 00:22:16,960 and this sequence is a simple code 393 00:22:16,960 --> 00:22:21,010 for the amino acid sequence of a particular protein." 394 00:22:21,010 --> 00:22:24,190 And there were people arguing that he was just flat wrong, 395 00:22:24,190 --> 00:22:25,450 that this was not true. 396 00:22:25,450 --> 00:22:28,900 Of course, it turned out he was right. 397 00:22:28,900 --> 00:22:31,480 And then the central dogma is the transfer 398 00:22:31,480 --> 00:22:35,110 of information from nucleic acid to nucleic acid 399 00:22:35,110 --> 00:22:38,810 or from nucleic acid to protein may be possible. 400 00:22:38,810 --> 00:22:42,790 But transfer from protein to protein or from protein 401 00:22:42,790 --> 00:22:45,820 to nucleic acid is impossible. 402 00:22:45,820 --> 00:22:48,160 And that's not quite true. 403 00:22:48,160 --> 00:22:51,380 But it's a good first approximation. 404 00:22:51,380 --> 00:22:57,740 So this is where things stood back about 60 years ago. 405 00:22:57,740 --> 00:23:00,800 And then a few Nobel prizes later, 406 00:23:00,800 --> 00:23:03,890 we began to understand some of the mechanism of how 407 00:23:03,890 --> 00:23:05,180 this works. 408 00:23:05,180 --> 00:23:07,490 And of course, how it works is that you 409 00:23:07,490 --> 00:23:13,250 have DNA, which is these four bases, double stranded. 410 00:23:13,250 --> 00:23:17,400 RNA gets produced in the process of transcription. 411 00:23:17,400 --> 00:23:20,720 So this thing unfolds. 412 00:23:20,720 --> 00:23:27,710 An RNA strand is built along the DNA and separates from the DNA, 413 00:23:27,710 --> 00:23:30,380 creating a single-stranded RNA. 414 00:23:30,380 --> 00:23:33,320 And then it goes and hooks up with a ribosome. 415 00:23:33,320 --> 00:23:38,990 And the ribosome takes that RNA and takes the codes 416 00:23:38,990 --> 00:23:42,020 in triplets, and each triplet stands 417 00:23:42,020 --> 00:23:45,710 for a particular amino acid, which it then assembles 418 00:23:45,710 --> 00:23:48,020 in sequence and creates proteins, which 419 00:23:48,020 --> 00:23:52,300 are sequences of amino acids. 420 00:23:52,300 --> 00:23:57,210 Now, it's very complicated. 421 00:23:57,210 --> 00:23:59,340 Because there's three-dimensionality 422 00:23:59,340 --> 00:24:00,720 and there's time involved. 423 00:24:00,720 --> 00:24:05,310 And the rate constants-- this is chemistry, after all. 424 00:24:05,310 --> 00:24:10,245 So again, a few more Nobel prizes later, 425 00:24:10,245 --> 00:24:15,330 we have that transcription, that process of turning DNA 426 00:24:15,330 --> 00:24:19,140 into RNA, is regulated by promoter, repressor, 427 00:24:19,140 --> 00:24:22,860 and enhancer regions on the genome. 428 00:24:22,860 --> 00:24:27,960 And the proteins mediate this process by binding to the DNA 429 00:24:27,960 --> 00:24:31,560 and causing the beginning of transcription, 430 00:24:31,560 --> 00:24:35,340 or causing it to run faster or causing it to run slower, 431 00:24:35,340 --> 00:24:38,730 or they interfere with it, et cetera. 432 00:24:38,730 --> 00:24:42,000 There are also these enhancers, some of which 433 00:24:42,000 --> 00:24:47,280 are very far away from the coding region, that 434 00:24:47,280 --> 00:24:51,210 make huge differences in how much of the RNA, 435 00:24:51,210 --> 00:24:54,510 and therefore how much of the protein, is made. 436 00:24:54,510 --> 00:24:57,270 And the current understanding of that 437 00:24:57,270 --> 00:25:00,360 is that, if here is the gene, it may 438 00:25:00,360 --> 00:25:03,750 be that the strand of DNA loops around. 439 00:25:03,750 --> 00:25:08,700 And the enhancer, even though it's distant in genetic units, 440 00:25:08,700 --> 00:25:12,130 is actually in close physical proximity, 441 00:25:12,130 --> 00:25:16,860 and therefore can encourage more of this transcription 442 00:25:16,860 --> 00:25:20,160 to take place. 443 00:25:20,160 --> 00:25:23,550 By the way, if you're interested in this stuff, of course 444 00:25:23,550 --> 00:25:28,230 MIT teaches a lot of courses in how to do this. 445 00:25:28,230 --> 00:25:31,860 Dave Gifford and Manolis Kellis both teach 446 00:25:31,860 --> 00:25:36,000 computational courses in how to apply computational methods 447 00:25:36,000 --> 00:25:39,820 to try to decipher this kind of activity. 448 00:25:39,820 --> 00:25:43,470 So repressors prevent activator from binding or alters 449 00:25:43,470 --> 00:25:46,770 the activator in order to change the rate constants. 450 00:25:46,770 --> 00:25:50,340 And so this is another mechanism. 451 00:25:50,340 --> 00:25:53,930 Now, one of the problems is that if you 452 00:25:53,930 --> 00:26:02,510 look at the total amount of DNA in your genes, in your cells, 453 00:26:02,510 --> 00:26:07,160 only about 1 and 1/2% are exons, which 454 00:26:07,160 --> 00:26:14,820 are the parts that code for mRNA, and eventually protein. 455 00:26:14,820 --> 00:26:19,600 So the question is what does the other 98 and 1/2% do? 456 00:26:19,600 --> 00:26:23,650 There was this unfortunate tendency in the biology 457 00:26:23,650 --> 00:26:27,100 community to call that junk DNA, which 458 00:26:27,100 --> 00:26:29,200 of course is a terrible notion. 459 00:26:29,200 --> 00:26:31,990 Because evolution would certainly have gotten 460 00:26:31,990 --> 00:26:34,540 rid of it if it was truly junk. 461 00:26:34,540 --> 00:26:39,710 Because our cells spend a lot of energy building this stuff. 462 00:26:39,710 --> 00:26:42,700 And every time a cell divides, it 463 00:26:42,700 --> 00:26:45,640 rebuilds all that so-called junk DNA. 464 00:26:45,640 --> 00:26:48,700 So it can't possibly be junk. 465 00:26:48,700 --> 00:26:51,280 But the question is, what does it do? 466 00:26:51,280 --> 00:26:54,890 And we don't really know for a lot of it. 467 00:26:54,890 --> 00:26:57,370 So there are introns-- 468 00:26:57,370 --> 00:26:58,580 I'll show you a picture. 469 00:26:58,580 --> 00:27:02,830 There are segments of the coding region that don't wind up 470 00:27:02,830 --> 00:27:04,180 as part of the RNA. 471 00:27:04,180 --> 00:27:05,890 They're spliced out. 472 00:27:05,890 --> 00:27:08,650 And we don't quite know why. 473 00:27:08,650 --> 00:27:11,350 There are these regulatory sequences, 474 00:27:11,350 --> 00:27:15,070 which is only about 5%, that are those promoters 475 00:27:15,070 --> 00:27:22,880 and repressors and enhancers that I talked about. 476 00:27:22,880 --> 00:27:26,230 And then there's a whole bunch of repetitive DNA that 477 00:27:26,230 --> 00:27:30,980 includes transposable elements, related sequences. 478 00:27:30,980 --> 00:27:34,150 And mostly, we don't understand what it all does. 479 00:27:36,880 --> 00:27:39,700 Hypotheses are things like, well, maybe 480 00:27:39,700 --> 00:27:43,870 it's a storehouse of potentially useful DNA 481 00:27:43,870 --> 00:27:47,750 so that if environmental conditions change a lot, 482 00:27:47,750 --> 00:27:49,720 then the cell doesn't have to reinvent 483 00:27:49,720 --> 00:27:51,490 the stuff from scratch. 484 00:27:51,490 --> 00:27:55,740 It saved it from previous times in evolution 485 00:27:55,740 --> 00:27:57,460 when that may have been useful. 486 00:27:57,460 --> 00:28:01,640 But that's pretty much pure speculation at this point. 487 00:28:01,640 --> 00:28:04,300 So just recently, the Killian Lecture 488 00:28:04,300 --> 00:28:09,520 was given by Gerald Fink, who's a geneticist here. 489 00:28:09,520 --> 00:28:14,590 And his claim is that a gene is not any segment of DNA 490 00:28:14,590 --> 00:28:17,260 that produces RNA or protein. 491 00:28:17,260 --> 00:28:20,770 But it's any segment of DNA that is transcribed 492 00:28:20,770 --> 00:28:23,740 into RNA that has some function, whatever 493 00:28:23,740 --> 00:28:29,060 it is, not necessarily building proteins, but just anything. 494 00:28:29,060 --> 00:28:33,200 And I think that view is becoming accepted. 495 00:28:33,200 --> 00:28:39,130 So I promised you a little bit of more complexity. 496 00:28:39,130 --> 00:28:42,520 So when you look at your DNA in eukaryotes, 497 00:28:42,520 --> 00:28:45,820 like us, here's the promoter. 498 00:28:45,820 --> 00:28:51,070 And then here is the sequence of the genome. 499 00:28:51,070 --> 00:28:55,420 When this gets transcribed, it gets transcribed into something 500 00:28:55,420 --> 00:28:59,320 called pre-mRNA, messenger RNA. 501 00:28:59,320 --> 00:29:03,120 And then there's this process of alternative splicing 502 00:29:03,120 --> 00:29:09,100 that splices out the introns and leaves only the exons. 503 00:29:09,100 --> 00:29:11,620 But sometimes it doesn't leave all the exons. 504 00:29:11,620 --> 00:29:13,640 It only leaves some of them. 505 00:29:13,640 --> 00:29:17,470 And so the same gene can, under various circumstances, 506 00:29:17,470 --> 00:29:19,840 produce different mRNA, which then 507 00:29:19,840 --> 00:29:22,310 produces different proteins. 508 00:29:22,310 --> 00:29:25,540 And again, there's a lot of mysteries about exactly 509 00:29:25,540 --> 00:29:27,940 how all this works. 510 00:29:27,940 --> 00:29:31,210 Nevertheless, that's the basic mechanism. 511 00:29:31,210 --> 00:29:36,325 And then here, I've just listed a few 512 00:29:36,325 --> 00:29:38,860 of the complexity problems. 513 00:29:38,860 --> 00:29:43,630 So there are things like, RNA can turn into DNA. 514 00:29:43,630 --> 00:29:46,810 This is a trick that viruses use a lot. 515 00:29:46,810 --> 00:29:50,260 They incorporate themselves into your cell, 516 00:29:50,260 --> 00:29:55,390 create a DNA complement to the RNA, 517 00:29:55,390 --> 00:29:58,670 and then use that to generate more viruses. 518 00:29:58,670 --> 00:30:03,340 So this is very typical of a viral infection. 519 00:30:03,340 --> 00:30:06,700 Prions, we also don't understand very well. 520 00:30:06,700 --> 00:30:11,740 This is like mad cow disease, where these proteins are able 521 00:30:11,740 --> 00:30:16,000 to cause changes in other proteins without going through 522 00:30:16,000 --> 00:30:21,130 the RNA/DNA-mediated mechanisms. 523 00:30:21,130 --> 00:30:24,100 There are DNA-modifying proteins, 524 00:30:24,100 --> 00:30:26,920 the most important of which is the stuff involved 525 00:30:26,920 --> 00:30:32,200 in CRISPR-CAS9, which is this relatively new discovery 526 00:30:32,200 --> 00:30:37,540 about how bacteria are able to use a mechanism that they stole 527 00:30:37,540 --> 00:30:45,670 from viruses to edit the genetic complement of themselves, 528 00:30:45,670 --> 00:30:50,110 and more importantly, of other viruses that attack them. 529 00:30:50,110 --> 00:30:53,120 So it's an antiviral defense mechanism. 530 00:30:53,120 --> 00:30:56,890 And we're now figuring out how to use it to do gene editing. 531 00:30:56,890 --> 00:31:01,270 You may have read about this Chinese guy who actually went 532 00:31:01,270 --> 00:31:05,980 out and edited the genome of a couple of girls who were born 533 00:31:05,980 --> 00:31:11,050 in China, incorporating some, I think, resistance 534 00:31:11,050 --> 00:31:14,390 against HIV infections in their genome. 535 00:31:14,390 --> 00:31:17,080 And of course, this is probably way too early 536 00:31:17,080 --> 00:31:19,570 to do experiments on human beings, 537 00:31:19,570 --> 00:31:22,810 because they haven't demonstrated that this is safe. 538 00:31:22,810 --> 00:31:25,990 But maybe that'll become accepted. 539 00:31:25,990 --> 00:31:29,680 George Church at Harvard has been going around-- 540 00:31:29,680 --> 00:31:32,080 he likes to rattle people's chains. 541 00:31:32,080 --> 00:31:33,850 And he's been going around saying, well, 542 00:31:33,850 --> 00:31:39,280 the guy, he was unethical and was a slob, but what he's doing 543 00:31:39,280 --> 00:31:41,110 is a really great idea. 544 00:31:41,110 --> 00:31:45,480 So we'll see where that goes. 545 00:31:45,480 --> 00:31:48,750 And then there are these retrotransposons, 546 00:31:48,750 --> 00:31:53,460 where pieces of DNA in eukarya just pop out 547 00:31:53,460 --> 00:31:56,670 of wherever they are and insert themselves 548 00:31:56,670 --> 00:31:59,860 in some other place in the genome. 549 00:31:59,860 --> 00:32:02,770 And in plants, this happens a lot. 550 00:32:02,770 --> 00:32:08,550 So for example, wheat seems to have a huge number of copies 551 00:32:08,550 --> 00:32:13,320 of DNA segments that maybe it had only one of, 552 00:32:13,320 --> 00:32:16,730 but it's replicated through this mechanism. 553 00:32:19,690 --> 00:32:23,390 Last bit of complexity-- 554 00:32:23,390 --> 00:32:26,520 so we have various kinds of RNA. 555 00:32:26,520 --> 00:32:29,740 There's long non-coding RNA, which 556 00:32:29,740 --> 00:32:32,640 seems to participate in gene regulation. 557 00:32:32,640 --> 00:32:39,700 There is RNA interference, that there are these small RNA 558 00:32:39,700 --> 00:32:44,260 pieces that will actually latch onto the RNA produced 559 00:32:44,260 --> 00:32:46,770 by the standard genetic mechanism 560 00:32:46,770 --> 00:32:50,260 and prevent it from being translated into protein. 561 00:32:50,260 --> 00:32:53,530 This was another Nobel Prize a few years ago. 562 00:32:53,530 --> 00:32:56,700 Almost everything in this field, if you're first, 563 00:32:56,700 --> 00:32:58,150 you get a Nobel Prize for it. 564 00:33:00,960 --> 00:33:03,420 Once the proteins are made, they're 565 00:33:03,420 --> 00:33:06,010 degraded differentially. 566 00:33:06,010 --> 00:33:08,130 So there are different mechanisms 567 00:33:08,130 --> 00:33:11,490 in the cell that destroy certain kinds of proteins 568 00:33:11,490 --> 00:33:13,660 much faster than others. 569 00:33:13,660 --> 00:33:16,290 And so the production rate doesn't tell you 570 00:33:16,290 --> 00:33:19,830 how much is going to be there at any particular time. 571 00:33:19,830 --> 00:33:24,630 And then there's this secondary and tertiary structure, 572 00:33:24,630 --> 00:33:27,030 where there's actually-- 573 00:33:27,030 --> 00:33:27,780 what is it? 574 00:33:27,780 --> 00:33:31,530 It's a mile of DNA in each of your cells. 575 00:33:31,530 --> 00:33:34,140 So it wouldn't fit. 576 00:33:34,140 --> 00:33:41,550 And so it gets wrapped up on these acetylated histones 577 00:33:41,550 --> 00:33:44,610 to produce something called chromatin. 578 00:33:44,610 --> 00:33:47,700 And again, we don't quite understand how this all works. 579 00:33:47,700 --> 00:33:50,850 Because you'd think that if you wrap stuff up like this, 580 00:33:50,850 --> 00:33:55,290 it would become inaccessible to transcription. 581 00:33:55,290 --> 00:33:58,560 And therefore, it's not clear how it gets expressed. 582 00:33:58,560 --> 00:34:01,840 But somehow or other, the cell is able to do that. 583 00:34:01,840 --> 00:34:06,840 So there's a lot yet to learn in this area. 584 00:34:06,840 --> 00:34:09,480 Now, the reason we're interested in all this 585 00:34:09,480 --> 00:34:14,400 is because, if you plot Moore's law for how quickly computers 586 00:34:14,400 --> 00:34:17,610 are becoming cheaper per performance, 587 00:34:17,610 --> 00:34:22,500 and you plot the cost of gene sequencing, 588 00:34:22,500 --> 00:34:24,270 it keeps going down. 589 00:34:24,270 --> 00:34:29,120 And it goes down much faster even than Moore's law. 590 00:34:29,120 --> 00:34:31,239 So this is pretty remarkable. 591 00:34:31,239 --> 00:34:36,070 And it means that, as I said, that $3 dollar first genome now 592 00:34:36,070 --> 00:34:39,010 costs just a few hundred dollars. 593 00:34:39,010 --> 00:34:44,420 In fact, if you're just interested in the whole exome, 594 00:34:44,420 --> 00:34:51,710 so only the 2%, roughly, of the DNA that 595 00:34:51,710 --> 00:34:55,400 produces genetic coding, you can now 596 00:34:55,400 --> 00:34:58,490 go to this company, which I have nothing to do with. 597 00:34:58,490 --> 00:35:00,950 I just pulled this off the web. 598 00:35:00,950 --> 00:35:09,000 But for $299, they will give you 50-times coverage 599 00:35:09,000 --> 00:35:12,110 on about six gigabases. 600 00:35:12,110 --> 00:35:17,300 And if you pay them an extra $100, 601 00:35:17,300 --> 00:35:19,860 they'll do it at 100x coverage. 602 00:35:19,860 --> 00:35:22,530 So these techniques are very noisy. 603 00:35:22,530 --> 00:35:25,340 And so it's important to get lots of replicates 604 00:35:25,340 --> 00:35:30,430 in order to reassemble what you think is going on. 605 00:35:30,430 --> 00:35:34,320 A slightly more recent phenomenon is people say, well, 606 00:35:34,320 --> 00:35:36,930 not only can we sequence your DNA 607 00:35:36,930 --> 00:35:42,320 but we can sequence the RNA that got transcribed from the DNA. 608 00:35:42,320 --> 00:35:49,880 And in fact, you can buy a kit for $360 that will take 609 00:35:49,880 --> 00:35:53,220 the RNA from individual cells-- 610 00:35:53,220 --> 00:35:58,770 so these are like picoliter amounts of stuff. 611 00:35:58,770 --> 00:36:04,250 And it will give you the RNA sequence for $360 for up to 100 612 00:36:04,250 --> 00:36:09,120 cells, so $3, $3.50 per cell. 613 00:36:09,120 --> 00:36:11,030 So people are very excited. 614 00:36:11,030 --> 00:36:13,730 And there are now also companies that will 615 00:36:13,730 --> 00:36:16,580 sell you advanced analysis. 616 00:36:16,580 --> 00:36:19,100 So they will correlate the data that you 617 00:36:19,100 --> 00:36:22,310 are getting with different databases 618 00:36:22,310 --> 00:36:26,540 and figure out whether this represents 619 00:36:26,540 --> 00:36:30,260 a dominant or a recessive or an x-linked model, 620 00:36:30,260 --> 00:36:34,760 if you have family familial data and functional annotation 621 00:36:34,760 --> 00:36:37,250 of candidate genes, et cetera. 622 00:36:37,250 --> 00:36:40,610 And so, for example, starting about three years ago, 623 00:36:40,610 --> 00:36:46,070 if you walk into the Dana-Farber with a newly diagnosed cancer, 624 00:36:46,070 --> 00:36:50,420 a solid-tumor cancer, they will take a sample of that cancer, 625 00:36:50,420 --> 00:36:54,560 send it off to companies like this, or their own labs, 626 00:36:54,560 --> 00:36:58,190 and do sequencing and do analysis and try to figure out 627 00:36:58,190 --> 00:37:02,780 exactly which damaged genes that you have may be causing 628 00:37:02,780 --> 00:37:07,820 the cancer, and maybe more importantly, since it's still 629 00:37:07,820 --> 00:37:13,730 a pretty empirical field, which unusual variants of your genes 630 00:37:13,730 --> 00:37:16,400 suggest that certain drugs are likely to be 631 00:37:16,400 --> 00:37:20,270 more effective in treating your cancer than other drugs. 632 00:37:20,270 --> 00:37:25,131 So this has become completely routine in cancer care 633 00:37:25,131 --> 00:37:27,185 and in a few other domains. 634 00:37:30,530 --> 00:37:35,250 So now I'm going to switch to a more technical set of material. 635 00:37:35,250 --> 00:37:38,390 So if you want to characterize disease subtypes 636 00:37:38,390 --> 00:37:42,710 using gene expression arrays, microarrays, here's 637 00:37:42,710 --> 00:37:43,740 one way to do it. 638 00:37:43,740 --> 00:37:47,070 And this is a famous paper by Alizadeh. 639 00:37:47,070 --> 00:37:51,500 It was essentially the first of this class of papers 640 00:37:51,500 --> 00:37:54,220 back in 2001, I think. 641 00:37:54,220 --> 00:37:56,420 Yeah, 2001. 642 00:37:56,420 --> 00:37:59,840 And since then, there have been probably tens or hundreds 643 00:37:59,840 --> 00:38:02,420 of thousands of other papers published 644 00:38:02,420 --> 00:38:06,930 doing similar kinds of analyses on other data sets. 645 00:38:06,930 --> 00:38:09,510 So what they did is they said, OK, we're 646 00:38:09,510 --> 00:38:16,200 going to extract the coding RNA. 647 00:38:16,200 --> 00:38:21,510 We're going to create complementary DNA from it. 648 00:38:21,510 --> 00:38:24,420 We're going to use a technique to amplify that, 649 00:38:24,420 --> 00:38:27,660 because we're starting with teeny-tiny quantities. 650 00:38:27,660 --> 00:38:34,380 And then we're going to take a microarray, which is either 651 00:38:34,380 --> 00:38:38,580 a glass slide with tens or hundreds of thousands 652 00:38:38,580 --> 00:38:43,200 of spotted bits of DNA on it or it's 653 00:38:43,200 --> 00:38:46,770 a silicon chip with wells that, again, 654 00:38:46,770 --> 00:38:51,360 have tens or hundreds of thousands of bits of DNA in it. 655 00:38:51,360 --> 00:38:54,640 Now, where does that DNA come from? 656 00:38:54,640 --> 00:38:57,360 Initially, it was just a random collection 657 00:38:57,360 --> 00:39:02,790 of pieces of genes from the genome. 658 00:39:02,790 --> 00:39:05,920 Since then, they've gotten somewhat more sophisticated. 659 00:39:05,920 --> 00:39:12,090 But the idea is that I'm going to take the amplified cDNA, 660 00:39:12,090 --> 00:39:15,870 I'm going to mark with one of these jellyfish proteins that 661 00:39:15,870 --> 00:39:18,750 glows under light, and then I'm going 662 00:39:18,750 --> 00:39:23,370 to flow it over this slide or over this set of wells. 663 00:39:23,370 --> 00:39:29,410 And the complementary parts of the complementary DNA 664 00:39:29,410 --> 00:39:35,556 will stick to the samples of DNA that are in this well. 665 00:39:35,556 --> 00:39:39,260 OK-- stands to reason. 666 00:39:39,260 --> 00:39:43,040 An alternative is that you take normal tissue as well 667 00:39:43,040 --> 00:39:47,150 as, say, the cancerous tissue, you mark the normal tissue 668 00:39:47,150 --> 00:39:51,770 with green fluorescent jellyfish stuff 669 00:39:51,770 --> 00:39:54,750 and you mark the cancer with red, 670 00:39:54,750 --> 00:39:57,260 and then you flow both of them in equal amounts 671 00:39:57,260 --> 00:39:58,550 over the array. 672 00:39:58,550 --> 00:40:00,500 That lets you measure a ratio. 673 00:40:00,500 --> 00:40:03,650 And you don't have as much of a calibration problem 674 00:40:03,650 --> 00:40:07,640 about trying to figure out the exact value. 675 00:40:07,640 --> 00:40:10,760 And then you cluster these samples by nearness 676 00:40:10,760 --> 00:40:12,470 in the expression space. 677 00:40:12,470 --> 00:40:16,670 And you cluster the genes by expression similarity 678 00:40:16,670 --> 00:40:18,240 across samples. 679 00:40:18,240 --> 00:40:20,540 So it used to be called bi-clustering. 680 00:40:20,540 --> 00:40:24,110 And I'll talk in a few minutes about a particular technique 681 00:40:24,110 --> 00:40:26,370 for doing this. 682 00:40:26,370 --> 00:40:30,750 So this is a typical microarray experiment. 683 00:40:30,750 --> 00:40:34,790 The RNA is turned into its complementary DNA, 684 00:40:34,790 --> 00:40:37,440 flowed over the microarray chip. 685 00:40:37,440 --> 00:40:39,720 And you get out a bunch of spots that 686 00:40:39,720 --> 00:40:43,960 are to various degrees of green and red. 687 00:40:43,960 --> 00:40:48,370 And then you calculate their ratio. 688 00:40:48,370 --> 00:40:50,620 And then you do this bi-clustering. 689 00:40:50,620 --> 00:40:53,370 And what you get is a hierarchical clustering 690 00:40:53,370 --> 00:40:57,540 of genes and a hierarchical clustering, in their case, 691 00:40:57,540 --> 00:41:01,240 of breast cancer biopsy specimens that express 692 00:41:01,240 --> 00:41:02,790 these genes in different ways. 693 00:41:06,090 --> 00:41:09,150 So this was pretty revolutionary, 694 00:41:09,150 --> 00:41:12,730 because the answers made sense. 695 00:41:12,730 --> 00:41:16,590 So when they did this on 19 cell lines 696 00:41:16,590 --> 00:41:22,230 in 65 breast tumor samples and a whole bunch of genes, 697 00:41:22,230 --> 00:41:26,400 they came up with a clustering that said, hmm, 698 00:41:26,400 --> 00:41:31,950 it looks like there are some samples that have 699 00:41:31,950 --> 00:41:34,810 this endothelial cell cluster. 700 00:41:34,810 --> 00:41:37,230 So it's a particular kind of problem. 701 00:41:37,230 --> 00:41:42,240 And you could correlate it with pathology 702 00:41:42,240 --> 00:41:47,280 from the tumor slides and different subclasses. 703 00:41:47,280 --> 00:41:51,330 And then this is a very typical kind of heat map 704 00:41:51,330 --> 00:41:53,820 that you see in this type of study. 705 00:42:00,270 --> 00:42:04,590 Another study from 65 breast carcinoma samples, 706 00:42:04,590 --> 00:42:08,700 using the gene list that they curated before, 707 00:42:08,700 --> 00:42:12,540 looks like it clusters the expression levels 708 00:42:12,540 --> 00:42:14,040 into these five clusters. 709 00:42:17,210 --> 00:42:18,500 It's a little hard to look at. 710 00:42:18,500 --> 00:42:21,770 I mean, when I stare at these, it's not obvious to me 711 00:42:21,770 --> 00:42:25,790 why the mathematics came up with exactly those clusters rather 712 00:42:25,790 --> 00:42:27,080 than some others. 713 00:42:27,080 --> 00:42:30,150 But you can see that there is some sense to it. 714 00:42:30,150 --> 00:42:34,640 So here you see a lot of greens at this end of it 715 00:42:34,640 --> 00:42:38,460 and not very much at this end, and vise versa. 716 00:42:38,460 --> 00:42:40,850 So there is some difference between these clusters. 717 00:42:40,850 --> 00:42:41,450 Yeah? 718 00:42:41,450 --> 00:42:43,533 AUDIENCE: How did they come up with the gene list? 719 00:42:43,533 --> 00:42:46,254 And does anyone ever do this kind of cluster analysis 720 00:42:46,254 --> 00:42:48,195 without coming up with a gene list first? 721 00:42:48,195 --> 00:42:49,070 PETER SZOLOVITS: Yes. 722 00:42:49,070 --> 00:42:52,850 So I'm going to talk in a minute about modern gene-wide 723 00:42:52,850 --> 00:42:55,670 association studies, where basically you 724 00:42:55,670 --> 00:42:59,630 say, I'm going to look at every gene known to man. 725 00:42:59,630 --> 00:43:04,760 So they still have a list, but the list is 20,000 or 25,000. 726 00:43:04,760 --> 00:43:06,770 It's whatever we know about. 727 00:43:06,770 --> 00:43:09,310 And that's another way of doing it. 728 00:43:09,310 --> 00:43:15,830 So what was compelling about this work, this group's work, 729 00:43:15,830 --> 00:43:21,560 is a later analysis showed that these five subtypes actually 730 00:43:21,560 --> 00:43:26,930 had different survival rates, and at p-equal 0.01 level 731 00:43:26,930 --> 00:43:29,030 of statistical significance. 732 00:43:29,030 --> 00:43:31,130 You've seen these survival curves, of course, 733 00:43:31,130 --> 00:43:33,500 before from David's lecture. 734 00:43:33,500 --> 00:43:37,340 But this is pretty impressive that doing something 735 00:43:37,340 --> 00:43:40,340 that had nothing to do with the clinical condition 736 00:43:40,340 --> 00:43:41,250 of the patient-- 737 00:43:41,250 --> 00:43:45,570 this is purely based on their gene expression levels-- 738 00:43:45,570 --> 00:43:48,740 you were able to find clusters that actually 739 00:43:48,740 --> 00:43:50,780 behave differently, clinically. 740 00:43:50,780 --> 00:43:53,840 So some of them do better than others. 741 00:43:53,840 --> 00:43:57,650 So this paper and this approach to work 742 00:43:57,650 --> 00:44:02,360 set off a huge set of additional work. 743 00:44:02,360 --> 00:44:06,110 This was, again, back in the Alizadeh paper. 744 00:44:06,110 --> 00:44:12,350 They did a similar relationship between 96 samples 745 00:44:12,350 --> 00:44:15,860 of normal and malignant lymphocytes. 746 00:44:15,860 --> 00:44:20,180 And they get a similar result, where 747 00:44:20,180 --> 00:44:23,450 the clusters that they identify here 748 00:44:23,450 --> 00:44:28,820 correspond to sort of well-understood existing 749 00:44:28,820 --> 00:44:31,150 types of lymphoma. 750 00:44:31,150 --> 00:44:35,350 So this, again, gives you some confidence 751 00:44:35,350 --> 00:44:41,080 that what you're extracting from these genetic correlations 752 00:44:41,080 --> 00:44:45,820 is meaningful in the terms that people who deal with lymphomas 753 00:44:45,820 --> 00:44:48,530 think about, about the topic. 754 00:44:48,530 --> 00:44:51,160 But of course, it can give you much more detail. 755 00:44:51,160 --> 00:44:53,770 Because people's intuitions may not 756 00:44:53,770 --> 00:44:59,210 be as effective as these large-scale data analyses. 757 00:44:59,210 --> 00:45:02,410 So to get to your question about generalizing this, 758 00:45:02,410 --> 00:45:06,050 I mean, here's one way that I look at this. 759 00:45:06,050 --> 00:45:12,280 If I list all the genes and I list all the phenotypes-- 760 00:45:12,280 --> 00:45:14,110 now, we're a little more sure of what 761 00:45:14,110 --> 00:45:16,760 the genes are than of what the phenotypes are. 762 00:45:16,760 --> 00:45:19,690 So that's an interesting problem. 763 00:45:19,690 --> 00:45:23,530 Then I can do a bunch of analyses. 764 00:45:23,530 --> 00:45:27,460 So what is a phenotype? 765 00:45:27,460 --> 00:45:31,720 Well, it can be a diagnosed disease, like breast cancer. 766 00:45:31,720 --> 00:45:35,320 Or it can be the type of lymphoma from the two examples 767 00:45:35,320 --> 00:45:36,880 I've just shown you. 768 00:45:36,880 --> 00:45:40,000 It can also be a qualitative or a quantitative trait. 769 00:45:40,000 --> 00:45:41,020 It could be your weight. 770 00:45:41,020 --> 00:45:42,340 It could be your eye color. 771 00:45:42,340 --> 00:45:48,790 It could be almost anything that is clinically known about you. 772 00:45:48,790 --> 00:45:50,860 And it could even be behavior. 773 00:45:50,860 --> 00:45:58,940 It could be things like, what is your daily output of Twitter 774 00:45:58,940 --> 00:45:59,440 posts? 775 00:46:02,240 --> 00:46:04,670 That's a perfectly reasonable trait. 776 00:46:04,670 --> 00:46:06,830 I don't know if it's genetically predictable. 777 00:46:06,830 --> 00:46:12,170 But you'll see some surprising things that are. 778 00:46:12,170 --> 00:46:14,300 So then, how do you analyze this? 779 00:46:14,300 --> 00:46:19,070 Well, if you start by looking at a particular phenotype 780 00:46:19,070 --> 00:46:22,070 and say, what genes are associated with this, 781 00:46:22,070 --> 00:46:25,250 then you're doing what's called a GWAS, or a Gene-Wide 782 00:46:25,250 --> 00:46:27,260 Association Study. 783 00:46:27,260 --> 00:46:29,120 So you look for genetic differences 784 00:46:29,120 --> 00:46:32,810 that correspond to specific phenotypic differences. 785 00:46:32,810 --> 00:46:35,810 And usually, you're looking at things like single nucleotide 786 00:46:35,810 --> 00:46:37,640 polymorphisms. 787 00:46:37,640 --> 00:46:40,850 So this is places where your genome differs 788 00:46:40,850 --> 00:46:44,030 from the reference genome, the most common genome 789 00:46:44,030 --> 00:46:47,760 in the human population, at one particular locus. 790 00:46:47,760 --> 00:46:51,590 So you have a C instead of a G or something one place 791 00:46:51,590 --> 00:46:53,120 in your genes. 792 00:46:53,120 --> 00:46:57,710 Copy number variations, there are stretches of DNA 793 00:46:57,710 --> 00:47:01,010 that have repeats in them. 794 00:47:01,010 --> 00:47:03,810 And the number of repeats is variable. 795 00:47:03,810 --> 00:47:06,020 So one of the most famous ones of these 796 00:47:06,020 --> 00:47:09,330 is the one associated with Huntington's disease. 797 00:47:09,330 --> 00:47:14,090 It turns out that if you have up to 20-something repeats 798 00:47:14,090 --> 00:47:17,790 of a certain section of DNA, you're perfectly healthy. 799 00:47:17,790 --> 00:47:20,220 But if you're above 30 something, 800 00:47:20,220 --> 00:47:23,760 then you're going to die of Huntington's disease. 801 00:47:23,760 --> 00:47:26,780 And again, we don't quite understand these mechanisms. 802 00:47:26,780 --> 00:47:28,790 But these are empirically known. 803 00:47:28,790 --> 00:47:31,580 So copy number variations are important, 804 00:47:31,580 --> 00:47:38,030 gene expression levels, which I've talked about a minute ago. 805 00:47:38,030 --> 00:47:41,150 But the trick here in a GWAS is to look 806 00:47:41,150 --> 00:47:44,690 at a very wide set of genes rather than 807 00:47:44,690 --> 00:47:48,260 just a limited set of samples that you 808 00:47:48,260 --> 00:47:50,300 know you're interested in. 809 00:47:50,300 --> 00:47:52,650 Now, the other approach is the opposite, 810 00:47:52,650 --> 00:47:56,600 which is to say, let's look at a particular gene 811 00:47:56,600 --> 00:47:59,600 and figure out what's it correlated with. 812 00:47:59,600 --> 00:48:05,000 And so that's called a PheWAS, a Phenome-Wide Association Study. 813 00:48:05,000 --> 00:48:10,340 And now what you do is you list all the different phenotypes. 814 00:48:10,340 --> 00:48:14,180 And you say, well, we can do the same kind of analysis 815 00:48:14,180 --> 00:48:17,570 to say which of them are disproportionately 816 00:48:17,570 --> 00:48:22,490 present in people who have that genetic variant. 817 00:48:22,490 --> 00:48:25,220 So here's what a typical GWAS looks like. 818 00:48:25,220 --> 00:48:29,190 This is called a Manhattan plot, which I think is pretty funny. 819 00:48:29,190 --> 00:48:33,540 But it does kind of look like the skyline of Manhattan. 820 00:48:33,540 --> 00:48:37,730 So this is all of your genes laid out in sequence 821 00:48:37,730 --> 00:48:40,250 along your chromosomes. 822 00:48:40,250 --> 00:48:45,980 And you take a particular phenotype and you 823 00:48:45,980 --> 00:48:51,200 say, what is the difference in the ratio of expression 824 00:48:51,200 --> 00:48:55,250 levels between people who have this disease and people who 825 00:48:55,250 --> 00:48:57,200 don't have this disease? 826 00:48:57,200 --> 00:49:01,140 And something like this gene, whatever it is, clearly there 827 00:49:01,140 --> 00:49:04,670 is an enormous difference in its expression level. 828 00:49:04,670 --> 00:49:07,760 And so you would be surprised if this gene didn't have something 829 00:49:07,760 --> 00:49:09,920 to do with the disease. 830 00:49:09,920 --> 00:49:15,740 And similarly, you can calculate different significance levels. 831 00:49:15,740 --> 00:49:18,470 You have to do something like a Bonferroni correction, 832 00:49:18,470 --> 00:49:23,400 because you are testing so many hypotheses simultaneously. 833 00:49:23,400 --> 00:49:26,210 And so typically, the top of these lines 834 00:49:26,210 --> 00:49:29,630 is the Bonferroni-corrected threshold. 835 00:49:29,630 --> 00:49:33,780 And then you say, OK, this guy, this guy, this guy, this guy, 836 00:49:33,780 --> 00:49:36,420 and this guy come above that threshold. 837 00:49:36,420 --> 00:49:38,300 So these are good candidate genes 838 00:49:38,300 --> 00:49:41,840 to think that may be associated with this disease. 839 00:49:41,840 --> 00:49:46,140 Now, can you go out and start treating people based on that? 840 00:49:46,140 --> 00:49:48,380 Well, it's probably not a good idea. 841 00:49:48,380 --> 00:49:51,740 Because there are many reasons why this analysis 842 00:49:51,740 --> 00:49:52,640 might have failed. 843 00:49:52,640 --> 00:49:56,240 All the lessons that you've heard about confounders 844 00:49:56,240 --> 00:49:58,520 come in very strongly here. 845 00:49:58,520 --> 00:50:00,620 And so typically, what biologists 846 00:50:00,620 --> 00:50:03,230 do is they do this kind of analysis. 847 00:50:03,230 --> 00:50:08,030 They then create a strain of knock-out mice 848 00:50:08,030 --> 00:50:10,895 who have some analog of whatever disease 849 00:50:10,895 --> 00:50:12,020 it is that you're studying. 850 00:50:15,410 --> 00:50:17,870 And they see whether, in fact, knocking out 851 00:50:17,870 --> 00:50:24,590 a certain gene, like this guy, cures or creates 852 00:50:24,590 --> 00:50:27,470 the disease that you're interested in in this mouse 853 00:50:27,470 --> 00:50:28,280 model. 854 00:50:28,280 --> 00:50:31,130 And then you have a more mechanistic explanation 855 00:50:31,130 --> 00:50:32,960 for what the relationship might be. 856 00:50:38,770 --> 00:50:42,010 So basically, you're looking at the ratio 857 00:50:42,010 --> 00:50:47,790 of the odds of having the disease if you have a SNP, 858 00:50:47,790 --> 00:50:52,570 or if you have a genetic variant, to having the disease 859 00:50:52,570 --> 00:50:54,600 if you don't have the genetic variant. 860 00:50:54,600 --> 00:50:55,210 Yeah? 861 00:50:55,210 --> 00:50:57,145 AUDIENCE: I'm just curious on the class size. 862 00:50:57,145 --> 00:50:58,770 It seems like the Bonferroni correction 863 00:50:58,770 --> 00:51:01,920 is being very limiting here, potentially conservative. 864 00:51:01,920 --> 00:51:04,930 And I'm curious if there are specific computational 865 00:51:04,930 --> 00:51:07,060 techniques adapted to this scenario that 866 00:51:07,060 --> 00:51:10,133 allow you to sort of mine a bit more effectively than those. 867 00:51:10,133 --> 00:51:11,050 PETER SZOLOVITS: Yeah. 868 00:51:11,050 --> 00:51:14,740 So if you talk to the statisticians, who 869 00:51:14,740 --> 00:51:19,030 are more expert at this than the computer scientists typically, 870 00:51:19,030 --> 00:51:21,160 they will tell you that Bonferroni 871 00:51:21,160 --> 00:51:24,940 is a very conservative kind of correction. 872 00:51:24,940 --> 00:51:28,840 And if you can impose some sort of structure 873 00:51:28,840 --> 00:51:33,370 on the set of genes that you're testing, then you can cheat. 874 00:51:33,370 --> 00:51:39,100 And you can say, well, you know, these 75 genes actually 875 00:51:39,100 --> 00:51:41,420 are all part of the same mechanism. 876 00:51:41,420 --> 00:51:43,780 And we're really testing the mechanism and not 877 00:51:43,780 --> 00:51:46,000 the individual gene. 878 00:51:46,000 --> 00:51:49,240 And therefore, instead of making a Bonferroni correction 879 00:51:49,240 --> 00:51:53,470 for 75 of these guys, we only have to do it for one. 880 00:51:53,470 --> 00:51:57,610 And so you can reduce the Bonferroni correction that way. 881 00:51:57,610 --> 00:52:00,490 But people get nervous when you do that. 882 00:52:00,490 --> 00:52:06,040 Because your incentive as a researcher 883 00:52:06,040 --> 00:52:10,090 is to show statistically significant results. 884 00:52:10,090 --> 00:52:12,790 But that whole question of p-values 885 00:52:12,790 --> 00:52:15,880 keeps coming under discussion. 886 00:52:15,880 --> 00:52:20,920 So the head of the American Statistical Association, 887 00:52:20,920 --> 00:52:24,070 about 15 years ago-- he's the Stanford professor. 888 00:52:24,070 --> 00:52:31,360 And he published what became a very notorious article saying, 889 00:52:31,360 --> 00:52:33,640 you know, we got it all wrong. 890 00:52:33,640 --> 00:52:37,505 Statistical significance is not significance 891 00:52:37,505 --> 00:52:40,450 in the standard English sense of the word. 892 00:52:40,450 --> 00:52:42,940 And so he called for various other ways 893 00:52:42,940 --> 00:52:46,840 and was more sympathetic to Bayesian kinds of reasoning 894 00:52:46,840 --> 00:52:48,250 and things like that. 895 00:52:48,250 --> 00:52:50,870 So there may be some gradual movement to that. 896 00:52:50,870 --> 00:52:54,310 But this is a huge can of worms to which we don't have 897 00:52:54,310 --> 00:52:55,930 a very good mechanistic answer. 898 00:52:58,660 --> 00:52:59,860 All right. 899 00:52:59,860 --> 00:53:03,760 So if you do these GWASs-- 900 00:53:03,760 --> 00:53:06,400 and this is the real problem with them 901 00:53:06,400 --> 00:53:10,160 is that most of what you see is down here. 902 00:53:10,160 --> 00:53:15,365 So you have things with common variants. 903 00:53:18,300 --> 00:53:21,270 But they have very small effect sizes 904 00:53:21,270 --> 00:53:24,120 when you look at what their effect is 905 00:53:24,120 --> 00:53:26,590 on a particular disease. 906 00:53:26,590 --> 00:53:31,540 And so that same Zach Kohane that I mentioned earlier 907 00:53:31,540 --> 00:53:34,560 has always been challenging people doing this kind of work, 908 00:53:34,560 --> 00:53:36,950 saying, look-- 909 00:53:36,950 --> 00:53:42,080 for example, we did a GWAS with Kat Liao, 910 00:53:42,080 --> 00:53:45,890 who was a guest interviewee here when I was lecturing 911 00:53:45,890 --> 00:53:47,540 earlier in the semester. 912 00:53:47,540 --> 00:53:48,930 She's a rheumatologist. 913 00:53:48,930 --> 00:53:51,360 And we did a gene-wide association study. 914 00:53:51,360 --> 00:53:54,650 We found a bunch of genes that had odds ratios 915 00:53:54,650 --> 00:53:59,360 of like 1.1 to 1, 1.2 to 1. 916 00:53:59,360 --> 00:54:01,160 And they're statistically significant. 917 00:54:01,160 --> 00:54:03,410 Because if you collect enough data, 918 00:54:03,410 --> 00:54:06,950 everything is statistically significant. 919 00:54:06,950 --> 00:54:12,440 But are they significant in the other sense of significance? 920 00:54:12,440 --> 00:54:15,260 Well, so Zach's argument was that if you 921 00:54:15,260 --> 00:54:18,800 look at something like the odds ratio of lung cancer 922 00:54:18,800 --> 00:54:21,910 for people who do and don't smoke, 923 00:54:21,910 --> 00:54:25,750 the odds ratios is eight. 924 00:54:25,750 --> 00:54:33,100 So when you compare 1.1 to eight, you should be ashamed. 925 00:54:33,100 --> 00:54:36,310 You're not doing very well in terms of elucidating 926 00:54:36,310 --> 00:54:38,470 what the effects really are. 927 00:54:38,470 --> 00:54:41,650 And so Zack actually has argued very strongly 928 00:54:41,650 --> 00:54:44,110 that rather than focusing all our attention 929 00:54:44,110 --> 00:54:49,060 on these genetic factors that have very weak relationships, 930 00:54:49,060 --> 00:54:52,330 we should instead focus more on clinical things that 931 00:54:52,330 --> 00:54:56,020 often have stronger predictive relationships. 932 00:54:56,020 --> 00:54:59,480 And some combination, of course, is best. 933 00:54:59,480 --> 00:55:04,870 Now, it is true that we know a whole bunch of highly penetrant 934 00:55:04,870 --> 00:55:06,760 Mendelian mutations. 935 00:55:06,760 --> 00:55:10,660 So these are ones where, one change in your genome, 936 00:55:10,660 --> 00:55:14,800 and all of a sudden you have some terrible disease. 937 00:55:14,800 --> 00:55:19,100 And I think when the Genome Project started in the 1990s, 938 00:55:19,100 --> 00:55:21,580 there was an expectation that we would 939 00:55:21,580 --> 00:55:24,520 find a whole bunch more things like that 940 00:55:24,520 --> 00:55:26,860 from knowing the genome. 941 00:55:26,860 --> 00:55:29,810 And that expectation was dashed. 942 00:55:29,810 --> 00:55:34,190 Because what we discovered is that our predecessors 943 00:55:34,190 --> 00:55:36,380 were actually pretty good at recognizing 944 00:55:36,380 --> 00:55:40,250 those kinds of diseases, from Mendel 945 00:55:40,250 --> 00:55:42,290 on, with the wrinkled peas. 946 00:55:42,290 --> 00:55:45,740 If you see a family in which there's a segregation 947 00:55:45,740 --> 00:55:48,860 pattern where you can see who has the disease 948 00:55:48,860 --> 00:55:52,880 and who doesn't and what their relationships are, 949 00:55:52,880 --> 00:55:55,520 you can get a pretty good idea of what 950 00:55:55,520 --> 00:55:57,770 genes or what genetic variants are 951 00:55:57,770 --> 00:56:00,300 associated with that disease. 952 00:56:00,300 --> 00:56:04,650 And it turns out we had found almost all of them. 953 00:56:04,650 --> 00:56:08,680 And so there weren't a whole lot more that are highly penetrant 954 00:56:08,680 --> 00:56:10,590 Mendelian mutations. 955 00:56:10,590 --> 00:56:14,430 And so what we had is mostly these common variants 956 00:56:14,430 --> 00:56:17,400 with small effects. 957 00:56:17,400 --> 00:56:20,610 What's really interesting and worth working on 958 00:56:20,610 --> 00:56:24,180 is these rare variants with small effects. 959 00:56:24,180 --> 00:56:30,690 So the mystery kid, like the kid whose case I showed you, 960 00:56:30,690 --> 00:56:33,750 probably has some interesting genetics 961 00:56:33,750 --> 00:56:39,290 that is quite uncommon, and obviously, for a long time, 962 00:56:39,290 --> 00:56:41,300 had a small effect. 963 00:56:41,300 --> 00:56:44,120 But then all of a sudden, something happened. 964 00:56:44,120 --> 00:56:48,800 And there is this whole field called unknown disease 965 00:56:48,800 --> 00:56:53,870 diagnosis that says, what do you do when some weirdo walks in 966 00:56:53,870 --> 00:56:57,530 off the street and you have no idea what's going on? 967 00:56:57,530 --> 00:57:01,580 And there are now companies-- so I was a judge in a challenge 968 00:57:01,580 --> 00:57:04,130 about four or five years ago, where 969 00:57:04,130 --> 00:57:09,890 we took eight kids like this and we genotyped them, 970 00:57:09,890 --> 00:57:12,440 and we genotyped their parents and their grandparents 971 00:57:12,440 --> 00:57:13,400 and their siblings. 972 00:57:13,400 --> 00:57:15,840 And we took all their clinical data. 973 00:57:15,840 --> 00:57:19,220 This was with the consent of their parents, of course. 974 00:57:19,220 --> 00:57:22,250 And we made it available as a contest. 975 00:57:22,250 --> 00:57:24,860 And we had 20-something participants 976 00:57:24,860 --> 00:57:27,890 from around the world who tried to figure out something 977 00:57:27,890 --> 00:57:30,950 useful to say about these kids. 978 00:57:30,950 --> 00:57:33,410 And you go through a pipeline. 979 00:57:33,410 --> 00:57:35,870 And we did this in two rounds. 980 00:57:35,870 --> 00:57:39,380 The first round, the pipelines all looked very different. 981 00:57:39,380 --> 00:57:41,060 And the second round, a couple of years 982 00:57:41,060 --> 00:57:44,120 later, the pipelines had pretty much converged. 983 00:57:44,120 --> 00:57:47,030 And I see now that there is a company that did well 984 00:57:47,030 --> 00:57:51,020 in one of these challenges that now sells this as a service, 985 00:57:51,020 --> 00:57:54,560 like I showed you before, different company. 986 00:57:54,560 --> 00:57:58,880 And so you send them the genetic makeup 987 00:57:58,880 --> 00:58:02,000 of some kid with a weird condition 988 00:58:02,000 --> 00:58:05,300 and the genetic makeup of their family, 989 00:58:05,300 --> 00:58:08,270 and it tries to guess which genes 990 00:58:08,270 --> 00:58:13,580 might be involved in causing the problem that this child has. 991 00:58:17,550 --> 00:58:19,510 That's not the answer, of course. 992 00:58:19,510 --> 00:58:24,790 Because that's just a sort of suspicion of a problem. 993 00:58:24,790 --> 00:58:28,980 And then you have to go out and do real biological work 994 00:58:28,980 --> 00:58:31,080 to try to reproduce that scenario 995 00:58:31,080 --> 00:58:33,690 and see what the effects really are. 996 00:58:33,690 --> 00:58:37,800 But at least in a couple of cases out of those eight, 997 00:58:37,800 --> 00:58:42,690 those hints have, in fact, led to a much better understanding 998 00:58:42,690 --> 00:58:45,630 of what caused the problems in these children. 999 00:58:48,600 --> 00:58:49,875 That was fun, by the way. 1000 00:58:49,875 --> 00:58:53,400 I got my name as an author on one 1001 00:58:53,400 --> 00:58:55,890 of these things that looks like a high-energy physics 1002 00:58:55,890 --> 00:58:56,920 experiment. 1003 00:58:56,920 --> 00:59:01,080 The first two pages of the paper is just the list of authors. 1004 00:59:01,080 --> 00:59:04,245 So it's kind of interesting. 1005 00:59:06,840 --> 00:59:10,770 Now, here's a more recent study, which 1006 00:59:10,770 --> 00:59:14,640 is a gene-wide association of type 2 diabetes. 1007 00:59:14,640 --> 00:59:16,680 It's not quite gene-wide, because they 1008 00:59:16,680 --> 00:59:18,990 didn't study every locus. 1009 00:59:18,990 --> 00:59:22,680 But they studied a hundred loci that 1010 00:59:22,680 --> 00:59:26,860 have been associated with type 2 diabetes in previous studies. 1011 00:59:26,860 --> 00:59:29,850 So of course, if you're not the first person doing 1012 00:59:29,850 --> 00:59:33,120 this kind of work, you can rely on the literature, where 1013 00:59:33,120 --> 00:59:36,820 other people have already come up with some interesting ideas. 1014 00:59:36,820 --> 00:59:39,210 So they wound up selecting 94 type 1015 00:59:39,210 --> 00:59:42,030 2 diabetes-associated variants. 1016 00:59:42,030 --> 00:59:45,810 So these are the glycemic traits, fasting insulin, 1017 00:59:45,810 --> 00:59:51,390 fasting glucose, et cetera; things about your body, 1018 00:59:51,390 --> 00:59:54,120 your body mass index, height, weight, circumference, 1019 00:59:54,120 --> 00:59:57,630 et cetera; lipid levels of various sorts, 1020 00:59:57,630 --> 01:00:00,390 associations with different diseases, 1021 01:00:00,390 --> 01:00:03,300 coronary artery disease, renal function, et cetera. 1022 01:00:07,110 --> 01:00:09,640 And let me come back to this. 1023 01:00:09,640 --> 01:00:11,800 So what they did is they said, OK, 1024 01:00:11,800 --> 01:00:13,710 here's the way we're going to model this. 1025 01:00:13,710 --> 01:00:15,870 We have an association matrix that 1026 01:00:15,870 --> 01:00:20,220 has 47 traits by 94 genetic factors. 1027 01:00:20,220 --> 01:00:22,740 So we make a matrix out of that. 1028 01:00:22,740 --> 01:00:25,230 And then they did something funny. 1029 01:00:25,230 --> 01:00:26,850 So they doubled the traits. 1030 01:00:26,850 --> 01:00:31,830 The technology for matrix factorization 1031 01:00:31,830 --> 01:00:34,890 is called non-negative matrix factorization. 1032 01:00:34,890 --> 01:00:37,320 And since many of those associations 1033 01:00:37,320 --> 01:00:40,410 were negative, what they did is, for each trait 1034 01:00:40,410 --> 01:00:43,380 that had both positive and negative values, 1035 01:00:43,380 --> 01:00:46,050 they duplicated the column. 1036 01:00:46,050 --> 01:00:51,180 They created one column that had positive associations and one 1037 01:00:51,180 --> 01:00:55,260 column that had the negation of the negative associations 1038 01:00:55,260 --> 01:00:56,740 with zeros everywhere else. 1039 01:00:56,740 --> 01:00:59,262 So that's how they dealt with that problem. 1040 01:00:59,262 --> 01:01:00,720 And then they said, OK, we're going 1041 01:01:00,720 --> 01:01:04,920 to apply matrix factorization to factor X 1042 01:01:04,920 --> 01:01:10,090 into two matrices, W and H. And I drew those here on the board. 1043 01:01:10,090 --> 01:01:12,630 So you have one matrix that-- 1044 01:01:12,630 --> 01:01:17,070 well, this is your original 47 by 94 matrix. 1045 01:01:17,070 --> 01:01:21,570 And the question is, can you find two smaller matrices that 1046 01:01:21,570 --> 01:01:26,640 are 47 by K and K by 94, that when you multiply these 1047 01:01:26,640 --> 01:01:29,700 together, you get back some close approximation 1048 01:01:29,700 --> 01:01:31,560 to that matrix. 1049 01:01:31,560 --> 01:01:34,620 Now, if you've been looking at the literature, 1050 01:01:34,620 --> 01:01:37,590 there are all kinds of ideas like auto-encoders. 1051 01:01:37,590 --> 01:01:42,000 And these are all basically the same underlying idea. 1052 01:01:42,000 --> 01:01:45,210 It's an unsupervised method that says, 1053 01:01:45,210 --> 01:01:48,330 can we find interesting patterns in the data 1054 01:01:48,330 --> 01:01:50,820 by doing some kind of dimension reduction? 1055 01:01:50,820 --> 01:01:55,320 And this is one of those methods for doing dimension reduction. 1056 01:01:55,320 --> 01:02:00,120 So what's nice about this one is that when 1057 01:02:00,120 --> 01:02:07,710 they get their W and H, they predict X from that. 1058 01:02:07,710 --> 01:02:10,800 And then they know, of course, what the error is. 1059 01:02:10,800 --> 01:02:15,270 And they say, well, minimizing that error is our objective. 1060 01:02:15,270 --> 01:02:17,820 So that also lets them get at the question of, 1061 01:02:17,820 --> 01:02:19,860 what's the right K? 1062 01:02:19,860 --> 01:02:21,750 And that's an important problem. 1063 01:02:21,750 --> 01:02:23,820 Because normally clustering methods 1064 01:02:23,820 --> 01:02:25,980 like hierarchical clustering, you 1065 01:02:25,980 --> 01:02:28,860 have to specify what the number of clusters 1066 01:02:28,860 --> 01:02:30,310 is that you're looking for. 1067 01:02:30,310 --> 01:02:33,240 And that's hard to do a priori, whereas this technique 1068 01:02:33,240 --> 01:02:38,110 can suggest at least which one fits the data best. 1069 01:02:38,110 --> 01:02:42,450 And so the loss function is some regularized L2 distance 1070 01:02:42,450 --> 01:02:48,450 between the reconstruction, W times H and X, and some penalty 1071 01:02:48,450 --> 01:02:52,470 terms based on the size of W and H 1072 01:02:52,470 --> 01:02:54,960 coupled by these relevance weights that-- 1073 01:02:54,960 --> 01:02:59,000 you can look at the paper, which I think I referred to in here 1074 01:02:59,000 --> 01:03:01,590 and I asked you to read. 1075 01:03:01,590 --> 01:03:04,050 And then they do give sampling and a whole bunch 1076 01:03:04,050 --> 01:03:08,400 of computational tricks to speed up the process. 1077 01:03:08,400 --> 01:03:13,650 So they got about 17,000 people from four different studies. 1078 01:03:13,650 --> 01:03:15,580 They're all of European ancestry. 1079 01:03:15,580 --> 01:03:18,660 So there's the usual generalization problem of, 1080 01:03:18,660 --> 01:03:23,620 how do you apply this to people from other parts of the world? 1081 01:03:23,620 --> 01:03:28,600 And they did individual-level analysis 1082 01:03:28,600 --> 01:03:33,010 of all the individuals with type 2 diabetes from these. 1083 01:03:33,010 --> 01:03:37,060 And the results were that they found five subtypes-- 1084 01:03:37,060 --> 01:03:47,800 again, five-- which were present on 82.3% of iterations. 1085 01:03:47,800 --> 01:03:50,200 By the way, total random aside, there's 1086 01:03:50,200 --> 01:03:54,040 a wonderful video at Caltech of the woman who 1087 01:03:54,040 --> 01:03:58,660 just made the picture of the black hole shadow. 1088 01:03:58,660 --> 01:04:02,410 And she makes arguments very much like this. 1089 01:04:02,410 --> 01:04:05,830 We tried a whole bunch of different ways 1090 01:04:05,830 --> 01:04:08,500 of coming up with this picture. 1091 01:04:08,500 --> 01:04:11,230 And what we decided was true is whatever 1092 01:04:11,230 --> 01:04:14,350 showed up in almost all of the different methods 1093 01:04:14,350 --> 01:04:15,480 of reconstructing it. 1094 01:04:15,480 --> 01:04:18,370 So this is kind of a similar argument. 1095 01:04:18,370 --> 01:04:22,780 And their interpretations, medically, are that one of them 1096 01:04:22,780 --> 01:04:27,000 is involved with variations in the beta cells. 1097 01:04:27,000 --> 01:04:31,240 So these are the cells in your pancreas that make insulin. 1098 01:04:31,240 --> 01:04:35,600 One of them is in variations in proinsulin, 1099 01:04:35,600 --> 01:04:38,630 which is a predecessor of insulin that 1100 01:04:38,630 --> 01:04:40,660 is under different controls. 1101 01:04:40,660 --> 01:04:47,620 And then three others have to do with obesity, bad things 1102 01:04:47,620 --> 01:04:52,135 about your lipid metabolism, and then your liver function. 1103 01:04:54,860 --> 01:04:57,170 And if you look at their results, 1104 01:04:57,170 --> 01:05:03,410 the top spider diagrams, so the way to interpret these 1105 01:05:03,410 --> 01:05:08,720 is that the middle circle, octagon, 1106 01:05:08,720 --> 01:05:14,030 the one in the very middle, is the one with negative data. 1107 01:05:14,030 --> 01:05:17,330 The one in between that and the outside 1108 01:05:17,330 --> 01:05:19,580 is with zero correlation. 1109 01:05:19,580 --> 01:05:22,820 And the outside one is with positive correlation. 1110 01:05:22,820 --> 01:05:25,970 And what you see is that different factors 1111 01:05:25,970 --> 01:05:29,850 have different influences in these different clusters. 1112 01:05:29,850 --> 01:05:31,400 So these are the factors that are 1113 01:05:31,400 --> 01:05:35,330 most informative in figuring out which cluster somebody belongs 1114 01:05:35,330 --> 01:05:36,230 to. 1115 01:05:36,230 --> 01:05:40,660 And they indeed look considerably different. 1116 01:05:40,660 --> 01:05:42,700 I'm not going to have you read this. 1117 01:05:42,700 --> 01:05:44,470 But it'll be in the slides. 1118 01:05:44,470 --> 01:05:46,480 Now, one thing that's interesting-- 1119 01:05:46,480 --> 01:05:50,770 and again, this won't be on the final exam. 1120 01:05:50,770 --> 01:05:52,087 But look at these numbers. 1121 01:05:52,087 --> 01:05:52,795 They're all tiny. 1122 01:05:59,200 --> 01:06:02,450 Some of them are hugely statistically significant. 1123 01:06:02,450 --> 01:06:09,490 So DI, whatever that is, contributes 0.05 units 1124 01:06:09,490 --> 01:06:17,260 to having beta-cell type of this disease at a p-value of 6.6 1125 01:06:17,260 --> 01:06:20,140 times 10 to the minus 37th. 1126 01:06:20,140 --> 01:06:21,740 So it's definitely there. 1127 01:06:21,740 --> 01:06:22,960 It's definitely an effect. 1128 01:06:22,960 --> 01:06:25,920 But it's not a very big effect. 1129 01:06:25,920 --> 01:06:29,520 And what strikes me every time I look at studies 1130 01:06:29,520 --> 01:06:33,840 like this is just how small those effects are, 1131 01:06:33,840 --> 01:06:36,510 whether you're predicting some output 1132 01:06:36,510 --> 01:06:39,090 like the level of insulin in the patient, 1133 01:06:39,090 --> 01:06:42,480 or whether you're predicting something like a category 1134 01:06:42,480 --> 01:06:44,400 membership, as in this table. 1135 01:06:47,130 --> 01:06:51,510 So as I said, PheWAS is a reverse GWAS. 1136 01:06:51,510 --> 01:06:55,650 And the first paper that introduced the terminology 1137 01:06:55,650 --> 01:07:02,270 was by Josh Denny and colleagues at Vanderbilt in 2010. 1138 01:07:02,270 --> 01:07:07,460 And so they did not quite a phenome-wide association. 1139 01:07:07,460 --> 01:07:12,860 But they said, we're going to take 25,000 samples 1140 01:07:12,860 --> 01:07:16,550 from the Vanderbilt biobank, and we're 1141 01:07:16,550 --> 01:07:19,550 going to take the first 6,000 European Americans 1142 01:07:19,550 --> 01:07:23,170 with samples, no other criteria for selection. 1143 01:07:23,170 --> 01:07:24,560 Why European Americans? 1144 01:07:24,560 --> 01:07:28,130 Because all the GWAS data is about European Americans. 1145 01:07:28,130 --> 01:07:31,140 So they wanted to be able to compare to that. 1146 01:07:31,140 --> 01:07:33,230 And then they said, let's pick not 1147 01:07:33,230 --> 01:07:37,920 one SNP but five different SNPs that we're interested in. 1148 01:07:37,920 --> 01:07:39,620 So they picked these, which are known 1149 01:07:39,620 --> 01:07:42,710 to be associated with coronary artery disease 1150 01:07:42,710 --> 01:07:47,870 and carotid artery stenosis, atrial fibrillation, 1151 01:07:47,870 --> 01:07:51,890 multiple sclerosis and lupus, rheumatoid arthritis 1152 01:07:51,890 --> 01:07:53,100 and Crohn's disease. 1153 01:07:53,100 --> 01:07:55,940 So it's a nice grab-bag of interesting disease 1154 01:07:55,940 --> 01:07:58,130 associations. 1155 01:07:58,130 --> 01:08:01,130 And then the hard work they did was 1156 01:08:01,130 --> 01:08:04,160 they went through the tens of thousands 1157 01:08:04,160 --> 01:08:08,690 of different billing codes that were available. 1158 01:08:08,690 --> 01:08:16,010 And they, by hand, clustered them into 744 case groups 1159 01:08:16,010 --> 01:08:19,010 and said, OK, these are the phenotypes 1160 01:08:19,010 --> 01:08:22,130 that we're interested in. 1161 01:08:22,130 --> 01:08:25,100 And that data set, by the way, is still available. 1162 01:08:25,100 --> 01:08:27,200 And it's been used by a lot of other people, 1163 01:08:27,200 --> 01:08:32,310 because nobody wants to repeat that analysis. 1164 01:08:32,310 --> 01:08:34,279 So now what you see is something very 1165 01:08:34,279 --> 01:08:38,689 similar to what you saw in GWAS, except here, what we 1166 01:08:38,689 --> 01:08:41,359 have is the ICD-9 code group. 1167 01:08:41,359 --> 01:08:46,069 I guess by the time this got published, it was up to 1,000. 1168 01:08:46,069 --> 01:08:53,210 And these are the same kinds of odds ratios 1169 01:08:53,210 --> 01:09:00,439 for the genetic expression of those markers. 1170 01:09:00,439 --> 01:09:06,500 And what you find, again, is that this is the p-equal 0.05. 1171 01:09:06,500 --> 01:09:10,130 That's the Bonferroni-corrected version. 1172 01:09:10,130 --> 01:09:13,609 And only multiple sclerosis comes up 1173 01:09:13,609 --> 01:09:17,840 for this particular SNP, which was one of the ones 1174 01:09:17,840 --> 01:09:19,460 that they expected to come up. 1175 01:09:19,460 --> 01:09:23,720 But they were interested to see what else lights up when 1176 01:09:23,720 --> 01:09:25,729 you do this sort of analysis. 1177 01:09:25,729 --> 01:09:30,529 And what they discovered is that malignant neoplasm 1178 01:09:30,529 --> 01:09:34,939 of the rectum, benign digestive tract neoplasms-- 1179 01:09:34,939 --> 01:09:39,290 so there's something going on about cancer that is somehow 1180 01:09:39,290 --> 01:09:42,600 related to this single-nucleotide polymorphism, 1181 01:09:42,600 --> 01:09:45,120 not at a statistically high enough level, 1182 01:09:45,120 --> 01:09:47,390 but it's still kind of intriguing 1183 01:09:47,390 --> 01:09:49,370 that there may be some relationship there. 1184 01:09:49,370 --> 01:09:50,120 Yeah? 1185 01:09:50,120 --> 01:09:52,827 AUDIENCE: So is this data at all public? 1186 01:09:52,827 --> 01:09:54,410 Or is this at one particular hospital? 1187 01:09:54,410 --> 01:09:55,750 Or who has this data? 1188 01:09:55,750 --> 01:09:57,033 Would it be combined? 1189 01:09:57,033 --> 01:09:57,950 PETER SZOLOVITS: Yeah. 1190 01:09:57,950 --> 01:10:01,430 I don't believe that you can get their data unless-- 1191 01:10:01,430 --> 01:10:02,690 I think, if-- 1192 01:10:02,690 --> 01:10:04,670 I mean, they're pretty good about collaborating 1193 01:10:04,670 --> 01:10:05,690 with people. 1194 01:10:05,690 --> 01:10:11,450 So if you're willing to become a volunteer 1195 01:10:11,450 --> 01:10:15,020 employee at Vanderbilt, they could probably take you. 1196 01:10:15,020 --> 01:10:17,330 But I just made that up. 1197 01:10:17,330 --> 01:10:21,410 But every hospital has very strong controls. 1198 01:10:21,410 --> 01:10:25,340 Now, what is available is the NCBI 1199 01:10:25,340 --> 01:10:28,490 has GEO, the Gene Expression Omnibus, which 1200 01:10:28,490 --> 01:10:30,050 has enormous amounts-- 1201 01:10:30,050 --> 01:10:35,940 like, I think, hundreds of billions of sample data. 1202 01:10:35,940 --> 01:10:40,370 But you don't often know exactly what the sample is from. 1203 01:10:40,370 --> 01:10:42,860 So it comes with an accession number 1204 01:10:42,860 --> 01:10:48,620 and an English description of what kind of data it is. 1205 01:10:48,620 --> 01:10:50,840 And there are actually lots of papers 1206 01:10:50,840 --> 01:10:52,580 where people have done natural language 1207 01:10:52,580 --> 01:10:56,270 processing on those English descriptions in order 1208 01:10:56,270 --> 01:10:59,270 to try to figure out what kind of data this is. 1209 01:10:59,270 --> 01:11:01,140 And then they can make use of it. 1210 01:11:01,140 --> 01:11:03,230 So you can be clever. 1211 01:11:03,230 --> 01:11:04,880 And there's a ton of data out there, 1212 01:11:04,880 --> 01:11:09,090 but it's not well-curated data. 1213 01:11:09,090 --> 01:11:11,700 Now, what's interesting is you don't always 1214 01:11:11,700 --> 01:11:12,720 get what you expect. 1215 01:11:12,720 --> 01:11:16,770 So for example, that SNP was selected 1216 01:11:16,770 --> 01:11:19,680 because it's thought to be associated 1217 01:11:19,680 --> 01:11:23,370 with multiple sclerosis and lupus. 1218 01:11:23,370 --> 01:11:28,550 But in reality, the association with lupus is not significant. 1219 01:11:28,550 --> 01:11:32,580 Its p-value of 0.5, which is not very impressive. 1220 01:11:32,580 --> 01:11:37,110 The association with multiple sclerosis is significant. 1221 01:11:37,110 --> 01:11:40,620 And so they found, in this particular study, 1222 01:11:40,620 --> 01:11:46,190 a couple of things that had been expected but didn't work out. 1223 01:11:46,190 --> 01:11:51,770 So for example, this SNP, which was associated with coronary 1224 01:11:51,770 --> 01:11:56,270 artery disease and thought to be associated with this carotid 1225 01:11:56,270 --> 01:12:00,590 plaque deposition in your carotid artery, just isn't. 1226 01:12:00,590 --> 01:12:04,460 p-value of 0.82 is not impressive at all. 1227 01:12:07,630 --> 01:12:09,370 OK, onward. 1228 01:12:09,370 --> 01:12:12,070 So that was done for SNPs. 1229 01:12:12,070 --> 01:12:14,560 Now, a very popular idea today is 1230 01:12:14,560 --> 01:12:18,700 to look at expression levels, partly because of those prices 1231 01:12:18,700 --> 01:12:22,390 I showed you where you can very cheaply get expression 1232 01:12:22,390 --> 01:12:24,850 levels from lots of samples. 1233 01:12:24,850 --> 01:12:28,060 And so there's this whole notion of Expression Quantitative 1234 01:12:28,060 --> 01:12:32,570 Trait Loci, or EQTL, that says, hey, 1235 01:12:32,570 --> 01:12:36,160 instead of working as hard as the Vanderbilt guys did 1236 01:12:36,160 --> 01:12:40,320 to figure out these hundreds of categories of disease, 1237 01:12:40,320 --> 01:12:44,820 let's just take your gene expression levels 1238 01:12:44,820 --> 01:12:49,960 and use those as defining the trait that we're interested in. 1239 01:12:49,960 --> 01:12:52,530 So now we're looking at the relationship 1240 01:12:52,530 --> 01:12:57,210 between your genome and the expression levels. 1241 01:12:57,210 --> 01:12:59,700 And so you might say, well, that ought to be easy. 1242 01:12:59,700 --> 01:13:03,300 Because if the gene is there, it's going to get expressed. 1243 01:13:03,300 --> 01:13:05,460 But of course, that's not telling you 1244 01:13:05,460 --> 01:13:09,330 whether the gene is being activated or repressed 1245 01:13:09,330 --> 01:13:13,680 or enhanced, or whether any of these other complications 1246 01:13:13,680 --> 01:13:16,080 that I talked about earlier are present. 1247 01:13:16,080 --> 01:13:19,230 And so this is an interesting empirical question. 1248 01:13:19,230 --> 01:13:26,190 And so people say, well, maybe a small genetic variation 1249 01:13:26,190 --> 01:13:31,710 will cause different expression levels of some RNA. 1250 01:13:31,710 --> 01:13:33,450 And we can measure these, and then 1251 01:13:33,450 --> 01:13:36,840 use those to do this kind of analysis. 1252 01:13:41,770 --> 01:13:45,580 So differential expression in different populations-- 1253 01:13:45,580 --> 01:13:48,650 there is evidence that, for example, 1254 01:13:48,650 --> 01:13:53,290 if you take 16 people of African descent, 1255 01:13:53,290 --> 01:13:57,970 then 17% of the genes in a small sample 1256 01:13:57,970 --> 01:14:02,380 of 16 people differ in their expression level 1257 01:14:02,380 --> 01:14:05,350 among those individuals; and similarly, 1258 01:14:05,350 --> 01:14:15,550 26% in this Asian population and 17% to 29% in a HapMap sample. 1259 01:14:15,550 --> 01:14:17,980 Of course, some of these differences 1260 01:14:17,980 --> 01:14:22,720 may be because of confounders like environment, 1261 01:14:22,720 --> 01:14:27,670 different tissues, limited correlation of these expression 1262 01:14:27,670 --> 01:14:30,070 levels to disease phenotypes. 1263 01:14:30,070 --> 01:14:32,890 Nevertheless, this type of analysis 1264 01:14:32,890 --> 01:14:38,290 has uncovered relationships between these EQTLs and asthma 1265 01:14:38,290 --> 01:14:40,270 and Crohn's disease. 1266 01:14:40,270 --> 01:14:42,517 So I'll let you read the conclusion of one 1267 01:14:42,517 --> 01:14:43,225 of these studies. 1268 01:14:56,480 --> 01:15:00,700 So this is saying what I said before, that we probably know 1269 01:15:00,700 --> 01:15:03,350 all the Mendelian diseases. 1270 01:15:03,350 --> 01:15:06,580 So the diseases that we're interested in understanding 1271 01:15:06,580 --> 01:15:09,670 better today are the ones that are not Mendelian, 1272 01:15:09,670 --> 01:15:14,530 but they're some complicated combination of effects 1273 01:15:14,530 --> 01:15:17,770 from different genes. 1274 01:15:17,770 --> 01:15:21,770 And that makes it, of course, a much harder problem. 1275 01:15:21,770 --> 01:15:27,090 There is an interesting recent paper-- 1276 01:15:27,090 --> 01:15:28,360 well, not that recent-- 1277 01:15:28,360 --> 01:15:35,560 2005-- that uses Bayesian network technology 1278 01:15:35,560 --> 01:15:37,280 to try to get at this. 1279 01:15:37,280 --> 01:15:40,930 And so they say, well, if you have some quantitative trait 1280 01:15:40,930 --> 01:15:45,400 locus and you treat the RNA expression level 1281 01:15:45,400 --> 01:15:50,365 as this expression quantitative trait locus, and then 1282 01:15:50,365 --> 01:15:55,780 you take C as some complex trait, which might be a disease 1283 01:15:55,780 --> 01:15:58,000 or it might be a proclivity for something, 1284 01:15:58,000 --> 01:16:02,050 or it might be one of Josh Denny's categories 1285 01:16:02,050 --> 01:16:04,840 or whatever, then there are a number 1286 01:16:04,840 --> 01:16:07,810 of different Bayesian network-style models that you 1287 01:16:07,810 --> 01:16:09,190 can build. 1288 01:16:09,190 --> 01:16:14,290 So you can say, ah, the genetic variant 1289 01:16:14,290 --> 01:16:18,820 causes a difference in gene expression, which 1290 01:16:18,820 --> 01:16:21,580 in turn causes the disease. 1291 01:16:21,580 --> 01:16:24,640 Or you could say, hmm, the genetic trait 1292 01:16:24,640 --> 01:16:27,640 causes the disease, which in turn causes 1293 01:16:27,640 --> 01:16:31,840 the observable difference in gene expression. 1294 01:16:31,840 --> 01:16:38,560 Or you can say that the genetic variant causes 1295 01:16:38,560 --> 01:16:43,120 both the expression level and the disease, 1296 01:16:43,120 --> 01:16:45,280 but they're not necessarily coupled. 1297 01:16:45,280 --> 01:16:48,820 So they may be conditionally independent 1298 01:16:48,820 --> 01:16:50,950 given the genetic variant. 1299 01:16:50,950 --> 01:16:53,380 Or you can have more complex issues, 1300 01:16:53,380 --> 01:16:57,100 like you could have the gene causing changes 1301 01:16:57,100 --> 01:17:01,000 in expression level of a whole bunch of different RNA, 1302 01:17:01,000 --> 01:17:05,470 which combined cause some disease. 1303 01:17:05,470 --> 01:17:08,530 Or you can have different genetic changes 1304 01:17:08,530 --> 01:17:12,040 all impacting the expression of some RNA, which 1305 01:17:12,040 --> 01:17:13,870 causes the disease. 1306 01:17:13,870 --> 01:17:17,680 Or-- just wait for it. 1307 01:17:17,680 --> 01:17:20,600 Oops. 1308 01:17:20,600 --> 01:17:26,010 You can have models like this that say, 1309 01:17:26,010 --> 01:17:28,700 we have some environmental contributions 1310 01:17:28,700 --> 01:17:31,220 and a bunch of different genes which 1311 01:17:31,220 --> 01:17:37,550 affect the expression of a bunch of different EQTLs, which 1312 01:17:37,550 --> 01:17:40,640 cause a bunch of clinical traits, which 1313 01:17:40,640 --> 01:17:44,930 cause changes in a bunch of reactive RNA, which 1314 01:17:44,930 --> 01:17:48,530 cause comorbidities. 1315 01:17:48,530 --> 01:17:51,860 So the approach that they take is 1316 01:17:51,860 --> 01:17:57,800 to say, well, we can generate a large set of hypotheses 1317 01:17:57,800 --> 01:18:01,300 like this, and then just calculate 1318 01:18:01,300 --> 01:18:05,750 the likelihood of the data given each of these hypotheses. 1319 01:18:05,750 --> 01:18:09,680 And whichever one assigns the greatest likelihood to the data 1320 01:18:09,680 --> 01:18:12,185 is most likely to be the one that's close to correct. 1321 01:18:15,050 --> 01:18:18,740 So let me just blast through the rest of this quickly. 1322 01:18:18,740 --> 01:18:22,310 Scaling up genome-phenome association studies-- 1323 01:18:22,310 --> 01:18:26,630 the UK Biobank is sort of like this All of Us project. 1324 01:18:26,630 --> 01:18:30,460 But they do make their data available. 1325 01:18:30,460 --> 01:18:34,340 All of Us will, also, but it hasn't been collected yet. 1326 01:18:34,340 --> 01:18:36,455 UK Biobank has about half a million 1327 01:18:36,455 --> 01:18:41,210 de-identified individuals with full exome sequencing, 1328 01:18:41,210 --> 01:18:46,520 although they only have about 10% of what they want now. 1329 01:18:46,520 --> 01:18:51,110 And many of them will have worn 24-hour activity monitors so 1330 01:18:51,110 --> 01:18:53,760 that we have behavioral data. 1331 01:18:53,760 --> 01:18:56,240 Some of them have had repeat measurements. 1332 01:18:56,240 --> 01:18:58,880 They do online questionnaires. 1333 01:18:58,880 --> 01:19:02,930 About a fifth of them will have imaging. 1334 01:19:02,930 --> 01:19:05,730 And it's linked to their electronic health record. 1335 01:19:05,730 --> 01:19:07,490 So we know if they died or if they 1336 01:19:07,490 --> 01:19:12,230 had cancer or various hospital episodes, et cetera. 1337 01:19:12,230 --> 01:19:19,640 And there's a website here which publishes the latest analyses. 1338 01:19:19,640 --> 01:19:23,180 And so you see, on April 18, genetic variants that 1339 01:19:23,180 --> 01:19:27,050 protect against obesity and type 2 diabetes discovered, 1340 01:19:27,050 --> 01:19:30,530 moderate with meat-eaters are at risk of bowel cancer, 1341 01:19:30,530 --> 01:19:33,930 and research identifies genetic causes of poor sleep. 1342 01:19:33,930 --> 01:19:35,690 So this is all over the place. 1343 01:19:35,690 --> 01:19:38,390 But these are all the studies that are being done by this. 1344 01:19:42,050 --> 01:19:42,910 I'll skip this. 1345 01:19:42,910 --> 01:19:46,190 But there's a group here at MGH and the Broad that 1346 01:19:46,190 --> 01:19:49,730 is using this data to do, large-scale, 1347 01:19:49,730 --> 01:19:54,860 many, many gene-wide association studies. 1348 01:19:54,860 --> 01:19:57,830 And one of the things that I promised you, 1349 01:19:57,830 --> 01:20:00,860 which is interesting, is from these studies, they say, 1350 01:20:00,860 --> 01:20:05,370 well, the heritability of height is pretty good. 1351 01:20:05,370 --> 01:20:10,910 It's about 0.46 with a p-value of 10 to the minus 109th. 1352 01:20:10,910 --> 01:20:14,660 So your height is definitely determined, in large part, 1353 01:20:14,660 --> 01:20:16,580 by your parents' height. 1354 01:20:16,580 --> 01:20:18,650 But what's interesting is that whether you 1355 01:20:18,650 --> 01:20:21,680 get a college degree or not is determined 1356 01:20:21,680 --> 01:20:24,500 by whether your parents got a college degree or not. 1357 01:20:24,500 --> 01:20:27,440 This is probably not genetic. 1358 01:20:27,440 --> 01:20:29,540 Or it's only partly genetic. 1359 01:20:29,540 --> 01:20:34,850 But it clearly has confounders us from money and social status 1360 01:20:34,850 --> 01:20:36,770 and various things like that. 1361 01:20:36,770 --> 01:20:40,250 And then what I found amusing is that even 1362 01:20:40,250 --> 01:20:47,000 TV-watching is partly heritable from your genetics. 1363 01:20:50,850 --> 01:20:55,230 Fortunately, my parents watch a lot of TV. 1364 01:20:55,230 --> 01:20:57,150 The last thing I wanted to mention, 1365 01:20:57,150 --> 01:20:59,220 but I'm not going to have time to get into it, 1366 01:20:59,220 --> 01:21:02,390 is this notion of gene set enrichment analysis. 1367 01:21:02,390 --> 01:21:05,490 It's what I was saying before, that genes typically 1368 01:21:05,490 --> 01:21:08,260 don't act by themselves. 1369 01:21:08,260 --> 01:21:11,400 And so if you think back on high school biology, 1370 01:21:11,400 --> 01:21:14,300 you probably learned about the Krebs cycle 1371 01:21:14,300 --> 01:21:17,170 that powers cellular mechanisms. 1372 01:21:17,170 --> 01:21:19,440 So if you break any part of that cycle, 1373 01:21:19,440 --> 01:21:21,940 your cells don't get enough energy. 1374 01:21:21,940 --> 01:21:24,570 And so it stands to reason that if you 1375 01:21:24,570 --> 01:21:27,840 want to understand that sort of metabolism, 1376 01:21:27,840 --> 01:21:30,510 you shouldn't be looking at an individual gene. 1377 01:21:30,510 --> 01:21:33,330 But you should be looking at all of the genes that 1378 01:21:33,330 --> 01:21:36,070 are involved in that process. 1379 01:21:36,070 --> 01:21:39,450 And so there have been many attempts to try to do this. 1380 01:21:39,450 --> 01:21:46,560 The Broad Institute here has a set of, originally, 1,300 1381 01:21:46,560 --> 01:21:48,710 biologically-defined gene sets. 1382 01:21:48,710 --> 01:21:52,350 So these were ones that interacted with each other 1383 01:21:52,350 --> 01:21:55,750 in controlling some important mechanism in the body. 1384 01:21:55,750 --> 01:21:58,380 They're now up to 18,000. 1385 01:21:58,380 --> 01:22:02,190 For example, genes involved in oxidative phosphorylation 1386 01:22:02,190 --> 01:22:05,790 and muscle tissue show reduced expression in diabetics, 1387 01:22:05,790 --> 01:22:09,300 although the average decrease per gene is only 20%. 1388 01:22:09,300 --> 01:22:11,040 So they have these sets. 1389 01:22:11,040 --> 01:22:15,060 And from those, there is a very nice technique 1390 01:22:15,060 --> 01:22:18,090 that is able to pull-- 1391 01:22:24,300 --> 01:22:27,180 it's essentially a way of strengthening 1392 01:22:27,180 --> 01:22:31,710 the gene-wide associations by allowing you to associate them 1393 01:22:31,710 --> 01:22:33,750 with these sets of genes. 1394 01:22:33,750 --> 01:22:36,810 And the approach that they take is quite clever. 1395 01:22:36,810 --> 01:22:40,930 They say, if we take all the genes in a gene set 1396 01:22:40,930 --> 01:22:45,430 and we order them by their correlation with whatever trait 1397 01:22:45,430 --> 01:22:49,450 we're interested in, then the genes 1398 01:22:49,450 --> 01:22:52,420 that are closer to the beginning of that 1399 01:22:52,420 --> 01:22:54,370 are more likely to be involved. 1400 01:22:54,370 --> 01:22:57,970 Because they're the ones that are most strongly associated. 1401 01:22:57,970 --> 01:23:00,790 And so they have this random walk process 1402 01:23:00,790 --> 01:23:04,060 that find sort of the maximum place where 1403 01:23:04,060 --> 01:23:06,640 you can say anything before that is 1404 01:23:06,640 --> 01:23:08,890 likely to be associated with the disease 1405 01:23:08,890 --> 01:23:11,080 that you're interested in. 1406 01:23:11,080 --> 01:23:16,930 And they've had a number of successes of showing enrichment 1407 01:23:16,930 --> 01:23:22,870 in various diseases and various biological factors. 1408 01:23:22,870 --> 01:23:26,650 The last thing I want to say is a little bit disappointing. 1409 01:23:26,650 --> 01:23:30,310 I was just really looking for the killer paper 1410 01:23:30,310 --> 01:23:32,890 to talk about that uses some really 1411 01:23:32,890 --> 01:23:36,160 sophisticated deep learning, machine learning. 1412 01:23:36,160 --> 01:23:40,370 And as far as I can tell, it doesn't exist yet. 1413 01:23:40,370 --> 01:23:45,700 So most of these methods are based on clustering techniques 1414 01:23:45,700 --> 01:23:49,000 on clever ideas, like the one for gene 1415 01:23:49,000 --> 01:23:52,270 set enrichment analysis. 1416 01:23:52,270 --> 01:23:55,780 But they're not neural network types of techniques. 1417 01:23:55,780 --> 01:23:58,540 They're not immensely sophisticated. 1418 01:23:58,540 --> 01:24:02,080 So what you see coming up is things like Bayesian networks 1419 01:24:02,080 --> 01:24:06,040 and clustering and matrix factorization and so on, which 1420 01:24:06,040 --> 01:24:10,390 sort of sound like 10-, 15-, 20-year-old technologies. 1421 01:24:10,390 --> 01:24:15,770 And I haven't seen examples yet of the hot off the presses, 1422 01:24:15,770 --> 01:24:19,870 we built a 83-layer neural network 1423 01:24:19,870 --> 01:24:23,110 that outperforms these other methods. 1424 01:24:23,110 --> 01:24:25,180 I suspect that that's coming. 1425 01:24:25,180 --> 01:24:28,630 It just hasn't hit yet, as far as I know. 1426 01:24:28,630 --> 01:24:31,760 If you know of such papers, by all means, let me know. 1427 01:24:31,760 --> 01:24:32,260 All right. 1428 01:24:32,260 --> 01:24:34,050 Thank you.