1 00:00:15,775 --> 00:00:17,400 PETER SZOLOVITS: So last time we talked 2 00:00:17,400 --> 00:00:20,580 about what medicine does, and today I 3 00:00:20,580 --> 00:00:24,330 want to take a deep dive into medical data. 4 00:00:24,330 --> 00:00:28,800 And I'm going to use as examples a lot of stuff from the MIMIC 5 00:00:28,800 --> 00:00:31,470 database, which is one of the databases 6 00:00:31,470 --> 00:00:34,180 that we're going to be using in this class. 7 00:00:34,180 --> 00:00:36,480 Some of you are probably familiar with it, 8 00:00:36,480 --> 00:00:38,670 and some of you are not. 9 00:00:38,670 --> 00:00:41,280 And there are, I hope, some takeaway lessons 10 00:00:41,280 --> 00:00:43,600 from this discussion. 11 00:00:43,600 --> 00:00:48,450 So for example, a few years ago, when 12 00:00:48,450 --> 00:00:51,630 MIMIC-III was about to be released, 13 00:00:51,630 --> 00:00:54,270 I was playing with the data, and I 14 00:00:54,270 --> 00:00:58,440 looked at the distribution of heart rates 15 00:00:58,440 --> 00:01:00,660 in the CareVue part of the database. 16 00:01:00,660 --> 00:01:03,690 So MIMIC, for those of you who don't know, 17 00:01:03,690 --> 00:01:10,410 has intensive care data from about 60-something thousand 18 00:01:10,410 --> 00:01:13,380 admissions to intensive care units at the Beth Israel 19 00:01:13,380 --> 00:01:18,720 Deaconess Medical Center over a period of about 12 years. 20 00:01:18,720 --> 00:01:21,960 And one of the technical difficulties 21 00:01:21,960 --> 00:01:25,440 that we encountered is that in the middle of that time period. 22 00:01:25,440 --> 00:01:29,040 The hospitals shifted from one information system 23 00:01:29,040 --> 00:01:32,190 that they used in their intensive care unit to another. 24 00:01:32,190 --> 00:01:33,480 CareVue is the old one. 25 00:01:33,480 --> 00:01:35,310 MetaVision is the new one. 26 00:01:35,310 --> 00:01:37,420 And of course, they're not exactly compatible. 27 00:01:37,420 --> 00:01:40,510 So we'll see some examples of that. 28 00:01:40,510 --> 00:01:41,980 So this is the old data. 29 00:01:41,980 --> 00:01:44,100 So this is from CareVue. 30 00:01:44,100 --> 00:01:46,650 And you look at that and say, well, heart rates 31 00:01:46,650 --> 00:01:53,250 range from 40 to 200 roughly, which is OK. 32 00:01:53,250 --> 00:01:55,380 But then there's this funny thing. 33 00:01:55,380 --> 00:01:57,900 There are two peaks. 34 00:01:57,900 --> 00:02:02,070 So where, if ever, do you see two peaks 35 00:02:02,070 --> 00:02:04,890 in physiological data? 36 00:02:08,090 --> 00:02:10,380 Not typical. 37 00:02:10,380 --> 00:02:14,043 And so my initial reaction was-- 38 00:02:14,043 --> 00:02:17,001 [LAUGHTER] 39 00:02:18,480 --> 00:02:21,170 So then I looked a little closer, 40 00:02:21,170 --> 00:02:24,650 and I said, hmm, what do the heart rates look 41 00:02:24,650 --> 00:02:27,450 like from these two systems? 42 00:02:27,450 --> 00:02:29,660 And if you look in CareVue, you see the picture 43 00:02:29,660 --> 00:02:30,620 that I just showed you. 44 00:02:30,620 --> 00:02:32,690 And if you look in MetaVision, you 45 00:02:32,690 --> 00:02:35,300 see this other picture, which looks more like what 46 00:02:35,300 --> 00:02:38,610 you would normally expect. 47 00:02:38,610 --> 00:02:40,940 And so I'm sitting there scratching my head going, 48 00:02:40,940 --> 00:02:44,790 OK, there must be some difference between these. 49 00:02:44,790 --> 00:02:48,740 It's not that simultaneous with the switchover 50 00:02:48,740 --> 00:02:53,060 of the hospital from one information system to another. 51 00:02:53,060 --> 00:02:55,430 Physiology of people changed, and all 52 00:02:55,430 --> 00:02:58,580 of a sudden some subset of people 53 00:02:58,580 --> 00:03:01,950 started having faster heart rates. 54 00:03:01,950 --> 00:03:02,930 Right? 55 00:03:02,930 --> 00:03:05,510 But if you think about that what subset of people 56 00:03:05,510 --> 00:03:08,118 have faster heart rates? 57 00:03:08,118 --> 00:03:08,910 AUDIENCE: Athletes. 58 00:03:08,910 --> 00:03:10,380 PETER SZOLOVITS: Hmm? 59 00:03:10,380 --> 00:03:13,320 AUDIENCE: Babies? 60 00:03:13,320 --> 00:03:15,280 AUDIENCE: If you're in a stress test. 61 00:03:15,280 --> 00:03:17,210 PETER SZOLOVITS: Unh-hmm. 62 00:03:17,210 --> 00:03:18,252 AUDIENCE: Is it children? 63 00:03:18,252 --> 00:03:19,419 PETER SZOLOVITS: Yeah, kids. 64 00:03:21,490 --> 00:03:23,590 So I said, hmm, interesting. 65 00:03:23,590 --> 00:03:26,300 So anyway, if you look at the statistics, 66 00:03:26,300 --> 00:03:29,820 you see that the mean heart rate in CareVue is 108, 67 00:03:29,820 --> 00:03:35,050 and the mean heart rate in MetaVision is 87. 68 00:03:35,050 --> 00:03:39,310 But of course, means are not that meaningful 69 00:03:39,310 --> 00:03:43,480 when you look at these bimodal distributions. 70 00:03:43,480 --> 00:03:46,540 So then I said, well, what if we just look at adults? 71 00:03:46,540 --> 00:03:54,230 So we look at people from age greater than 1 up to age 90. 72 00:03:54,230 --> 00:03:56,940 And I'll say a word about that in a minute. 73 00:03:56,940 --> 00:03:58,840 And I look at those two distributions. 74 00:03:58,840 --> 00:04:00,100 They look pretty close. 75 00:04:00,100 --> 00:04:02,120 They look pretty similar. 76 00:04:02,120 --> 00:04:06,160 So that means that the number of patients 77 00:04:06,160 --> 00:04:09,010 of different ages in the adult group 78 00:04:09,010 --> 00:04:12,220 is similar in the two data sets. 79 00:04:12,220 --> 00:04:17,290 But if I don't exclude the very young or the very old, 80 00:04:17,290 --> 00:04:20,200 then I see this funny distribution 81 00:04:20,200 --> 00:04:22,840 where I have suppressed ages greater than 90 82 00:04:22,840 --> 00:04:24,700 but not the young. 83 00:04:24,700 --> 00:04:26,890 And what you see is that in CareVue there's 84 00:04:26,890 --> 00:04:29,080 this giant spike at age 0. 85 00:04:32,440 --> 00:04:37,570 So what happened at the hospital is that under the old system 86 00:04:37,570 --> 00:04:41,860 it was also being used in the NICU, the Neonatal Intensive 87 00:04:41,860 --> 00:04:43,110 Care Unit. 88 00:04:43,110 --> 00:04:46,750 And the new system was not being used in the NICU. 89 00:04:46,750 --> 00:04:52,090 And therefore, they didn't capture data about babies. 90 00:04:52,090 --> 00:04:57,100 And in fact, if you look at age versus heart 91 00:04:57,100 --> 00:05:00,520 rate of the entire population, you 92 00:05:00,520 --> 00:05:03,200 see two very peculiar things. 93 00:05:03,200 --> 00:05:07,120 So here are the adults that we've been talking about, 94 00:05:07,120 --> 00:05:08,290 and here are the babies. 95 00:05:08,290 --> 00:05:11,080 And sure enough, they have higher heart rates. 96 00:05:11,080 --> 00:05:13,892 And then here are these 300-year-old people. 97 00:05:13,892 --> 00:05:15,040 [LAUGHTER] 98 00:05:15,040 --> 00:05:18,460 You go, wow, I don't think I'm going to have a heart rate when 99 00:05:18,460 --> 00:05:21,000 I'm 300 years old. 100 00:05:21,000 --> 00:05:23,020 So who are those people? 101 00:05:26,230 --> 00:05:28,960 Anybody have a clue? 102 00:05:28,960 --> 00:05:29,612 Yeah? 103 00:05:29,612 --> 00:05:30,940 AUDIENCE: Entry errors. 104 00:05:30,940 --> 00:05:31,898 PETER SZOLOVITS: Sorry? 105 00:05:31,898 --> 00:05:34,107 AUDIENCE: Entry errors? 106 00:05:34,107 --> 00:05:35,940 PETER SZOLOVITS: There are too many of them. 107 00:05:35,940 --> 00:05:38,890 Yeah, entry errors is always a possibility, 108 00:05:38,890 --> 00:05:41,470 but there's quite a few data points there. 109 00:05:41,470 --> 00:05:42,176 Yeah? 110 00:05:42,176 --> 00:05:45,582 AUDIENCE: [INAUDIBLE] 111 00:05:45,582 --> 00:05:46,540 PETER SZOLOVITS: Close. 112 00:05:46,540 --> 00:05:48,460 It's not quite missing data. 113 00:05:48,460 --> 00:05:52,150 So HIPAA, the Health Insurance Portability and Accountability 114 00:05:52,150 --> 00:05:56,620 Act, defines a set of criteria about protecting 115 00:05:56,620 --> 00:05:58,850 personal health information. 116 00:05:58,850 --> 00:06:01,540 And one of the things you are not allowed to do 117 00:06:01,540 --> 00:06:04,270 is to specify the age of somebody 118 00:06:04,270 --> 00:06:07,600 who is 90 years old or older. 119 00:06:07,600 --> 00:06:11,200 And the reason is because the number of 97-year-olds 120 00:06:11,200 --> 00:06:12,800 is pretty small. 121 00:06:12,800 --> 00:06:16,420 And so if I tell you that Willy is 97 years old, 122 00:06:16,420 --> 00:06:19,120 then you're going to be able to pick him out of a population 123 00:06:19,120 --> 00:06:24,640 relatively easily, and so it's prohibited to say that. 124 00:06:24,640 --> 00:06:28,150 So as a result, everybody who's 90 or older 125 00:06:28,150 --> 00:06:32,350 gets labeled as being 300 years old in the database. 126 00:06:32,350 --> 00:06:34,840 It's an artifact. 127 00:06:34,840 --> 00:06:41,770 It's like back in my youth, I worked as a computer programmer 128 00:06:41,770 --> 00:06:45,370 at a health sciences computing facility at UCLA. 129 00:06:45,370 --> 00:06:49,390 And we used to have a convention that missing data was 130 00:06:49,390 --> 00:06:53,272 represented by 99999. 131 00:06:53,272 --> 00:06:57,640 And of course, if you average that into a real data set, 132 00:06:57,640 --> 00:07:01,720 you get garbage, which people did regularly. 133 00:07:01,720 --> 00:07:03,850 So there are problems with this, and we're 134 00:07:03,850 --> 00:07:05,530 running into one of those. 135 00:07:05,530 --> 00:07:11,050 If you look at just the adults, the two systems 136 00:07:11,050 --> 00:07:12,620 actually look very similar. 137 00:07:12,620 --> 00:07:16,150 So the blue and red dots, or the two systems, 138 00:07:16,150 --> 00:07:19,180 and I've drawn the trend lines between them, 139 00:07:19,180 --> 00:07:21,680 and you can see that they're very similar. 140 00:07:21,680 --> 00:07:23,720 So it looks like as you get older, 141 00:07:23,720 --> 00:07:27,130 your heart rate declines very slightly. 142 00:07:27,130 --> 00:07:29,890 But it does so equally in the two data sets. 143 00:07:29,890 --> 00:07:30,390 Yeah? 144 00:07:30,390 --> 00:07:32,330 AUDIENCE: On the previous slide, beyond 300, 145 00:07:32,330 --> 00:07:35,320 it looks like they're older than 300? 146 00:07:35,320 --> 00:07:37,600 PETER SZOLOVITS: Well that's because the ages there 147 00:07:37,600 --> 00:07:41,120 are computed at the time that the heart rate is measured. 148 00:07:41,120 --> 00:07:44,170 And so if you are 300 years old when 149 00:07:44,170 --> 00:07:46,420 you're admitted to the hospital, if you 150 00:07:46,420 --> 00:07:48,910 stay in the hospital for six months, 151 00:07:48,910 --> 00:07:51,220 then you're 300 and 1/2 years old 152 00:07:51,220 --> 00:07:52,595 by the time of that measurement. 153 00:07:52,595 --> 00:07:55,150 [LAUGHS] So that's why there are data points 154 00:07:55,150 --> 00:07:58,000 to the right of 300. 155 00:07:58,000 --> 00:08:01,240 Yeah, good catch. 156 00:08:01,240 --> 00:08:03,580 OK, and then this is what the babies look like. 157 00:08:03,580 --> 00:08:06,730 And of course, they do have higher heart rates. 158 00:08:06,730 --> 00:08:08,500 And here here are the oldsters. 159 00:08:08,500 --> 00:08:13,600 So actually, there are people out to 310 years 160 00:08:13,600 --> 00:08:18,430 old because maybe they were discharged from the hospital. 161 00:08:18,430 --> 00:08:21,410 And then at age 100, they came back. 162 00:08:21,410 --> 00:08:23,320 You know, maybe they were 90 years old 163 00:08:23,320 --> 00:08:26,540 at the time they were initially admitted 10 years later. 164 00:08:26,540 --> 00:08:29,650 They came back, and we recorded more data about them, 165 00:08:29,650 --> 00:08:33,600 and so this is all relative to that 300. 166 00:08:33,600 --> 00:08:35,679 OK, so that's just one example. 167 00:08:35,679 --> 00:08:38,799 And the lesson there is be careful 168 00:08:38,799 --> 00:08:42,070 when you look at data because it can really easily fool 169 00:08:42,070 --> 00:08:45,850 you 'cause there are all kinds of funny things about the way 170 00:08:45,850 --> 00:08:49,360 it's collected, about these artifactual things 171 00:08:49,360 --> 00:08:54,370 like 300-year-old patients and so on. 172 00:08:54,370 --> 00:08:57,100 So here's a catalog of the types of data 173 00:08:57,100 --> 00:08:59,330 that are available to us. 174 00:08:59,330 --> 00:09:03,400 So we have the typical kind of electronic health record data 175 00:09:03,400 --> 00:09:05,110 from hospitals-- 176 00:09:05,110 --> 00:09:09,010 demographics, age, sex, socioeconomic status, 177 00:09:09,010 --> 00:09:11,740 insurance type, language, religion, living situation, 178 00:09:11,740 --> 00:09:15,700 family structure, location, work, et cetera. 179 00:09:15,700 --> 00:09:17,020 We have vital signs-- 180 00:09:17,020 --> 00:09:19,840 your weight, your height, your pulse, respiration rate, 181 00:09:19,840 --> 00:09:22,430 body temperature, et cetera. 182 00:09:22,430 --> 00:09:24,070 So these are typically the things 183 00:09:24,070 --> 00:09:27,040 that if you ever go to a doctor's office, 184 00:09:27,040 --> 00:09:29,860 or you go into a hospital, the nurse 185 00:09:29,860 --> 00:09:33,130 will take you aside and weigh you and measure your height 186 00:09:33,130 --> 00:09:37,930 and check your blood pressure and take your temperature 187 00:09:37,930 --> 00:09:38,890 and stuff like that. 188 00:09:38,890 --> 00:09:41,380 These are standard vital signs, and so we 189 00:09:41,380 --> 00:09:44,380 have lots of those recorded. 190 00:09:44,380 --> 00:09:47,920 Medications-- prescription medications, 191 00:09:47,920 --> 00:09:50,420 over-the-counter drugs, illegal drugs 192 00:09:50,420 --> 00:09:52,660 if you're willing not to lie to your health care 193 00:09:52,660 --> 00:09:56,050 provider, alcohol. 194 00:09:56,050 --> 00:09:58,870 Again, one of my earliest days, I 195 00:09:58,870 --> 00:10:03,010 was hanging out with a cardiologist at Tufts Medical 196 00:10:03,010 --> 00:10:06,700 Center, and we see this elderly lady 197 00:10:06,700 --> 00:10:11,140 who looks kind of terrible. 198 00:10:11,140 --> 00:10:13,060 And we're talking to her-- 199 00:10:13,060 --> 00:10:14,860 well, the doctor is talking to her. 200 00:10:14,860 --> 00:10:18,300 I'm trying to stay out of the way. 201 00:10:18,300 --> 00:10:24,430 And he says, so do you drink alcohol? 202 00:10:24,430 --> 00:10:27,760 And she says, oh, no, never touch the stuff. 203 00:10:27,760 --> 00:10:30,740 And then we talk some more, and we go out 204 00:10:30,740 --> 00:10:32,440 of the patient's room. 205 00:10:32,440 --> 00:10:38,360 And the doctor turns to me out of earshot of the patient 206 00:10:38,360 --> 00:10:41,540 and says, oh, she's a chronic drunk. 207 00:10:41,540 --> 00:10:43,400 I said, well, how do you know? 208 00:10:43,400 --> 00:10:45,710 And he says, well, from lab tests, 209 00:10:45,710 --> 00:10:49,760 from the appearance of her skin, from her general demeanor, 210 00:10:49,760 --> 00:10:53,750 from various sort of ineffable factors. 211 00:10:53,750 --> 00:10:57,200 And so patients lie. 212 00:10:57,200 --> 00:10:59,750 They really do because they don't want to tell you things. 213 00:11:02,910 --> 00:11:05,250 Medications, by the way, is a big deal. 214 00:11:05,250 --> 00:11:09,350 So there is this whole field called med red, medication 215 00:11:09,350 --> 00:11:14,240 reconciliation, which is the hospitals or the doctors' 216 00:11:14,240 --> 00:11:17,600 offices attempt to figure out what medications 217 00:11:17,600 --> 00:11:19,440 you're actually taking. 218 00:11:19,440 --> 00:11:21,800 So I'm a member of the MIT health plan, 219 00:11:21,800 --> 00:11:25,070 and if I sign into my health plan account, 220 00:11:25,070 --> 00:11:29,270 it tells me that I'm taking some pills that I 221 00:11:29,270 --> 00:11:36,320 got 12 years ago as part of a laboratory test, 222 00:11:36,320 --> 00:11:39,500 where I took two pills which were supposed 223 00:11:39,500 --> 00:11:41,390 to have some physiological effect, 224 00:11:41,390 --> 00:11:43,070 and then they measured that. 225 00:11:43,070 --> 00:11:44,990 And I've never gotten another pill 226 00:11:44,990 --> 00:11:47,900 and never taken one since then, nor would it 227 00:11:47,900 --> 00:11:50,270 be particularly good for me. 228 00:11:50,270 --> 00:11:52,670 But it's still on my record, and there's 229 00:11:52,670 --> 00:11:56,720 no notice of it ever having been discontinued. 230 00:11:56,720 --> 00:11:59,090 And that's a real problem because if you're 231 00:11:59,090 --> 00:12:01,250 taking care of a patient, you'd like 232 00:12:01,250 --> 00:12:04,030 to understand what drugs they're actually taking, 233 00:12:04,030 --> 00:12:07,760 and it's hard to know. 234 00:12:07,760 --> 00:12:10,490 Then lab tests-- so this is the things 235 00:12:10,490 --> 00:12:14,810 that you imagine that we do a lot of, 236 00:12:14,810 --> 00:12:18,530 and these are components of the blood and the urine mainly, 237 00:12:18,530 --> 00:12:23,540 but also of the stool, saliva, spinal fluid, fluid taken off 238 00:12:23,540 --> 00:12:28,010 the belly, joint fluid, bone marrow, stuff 239 00:12:28,010 --> 00:12:30,030 coming out of your lungs. 240 00:12:30,030 --> 00:12:33,290 It's anything and any place where 241 00:12:33,290 --> 00:12:36,800 you can produce some specimen, they 242 00:12:36,800 --> 00:12:39,380 can send it to a lab and measure things in it, 243 00:12:39,380 --> 00:12:42,600 and they measure lots and lots of different kinds of things. 244 00:12:42,600 --> 00:12:44,660 And these are often useful. 245 00:12:44,660 --> 00:12:48,740 Pathology, qualitative and quantitative examination 246 00:12:48,740 --> 00:12:51,830 of any body tissue, for example, biopsy samples 247 00:12:51,830 --> 00:12:55,340 or surgical scraps. 248 00:12:55,340 --> 00:12:57,290 You know, if they do an operation, 249 00:12:57,290 --> 00:12:59,960 they cut something out of you, that typically 250 00:12:59,960 --> 00:13:01,765 winds up on a pathologist's bench, 251 00:13:01,765 --> 00:13:05,230 who then tries to figure out what its characteristics are 252 00:13:05,230 --> 00:13:09,590 and that's, again, useful information. 253 00:13:09,590 --> 00:13:13,310 Microbiology-- ever since Pasteur, 254 00:13:13,310 --> 00:13:16,580 we know that organisms cause disease. 255 00:13:16,580 --> 00:13:21,950 And so we're quite interested in knowing what organisms 256 00:13:21,950 --> 00:13:23,750 are growing inside your body. 257 00:13:26,480 --> 00:13:30,770 And typically, testing is not only to identify the organism 258 00:13:30,770 --> 00:13:33,140 but also to figure out which antibiotics 259 00:13:33,140 --> 00:13:36,260 it's sensitive to and insensitive to. 260 00:13:36,260 --> 00:13:41,900 And so you'll see things like reports of sensitivity testing 261 00:13:41,900 --> 00:13:44,270 at various dilutions. 262 00:13:44,270 --> 00:13:47,300 In other words, they try to give a strong dose 263 00:13:47,300 --> 00:13:49,870 of an antibiotic a week weaker dose a week 264 00:13:49,870 --> 00:13:52,250 or dose a weaker dose a week or dose 265 00:13:52,250 --> 00:13:55,550 to see which is the minimum level of dosing 266 00:13:55,550 --> 00:13:58,220 that's enough to kill the bacteria. 267 00:13:58,220 --> 00:14:01,580 There's a comma missing there, but input, output of fluids 268 00:14:01,580 --> 00:14:06,380 is another important thing because people, especially 269 00:14:06,380 --> 00:14:10,940 in the hospital, often get either dehydrated or over 270 00:14:10,940 --> 00:14:11,990 hydrated. 271 00:14:11,990 --> 00:14:13,790 And neither of those is good for you, 272 00:14:13,790 --> 00:14:16,940 and so trying to keep track of what's going into you 273 00:14:16,940 --> 00:14:20,990 and what's coming out of you is important. 274 00:14:20,990 --> 00:14:22,740 Then there are tons of notes. 275 00:14:22,740 --> 00:14:26,960 So an important one that we're going to look at in this class 276 00:14:26,960 --> 00:14:28,700 is discharge summaries. 277 00:14:28,700 --> 00:14:31,640 So these are the typically long notes 278 00:14:31,640 --> 00:14:35,390 that are written at the end of a hospitalization. 279 00:14:35,390 --> 00:14:41,570 So this is a summary of why you came in, what they did to you, 280 00:14:41,570 --> 00:14:45,410 the main things they discovered about you, 281 00:14:45,410 --> 00:14:48,470 and then plans for what to do after your discharge. 282 00:14:48,470 --> 00:14:50,340 Where are you going to go? 283 00:14:50,340 --> 00:14:52,250 What drugs are you going to be taking? 284 00:14:52,250 --> 00:14:55,650 When are you supposed to come back for follow up, et cetera. 285 00:14:55,650 --> 00:14:58,850 I'll show you an excruciatingly long one of those 286 00:14:58,850 --> 00:15:01,650 later in the lecture today. 287 00:15:01,650 --> 00:15:04,760 But we also have notes from attendings and/or residents, 288 00:15:04,760 --> 00:15:08,990 nurses, various specialties, consultants. 289 00:15:08,990 --> 00:15:11,390 The referring physician-- if somebody sends you 290 00:15:11,390 --> 00:15:14,360 to the hospital, that doctor will usually 291 00:15:14,360 --> 00:15:17,310 write a note saying this is what I'm interested in. 292 00:15:17,310 --> 00:15:20,110 Here's why I'm sending in the patient. 293 00:15:20,110 --> 00:15:24,090 There are letters back to the referring physician saying, OK, 294 00:15:24,090 --> 00:15:25,460 this is what we found out. 295 00:15:25,460 --> 00:15:28,810 Here's the answer to the question you were asking. 296 00:15:28,810 --> 00:15:30,820 There are emergency department notes. 297 00:15:30,820 --> 00:15:34,450 So that's often the first contact between the patient 298 00:15:34,450 --> 00:15:36,410 and the health care system. 299 00:15:36,410 --> 00:15:38,620 So these are all important. 300 00:15:38,620 --> 00:15:43,300 And then there's tons and tons of billing data. 301 00:15:43,300 --> 00:15:48,610 So remember the EHR systems were initially 302 00:15:48,610 --> 00:15:51,550 designed by accountants. 303 00:15:51,550 --> 00:15:54,520 And they were designed for the purpose of billing. 304 00:15:54,520 --> 00:15:58,810 And so we capture a lot of data about formalized ways 305 00:15:58,810 --> 00:16:01,730 of describing the condition of the patient 306 00:16:01,730 --> 00:16:04,540 and what was done to the patient in order 307 00:16:04,540 --> 00:16:07,720 to submit the right bills. 308 00:16:07,720 --> 00:16:12,320 You obviously want to bill through it as much as possible. 309 00:16:12,320 --> 00:16:14,830 But you have to be able to justify the bills that you 310 00:16:14,830 --> 00:16:18,610 submit because insurance companies and Medicare 311 00:16:18,610 --> 00:16:23,112 and Medicaid don't have a good sense of humor. 312 00:16:23,112 --> 00:16:27,040 And if you submit bills for things that you can't justify, 313 00:16:27,040 --> 00:16:28,180 then you get penalized. 314 00:16:31,690 --> 00:16:33,610 And then there are administrative data 315 00:16:33,610 --> 00:16:36,100 like, which service are you on? 316 00:16:36,100 --> 00:16:40,300 So this this is occasionally a confusing thing. 317 00:16:40,300 --> 00:16:43,990 You can go into the hospital and have heart problems, 318 00:16:43,990 --> 00:16:46,750 but it turns out that the heart intensive care 319 00:16:46,750 --> 00:16:51,400 unit, the cardiac intensive care unit, is full up with patients. 320 00:16:51,400 --> 00:16:56,110 But there's an extra bed in the pulmonary intensive care unit, 321 00:16:56,110 --> 00:16:58,790 and so they stick you in that unit, 322 00:16:58,790 --> 00:17:01,300 but you're still on the cardiology service. 323 00:17:01,300 --> 00:17:05,140 And so there are these sort of mixture kinds of cases that you 324 00:17:05,140 --> 00:17:06,700 still have to take care of. 325 00:17:06,700 --> 00:17:09,579 Transfers are when you get transferred from one place 326 00:17:09,579 --> 00:17:13,119 to another in the hospital. 327 00:17:13,119 --> 00:17:17,890 Imaging data-- so I'm not going to talk about that much today, 328 00:17:17,890 --> 00:17:23,410 but there are X-rays, ultrasound, CT, MRI, PET scans, 329 00:17:23,410 --> 00:17:27,760 retinal scans, endoscopy, photographs of your skin 330 00:17:27,760 --> 00:17:28,790 and stuff like that. 331 00:17:28,790 --> 00:17:31,750 So this is all imaging data, and there's 332 00:17:31,750 --> 00:17:33,550 been a tremendous amount of progress 333 00:17:33,550 --> 00:17:35,980 recently in applying machine learning 334 00:17:35,980 --> 00:17:39,590 techniques to try to interpret the contents of these data. 335 00:17:39,590 --> 00:17:42,590 So these are also very important. 336 00:17:42,590 --> 00:17:47,170 And then there's the whole quantified self movement. 337 00:17:47,170 --> 00:17:50,185 I mean, how many of you where an activity tracker? 338 00:17:52,900 --> 00:17:54,520 Only about 1/3? 339 00:17:54,520 --> 00:17:56,980 I'm surprised at a place like MIT. 340 00:17:56,980 --> 00:17:58,900 [LAUGHTER] 341 00:17:58,900 --> 00:18:02,740 So you know, we measure steps and elevation change 342 00:18:02,740 --> 00:18:03,590 and workouts. 343 00:18:03,590 --> 00:18:08,140 And you can record vital signs and diet 344 00:18:08,140 --> 00:18:12,040 and your blood sugar, especially if you're diabetic; 345 00:18:12,040 --> 00:18:16,710 allergies, allergic incidents. 346 00:18:16,710 --> 00:18:20,380 There's all this mindfulness, mood, sleep, pain, 347 00:18:20,380 --> 00:18:22,180 sexual activity. 348 00:18:22,180 --> 00:18:24,520 And then people have developed this idea 349 00:18:24,520 --> 00:18:27,340 of N of 1 experiments. 350 00:18:27,340 --> 00:18:29,800 For example, I had a student some years 351 00:18:29,800 --> 00:18:33,490 ago who suffered from psoriasis. 352 00:18:33,490 --> 00:18:36,730 It's a grody condition of the skin. 353 00:18:36,730 --> 00:18:39,160 And the problem is there are no good cures for it. 354 00:18:39,160 --> 00:18:42,130 And so people who suffer from psoriasis 355 00:18:42,130 --> 00:18:43,750 try all kinds of things. 356 00:18:43,750 --> 00:18:46,210 You know, they stop eating nonce for a while, 357 00:18:46,210 --> 00:18:49,120 or they douse themselves with vinegar. 358 00:18:49,120 --> 00:18:54,040 Or they do whatever crazy thing comes to mind. 359 00:18:54,040 --> 00:18:57,970 And we don't have a good theory for how to treat this disease. 360 00:18:57,970 --> 00:19:01,280 But on the other hand, some things work for some people. 361 00:19:01,280 --> 00:19:03,460 And so there's a whole methodology that 362 00:19:03,460 --> 00:19:06,820 has been developed that says, when you try these things, 363 00:19:06,820 --> 00:19:08,770 act like a scientist. 364 00:19:08,770 --> 00:19:10,360 Have hypotheses. 365 00:19:10,360 --> 00:19:11,320 Take good notes. 366 00:19:11,320 --> 00:19:14,380 Collect good data. 367 00:19:14,380 --> 00:19:18,040 Be cognizant of things like onset periods, where 368 00:19:18,040 --> 00:19:21,190 you know you may have to drip vinegar on yourself for a week 369 00:19:21,190 --> 00:19:23,270 before you see any effect. 370 00:19:23,270 --> 00:19:26,890 So if that doesn't do a thing after one day, don't stop. 371 00:19:26,890 --> 00:19:31,510 And furthermore, if you stop then don't start something new 372 00:19:31,510 --> 00:19:35,470 immediately because you will then 373 00:19:35,470 --> 00:19:39,010 be confused about whether this is the effect of the thing 374 00:19:39,010 --> 00:19:42,250 you were on before or the new thing that you're trying. 375 00:19:42,250 --> 00:19:46,480 So there's all sorts of ideas like that. 376 00:19:46,480 --> 00:19:50,740 So this is a slide from our paper on MIMIC-III. 377 00:19:50,740 --> 00:19:54,370 And it gives you a kind of overview of what's 378 00:19:54,370 --> 00:19:56,390 going on with the patient. 379 00:19:56,390 --> 00:19:59,210 So if you look at this-- 380 00:19:59,210 --> 00:20:01,570 I'm going to point with my hands-- 381 00:20:01,570 --> 00:20:04,150 at the top is something very important. 382 00:20:04,150 --> 00:20:07,640 This patient starts off at full code. 383 00:20:07,640 --> 00:20:11,390 That means that if something bad happens to him, 384 00:20:11,390 --> 00:20:15,340 he wants everything to be done to try to save him. 385 00:20:15,340 --> 00:20:17,470 And he winds up in comfort measures 386 00:20:17,470 --> 00:20:21,110 only, which means that if something bad happens to him, 387 00:20:21,110 --> 00:20:23,350 he wants to die-- 388 00:20:23,350 --> 00:20:28,360 or his family does if he's unconscious. 389 00:20:28,360 --> 00:20:30,560 So what else do we know about this guy? 390 00:20:30,560 --> 00:20:34,730 Well GCS is the Glasgow Coma Score. 391 00:20:34,730 --> 00:20:38,090 And it's a way of quantifying people's level 392 00:20:38,090 --> 00:20:39,620 of consciousness. 393 00:20:39,620 --> 00:20:43,310 And you see that at the beginning 394 00:20:43,310 --> 00:20:47,120 this patient is oriented, and then gets confused. 395 00:20:47,120 --> 00:20:51,410 And finally, is only making incomprehensible words 396 00:20:51,410 --> 00:20:53,980 or sounds. 397 00:20:53,980 --> 00:20:57,590 Motor, he's able to obey commands. 398 00:20:57,590 --> 00:21:01,610 Eventually, he's only able to flex 399 00:21:01,610 --> 00:21:03,560 when you stimulate his muscles. 400 00:21:03,560 --> 00:21:06,970 So he's no longer conscious. 401 00:21:06,970 --> 00:21:11,900 Eye movements-- he's able to follow you spontaneously. 402 00:21:11,900 --> 00:21:14,000 He's able to orient to speech. 403 00:21:14,000 --> 00:21:16,200 And eventually orientation at all. 404 00:21:16,200 --> 00:21:20,700 So this is clearly somebody who's going downhill quickly 405 00:21:20,700 --> 00:21:25,160 and, in fact, dies at the end of this episode. 406 00:21:25,160 --> 00:21:28,220 Now, we then look at labs so we can 407 00:21:28,220 --> 00:21:32,240 see what is their level of platelets at about the time 408 00:21:32,240 --> 00:21:35,000 that they're measured, their creatinine level, 409 00:21:35,000 --> 00:21:38,970 their white blood cell count, the neutrophils percentage, et 410 00:21:38,970 --> 00:21:40,220 cetera. 411 00:21:40,220 --> 00:21:44,450 And there's not every possible data point on the slide. 412 00:21:44,450 --> 00:21:47,150 This is just illustrative. 413 00:21:47,150 --> 00:21:49,590 The next section is medications. 414 00:21:49,590 --> 00:21:51,920 So the person is on morphine. 415 00:21:51,920 --> 00:21:55,330 They're on Vancomycin, which is an antibiotic. 416 00:21:55,330 --> 00:21:57,440 Piperacillin-- I don't know what that is. 417 00:21:57,440 --> 00:21:58,842 Does somebody know? 418 00:21:58,842 --> 00:22:00,288 AUDIENCE: Antibiotic. 419 00:22:00,288 --> 00:22:01,463 PETER SZOLOVITS: It's what? 420 00:22:01,463 --> 00:22:02,822 AUDIENCE: It's antibiotic. 421 00:22:02,822 --> 00:22:05,090 PETER SZOLOVITS: OK. 422 00:22:05,090 --> 00:22:09,210 Sodium chloride 9%, So that's just keeping him hydrated. 423 00:22:09,210 --> 00:22:11,310 Amiodarone and dextrose. 424 00:22:11,310 --> 00:22:14,480 So dextrose is giving him some energy. 425 00:22:14,480 --> 00:22:18,030 And then these are the various measurements. 426 00:22:18,030 --> 00:22:22,010 So you see the heart rate, for example, is up pretty high 427 00:22:22,010 --> 00:22:24,290 and is going up near the end. 428 00:22:24,290 --> 00:22:28,310 The oxygen saturation starts off pretty good. 429 00:22:28,310 --> 00:22:33,540 But here we're down to 60% or 50% 430 00:22:33,540 --> 00:22:37,640 O2 sat, which is supposed to be above about 92 431 00:22:37,640 --> 00:22:40,100 in order to be considered reasonable. 432 00:22:40,100 --> 00:22:42,770 So again, this is a very consistent picture 433 00:22:42,770 --> 00:22:48,300 of things going very badly wrong for this particular patient. 434 00:22:48,300 --> 00:22:52,080 So this is all the data in the database. 435 00:22:52,080 --> 00:22:54,680 Now, if you want to try to analyze some of this stuff, 436 00:22:54,680 --> 00:22:59,450 you can say, well, let's look at the ages 437 00:22:59,450 --> 00:23:03,360 at the time of the last lab measurement in the database. 438 00:23:03,360 --> 00:23:06,330 So we have the times of all the lab measurements. 439 00:23:06,330 --> 00:23:13,580 So we can see that many of the ICU population are fairly old. 440 00:23:13,580 --> 00:23:18,120 There's a relatively small number of young people 441 00:23:18,120 --> 00:23:22,610 and then a growing number of older people in both females 442 00:23:22,610 --> 00:23:24,140 and males. 443 00:23:24,140 --> 00:23:29,430 If we look at age at admission by gender-- 444 00:23:29,430 --> 00:23:32,870 so this is age at admission not age at the time the last lab 445 00:23:32,870 --> 00:23:35,060 measurement was done-- 446 00:23:35,060 --> 00:23:37,160 it's a pretty similar curve. 447 00:23:37,160 --> 00:23:42,800 So we see that females were 64.21 448 00:23:42,800 --> 00:23:51,470 at time of last lab measurement; 63.5 at the time of admission. 449 00:23:51,470 --> 00:23:55,580 So we can look at demographics, and demographics typically 450 00:23:55,580 --> 00:24:00,560 includes these kinds of factors, which I've mentioned before. 451 00:24:00,560 --> 00:24:02,900 And again, if we're interested in the relationship 452 00:24:02,900 --> 00:24:06,960 between this and, for example, the age distribution, 453 00:24:06,960 --> 00:24:12,860 we see that if you look at the different admission types-- 454 00:24:12,860 --> 00:24:18,260 so you can be either admitted for an emergency 455 00:24:18,260 --> 00:24:22,670 for some urgent care or electively. 456 00:24:22,670 --> 00:24:25,850 And it doesn't seem to make a whole lot of difference, 457 00:24:25,850 --> 00:24:31,260 at least in the means of the population age distribution. 458 00:24:31,260 --> 00:24:35,390 On the other hand, if you look at insurance type 459 00:24:35,390 --> 00:24:38,320 and, say, who's paying the bills, 460 00:24:38,320 --> 00:24:42,190 there is a big difference in the age distributions. 461 00:24:42,190 --> 00:24:48,710 Now, why do you think that private insurance drops way off 462 00:24:48,710 --> 00:24:51,331 at about 65? 463 00:24:51,331 --> 00:24:54,000 AUDIENCE: Isn't insurance always covered for everyone 464 00:24:54,000 --> 00:24:56,110 by the state health? 465 00:24:56,110 --> 00:24:57,860 PETER SZOLOVITS: It's because of Medicare. 466 00:24:57,860 --> 00:25:02,450 So Medicare covers people who are 65 years old. 467 00:25:02,450 --> 00:25:04,370 There's a terrible story I have to tell you. 468 00:25:04,370 --> 00:25:07,430 I was talking to somebody at an insurance company who's 469 00:25:07,430 --> 00:25:10,370 a bit cynical, and he said suppose 470 00:25:10,370 --> 00:25:14,150 that you see a 63-year-old patient who's developing type 471 00:25:14,150 --> 00:25:16,100 2 diabetes, what should you do for him? 472 00:25:19,210 --> 00:25:21,180 Well, there are standard things you 473 00:25:21,180 --> 00:25:23,130 should do for somebody developing 474 00:25:23,130 --> 00:25:26,730 type 2 diabetes, like get him to eat better, get 475 00:25:26,730 --> 00:25:29,040 him to lose weight, get him to exercise more, 476 00:25:29,040 --> 00:25:31,000 et cetera, et cetera. 477 00:25:31,000 --> 00:25:33,570 But his cynical answer was absolutely nothing. 478 00:25:36,680 --> 00:25:37,180 Why? 479 00:25:37,180 --> 00:25:40,330 Well it's very cheap to do nothing. 480 00:25:40,330 --> 00:25:42,760 Most people who develop type 2 diabetes 481 00:25:42,760 --> 00:25:46,670 don't get real sick in the next two years. 482 00:25:46,670 --> 00:25:49,630 And by the time this patient is 65, 483 00:25:49,630 --> 00:25:51,670 he'll be the government's responsibility, 484 00:25:51,670 --> 00:25:55,440 not the insurance company's. 485 00:25:55,440 --> 00:25:56,977 Nice. 486 00:25:56,977 --> 00:25:57,602 AUDIENCE: Yeah. 487 00:26:01,107 --> 00:26:03,190 PETER SZOLOVITS: So of course a lot of the elderly 488 00:26:03,190 --> 00:26:07,840 are insured by Medicare or Medicaid, not that surprising. 489 00:26:07,840 --> 00:26:10,810 Self-pay is a pretty small number 490 00:26:10,810 --> 00:26:14,030 because it's insanely expensive to pay for your own health 491 00:26:14,030 --> 00:26:16,810 care. 492 00:26:16,810 --> 00:26:18,250 What about where you came from? 493 00:26:21,580 --> 00:26:25,330 Were you referred from a clinic, or were 494 00:26:25,330 --> 00:26:27,070 you an emergency room admit? 495 00:26:27,070 --> 00:26:32,530 Or were you referred from an HMO or et cetera? 496 00:26:32,530 --> 00:26:38,170 And other than a transfer from a skilled nursing facility 497 00:26:38,170 --> 00:26:42,490 or transfer within the facility, within the hospital, 498 00:26:42,490 --> 00:26:44,200 it doesn't make much difference. 499 00:26:44,200 --> 00:26:47,350 The averages there and the distributions 500 00:26:47,350 --> 00:26:50,290 look moderately similar. 501 00:26:50,290 --> 00:26:53,440 If you're coming from a skilled nursing facility, 502 00:26:53,440 --> 00:26:56,290 if you are in a skilled nursing facility, 503 00:26:56,290 --> 00:27:00,600 you're probably old because younger people don't typically 504 00:27:00,600 --> 00:27:02,560 need skilled nursing care. 505 00:27:02,560 --> 00:27:06,490 And I'm not sure why transfers within the facility 506 00:27:06,490 --> 00:27:10,780 are significantly younger ages, but that's 507 00:27:10,780 --> 00:27:13,060 true from the MIMIC data. 508 00:27:15,610 --> 00:27:18,730 What about age at admission by language? 509 00:27:18,730 --> 00:27:20,950 So some people speak English. 510 00:27:20,950 --> 00:27:25,180 Some people speak not available. 511 00:27:25,180 --> 00:27:28,490 Some people speak Spanish, et cetera. 512 00:27:28,490 --> 00:27:30,970 So it turns out the Russians are the oldest. 513 00:27:33,610 --> 00:27:36,550 And that may have to do with immigration patterns, 514 00:27:36,550 --> 00:27:38,520 or I don't know exactly why. 515 00:27:41,040 --> 00:27:45,700 But that's what the data show. 516 00:27:45,700 --> 00:27:49,120 If you do it by ethnicity, it turns out 517 00:27:49,120 --> 00:27:53,590 that African-Americans, on the whole, 518 00:27:53,590 --> 00:27:56,080 are somewhat younger than whites. 519 00:27:56,080 --> 00:27:58,510 And Hispanics are somewhat younger yet. 520 00:28:01,300 --> 00:28:06,520 So that means that those subpopulations apparently 521 00:28:06,520 --> 00:28:11,680 need intensive care earlier in life than whites. 522 00:28:11,680 --> 00:28:15,670 So this is a topic that's very hot right now, 523 00:28:15,670 --> 00:28:20,250 discussions about how bias might play into health care. 524 00:28:20,250 --> 00:28:20,864 Yeah? 525 00:28:20,864 --> 00:28:23,770 AUDIENCE: What does unable to obtain mean? 526 00:28:23,770 --> 00:28:25,690 PETER SZOLOVITS: It just means that somebody 527 00:28:25,690 --> 00:28:28,000 refused to say what their ethnicity was. 528 00:28:28,000 --> 00:28:29,540 AUDIENCE: When they were asked this? 529 00:28:29,540 --> 00:28:30,910 PETER SZOLOVITS: Yeah. 530 00:28:30,910 --> 00:28:31,750 I think. 531 00:28:31,750 --> 00:28:34,150 I'm not positive. 532 00:28:34,150 --> 00:28:36,610 AUDIENCE: So just to confirm. 533 00:28:36,610 --> 00:28:40,722 This also represents Boston's population dynamics too, right? 534 00:28:40,722 --> 00:28:42,430 PETER SZOLOVITS: It's the catchment basin 535 00:28:42,430 --> 00:28:44,530 of the Beth Israel Deaconess Hospital, 536 00:28:44,530 --> 00:28:47,020 which is Boston clearly. 537 00:28:47,020 --> 00:28:51,940 But there are-- it turns out that a lot of North Shore 538 00:28:51,940 --> 00:28:56,140 people go to Mass General, and so different hospitals have 539 00:28:56,140 --> 00:28:57,328 different catchment basins. 540 00:28:57,328 --> 00:28:59,870 AUDIENCE: Does it have anything to do with like, is this just 541 00:28:59,870 --> 00:29:00,640 the ICU? 542 00:29:00,640 --> 00:29:04,820 Or is this everybody who goes to the hospital or the ER? 543 00:29:04,820 --> 00:29:07,270 PETER SZOLOVITS: These are all people who at some point 544 00:29:07,270 --> 00:29:10,040 were in the ICU. 545 00:29:10,040 --> 00:29:12,580 So these are the sicker patients. 546 00:29:12,580 --> 00:29:13,690 Yeah? 547 00:29:13,690 --> 00:29:15,630 AUDIENCE: So just want to double-check 548 00:29:15,630 --> 00:29:18,172 there's a higher proportion of black, African American people 549 00:29:18,172 --> 00:29:21,539 in the population here as well because the red is 550 00:29:21,539 --> 00:29:23,950 higher than the others? 551 00:29:23,950 --> 00:29:26,230 PETER SZOLOVITS: No, actually-- 552 00:29:26,230 --> 00:29:27,970 I don't remember if I have that graph-- 553 00:29:27,970 --> 00:29:29,883 I think this is cumulative. 554 00:29:29,883 --> 00:29:31,570 AUDIENCE: Oh, OK. 555 00:29:31,570 --> 00:29:35,530 PETER SZOLOVITS: So most people are 556 00:29:35,530 --> 00:29:41,680 white for whatever definition of white we're using. 557 00:29:41,680 --> 00:29:44,680 And I think it's only the increment that you see on top. 558 00:29:49,450 --> 00:29:50,950 All right, how about marital status? 559 00:29:53,490 --> 00:29:55,740 Well, according to this, it's bad to be single. 560 00:30:03,280 --> 00:30:06,130 So I could sort of see that for hospitalization. 561 00:30:06,130 --> 00:30:09,580 I'm not sure why it's true for the ICU 562 00:30:09,580 --> 00:30:13,870 because if you don't have anybody at home to take care 563 00:30:13,870 --> 00:30:16,780 of you when you get sick, it seems reasonable 564 00:30:16,780 --> 00:30:19,150 that you'd be more likely to wind up in the hospital. 565 00:30:19,150 --> 00:30:21,480 But I don't know why you'd wind up in intensive care. 566 00:30:21,480 --> 00:30:22,080 Yeah? 567 00:30:22,080 --> 00:30:24,790 AUDIENCE: Isn't it possible that those are also 568 00:30:24,790 --> 00:30:28,510 single people are probably younger than married people, 569 00:30:28,510 --> 00:30:30,130 and those are probably younger than-- 570 00:30:30,130 --> 00:30:30,790 PETER SZOLOVITS: Yes, yeah. 571 00:30:30,790 --> 00:30:31,450 AUDIENCE: [INAUDIBLE] people. 572 00:30:31,450 --> 00:30:33,533 PETER SZOLOVITS: Yeah, that's probably also right. 573 00:30:39,330 --> 00:30:41,610 So here's an interesting question, 574 00:30:41,610 --> 00:30:43,380 a little bit related to something you'll 575 00:30:43,380 --> 00:30:46,800 see on the next problem set. 576 00:30:46,800 --> 00:30:50,520 So could we predict in-hospital mortality 577 00:30:50,520 --> 00:30:52,305 from just these demographic features? 578 00:30:56,100 --> 00:31:00,120 So I'm using a tool in language called 579 00:31:00,120 --> 00:31:02,940 R. This is a general linear model, 580 00:31:02,940 --> 00:31:06,540 and I've set it up to do basically logistic regression. 581 00:31:06,540 --> 00:31:09,180 And it says I'm predicting whether you 582 00:31:09,180 --> 00:31:16,920 die in the hospital based on these demographic factors. 583 00:31:16,920 --> 00:31:19,710 And it turns out that the only ones that 584 00:31:19,710 --> 00:31:23,940 are highly significant are age. 585 00:31:23,940 --> 00:31:27,300 So that's not surprising, that older people 586 00:31:27,300 --> 00:31:29,940 are more likely to die than younger people. 587 00:31:29,940 --> 00:31:32,100 It's generally true. 588 00:31:32,100 --> 00:31:36,210 And if I'm unable to obtain your ethnicity, 589 00:31:36,210 --> 00:31:42,540 or I don't know your ethnicity, then you're more likely to die. 590 00:31:42,540 --> 00:31:46,980 I have no clue why that might be the case. 591 00:31:46,980 --> 00:31:49,810 And other things are not as significant. 592 00:31:49,810 --> 00:31:52,170 So if you speak Spanish or English, 593 00:31:52,170 --> 00:31:54,180 you're slightly less likely to die. 594 00:31:54,180 --> 00:31:56,970 You see a negative contribution here. 595 00:31:56,970 --> 00:32:01,560 And if you speak Russian, you're slightly less likely to die. 596 00:32:01,560 --> 00:32:06,030 But it's significant not at the p equal 0.05 level, 597 00:32:06,030 --> 00:32:10,170 but it is at the p equal 0.06 level. 598 00:32:10,170 --> 00:32:13,260 And marriage doesn't seem to make much difference 599 00:32:13,260 --> 00:32:17,190 in predicting whether you're going to die or not. 600 00:32:17,190 --> 00:32:20,770 Now, remember, this is ICU patients. 601 00:32:20,770 --> 00:32:24,570 And we're looking at in-hospital mortality. 602 00:32:24,570 --> 00:32:26,607 AUDIENCE: For ethnicity, can they 603 00:32:26,607 --> 00:32:28,190 learn that at any point in this study, 604 00:32:28,190 --> 00:32:29,482 or just right at the beginning? 605 00:32:29,482 --> 00:32:30,445 Or do you know? 606 00:32:30,445 --> 00:32:31,320 Because I don't know. 607 00:32:31,320 --> 00:32:32,340 PROFESSOR: I don't know. 608 00:32:32,340 --> 00:32:34,170 AUDIENCE: Because it could be that 609 00:32:34,170 --> 00:32:37,880 unable to obtain means that they died before we can ask them. 610 00:32:37,880 --> 00:32:39,630 PROFESSOR: No, because there wouldn't 611 00:32:39,630 --> 00:32:44,040 be that many of those people, I think. 612 00:32:44,040 --> 00:32:46,260 There are not that many people who don't 613 00:32:46,260 --> 00:32:51,300 live past the intake interview. 614 00:32:51,300 --> 00:32:52,290 And they do ask them. 615 00:32:55,350 --> 00:32:58,420 AUDIENCE: [INAUDIBLE] 616 00:32:58,420 --> 00:33:00,560 PROFESSOR: Yeah, that would be an example. 617 00:33:00,560 --> 00:33:05,860 But I don't think you'd see enough such people to show up 618 00:33:05,860 --> 00:33:08,800 statistically. 619 00:33:08,800 --> 00:33:11,410 OK. 620 00:33:11,410 --> 00:33:14,720 Well, so I've already mentioned that there 621 00:33:14,720 --> 00:33:18,180 is this problem of having moved from CareVue view 622 00:33:18,180 --> 00:33:21,590 to MetaVision just in the MIMIC database. 623 00:33:21,590 --> 00:33:23,890 But of course, this is a much bigger problem 624 00:33:23,890 --> 00:33:25,720 around the country and around the world, 625 00:33:25,720 --> 00:33:29,890 because every hospital has its own way of keeping records. 626 00:33:29,890 --> 00:33:34,180 And wouldn't it be nice if we had standards? 627 00:33:34,180 --> 00:33:36,310 And of course, there's this funny phrase, 628 00:33:36,310 --> 00:33:38,470 the wonderful thing about standards 629 00:33:38,470 --> 00:33:41,530 is that there's so many to choose from. 630 00:33:41,530 --> 00:33:44,770 So for example, if you look at prescriptions in the MIMIC 631 00:33:44,770 --> 00:33:49,280 database, here are two particular prescriptions 632 00:33:49,280 --> 00:33:53,800 for subject number 57139 admitted 633 00:33:53,800 --> 00:33:58,690 on admission ID 155470. 634 00:33:58,690 --> 00:34:03,310 And so they have the same start date but different end dates. 635 00:34:03,310 --> 00:34:08,199 One is a prescription for Tylenol, acetaminophen, 636 00:34:08,199 --> 00:34:13,469 and the other is for clobetasol propionate 0.05% cream. 637 00:34:13,469 --> 00:34:16,960 That's a skin lotion thing for-- 638 00:34:16,960 --> 00:34:20,530 I think it's a steroid skin cream. 639 00:34:20,530 --> 00:34:23,710 So if you look in the BI's database, 640 00:34:23,710 --> 00:34:26,170 they have their own private formulary 641 00:34:26,170 --> 00:34:32,679 code where this thing is acet325 and this thing 642 00:34:32,679 --> 00:34:38,380 is clob.05C30, right? 643 00:34:38,380 --> 00:34:40,300 And if you look, there there's also 644 00:34:40,300 --> 00:34:45,520 something called a GSN, which is some commercial coding 645 00:34:45,520 --> 00:34:47,590 system for drugs. 646 00:34:47,590 --> 00:34:51,219 Maybe having to do with who their drug supplier is 647 00:34:51,219 --> 00:34:53,090 at the hospital. 648 00:34:53,090 --> 00:34:55,310 And these have different codes. 649 00:34:55,310 --> 00:35:01,090 There's the National Drug Code, which is an FDA assigned nine 650 00:35:01,090 --> 00:35:06,280 digit code that specifies who made the drug, what form it's 651 00:35:06,280 --> 00:35:09,670 in, and what's its strength. 652 00:35:09,670 --> 00:35:11,650 And so you get these. 653 00:35:11,650 --> 00:35:14,890 Then there's a human readable description 654 00:35:14,890 --> 00:35:20,450 that says Tylenol comes in 325 milligram tablets. 655 00:35:20,450 --> 00:35:24,640 And the clobetasol comes in 30 gram tubes. 656 00:35:24,640 --> 00:35:31,030 And the dose is supposed to be 325 to 650, i.e. one 657 00:35:31,030 --> 00:35:34,510 to two tablets measured in milligrams. 658 00:35:34,510 --> 00:35:40,270 The dose here is one application, whatever that is. 659 00:35:40,270 --> 00:35:42,640 I don't know what the 0.01 means. 660 00:35:42,640 --> 00:35:45,370 And this is a tablet and that's a tube. 661 00:35:45,370 --> 00:35:47,260 And this is taken orally. 662 00:35:47,260 --> 00:35:51,970 That's administered on the skin, right? 663 00:35:51,970 --> 00:35:55,230 So this is a local database. 664 00:35:55,230 --> 00:36:00,490 AUDIENCE: For a doctor, they just [INAUDIBLE] 665 00:36:00,490 --> 00:36:03,250 PROFESSOR: At most hospitals, that's true now. 666 00:36:03,250 --> 00:36:05,350 It wasn't true when the MIMIC database 667 00:36:05,350 --> 00:36:08,170 started being collected. 668 00:36:08,170 --> 00:36:12,430 And the BI was relatively late in moving 669 00:36:12,430 --> 00:36:15,490 to that compared to some of the other hospitals in the Boston 670 00:36:15,490 --> 00:36:16,510 area. 671 00:36:16,510 --> 00:36:19,720 Each hospital has its own digitorata 672 00:36:19,720 --> 00:36:23,100 for what it thinks is most important. 673 00:36:23,100 --> 00:36:27,910 And I think the BI just didn't prioritize it as much as some 674 00:36:27,910 --> 00:36:31,180 of the other hospitals. 675 00:36:31,180 --> 00:36:36,950 OK, so then I said, well, if you look at prescriptions, 676 00:36:36,950 --> 00:36:38,260 how often are they given? 677 00:36:38,260 --> 00:36:42,310 So remember, we have about 60,000 ICU stays. 678 00:36:42,310 --> 00:36:49,810 And so iso-osmotic dextrose was given 87,000 times 679 00:36:49,810 --> 00:36:51,200 to various people. 680 00:36:51,200 --> 00:36:54,090 Sodium chloride 0.9 percent flush. 681 00:36:54,090 --> 00:36:56,960 Do you know what that is? 682 00:36:56,960 --> 00:36:59,360 Have you ever had an IV? 683 00:36:59,360 --> 00:37:02,000 So periodically, the nurse comes by 684 00:37:02,000 --> 00:37:04,670 and squirts a little bit of stuff in the IV 685 00:37:04,670 --> 00:37:07,130 to make sure that it hasn't clogged up. 686 00:37:07,130 --> 00:37:09,920 That's what that is. 687 00:37:09,920 --> 00:37:12,878 Insulin, SW. 688 00:37:12,878 --> 00:37:13,420 I don't know. 689 00:37:13,420 --> 00:37:14,480 Salt water? 690 00:37:14,480 --> 00:37:16,730 I don't know what SW is. 691 00:37:16,730 --> 00:37:20,670 Magnesium sulfate, dextrose five in water. 692 00:37:20,670 --> 00:37:22,340 Furosimide is a diuretic. 693 00:37:22,340 --> 00:37:26,030 Potassium chloride replenishes potassium 694 00:37:26,030 --> 00:37:29,330 that people are often low on. 695 00:37:29,330 --> 00:37:36,590 And then you go, so why is there this D5W and that D5W? 696 00:37:36,590 --> 00:37:42,080 And that's probably some data in the system, OK? 697 00:37:42,080 --> 00:37:46,130 One of them has an NDC code associated with it 698 00:37:46,130 --> 00:37:49,680 and the other one doesn't but probably should. 699 00:37:49,680 --> 00:37:50,240 Yeah. 700 00:37:50,240 --> 00:37:52,490 AUDIENCE: I was actually going to ask, does yours mean 701 00:37:52,490 --> 00:37:54,155 that they're standard across hospitals 702 00:37:54,155 --> 00:37:57,410 or just that we don't have the data? 703 00:37:57,410 --> 00:38:02,300 PROFESSOR: The NDC code should be standard across the country, 704 00:38:02,300 --> 00:38:05,120 because those are FDA assigned codes. 705 00:38:05,120 --> 00:38:08,810 But not every hospital uses them, OK? 706 00:38:08,810 --> 00:38:10,880 And for the ones that say zero, I'm 707 00:38:10,880 --> 00:38:13,490 not sure why they're not associated 708 00:38:13,490 --> 00:38:18,220 with a code in this hospital's database. 709 00:38:18,220 --> 00:38:26,990 OK, next most common, you see normal saline, 710 00:38:26,990 --> 00:38:29,570 0.9 percent sodium chloride. 711 00:38:29,570 --> 00:38:32,570 So that was the same stuff as the flush solution 712 00:38:32,570 --> 00:38:35,450 but this time not being used for flush. 713 00:38:35,450 --> 00:38:38,270 Metoprolol is a beta blocker. 714 00:38:38,270 --> 00:38:43,760 Here's another insulin this time with an NDC code, et cetera. 715 00:38:43,760 --> 00:38:48,410 I love bag and vial, OK? 716 00:38:48,410 --> 00:38:51,680 So these are not exactly medications. 717 00:38:51,680 --> 00:38:57,760 A bag is literally like a baggy that they put something into, 718 00:38:57,760 --> 00:38:59,690 and a vial is literally something 719 00:38:59,690 --> 00:39:01,700 that they put pills in. 720 00:39:01,700 --> 00:39:03,590 And why is that in the database? 721 00:39:03,590 --> 00:39:06,830 Because they get to charge for it, OK? 722 00:39:06,830 --> 00:39:08,710 And I don't know what the charge is, 723 00:39:08,710 --> 00:39:10,280 but it wouldn't surprise me if you're 724 00:39:10,280 --> 00:39:15,870 paying $5 for a plastic bag to put something in. 725 00:39:15,870 --> 00:39:21,170 OK, so if we say, well, how many pharmacy orders are there 726 00:39:21,170 --> 00:39:26,390 per admission at this hospital, and the answer is a lot. 727 00:39:26,390 --> 00:39:28,190 So if you look at-- 728 00:39:28,190 --> 00:39:30,380 it's a very long tailed distribution, 729 00:39:30,380 --> 00:39:32,930 goes out to about 2,500. 730 00:39:32,930 --> 00:39:38,980 But you see, if I blow up just the numbers up to about 200, 731 00:39:38,980 --> 00:39:41,690 there's a very large number of people 732 00:39:41,690 --> 00:39:50,540 with two prescriptions filled, and then a fairly declining 733 00:39:50,540 --> 00:39:52,610 number with more. 734 00:39:52,610 --> 00:39:54,240 And then it's a very long tail. 735 00:39:54,240 --> 00:39:58,820 So can you imagine 2,500 things prescribed for you 736 00:39:58,820 --> 00:40:03,110 during a hospital stay? 737 00:40:03,110 --> 00:40:05,840 Well, a little more about standards, 738 00:40:05,840 --> 00:40:10,490 so NDC is probably the best of the coding systems. 739 00:40:10,490 --> 00:40:13,380 And it's developed by the FDA. 740 00:40:13,380 --> 00:40:16,460 The picture up on the top right shows 741 00:40:16,460 --> 00:40:20,870 that the first four digits are the so-called labeler. 742 00:40:20,870 --> 00:40:23,660 That's usually the person who produced the drugs, 743 00:40:23,660 --> 00:40:26,870 or at least the person who distributes them. 744 00:40:26,870 --> 00:40:32,490 The second four digit number is the form of the drug, 745 00:40:32,490 --> 00:40:37,910 so whether it's capsules, or tablets, or liquid, or whatever 746 00:40:37,910 --> 00:40:39,330 and the dose. 747 00:40:39,330 --> 00:40:44,630 And then the last two digits are a package code which 748 00:40:44,630 --> 00:40:49,080 translates into the total number of doses that are in a package, 749 00:40:49,080 --> 00:40:49,580 right? 750 00:40:49,580 --> 00:40:51,440 So this is a godsend. 751 00:40:51,440 --> 00:40:54,170 And all of the robotic pharmacies and so on 752 00:40:54,170 --> 00:41:00,500 rely on using this kind of information nowadays. 753 00:41:00,500 --> 00:41:05,180 Unfortunately, they ran out of four digit numbers, 754 00:41:05,180 --> 00:41:07,670 and so there's now a-- 755 00:41:07,670 --> 00:41:11,180 they added an extra digit, but they didn't do it 756 00:41:11,180 --> 00:41:13,280 systematically, and so sometimes they 757 00:41:13,280 --> 00:41:15,620 added an extra digit to the labeler 758 00:41:15,620 --> 00:41:18,180 and sometimes to the product code. 759 00:41:18,180 --> 00:41:20,720 And so there is a nightmare of translations 760 00:41:20,720 --> 00:41:23,600 between the old codes and the new codes. 761 00:41:23,600 --> 00:41:25,790 And you have to have a code dictionary in order 762 00:41:25,790 --> 00:41:28,700 to do it properly and so on. 763 00:41:28,700 --> 00:41:31,730 OK, well, if that weren't good enough, 764 00:41:31,730 --> 00:41:35,510 the International Council for the Harmonization 765 00:41:35,510 --> 00:41:39,860 of Technical requirements for Pharmaceuticals for Human Use 766 00:41:39,860 --> 00:41:42,800 developed another coding system called MedDRA, which 767 00:41:42,800 --> 00:41:45,930 is also used in various places. 768 00:41:45,930 --> 00:41:48,980 And this is an international standard, 769 00:41:48,980 --> 00:41:51,750 which is, of course, incompatible with the NDC. 770 00:41:56,830 --> 00:42:02,270 CPT is the Common Procedural Terminology, which we'll 771 00:42:02,270 --> 00:42:04,040 talk about in a little bit. 772 00:42:04,040 --> 00:42:06,860 And they have a subrange of their codes 773 00:42:06,860 --> 00:42:11,400 which also correspond to medication administration. 774 00:42:11,400 --> 00:42:15,980 And so this is yet another way of coding giving medicines. 775 00:42:15,980 --> 00:42:23,180 And then the HCPCS is yet another set 776 00:42:23,180 --> 00:42:26,360 of codes for specifying what medicines 777 00:42:26,360 --> 00:42:29,210 you've given to somebody. 778 00:42:29,210 --> 00:42:34,820 And then I had mentioned this GSN number, which apparently 779 00:42:34,820 --> 00:42:36,890 the Beth Israel uses. 780 00:42:36,890 --> 00:42:40,130 This as a commercial coding system from a company 781 00:42:40,130 --> 00:42:43,400 called First Databank that is in the business 782 00:42:43,400 --> 00:42:46,310 of trying to produce standards. 783 00:42:46,310 --> 00:42:48,380 But in this case, they're producing 784 00:42:48,380 --> 00:42:52,280 ones that are pretty redundant with other existing standards. 785 00:42:52,280 --> 00:42:55,160 But nevertheless, for historical reasons, 786 00:42:55,160 --> 00:42:58,900 or for whatever reasons, people are using these. 787 00:42:58,900 --> 00:43:01,400 OK, enough of drugs. 788 00:43:01,400 --> 00:43:04,950 So what procedures were done to a patient? 789 00:43:04,950 --> 00:43:07,880 If you look in MIMIC, there are three tables. 790 00:43:07,880 --> 00:43:13,850 There's procedures ICD, which has ICD-9 codes for about 791 00:43:13,850 --> 00:43:16,550 a quarter million procedures. 792 00:43:16,550 --> 00:43:20,290 There's CPT events, which has about half a million, 793 00:43:20,290 --> 00:43:27,740 600,000 events that are coded in the CPT terminology. 794 00:43:27,740 --> 00:43:32,090 And then MetaVision, the newer of the two systems, 795 00:43:32,090 --> 00:43:35,420 has about a quarter million procedure events 796 00:43:35,420 --> 00:43:37,770 that are coded in that system. 797 00:43:37,770 --> 00:43:43,280 So some examples, here's the most common ICD-9 procedure 798 00:43:43,280 --> 00:43:44,300 codes. 799 00:43:44,300 --> 00:43:52,700 So ICD-9 code 3893 of which there are 14,000 instances 800 00:43:52,700 --> 00:43:57,410 is venous catheterization, not elsewhere classified. 801 00:43:57,410 --> 00:44:00,710 So what's venous catheterization? 802 00:44:00,710 --> 00:44:05,540 It's when somebody sticks an IV in your vein, OK? 803 00:44:05,540 --> 00:44:06,500 Very common. 804 00:44:06,500 --> 00:44:08,270 You show up at a hospital. 805 00:44:08,270 --> 00:44:13,800 Before they ask you your name, they stick an IV in your arm. 806 00:44:13,800 --> 00:44:16,910 That's a billable event, too. 807 00:44:16,910 --> 00:44:21,440 Then insertion of an endotracheal tube, 808 00:44:21,440 --> 00:44:23,780 you know, if you're having any problems like that, 809 00:44:23,780 --> 00:44:26,860 they stick something down your throat. 810 00:44:26,860 --> 00:44:30,750 Ventral infusion of concentrated nutritional substances, 811 00:44:30,750 --> 00:44:32,990 so if you're not able to eat, then they 812 00:44:32,990 --> 00:44:36,290 feed you through a stomach tube, OK? 813 00:44:36,290 --> 00:44:39,620 So that's what that is. 814 00:44:39,620 --> 00:44:42,800 Continuous invasive mechanical ventilation 815 00:44:42,800 --> 00:44:45,740 for less than 96 consecutive hours, 816 00:44:45,740 --> 00:44:49,910 so this is being put on a ventilator that's 817 00:44:49,910 --> 00:44:52,080 breathing for you, et cetera. 818 00:44:52,080 --> 00:44:56,070 So you see that there is a very long tail of these. 819 00:44:56,070 --> 00:44:58,490 So those are the ICD-9 codes. 820 00:44:58,490 --> 00:45:02,450 Now, CPT has its own procedure codes 821 00:45:02,450 --> 00:45:05,880 that go into a tremendous amount of detail. 822 00:45:05,880 --> 00:45:09,050 So for example, this is the medicine subsection, 823 00:45:09,050 --> 00:45:12,200 and it shows you the kinds of drugs 824 00:45:12,200 --> 00:45:15,410 that you're being administered that 825 00:45:15,410 --> 00:45:21,110 are involved in dialysis, or psychiatry, or vaccines, 826 00:45:21,110 --> 00:45:22,730 or whatever. 827 00:45:22,730 --> 00:45:27,530 And then here are the surgical and the radiological codes. 828 00:45:27,530 --> 00:45:29,810 And there's tons and tons of detail on these. 829 00:45:29,810 --> 00:45:30,470 Yeah. 830 00:45:30,470 --> 00:45:34,760 AUDIENCE: So how can they put these codes as 1,000 to 1,022? 831 00:45:34,760 --> 00:45:36,800 This is really annoying for anyone-- 832 00:45:36,800 --> 00:45:38,790 PROFESSOR: No, these are categories. 833 00:45:38,790 --> 00:45:45,650 So if you drill down, there's a fanout of that tree 834 00:45:45,650 --> 00:45:49,300 and you get down to individual codes. 835 00:45:49,300 --> 00:45:54,860 Just as a nasty surprise, CPT is owned by the American College 836 00:45:54,860 --> 00:45:59,600 of Physicians, and they could sue me 837 00:45:59,600 --> 00:46:03,980 if I showed you the actual codes because they're copyrighted. 838 00:46:06,950 --> 00:46:10,485 And you have to pay them if you use those codes. 839 00:46:10,485 --> 00:46:10,985 It's crazy. 840 00:46:13,730 --> 00:46:17,435 OK, so if you look at the number of all 841 00:46:17,435 --> 00:46:20,090 of these codes per admission, you 842 00:46:20,090 --> 00:46:22,760 see a distribution like this. 843 00:46:22,760 --> 00:46:24,860 Or if I separate them out, you see 844 00:46:24,860 --> 00:46:29,540 that there are more ICD-9 codes and fewer of the CPT 845 00:46:29,540 --> 00:46:33,050 and the codes that are in MetaVision. 846 00:46:33,050 --> 00:46:37,650 But they look somewhat similar in their distributions. 847 00:46:37,650 --> 00:46:39,930 OK, lab measurements. 848 00:46:39,930 --> 00:46:43,760 So you send off a sputum sample, blood, urine, 849 00:46:43,760 --> 00:46:46,460 piece of your brain, something. 850 00:46:46,460 --> 00:46:50,420 They stick it in some goo and measure something about it. 851 00:46:50,420 --> 00:46:52,330 So what is it that they're measuring? 852 00:46:52,330 --> 00:46:55,840 Well, it turns out that hematocrit 853 00:46:55,840 --> 00:46:57,460 is the most common measurement. 854 00:46:57,460 --> 00:47:02,480 So this is how much hemoglobin is in your blood, 855 00:47:02,480 --> 00:47:06,370 or what fraction in your blood, and is very important 856 00:47:06,370 --> 00:47:08,750 for sick people. 857 00:47:08,750 --> 00:47:11,890 And the second most important is potassium, 858 00:47:11,890 --> 00:47:15,340 then sodium creatinine, chloride, urea nitrogen, 859 00:47:15,340 --> 00:47:16,770 bicarbonate, et cetera. 860 00:47:16,770 --> 00:47:19,600 So this is a long, long list of different things 861 00:47:19,600 --> 00:47:23,990 that can be measured, and all the stuff is in the database. 862 00:47:23,990 --> 00:47:29,770 So for example, here's patient number two in the database. 863 00:47:29,770 --> 00:47:39,490 And on July 17 of 2138, this is part of the deidentification 864 00:47:39,490 --> 00:47:42,730 process to make it difficult to figure out 865 00:47:42,730 --> 00:47:45,070 who the patient actually is. 866 00:47:45,070 --> 00:47:52,450 This person got a test for their blood 867 00:47:52,450 --> 00:47:57,618 and they reported atypical lymphocytes. 868 00:47:57,618 --> 00:47:59,410 So there are a couple of interesting things 869 00:47:59,410 --> 00:48:01,940 to note here. 870 00:48:01,940 --> 00:48:07,130 One is that some things have a value and others don't. 871 00:48:07,130 --> 00:48:09,100 So this is a qualitative measure, 872 00:48:09,100 --> 00:48:11,800 so there's no value associated with it. 873 00:48:11,800 --> 00:48:15,700 Just the fact of the label tells you what the result of the test 874 00:48:15,700 --> 00:48:16,930 was. 875 00:48:16,930 --> 00:48:18,760 The other thing that's interesting 876 00:48:18,760 --> 00:48:21,160 is this last column, which is LOINK, 877 00:48:21,160 --> 00:48:25,390 and I'll say a word about that in a minute-- 878 00:48:25,390 --> 00:48:28,030 actually right now. 879 00:48:28,030 --> 00:48:32,140 So LOINK is the Logical Observation Identifiers Names 880 00:48:32,140 --> 00:48:33,730 and Codes. 881 00:48:33,730 --> 00:48:35,950 It was developed by our colleagues 882 00:48:35,950 --> 00:48:41,890 at Regenstrief Clinic in Indiana about 15 years ago, maybe 20 883 00:48:41,890 --> 00:48:44,110 years ago at this point. 884 00:48:44,110 --> 00:48:48,700 And the attempt was to say every different type of laboratory 885 00:48:48,700 --> 00:48:51,890 test ought to have a unique name, 886 00:48:51,890 --> 00:48:53,680 and they ought to be hierarchical 887 00:48:53,680 --> 00:48:57,220 so that if you have, for example, three different ways 888 00:48:57,220 --> 00:49:00,070 of measuring serum potassium, that they're 889 00:49:00,070 --> 00:49:02,500 related to each other but that they're 890 00:49:02,500 --> 00:49:04,930 distinct from each other, because there may 891 00:49:04,930 --> 00:49:07,930 be circumstances under which the errors that you 892 00:49:07,930 --> 00:49:13,310 get from one measurement versus another are different. 893 00:49:13,310 --> 00:49:16,450 And so this is the standard way. 894 00:49:16,450 --> 00:49:19,630 If you send off your blood sample to a lab, 895 00:49:19,630 --> 00:49:23,440 they send back a string like this to the hospital 896 00:49:23,440 --> 00:49:27,220 or to your doctor's office that says, 897 00:49:27,220 --> 00:49:30,250 it's coded in this OBX coding system, 898 00:49:30,250 --> 00:49:34,720 and here is the LOINK code, and this 899 00:49:34,720 --> 00:49:38,050 is the SNOMED interpretation. 900 00:49:38,050 --> 00:49:42,430 And so this string is the way that your hospital's EHR 901 00:49:42,430 --> 00:49:45,400 or your doctor's office system figures out what 902 00:49:45,400 --> 00:49:47,740 the result of the test was. 903 00:49:47,740 --> 00:49:52,300 HL7 is this 30-something year old organization 904 00:49:52,300 --> 00:49:55,660 that has been working on standardizing stuff like this. 905 00:49:55,660 --> 00:50:00,260 And LOINK is part of their standardization. 906 00:50:00,260 --> 00:50:02,830 So if you look at these, you say, well, again, 907 00:50:02,830 --> 00:50:05,500 how many tests per admission? 908 00:50:05,500 --> 00:50:09,400 Again, a huge, long tail up to about 15,000 909 00:50:09,400 --> 00:50:14,150 for a very small number of patients. 910 00:50:14,150 --> 00:50:18,100 If you look at lab tests per admission, 911 00:50:18,100 --> 00:50:21,850 you can do a log transform and get 912 00:50:21,850 --> 00:50:25,120 something that looks like a more reasonable distribution. 913 00:50:25,120 --> 00:50:27,940 By the way, that's a very generic lesson when we're 914 00:50:27,940 --> 00:50:31,000 going to do analyses of these data, is that, 915 00:50:31,000 --> 00:50:35,890 often, doing a transform of some sort, like in this case, a log, 916 00:50:35,890 --> 00:50:38,110 takes some funny looking distribution 917 00:50:38,110 --> 00:50:40,990 and turns it into something that looks plausibly 918 00:50:40,990 --> 00:50:44,490 normal, which is better for a lot of the techniques we use. 919 00:50:44,490 --> 00:50:45,560 Yeah. 920 00:50:45,560 --> 00:50:50,237 AUDIENCE: [INAUDIBLE] means the same thing? 921 00:50:50,237 --> 00:50:51,070 Like, for instance-- 922 00:50:51,070 --> 00:50:51,760 PROFESSOR: Yes. 923 00:50:51,760 --> 00:50:53,727 AUDIENCE: --hematocrit [INAUDIBLE] 924 00:50:53,727 --> 00:50:54,310 PROFESSOR: Yes 925 00:50:54,310 --> 00:50:55,018 AUDIENCE: --same? 926 00:50:55,018 --> 00:50:55,602 PROFESSOR: Yes 927 00:50:55,602 --> 00:50:56,590 AUDIENCE: Always same? 928 00:50:56,590 --> 00:51:00,520 PROFESSOR: Yes, that's the whole idea of creating the standard. 929 00:51:00,520 --> 00:51:02,980 And that has been pretty successful, pretty 930 00:51:02,980 --> 00:51:06,160 successfully adopted. 931 00:51:06,160 --> 00:51:08,130 OK, chart events. 932 00:51:08,130 --> 00:51:10,540 So these are the things that nurses typically 933 00:51:10,540 --> 00:51:13,220 enter at the bedside. 934 00:51:13,220 --> 00:51:18,370 And so there are 5.1, 5.2 million heart rates 935 00:51:18,370 --> 00:51:20,940 measured in the MIMIC database. 936 00:51:20,940 --> 00:51:26,080 And calprevslig is an artifact. 937 00:51:26,080 --> 00:51:28,430 It exists in every record. 938 00:51:28,430 --> 00:51:31,600 And it's some calibration something or other 939 00:51:31,600 --> 00:51:33,130 that doesn't mean anything. 940 00:51:33,130 --> 00:51:36,090 I've never been able to figure out exactly what it is. 941 00:51:36,090 --> 00:51:41,160 SPO2 is the partial pressure of oxygen in your blood. 942 00:51:41,160 --> 00:51:44,770 If you use a pulse oximeter, that's what that's measuring. 943 00:51:44,770 --> 00:51:49,050 Respiratory rate, heart rhythm, ectopy type, dot, dot, dot. 944 00:51:49,050 --> 00:51:51,855 Now, you might be troubled by the fact that here 945 00:51:51,855 --> 00:51:55,830 is heart rate again, right? 946 00:51:55,830 --> 00:52:01,200 But I've already shown you this, that heart rate in CareVue 947 00:52:01,200 --> 00:52:03,730 and heart rate in MetaVision were 948 00:52:03,730 --> 00:52:08,340 coded under different codes in the joint system 949 00:52:08,340 --> 00:52:11,940 that we created out of those two databases. 950 00:52:11,940 --> 00:52:16,470 And so you have to take care of figuring out 951 00:52:16,470 --> 00:52:20,160 what's what if you're trying to analyze this data. 952 00:52:20,160 --> 00:52:24,570 Not only do we have that problem of different age distributions 953 00:52:24,570 --> 00:52:27,040 across the two different data sets, 954 00:52:27,040 --> 00:52:29,820 but we also just have the mechanical problem 955 00:52:29,820 --> 00:52:33,540 that there will be things with the same label that may or may 956 00:52:33,540 --> 00:52:37,380 not represent the same measurement at different times 957 00:52:37,380 --> 00:52:39,300 in the system. 958 00:52:39,300 --> 00:52:46,870 OK, this is the number of chart entries per admission, again, 959 00:52:46,870 --> 00:52:47,960 on a log scale. 960 00:52:47,960 --> 00:52:50,020 So you see that there are about 10 961 00:52:50,020 --> 00:52:57,600 to the 3.5 chart entries per admission, so thousands 962 00:52:57,600 --> 00:53:03,520 of admissions, of chart events per admission. 963 00:53:03,520 --> 00:53:06,160 We also track outputs. 964 00:53:06,160 --> 00:53:10,810 So Foley catheter allows your bladder 965 00:53:10,810 --> 00:53:13,300 to drain without your having consciously 966 00:53:13,300 --> 00:53:16,720 to go to the bathroom, so they collect that information. 967 00:53:16,720 --> 00:53:21,183 There are 1.9 million recordings of how much fluid came out 968 00:53:21,183 --> 00:53:21,850 of your bladder. 969 00:53:24,790 --> 00:53:28,210 Chest tubes will drain stuff out of your chest 970 00:53:28,210 --> 00:53:30,070 if you have congestion. 971 00:53:30,070 --> 00:53:35,620 Urine is if you pee regularly, stool out, et cetera. 972 00:53:35,620 --> 00:53:38,350 And again, I'm not sure I understand 973 00:53:38,350 --> 00:53:45,220 what the difference is between urine out Foley versus Foley. 974 00:53:45,220 --> 00:53:47,650 They may be the same thing but one 975 00:53:47,650 --> 00:53:50,740 from CareVue and one from MetaVision, 976 00:53:50,740 --> 00:53:53,990 so again, typical kinds of problems. 977 00:53:53,990 --> 00:53:59,560 If you look at the number of output events per admission, 978 00:53:59,560 --> 00:54:07,960 you're seeing on the order of 100, roughly. 979 00:54:07,960 --> 00:54:09,520 Well, if you're tracking outputs, 980 00:54:09,520 --> 00:54:13,480 you should also track inputs, and so they do. 981 00:54:13,480 --> 00:54:20,110 And so D5W is this dextrose in water, 0.9 percent 982 00:54:20,110 --> 00:54:21,750 normal saline. 983 00:54:21,750 --> 00:54:24,010 Propofol is an anesthetic. 984 00:54:24,010 --> 00:54:28,300 Insulin, heparin, blood thinner, et cetera. 985 00:54:28,300 --> 00:54:33,190 Fentanyl is, I think, an opioid, if I remember right. 986 00:54:33,190 --> 00:54:37,090 So these are various things that are given to people. 987 00:54:37,090 --> 00:54:41,020 And they affect the volume of the person. 988 00:54:41,020 --> 00:54:44,170 So this is an attempt to keep the person in balance 989 00:54:44,170 --> 00:54:46,600 and keep track of that. 990 00:54:46,600 --> 00:54:50,920 MetaVision inputs are classified somewhat differently 991 00:54:50,920 --> 00:54:53,770 but they have similar kinds of data. 992 00:54:53,770 --> 00:54:56,190 And if you combine them, you get, again, 993 00:54:56,190 --> 00:54:59,560 a distribution on a log scale that 994 00:54:59,560 --> 00:55:02,050 shows that there are on the order of 10 995 00:55:02,050 --> 00:55:05,590 to the fifth input events, so quite 996 00:55:05,590 --> 00:55:11,780 a few input events, because this is recorded periodically. 997 00:55:11,780 --> 00:55:13,440 Now, the paper that I-- 998 00:55:13,440 --> 00:55:13,940 yeah. 999 00:55:13,940 --> 00:55:15,374 AUDIENCE: What's the input again? 1000 00:55:15,374 --> 00:55:17,832 Is that when you come to the hospital and get admitted or-- 1001 00:55:17,832 --> 00:55:18,950 PROFESSOR: No, no, no. 1002 00:55:18,950 --> 00:55:21,460 It's an input into you. 1003 00:55:21,460 --> 00:55:23,570 So it's like you drink a glass of water, 1004 00:55:23,570 --> 00:55:26,020 the nurse is supposed to record it. 1005 00:55:26,020 --> 00:55:29,570 Although, she doesn't always because she may not notice it. 1006 00:55:29,570 --> 00:55:32,890 But if they hang an IV bag and pour a liter of liquid 1007 00:55:32,890 --> 00:55:38,050 into you, they do record that, OK? 1008 00:55:38,050 --> 00:55:42,850 All right, so I had you read this interesting paper 1009 00:55:42,850 --> 00:55:48,790 and a discussion prior to that paper, 1010 00:55:48,790 --> 00:55:52,810 because one of the authors is a former student of mine. 1011 00:55:52,810 --> 00:55:56,200 And I know one of the other guys pretty well. 1012 00:55:56,200 --> 00:56:00,100 And the former student, Zak Kohane, 1013 00:56:00,100 --> 00:56:05,050 came back some years ago from a conference in California 1014 00:56:05,050 --> 00:56:07,690 and was explaining to me that he ran into a venture 1015 00:56:07,690 --> 00:56:10,430 capitalist who discovered that there 1016 00:56:10,430 --> 00:56:16,840 is an interesting physiological variation in the abnormality 1017 00:56:16,840 --> 00:56:20,110 of lab tests that are done at night. 1018 00:56:20,110 --> 00:56:23,690 And he suspected that there was a diurnal variation 1019 00:56:23,690 --> 00:56:27,250 that lab tests actually become more abnormal at night 1020 00:56:27,250 --> 00:56:29,800 than they do during the day. 1021 00:56:29,800 --> 00:56:33,760 And Zak, who is not only a computer science PhD but also 1022 00:56:33,760 --> 00:56:36,670 a practicing doctor, turns to him and says, 1023 00:56:36,670 --> 00:56:39,430 you're an idiot, right? 1024 00:56:39,430 --> 00:56:45,670 Who has their blood drawn at 3 o'clock in the morning. 1025 00:56:45,670 --> 00:56:50,650 It's typically not healthy people, right? 1026 00:56:50,650 --> 00:56:54,910 So this is another of these nice confounding stories 1027 00:56:54,910 --> 00:57:00,040 where, if you have a test done in the middle of the night, 1028 00:57:00,040 --> 00:57:03,290 it probably indicates that you're sicker. 1029 00:57:03,290 --> 00:57:06,340 So he and Griffin recruited their third author 1030 00:57:06,340 --> 00:57:09,010 and went off and did a very large scale 1031 00:57:09,010 --> 00:57:11,890 study of this question, which is what the paper that I 1032 00:57:11,890 --> 00:57:14,860 asked you to read reports on. 1033 00:57:14,860 --> 00:57:18,070 And so I said, well, I wonder if I 1034 00:57:18,070 --> 00:57:22,210 could reproduce that study in the MIMIC database. 1035 00:57:22,210 --> 00:57:24,970 And the answer, just in case you get your hopes up, 1036 00:57:24,970 --> 00:57:30,070 was no, in large part because we just don't 1037 00:57:30,070 --> 00:57:31,390 have the right kind of data. 1038 00:57:31,390 --> 00:57:33,910 So there are not that many white blood 1039 00:57:33,910 --> 00:57:38,510 counts that were measured in the MIMIC database, for example. 1040 00:57:38,510 --> 00:57:41,860 But if you look at the-- 1041 00:57:41,860 --> 00:57:43,660 this is MIMIC data. 1042 00:57:43,660 --> 00:57:46,270 And if you say, what's the fraction 1043 00:57:46,270 --> 00:57:49,480 of abnormal white blood count values by hour-- 1044 00:57:49,480 --> 00:57:53,290 so this is midnight to midnight. 1045 00:57:53,290 --> 00:57:58,000 And each hour, there's some fraction of these test results 1046 00:57:58,000 --> 00:57:59,380 that are abnormal. 1047 00:57:59,380 --> 00:58:02,050 And sure enough, what you see is that, at 5 o'clock 1048 00:58:02,050 --> 00:58:04,930 in the morning, a much higher fraction of them 1049 00:58:04,930 --> 00:58:08,926 is abnormal than at 3 o'clock in the afternoon, 1050 00:58:08,926 --> 00:58:16,210 OK, which is consistent with Zak's peremptory comment 1051 00:58:16,210 --> 00:58:17,620 about the guy being an idiot. 1052 00:58:20,590 --> 00:58:22,960 So once again, I said, well, can we 1053 00:58:22,960 --> 00:58:26,500 build a really simple model that predicts 1054 00:58:26,500 --> 00:58:29,570 who's going to die in the hospital in this case? 1055 00:58:29,570 --> 00:58:31,210 That's the easiest one to predict 1056 00:58:31,210 --> 00:58:33,590 because I have that data. 1057 00:58:33,590 --> 00:58:36,760 We could get three-year survival data, which 1058 00:58:36,760 --> 00:58:38,810 is what they were looking at. 1059 00:58:38,810 --> 00:58:41,590 But it's harder and it runs into censoring problems 1060 00:58:41,590 --> 00:58:44,620 of what happens if the person was hospitalized 1061 00:58:44,620 --> 00:58:48,100 less than three years before the end of our data collection 1062 00:58:48,100 --> 00:58:49,660 period and so on. 1063 00:58:49,660 --> 00:58:51,520 And so I avoided that. 1064 00:58:51,520 --> 00:58:56,590 But what this is showing you is, for each of the hours, 1065 00:58:56,590 --> 00:59:07,850 zero to 24, what is the number of measurements? 1066 00:59:07,850 --> 00:59:11,390 And for each of those hours, what 1067 00:59:11,390 --> 00:59:13,580 is the fraction of those measurements that's 1068 00:59:13,580 --> 00:59:16,820 abnormal, OK? 1069 00:59:16,820 --> 00:59:18,800 So I said, well, let's just throw it 1070 00:59:18,800 --> 00:59:21,020 into a logistic regression model. 1071 00:59:21,020 --> 00:59:23,420 And what comes out is something really weird, 1072 00:59:23,420 --> 00:59:27,620 which is that a few particular hours are significant, 1073 00:59:27,620 --> 00:59:29,990 but most of them are not. 1074 00:59:29,990 --> 00:59:33,770 And that looks like noise to me, right? 1075 00:59:33,770 --> 00:59:40,160 Because you wouldn't expect that, at 8 o'clock 1076 00:59:40,160 --> 00:59:45,410 in the morning, the fact that you had something measured 1077 00:59:45,410 --> 00:59:46,580 matters. 1078 00:59:46,580 --> 00:59:49,970 But at 9 o'clock in the morning, it doesn't. 1079 00:59:49,970 --> 00:59:52,490 That doesn't seem sensible. 1080 00:59:52,490 --> 00:59:55,910 So I don't think there's enough signal here. 1081 00:59:55,910 --> 01:00:01,520 And in fact, when I looked at the number of white blood count 1082 01:00:01,520 --> 01:00:04,790 measurements at night and related to mortality-- 1083 01:00:04,790 --> 01:00:09,200 so false means people lived and true means they died. 1084 01:00:09,200 --> 01:00:11,810 But you see that there's not a whole lot of difference 1085 01:00:11,810 --> 01:00:13,730 between the distributions. 1086 01:00:13,730 --> 01:00:16,310 But you also see that the number of white blood counts 1087 01:00:16,310 --> 01:00:18,920 is relatively small in this database. 1088 01:00:18,920 --> 01:00:24,360 And so I think we just don't have enough data to do it. 1089 01:00:24,360 --> 01:00:28,070 On the other hand, if you look at a panel of different drugs, 1090 01:00:28,070 --> 01:00:31,730 you look at mean values of blood urea nitrogen or calcium 1091 01:00:31,730 --> 01:00:34,910 chloride, CO2, et cetera, you see that there 1092 01:00:34,910 --> 01:00:38,280 is variation across time. 1093 01:00:38,280 --> 01:00:41,900 So there is some sort of variance 1094 01:00:41,900 --> 01:00:45,830 that's either caused by the diurnal physiology 1095 01:00:45,830 --> 01:00:50,300 of the human body or by the routine practice of medicine, 1096 01:00:50,300 --> 01:00:55,190 about when people choose to take lab measurements. 1097 01:00:55,190 --> 01:01:01,670 And in fact, if you look at the fraction of high end low lab 1098 01:01:01,670 --> 01:01:04,580 values, they do vary by hour. 1099 01:01:04,580 --> 01:01:07,970 And in particular, if you look at white blood counts, 1100 01:01:07,970 --> 01:01:15,200 you see that the fraction of high values goes up at night 1101 01:01:15,200 --> 01:01:17,660 and the fraction of low values goes down 1102 01:01:17,660 --> 01:01:20,660 at night, right, which is consistent with what 1103 01:01:20,660 --> 01:01:23,410 they saw as well. 1104 01:01:23,410 --> 01:01:25,250 There is another way to measure it, 1105 01:01:25,250 --> 01:01:30,230 which is, instead of using normal ranges, 1106 01:01:30,230 --> 01:01:33,680 the lab actually gives you a call that says, 1107 01:01:33,680 --> 01:01:36,530 is this value normal, low, or high? 1108 01:01:36,530 --> 01:01:38,630 And we can use that. 1109 01:01:38,630 --> 01:01:41,240 That's a little bit more subtle because it 1110 01:01:41,240 --> 01:01:44,300 depends on calibration of the equipment 1111 01:01:44,300 --> 01:01:47,220 and is updated as the calibration changes. 1112 01:01:47,220 --> 01:01:49,610 So that's probably a little bit more accurate. 1113 01:01:49,610 --> 01:01:53,330 But you see essentially the same phenomenon here. 1114 01:01:53,330 --> 01:01:59,780 But if you look at the distributions of when 1115 01:01:59,780 --> 01:02:02,450 measurements are done that turn out to be normal 1116 01:02:02,450 --> 01:02:05,310 versus when they turn out to be abnormal, 1117 01:02:05,310 --> 01:02:08,090 there is a lot of similarity between the normal 1118 01:02:08,090 --> 01:02:12,590 and the abnormal curves of when those measurements are taken. 1119 01:02:12,590 --> 01:02:15,770 So we're not seeing that. 1120 01:02:15,770 --> 01:02:18,950 OK, let me race through to the end. 1121 01:02:18,950 --> 01:02:23,240 This is my heartbeat from my watch. 1122 01:02:23,240 --> 01:02:25,520 You can actually download the stuff 1123 01:02:25,520 --> 01:02:28,760 and put it in your favorite analysis engine 1124 01:02:28,760 --> 01:02:30,540 and take a look. 1125 01:02:30,540 --> 01:02:33,650 So here I was running across the Harvard bridge. 1126 01:02:33,650 --> 01:02:37,670 And if you look at my heart rate variability over the 30 seconds 1127 01:02:37,670 --> 01:02:41,390 or so, you see that the interbeat interval 1128 01:02:41,390 --> 01:02:47,240 ranges from about 550 to about 600 1129 01:02:47,240 --> 01:02:50,900 and whatever 20 milliseconds. 1130 01:02:50,900 --> 01:02:53,970 And so you could calculate my heart rate variability, 1131 01:02:53,970 --> 01:02:57,500 which is thought to be an indicator of heart health 1132 01:02:57,500 --> 01:02:58,610 and so on. 1133 01:02:58,610 --> 01:03:00,650 You can calculate that I was running 1134 01:03:00,650 --> 01:03:02,930 at a pace of about 100-- 1135 01:03:02,930 --> 01:03:07,320 my heart was beating at a pace of about 100 beats per minute. 1136 01:03:07,320 --> 01:03:09,620 So you know there's all sorts of information 1137 01:03:09,620 --> 01:03:11,570 like that available. 1138 01:03:11,570 --> 01:03:15,530 Now, as I said, I'm not going to get into this today, 1139 01:03:15,530 --> 01:03:19,850 but this was a very successful recently 1140 01:03:19,850 --> 01:03:22,670 published paper where they're able to take 1141 01:03:22,670 --> 01:03:26,460 a look at images of the lung. 1142 01:03:26,460 --> 01:03:31,370 So this is a transverse scan of the lung. 1143 01:03:31,370 --> 01:03:34,100 And they have a deep learning machine 1144 01:03:34,100 --> 01:03:37,880 that is able to identify these two yellow marked things 1145 01:03:37,880 --> 01:03:42,020 as pulmonary emboli as opposed to these other things that 1146 01:03:42,020 --> 01:03:45,270 are just random flecks in the tissue. 1147 01:03:45,270 --> 01:03:47,820 And I can't do that by eyeball. 1148 01:03:47,820 --> 01:03:51,230 Maybe a good radiologist might be able to, 1149 01:03:51,230 --> 01:03:55,340 but this is claimed in the paper to outperform 1150 01:03:55,340 --> 01:03:58,070 decent radiologists already. 1151 01:03:58,070 --> 01:03:59,930 This was one of the articles that 1152 01:03:59,930 --> 01:04:03,710 led Geoff Hinton to make this rather stupid pronouncement 1153 01:04:03,710 --> 01:04:08,130 that said, tell your children not to become radiologists 1154 01:04:08,130 --> 01:04:12,390 because the profession will be over by the time they get fully 1155 01:04:12,390 --> 01:04:15,060 trained, which I don't believe. 1156 01:04:15,060 --> 01:04:18,120 They may do different things, but they won't go away. 1157 01:04:20,670 --> 01:04:25,170 This was a slide from Ron Kikinis at the Brigham, 1158 01:04:25,170 --> 01:04:27,960 and they're using automated techniques 1159 01:04:27,960 --> 01:04:31,380 of analyzing white matter in order 1160 01:04:31,380 --> 01:04:33,810 to identify lupus lesions. 1161 01:04:33,810 --> 01:04:38,490 So lupus is a bad disease that shows up 1162 01:04:38,490 --> 01:04:44,880 in these magnetic resonance images in certain ways. 1163 01:04:44,880 --> 01:04:48,210 The last thing I want to talk about today is notes. 1164 01:04:48,210 --> 01:04:54,420 So my students did a little exercise last semester 1165 01:04:54,420 --> 01:05:01,320 where we tried to see how good is the average ape, namely 1166 01:05:01,320 --> 01:05:07,390 member of my research group, at predicting mortality? 1167 01:05:07,390 --> 01:05:10,330 And so we took a bunch of cases from the MIMIC data 1168 01:05:10,330 --> 01:05:14,400 set, blinded to the question of whether the person lived 1169 01:05:14,400 --> 01:05:15,390 or died. 1170 01:05:15,390 --> 01:05:18,780 We gave the data to people in a kind of visualization tool, 1171 01:05:18,780 --> 01:05:21,630 sort of like the one that I showed you earlier, 1172 01:05:21,630 --> 01:05:26,610 that summarizes the case, and then also gave people access 1173 01:05:26,610 --> 01:05:30,930 to the notes, the deidentified notes about those cases, 1174 01:05:30,930 --> 01:05:34,560 to see whether people could predict, better than a coin 1175 01:05:34,560 --> 01:05:39,150 flip, whether somebody was going to live or die. 1176 01:05:39,150 --> 01:05:42,870 And the answer is yes, slightly better, OK? 1177 01:05:42,870 --> 01:05:46,170 Not immensely better but slightly better. 1178 01:05:46,170 --> 01:05:50,380 And furthermore, it looks like, by giving them feedback, 1179 01:05:50,380 --> 01:05:52,890 so as they're looking at these cases 1180 01:05:52,890 --> 01:05:54,765 and trying to make the prediction, 1181 01:05:54,765 --> 01:05:56,640 they make a prediction, you tell them if they 1182 01:05:56,640 --> 01:06:00,000 were right or wrong, we learn. 1183 01:06:00,000 --> 01:06:03,480 And so we get slightly better than slightly better 1184 01:06:03,480 --> 01:06:05,950 than random, right? 1185 01:06:05,950 --> 01:06:07,940 It's kind of interesting. 1186 01:06:07,940 --> 01:06:11,250 OK, so one of the things I discovered 1187 01:06:11,250 --> 01:06:13,440 is that, at least when I was playing 1188 01:06:13,440 --> 01:06:17,160 the monkey in this exercise, I found the notes 1189 01:06:17,160 --> 01:06:20,490 to be immensely useful, much more useful 1190 01:06:20,490 --> 01:06:24,660 than the trend lines of laboratory data. 1191 01:06:24,660 --> 01:06:27,390 Partly, it's because I'm used to reading English. 1192 01:06:27,390 --> 01:06:31,950 I'm not so used to reading graphs of laboratory data. 1193 01:06:31,950 --> 01:06:35,430 But part of it is that there is a level of human understanding 1194 01:06:35,430 --> 01:06:38,670 that is transmitted in the nursing notes 1195 01:06:38,670 --> 01:06:42,120 and in the discharge summaries and so on that you don't get 1196 01:06:42,120 --> 01:06:44,340 from just looking at raw data. 1197 01:06:44,340 --> 01:06:47,310 And so there is very much the sense, which 1198 01:06:47,310 --> 01:06:49,800 we're going to talk about in a couple of weeks, 1199 01:06:49,800 --> 01:06:54,930 of how can we take advantage of that information, 1200 01:06:54,930 --> 01:06:57,600 extract it, and use it in the kinds of modeling 1201 01:06:57,600 --> 01:06:59,140 that we want to do? 1202 01:06:59,140 --> 01:07:01,710 So in MIMIC, if you look, we have 1203 01:07:01,710 --> 01:07:04,080 nursing notes, and radiology reports, 1204 01:07:04,080 --> 01:07:08,160 and more nursing notes, and electrocardiogram reports, 1205 01:07:08,160 --> 01:07:10,600 and doctor's notes, and discharge summaries, 1206 01:07:10,600 --> 01:07:16,170 and echocardiograms, respiratory, et cetera. 1207 01:07:16,170 --> 01:07:18,630 And if you look at the distribution 1208 01:07:18,630 --> 01:07:22,950 of the lengths of these, these are, unfortunately, 1209 01:07:22,950 --> 01:07:24,780 not on the same scale. 1210 01:07:24,780 --> 01:07:26,790 But the discharge summary is the thing 1211 01:07:26,790 --> 01:07:30,010 that's written at the time you leave the hospital. 1212 01:07:30,010 --> 01:07:32,190 So this is sort of the summary of everything 1213 01:07:32,190 --> 01:07:35,310 that happened to you during your hospitalization. 1214 01:07:35,310 --> 01:07:36,330 And it's long. 1215 01:07:36,330 --> 01:07:41,190 So, you know, it goes up to like 30,000 characters. 1216 01:07:41,190 --> 01:07:48,240 You know, it's a short story, not so short short story. 1217 01:07:48,240 --> 01:07:50,580 Nursing notes tend to be shorter. 1218 01:07:50,580 --> 01:07:53,490 They run up to about 3,000 characters. 1219 01:07:53,490 --> 01:07:55,500 This other set of nursing notes, which 1220 01:07:55,500 --> 01:07:59,040 I think comes from the other system, is a little bit longer. 1221 01:07:59,040 --> 01:08:01,560 It goes up to about 5,000. 1222 01:08:01,560 --> 01:08:03,930 Doctor's notes are a little bit longer yet. 1223 01:08:03,930 --> 01:08:09,030 They go up to about 10,000, 15,000 characters, typically. 1224 01:08:09,030 --> 01:08:11,760 And there are various other kinds of notes. 1225 01:08:11,760 --> 01:08:14,470 So I just wanted to show you a few of these. 1226 01:08:14,470 --> 01:08:16,080 Here's a brief nursing note. 1227 01:08:16,080 --> 01:08:20,939 So this is a patient who is hypotensive but not in shock. 1228 01:08:20,939 --> 01:08:23,399 Patient remains on this drug drip 1229 01:08:23,399 --> 01:08:27,930 at 0.75 micrograms per kilogram per minute, 1230 01:08:27,930 --> 01:08:30,479 no titration needed at this time. 1231 01:08:30,479 --> 01:08:32,810 Their blood pressure is stable at more than 100. 1232 01:08:32,810 --> 01:08:38,130 Their mean arterial pressure is 65, greater than 65. 1233 01:08:38,130 --> 01:08:42,660 Wean them from this drug presumably if it's tolerated. 1234 01:08:42,660 --> 01:08:47,220 A wound infection, so anterior groin area 1235 01:08:47,220 --> 01:08:52,560 open and oozing moderate amounts of thin, pink-tinged serous 1236 01:08:52,560 --> 01:08:53,479 fluid. 1237 01:08:53,479 --> 01:08:57,660 Patient's stooling with small amounts of stool on something 1238 01:08:57,660 --> 01:09:01,170 and dangerously close to the open wound, et cetera. 1239 01:09:01,170 --> 01:09:04,200 So this is sort of the nurse's snapshot. 1240 01:09:04,200 --> 01:09:07,010 She just went in, saw the patient-- 1241 01:09:07,010 --> 01:09:10,399 by the way, I say she, but probably 1242 01:09:10,399 --> 01:09:14,540 a vast majority of nurses in Boston area hospitals 1243 01:09:14,540 --> 01:09:19,189 really are women, but there are some male nurses-- 1244 01:09:19,189 --> 01:09:22,250 and will record sort of a snapshot of what's 1245 01:09:22,250 --> 01:09:24,149 going on with the patient. 1246 01:09:24,149 --> 01:09:25,729 What are the concerns? 1247 01:09:25,729 --> 01:09:28,970 In principle, this is going to be useful not only 1248 01:09:28,970 --> 01:09:32,210 as a part of the medical record, but also when 1249 01:09:32,210 --> 01:09:36,470 this nurse goes off shift and the next nurse comes on shift. 1250 01:09:36,470 --> 01:09:39,380 Then this is a recording of what the state of the patient 1251 01:09:39,380 --> 01:09:43,130 was the last time they were seen by the nurse. 1252 01:09:43,130 --> 01:09:46,939 In reality, the nurses tend to tell each other verbally 1253 01:09:46,939 --> 01:09:50,750 rather than relying on the written version. 1254 01:09:50,750 --> 01:09:54,350 I remember one time talking to a nurse in an intensive care 1255 01:09:54,350 --> 01:09:57,080 unit in another part of the country, and I said, 1256 01:09:57,080 --> 01:10:00,350 so whoever reads your notes, and she 1257 01:10:00,350 --> 01:10:08,060 says, quality assurance officers, so the hospital has 1258 01:10:08,060 --> 01:10:10,310 people responsible for trying to assess 1259 01:10:10,310 --> 01:10:12,740 the quality of care they're giving, 1260 01:10:12,740 --> 01:10:15,890 and lawyers when there's a lawsuit. 1261 01:10:15,890 --> 01:10:19,190 And she was very happy because she had saved the hospital 1262 01:10:19,190 --> 01:10:22,910 10 million dollars by having carefully recorded 1263 01:10:22,910 --> 01:10:26,900 that some procedure had been done to a patient who then 1264 01:10:26,900 --> 01:10:31,430 had a bad outcome and was suing the hospital for their neglect 1265 01:10:31,430 --> 01:10:33,050 in not having done this. 1266 01:10:33,050 --> 01:10:35,390 But because it was in the note, that 1267 01:10:35,390 --> 01:10:37,670 was proof that it actually had been done, 1268 01:10:37,670 --> 01:10:40,415 and therefore the hospital wasn't liable. 1269 01:10:43,310 --> 01:10:45,950 But there is a lot of information in here. 1270 01:10:45,950 --> 01:10:50,090 Now, I'm going to show you many pages of a typical discharge 1271 01:10:50,090 --> 01:10:51,020 summary. 1272 01:10:51,020 --> 01:10:54,800 So this is somebody on the surgery service 1273 01:10:54,800 --> 01:10:59,750 who came in complaining of leg pain, redness, 1274 01:10:59,750 --> 01:11:02,630 and swelling secondary to infection 1275 01:11:02,630 --> 01:11:07,610 of the left femoral popliteal bypass. 1276 01:11:07,610 --> 01:11:09,980 So she had surgery-- 1277 01:11:09,980 --> 01:11:10,700 I think she. 1278 01:11:10,700 --> 01:11:11,810 Yeah, female. 1279 01:11:11,810 --> 01:11:15,320 She had surgery which didn't heal well, 1280 01:11:15,320 --> 01:11:19,430 so major surgical or invasive procedure, incision 1281 01:11:19,430 --> 01:11:22,670 and drainage and pulse irrigation of the left groin, 1282 01:11:22,670 --> 01:11:29,150 and left above-knee popliteal site incisions with exploration 1283 01:11:29,150 --> 01:11:34,370 of bypass graft, and excision of the entire left common femoral 1284 01:11:34,370 --> 01:11:37,460 artery to above-knee blah, blah, blah, blah blah, blah. 1285 01:11:37,460 --> 01:11:40,010 So this is what they did. 1286 01:11:40,010 --> 01:11:41,910 History of the present illness-- 1287 01:11:41,910 --> 01:11:44,280 she's a 45-year-old woman who underwent 1288 01:11:44,280 --> 01:11:47,130 the left femoral, a.k.a. 1289 01:11:47,130 --> 01:11:51,410 doctor something or other with PTFE, whatever that 1290 01:11:51,410 --> 01:11:55,130 is, over a month ago on a certain date. 1291 01:11:55,130 --> 01:11:57,800 By the way, these bracketed asterisked things 1292 01:11:57,800 --> 01:12:01,340 are where we've taken out identifying information 1293 01:12:01,340 --> 01:12:04,310 from the record. 1294 01:12:04,310 --> 01:12:06,320 She had been doing well post-operatively 1295 01:12:06,320 --> 01:12:09,780 and was seen in the clinic six days prior to presentation. 1296 01:12:09,780 --> 01:12:12,320 At this time, she acutely developed nausea, vomiting, 1297 01:12:12,320 --> 01:12:14,970 fevers, and progressive redness, swelling, 1298 01:12:14,970 --> 01:12:17,940 pain of her left thigh, et cetera, OK? 1299 01:12:17,940 --> 01:12:20,820 So that's just page one of many pages. 1300 01:12:20,820 --> 01:12:21,320 Yeah. 1301 01:12:21,320 --> 01:12:22,403 AUDIENCE: Just a question. 1302 01:12:22,403 --> 01:12:27,950 Is this completely [INAUDIBLE] information [INAUDIBLE] 1303 01:12:27,950 --> 01:12:29,300 patient's name or date? 1304 01:12:29,300 --> 01:12:31,310 PROFESSOR: Not in this system. 1305 01:12:31,310 --> 01:12:32,870 There are people-- 1306 01:12:32,870 --> 01:12:35,150 Henry Chueh at Mass General spent 1307 01:12:35,150 --> 01:12:39,890 10 years building a system that had autocomplete and so on. 1308 01:12:39,890 --> 01:12:43,820 And some doctors liked it and some doctors hated it. 1309 01:12:43,820 --> 01:12:48,050 And the MGH threw out all of their old systems 1310 01:12:48,050 --> 01:12:51,090 in order to buy Epic, and so it's gone. 1311 01:12:51,090 --> 01:12:54,390 It was like 10 years of work down the drain. 1312 01:12:54,390 --> 01:12:57,740 But it was not a spectacular success. 1313 01:12:57,740 --> 01:13:00,500 Because whenever you have auto complete, 1314 01:13:00,500 --> 01:13:03,740 you have to anticipate every possible answer. 1315 01:13:03,740 --> 01:13:05,840 And people are very creative, and they always 1316 01:13:05,840 --> 01:13:09,050 want to type something that you didn't anticipate. 1317 01:13:09,050 --> 01:13:11,392 So it's hard to support it. 1318 01:13:11,392 --> 01:13:12,350 AUDIENCE: What is Epic? 1319 01:13:12,350 --> 01:13:13,250 That's like the new-- 1320 01:13:13,250 --> 01:13:15,290 PROFESSOR: Epic is a big company that 1321 01:13:15,290 --> 01:13:19,190 has been winning all the recent contests for installing 1322 01:13:19,190 --> 01:13:21,620 electronic medical record systems. 1323 01:13:21,620 --> 01:13:23,930 Remember in my last lecture, I showed that we're 1324 01:13:23,930 --> 01:13:26,780 reaching about 100% saturation? 1325 01:13:26,780 --> 01:13:34,340 So they've been winning a lot of the installation deals. 1326 01:13:34,340 --> 01:13:38,360 And they're getting a lot of the subsidy. 1327 01:13:38,360 --> 01:13:41,780 The estimate I heard was that Partners Healthcare, which 1328 01:13:41,780 --> 01:13:45,350 is MGH at the Brigham and a couple of other hospitals, 1329 01:13:45,350 --> 01:13:48,680 spent somewhere on the order of two billion dollars 1330 01:13:48,680 --> 01:13:50,940 installing the system. 1331 01:13:50,940 --> 01:13:53,270 So that included all the customizations 1332 01:13:53,270 --> 01:13:56,330 and all the training and all the administrative stuff 1333 01:13:56,330 --> 01:13:57,560 that went with it. 1334 01:13:57,560 --> 01:14:00,040 But that's a huge amount of money. 1335 01:14:00,040 --> 01:14:01,880 AUDIENCE: I agree. 1336 01:14:01,880 --> 01:14:06,600 PROFESSOR: OK, so we have past medical history-- 1337 01:14:06,600 --> 01:14:10,320 pack a day smoker, abused cocaine 1338 01:14:10,320 --> 01:14:12,660 but says she stopped six months ago, 1339 01:14:12,660 --> 01:14:16,000 has asthma, type 2 diabetes. 1340 01:14:16,000 --> 01:14:17,955 Social history, family history. 1341 01:14:22,890 --> 01:14:24,675 These are of the physical exam results. 1342 01:14:27,400 --> 01:14:31,740 So it's giving you a lot of information about the person. 1343 01:14:31,740 --> 01:14:35,250 Description of the wound down at the bottom. 1344 01:14:35,250 --> 01:14:37,030 Pertinent lab results. 1345 01:14:37,030 --> 01:14:39,420 So these are copied out of the laboratory tables. 1346 01:14:39,420 --> 01:14:39,920 Yeah. 1347 01:14:39,920 --> 01:14:41,900 AUDIENCE: Just to double check with the drug results-- 1348 01:14:41,900 --> 01:14:42,510 PROFESSOR: Sorry? 1349 01:14:42,510 --> 01:14:43,870 AUDIENCE: Just to double check with the drug results 1350 01:14:43,870 --> 01:14:44,886 two slides back-- 1351 01:14:44,886 --> 01:14:45,511 PROFESSOR: Yeah 1352 01:14:45,511 --> 01:14:52,050 AUDIENCE: It said-- so it has the fake dates of 2190 1353 01:14:52,050 --> 01:14:52,845 up there. 1354 01:14:52,845 --> 01:14:53,470 PROFESSOR: Yep. 1355 01:14:53,470 --> 01:14:56,345 AUDIENCE: So the fact that there was a positive test in 2187 1356 01:14:56,345 --> 01:14:57,773 would mean a year ago. 1357 01:14:57,773 --> 01:14:58,440 PROFESSOR: Yeah. 1358 01:14:58,440 --> 01:15:00,440 AUDIENCE: So that's the medication. 1359 01:15:00,440 --> 01:15:03,910 PROFESSOR: Yeah, the deindenfication technology here 1360 01:15:03,910 --> 01:15:07,540 maintains the relative dates but not the absolute dates. 1361 01:15:14,510 --> 01:15:17,690 So these are results, again, copied out 1362 01:15:17,690 --> 01:15:22,490 of the laboratory database into the discharge summary. 1363 01:15:22,490 --> 01:15:26,820 Brief hospital course, and then a review of systems, 1364 01:15:26,820 --> 01:15:29,690 so what's going on neurologically, cardiovascular, 1365 01:15:29,690 --> 01:15:34,130 pulmonary, GI, GU, et cetera. 1366 01:15:34,130 --> 01:15:38,930 Infectious disease, endocrine, hematology, prophylaxis. 1367 01:15:38,930 --> 01:15:41,090 And at the time was discharged, the patient 1368 01:15:41,090 --> 01:15:45,170 was doing well, no fever and stable vital signs, 1369 01:15:45,170 --> 01:15:47,630 tolerating a regular diet, ambulating, voiding 1370 01:15:47,630 --> 01:15:51,980 without assistance, and pain was well controlled. 1371 01:15:51,980 --> 01:15:54,260 Medications on admission, so this 1372 01:15:54,260 --> 01:15:57,280 was the medication reconciliation. 1373 01:15:57,280 --> 01:16:01,780 Discharge medication, so this is what she's being sent home on. 1374 01:16:01,780 --> 01:16:05,840 Discharge disposition is to the home with some 1375 01:16:05,840 --> 01:16:11,630 follow up service, and she's going home. 1376 01:16:11,630 --> 01:16:15,410 And the discharge diagnosis is infected 1377 01:16:15,410 --> 01:16:20,180 left femoral popliteal bypass graft and the condition. 1378 01:16:20,180 --> 01:16:24,242 And these are the instructions to the patient that say, you 1379 01:16:24,242 --> 01:16:26,450 know, here's what you can do, here is when you should 1380 01:16:26,450 --> 01:16:28,950 come back and tell us if something is going wrong, 1381 01:16:28,950 --> 01:16:29,460 et cetera. 1382 01:16:32,330 --> 01:16:37,340 And here's what you should report if it happens. 1383 01:16:37,340 --> 01:16:40,040 You know, if you have sudden severe bleeding or swelling, 1384 01:16:40,040 --> 01:16:41,120 do this. 1385 01:16:41,120 --> 01:16:44,060 Follow up with doctor somebody or other. 1386 01:16:44,060 --> 01:16:49,140 Call his clinic at this number to schedule an appointment 1387 01:16:49,140 --> 01:16:53,840 and then follow up with doctor somebody else in two weeks. 1388 01:16:59,670 --> 01:17:02,040 I think this is the same one. 1389 01:17:02,040 --> 01:17:06,260 So just a couple of final words about standards. 1390 01:17:06,260 --> 01:17:09,650 So you saw in David's introductory lecture 1391 01:17:09,650 --> 01:17:13,790 a reference to Odyssey, which is a standard method of encoding 1392 01:17:13,790 --> 01:17:17,640 the kind of data that we're talking about today. 1393 01:17:17,640 --> 01:17:21,020 There is a likelihood that the next release of the MIMIC 1394 01:17:21,020 --> 01:17:26,620 database will adopt the Odyssey formats rather than the-- 1395 01:17:26,620 --> 01:17:27,740 yeah. 1396 01:17:27,740 --> 01:17:31,250 David's shaking his head, wondering why. 1397 01:17:31,250 --> 01:17:31,750 Me, too. 1398 01:17:34,430 --> 01:17:37,460 AUDIENCE: Odyssey hasn't handled clinical notes very well yet. 1399 01:17:37,460 --> 01:17:40,430 PROFESSOR: Well, so, you know, what always happens, 1400 01:17:40,430 --> 01:17:44,090 as you say, I'm going to adopt the standard asterisk 1401 01:17:44,090 --> 01:17:46,460 with the following extensions. 1402 01:17:46,460 --> 01:17:48,470 And that's probably what's going to happen. 1403 01:17:48,470 --> 01:17:51,980 But it means that the central tables, you 1404 01:17:51,980 --> 01:17:55,820 know, the ICD-9 code tables and the drug tables, some things 1405 01:17:55,820 --> 01:17:59,270 like that, are likely to wind up adopting the formats 1406 01:17:59,270 --> 01:18:03,260 of the Odyssey database. 1407 01:18:03,260 --> 01:18:07,850 You should also know about this thing called FHIR, F-H-I-R, 1408 01:18:07,850 --> 01:18:11,720 the Fast Health Interoperability Resources. 1409 01:18:11,720 --> 01:18:15,950 So HL7 is the standards organization 1410 01:18:15,950 --> 01:18:21,110 that had a tremendous success in the early 1990s 1411 01:18:21,110 --> 01:18:24,740 in solving the problem of how to allow laboratories to report 1412 01:18:24,740 --> 01:18:27,950 lab data back to the hospitals or the clinics that 1413 01:18:27,950 --> 01:18:29,990 ordered the labs. 1414 01:18:29,990 --> 01:18:33,860 And that character string with the up arrows 1415 01:18:33,860 --> 01:18:36,470 and the vertical bars and so on that I showed you 1416 01:18:36,470 --> 01:18:41,210 before that had LOINK encoded in it is that standard. 1417 01:18:41,210 --> 01:18:44,090 That's called HL7 Version 2. 1418 01:18:44,090 --> 01:18:48,440 It's still in use very widely, they then got ambitious 1419 01:18:48,440 --> 01:18:51,590 and suffered second system syndrome, which 1420 01:18:51,590 --> 01:18:55,440 is they decided to build HL7 Version 3, 1421 01:18:55,440 --> 01:18:59,390 which I used to teach in a class here 10 years ago. 1422 01:18:59,390 --> 01:19:03,740 But one of my friends who works for a company that 1423 01:19:03,740 --> 01:19:10,640 helps hospitals implement that sent me a 38 megabyte PDF file 1424 01:19:10,640 --> 01:19:13,370 that describes what you need to know in order 1425 01:19:13,370 --> 01:19:15,480 to implement that system. 1426 01:19:15,480 --> 01:19:18,440 And as a result, nobody was doing it. 1427 01:19:18,440 --> 01:19:21,800 So FHIR is a gross simplification 1428 01:19:21,800 --> 01:19:24,500 of that that starts off and says, 1429 01:19:24,500 --> 01:19:28,460 if a doctor refers a new patient to you, 1430 01:19:28,460 --> 01:19:31,970 what is the minimum set of data that you need to know in order 1431 01:19:31,970 --> 01:19:33,950 to take care of that person? 1432 01:19:33,950 --> 01:19:39,620 And FHIR tries to provide just that subset of all of the data. 1433 01:19:39,620 --> 01:19:43,730 It has become a standard mainly because, after Congress spent 1434 01:19:43,730 --> 01:19:47,060 $42 billion dollars or so bribing people 1435 01:19:47,060 --> 01:19:49,520 into buying these information systems, 1436 01:19:49,520 --> 01:19:52,700 they got mad that the information systems they bought 1437 01:19:52,700 --> 01:19:54,590 couldn't talk to each other. 1438 01:19:54,590 --> 01:19:57,350 And so they called in, on the carpet, the heads 1439 01:19:57,350 --> 01:20:01,280 of these IT companies, health IT companies, 1440 01:20:01,280 --> 01:20:02,900 and they yelled at them and they made 1441 01:20:02,900 --> 01:20:06,260 them promise that there would be interoperability. 1442 01:20:06,260 --> 01:20:07,220 They promised. 1443 01:20:07,220 --> 01:20:09,800 And out of that came FHIR. 1444 01:20:09,800 --> 01:20:14,210 It was probably simultaneously developed but they adopted it. 1445 01:20:14,210 --> 01:20:18,260 And so now, in principle, it's possible to exchange 1446 01:20:18,260 --> 01:20:21,020 data between different hospitals, at least 1447 01:20:21,020 --> 01:20:26,180 to the level of that degree of harmonization of the data. 1448 01:20:26,180 --> 01:20:29,070 In reality, the companies don't want 1449 01:20:29,070 --> 01:20:33,200 you to do that because they like there to be friction 1450 01:20:33,200 --> 01:20:35,390 in not being able to take all your data 1451 01:20:35,390 --> 01:20:39,170 to a different hospital, because it is more likely to leave you 1452 01:20:39,170 --> 01:20:41,450 at the one that you're at. 1453 01:20:41,450 --> 01:20:44,960 So there is complicated socioeconomic kinds 1454 01:20:44,960 --> 01:20:46,700 of issues in all this. 1455 01:20:46,700 --> 01:20:50,000 But at least the standard exists and is becoming more and more 1456 01:20:50,000 --> 01:20:53,810 widely deployed as long as Congress pays attention. 1457 01:20:53,810 --> 01:20:54,860 It's ugly. 1458 01:20:54,860 --> 01:20:58,100 So here is what a patient looks like, right? 1459 01:20:58,100 --> 01:21:01,790 It's the usual unreadable XML garbage. 1460 01:21:01,790 --> 01:21:03,980 But fortunately, there are parsers 1461 01:21:03,980 --> 01:21:08,340 that can turn it into JSON and simpler representations. 1462 01:21:08,340 --> 01:21:11,370 And so that's pretty common. 1463 01:21:11,370 --> 01:21:17,660 So the terminologies that exist are LOINK, NBC, ICD-9 and 10. 1464 01:21:17,660 --> 01:21:19,520 SNOMED I didn't talk about today. 1465 01:21:19,520 --> 01:21:23,210 DSM-5 is the Diagnostic and Statistical 1466 01:21:23,210 --> 01:21:25,850 Manual for Psychiatrists. 1467 01:21:25,850 --> 01:21:28,220 That's used as a common coding method 1468 01:21:28,220 --> 01:21:31,730 for describing psychiatric disease. 1469 01:21:31,730 --> 01:21:33,800 And there are many more of these. 1470 01:21:33,800 --> 01:21:36,740 There's something called the Unified Medical Language 1471 01:21:36,740 --> 01:21:41,060 Systems Metathesaurus from the National Library of Medicine 1472 01:21:41,060 --> 01:21:47,660 that integrates about 180 of these different terminologies. 1473 01:21:47,660 --> 01:21:52,400 And so there is a nice one-stop shop 1474 01:21:52,400 --> 01:21:55,550 where you can get all these things from them. 1475 01:21:59,670 --> 01:22:02,860 So takeaway lessons, know your data. 1476 01:22:02,860 --> 01:22:06,060 Remember that first example of the heart rates, that 1477 01:22:06,060 --> 01:22:08,010 comes up over and over again. 1478 01:22:08,010 --> 01:22:12,150 And doing machine learning and analysis on data 1479 01:22:12,150 --> 01:22:15,270 that you don't understand is likely to lead you 1480 01:22:15,270 --> 01:22:18,480 to false conclusions. 1481 01:22:18,480 --> 01:22:21,150 Harmonization is difficult and time consuming. 1482 01:22:21,150 --> 01:22:23,070 And there are lots of things for which we just 1483 01:22:23,070 --> 01:22:25,920 don't have standards, and so everybody develops 1484 01:22:25,920 --> 01:22:27,930 their own representations. 1485 01:22:27,930 --> 01:22:33,270 I had a PhD student about a decade ago who, in his thesis, 1486 01:22:33,270 --> 01:22:37,620 wrote that he spent about half his time cleaning data. 1487 01:22:37,620 --> 01:22:40,800 And I gave that thesis to another student who 1488 01:22:40,800 --> 01:22:43,270 started a few years later who read it, 1489 01:22:43,270 --> 01:22:47,280 and he comes to me just awestruck and he says, what? 1490 01:22:47,280 --> 01:22:51,900 He only spent half his time cleaning? 1491 01:22:51,900 --> 01:22:55,830 Unfortunately, that's roughly where we are in this field. 1492 01:22:55,830 --> 01:22:58,980 So sorry to be a downer, but that's 1493 01:22:58,980 --> 01:23:01,050 the current state of the art. 1494 01:23:01,050 --> 01:23:04,890 And next time, David will start by looking at actually building 1495 01:23:04,890 --> 01:23:08,370 some models with these kinds of data 1496 01:23:08,370 --> 01:23:11,580 and showing you what we can accomplish. 1497 01:23:11,580 --> 01:23:13,370 Thank you.