Machine Learning Meets Genome Editing Microsoft Research senior researcher Jennifer Listgarten offers insight into how machine learning is helping the genome editing tool CRISPR change the world.

Jennifer Listgarten doesn’t like to be told what to do. As a teenager growing up in Toronto, she was kicked out of several high school classes because she refused to do anything that did not stimulate her intellectually. “My parents thought it was pretty remarkable when one day I ended up with a great job at Microsoft,” says Listgarten, who joined Microsoft Research in 2007 after earning a Ph.D. in computer science from the University of Toronto, where she specialized in machine learning.

Founded in 1991 by the software giant to pursue technological innovations, Microsoft Research has long been one of the few places outside academia where scientists can pursue pure research. As a senior researcher at Microsoft Research New England in Cambridge, Massachusetts, Listgarten has focused her work on the intersection of molecular biology and computer science, applying machine learning techniques to the field of computational biology. One of her principal interests is the genome editing tool CRISPR (pronounced “crisper”), which she became enthralled by three years ago during a talk at the Broad Institute of MIT and Harvard, a nonprofit research organization that has brought together more than 3,000 scientists who are using genomics to try to improve human health. CRISPR, which stands for “Clustered Regularly Interspaced Short Palindromic Repeats,” occur naturally in the immune system of bacteria to help defend against viruses. To protect the bacteria, CRISPR-associated proteins (a set of enzymes called Cas) can precisely snip the DNA of viruses, preventing them from replicating. Although CRISPR was identified in the early ’90s, it wasn’t until two decades later that scientists successfully demonstrated that it could be adapted for editing the genome, an organism’s complete set of DNA.

This spring, Listgarten was one of three outside experts invited to speak at the Massachusetts Institute of Technology’s first annual Statistics and Data Science Center Day, hosted by the university’s Statistics and Data Science Center, which is part of the MIT Institute for Data, Systems and Society (IDSS). During her presentation Listgarten explained how genetics has become a data science, citing the example of her own research. She recently spoke with WorldQuant Global Head of Content Michael Peltz about CRISPR, machine learning and the combined impact that they could have on genetics, health and medicine.

Have you always been interested in science?

Jennifer Listgarten: Yes, my interest goes way, way back. My father was a doctor, and from the outset I was very interested in the basic sciences, like biology, chemistry, physics, although less so biology because in my mind it required a lot of memorization. In college I took specialist courses in math, physics and chemistry, but I have not taken any biology since high school. Of course, I certainly have learned a lot about it during the course of my work.

Listgarten for web.png

How would you define computational biology, and how does it differ from bioinformatics?

I would say that a computational biologist is someone who is deeply interested in biology problems and wants to bring to bear his or her knowledge in computer science to solve those problems. That could be someone like me coming from a machine learning perspective, but it could also be people who come from a more algorithmic or theoretical background and who want to bring to bear those tools.

But then what is bioinformatics? Some might say it’s the same thing as computational biology. Others might say bioinformatics is more data processing, writing Python scripts to get through the data and these kinds of things. And some people insist that the definitions of these two terms are the opposite of what I just suggested. I guess it’s in the eye of the beholder, ultimately. We do what we do, call it what you will.

How did you get interested in applying machine learning to CRISPR technology?

I moved here in June 2014 from Los Angeles. Before then I had started to hear whispers of this thing called CRISPR that was changing the world. At the time, it was still relatively obscure in terms of the popular press.

When I arrived at Microsoft Research in Cambridge, I knew that I wanted to make use of the local community. I also knew that if I was going to be a successful researcher here, I couldn’t simply rely on collaborating with people in my lab; I was going to have to go out and forge new connections in the amazing academic community that is Cambridge, Massachusetts. I got on as many talk lists as I could, and I saw this one about CRISPR. So I waltzed over to the Broad Institute and heard John Doench, who now is a very close collaborator and also a friend, give this wonderful talk. I had thought I was just going to listen and then go home, but as John was speaking it dawned on me that he had this problem that was super important and absolutely perfectly suited to my skill set in machine learning. At the end of the talk, I went up to him and said: “I moved here three weeks ago. I knew nothing about CRISPR until I attended your talk, but I think you need to work with someone just like me.” I gave him my card, and within two weeks he was in my office with one of his colleagues, along with my Microsoft colleague Nicolo Fusi, who joined in from Los Angeles. There was no looking back after that. We now have multiple projects on the go.

What was the original problem?

The issue was what’s called the on-target problem, which John and his team had already come up with a solution for but which we then worked on with more advanced methods yielding better results. CRISPR is a tool for gene editing, but this term is sometimes abused a little in the sense that in the conventional sense of the word “editing” you would change a T to an A, but this is perhaps not the main way that CRISPR is currently used. Another way is to “edit” a gene in the sense that you want to disrupt it so as to prevent it from functioning — a so-called gene knockout. John and his team at the Broad Institute do a lot of these gene knockout experiments. His talk was about “I want to deploy CRISPR to knock out a gene. How can I do that in a way that is most effective?”

What does that mean? Say you have one gene and it covers several thousands of nucleotides. If you want to deploy CRISPR to knock out that gene, you need to use some genetic scissors to which you attach what’s called a guide RNA. The guide RNA is going to take (i.e., guide) these scissors to the right part of the genome — in this case the gene you’re targeting — and then cut the double-stranded DNA there. Then the gene’s innate repair mechanism will kick in and try to fix it. But with some reasonable probability, it will repair it badly and the gene will not function.

The issue is that you can actually deploy these scissors to many, many positions along the genome, even within one gene. And it turns out that some of the resulting DNA cuts are very effective at disabling the gene and some of them are not so effective. So the question is, you’re a biologist, and for whatever reason you want to shut down this gene: Which of these hundreds of places that you could target is most likely to have the knockout effect you want? That’s something that machine learning is particularly well suited to try to answer.

With a predictive model you can always have some holdout benchmark dataset. It could even be a dataset that doesn’t exist yet.

John and his team systematically measured how effective CRISPR was at every possible place for seven genes and then later another eight. Given these data “examples,” we were then set to do machine learning. The examples can tell us what worked and what didn’t, and the machine learning tries to capture the underlying regularities that determine this. Here the input to the machine learning model is an RNA guide sequence of around 30 nucleotides. The output of the model is how much you knocked the gene out (or how much you predict you would if you did the experiment). So the “training” data are these pairs of inputs and outputs. This is what we would call supervised learning. It’s supervised because when you’re training the model you give it supervision in the sense that you’ve measured the actual output you would like to be able to predict.

How does that compare with unsupervised machine learning?

Unsupervised learning is a much more difficult beast that is very hard to handle. A classic unsupervised approach is just a simple clustering. You take data for some genes and cluster them, for example, or take data for some patients and cluster them according to whatever data you have. You choose some metric that defines the similarity between the data points, whatever they are, and based on that metric you aggregate genes or people into [clusters] in some way: Those who look most similar by that metric land in one cluster, and those who are not similar to those people but similar to each other end up in a different cluster. This is unsupervised learning because all you’re doing is giving the model an input, but no notion of an output — there’s no notion of a supervised label. You’re simply looking for patterns in what is basically an ad hoc manner.

With a predictive model you can always have some holdout benchmark dataset. It could even be a dataset that doesn’t exist yet, and you say, “Well, if you really want to prove you’ve done something important, I’m going to generate a new dataset you couldn’t possibly have looked at, and then I’m going to see how well your model does.” But in an unsupervised model, there is no obvious and unbiased notion of how to evaluate it. So people make up all kinds of ways because the question is essentially ill posed.

That’s not to say unsupervised models aren’t important, but they provide more exploratory ways of looking at the data. Often they’re used interactively: You get some clustering of genes or patients. Maybe then you show it to your clinician friend, who synthesizes this with a bunch of side information in her head and remarks, “Oh, neat, this cluster of patients were actually in this drug trial.” And “Oh, these genes all belong to the same pathway.” But again, it’s hard to evaluate these unsupervised results in an unbiased manner.

How else can you use machine learning with CRISPR?

The first problem we tackled was the on-target problem I mentioned earlier: figuring out which place in the gene one should deploy the scissors in order to knock it out. That work yielded an on-target predictive model that’s now used worldwide. There’s also the flip side of that problem, an off-target effect: Imagine I have decided to cut a gene in a particular place — I need to know if I am also going to accidentally cut the genome in other places and thereby have a potentially bad side effect.

Our current tool gives a rank ordered list of, say, the top three places one should try to cut a given gene in order to make an effective knockout [on target]. Now we’re about to deploy a complementary [off target] tool that is going to tell you, for each of those three places, how badly you might disrupt the rest of the genome if you use it. Then one can sort of redo the ranking of the guides so that it trades off between how efficiently it will knock out the desired gene versus how much it might disrupt somewhere else in the genome. This off-target problem can, again, be cast as a supervised machine learning problem, which is exactly how we’ve done it. However, it is a much harder machine learning problem for several reasons, including the fact that one has to scan the entire genome for accidents rather than focusing only on the area of interest, and also because we now have to model accidental uses of CRISPR in which the guide doesn’t perfectly match the genome but might be active there anyhow.

Health is a place where we’re already seeing that machine learning is being used and where we will see a major impact sooner than other areas.

At the IDSS conference you spoke about how CRISPR could change the world.

I believe it’s already changing the world. Right now it’s changing the world in the sense that it’s a tool people use day to day all over the world in their basic biology labs to advance their science. CRISPR allows them to do things more quickly, more cheaply and at a larger scale than previous technologies, and with much less labor than they used to require.

So the impact on basic biology is already there. The next question is, what about translational and medical uses? That area is not as well developed because it is much harder and comes with more risks. Still, rapid progress is being made on these fronts. In fact, just recently we saw big advances in state-of-the-art embryonic [germline] editing, wherein a version of a gene implicated in heart disease was shown to be successfully edited out, and without the problems of mosaicism [wherein the edit makes it only to some cells] that previous papers had shown. I’m hesitant to predict when this kind of thing will be in actual use; that will also depend on societal comfort and laws that allow it. Earlier this year the National Academies of Sciences, Engineering and Medicine supported the case of editing embryos for medically relevant problems but cautioned against anything considered enhancement. Of course, this could end up being quite a blurry line. Also, that was a scientific panel and did not take into account government persuasions, which are harder to reason about but need to be accounted for all the same.

The machine learning projects that I talked about — predicting the on- and off-target effects — are some of the reasons it’s not yet prime time for prevention and therapeutics. Once you go to a living being, you have to be really careful what you do, because you’re not just playing with some cell line in a lab. If you accidently disrupt the wrong part of the genome in a living system, you could make things much worse than if you had just left the patient alone. There is also an in-between kind of application wherein you might extract blood cells from a living person, use CRISPR on them to, say, modify their immunology, then screen for successful edits and put only those back into the person. We will see this kind of clinical use much sooner than others.

But there are proof-of-principle experiments for the harder problem out there in model organisms. HIV requires a very particular receptor to get into a cell and infect a person. There are some people who naturally have a genetic mutation such that they don’t express that receptor, and they’re naturally immune to HIV. Researchers think if they could change a person’s receptors to achieve a similar state, then everybody would be immune to HIV. Scientists have done this kind of thing in mice with some success but aren’t yet there in humans. There are many fascinating experiments like this where you can see that we’re getting close to important preventative and therapeutic uses.

What do you think will be the ultimate impact of machine learning on genetics, health and medicine?

People often confuse me with someone who studies health. Often computer scientists think, “Oh, she does something with biology, so she’s a health expert.” To which I say, “No, I am very focused on molecular biology, which has implications for health but is not health.” Still, I think health is a place where we’re already seeing that machine learning is being used and where we will see a major impact sooner than other areas. That’s because in health a lot of the problems are framed as prediction problems. For example, radiology is image-based and has many available expert labels. So this is a place where we can collect quite a bit of structured data that can readily be used with machine learning. Then there are things like wearables, which have many sensors, and of course there are many sensors yielding data in an emergency room, though not always with an associated clean label. I believe the time is coming when we’ll start to see machine learning models deployed in these kinds of settings in a real way, affecting society day in and day out, much as we are already impacted by AI in our smartphones.

Going back to molecular biology, things are a little bit harder there because basic science is more open-ended research, where even the questions we ask evolve as we learn more. It is not a clear-cut goal such as predicting when someone will relapse. Nor is it about predicting who will respond well to a drug. Basic scientific questions are often more amorphous than well-defined clinical tasks.

But machine learning is undoubtedly playing a role in even more basic biology. Preceding my CRISPR work we used machine learning to understand how epigenetics  —the study of the process by which genetic information is translated into the substance and behavior of an organism — might be implicated in diseases such as rheumatoid arthritis and cancer. We used machine learning to tease apart structured noise in the data from the signal of interest, so that one can make use of the data in a way that doesn’t lead to the wrong scientific conclusions. Epigenetic marks across the genome are measured for a large set of people with, for example, rheumatoid arthritis. Then we ask which epigenetic marks are implicated in rheumatoid arthritis. It turns out that there’s this complication that when you take a sample from someone, you get a heterogeneous mix of cells, and each cell has its own epigenetic marks. It turns out this wreaks havoc with the analysis.

This is a place where machine learning can really excel: It can, in effect, de-convolve [clean up] complex data to extract the meaning from the mess. In this example we can effectively de-convolve the cell types without any knowledge of them or what they look like. In other words, there are things in your data that are mixing everything up and making it incomprehensible. As a machine learning person, you can go in and build models that essentially pull it apart in the right way so that you can actually make sense of the data. This is one of the very powerful ways in which machine learning is currently advancing basic molecular biology.


Thought Leadership articles are prepared by and are the property of WorldQuant, LLC, and are circulated for informational and educational purposes only. This article is not intended to relate specifically to any investment strategy or product that WorldQuant offers, nor does this article constitute investment advice or convey an offer to sell, or the solicitation of an offer to buy, any securities or other financial products. In addition, the above information is not intended to provide, and should not be relied upon for, investment, accounting, legal or tax advice. Past performance should not be considered indicative of future performance. WorldQuant makes no representations, express or implied, regarding the accuracy or adequacy of this information, and you accept all risks in relying on the above information for any purposes whatsoever. The views and opinions expressed herein are those solely of the author, as of the date of this article, and are subject to change without notice, and do not necessarily reflect the views of WorldQuant, its affiliates or its employees. No assurances can be given that any aims, assumptions, expectations and/or goals described in this article will be realized or that the activities described in the article did or will continue at all or in the same manner as they were conducted during the period covered by this article. Neither WorldQuant nor the author undertakes to advise you of any changes in the views expressed herein. WorldQuant may have a significant financial interest in one or more of any positions and/or securities or derivatives discussed