Interview: From MS Anti-Spam to Vaccines
INTERVIEW Microsoft Research announced last week that anti-spam technology has been used to create more effective vaccine designs for HIV. Such machine learning techniques could have a broad impact on biological research in the coming years, and BetaNews sat down with Microsoft's Nebojsa Jojic to find out more about his team's work.
BetaNews: To start, tell us a little bit about the project and its goals.
Nebojsa Jojic: At a high level, what we are trying to do is use our machine learning techniques to sift through data that we have from our biologist collaborators and discover patterns in HIV strains. That has consequences in vaccine design. It turns out that several of the algorithms we've built for different applications including computer vision (analysis of photographs and textual messages like e-mails) have direct application to these sorts of problems.
Long term, we would like to have a prototypical model for much faster turnaround in sequencing a new pathogen in new patients and getting to a good vaccine. The problem with HIV is that it's so variable, and for variable organisms it will be important to do these things computationally, instead of trying to actually stare at the patterns using a naked human eye.
Moreover, we are discovering a lot of interesting things about biology and how evolution actually happens. So more generally, studying ways to model evolution of organisms is interesting to us as well.
BetaNews: Who is involved in the project and how did you personally get involved?
Nebojsa Jojic: There are quite a few people involved in this. Biology projects are quite complex because biology is complex. There are quite a lot of people involved from labs in Seattle, in Perth, Australia, and also in Boston.
I've been interested in analysis of natural signals for a while; I looked at things like images and audio recordings and a lot of similar patterns appear in the data. For the type of applications I was looking at, I was interested in whether I could design algorithms to make sense of patterns like that. I was interested in the artificial intelligence aspect - whether we can build computer programs that can, for example, see.
To do this, you have to turn to these machine learning techniques. There was a lot of machine learning research that has come up in the recent years that seems quite promising. Having learned about all these techniques and having had a lot of experience with other types of natural signals, naturally I thought there must be some way I could contribute to biological signals as well.
It's an interesting parallel. When I was analyzing images I would take the images in a collection and break them up into patches of much smaller size randomly, and I would try to put these patches back together into a coherent picture. That picture was much smaller than the pictures I started with. The idea being the result -- that I called "epitome" -- epitomized the variability in other pictures. Because it's smaller and yet it has all the patches of original data, that means that self-similar elements in the data are mapped to the same spot in epitome.
It turns out the way the infected cells of a patient fight against a virus, is by chopping up a virus into little pieces, called "epitopes" and presenting them on the surface of the cell. The cell then signals to the rest of the immune system about the infection and the immune system then reacts.
In order to design a vaccine you have to do a similar thing I was describing in the case of images. You have to chop up the virus into appropriate pieces and then piece them together into something that is roughly compact. Compactness is important for vaccine delivery methods to work, so there was a direct parallel to our vision research.
BN: How does this relate to anti-spam technology?
Jojic: The anti-spam technology also depends on these machine learning techniques that I was referring to. They are, by definition, meant to be generally useful techniques for any type of data, and they are meant to discover patterns in the data automatically. So some of these techniques have been used directly to design spam filters. And some of the same software was actually used in this research as well for predicting the epitopes.
BN: Was custom software developed in order to discover these patterns?
Jojic: We have adopted our previous algorithms to work on the vaccine problem. There are two major pieces in what we've been developing. One piece is meant to optimize for the vaccine coverage and create shorter vaccines that have as much diversity of the virus covered as possible.
We also have a piece of software that looks into the virus strains and discovers what likely patterns the immune type of the patient could have recognized, which led to the particular strain that lives there. That leads us to deduce the important pieces of the virus that are needed in the vaccine. After that, we can feed them into the first piece of software that optimizes the vaccine coverage.
BN: How long did it take to build the software, or apply your techniques to biological applications?
Jojic: In this case, because we were completely new to the area and also because the biologists we were talking to were new to machine learning, it took us about a year of discussion before we could really collaborate.
In terms of the software development, that was actually fairly quick. It seems like our collaboration is speeding up now, and we are getting much more efficient in terms of transferring the data and sending our analysis back and forth.
BN: How much is Microsoft involved in the actual vaccine development?
Jojic: There are several important steps in developing a vaccine. One thing is to decide what should be in the vaccine -- what sort of pattern that mimics the viral diversity should you choose -- and there are other issues related to how you actually deliver it into the cell. Our part is actually deciding on the content, or optimizing for the content, of the vaccine.
What we have is a set of techniques that is going to provide the vaccine designer with the optimal pattern to put inside for any given length of the vaccine. Ahead of time we are doing the analysis that can give the effectiveness of the patterns depending on what the requirements are. In other words, we can optimize for content, but the delivery side will have to be optimized by biologists and medical research.
BN: How do you go about optimizing the vaccine and making it smaller?
Jojic: There are two ways. There are areas of the virus that may not be important for immunity, so you do not have to include those patterns. The other thing that's even neater, is the idea that you don't necessarily need a cocktail of virus strains, as in the cocktail there are many areas that repeat. These pieces that I talked about, the epitopes, overlap a lot in the virus. By exploiting this overlap, you can create shorter vaccines. This has been done automatically by algorithms for computer vision as well.
BN: Is this similar to a compression model for digital data, for example image compression?
Jojic: It's quite different. The image compression models are based on expressing every pattern as a sum and they typically ignore overlaps. For example, the usual JPEG algorithms would break things into non-overlapping patches and express each one of them as a sum of components. And the weights of those components would be encoded.
This is quite a bit different, because the compression that you get is more meant to generalize. If you're forcing your presentation to be small, that means you have to make things that look the same be considered the same. If you do that, then you are generalizing and you can understand images better. That was our goal.
BN: Is it strange for Microsoft Research to be involved with medical research?
Jojic: Microsoft Research is one of the top research institutions in computer science. And computer science is starting to look into biology applications, which is not surprising since the amount of data being gathered all the time is larger and larger.
After all, biological systems do store data in the digital format, so it is a digital problem. This is especially true of machine learning; we are interested in these algorithms that apply to many different types of data. And the reason why we are doing it at Microsoft is because many of these types of data are very important for Microsoft products, like images, e-mail messages, video sequences and so on.
Inevitably, when you do a good job in one of these domains, it turns out that it's applicable to other domains - and one of those is biology. The question is why not try to make an impact on the societal level as well.
BN: Microsoft has announced that vaccine designs are already in testing. What will be learned from the work when results come later this year?
Jojic: We are hoping that first of all, our hypothesis will be valid. In addition, we are hoping to get more information about the immune system by running these experiments. We hope to get some new data that should help us narrow down what should be in the vaccine and learn more about the intricacies of the immune system.
The next thing that we are going to do is run the experiments in mice. That will mimic the whole process of the vaccination, where the vaccine gets delivered into a cell of a mouse and then we see the response of the immune system.
BN: Can these techniques be used elsewhere in medical research?
Jojic: First of all, you can apply this work to other pathogens to build vaccines. Additionally, you can use the same techniques in conjunction with evolutionary models and try to understand various forms of variability in evolution.
We have also used these representations for pattern discovery in regulatory mechanisms in the human genome and various other places. The work we are doing can have an application in many, many different places in biology.