Last year an international team of researchers from Canada, Germany, Russia and the U.S. published a paper about macrophages — large white blood cells whose job is to search out, engulf and destroy pathogens in the body. Macrophages, which toggle among three different states depending on the concentration of certain substances in the body, can also play a role in tumor growth when one of those modes is activated. As part of their research, the scientists looked at the Krebs cycle — a metabolic process first identified and described in 1937 — and studied how macrophages switch among states. Notably, the team used a specially designed computer algorithm to combine RNA sequencing with metabolic profiling data. They discovered that the chemical compound known as itaconate plays a crucial role in both the Krebs cycle and macrophage mode switching. One of the hypotheses for future researchers to test is whether itaconate can be used to force macrophages to switch among modes — essentially, employing a computer algorithm to try to “hack” the human immune system to help fight cancer.
The macrophage research was made possible by advances in bioinformatics, the science of using computers and large computational clusters to analyze biological and biochemical data. Similar to hacking a computer, where programmers decode machine-executable instructions to understand what is happening inside a program, scientists studying living cells need to understand all the internal processes at work. Cells consist of dozens of components and contain thousands of proteins and metabolites. Deciphering their interactions is much more complicated than understanding how a smartphone or another complex modern device works. Biological information is stored on multiple levels: deoxyribonucleic acid (DNA), ribonucleic acid (RNA), proteins and many variations of those biomolecules. Algorithms, which are capable of processing large amounts of varied types of data, become a crucial component in understanding that information.
Bioinformatics would be impossible without the progress in sequencing, starting in the 1950s with work by British biochemist Frederick Sanger.
DNA, which stores genetic information, is similar to a computer hard drive. RNA is analogous to a computer’s working memory, where only the fragments currently needed are loaded (in cells this operation is called transcription). Proteins implement the many functions in the cell; they are synthesized from RNA in a process called translation. Proteins can be compared to the software programs executed on a computer. But the computer analogy only goes so far, because understanding all the intracellular processes requires more than simply identifying the genetic code. Those processes depend not only on various objects and substances but on their interactions, which can be quite complex. Moreover, metabolic processes like the Krebs cycle can transform substances, making the situation unstable. In total, the number of various interacting components in the cell can reach the tens of millions, as the composition and concentration of chemicals change with time.
Dutch theoretical biologist Paulien Hogeweg and her colleague Ben Hesper are generally credited with coining the word “bioinformatics” in the early 1970s, using it to describe “the study of informatic processes in biotic systems.” However, it wasn’t until the late 1980s that the term was used to refer to the computational analysis of genomics data. In fact, before the 1980s biology was not a quantitative discipline; it was aimed at describing, classifying and building qualitative models. Experimental data was small, and sophisticated methods were not needed to analyze it.
Bioinformatics would be impossible without the progress in sequencing over the preceding decades, starting in the 1950s with work by British biochemist Frederick Sanger, who determined the amino acid sequence of the protein insulin. Sanger would go on to make major breakthroughs in sequencing RNA molecules, in the 1960s, and the nucleotide order of DNA, in the 1970s; he is one of only two people who have received a Nobel Prize twice in the same category. (He won in chemistry in 1958 and 1980.)
The first sequencing experiments produced kilobytes of data. By the time Sanger won his second Nobel, they were producing hundreds of kilobytes of data thanks to improvements in technology. By April 2003, when an international consortium completed the 13-year project to map and sequence the 23 chromosome pairs that constitute the human genome, DNA sequencing was creating gigabytes of data (six gigabytes in the case of the Human Genome Project, which used the Sanger method for sequencing).
Since then several high-throughput sequencing technologies have emerged. The most notable is next-generation technology introduced by Solexa, a Hayward, California, company that was acquired by San Diego–based Illumina in 2007. These technologies have lowered the cost of sequencing the human genome from around $5 billion in 2003 to just $1,000 in 2017.
Gigabytes of Data
Modern genome sequencing machines produce hundreds of gigabytes of data on a daily basis; processing this data is not feasible without computers. Advances in technology — namely, steady-state profiling methods — allow researchers to take snapshots of what is happening in cells on the RNA, protein and metabolic levels. To process this information, scientists need new algorithms that are capable of combining data from multiple sources in a rigorous and systematic way.
Besides sequencing and steady-state profiling, researchers can turn to databases curated by biology and bioinformatics community members; examples include the National Center for Biotechnology Information (NCBI) and the Kyoto Encyclopedia of Genes and Genomes (KEGG). These databases store information in a structured form around known biological objects (genes, proteins, enzymes, molecules) and their interactions and transformations (pathways, reactions, functional hierarchies, etc.). Unstructured information can be found in research papers, which also are available in online databases, such as PubMed, maintained by the National Institutes of Health (NIH), and in open access journals such as PLOS Biology. Although these databases are generally well maintained, they may not reflect all the available information at a current moment in time. Therefore, the main goal of unstructured information analysis is to extract relationships among biological objects not present in databases. Tools and techniques that can be used to do this include deep learning, latent Dirichlet allocation (LDA), Word2vec and other natural language processing methods.
However, the amount of data and knowledge available in these databases far exceeds what a researcher can grasp without the aid of computer algorithms. For example, there are more than 20 million genes and more than 10,000 biochemical reactions in the KEGG database, and PubMed contains more than 20 million abstracts. These algorithms need to incorporate both the existing knowledge available in databases and new experimental results from RNA, protein and metabolic profiling. The first step in processing new data is to map it to the existing knowledge base: sequences or networks from databases. The second step is to check whether the new data is consistent with existing data; this usually involves computing relevance scores and solving an optimization problem. In cases where some inconsistencies are found, researchers will need to use algorithms to check for statistical significance. If it exists, new knowledge has been found.
Genes and Metabolites
The basic building blocks and methods used in these algorithms depend on the type of data with which researchers are dealing. For genome and RNA sequencing data, the main methods are approximate and include local string matching and alignment, used in tools such as the NIH’s Basic Local Alignment Search Tool (BLAST). To deal with networks that describe protein-to-protein interaction, metabolic pathways, gene interactions or chemical reactions, researchers typically use graph algorithms to find connected components and solve optimization problems on subnetworks. Statistical tests are needed to calculate probability values and relevance scores.
Consider a web service for integrated transcriptional and metabolic network analysis, developed by several of the researchers who published the 2016 paper on macrophages. Called GAM, which stands for “genes and metabolites,” the new service was built using an inventive subnetwork search algorithm. As a first step, the researchers loaded the chemical reactions network from the KEGG database. Then they mapped gene expression and metabolic data on network nodes, leaving only part of the network that is likely to be connected with input data. After that they assigned relevance scores to each node and link in the network. The final step involved solving an optimization problem — finding a connected subnetwork and maximizing the total relevance score.
As computational power grows, researchers can handle ever-larger optimization problems, taking advantage of recent improvements in convex and distributed optimization methods. Algorithms such as those described in this article can help discover chemical compounds that play important roles in disease pathways. When applied to data related to cancer cells, these algorithms may even be able to help researchers find ways to regulate them directly.
Thought Leadership articles are prepared by and are the property of WorldQuant, LLC, and are circulated for informational and educational purposes only. This article is not intended to relate specifically to any investment strategy or product that WorldQuant offers, nor does this article constitute investment advice or convey an offer to sell, or the solicitation of an offer to buy, any securities or other financial products. In addition, the above information is not intended to provide, and should not be relied upon for, investment, accounting, legal or tax advice. Past performance should not be considered indicative of future performance. WorldQuant makes no representations, express or implied, regarding the accuracy or adequacy of this information, and you accept all risks in relying on the above information for any purposes whatsoever. The views expressed herein are solely those of WorldQuant as of the date of this article and are subject to change without notice. No assurances can be given that any aims, assumptions, expectations and/or goals described in this article will be realized or that the activities described in the article did or will continue at all or in the same manner as they were conducted during the period covered by this article. WorldQuant does not undertake to advise you of any changes in the views expressed herein. WorldQuant may have a significant financial interest in one or more of any positions and/or securities or derivatives discussed.