Challenges of Bioinformatics, Computational Biology, and Big Data Analysis

Let’s start by comparing two common concepts, bioinformatics and computational biology.

Bioinformatics actually originated in 1978 as a new concept proposed by Paulien Hogeweg: Theoretical Biology, which refers to a biological existence that is theoretical (generated by computation) and corresponds to experimental results. Today, bioinformatics has a clear positioning: it’s a tool-based discipline, specifically, the application of information technology (IT) in biological research, a combination of IT and BT (biological technology). Many newcomers to the field of bioinformatics feel quite proud of this discipline, as it combines two of the world’s hottest technologies. The 20th century was the century of IT, the 21st century is that of BT, so what does IT + BT mean? It must be a hot topic, right? However, the truth is that IT + BT still can’t save the young field of bioinformatics. Many of our lab graduates (from the Bioinformatics Research Lab at the Medical School of Tsinghua University, PI: Professor Liu Xiangjun) have changed fields: some went to Microsoft, others to Yahoo, some became officials, some switched to medical devices, and some became patent agents, etc. Currently, only two of us are still struggling in the field of bioinformatics (or three if you count a junior in bio-statistics). This indicates that bioinformatics as a development direction is not as exciting as the concept suggests. However, with the recruitment of Robert Gentleman by 23andMe with a high salary, and the increasing competition among major companies for bioinformatics talents, as well as the rising number of domestic bioinformatics job ads, it seems that change is imminent.

Computational biology, unlike bioinformatics, is a research discipline and not just a tool. Compared to bioinformatics, computational biology is the true heir to the theoretical biology originally proposed by Paulien. Undoubtedly, computational biology uses bioinformatics as a tool but is aimed directly at biological research with specific scientific questions in mind. Bioinformatics is a broad concept that many scientists with backgrounds in physics and mathematics have entered and thrived in; however, computational biology requires rigorous biological knowledge and is defined by its focus on biological research. The two are fundamentally different, yet there is a large overlapping area. For example, the development of the BLAST algorithm is a classic bioinformatics project, but not computational biology. Projects that deeply mine public data using the BLAST algorithm from a unique biological perspective are computational biology. For labs that have or will soon have high-throughput biological data, what is actually needed is computational biology, because it pertains to data analysis for biological research.

Big data is naturally one of the most popular terms nowadays. The English Wikipedia defines big data as data that are too large, complex, and cannot be stored, managed, or analyzed using traditional methods. Common big data refers to the social big data generated by tracking user behavior or whereabouts through large websites, apps, or electronic devices like smartphones. For instance, analyzing sales records on Taobao can allow subsequent ads for related products to accurately target populations; analyzing football team data can predict World Cup outcomes, giving rise to legends like the octopus Paul; analyzing someone’s mobile phone location data to track their daily activities can predict their occupation, and even private information such as personality, hobbies, and diseases. Although social big data is vast and complex, the data types within specific industries are relatively stable. As an emerging profession, big data analysts in the United States can get a job after short-term training, mainly conducting big data analysis through databases (without needing to program), and the entry threshold is not too high. The analysis method for textual big data is mainly based on machine learning algorithms, and deep learning is a brand new and hot field based on machine learning algorithms, closer to artificial intelligence, which can learn more abstract data at multiple levels, such as images and sounds.

If the massive production and explosion of social big data is due to its inexpensive production method, then the generation of biological big data is not so cheap. The budget for the Human Genome Project was 1 dollar, 1 base pair, totaling 3 billion dollars (actually used 2.7 billion dollars), and now it’s about 1000 dollars per genome. Although it is still not cheap enough, it is a general trend that it will become cheaper with the advancement of technology. What is biological big data? I looked it up online, and the explanations and understandings of biological big data are quite messy, with many not cutting to the essence of big data: that is, inductive learning used for prediction and practical guidance. Broadly speaking, all data related to biology can be called biological big data. For example, clinical data and laboratory high-throughput data can be used as big data for application-oriented analysis. In terms of analytical behavior, big data analysis is clearly data-driven, using inductive methods to statistically reveal the objective laws behind seemingly chaotic data. Therefore, whether it is clinical data or laboratory high-throughput data, if it is used for practical guidance-oriented analytical research, it can be called big data research. Among the dozens of sub-disciplines of bioinformatics, one called Translational Bioinformatics (TBI) is particularly relevant to the analysis of biological big data. According to Wikipedia, translational bioinformatics starts with various types of biological big data, integrating bioinformatics, biostatistics, statistical genetics, clinical informatics, etc., to produce knowledge or information that can be used for clinical diagnosis or treatment and scientific research. From this perspective, the analysis of biological big data also falls within the scope of bioinformatics.

Apart from not being cheap, biological big data has many characteristics different from social big data. For instance, social big data reflects human behavior, while the driving force behind biological big data is the essence of life, which are not on the same level. In terms of data complexity, the latter is much higher than the former. In addition to clinical data, molecular biological big data is primarily based on the central dogma of life and can be divided into multiple dimensions such as DNA, RNA, proteins, and metabolites. Each dimension further derives dozens of different platforms measuring different molecular behaviors. The relationships among these big data are very complex, and the true world of biological macromolecules behind them is even more mysterious. So far, the world’s most advanced data analysis capabilities can only observe and speculate on some simple phenomena, and it is difficult to evolve predictions of even a single molecular behavior within a cell. Not to mention the lack of sufficiently intelligent software, even if there was such software, I seriously doubt that the combined computing power of computers around the world would still be insufficient.

Some say that when data accumulates enough, it will speak for itself and tell us conclusions. I don’t have such confidence. First of all, a large proportion of biological big data is experimental measurement data, much of which is indirect data (such as gene chip data), with many systematic errors. I once saw an example where a lab conducted a transcriptome experiment with 100 tumor samples (T) and 90 normal control samples (N). Because of the large number of samples, the experiment was divided into four batches, 50T+50T+45N+45N. They chose the best Affymetrix whole-genome chip platform at the time, performed laser microdissection on fresh tumor samples to ensure the measurement subject was tumor tissue, mRNA was not frozen, and reverse transcription experiments were done promptly, with RNase cleanly removed. When the data came out and were normalized, principal component analysis (PCA) was performed, and the results seemed perfect! All tumor samples and all normal samples were well separated. Further cluster analysis and differential expression analysis were also perfect. However, this experiment with a nearly perfect biological design yielded data that was complete garbage and irreparable garbage. Why? Because it violated the principle of random grouping in experimental design, creating an irreparable batch effect. Thus, the obtained analysis results theoretically are a combination of systematic errors and biological differences that cannot be separated. In reality, systematic errors would completely overshadow biological effects. Such data, if entered into the analysis pool, would have a devastating impact on the scientific nature of the analysis results. Will data speak for itself? NO, this is based on a scientific premise. Then, for data with reasonable experimental design and systematic errors within the normal range, will it speak for itself? The answer is still NO. Even with social big data, there will be different data distributions that require corresponding statistical models or learning algorithms. Biological big data is more complex. For example, the analysis algorithms for expression profile data from biological chips and high-throughput sequencing are completely different, because of the different data distributions, they cannot be mixed. Letting data speak for itself is a beautiful illusion, and a professional bioinformatics scientist is indispensable.

Speaking of the interpretation of biological big data, there is a story (not sure if it’s true or not), where someone had their genome sequenced by several well-known health and medical companies in the United States. The sequence they received from each company was basically consistent, but the analysis results varied greatly, with some conclusions even being opposite. It is said that this was one of the main reasons the US FDA shut down the industry last year. I think this is understandable. Simply put, there is a serious discrepancy between the theoretical technical foundation and the marketing concept. In order to quickly capture the market, the psychological expectations generated by what people hear from companies do not match the current technical capabilities (perhaps some year in the future?). Just like stock prices that always do not reflect the current economy, when economic expectations and actual developments are severely out of sync, a market hand will apply a returning force to bring prices back, ensuring they do not drift too far apart. I believe that a similar market hand also exists in the health and medical market.

One reply on “Challenges of Bioinformatics, Computational Biology, and Big Data Analysis”

Leave a Reply Cancel reply