Data Science in Biotechnology
By Kat Campise, Data Scientist, Ph.D.
Since all companies are now data and tech companies, regardless of the sector, there is no shortage of the need for job candidates who have demonstrable expertise in math, statistics, and programming. Once you’ve mastered the essential data science skillset, you have the opportunity to apply it to just about every industry, and biotechnology is no exception.
What is Biotechnology?
If we apply the term broadly, human beings have been “biotechnicians” since the moment we began breeding animals to produce specific features and manipulating crops for the same purpose. In a more specific sense, biotechnology is “any technological application that uses biological systems, living organisms, or derivatives thereof, to make or modify products or processes for specific use.” Within the context of the 21st century, our technological tools (math, statistics, computational resources, availability of data sources, etc.) are significantly advanced and continuously improving.
Our biological knowledge has also substantially increased as we know much more about the molecular interactions at the genomic level and, through the use of predictive models, can better determine the likely outcomes of manipulating the cellular realm. Of course, this is not foolproof. It’s difficult to capture every possible outcome based on a set of internal and external features where there is also an array of possible input factors. With crops, predictive accuracy is far more feasible than with more complex biological systems, especially humans. But, the pharmaceutical industry is a salient example of biotechnology as applied to human physiology. The high cost of medications and litany of side effects notwithstanding, pharmaceutical biotech has attained a certain amount of success in improving patient health outcomes.
With that in mind, there are diverse branches of biotechnology:
- Medical: vaccines, stem cell research, pharmaceuticals, gene therapy, etc.
- Agricultural (or plant): similar to medical biotechnology but applied to plants.
- Animal: the goals are parallel to medical biotechnology with the application being non-human animals.
- Industrial: the focus here is on producing items such as detergents, cosmetics, textiles, bio-fuels, etc.
- Environmental: utilizing existing biological systems to help mitigate or remove the damage caused to ecosystems (usually through the other biotech areas such as pesticides, plastics, ).
- Marine: the same overarching biotechnology goals and objectives are incorporated within marine biotechnology, but the products are derived from and/or focused on the aquatic ecosystem.
As you can see, there are many overlapping areas between all of the biotechnology subsectors. Some may reduce this list to three, more comprehensive, sectors: medical, industrial, and agricultural. Indeed, we are interacting with biotech products on a daily basis.
Biotechnology and Data Science
Ultimately, biotechnologists are research scientists that apply statistical analyses to the tiny world of molecular biology. If you think that they are basically data scientists within a highly specific sector, you’re correct. Both biotechnologists and data scientists are expected to be experts in research design (true experimental vs. pre-experimental vs. quasi-experimental). Then, there is the ever-present need to know trio: math, statistics, and programming. One minor difference arises with regard to stats. Biotechnologists focus on biostatistics, which is a specialty within statistics. But, any data scientist with a hardcore math and stats background can easily shift their focus to biostats.
Biotechnologists are drenched with data. The molecular world and its larger environment are dynamic systems. Each contains massive amounts of quantitative data, and sifting through which factors are more likely to produce a particular effect requires major computational effort. Thus, biotechnologists must use the same, if not similar, programming tools to carry out their research: Python and R (or some employers may require C++. They need to pull data from databases as well, so SQL can be added to the list of biotech “need to know.”
However, there isn’t a perfect crossover between a data scientist and a biotechnologist. The missing ingredient, unless the data scientist majored in biotech or another biology focused discipline, is knowledge of activities such as preparing microbial seed vials, creating and monitoring cell banks, filtration, extraction, chromatography, etc. You will also need to be familiar with specific software programs used within whichever biotech sector you’re entering. Examples include, but are not limited to, the Genome Analysis Toolkit, Burrows-Wheeler Aligner, and the Protein Variation Effect Analyzer (PROVEAN).
Another differential surfaces when we review the starting salary of a biotechnologist vs. a data scientist. According to the Bureau of Labor and Statistics, biotechnicians have a median pay of $43,800 per year and the expected job growth is only 10% until the year 2026. Comparing this to the average salary of a data scientist, $139,840 according to Glassdoor, may bring some would-be biotechnologists to the conclusion that entering data science would be far more lucrative.
If we look a little more closely, however, biotechnologists don’t have the same academic expectations; a bachelor’s degree and maybe a year of lab experience is the minimum for job entry. Meanwhile, the current minimum for data scientists is a master’s or Ph.D. in a STEM field (computer science, math, engineering, physics, biology, biostats, etc.) and anywhere from 5 or more years of data science experience. Some employers set the academic bar at a STEM bachelor’s degree, but with an even higher number of years of experience.
If you research biotechnology, biotechnologist, and data scientist through Indeed or Glassdoor, you may notice a trend: data scientists are in higher demand or employers are advertising data science positions for jobs formerly considered to be the realm of biotechnologists. So, there is a crossover occurring between data scientists and biotechnologists, and it’s in favor of candidates having the essential biotech knowledge, but they’ll likely be referred to as a data scientist.
How Data Science is Impacting Biotechnology
Are you interested in helping to cure cancer or any of the other nefarious diseases that continue to infiltrate human health? Do you want to help create products that are safer for the environment or help to remove dangerous chemicals from our oceans, land, and air? While data science continues down its own evolutionary path, the science of data doesn’t stray from the fundamentals of:
- Generating a question (or series of questions)
- Collecting data
- Prepping data
- Choosing a relevant model
- Testing the model
- Fine-tuning the model
- Launching the model into a larger production environment
- Continuously monitoring and optimizing the model
But, data science is now permeating industries where those predictive models can have a life or death impact on humanity. For example, there have been tragedies with driverless cars: Tesla and Uber self-driving vehicles have killed two people, and numerous non-lethal crashes — which aren’t always reported — have damaged property and placed humans at risk. Transferring this possibility to pharmaceuticals, robotic surgeons, and our food supply, and it’s a clarion call that data scientists are beginning to shoulder an immense responsibility. Algorithms are only as accurate as those who are creating them. We must move forward with caution, and perpetual self audits along with careful analysis of how much power we are lending algorithms and their larger “artificially intelligent” systems.
Data Science and DNA
To say that the human genome is complex is an exponential understatement. Nature and nurture are mitigating factors that can activate a particular gene or influence its dormancy. The increase in computational power, which includes gathering and processing zettabytes (if not yottabytes) of data has helped move the process of gene sequencing forward at a faster pace and it can be used to sequence genomes at the individual level.
Data scientists, as the gurus of Big Data, can facilitate major advances in medical biotechnology by utilizing the individualized genome sequencing as a factor in predicting the onset of a disease. A unique health and wellness profile can be generated based on both genomic and lifestyle data. If this is tied to a health and wellness app where the user can be alerted if certain foods or activities increase their risk of a particular disease, then early detection may reduce medical costs.
Certainly, this assumes that the user has a high rate of adherence, e.g., turning on their app to record the number of steps or accurately entering their meals consumed throughout the day. Alternatively, these health and wellness activities can be recorded automatically or require minimal effort on the user’s part. Data scientists are helpful here as well. If recording a meal is as easy as taking a picture (AI image recognition is the data science realm), then user adherence is likely to increase (but, we need data scientists also to test whether this is true).
Data Science and Climate Change
The fact of the matter is, earth’s climate does shift over time. However, our earlier stages of biotechnology are a factor in increasing environmental toxins. We overfish our oceans, pump waste all over the planet, and the population isn’t decreasing; we’ll definitely need more agricultural land for plants and animals. Flooding, fires, tornados, earthquakes, and hurricanes are relentless threats. But, due to the explosion in IoT devices, we have “eyes,” “ears,” and electronic proprioceptors ingesting data about everything from the local and global temperatures to the activities of our sun (and beyond). Therefore, data scientists are the perfect types of scientists to help make sense of all that data.
As a matter of fact, data scientists of all levels (novice to expert) are currently working on certain predictive facets that are directly or indirectly related to climate change (and our impact on earth’s ecosystems):
- Hurricane Florence — Building a Simple Storm Track Prediction Model
- Data for Climate Change Challenge
- How Artificial Intelligence Can Fight Air Pollution in China
- Using Data to Better Understand Climate Change
The crucial thread throughout all of the articles is data and deciphering its meaning — if it has any at all. The aforementioned IoT devices being deployed as environmental sensors still need improving. But, as the saying goes, garbage in, garbage out. If the sensors are not relaying high quality (no nulls or missing data) and reliable data, then a large inferential gap is left open. Data scientists can build the initial predictive models based on the current data (after clean and prep). Then, engineers can refine the sensors to extract better quality data. The cycle repeats, and will repeat, until the predictive accuracy is at a certain threshold.
The IBM project for predicting China’s air quality is a perfect example of building, implementation, and adjustment. While the model is better than prior tools, higher accuracy and shortening as well as extending the air quality prediction time (from minutes to possibly months or years in advance), is still in the works. Moreover, we could use better predictive models for ways to reduce carbon emissions in addition to forecasting air quality.
Becoming a Data Scientist in Biotechnology
We’ve already discussed the main expertise you’ll need as both a data scientist and a biotechnologist. For those of you who have zero knowledge of biotech, but you’re either an aspiring or professional data scientist who wants an intro to the topic, then you might take a look at Coursera’s selection of courses:
- Industrial Biotechnology
- Genes and the Human Condition (From Behavior to Biotechnology)
- Systems Biology and Biotechnology Specialization
It’s free to audit individual courses (you won’t have access to assignments or exams). Or if you want to pursue a certification, the per class fee varies between $49 and $79. Another alternative is to take a course from edX; they too have biotech education available:
- Industrial Biotechnology
- Statistical Analysis in Bioinformatics
- Biological Engineering: Cellular Design Principles
Most of edX’s courses are free to take — if you don’t want a certification. The Statistical Analysis in Bioinformatics course, however, is a part of a larger academic program and currently costs $249. For those of you who are just now entering data science, we recommend that you establish your core data science skills first, especially math and statistics. As you gain more experience with data science processes, layering in biotech coursework (whether formally or via a MOOC) at a later time helps reduce the cognitive load. It’s wise to begin practicing on smaller datasets with fewer features and then graduating — at your own pace — to larger and more complex datasets will provide a smoother path towards your career in data science.