What is the Difference Between Data Science and Statistics
Numerous conflicting opinions exist within the scientific community regarding the difference between Data Science and Statistics. At a distance, the fields often appear interchangeable, with a shared goal of utilizing data to solve problems. One of most recognized voices in statistics, FiveThirtyEight founder Nate Silver, asserted that data science is merely a rebranding of statistics. However, leading academics including Vasant Dhar of NYU, Andrew Gelman of Columbia University, and David Donoho of Stanford describe Data Science as an applied branch of statistics resulting from the emergence of computer science. While data science and statistics do share several commonalities, if you take a closer look at each field of specialty, there are fundamental differences.
What is Statistics?
Statistics is a field of study rooted in mathematics, providing programmatic tools and methods — such as variance analysis, mean, median, and frequency analysis – to collect data, design experiments, and perform analysis on a given set of figures to measure an attribute or determine values for a particular question. Statistical methods are used in all fields that require decision making.
What is Data Science?
Renowned scientist Jim Gray said, “everything about science is changing because of the impact of information technology,” and referred to Data Science as the “fourth paradigm” of science. The field of Data Science can be described as the crossroad at which machine learning, traditional research, and software development meet. A more wide-ranging multi-disciplinary field, data science goes beyond exploratory analysis, using scientific methods, algorithms and mathematical formulas to extract, evaluate, and visualize structured and unstructured data. Data science can be broken down further into data mining, machine learning, and big data.
Data Mining: A process of extracting and discovering patterns in large data sets.
Machine Learning: Used to build predictive models, Machine Learning is the study of computer algorithms that improve automatically through experience and by the use of data.
Big Data: Gartner, a leading research and advisory company, defines big data as “high-volume, high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.” In other words, big data is comprised of data sets so complex that traditional data processing software is unable to manage them. This is where Data Science comes into play, applying multiple disciplines and scientific methods to massive volumes of data to address questions and problems, and subsequently offer solutions or next steps.
Strong mathematical skills are the foundation for anyone going into a career in statistics. Utilizing graphs, charts, and tables, Statisticians are able to interpret and give structure to large amounts of quantitative information and present a comprehensive and digestible overview (often in laymen’s terms) to colleagues, shareholders, and external clients.
To effectively communicate information and in addition to technical skills, Statisticians must possess business acumen, problem-solving abilities, objective reasoning, and critical thinking ability, as well as strong interpersonal communication skills.
Sought after qualifications include:
- Degree in statistics, mathematics or economics with coursework in calculus, linear algebra, experimental design, survey methodology, statistical theory, and probability.
- Theoretical and Applied Statistical expertise.
- *Proficiency in computer programming languages including but not limited to Python, R, SAS, SQL, Julia, Scala, C, Matlab, C/C++, Java, and Perl.
- Proficiency in Hadoop-based analytics for big data, including Hive and Pig.
- Knowledge of databases such as NoSQL and cloud computing.
Skills: Data Science (and the Role of Statistics)
A data scientist has the unique ability to interpret large amounts of data, examine the data from a number of perspectives, extract key elements and translate the findings into useful content that can then drive decision making or make recommendations that an organization can use to improve some aspect of its business.
Statistics is an essential arrow in every data scientist’s quiver. Brad Schumitsch, current Engineering Manager at Facebook, confirmed that “statistics is a crucial component of data science.” Data scientists should be fluent in concepts of descriptive statistics (measures of frequency, measures of central tendency, measures of dispersion or variation, and measures of position), probability theory (mathematical formulas used to measure the likelihood of a certain event occurring), as well as probability distribution, dimensionality reduction, over- and under-sampling, and Bayesian analysis. Statistical tools, including the mean, mode, and bias, are frequently the first methods data scientists learn in order to explore data.
Breaking these methods down further:
- Regression, causal effects analysis, time series analysis, latent variable analysis, and survey design may be used to understand user engagement; analyzing conversion, retention, and leads. These tools can be used to explain why sales have increased or decreased in a certain time period, and ultimately help a business owner understand what is driving sales, forecast future sales, and predict potential future trends.
- Clustering, dimensionality reduction, and latent variable analysis statistics are used to transform big data into actionable, big picture recommendations for businesses. For example, grocery stores label and group their customers to help determine and understand buying habits. By creating groups like senior citizens, family-focused, and single-millennials, management can create more targeted marketing campaigns. In an online business model, predictive modeling, latent variable analysis, dimensionality reduction, collaborative filtering, and clustering statistics are applied to suggest what online users want next based upon their digital interactions (purchases, reviews, clicks) within a specific website.
- Using Bayesian data analysis, data scientists can intelligently estimate outcomes for web data, such as click-through rate. By combining global data with prior knowledge, data scientists will be able to determine a desirable estimate, communicate the properties of that estimate, and provide a summary of what the estimate means.
According to Towards Data Science, a strong statistical foundation is supported by technical expertise in programming, multivariable calculus, linear algebra, and optimization methods.
A seemingly unrelated skill looked for by recruiters in data science is strong communication and presentation skills and the ability to collaborate. Data scientists rarely work alone, partnering with analysts, engineers, business intelligence specialists, and architects to develop strategies. As the intermediary between this group of specialists and the C-suite, data scientists are tasked with translating data into a digestible story, as mentioned often in laymen’s terms, that can then be utilized to achieve business goals. Therefore, strong communication skills and an understanding of the industry is essential.
Additional sought-after qualifications include:
- Expertise in Data Wrangling – the process of transforming and mapping raw data from one form to another to prepare the data for further analysis.
- Aptitude of Machine Learning and DataRobot.
- Advanced knowledge of data mining, cleaning (data wrangling), visualization, and the technical programs associated with each.
- Proficiency* in computer programming languages and machine learning libraries including but not limited to Python, Tableau, Hadoop, Apache Spark, R, SQL, Java, Julia, Scala, MATLAB, TensorFlow, Apache Storm, Flink, Hive, sklearn.
- Experience developing automated models.
*Data Science Central currently lists Python, R, SAS, and SQL as the most dominant programming languages for both statistics and data science.
Career Outlook, Opportunities, and Growth: Statistics
Over the next decade, as businesses grow their presence online and utilize social media and mobile devices to conduct their daily operations, the volume of data available is expected to increase. A 2019 study conducted by KPMG ranked improving planning and forecasting capabilities as a top priority. High-quality data and analysis are required to reach this goal, and companies in all industries, regardless of geographic area and size, will continue to search for new ways to collect, interpret, and capitalize on the abundance of information available. As a result, those with statistical expertise will be the primarily sought-after candidates. This is supported by a BLS report that projects careers in the field of statistics to grow 35 percent, adding nearly 15,000 new jobs between 2019 and 2029, faster than the average for other occupations.
Statisticians work with data from start to finish, determining the problem to be solved and what data is needed to solve it, choosing the appropriate method(s) of data acquisition, and designing collection tools, carrying out data collection, preparing data for analysis, and finally interpreting and identifying trends in the data.
A number of industries already rely on statistical modeling, including but not limited to:
Government: Statisticians working in government are tasked with collecting national data by developing and analyzing surveys. The topics of focus span a wide array of areas, from level of pesticides in drinking water to the number of remaining members of each endangered species by geographic region, unemployment rates, national wage averages, and tracking illness and fatalities, and more. A recent real-world example would be those statisticians tasked with tracking COVID-19. Data is collected through a myriad of methods including surveys, questionnaires, experiments, and opinion polls.
Healthcare: In healthcare statisticians are known as biostatisticians or biometricians. Employed by pharmaceutical companies, public health agencies, or hospitals these individuals are tasked with designing studies surrounding the success or failure of specific drugs or tracking the origins of certain illnesses. Healthcare in particular is an industry where a statistician must also have solid background knowledge of the area in which they are analyzing data. This is essential in understanding trends in the field.
Research and development: Analyzing surveys for price sensitivity, studying consumer behaviors across geographies, or customer sentiment analysis, consumer data is utilized by companies worldwide in myriad ways. Some real-time examples include:
- Product Development: Statisticians design experiments to test products throughout their development as well as analyze consumer data to assist in developing marketing strategies and price points for goods. This is another area where knowledge of the field is essential.
- Quality Testing: Statistics is often used in quality testing as companies that make thousands of products try to ensure the quality of their goods. To perform these tests, statisticians use sampling, or a small group, to test the product and determine quality.
- Supply Chain Management: The retail space relies heavily on statistics, as companies track everything sold in stores and online and use the data to determine quantity and which products to ship. For example, during hurricane season large grocery chains and stores such as Target and Wal-Mart would analyze store purchasing information and determine that additional supplies of water, toilet paper, and dry goods are needed during storm preparations in susceptible areas.
- Consumer Behavior Analysis: Consumer data is one of a company’s most powerful tools in understanding its customers. For example, in 2018 Equifax Inc. partnered with students at Cornell University to determine how customers prioritized paying bills, determining whether someone is more likely to pay a mortgage, car, or cell phone bill first.
Colleges and universities: In an academic setting, the role of a statistician is to study abstract concepts, research, and explore new theories with the goal of expanding knowledge of the field. Opportunities for these positions exist at both the graduate and undergraduate level.
In addition to academic research, statisticians employed by a college or university are essential to the continued growth of existing schools and programs. In this role they are likely to look at body demographic changes over time, changes in interests in courses and programs, and year over year enrollment and retention.
Learn more about becoming a statistician here.
Career Outlook, Opportunities and Growth: Data Science
Nearly a decade ago Harvard Business Review referred to the data scientist as the “sexiest job of the 21st century.” Fast forward and careers in the field of data science now represent one of the fastest growing and most profitable career paths. This is due in part to the fact that the amount of data we as a society generate is growing exponentially, at an estimated current rate of 2.5 quintillion bytes every day, and companies globally are looking to capitalize on that wealth of information. In fact, worldwide revenues for the artificial intelligence (AI) market are expected to grow 16.4% year over year in 2021 to $327.5 billion, and the market is expected to break the $500 billion mark by 2024.
Employment in the field of data science is expected to grow 30.9 percent between 2019 and 2029 and is listed among BLS’s 30 fastest growing occupations. According to a 2019 survey conducted by KPMG, investing in data and analytics ranked as a top priority across all industries, geographies, and company sizes. A study by the Business-Higher Education Forum and PricewaterhouseCoopers projected the number of new job postings looking for experience in data science and analytics in 2020 to reach 2.72 million. Despite this incredible demand, there is a notable shortage of qualified data scientists. In 2020 alone, there was a shortage of 250,000 data science professionals.
This shortage is two-fold. First and most is that demand has simply exceeded the number of candidates seeking roles in the field. The second touches on one key word: qualification. Part economist, part physicist, part mathematician, data scientists can be described as number crunchers with a background in engineering. This unique amalgamation of skills can be incredibly difficult to find. In addition, technical skills will only get a candidate so far in the application process as data science roles often also require industry knowledge to be able to guide business decisions.
As companies of all sizes continue to place further emphasis on digital growth, integrating digital technologies into all aspects of a business requires a range of new roles based in data science, including data architect, applications architect, enterprise architect, infrastructure architect, business intelligence engineer, data engineer, database administrator, and machine learning specialists (including NLP engineer and computer vision specialist).
According to Springboard, the top fields for data science professionals include finance, tech, healthcare, and general professional services.
Tech: It’s unsurprising that the tech industry uses data to drive product development. Data is used to track and forecast trends in user behavior, helping companies understand the wants and needs of their consumers. From social media to e-commerce, nearly all online entities – from Facebook and Google to Netflix and Amazon – rely heavily on machine learning and artificial intelligence to improve user experience, constantly updating algorithms to better serve their users and ultimately collect even more data.
One of the clearest examples of the difference between the role of statistician and data science lies within the tech industry and is broken down in a 2016 article on The Signal by MixPanel. The piece highlights the intuitive algorithms that support apps like Facebook and Instagram. These unique and confounding algorithms are constantly evolving with consumer behavior, tracking every click, like and post to then determine what a user sees when they open the app.
Companies like Facebook and other app developers need statisticians to collect and analyze the immense amount of historical user data available, as it can be used both to understand current user behavior as well as helps drive ad sales and placements. However, when it comes to producing the algorithm that can influence future behavior, or in Facebook’s case determine what a user will see in their newsfeed each time they open the app, this is where a data scientist comes into play.
Finance: Banks, investment firms, insurance agencies, and real estate specialists all rely on data science to determine risk, prevent loss, and predict market activity. Data scientists are also tasked with creating algorithms to identify fraud, identity theft, and scams by focusing on finding anomalies in customer behavior.
According to a report in the Financial Times, big data revealed new ways for insurers to analyze and develop insurance policies. For example, in 2015 Liberty Mutual Insurance and American Family Insurance both partnered with Nest Labs to offer Nest Protect, a smoke alarm and carbon monoxide monitor, complimentary to its customers. Customers in turn also received a discount on their insurance premiums for having the devices installed in their homes and the insurance companies received data from those devices. Delineating the mutual benefits this data exchange, the article describes, “From weather patterns to social media, new sources of data could help [insurers] streamline costs, be more targeted with the risks they want to underwrite, identify new customers, predict fraud, or identify which claims have the potential to become very expensive.”
Professional Services Industry: This term encompasses all industries in which data can be used to optimize daily operations and facilitate growth. From retail to agriculture, companies are investing in data analytics to improve productivity and increase sales.
In the agricultural industry for example, the use of sensor technology is expected to generate an average of 4.1 million data points per day by 2050, compared to 190,000 per in day in 2019. This technology supplies data on crops, soil, weather, temperature, and moisture conditions and can be used to support healthier livestock, detect disease, and manage harvesting systems, all of which will aide farmers in producing more products and more food in the most efficient way.
For retailers, customer satisfaction is key, and big data plays an important role in targeting what consumers want in order to become a repeat customer. Designing a bespoke shopping experience, whether in store or online is essential to producing the best customer experience.
Healthcare: A broad field with numerous niche areas, data science in healthcare could be applied to diagnostics, therapeutics, pharmaceuticals, medical technology, and more.
Since early 2020, the world followed along as data scientists tracked and predicted the spread of COVID-19. Data was used to calculate anticipated fatality rates, spikes in infection, and targeting new hot spots and vulnerable regions. In addition, data science is of great importance in the development of vaccines. This includes tracking patient trials and side effects and organizing the supply chain for a global vaccination rollout.
Hospitals also utilize data to help improve patient care. Results impacted shortening wait times, increasing emergency room efficiency, or decrease overcrowding. Throughout the COVID-19 pandemic, hospitals have used patient data to track admissions, discharges, fatalities, recoveries, and bed and supply shortages.
Another example would be wearable trackers such as heart monitors. These devices constantly monitor a patient’s activity, sending data back to physicians in real-time. This information is essential for doctors to provide optimal, personalized care for each patient.
Whether it’s to improve diagnostic accuracy, find cures for diseases, provide better patient care or help prevent the spread of viruses, data science unquestionably plays a large role in the healthcare industry.
Learn more about entering the field of data science here.
Conclusion: Key Differences in the Fields of Data Science and Statistics
Both data science and statistics support decision making, but in different ways. Data science uses scientific methods to discover and understand patterns, performance, and trends, often comparing numerous models to produce the best outcome. Meanwhile, statistics focuses on mathematical formulas and concepts to provide data analysis. Statistical analysis begins with a simple model (often linear regression). Data is then checked against that model to prove accuracy and improve the existing model to best fit the data. Statistics also deals with quantifying uncertainty or trying to determine how likely an outcome is if there are unknown factors. This step is rarely needed in data science.
Data scientists use statistics, among other disciplines, to tell a story with the data, explaining their insights in a way that is easily understood without sacrificing the integrity of the data. Strong communication skills are vital because the role of a data scientist is to be a translator or ambassador between the data and a company or client, conveying the meaning of the data and what actionable insights make the data important to the company with the ultimate goal of guiding corporate action.
The ability to turn big data into the big picture for a company is a key factor in distinguishing a data scientist from a statistician. Following this line of thinking, it is likely safe to conclude that all data scientists are statisticians, but not all statisticians are data scientists.
According to Matt Przybyla, Sr. Data Scientist and Top Writer in Technology and Education at Towards Data Science, “If you want to focus on significance, testing, experimental design, normality distribution, and diagnostic plotting, then become a Statistician. If you want to practice more software-engineering like coding and automation of machine learning models, then become a Data Scientist.”
Both disciplines will continue to exist within the job market for the foreseeable future and will likely have a big overlap in skills. If a company can successfully merge these two fields of expertise, however, they can create a powerful corporate tool.
Return to Discover Data Science Articles