DiscoverDataScience.org

  • Programs
    • Bachelor’s in Data Science Programs
      • Data Science Minors
    • Master’s in Data Science Programs
    • Data Science PhD Programs
    • Data Science Certification Programs
    • Data Science Associate Degrees
    • Data Science Bootcamps
    • MBA in Data Science/Analytics
  • Online
    • Online Master’s in Data Science Programs
    • Online Master’s in Data Analytics Programs
    • Online Master’s in Business Analytics Programs
    • Online Master’s in Information Systems
    • Online Master’s in Health Informatics Programs
  • Resources
    • 2021 Salary Guide to Careers in Data Science
    • Top 30 Affordable Online Master’s in Data Science Programs
    • Journey through Data Science with the Data Professor
    • The Significance of Data Community Building
    • How to Build a Data Science Portfolio & Resume
    • Data Science Job Search Guide
    • Guide to a Career in Analytics
    • Guide to a Career in Health Informatics
    • Guide to Geographic Information System (GIS) Careers
    • Careers with Numbers
    • Income Sharing Agreement Guide
    • GRE Prep Guide
    • Kids STEM Guide
    • Women in STEM Guide
    • Minorities in STEM Guide
    • STEM Scholarship Guide
    • Big Data Internship Tips
    • Data Science in High Schools
    • Applying for a Big Data PhD
    • Data Science and Sustainability
    • Data Science and Libraries
    • Data Science Degrees by State
    • Math Help Guide
  • Related Programs
    • Master’s in Business Analytics Programs
    • Master’s in Data Analytics Programs
    • Master’s in Information Systems Programs
    • Master’s in Health Informatics Programs
    • Ph.D. Programs in Information Systems
    • Ph.D. in Health Informatics Programs
    • Sports Analytics Degree Programs
    • GIS Degree Programs
    • Accounting Analytics Degree Programs
    • Actuarial Science Degree Programs
    • Cyber Security Degree Programs
    • Data Analytics and Visualization Programs
  • About
FIND A PROGRAM
1
2
3
4
Sponsored Content

What is the Difference Between Data Mining and Machine Learning?

Data mining is the probing of available datasets in order to identify patterns and anomalies. Machine learning is the process of machines (a.k.a. computers) learning from heterogeneous data in a way that mimics the human learning process. The two concepts together enable both past data characterization and future data prediction.

There are Many Lenses of Data Mining

The purpose of data mining is to identify patterns in data, and patterns can be identified in many different ways depending on what information is needed.

1) Data mining is used to classify data.
Classifying data is something we perform on a daily basis, like when we sort laundry and separate shirts, pants, socks, etc. In terms of big data, sorting becomes far more complicated. For example, credit checks access a person’s financial history. After integrating data on existing debt, income, and late payment histories, loan applicants are classified into either “eligible” or “ineligible”.

2) Data mining is used to identify associations in data.
For example, consider a grocery store that sets up an online shopping system with a virtual shopping cart. Once data is collected from thousands of customers, it would probably be revealed that people who buy hot dogs often buy buns and ketchup as well, or that people who add pasta noodles to their carts often buy pasta sauce. Sometimes associations are completely beyond what anyone would anticipate, such as the Pop-tart story found here.

As another example, consider an application that collects cell phone GPS location data from its users.
Using data mining, analysts can deduce that a few people, call them Rachel, Ross, Joey, Chandler, and Monica, gather every day at about the same time at a coffee shop called Central Perk (those of you that watched “Friends” know what this is about). By that, they can infer that Rachel, Ross, Joey, Chandler, and Monica are friends.

3) Data mining is used to identify outliers and anomalies.
Identifying unusual data can be very useful. An example would be a fraud detection system run by a credit card company. If, all of a sudden, high-ticket item purchases are made from an individual’s account and those purchases are outside his or her home state, security programs will isolate the incident and ring virtual alarm bells to indicate something unusual is happening that warrants further investigation, such as a freeze on the account and a phone call to the customer. Another example, considering the Central Perk scenario above, would be if it were observed that Chandler and Monica stopped coming to Central Perk altogether after being faithful members for many years. A trend that is broken suggests that something has changed, which is actually true – Chandler and Monica got married and moved to the suburbs.

4) Data mining is used to group data.
Cluster analysis groups items together based on shared properties. For example, if biologists are given the DNA sequences of 1,000 different species, algorithms that compare the sequences might cluster the species into five general groups that are upon investigation identified as mammals, reptiles, amphibians, birds, and fish.

5) Data mining is used to perform regression analysis and generate prediction models.
Regression analysis seeks to analyze the relationship between quantitative variables. Calculating residential real estate values is a perfect example of regression analysis. Residential real estate prices are influenced by many different factors including square footage, number of beds/baths, population of city, distance to schools, etc. If the data from hundreds of recently sold properties is collected and analyzed, data mining could determine how much each factor contributes to the purchase price. Using that information, real estate investors can then predict values and trends. Both real estate investors and insurance companies rely heavily on such predictive models.

No matter the type of data mining, all data mining strategies have the ultimate goal of extracting patterns from data.

Data scientists are not merely interested in characterizing existing data, although that is a huge part of their job. They are equally interested in predicting future data and accurately characterizing unknown data. Machine learning is a way that data mining output is used to generate tools that can be applied to novel data.

The Machine Learning Toolbox: Advanced Algorithms

The main purpose of machine learning is to generate algorithms that can “learn” from data. Algorithms are sequential processes that can solve a problem in a finite number of steps. In machine learning algorithms, each piece of data that is run through the algorithm pipeline will influence the outcome of the algorithm. For example, if one spam message is run through the algorithm, the machine will learn what one spam message looks like. If thousands of spam messages are run through the algorithm, the machine has been exposed to thousands of spam messages so that it can identify commonalities and better define exactly what spam looks like. The goal of machine learning is to develop an algorithm that can independently operate and be applied to novel data. In this example, it would be an algorithm that can accurately classify an email as “spam” or “legitimate”.

In supervised learning, accurately characterized data is divided into “training” and “test” sets. Training sets are typically about 80% of data, and test sets are the remainder. In our example, we have emails that are classified as “spam” or “legitimate” by human experts. The machine learning algorithm is developed using the training set, a portion of emails that have already been identified. Once the optimized algorithm has been developed after all of the training set has been run through the pipeline, the algorithm is tested with the test set to determine its accuracy. Accuracy is determined by how many times the algorithm correctly characterizes test set data. Ideally, algorithms would classify big data correctly 100% of the time, but considering that there are always outliers, that is not realistic. A classification accuracy above 90% is usually considered acceptable.

In unsupervised learning, the classes are not known. The machine learning algorithm would infer patterns and properties based on input comparisons and cluster data into different groups. For the email example, after running thousands of unclassified emails through the algorithm, the algorithm might group them into three different categories. Human experts would then examine random samples from the three clusters of emails, and upon examination, may label them as “spam”, “personal”, and “retail”. Or perhaps four clusters of emails would be generated by the algorithm. In that case, human experts would analyze examples in each cluster and assign cluster labels such as “spam”, “personal”, “work”, and “retail”. Note that unsupervised learning output requires expert analysis in order to assign meaning.

Data Scientists are Master Programmers

The job of data scientists is to examine data to make predictions, and data scientists cannot do their jobs without both data mining and machine learning. They must perform data mining to characterize data, and they must integrate machine learning algorithms in order to make predictions. These two processes require an intense amount of programming, and thus data scientists should have fluency in programming languages such as R, Python, or MatLab. Data scientists also must be able to write and modify these complicated algorithms.

FIND A PROGRAM
1
2
3
4
Sponsored Content
  • Career Guides
  • Data Analyst
  • Data Architect
  • Data Engineer
  • Business Analyst
  • Marketing Analyst
  • Data Analytics Manager
  • Business Intelligence Analyst
  • Data Mining Specialist
  • Statistician
  • Machine Learning Engineer
  • Database Administrator
  • Database Developer
  • Data Science Toolkit
  • Hadoop
  • Hive
  • Java
  • Python
  • R
  • SAS
  • SQL
  • Tableau
  • Data Science Articles
  • Journey through Data Science with the Data Professor
  • How to Build a Data Science Portfolio & Resume
  • The Significance of Data Community Building
  • Developer Impostor Syndrome
  • How to Improve Programming Skills
  • Data Science Degree Vs. Training
  • Why Data Destruction is Important for your Business
  • Data Storytelling: Mastering Data Science’s Core Skillset
  • What is a Marketing Funnel and How to Create One
  • Building a Data Science Brand
  • Interviewing for Data Careers
  • Top 5 Reasons to Become a Data Scientist
  • What is Data Analytics?
  • What is Business Analytics?
  • What is Quantum Machine Learning?
  • What is Predictive Analytics?
  • Data Science vs. Statistics
  • Data Mining vs. Machine Learning
  • Business Analyst vs. Data Scientist
  • Data Scientist vs. Software Engineer
  • Data Science vs. Computer Science
  • Data Engineer vs. Data Scientist
  • Data Analyst vs. Data Scientist
  • How to Use Deepfake Technology
  • Java vs. JavaScript
  • What Is Python Used For & Why Is It Important to Learn?
  • Artificial Intelligence as a Trending Field
  • Data Science in Health Care
  • Guide to a Career in Criminal Intelligence
  • Guide to a Career in Health Informatics
  • Guide to Geographic Information System (GIS) Careers
  • Data Science Ph.D.
  • Expert Interview: Dr. Sudipta Dasmohapatra
  • Expert Interview: Sandra Altman
  • Expert Interview: Tony Johnson
  • Expert Interview: Bob Muenchen
  • Industries Using Data Science
  • Artificial Intelligence
  • Biotechnology
  • Finance
  • Health Care
  • Insurance
  • Law Enforcement
  • Logistics
  • Marketing and Advertising
  • Sports
  • Clean Energy
  • Programs
  • Online
  • Resources
  • Related Programs
Our site does not feature every educational option available on the market. We encourage you to perform your own independent research before making any education decisions. Many listings are from partners who compensate us, which may influence which programs we write about. Learn more about us

© Copyright 2022 | https://www.discoverdatascience.org | All Rights Reserved

  • Home
  • About Us
  • Privacy Policy
  • Terms of Use