DiscoverDataScience.org

  • Online
    • Online Masters in Business Analytics
    • Online Masters in Data Analytics
    • Online Masters in Data Science
    • Online Masters in Health Informatics
    • Online Masters in Information Systems
    • Top Affordable Online Master’s in Data Science
  • Programs
        • Bachelors in Data Science
        • Minor in Data Science
        • Masters in Data Science
        • MBA in Data Science / Data Analytics
        • Data Science PhD Programs
        • Additional Programs
        • Data Science Bootcamps
        • Data Science Certificate Programs
        • Associates Degree in Data Science
  • Related Programs
        • Masters in Business Analytics Programs
        • Masters in Data Analytics Programs
        • Masters in Health Informatics Programs
        • Masters in Information Systems Programs
        • PhD in Health Informatics
        • PhD in Information Systems
        • Other Degrees and Certificate Programs
        • Accounting Analytics
        • Actuarial Science
        • Cyber Security
        • Data Analytics and Visualization
        • Geographic Information Systems (GIS)
        • Sports Analytics
  • Schools By State
    • California
    • Florida
    • Georgia
    • Maryland
    • New Jersey
    • New York
    • Pennsylvania
    • Texas
    • Virginia
    • All Schools by State
  • Careers & Salary
        • Career Guides – How to Become:
        • Business Analyst
        • Business Intelligence Analyst
        • Data Analyst
        • Data Scientist
        • Machine Learning Engineer
        • Statistician
        • All Career Guides
        • Salary Guides
        • Careers in Data Science
        • Business Analyst
        • Data Analyst
        • Data Scientist
  • Resources
        • Articles
        • Data Science in the Health Care Industry
        • Data Storytelling
        • How to Use Deepfake
        • Journey through Data Science with the Data Professor
        • Top Reasons to Become a Data Scientist
        • What is Python and Why Important
        • + All Articles
        • FAQ
        • Data Analyst vs Data Scientist
        • Data Science vs Computer Science
        • Do You Need a PhD to Become a Data Scientist?
        • How to Get a Job as a Data Scientist?
        • Is Data Science Hard?
        • Is a PhD in Data Science Worth It?
        • What Can I Do With a Masters in Statistics?
        • What is Business Analytics?
        • What is Data Analytics?
        • +All FAQs
        • Social Good
        • Clean Water
        • Cyberbullying
        • Mental Health
        • Nonprofits
        • +All Social Good
        • Data Science in Industry
        • Artificial Intelligence AI
        • Biotechnology
        • Clean Energy
        • Health Care
        • Logistics
        • Marketing
        • Sports
        • + All Industries
        • Data Science Training Toolkits
        • Java
        • SAS
        • SQL
        • Tableau
        • +All Training
        • More Resources & Helpfull Guides
        • Data Science and Sustainability
        • Expert Interviews
        • Exploring a Career with Numbers
        • Income Sharing Agreements
        • Making Room for Diverse Populations in STEM
        • Scholarship Guide
        • +More Resources
        • Top Picks
        • Best Master’s Data Science Programs for 2023
        • Best Bachelor’s Data Science Programs for 2023
        • The Most Affordable Data Science Bachelor’s Programs for 2023
        • The Most Affordable Data Science Master’s Programs for 2023
FIND A PROGRAM
1
2
3
4
Sponsored Content

What is the Difference Between Data Mining and Machine Learning?

Data mining is the probing of available datasets in order to identify patterns and anomalies. Machine learning is the process of machines (a.k.a. computers) learning from heterogeneous data in a way that mimics the human learning process. The two concepts together enable both past data characterization and future data prediction.

FIND SCHOOLS
Sponsored Content

There are Many Lenses of Data Mining

The purpose of data mining is to identify patterns in data, and patterns can be identified in many different ways depending on what information is needed.

1) Data mining is used to classify data.
Classifying data is something we perform on a daily basis, like when we sort laundry and separate shirts, pants, socks, etc. In terms of big data, sorting becomes far more complicated. For example, credit checks access a person’s financial history. After integrating data on existing debt, income, and late payment histories, loan applicants are classified into either “eligible” or “ineligible”.

2) Data mining is used to identify associations in data.
For example, consider a grocery store that sets up an online shopping system with a virtual shopping cart. Once data is collected from thousands of customers, it would probably be revealed that people who buy hot dogs often buy buns and ketchup as well, or that people who add pasta noodles to their carts often buy pasta sauce. Sometimes associations are completely beyond what anyone would anticipate, such as the Pop-tart story found here.

As another example, consider an application that collects cell phone GPS location data from its users.
Using data mining, analysts can deduce that a few people, call them Rachel, Ross, Joey, Chandler, and Monica, gather every day at about the same time at a coffee shop called Central Perk (those of you that watched “Friends” know what this is about). By that, they can infer that Rachel, Ross, Joey, Chandler, and Monica are friends.

3) Data mining is used to identify outliers and anomalies.
Identifying unusual data can be very useful. An example would be a fraud detection system run by a credit card company. If, all of a sudden, high-ticket item purchases are made from an individual’s account and those purchases are outside his or her home state, security programs will isolate the incident and ring virtual alarm bells to indicate something unusual is happening that warrants further investigation, such as a freeze on the account and a phone call to the customer. Another example, considering the Central Perk scenario above, would be if it were observed that Chandler and Monica stopped coming to Central Perk altogether after being faithful members for many years. A trend that is broken suggests that something has changed, which is actually true – Chandler and Monica got married and moved to the suburbs.

4) Data mining is used to group data.
Cluster analysis groups items together based on shared properties. For example, if biologists are given the DNA sequences of 1,000 different species, algorithms that compare the sequences might cluster the species into five general groups that are upon investigation identified as mammals, reptiles, amphibians, birds, and fish.

5) Data mining is used to perform regression analysis and generate prediction models.
Regression analysis seeks to analyze the relationship between quantitative variables. Calculating residential real estate values is a perfect example of regression analysis. Residential real estate prices are influenced by many different factors including square footage, number of beds/baths, population of city, distance to schools, etc. If the data from hundreds of recently sold properties is collected and analyzed, data mining could determine how much each factor contributes to the purchase price. Using that information, real estate investors can then predict values and trends. Both real estate investors and insurance companies rely heavily on such predictive models.

No matter the type of data mining, all data mining strategies have the ultimate goal of extracting patterns from data.

Data scientists are not merely interested in characterizing existing data, although that is a huge part of their job. They are equally interested in predicting future data and accurately characterizing unknown data. Machine learning is a way that data mining output is used to generate tools that can be applied to novel data.

The Machine Learning Toolbox: Advanced Algorithms

The main purpose of machine learning is to generate algorithms that can “learn” from data. Algorithms are sequential processes that can solve a problem in a finite number of steps. In machine learning algorithms, each piece of data that is run through the algorithm pipeline will influence the outcome of the algorithm. For example, if one spam message is run through the algorithm, the machine will learn what one spam message looks like. If thousands of spam messages are run through the algorithm, the machine has been exposed to thousands of spam messages so that it can identify commonalities and better define exactly what spam looks like. The goal of machine learning is to develop an algorithm that can independently operate and be applied to novel data. In this example, it would be an algorithm that can accurately classify an email as “spam” or “legitimate”.

In supervised learning, accurately characterized data is divided into “training” and “test” sets. Training sets are typically about 80% of data, and test sets are the remainder. In our example, we have emails that are classified as “spam” or “legitimate” by human experts. The machine learning algorithm is developed using the training set, a portion of emails that have already been identified. Once the optimized algorithm has been developed after all of the training set has been run through the pipeline, the algorithm is tested with the test set to determine its accuracy. Accuracy is determined by how many times the algorithm correctly characterizes test set data. Ideally, algorithms would classify big data correctly 100% of the time, but considering that there are always outliers, that is not realistic. A classification accuracy above 90% is usually considered acceptable.

In unsupervised learning, the classes are not known. The machine learning algorithm would infer patterns and properties based on input comparisons and cluster data into different groups. For the email example, after running thousands of unclassified emails through the algorithm, the algorithm might group them into three different categories. Human experts would then examine random samples from the three clusters of emails, and upon examination, may label them as “spam”, “personal”, and “retail”. Or perhaps four clusters of emails would be generated by the algorithm. In that case, human experts would analyze examples in each cluster and assign cluster labels such as “spam”, “personal”, “work”, and “retail”. Note that unsupervised learning output requires expert analysis in order to assign meaning.

Data Scientists are Master Programmers

The job of data scientists is to examine data to make predictions, and data scientists cannot do their jobs without both data mining and machine learning. They must perform data mining to characterize data, and they must integrate machine learning algorithms in order to make predictions. These two processes require an intense amount of programming, and thus data scientists should have fluency in programming languages such as R, Python, or MatLab. Data scientists also must be able to write and modify these complicated algorithms.

FIND SCHOOLS
Sponsored Content
FIND A PROGRAM
1
2
3
4
Sponsored Content
  • Career Guides
  • Artificial Intelligence Engineer
  • Business Analyst
  • Business Intelligence Analyst
  • Data Analyst
  • Data Analytics Manager
  • Data Architect
  • Data Engineer
  • Data Mining Specialist
  • Database Administrator
  • Database Developer
  • Information Security Analyst
  • Machine Learning Engineer
  • Marketing Analyst
  • Software Developer
  • Statistician
  • Data Science Toolkit
  • Hadoop
  • Hive
  • Java
  • Python
  • R
  • SAS
  • SQL
  • Tableau
  • Data Science Articles
  • 10 Data Science Types
  • AI and Data Science
  • The Increasing Importance of Health Informatics
  • Python Growth Rate Predictions
  • Data-as-a-Service (DaaS)
  • Data Science Trends 2023
  • Cybersecurity Analyst vs. Engineer
  • Data Science in Education
  • Do You Need a PhD to Become a Data Scientist?
  • Best Big Data Conferences 2023
  • Data Science Focus Areas
  • Is a PhD in Data Science Worth It?
  • Is Data Science Hard?
  • Marketing Analytics Degree Online
  • Transferable Data Science Skills
  • Transitioning to Data Science
  • What Can I Do With a Masters in Statistics?
  • What Companies Hire Data Scientists?
  • What Is Cyber Science?
  • How to Read Crypto Charts
  • Breaking Down the Top Data Science Algorithms + Methods
  • Journey through Data Science with the Data Professor
  • How to Build a Data Science Portfolio & Resume
  • The Significance of Data Community Building
  • Developer Impostor Syndrome
  • How to Improve Programming Skills
  • Data Science Degree Vs. Training
  • Why Data Destruction is Important for your Business
  • Data Storytelling: Mastering Data Science’s Core Skillset
  • What is a Marketing Funnel and How to Create One
  • Building a Data Science Brand
  • Interviewing for Data Careers
  • Top 5 Reasons to Become a Data Scientist
  • What is Data Analytics?
  • What is Business Analytics?
  • What is Quantum Machine Learning?
  • What is Predictive Analytics?
  • Data Science vs. Statistics
  • Data Mining vs. Machine Learning
  • Business Analyst vs. Data Scientist
  • Data Scientist vs. Software Engineer
  • Data Science vs. Computer Science
  • Data Engineer vs. Data Scientist
  • Data Analyst vs. Data Scientist
  • How to Use Deepfake Technology
  • Java vs. JavaScript
  • What Is Python Used For & Why Is It Important to Learn?
  • Artificial Intelligence as a Trending Field
  • Data Science in Health Care
  • Guide to a Career in Criminal Intelligence
  • Guide to a Career in Health Informatics
  • Guide to Geographic Information System (GIS) Careers
  • Data Science Ph.D.
  • Expert Interview: Dr. Sudipta Dasmohapatra
  • Expert Interview: Sandra Altman
  • Expert Interview: Tony Johnson
  • Expert Interview: Bob Muenchen
  • Industries Using Data Science
  • Artificial Intelligence
  • Biotechnology
  • Finance
  • Health Care
  • Insurance
  • Law Enforcement
  • Logistics
  • Marketing and Advertising
  • Sports
  • Clean Energy
  • Online Guides
  • Data Science
  • Data Analytics
  • Business Analytics
  • Information Systems
  • Health Informatics
  • Programs
  • Online
  • Resources
  • Related Programs

© Copyright 2025 | https://www.discoverdatascience.org | All Rights Reserved

  • Home
  • About Us
  • Privacy Policy
  • Terms of Use