Machine Learning Data and Role

Data is the foundation of every Machine Learning system. Without data, no model can learn anything. An algorithm is just a set of instructions — data gives it something to work with. The quality and quantity of data directly determine how well a model performs.

What is Data in Machine Learning?

Data in Machine Learning is a structured collection of observations or records. Each record represents one real-world event, person, object, or measurement. These records get organized into rows and columns, similar to a spreadsheet.

Example Dataset — Student Exam Performance:

┌────────────┬──────────────┬───────────┬───────────┬────────┐
│ Student ID │ Study Hours  │ Sleep Hrs │ Prev Score│ Result │
├────────────┼──────────────┼───────────┼───────────┼────────┤
│ 1          │ 5            │ 7         │ 72        │ Pass   │
│ 2          │ 2            │ 5         │ 50        │ Fail   │
│ 3          │ 8            │ 8         │ 88        │ Pass   │
│ 4          │ 1            │ 4         │ 40        │ Fail   │
└────────────┴──────────────┴───────────┴───────────┴────────┘

Columns = Features (inputs) + Label (output = Result)
Rows    = Individual records (one per student)

Key Data Terminology

Features (Input Variables)

Features are the input columns that help the model make predictions. In the student example, Study Hours, Sleep Hours, and Previous Score are all features. Choosing the right features is one of the most important parts of building a good model.

Label (Target Variable)

The label is the column the model tries to predict. In the example, "Result" (Pass or Fail) is the label. Supervised learning requires labels. Unsupervised learning does not.

Instance / Sample / Record

Each row in the dataset is one instance. It represents a single observation — one student, one transaction, one image.

Dimensionality

The number of features in a dataset is its dimensionality. A dataset with 3 features is 3-dimensional. Very high-dimensional data (hundreds of features) introduces special challenges discussed in later topics.

Types of Data

Data
 │
 ├──► Structured Data    (organized in tables with rows and columns)
 │
 ├──► Unstructured Data  (images, audio, video, text documents)
 │
 └──► Semi-structured    (JSON, XML — has some structure but not full)

Structured Data

Structured data lives in tables. Every row and column has a clear meaning. It is the easiest type to work with in classic Machine Learning algorithms.

Sales records in a spreadsheet
Bank transaction logs
Weather measurements in a database

Unstructured Data

Unstructured data does not fit neatly into rows and columns. Deep Learning models handle this type well by converting raw inputs (like pixels or audio frequencies) into numerical patterns.

Photos and X-rays
Voice recordings
Email text and social media posts

Data Types Within Columns

Column Data Types:
┌────────────────────┬──────────────────────────────────────────┐
│ Type               │ Examples                                 │
├────────────────────┼──────────────────────────────────────────┤
│ Numerical          │ Age=25, Price=499, Temperature=36.5      │
│   - Continuous     │ Height, Weight, Speed (any decimal)      │
│   - Discrete       │ Number of rooms, Count of orders         │
│ Categorical        │ Color=Red, Gender=Male, City=Delhi       │
│   - Nominal        │ No order (colors, city names)            │
│   - Ordinal        │ Has order (Low/Medium/High, ratings 1–5) │
│ Text               │ Product review, email body               │
│ Date/Time          │ 2024-01-15, 09:30 AM                     │
└────────────────────┴──────────────────────────────────────────┘

The Role of Data Quality

A model trained on bad data produces bad predictions. This principle has a well-known phrase in computing: "Garbage in, garbage out." Data quality covers several dimensions:

Completeness

Are there missing values? A dataset where 40% of the "Age" column is empty creates problems. Models either skip those rows or need special handling to fill the gaps.

Accuracy

Does the data reflect reality? If a sensor records a temperature of 500°C in a room, that value is wrong. Inaccurate records corrupt what the model learns.

Consistency

The same information should appear the same way throughout. If one column lists "Male" and another lists "M" for the same thing, the model treats them as different categories.

Relevance

Every feature should relate to the prediction goal. Including someone's shoe size to predict loan repayment adds noise and weakens the model.

How Much Data is Enough?

The amount of data needed depends on the problem complexity. A general rule of thumb:

Problem Complexity vs Data Need:

Simple problem (few features, clear patterns):
  Hundreds to low thousands of records are enough

Medium problem (mixed data, moderate patterns):
  Thousands to tens of thousands of records

Complex problem (images, text, speech):
  Tens of thousands to millions of records

Rule: More data → Better generalization (up to a point)

Data Split: Training, Validation, and Test Sets

A dataset is never fed entirely to the model during training. It gets divided into three parts:

Full Dataset (100%)
  │
  ├──► Training Set   (~70%) — Model learns from this
  │
  ├──► Validation Set (~15%) — Used to tune model settings
  │
  └──► Test Set       (~15%) — Final check on unseen data

Example with 1000 records:
  Training:   700 records  → Model trains here
  Validation: 150 records  → Settings get adjusted here
  Test:       150 records  → Final performance measured here

The test set must remain untouched until evaluation. If the model ever sees test data during training, the evaluation results become misleading.

Data Sources

Data can come from many places depending on the industry and problem:

Internal databases: customer records, transaction logs, sensor readings
Public datasets: government open data portals, Kaggle, UCI Machine Learning Repository
APIs: weather data, financial market feeds, social media platforms
Web scraping: collecting structured data from websites programmatically
Surveys and forms: manually collected responses

Data in the Machine Learning Pipeline

Real World Event
      │
      ▼
Data Collection (gather raw observations)
      │
      ▼
Data Storage (database, CSV, cloud storage)
      │
      ▼
Data Preparation (clean, transform, structure)
      │
      ▼
Model Training (algorithm learns from prepared data)
      │
      ▼
Prediction on New Real-World Events

Every stage in this pipeline depends on the data being correct and well-organized. Weak data at any stage weakens the final model.

Previous lesson

Back to course

Next lesson