SQL Normalization

Normalization is the process of organizing a database into a clean, structured design that eliminates redundant (repeated) data and ensures that data is stored logically. The goal is to build a database where each piece of information is stored in only one place, and every table has a clear, single purpose.

Think of normalization like organising a wardrobe. Instead of dumping everything in one pile, clothes are separated into drawers and sections — shirts in one place, trousers in another, accessories separately. Everything has a home, nothing is duplicated, and finding things becomes easy.

Why Normalize a Database?

Reduces redundancy — The same data is not stored in multiple places.
Prevents update anomalies — Changing one piece of data does not require updating it in dozens of rows.
Prevents insertion anomalies — A record cannot be added only when other data is available.
Prevents deletion anomalies — Deleting one record does not accidentally remove other unrelated information.
Improves data integrity — Consistent, reliable data.

The Three Anomalies (Problems Normalization Solves)

The Problem Table (Unnormalised)

StudentID	StudentName	CourseName	TeacherName	City
1	Ravi Sharma	Mathematics	Prof. Singh	Delhi
1	Ravi Sharma	Physics	Prof. Gupta	Delhi
2	Priya Mehta	Mathematics	Prof. Singh	Mumbai

Update anomaly: If Prof. Singh's name changes, it must be updated in every row where Mathematics is listed.
Insertion anomaly: A new course cannot be added without also adding a student.
Deletion anomaly: If Priya's record is deleted, information about the Mathematics + Prof. Singh relationship is partially lost.

Normal Forms

Normalization is achieved in stages called Normal Forms (NF). Each level builds upon the previous one. The most commonly applied levels in practice are 1NF, 2NF, and 3NF.

First Normal Form (1NF)

A table is in First Normal Form when:

Every column contains atomic (indivisible) values — no lists or multiple values in a single cell.
Every column contains values of a single type.
Each row is unique (there is a primary key).

Violates 1NF (Multiple values in one cell)

StudentID	StudentName	Courses
1	Ravi Sharma	Maths, Physics
2	Priya Mehta	Maths

The Courses column contains multiple values in one cell — this violates 1NF.

Converted to 1NF (One value per cell)

StudentID	StudentName	Course
1	Ravi Sharma	Maths
1	Ravi Sharma	Physics
2	Priya Mehta	Maths

Each cell now has exactly one value. A composite primary key (StudentID + Course) uniquely identifies each row.

Second Normal Form (2NF)

A table is in Second Normal Form when:

It is already in 1NF.
Every non-key column is fully dependent on the entire primary key — not just part of it.

This applies only when the table has a composite primary key. If a non-key column depends on only part of the composite key (called a partial dependency), it violates 2NF.

Violates 2NF

Assume the composite primary key is (StudentID, CourseID).

StudentID	CourseID	StudentName	CourseName	Grade
1	101	Ravi Sharma	Mathematics	A
1	102	Ravi Sharma	Physics	B
2	101	Priya Mehta	Mathematics	A+

Partial dependencies:

StudentName depends only on StudentID (not on CourseID).
CourseName depends only on CourseID (not on StudentID).
Only Grade depends on the full composite key (StudentID + CourseID).

Converted to 2NF (Split into separate tables)

Students table:

StudentID (PK)	StudentName
1	Ravi Sharma
2	Priya Mehta

Courses table:

CourseID (PK)	CourseName
101	Mathematics
102	Physics

Enrollments table (only fully dependent data):

StudentID (FK)	CourseID (FK)	Grade
1	101	A
1	102	B
2	101	A+

Third Normal Form (3NF)

A table is in Third Normal Form when:

It is already in 2NF.
There are no transitive dependencies — a non-key column should not depend on another non-key column.

A transitive dependency exists when Column C depends on Column B, which depends on Column A (the primary key). Column C should be in a separate table.

Violates 3NF (Transitive Dependency)

StudentID (PK)	StudentName	ZipCode	City
1	Ravi Sharma	110001	Delhi
2	Priya Mehta	400001	Mumbai
3	Arjun Nair	110001	Delhi

Here, City depends on ZipCode (not directly on StudentID). ZipCode → City is a transitive dependency. Delhi is stored twice because ZipCode 110001 is the same — if the city name changes, it must be updated in every row with that zip code.

Converted to 3NF

Students table:

StudentID (PK)	StudentName	ZipCode (FK)
1	Ravi Sharma	110001
2	Priya Mehta	400001
3	Arjun Nair	110001

ZipCodes table:

ZipCode (PK)	City
110001	Delhi
400001	Mumbai

Now, city is stored only once per zip code. If a city name changes, only one row in the ZipCodes table needs to be updated.

Summary of Normal Forms

Normal Form	Rule	Problem It Solves
1NF	Atomic values only; each cell has one value; has a primary key	Multi-valued cells, repeating groups
2NF	In 1NF + no partial dependencies (non-key columns depend on the full key)	Redundancy caused by partial composite key dependencies
3NF	In 2NF + no transitive dependencies (non-key columns depend only on the primary key)	Redundancy caused by non-key columns depending on other non-key columns

De-normalization

In some cases — especially for read-heavy applications and reporting databases — de-normalization is intentionally applied. This means deliberately adding some redundancy back into the database to reduce the number of JOINs needed and improve read performance. De-normalization is a deliberate design trade-off, not a mistake.

Key Points to Remember

Normalization organises data to reduce redundancy and improve data integrity.
1NF — each cell holds one atomic value; primary key must exist.
2NF — eliminate partial dependencies (when a non-key column depends on only part of a composite key).
3NF — eliminate transitive dependencies (non-key columns depending on other non-key columns).
Each normal form includes all the rules of the forms before it.
For most business applications, achieving 3NF is sufficient.

Previous lessons

Back to courses