Table of Contents#
- Introduction
- What is Database Normalization?
- Why Normalize Your Database?
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Summary of the First Three Normal Forms
- When to Stop Normalizing?
- Conclusion
- References
What is Database Normalization?#
Database normalization is a systematic process of organizing data in a relational database to:
- Eliminate redundancy (duplicate data).
- Reduce data anomalies (errors during insert, update, or delete operations).
- Improve data integrity (consistency and accuracy).
Normalization was first introduced by Edgar F. Codd in 1970, and it defines a set of "normal forms" (NF)—rules that a database schema must follow to be considered "normalized." We’ll focus on the first three, as they address most common data issues.
Why Normalize Your Database?#
Unnormalized databases suffer from three critical "anomalies":
1. Update Anomalies#
If data is duplicated, updating it in one place but forgetting others leads to inconsistencies. For example, if BookTitle = "1984" has Price = $12 in one row and $15 in another, which is correct?
2. Insert Anomalies#
You may be unable to insert new data because it depends on unrelated data. For example, you can’t add a new Author to your database unless they have a book in an order.
3. Delete Anomalies#
Deleting a row may accidentally remove unrelated data. For example, deleting an order for "1984" might also delete the Author (George Orwell) from your records entirely.
Normalization eliminates these anomalies by breaking large tables into smaller, focused tables with well-defined relationships.
First Normal Form (1NF)#
1NF is the most basic level of normalization. Its goal is to ensure data is stored in a "flat" structure with atomic (indivisible) values.
Rules of 1NF#
A table is in 1NF if:
- All columns contain atomic values (no multi-valued or composite values).
- No repeating groups (no columns with lists or arrays).
- Column names are unique (no duplicate column headers).
- Row order does not matter (data is unordered; use queries to sort).
Example: Converting to 1NF#
Suppose we start with this unnormalized "Sales" table for a bookstore:
| OrderID | CustomerName | BooksPurchased | Quantity |
|---|---|---|---|
| 101 | Alice Smith | 1984, Animal Farm | 2, 1 |
| 102 | Bob Jones | To Kill a Mockingbird | 1 |
Problems with this table:#
BooksPurchasedandQuantitycontain multi-valued data (e.g., "1984, Animal Farm" is a list).- Repeating groups make it hard to query (e.g., "How many copies of 1984 were sold?").
Step 1: Break Multi-Valued Columns into Atomic Values#
Split BooksPurchased and Quantity into individual rows, one per book:
| OrderID | CustomerName | BookTitle | Quantity |
|---|---|---|---|
| 101 | Alice Smith | 1984 | 2 |
| 101 | Alice Smith | Animal Farm | 1 |
| 102 | Bob Jones | To Kill a Mockingbird | 1 |
This table now satisfies 1NF: all values are atomic, no repeating groups, and column names are unique.
Second Normal Form (2NF)#
2NF builds on 1NF and addresses partial dependencies. A table is in 2NF if:
- It is in 1NF.
- Every non-key attribute is fully functionally dependent on the entire primary key (PK), not just part of it.
What is Functional Dependency?#
A functional dependency A → B means: "If you know the value of A, you can uniquely determine the value of B." For example, BookID → BookTitle (a book’s ID determines its title).
Rules of 2NF#
- The table must have a composite primary key (a PK made of multiple columns) to have partial dependencies.
- Non-key attributes (columns not in the PK) must depend on the entire PK, not just one column in the PK.
Example: Converting to 2NF#
Let’s expand our 1NF "Sales" table to include more details. Assume the PK is (OrderID, BookID) (composite key, since one order can have multiple books):
| OrderID (PK) | BookID (PK) | CustomerName | BookTitle | Author | Price | Quantity |
|---|---|---|---|---|---|---|
| 101 | B100 | Alice Smith | 1984 | Orwell | 12 | 2 |
| 101 | B101 | Alice Smith | Animal Farm | Orwell | 10 | 1 |
| 102 | B200 | Bob Jones | To Kill a Mockingbird | Lee | 15 | 1 |
Problems with this table:#
- Partial dependencies: Non-key attributes like
BookTitle,Author, andPricedepend only onBookID(not the entire PK(OrderID, BookID)). For example,BookID = B100always maps toBookTitle = 1984, regardless ofOrderID. - Redundancy:
Author = Orwellis repeated for both books in Order 101.
Step 1: Identify Partial Dependencies#
BookID → BookTitle, Author, Price(depend only onBookID).OrderID → CustomerName(depend only onOrderID).(OrderID, BookID) → Quantity(depends on the entire PK).
Step 2: Split into Smaller Tables#
Create three tables to eliminate partial dependencies:
-
Orders (PK:
OrderID): Stores order-specific data.OrderID CustomerName 101 Alice Smith 102 Bob Jones -
Books (PK:
BookID): Stores book-specific data.BookID BookTitle Author Price B100 1984 Orwell 12 B101 Animal Farm Orwell 10 B200 To Kill a Mockingbird Lee 15 -
OrderDetails (PK:
(OrderID, BookID)): Links orders to books (junction table).OrderID BookID Quantity 101 B100 2 101 B101 1 102 B200 1
Now all non-key attributes depend on the entire PK of their respective tables. This is 2NF!
Third Normal Form (3NF)#
3NF builds on 2NF and eliminates transitive dependencies. A table is in 3NF if:
- It is in 2NF.
- No non-key attribute depends on another non-key attribute (i.e., no transitive dependencies).
What is a Transitive Dependency?#
A transitive dependency occurs when A → B and B → C, so indirectly A → C. Here, C depends on B (a non-key attribute), not directly on A (the PK).
Rules of 3NF#
- All non-key attributes must depend directly on the PK, not through another non-key attribute.
Example: Converting to 3NF#
Let’s take the Orders table from 2NF and add CustomerEmail and CustomerCity:
| OrderID (PK) | CustomerName | CustomerEmail | CustomerCity |
|---|---|---|---|
| 101 | Alice Smith | [email protected] | New York |
| 102 | Bob Jones | [email protected] | London |
Problem: Transitive Dependency#
Suppose we later add a CustomerZipCode column. Now:
CustomerName → CustomerCity(e.g., "Alice Smith" lives in "New York").CustomerCity → CustomerZipCode(e.g., "New York" has zip code "10001").
Thus, CustomerZipCode depends on CustomerCity (a non-key), creating a transitive dependency: OrderID → CustomerName → CustomerCity → CustomerZipCode.
Step 1: Identify Transitive Dependencies#
CustomerName → CustomerEmail, CustomerCity(non-key attributes depend onCustomerName, another non-key).
Step 2: Split into Smaller Tables#
Create a Customers table to store customer-specific data, and link it to Orders via a foreign key (CustomerID):
-
Customers (PK:
CustomerID):CustomerID CustomerName CustomerEmail CustomerCity C001 Alice Smith [email protected] New York C002 Bob Jones [email protected] London -
Orders (PK:
OrderID, FK:CustomerID):OrderID CustomerID 101 C001 102 C002
Now, Orders only contains data directly dependent on OrderID, and Customers handles all customer-related attributes. This eliminates transitive dependencies and achieves 3NF!
Summary of the First Three Normal Forms#
| Normal Form | Key Requirement | Goal |
|---|---|---|
| 1NF | Atomic values, no repeating groups. | Eliminate multi-valued data. |
| 2NF | All non-key attributes depend on the entire PK (no partial dependencies). | Eliminate redundancy in composite keys. |
| 3NF | No transitive dependencies (non-keys don’t depend on other non-keys). | Ensure all attributes depend directly on the PK. |
When to Stop Normalizing?#
Normalization improves data integrity, but over-normalization can harm performance. For example:
- Too many small tables require complex joins, slowing down queries.
- Read-heavy applications (e.g., reporting) may benefit from denormalization (adding controlled redundancy) to speed up reads.
Rule of Thumb: Stop at 3NF for most applications. Higher normal forms (4NF, 5NF) exist but are rarely needed unless dealing with highly complex data relationships.
Conclusion#
Database normalization is a cornerstone of robust database design. By following the first three normal forms, you can eliminate redundancy, reduce errors, and ensure data consistency. Start with 1NF to make data atomic, move to 2NF to fix partial dependencies, and finish with 3NF to remove transitive dependencies. Remember: normalization is a balance—optimize for your application’s needs, not perfection.
References#
- Codd, E. F. (1970). "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM.
- Silberschatz, A., Korth, H. F., & Sudarshan, S. (2019). Database System Concepts (7th ed.). McGraw-Hill.
- W3Schools. "Database Normalization." https://www.w3schools.com/sql/sql_normalization.asp
- IBM. "Database Normalization Basics." https://www.ibm.com/docs/en/db2-for-zos