CPSC 304

Schema

The schema of a database is the logical structure that defines the organization of data.

Schema Vs Instance

The schema defines the structure (sometimes called the intention) of the database. It is relatively static and changes only through explicit schema modification. The actual data stored at any point in time is the database instance (or extension), which must conform to the schema.

Database Instance

A database instance (or extension) describes the content stored in a database at a given point in time. If the database has a fixed schema, every database instance must conform to it — that is, every tuple in the instance must respect the domains, key constraints, and functional dependencies defined by the schema.

Properties

The instance changes as data is inserted, updated, or deleted. The schema typically remains fixed.
At any point in time, the instance must satisfy all integrity constraints declared in the schema (e.g. every super key must uniquely identify tuples, no primary key value may be null).

Entity-Relationship Diagram

An entity-relationship diagram is a method of conceptually describing the schema of a database in a human viewable format.

Components

Entity

The entities of an entity-relationship diagram describes a "thing" that is capable of an independent existence that can be uniquely identified. An entity is a thing that exists either physically or logically.

An entity may be a physical object such as a house or a car (they exist physically), an event such as a house sale or a car service, or a concept such as a customer transaction or order (they exist logically as a concept).

Entities can typically be thought of as nouns. Examples include a computer, an employee, a song, or a mathematical theorem.

Relationship

The relationships of an entity-relationship diagram captures how entities are related to each other. Relationships can be thought of as verbs linking two or more nouns.

Examples include an owns relationship between a company and a computer, a supervises relationship between an employee and a department, a performs relationship between an artist and a song, and a proves relationship between a mathematician and a conjecture.

Cardinality

The cardinality of a relationship describes the number of relationships that an entity can participate in. There are generally 4 types of cardinality:

one-to-one - The entity A is associated with at most one entity B and vice versa
- We could model an employees computer using a one-to-one relationship as every employee only has one computer and every computer is used by only one employee.
one-to-many - The entity A is associated with any number of entities B but each B is only associated with at most one entity A
- We can model a teacher and the classes they teach with a one-to-many relationship as a class can only have a single teacher but one teacher can teach many classes.
many-to-one - The entity A is associated with at most one of the entities B but each B can be associated with any number of entities A (this is the inverse of one-to-many).
- This can just model the inverse of one-to-many for example we can model a class having a single teacher but every teacher teaching multiple classes.
many-to-many - The entity A is associated with any number of entities B and vice versa
- We can model the relationships between students and classes using many-to-many relationships as each student is taking multiple classes and each class has multiple students enrolled in it.

Participation

The participation of a relationship describes whether an entity must be involved in that relationship in order to exist. There are generally 2 types of participation:

total participation - Every instance of the entity must participate in at least one instance of the relationship.
- We could model an employee belonging to a department using total participation as every employee must be assigned to a department.
partial participation - Some instances of the entity may not participate in the relationship.
- We could model an employee managing a department using partial participation as not every employee is a manager.

Notation

When designing an entity-relationship diagram, the entities are drawn as squares with their descriptive name (typically a noun) in the centre, these are then connected to the attributes using a line which are drawn as circles with their respective attribute keys in the middle.

Relationships are drawn as a vertical rhombus (resembling a kite) with a descriptive name in the centre (typically a verb) in the centre. These relationships are then linked to their constituting entities.

Line Notation

The lines to relationships can be "plain" or have arrows, these combinations of arrows an plane lines describes the cardinality of the relationship.

Cardinality

One-to-one

Two plane lines indicate a many-to-many relationship

Example of a many-to-many relationship between actors who work on movies.

One-to-many and Many-to-one

One plane line and an arrow indicate a one-to-many or a many-to-one relationship

Example of a one-to-many relationship between directors who direct movies.

One-to-one

Two arrows indicate a one-to-one relationship

Example of a one-to-one relationship between a country and their capital.

Participation

When an entity requires total participation it's line to the relationship is drawn bolded.

Example of a one-to-many relationship between directors who direct movies, where movies are required to have a relationship to a director.

Weak Entity

A weak entity is an entity that cannot be uniquely identified by it's own attributes alone. It instead relies on a relationship with another entity (known as it's owner entity), to uniquely identify it.

Weak entities must participate in a one-to-many relationship with it's owner entity, called it's identifying relationship (typically a belongs to relationship). Weak entities must have total participation in their identifying relationships sets meaning every weak entity must be related to an owner.

The identifying key of a weak entity is a combination of it's own partial key (discriminator) and the primary key of the owner entity. In a entity-relationship diagram the identifying key is underlined with a dashed line, and weak entities are drawn with thick lines and the same with their identifying relationship.

Generalization

In the same way as object oriented languages represent inheritance attributes can be inherited through generalization. This is a done through a IsA relationship between two or more entities.

In a entity-relationship diagram we show this be drawing a relationship with a triangle with the label IsA and then connecting it to the relevant entities. On corner (the "top") is connected to the "parent" entity (the one which we inherit attributes from) and then the opposite face is connected to any entities which inherit it's attributes.

Specialization

Specialization is a method of dividing entities into their sub-entities based on it's characteristics. This is a done through a IsA relationship between two or more entities.

In a entity-relationship diagram we show this be drawing a relationship with an upside down triangle with the label IsA and then connecting it to the relevant entities. On the upper face (the "top") is connected to the "parent" entity (the one which we specialize from) and then the opposite faces are connected to any entities which are specializations of that entity.

Aggregation

Aggregation is a method of modelling relationships between other relationships which is typically not allowed in entity-relationship diagrams. This is most useful for adding descriptive attributes or relationships to existing associations.

We show aggregation in a entity-relationship diagram by surrounding the entities to be aggregated in a dashed box and then drawing a line from the relationships.

Structured Query Language

SQL is a language for representing operations and relationships in a relational database. SQL was original based on relational algebra and can be broken down into two sub languages DDL and DML.

Sub Languages

Data Definition Language

DDL is a sub language of SQL which describes the syntax of the handling of schemas.

Creating Tables

Tables are created using the CREATE TABLE directive, and then specifying the name of the table as well as the schema of it's fields.

CREATE TABLE person (
	person_id SMALLINT UNSIGNED PRIMARY KEY,

	first_name VARCHAR(20) NOT NULL,
	last_name VARCHAR(20) NOT NULL
);

Create a table person with a primary key and first and last name fields.

Modifying Tables

Tables can be modified using the ALTER TABLE directive with the table name and then a set of operations:

ADD <column_name> <datatype> - Add a new column with the name column_name and of type datatype.
DROP <column_name> - Removes the column of column_name from the table.

ALTER TABLE person
ADD middle_name VARCHAR(20) NOT NULL;

Deleting Tables

Tables can be removed using the DROP TABLE or DROP TABLE IF EXISTS directives the IF EXISTS variant prevents an error if the table doesn't exist.

DROP TABLE IF EXISTS person;

Remove the table person if it was created already.

Constraints

Table constraints are used to specify rules for certain columns in a table they are placed after the column datatype when creating or modifying a table.

CREATE TABLE person (
	person_id SMALLINT UNSIGNED PRIMARY KEY,
	--                          ^–------––– Constraint

	first_name VARCHAR(20) NOT NULL,
	--                     ^–––––––– Constraint
	last_name VARCHAR(20) NOT NULL
	--                    ^–––––––– Constraint
);

List of Constraints

NOT NULL - Ensures the column is not empty (null value).
UNIQUE - All values in the column are different.
PRIMARY KEY - Specifies that it's the unique identifier for each row in a table and is a combination of NOT NULL and UNIQUE.
FOREIGN KEY - Prevents actions that would destroy the links between tables (must reference a primary key of another row - typically from another table)
CHECK - Ensures the column satisfies a given boolean expression (for example CHECK (age >= 18))
DEFAULT - Provides a default value if no value is specified.

Some constraints like PRIMARY KEY, FOREIGN KEY can be declared out of line and can accept a combination of multiple columns.

CREATE TABLE enrollment (
    student_id INT UNSIGNED NOT NULL,
    course_id  INT UNSIGNED NOT NULL,
    enrolled_on DATE NOT NULL,

    PRIMARY KEY (student_id, course_id),
    
    FOREIGN KEY (student_id) REFERENCES student(student_id),
    FOREIGN KEY (course_id)  REFERENCES course(course_id)
);

Views

Views are virtual tables created from queries on one or more base tables. A view does not store data; instead, it displays data stored in other tables dynamically. Views are useful for simplifying complex queries and restricting access to specific data.

Creating Views

Views are created using the CREATE VIEW directive, followed by the view name and the AS keyword with a SELECT query.

CREATE VIEW person_details AS
SELECT 
	person_id,
	first_name,
	last_name
FROM person
WHERE person_id > 0;

Create a view person_details that displays only active person records.

Dropping Views

Views can be removed using the DROP VIEW or DROP VIEW IF EXISTS directives. The IF EXISTS variant prevents an error if the view doesn't exist.

DROP VIEW IF EXISTS person_details;

Remove the view person_details if it exists.

Modifying Views

To change a view's definition, use the CREATE OR REPLACE VIEW directive, which will either create the view if it doesn't exist or replace its definition.

CREATE OR REPLACE VIEW person_details AS
SELECT 
	person_id,
	first_name,
	last_name,
	middle_name
FROM person
WHERE person_id > 0;

Replace the person_details view with an updated definition that includes the middle name field.

Data Manipulation Language

DML is a sub language of SQL which describes the syntax of the retrieving and manipulating of database records.

Statements

DML consists of several core statements for working with data:

SELECT

Retrieve data from tables. The SELECT statement includes:

Basic querying and column selection
DISTINCT values
Filtering with WHERE clause
Subqueries and set membership (IN, NOT IN, EXISTS, NOT EXISTS)
Set comparison operators (ANY, ALL)
Ordering results with ORDER BY
Limiting results with LIMIT

INSERT

Add new records to tables.

UPDATE

Modify existing records in tables.

DELETE

Remove records from tables.

Advanced Operations

JOIN

Combine rows from multiple tables based on related columns. Includes:

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER JOIN

Set Operations

Combine results from multiple queries using Set theory operations:

UNION (corresponds to Union)
INTERSECT (corresponds to Intersection)
EXCEPT/MINUS (corresponds to Set Difference)

Aggregate Functions

Perform calculations on multiple rows:

COUNT, SUM, AVG, MIN, MAX
GROUP BY clause
HAVING clause for filtering grouped results

Functional Dependence

Functional dependence describes a relationship between attributes of a relation. If the attribute Y is functionally dependent on X (X Y) it holds that for every pair of tuples with attributes X and Y, if the X values are equal then the Y values must also be equal.

For example consider a relation with the attributes PostalCode, City, and Province. If PostalCode City, Province, then for all tuples in the relation, if they have equal PostalCode values, they must have the same City and Province.

The symbol can be read as determines. For example X Y can be read as either X functionally determines Y or X determines Y.

Functional dependence can indicate to us that we're storing redundant data and can help us decompose relations without losing information.

Deriving Additional Functional Dependences

Given a set of functional dependences for a relation we can derive additional functional dependences through the following methods:

Axioms

The following axioms (known as Armstrong's Axioms) are sound and complete — they can derive all and only the functional dependencies that are logically implied by a given set. A set of axioms can be used to transform functional dependences

Reflexivity: If Y X, then X Y (ie. City, Major City).
Augmentation: If X Y, then X, Z Y, Z for any Z (ie.SID City, then SID, Major City, Major).
Transivity: If X Y and Y Z, then X Z (ie. SID City and City AreaCode, then SID AreaCode)

These axioms can then be used to derive a few additional rules for ease of use

Union: If X Y, and X Z then X Y, Z (eg. SID AreaCode and SID City, then SID AreaCode, City)
Decomposition: If X Y, Z then X Y and X Z (eg. SID AreaCode, City then SID AreaCode and SID City)

Closure Method

Alternatively we can use the "closure method", given a set of attributes X we denote the closure of it as X. If X is a super key X contains all attributes of the schema.

The algorithm is expressed as

Let Closure = X Until Closure doesn't change do if a, ..., a C is a functional dependence and {a, ..., } Closure then add C to Closure

Example

Consider the schema SupplierPart(supplierName, city, status, partNumber, partName, quantity) and the functional dependences

supplierName city
city status
partNumber partName
supplierName, partNumber quantity

Find the closure of supplierName, partNumber, supplierName, and partNumber.

(supplierName, partNumber) = {supplierName, partNumber}

supplierName city is a functional dependence and {supplierName} is a subset of (supplierName, partNumber) so we add city to (supplierName, partNumber).

(supplierName, partNumber) = {supplierName, partNumber, city}

city status is a functional dependence and {city} is a subset of (supplierName, partNumber) so we add status to (supplierName, partNumber).

(supplierName, partNumber) = {supplierName, partNumber, city, status}

partNumber partName is a functional dependence and {partNumber} is a subset of (supplierName, partNumber) so we add partName to (supplierName, partNumber).

(supplierName, partNumber) = {supplierName, partNumber, city, status, partName}

supplierName, partNumber quantity is a functional dependence and {supplierName, partNumber} is a subset of (supplierName, partNumber) so we add quantity to (supplierName, partNumber).

(supplierName, partNumber) = {supplierName, partNumber, city, status, partName, quantity}

We've exhausted all function dependences so this is our final closure. This final closure contains all attributes of the schema so we know (supplierName, partNumber) is a super key.

supplierName = {supplierName}

supplierName city is a functional dependence and {supplierName} is a subset of supplierName so we add city to supplierName.

supplierName = {supplierName, city}

city status is a functional dependence and {city} is a subset of supplierName so we add status to supplierName.

supplierName = {supplierName, city, status}

We've exhausted all functional dependencies which are contained in the closure so this is our final closure.

partNumber = {partNumber}

partNumber partName is a functional dependence and {partNumber} is a subset of partNumber so we add partName to partNumber.

partNumber = {partNumber, partName}

We've exhausted all functional dependencies which are contained in the closure so this is our final closure.

Using those three closures we can see that supplierName, partNumber is a candidate key as reducing either supplierName or partNumber means the key is no longer a super key.

Super Key

A super key is a set of one or more attributes in a schema that uniquely identifies a tuple.

Formally, a set of attributes is a super key for a relation schema if for every valid instance :

Equivalently, is a super key if holds and the closure contains all attributes of the schema.

Every relation always contains at least one super key: the trivial key, which is the set of all its attributes.

Example

Consider a table Employee with the following attributes

EmployeeID: Unique ID assigned by the company
SSN: Social Security Number
Name: Name of the employee
Department: The department the employee works at

Some examples of super keys are the following sets:

{EmployeeID}: Is unique for every Employee
{SSN}: Like the ID is unique for every Employee
{EmployeeID, Name}: Via reflexivity as Name is redundant.
{EmployeeID, SSN, Name, Department}: The trivial super key which already contains all columns.

Candidate Key

A candidate key is any super key which cannot be further reduced by removing attributes — that is, no proper subset of it is also a super key.

Formally, a set of attributes is a candidate key for a relation schema if:

is a super key
No proper subset is a super key ( is minimal).

Every relation has at least one candidate key. All candidate keys of a relation have the same property of uniquely identifying tuples — the choice of which one to designate as the primary key is a design decision.

Example

Consider a table Employee with the following attributes

EmployeeID: Unique ID assigned by the company
SSN: Social Security Number
Name: Name of the employee
Department: The department the employee works at

Some examples of it's candidate keys are the following sets:

{EmployeeID}: Is unique for every Employee
{SSN}: Like the ID is unique for every Employee

Conversely non-candidate keys which are still super keys are the following

{EmployeeID, SSN}: The SSN or EmployeeID are redundant keys to identify a record.
{EmployeeID, Name}: The Name attribute is redundant.

Primary Key

A primary key is a designated candidate key that is primarily used to identify records.

Example

Consider a table Employee with the following attributes

EmployeeID: Unique ID assigned by the company
SSN: Social Security Number
Name: Name of the employee
Department: The department the employee works at

Some examples of it's candidate keys are the following sets:

{EmployeeID}: Is unique for every Employee
{SSN}: Like the ID is unique for every `Employee

Given this list of candidate keys we can choose the primary key to be either EmployeeID or SSN. Typically we would choose EmployeeID to be our primary key as we would probably be generating it whereas SSN would be entered by the user.

Normal Form

A normal form is a property of a relation schema that characterizes it in terms of the redundancy it permits. Normal forms are defined with respect to functional dependencies (and in higher forms, multivalued and join dependencies) that hold on the schema.

Normal forms are ordered by strictness. Each successive normal form imposes additional constraints that eliminate further classes of redundancy:

That is, every BCNF relation is in 3NF, every 3NF relation is in 2NF, and every 2NF relation is in 1NF.

Summary

Normal Form	Condition
1NF	All attributes are atomic
2NF	1NF + no partial dependencies on the primary key
3NF	2NF + no transitive dependencies on the primary key
BCNF	Every non-trivial FD has a super key as its determinant

The goal of applying successively stricter normal forms is to minimize update anomalies (insertion, deletion, and modification anomalies) that arise from redundant data storage without losing information. The process of restructuring a schema to satisfy a normal form is called database normalization.

First Normal Form

First normal form is the most basic normal form. A relation is in first normal form if and only if every attribute contains only atomic (indivisible) values — that is, no attribute holds a set, list, or nested relation as its value.

Definition

A relation schema is in first normal form if for every valid instance :

The domain of each attribute contains only atomic values.
Each tuple assigns exactly one value from the domain to each attribute.

This is essentially a requirement of the relational model itself. E.F. Codd's original definition of a relation requires that all attribute values are atomic, so any valid relation is in 1NF by definition.

Violations

A relation violates 1NF when an attribute contains:

Multi-valued fields — e.g. a PhoneNumbers attribute storing {555-1234, 555-5678}.
Composite fields — e.g. a FullName attribute storing a structured value with sub-components FirstName and LastName.
Nested relations — e.g. an Orders attribute containing an entire table of order records within a single tuple.

Resolving Violations

To bring a relation into 1NF:

Multi-valued attributes are resolved by creating a separate tuple for each value, or by decomposing the multi-valued attribute into a separate relation.
Composite attributes are resolved by replacing them with their atomic components.

Note

Most modern relational database systems enforce atomicity at the storage level, making 1NF violations uncommon in practice. However, the theoretical distinction remains important as 1NF is the foundation upon which all higher normal forms are built.

Second Normal Form

Second normal form is a normal form used in database normalization. Second normal form requires that a schema is in 1NF and that every non-key attribute is fully functionally dependent on the entire primary key, meaning no non-key attribute depends on only part of a composite key.

Definition

A schema is in second normal form if:

It is in 1NF.
There are no partial functional dependencies, that is no non-key attribute X is dependent on a proper subset of the primary key.

Example

Consider a table Toothbrush with the following attributes

Manufacturer
`Model
ManufacturerCountry

and the following functional dependencies

Manufacturer, Model ManufacturerCountry
Manufacturer ManufacturerCountry

Is Toothbrush in 2NF? Justify why or why not and if it isn't decompose it such that it is.

Toothbrush is not in 2NF because ManufacturerCountry is only dependent on Manufacturer not on Model. To reconcile this we decompose it into two new tables

Toothbrush with the following attributes

Manufacturer
`Model

and Manufacturer

Manufacturer
ManufacturerCountry

Third Normal Form

Third normal form is a normal form used in database normalization. Third normal form eliminates transitive dependencies of non-key attributes on the primary key.

3NF is strictly weaker than BCNF. The difference is that 3NF permits a non-trivial FD when is prime, even if is not a super key. BCNF does not allow this exception.

The advantage of 3NF over BCNF is that it is always possible to find a decomposition into 3NF that is both loseless and dependency-preserving, which is not always achievable for BCNF.

Definition

A schema is in third normal form if for every non-trivial functional dependency (where is a single attribute) at least one of the following holds:

is a super key.
is a prime attribute (i.e. is a member of some candidate key).

Equivalently, a schema is in 3NF if it is in 2NF and no non-prime attribute is transitively dependent on any candidate key.

A transitive dependency exists when and , where is not a super key and is a non-prime attribute. In this case is a transitive dependency through .

Example

Consider a schema Student(StudentID, Department, DepartmentHead) with the following functional dependencies:

StudentID Department
Department DepartmentHead

The only candidate key is {StudentID}.

This schema is in 2NF (the key is a single attribute, so partial dependencies are impossible), but it is not in 3NF because DepartmentHead is transitively dependent on StudentID through Department:

Here Department DepartmentHead violates 3NF because Department is not a super key and DepartmentHead is not a prime attribute.

To bring it into 3NF, we decompose into:

Student(StudentID, Department)
Department(Department, DepartmentHead)

Boyce-Codd Normal Form

Boyce-Codd normal form is a normal form used in database normalization and is a slightly stricter version of 3NF. Boyce-Codd normal form removes all redundancies based on functional dependence.

Definition

If a schema is in Boyce-Codd normal form all of it's functional dependencies X Y satisfy at least one of the following conditions:

X Y satisfies reflexivity (Y X).
X is a super key for the schema.

If a relational schema is in BCNF, then it is automatically also in 3NF because BCNF is a stricter form of 3NF.

Example

Consider a table Address with the following attributes

HouseNumber
Street
City
Province
PostalCode

and the following functional dependencies

HouseNumber, Street, PostalCode → City
HouseNumber, Street, PostalCode → Province
PostalCode → City
PostalCode → Province

Is Address in BCNF? Justify why or why not and if it isn't decompose it such that it is.

Address is not in BCNF.

The candidate key is {HouseNumber, Street, PostalCode} (its closure yields all attributes). The dependencies PostalCode → City and PostalCode → Province violate BCNF because PostalCode alone is not a super key for Address.

To bring it into BCNF, we decompose using the violating dependency PostalCode → City, Province:

PostalArea(PostalCode, City, Province): Here PostalCode is a super key, so all functional dependencies satisfied.
Address(HouseNumber, Street, PostalCode): No non-trivial functional dependencies remain that violate BCNF.

This decomposition is loseless because the common attribute PostalCode is a super key for PostalArea.

Relational Algebra

Relational algebra is a procedural query language that defines a set of operations on relations. Each operation takes one or more relations as input and produces a new relation as output. This closure property means operations can be composed arbitrarily. Relational algebra serves as the theoretical foundation for query languages such as Structured Query Language.

Fundamental Operations

The six fundamental operations are sufficient to express any relational algebra query. All other operations can be derived from these.

Selection ()

Selects tuples from a relation that satisfy a given predicate.

where is a boolean predicate over attributes.

Example: returns all students older than 21.

Projection ()

Selects specified attributes from a relation, removing duplicates from the result.

Example: returns only the name and department of each employee.

Union ()

Returns all tuples that appear in either of two union-compatible relations.

Requires and to have the same arity and domain-compatible attributes.

Set Difference ()

Returns all tuples in the first relation that are not in the second.

Cartesian Product ()

Combines every tuple of one relation with every tuple of another.

where denotes the concatenation of the two tuples.

Rename ()

Renames either the relation or its attributes.

renames relation to with attributes renamed to .

Derived Operations

These operations can be expressed in terms of the fundamental operations but are used so frequently that they have their own notation.

Intersection ()

Returns tuples present in both relations.

Natural Join ()

Combines tuples from two relations that agree on all common attributes, then projects out the duplicate columns. The natural join is central to the theory of lossless-join decomposition.

Division ()

Returns tuples in that are associated with every tuple in . Useful for "for all" type queries.

Assignment ()

Assigns the result of a relational algebra expression to a temporary relation variable. This is purely notational convenience for breaking complex expressions into steps.

Properties

Property	Operations
Commutative	, , ,
Associative	, , ,
Not commutative	,

Selection is idempotent: .

Selections can be cascaded: .

Projections can be cascaded: when .

Structured Query Language

SQL is a language for representing operations and relationships in a relational database. SQL was original based on relational algebra and can be broken down into two sub languages DDL and DML.

Sub Languages

Data Definition Language

DDL is a sub language of SQL which describes the syntax of the handling of schemas.

Creating Tables

Tables are created using the CREATE TABLE directive, and then specifying the name of the table as well as the schema of it's fields.

CREATE TABLE person (
	person_id SMALLINT UNSIGNED PRIMARY KEY,

	first_name VARCHAR(20) NOT NULL,
	last_name VARCHAR(20) NOT NULL
);

Create a table person with a primary key and first and last name fields.

Modifying Tables

Tables can be modified using the ALTER TABLE directive with the table name and then a set of operations:

ADD <column_name> <datatype> - Add a new column with the name column_name and of type datatype.
DROP <column_name> - Removes the column of column_name from the table.

ALTER TABLE person
ADD middle_name VARCHAR(20) NOT NULL;

Deleting Tables

Tables can be removed using the DROP TABLE or DROP TABLE IF EXISTS directives the IF EXISTS variant prevents an error if the table doesn't exist.

DROP TABLE IF EXISTS person;

Remove the table person if it was created already.

Constraints

Table constraints are used to specify rules for certain columns in a table they are placed after the column datatype when creating or modifying a table.

CREATE TABLE person (
	person_id SMALLINT UNSIGNED PRIMARY KEY,
	--                          ^–------––– Constraint

	first_name VARCHAR(20) NOT NULL,
	--                     ^–––––––– Constraint
	last_name VARCHAR(20) NOT NULL
	--                    ^–––––––– Constraint
);

List of Constraints

NOT NULL - Ensures the column is not empty (null value).
UNIQUE - All values in the column are different.
PRIMARY KEY - Specifies that it's the unique identifier for each row in a table and is a combination of NOT NULL and UNIQUE.
FOREIGN KEY - Prevents actions that would destroy the links between tables (must reference a primary key of another row - typically from another table)
CHECK - Ensures the column satisfies a given boolean expression (for example CHECK (age >= 18))
DEFAULT - Provides a default value if no value is specified.

Some constraints like PRIMARY KEY, FOREIGN KEY can be declared out of line and can accept a combination of multiple columns.

CREATE TABLE enrollment (
    student_id INT UNSIGNED NOT NULL,
    course_id  INT UNSIGNED NOT NULL,
    enrolled_on DATE NOT NULL,

    PRIMARY KEY (student_id, course_id),
    
    FOREIGN KEY (student_id) REFERENCES student(student_id),
    FOREIGN KEY (course_id)  REFERENCES course(course_id)
);

Views

Creating Views

Views are created using the CREATE VIEW directive, followed by the view name and the AS keyword with a SELECT query.

CREATE VIEW person_details AS
SELECT 
	person_id,
	first_name,
	last_name
FROM person
WHERE person_id > 0;

Create a view person_details that displays only active person records.

Dropping Views

Views can be removed using the DROP VIEW or DROP VIEW IF EXISTS directives. The IF EXISTS variant prevents an error if the view doesn't exist.

DROP VIEW IF EXISTS person_details;

Remove the view person_details if it exists.

Modifying Views

To change a view's definition, use the CREATE OR REPLACE VIEW directive, which will either create the view if it doesn't exist or replace its definition.

CREATE OR REPLACE VIEW person_details AS
SELECT 
	person_id,
	first_name,
	last_name,
	middle_name
FROM person
WHERE person_id > 0;

Replace the person_details view with an updated definition that includes the middle name field.

Data Manipulation Language

DML is a sub language of SQL which describes the syntax of the retrieving and manipulating of database records.

Statements

DML consists of several core statements for working with data:

SELECT

Retrieve data from tables. The SELECT statement includes:

Basic querying and column selection
DISTINCT values
Filtering with WHERE clause
Subqueries and set membership (IN, NOT IN, EXISTS, NOT EXISTS)
Set comparison operators (ANY, ALL)
Ordering results with ORDER BY
Limiting results with LIMIT

INSERT

Add new records to tables.

UPDATE

Modify existing records in tables.

DELETE

Remove records from tables.

Advanced Operations

JOIN

Combine rows from multiple tables based on related columns. Includes:

INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL OUTER JOIN

Set Operations

Combine results from multiple queries using Set theory operations:

UNION (corresponds to Union)
INTERSECT (corresponds to Intersection)
EXCEPT/MINUS (corresponds to Set Difference)

Aggregate Functions

Perform calculations on multiple rows:

COUNT, SUM, AVG, MIN, MAX
GROUP BY clause
HAVING clause for filtering grouped results

SELECT Statement (DML)

The SELECT statement is used to retrieve data from a table, allowing you to query one or more columns.

Basic Syntax

SELECT first_name, last_name
FROM person;

Select the first and last names from the person table.

To select all columns, use the * wildcard:

SELECT * FROM person;

Selecting Distinct Values

Use DISTINCT to return only unique values, eliminating duplicate rows:

SELECT DISTINCT last_name
FROM person;

Select all unique last names from the person table.

You can use DISTINCT with multiple columns:

SELECT DISTINCT first_name, last_name
FROM person;

Returns unique combinations of first and last names.

Filtering Results

Use the WHERE clause to filter rows based on conditions:

SELECT first_name, last_name
FROM person
WHERE person_id = 5;

Subqueries and Set Membership

IN Operator

The IN operator checks if a value matches any value in a list or subquery result:

SELECT first_name, last_name
FROM person
WHERE person_id IN (1, 2, 3);

Select persons with IDs 1, 2, or 3.

Using IN with a subquery:

SELECT name
FROM student
WHERE student_id IN (
    SELECT student_id
    FROM enrollment
    WHERE course_id = 101
);

Select students enrolled in course 101.

NOT IN Operator

The NOT IN operator checks if a value does not match any value in a list or subquery result:

SELECT first_name, last_name
FROM person
WHERE person_id NOT IN (1, 2, 3);

Select all persons except those with IDs 1, 2, or 3.

Using NOT IN with a subquery:

SELECT name
FROM student
WHERE student_id NOT IN (
    SELECT student_id
    FROM enrollment
);

Select students who are not enrolled in any course.

Nested Queries

Subqueries can be nested within other queries in various clauses:

In SELECT clause:

SELECT first_name, last_name,
    (SELECT COUNT(*) 
     FROM enrollment 
     WHERE enrollment.student_id = student.student_id) AS course_count
FROM student;

Select students with their enrollment count.

In WHERE clause with comparison operators:

SELECT name
FROM student
WHERE student_id = (
    SELECT student_id
    FROM enrollment
    WHERE course_id = 101
    LIMIT 1
);

Using EXISTS:

SELECT name
FROM student
WHERE EXISTS (
    SELECT 1
    FROM enrollment
    WHERE enrollment.student_id = student.student_id
    AND course_id = 101
);

Select students who are enrolled in course 101.

Using NOT EXISTS:

SELECT name
FROM student
WHERE NOT EXISTS (
    SELECT 1
    FROM enrollment
    WHERE enrollment.student_id = student.student_id
);

Select students who have no enrollments.

Set Comparison Operators

Set comparison operators allow you to compare a value against a Set of values returned by a subquery.

ANY Operator

The ANY operator compares a value to any value in a Set. The condition is true if the comparison is true for at least one value in the Set.

SELECT name, grade
FROM student
WHERE grade > ANY (
    SELECT grade
    FROM student
    WHERE course_id = 101
);

Select students whose grade is greater than at least one grade in course 101.

Common uses of ANY:

= ANY is equivalent to IN
> ANY means greater than the minimum value
< ANY means less than the maximum value

ALL Operator

The ALL operator compares a value to all values in a Set. The condition is true only if the comparison is true for every value in the Set.

SELECT name, grade
FROM student
WHERE grade > ALL (
    SELECT grade
    FROM student
    WHERE course_id = 101
);

Select students whose grade is greater than all grades in course 101.

Common uses of ALL:

!= ALL (or <> ALL) is equivalent to NOT IN
> ALL means greater than the maximum value
< ALL means less than the minimum value

Example comparing ANY and ALL:

-- Students with salary higher than ANY engineer (at least one)
SELECT name, salary
FROM employee
WHERE salary > ANY (
    SELECT salary FROM employee WHERE department = 'Engineering'
);

-- Students with salary higher than ALL engineers (every one)
SELECT name, salary
FROM employee
WHERE salary > ALL (
    SELECT salary FROM employee WHERE department = 'Engineering'
);

Ordering Results

Use ORDER BY to sort results in ascending (ASC) or descending (DESC) order:

SELECT first_name, last_name
FROM person
ORDER BY last_name ASC, first_name ASC;

Limiting Results

Use LIMIT to restrict the number of rows returned:

SELECT first_name, last_name
FROM person
LIMIT 10;

INSERT Statement (DML)

The INSERT INTO statement is used to add new records to a table.

Basic Syntax

INSERT INTO person (person_id, first_name, last_name)
VALUES (1, 'John', 'Doe');

Insert a new person with ID 1.

Inserting Multiple Rows

Multiple rows can be inserted at once:

INSERT INTO person (person_id, first_name, last_name)
VALUES 
    (2, 'Jane', 'Smith'),
    (3, 'Bob', 'Johnson');

Insert multiple records in a single statement.

UPDATE Statement (DML)

The UPDATE statement is used to modify existing records in a table.

Basic Syntax

UPDATE person
SET first_name = 'Jonathan'
WHERE person_id = 1;

Update the first name of person with ID 1.

Updating Multiple Columns

Multiple columns can be updated simultaneously:

UPDATE person
SET first_name = 'Jane', last_name = 'Doe'
WHERE person_id = 2;

Update multiple fields in a single statement.

DELETE Statement (DML)

The DELETE FROM statement is used to remove records from a table.

Basic Syntax

DELETE FROM person
WHERE person_id = 3;

Delete the person with ID 3.

Deleting All Rows

To delete all rows from a table (use with caution):

DELETE FROM person;

This removes all records from the table while keeping the table structure intact.

JOIN Clause (DML)

The JOIN clause combines rows from multiple tables based on related columns.

Basic Syntax

SELECT student.name, course.title
FROM enrollment
INNER JOIN student ON enrollment.student_id = student.student_id
INNER JOIN course ON enrollment.course_id = course.course_id;

Join three tables together to get student names and their enrolled courses.

Types of Joins

Inner Join

INNER JOIN returns only matching rows from both tables.

SELECT orders.order_id, customers.customer_name
FROM orders
INNER JOIN customers ON orders.customer_id = customers.customer_id;

Returns only orders that have a matching customer.

Left Join

LEFT JOIN returns all rows from the left table and matching rows from the right.

SELECT customers.customer_name, orders.order_id
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;

Returns all customers, including those without orders.

Right Join

RIGHT JOIN returns all rows from the right table and matching rows from the left.

SELECT customers.customer_name, orders.order_id
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;

Returns all orders, including those without a matching customer.

Full Outer Join

FULL OUTER JOIN returns all rows when there's a match in either table.

SELECT customers.customer_name, orders.order_id
FROM customers
FULL OUTER JOIN orders ON customers.customer_id = orders.customer_id;

Returns all customers and all orders, matching them where possible.

Set Operations (DML)

Set operations combine the results of two or more queries into a single result Set. The queries must have compatible column structures (same number of columns with compatible data types).

Union

The UNION operator combines the results of two queries, returning all unique rows from both result sets. This corresponds to the Union operation in set theory.

SELECT first_name, last_name FROM person_table_1
UNION
SELECT first_name, last_name FROM person_table_2;

Returns all unique names from both tables.

Use UNION ALL to include duplicate rows:

SELECT first_name, last_name FROM person_table_1
UNION ALL
SELECT first_name, last_name FROM person_table_2;

Returns all names from both tables, including duplicates.

Intersection

The INTERSECT operator returns only the rows that appear in both query results. This corresponds to the Intersection operation in set theory.

SELECT first_name, last_name FROM person_table_1
INTERSECT
SELECT first_name, last_name FROM person_table_2;

Returns names that appear in both tables.

Difference

The EXCEPT operator (or MINUS in some databases) returns rows from the first query that do not appear in the second query. This corresponds to the Set Difference operation in set theory.

SELECT first_name, last_name FROM person_table_1
EXCEPT
SELECT first_name, last_name FROM person_table_2;

Returns names that are in the first table but not in the second.

Aggregate Functions (DML)

Aggregate functions perform calculations on multiple rows and return a single value.

Basic Syntax

SELECT COUNT(*) FROM person;

Count the total number of rows in the person table.

Common Aggregate Functions

COUNT() - Counts the number of rows.
SUM() - Calculates the sum of values.
AVG() - Calculates the average value.
MIN() - Finds the minimum value.
MAX() - Finds the maximum value.

Examples

SELECT COUNT(*), AVG(grade), MAX(grade), MIN(grade)
FROM student;

Get count, average, maximum, and minimum grades from the student table.

SELECT SUM(salary)
FROM employee
WHERE department = 'Engineering';

Calculate total salary for engineering department.

Grouping Results

Use GROUP BY to group rows and apply aggregate functions to each group:

SELECT course_id, COUNT(*) AS enrollment_count
FROM enrollment
GROUP BY course_id;

Count enrollments for each course.

Having Clause

Use HAVING to filter grouped results (similar to WHERE but for aggregated data):

SELECT course_id, COUNT(*) AS enrollment_count
FROM enrollment
GROUP BY course_id
HAVING COUNT(*) > 10;

Show only courses with more than 10 enrollments.

Difference between WHERE and HAVING:

WHERE filters rows before grouping
HAVING filters groups after aggregation