Understanding MySQL | Yuhang's Blog

This article is a review of what I learned in the Coursera course Managing Big Data with MySQL.

Admittedly, not a small part of the emphasis (and difficulty) of this course is on knowledge beyond MySQL itself, that is, the ability to apply the MySQL skills to real-life business problems. Nevertheless, this article will mainly focus on MySQL itself and my understanding of this rational database manage system.

If you want to learn MySQL from scratch by reading this article, it assumes you have intermediate experience in programming, especially in object-oriented programming, and only serves as a tutorial that gives you the basic (and hopefully intuitive) understanding of the concept of MySQL and the usage of several frequently used MySQL queries. If you are curious about more details or have any problems when implementing your queries, you are encouraged to refer to MySQL official documents and other tutorials.

Database and Table

What is a database? A database is a collection of one or more interconnected tables. Then what is a table? If you have experience with object-oriented programming, you can understand the concept of a table quite with ease. (If you don’t, just remember a class is a kind of things, and an object is a thing of this kind.) Suppose you have a class, where an object has several attributes (but does not have methods). When you generate multiple objects of this class and store one object on a row, with a column being an attribute of this class, you get a table. Honestly, this is just how some Python database frameworks (for example flask-sqlalchemy) consider a rational database table. To put it simply, a table is multiple objects of the same class.

You may find a relational database is somewhat like a spreadsheet. In fact, according what we’ve said, any relational database can be stored in a spreadsheet, but the reverse is certainly not true.

Keys

Primary Keys

When in real object-oriented programming, an object would usually be represented as a variable. It goes without saying that in (almost?) any programming language, within one naming scope, any variable should have a unique name to prevent confusion with other variables. It is the same in MySQL. An object, or a row, in the table should be able to be distinguished from other objects (or rows) so that many further operations could be applied.

That is why we have what is called a ‘primary key’ (key is a fancy synonym to attribute, or column). A primary key is the attribute that distinguish one object from others. Clearly, if a column is chosen as the primary key, no duplicate rows should appear in this column. Usually you can use an id number as the primary key, just as in other programming languages you use an index to refer to elements in an array or list.

However, sometimes you can use multiple keys together as the primary key. For example, if your class is a room which can be located by a building number and a room number in the building, the building number and the room number can be used together as the primary key. There may be multiple rooms in the same building, and there may be rooms sharing the same room number in different buildings, but typically no two rooms in the same building have the same room number.

Unique Keys

After being familiar with the concept of primary key, it would be easy to guess what a unique key is. Yes, it is just what it literally means. Clearly a primary key must be a unique key. Apart from the primary key, you can set any key (or combination of multiple keys) in your table as a unique key. This can act as an restriction on your data, preventing potential mistakes.

Other Constraints

In real implementation, other constraints may be imposed on keys when creating a table, which you can look up here. To name but a few frequently used ones:

A key should be of a certain type, for example int, float, char, varchar, date, etc., like you may have seen in some programming languages, that a variable should be assigned with an unchangeable type specified.
A key can have a default value.
You can explicitly forbid a key from having NULL values.
An AUTO_INCREMENT key has auto increment values, usually used as id numbers.

Installation

MySQL document has detailed instructions on how to install a MySQL service on your computer. If you have any trouble, ask Google.

If you are using Arch Linux like I do, please note that you should install mariadb instead of mysql:

1	sudo pacman -S mariadb

For other configurations on Arch Linux, you can refer to this article.

Running MySQL in Command Line

I hope you are familiar with the command line, where you can launch your MySQL service and start querying. If you don’t, Google.

MySQL Queries

Creating Databases

Creating a database is easy. You give it a name, and it’s done. A database is like a container of tables. Just remember to use the database when you want to operate on the tables in it.

1 2	CREATE DATABASE school; USE school;

In real life you may have to configure the permission to use the database for others if you are the MySQL administrator.

To display all the databases we have:

1	SHOW DATABASES;

+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| school             |
+--------------------+
4 rows in set (0.000 sec)

Apart from the newly created database, MySQL has three databases by default.

Creating Tables

A table, as mentioned, is a list of objects of the same class. When creating a table, you have to name the table, usually after the name of the class. Then you list all the keys you are interested in and impose relevant constraints on each of them.

In our example, let’s first create a students table using the snippet below:

CREATE TABLE students (
	student_id int NOT NULL AUTO_INCREMENT,
    name varchar(255) NOT NULL,
    gender varchar(10) NOT NULL,
    class_id int NOT NULL,
    age int NOT NULL,
    score int,
    PRIMARY KEY (student_id)
);

To display all the tables we have:

1	SHOW TABLES;

+------------------+
| Tables_in_school |
+------------------+
| students         |
+------------------+
1 row in set (0.001 sec)

To display the settings of a table:

1	DESCRIBE students;

+------------+--------------+------+-----+---------+----------------+
| Field      | Type         | Null | Key | Default | Extra          |
+------------+--------------+------+-----+---------+----------------+
| student_id | int(11)      | NO   | PRI | NULL    | auto_increment |
| name       | varchar(255) | NO   |     | NULL    |                |
| gender     | varchar(10)  | NO   |     | NULL    |                |
| class_id   | int(11)      | NO   |     | NULL    |                |
| age        | int(11)      | NO   |     | NULL    |                |
| score      | int(11)      | YES  |     | NULL    |                |
+------------+--------------+------+-----+---------+----------------+
6 rows in set (0.001 sec)

You can alter the settings using ALTER.

Loading Data

Now you can load data into the table, or in other words, creating objects of the class. For each object, you have specify the values for the attributes that you listed when creating the table. When you leave out certain attributes, the entries in the table would be NULL (which we’ll discuss later), or the default value that you set in advance.

INSERT INTO students (name, class_id, gender, age, score)
VALUES 
    ('David', 2, 'male', 17, 85),
    ('Nancy', 1, 'female', 18, 90),
    ('Simon', 1, 'male', 18, 97),
    ('Ronnie', 2, 'male', 20, 59),
    ('Alice', 2, 'female', 19, 75),
    ('Julia', 1, 'female', 16, 80);
INSERT INTO students (name, class_id, gender, age)
VALUES ('Dannie', 2, 'male', 20);

You may also insert some other (imaginary) students into this table to fulfill it, if you want to. Now this database only contains seven rows in order to facilitate your understanding of basic MySQL queries. In real-life situations, you may get tons of data in your database.

You can also DELETE and UPDATE values in a table.

SELECT

For a table loaded with data, what can be more exciting than to select data from it? That accounts for the fact that each and every query below starts with the key word SELECT.

When you want to see some of the columns of your table, you can list the column names after the SELECT key word, separated by commas. The key word FROM is used to inform MySQL which table you are talking about, which is also a must for all queries you are going to write.

1 2	SELECT student_id, name, gender FROM students;

+------------+--------+--------+
| student_id | name   | gender |
+------------+--------+--------+
|          1 | David  | male   |
|          2 | Nancy  | female |
|          3 | Simon  | male   |
|          4 | Ronnie | male   |
|          5 | Alice  | female |
|          6 | Julia  | female |
|          7 | Dannie | male   |
+------------+--------+--------+
7 rows in set (0.001 sec)

Most of the time you want to include a unique key in the columns you select so that you can distinguish all the objects from each other.

When you want to select all the columns in the table, use SELECT *:

1 2	SELECT * FROM students;

+------------+--------+--------+----------+-----+-------+
| student_id | name   | gender | class_id | age | score |
+------------+--------+--------+----------+-----+-------+
|          1 | David  | male   |        2 |  17 |    85 |
|          2 | Nancy  | female |        1 |  18 |    90 |
|          3 | Simon  | male   |        1 |  18 |    97 |
|          4 | Ronnie | male   |        2 |  20 |    59 |
|          5 | Alice  | female |        2 |  19 |    75 |
|          6 | Julia  | female |        1 |  16 |    80 |
|          7 | Dannie | male   |        2 |  20 |  NULL |
+------------+--------+--------+----------+-----+-------+
7 rows in set (0.001 sec)

WHERE

To filter the rows you want to display, a WHERE clause can be utilized. After the WHERE key word is a logical expression, stating the condition for a object to be satisfied if that row is to be displayed. You can compare it with the if clause in most programming languages.

1
2
3

SELECT *
FROM students
WHERE student_id=2;

+------------+-------+--------+----------+-----+-------+
| student_id | name  | gender | class_id | age | score |
+------------+-------+--------+----------+-----+-------+
|          2 | Nancy | female |        1 |  18 |    90 |
+------------+-------+--------+----------+-----+-------+
1 row in set (0.001 sec)

Logical operations in MySQL:

=
<>
>
<
<=
>=

Please pay attention that unlike other programming languages, in MySQL you don’t use == to represent the equality of two values, but simply =. The representations of those logical operations may vary among different database dialects (for example <> vs !=), but the key idea of setting conditions is the same.

It is not surprising that AND, OR, and NOT are supported to link multiple logical operations. Have a check for the priority of these operations, or use parentheses if you feel unsure.

Here list several useful special logical expressions in MySQL:

WHERE age BETWEEN 17 AND 19 means WHERE age >= 17 AND age <= 19.
WHERE age IN (16, 20) means WHERE age = 16 OR x = 20. The (16, 20) here is called an “expression list”, somewhat resembling a list in Python.
WHERE name LIKE '%nnie' means finding name ending with nnie. This syntax is to judge if a string fits a certain pattern, where % means zero, one or multiple characters, and _ means exactly one character. Learn more: Pattern Matching.
WHERE score IS NOT NULL is to judge if a value is NULL. The handling of NULL values should always be paid close attention to. See here and here for detailed information.

Try these syntaxes (combined with SELECT * FROM students, certainly) with your database.

ORDER BY

Considering its similarity with a spreadsheet, it is not surprising that MySQL supports some operations that you would otherwise do in a spreadsheet. Ordering rows is one of them.

ORDER BY does this stuff. You can sort the rows according to any column. You can specify whether you would like it to be in ascending or descending order. You can order the table according to multiple columns, the priority of each column given by the order you type them. These rules are really easy and intuitive.

1
2
3

SELECT student_id, name, age
FROM students
ORDER BY age DESC, student_id ASC;

+------------+--------+-----+
| student_id | name   | age |
+------------+--------+-----+
|          4 | Ronnie |  20 |
|          7 | Dannie |  20 |
|          5 | Alice  |  19 |
|          2 | Nancy  |  18 |
|          3 | Simon  |  18 |
|          1 | David  |  17 |
|          6 | Julia  |  16 |
+------------+--------+-----+
7 rows in set (0.001 sec)

What if you want a random order? Try ORDER BY Rand().

LIMIT

In real-life situations, the databases you encounter may have tens of thousands rows. It is usually useful to limit your output table to only a few rows to prevent a waste of system resources and, more importantly, your time. Add a LIMIT key word at the end of your query can achieve this effect, which can “cut” your output table to only some number of rows at the top.

1
2
3

SELECT * 
FROM students
LIMIT 3;

+------------+-------+--------+----------+-----+-------+
| student_id | name  | gender | class_id | age | score |
+------------+-------+--------+----------+-----+-------+
|          1 | David | male   |        2 |  17 |    85 |
|          2 | Nancy | female |        1 |  18 |    90 |
|          3 | Simon | male   |        1 |  18 |    97 |
+------------+-------+--------+----------+-----+-------+
3 rows in set (0.001 sec)

We can also specify the order so as to retrieve the three oldest students like this:

SELECT *
FROM students
ORDER BY age DESC
LIMIT 3;

+------------+--------+--------+----------+-----+-------+
| student_id | name   | gender | class_id | age | score |
+------------+--------+--------+----------+-----+-------+
|          4 | Ronnie | male   |        2 |  20 |    59 |
|          7 | Dannie | male   |        2 |  20 |  NULL |
|          5 | Alice  | female |        2 |  19 |    75 |
+------------+--------+--------+----------+-----+-------+
3 rows in set (0.001 sec)

To retrieve three random students (three samples of the table):

SELECT *  
FROM students 
ORDER BY Rand() 
LIMIT 3;

+------------+--------+--------+----------+-----+-------+
| student_id | name   | gender | class_id | age | score |
+------------+--------+--------+----------+-----+-------+
|          7 | Dannie | male   |        2 |  20 |  NULL |
|          1 | David  | male   |        2 |  17 |    85 |
|          6 | Julia  | female |        1 |  16 |    80 |
+------------+--------+--------+----------+-----+-------+
3 rows in set (0.001 sec)

UNION

To stack two queries together:

(SELECT * 
FROM students
ORDER BY score DESC
LIMIT 2)

UNION

(SELECT * 
FROM students
ORDER BY score ASC
LIMIT 2);

+------------+--------+--------+----------+-----+-------+
| student_id | name   | gender | class_id | age | score |
+------------+--------+--------+----------+-----+-------+
|          3 | Simon  | male   |        1 |  18 |    97 |
|          2 | Nancy  | female |        1 |  18 |    90 |
|          7 | Dannie | male   |        2 |  20 |  NULL |
|          4 | Ronnie | male   |        2 |  20 |    59 |
+------------+--------+--------+----------+-----+-------+
4 rows in set (0.000 sec)

Notice that duplicate rows in the result of a union are automatically eliminated.

Data Aggregation

So far, the output of our queries are only subsets of the initial table. We use SELECT to filter columns, WHERE to filter rows, ORDER BY to change the order. This means in our output table, each row still represents a single object that we named the initial table after.

However, there are occasions when we need to investigate the properties or attributes of a group of objects. For example, what is the total number of the objects in this table? This question is irrelevant to every single object, but is related to the group comprising all the objects in the table. This is why we introduce data aggregation, which that enables us to look at properties of a group of objects.

During data aggregation, what should be taken into consideration is two steps:

How do you divide the objects (rows) into different groups?
How do you calculate the aggregated value for each group? Or, you should find a function that turns a set of objects to a value.

In the aforementioned problem, the answer for question 1 is simple: all the rows go into one group. As for question 2, we count the number of objects in this group. Mathematically, we are calculating the size of the set consisting of these objects.

Let me give you another example. What if we want to know the average age for the two genders respectively? Then, the two steps would be:

Group students by gender.
For each group, calculate the average age value for all the objects.

We write the query like this:

1
2
3

SELECT AVG(age), gender
FROM students
GROUP BY gender;

+----------+--------+
| AVG(age) | gender |
+----------+--------+
|  17.6667 | female |
|  18.7500 | male   |
+----------+--------+
2 rows in set (0.000 sec)

You can easily detect the necessity of data aggregation in the example above, because the output table is concerned with the “average” property of gender groups of the students. The objects in the table are no longer “students” but “groups of students”.

On most occasions, we group the rows according to the values of one or more columns, which means in each group, those columns share the same value. Thus, it makes sense to say “the gender of this group of the students” in the last example. It is also reasonable to say “the average age of this group of students”, since “average” is a function applied on a group of objects, reflecting some properties of the group via a single value. However, “the age of this group of students” looks pretty absurd. Students in this group has different ages, and unless an aggregation function is provided, no one knows what should be the age of this entire group!

Another question: will you group by a unique key? Probably not. It would be useless since every group would only have one object. Meanwhile, the output table definitely has a unique key (or a primary key): those columns we group the table by. This may give you a better understanding of data aggregation: it turns non-unique columns to a unique key.

Therefore, the output table of data aggregation usually includes all the columns following GROUP BY, rows in each group having the same value on these columns. For any other columns that should appear in the table, an aggregation function must be provided.

Common aggregation functions include:

AVG(age): average
MAX(age): maximum
MIN(age): minimum
STDDEV(age): standard deviation
COUNT(age): number of values
COUNT(DISTINCT age): number of distinct values
COUNT(*): number of values, including NULL

Notice that except for COUNT(*), every function listed above excludes NULL values when computing.

DISTINCT

Literally, DISTINCT means removing duplicate rows in a column. However, you can understand this key word from another perspective, since, really,

1 2	SELECT DISTINCT gender, class_id FROM students;

+--------+----------+
| gender | class_id |
+--------+----------+
| male   |        2 |
| female |        1 |
| male   |        1 |
| female |        2 |
+--------+----------+
4 rows in set (0.001 sec)

is (almost) equivalent to

1
2
3

SELECT gender, class_id
FROM students
GROUP BY gender, class_id;

+--------+----------+
| gender | class_id |
+--------+----------+
| female |        1 |
| female |        2 |
| male   |        1 |
| male   |        2 |
+--------+----------+
4 rows in set (0.000 sec)

If you don’t care about the order. In fact, as you may find, GROUP BY automatically orders the results according to the columns, but DISTINCT doesn’t.

HAVING

After data aggregation, if you want to filter the groups, you can use HAVING. You can understand this as a group version of WHERE.

SELECT gender, class_id, AVG(age) 
FROM students 
GROUP BY gender, class_id 
HAVING AVG(age)>=18;

+--------+----------+----------+
| gender | class_id | AVG(age) |
+--------+----------+----------+
| female |        2 |  19.0000 |
| male   |        1 |  18.0000 |
| male   |        2 |  19.0000 |
+--------+----------+----------+
3 rows in set (0.001 sec)

Try to remove the HAVING clause in the last query and compare the different results.

Derived Columns and Alias

You can derive new columns from existing ones by operations including basic arithmetic operations and many MySQL built-in functions. You can name the derived columns using the AS key word.

1 2	SELECT student_id, name, age+2 AS age_two_years_later FROM students;

+------------+--------+---------------------+
| student_id | name   | age_two_years_later |
+------------+--------+---------------------+
|          1 | David  |                  19 |
|          2 | Nancy  |                  20 |
|          3 | Simon  |                  20 |
|          4 | Ronnie |                  22 |
|          5 | Alice  |                  21 |
|          6 | Julia  |                  18 |
|          7 | Dannie |                  22 |
+------------+--------+---------------------+
7 rows in set (0.001 sec)

The following example uses the built-in function SUBSTRING. Find more information here.

1 2	SELECT student_id, name, SUBSTRING(name, 1, 1) AS initial FROM students;

+------------+--------+---------+
| student_id | name   | initial |
+------------+--------+---------+
|          1 | David  | D       |
|          2 | Nancy  | N       |
|          3 | Simon  | S       |
|          4 | Ronnie | R       |
|          5 | Alice  | A       |
|          6 | Julia  | J       |
|          7 | Dannie | D       |
+------------+--------+---------+
7 rows in set (0.001 sec)

Notice that the new columns can still be seen as attributes of the original objects.

IF and CASE

When deriving columns, two operations deserve specially mentioning, which are IF and CASE.

1 2	SELECT student_id, score, IF(score >= 60, 'passed', 'failed') AS exam_result FROM students;

+------------+-------+-------------+
| student_id | score | exam_result |
+------------+-------+-------------+
|          1 |    85 | passed      |
|          2 |    90 | passed      |
|          3 |    97 | passed      |
|          4 |    59 | failed      |
|          5 |    75 | passed      |
|          6 |    80 | passed      |
|          7 |  NULL | failed      |
+------------+-------+-------------+
7 rows in set (0.001 sec)

SELECT student_id, score,
    CASE 
        WHEN score >= 90 THEN 'excellent'
        WHEN score >= 60 AND score <= 90 THEN 'passed'
        ELSE 'failed'
    END AS exam_result
FROM students;

+------------+-------+-------------+
| student_id | score | exam_result |
+------------+-------+-------------+
|          1 |    85 | passed      |
|          2 |    90 | excellent   |
|          3 |    97 | excellent   |
|          4 |    59 | failed      |
|          5 |    75 | passed      |
|          6 |    80 | passed      |
|          7 |  NULL | failed      |
+------------+-------+-------------+
7 rows in set (0.001 sec)

If you have programming experience, understanding this two functions would not be hard.

Remember that GROUP BY can only group rows according to the value of certain columns? If you directly use GROUP BY score, since the scores spread over a relatively large interval and can take many values, the result table will not usually of much help. However, with the use of IF and CASE, you can derive a new column which only contains a small number of distinct values, and thus achieve the effect of what is called data binning.

SELECT COUNT(*) AS number_of_students,
    CASE 
        WHEN score >= 90 THEN 'excellent'
        WHEN score >= 60 AND score <= 90 THEN 'passed'
        ELSE 'failed'
    END AS exam_result
FROM students
GROUP BY exam_result;

+--------------------+-------------+
| number_of_students | exam_result |
+--------------------+-------------+
|                  2 | excellent   |
|                  2 | failed      |
|                  3 | passed      |
+--------------------+-------------+
3 rows in set (0.001 sec)

As an example, this query retrieves the number of students having each exam results respectively.

Relationships between Tables and Foreign Keys

When we talk about relationships between tables, actually we are talking about relationships between the objects that the tables represent. These relationships are usually not without realistic meanings. For example, considering the enrollment of a student in a class, a relationship is thus formed between the students table and the class table.

There can be three kinds of relationships between tables, and accordingly different ways to represent them in the database:

One-one

That would be the easiest situation, but it is rare. You can use a column in either of the tables to represent objects in another table. This column becomes a foreign key.

Many-one

If a student can enroll in only one class, but a class can have many students, that would be the case. You can use a column in a student table to represent the class they are in (that would be a foreign key), but you can’t do it the other way around. You can’t use a column in the class table to represent the students enrolling in the class, because that would lead to duplicate class rows in the table, which conflicts with our definition of a primary key. You may be tempted to think that problems can be solved if the new column is an array-like object (for example a list in Python or a vector in C++), but it doesn’t work here in MySQL.

Now consider our students table, which already has a class_id column. Suppose now we give each class a name. To store this piece of information, we could add another column in the students table, but it would be redundant since a class_id appears multiple times in the students table. It worth noting that the very “class” of objects we are considering has been changed from students to classes, and hence it is necessary we create another table named classes.

CREATE TABLE classes (
	class_id int NOT NULL AUTO_INCREMENT,
    name varchar(255) NOT NULL,
    PRIMARY KEY (class_id)
)

Load data into the new table:

INSERT INTO classes (name)
VALUES 
    ('Class A'),
    ('Class B'),
    ('Class C');

We’ll talk about how to join these tables together later.

Many-many

Now let’s suppose the school also has several organizations. A student can enroll in as many organizations as he wants, and of course an organization can have multiple students. That fits what is called a many-many relationship. In this case, you can’t simply add a column in either of the tables. Instead, you have to create a new table, probably named org_enrollment, to represent the relationship between students and organizations. In this table, the primary key would be the combination of student_id and organization_id, which are both foreign keys.

Can that org_enrollment table have other columns other than the primary keys? Yes. Suppose in each organization a student has a particular position of either “manager” or “participant”, this piece of data, being an attribute of an enrollment, should be stored in that org_enrollment table.

Let’s firstly create an orgs table and load data into it:

CREATE TABLE orgs (
	org_id INT NOT NULL AUTO_INCREMENT,
    name varchar(255) NOT NULL,
    PRIMARY KEY (org_id)
);

INSERT INTO orgs (name)
VALUES
    ("Students' Union"),
    ('Youth Volunteer Org'),
    ('Enactus');

Then the org_enrollment table:

CREATE TABLE org_enrollment (
	org_id INT NOT NULL,
    student_id INT NOT NULL,
    position varchar(255) DEFAULT "participant",
    PRIMARY KEY (org_id, student_id)
);

INSERT INTO 
    org_enrollment (org_id, student_id, position)
VALUES
    (1, 1, "participant"),
    (2, 1, 'manager'),
    (1, 2, "participant"),
    (3, 2, 'manager'),
    (2, 3, "participant"),
    (1, 4, 'manager'),
    (2, 4, "participant"),
    (3, 5, "participant");

Clearly, you can regard one-one and many-one as special cases of many-many, and create a new table for one-one or many-one relationships. That would usually be unnecessary, unless you have good reasons, for example when the relationship has many its own attributes or the isolation of this table can help prevent confusion. Additionally, the examples above help explain the concept of a foreign key: a column in one table to represent objects in another table. Notice that to represent an object in another table without confusion, a foreign key is always the primary key in the other table. A foreign key can, but does not have to be declared when creating the table. It only comes into effect when you are performing the JOIN operation.

JOIN

With the foreign key, a shared column of two tables, being the “bridge”, two related tables can be joined to assemble data together.

For example, if we want to know the name of the class that each student is in, we can write:

1
2
3

SELECT s.student_id, s.name AS student_name, s.class_id, c.name AS class_name
FROM students s JOIN classes c
    ON s.class_id=c.class_id;

+------------+--------------+----------+------------+
| student_id | student_name | class_id | class_name |
+------------+--------------+----------+------------+
|          1 | David        |        2 | Class B    |
|          2 | Nancy        |        1 | Class A    |
|          3 | Simon        |        1 | Class A    |
|          4 | Ronnie       |        2 | Class B    |
|          5 | Alice        |        2 | Class B    |
|          6 | Julia        |        1 | Class A    |
|          7 | Dannie       |        2 | Class B    |
+------------+--------------+----------+------------+
7 rows in set (0.000 sec)

Notice that we give an alias to the two tables so that we can refer to their columns later. After ON we point out the condition for the join to be satisfied, that is, the value of the foreign key (class_id) in each table should be equal (thus representing the same class).

To see clearly what I mean, try to remove the ON clause and tweak the query a little bit, like this:

1 2	SELECT s.student_id, s.class_id AS s_class_id, c.class_id, c.name AS class_name FROM students s JOIN classes c;

+------------+------------+----------+------------+
| student_id | s_class_id | class_id | class_name |
+------------+------------+----------+------------+
|          1 |          2 |        1 | Class A    |
|          1 |          2 |        2 | Class B    |
|          1 |          2 |        3 | Class C    |
|          2 |          1 |        1 | Class A    |
|          2 |          1 |        2 | Class B    |
|          2 |          1 |        3 | Class C    |
|          3 |          1 |        1 | Class A    |
|          3 |          1 |        2 | Class B    |
|          3 |          1 |        3 | Class C    |
|          4 |          2 |        1 | Class A    |
|          4 |          2 |        2 | Class B    |
|          4 |          2 |        3 | Class C    |
|          5 |          2 |        1 | Class A    |
|          5 |          2 |        2 | Class B    |
|          5 |          2 |        3 | Class C    |
|          6 |          1 |        1 | Class A    |
|          6 |          1 |        2 | Class B    |
|          6 |          1 |        3 | Class C    |
|          7 |          2 |        1 | Class A    |
|          7 |          2 |        2 | Class B    |
|          7 |          2 |        3 | Class C    |
+------------+------------+----------+------------+
21 rows in set (0.000 sec)

You get a Cartesian product of these two tables, that is, every possible pair of rows from the two tables. This table doesn’t make much sense, because of the lack of realistic meaning of the row pairs whose class_ids from the students table and the classes table are not equal. This explains the necessity of the ON clause.

You can see from the last example that a foreign key is somewhat like a seed in a table. When joining two tables, the seed grows into a complete plant, showing all the information of the objects stored in the other table.

Inner, Left and Right Join

JOIN by default in MySQL means inner join. That is, the result table would only contain values that both tables have on that shared column (the foreign key). What if we want to see all the classes in the result even if some classes don’t have a student? Use a left join with classes table placing on the left…

1
2
3

SELECT c.name AS class_name, c.class_id, s.student_id, s.name AS student_name
FROM classes c LEFT JOIN students s
ON c.class_id=s.class_id;

+------------+----------+------------+--------------+
| class_name | class_id | student_id | student_name |
+------------+----------+------------+--------------+
| Class B    |        2 |          1 | David        |
| Class A    |        1 |          2 | Nancy        |
| Class A    |        1 |          3 | Simon        |
| Class B    |        2 |          4 | Ronnie       |
| Class B    |        2 |          5 | Alice        |
| Class A    |        1 |          6 | Julia        |
| Class B    |        2 |          7 | Dannie       |
| Class C    |        3 |       NULL | NULL         |
+------------+----------+------------+--------------+
8 rows in set (0.001 sec)

… Or use a right join with classes table placing on the right.

1
2
3

SELECT c.name AS class_name, c.class_id, s.student_id, s.name AS student_name
FROM students s RIGHT JOIN classes c
ON c.class_id=s.class_id;

The result is the same.

Join Multiple Tables

We have at least three tables to handle a many-many relationship. This is an example of how to join them together.

SELECT o.name as org_name, s.name as student_name, oe.position
FROM org_enrollment oe
JOIN students s ON oe.student_id=s.student_id
JOIN orgs o ON oe.org_id=o.org_id;

+---------------------+--------------+-------------+
| org_name            | student_name | position    |
+---------------------+--------------+-------------+
| Students' Union     | David        | participant |
| Students' Union     | Nancy        | participant |
| Students' Union     | Ronnie       | manager     |
| Youth Volunteer Org | David        | manager     |
| Youth Volunteer Org | Simon        | participant |
| Youth Volunteer Org | Ronnie       | participant |
| Enactus             | Nancy        | manager     |
| Enactus             | Alice        | participant |
+---------------------+--------------+-------------+
8 rows in set (0.001 sec)

This query is quite self-explanatory. Just pay attention to the syntax.

Subquery

You may want to use the result of one query in another query. In that case, the first query can be used as a subquery of the second one.

Single Numeric Value

Sometimes the subquery gives you a single numeric value. For example, what if you want to know the number of the students who are older than the average age of all? You may retrieve the average age first, like this:

1 2	SELECT AVG(age) FROM students;

+----------+
| AVG(age) |
+----------+
|  18.2857 |
+----------+
1 row in set (0.000 sec)

Then to count students older than average, we would write:

1
2
3

SELECT COUNT(*)
FROM students
WHERE age>

> what? The result of the first query. Don’t you copy and paste the value 18.2857 after the >. This could work, but that query would not be reusable if the students table undergoes modification in the future.

If you regard the first query as that single value and embed it into the latter query, you’ll get something like this:

SELECT COUNT(*)
FROM students
WHERE age>(
	SELECT AVG(age)
    FROM students
);

+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+
1 row in set (0.004 sec)

That’s how you write a subquery. Do enclose the subquery with parentheses.

Derived Table

Sometimes the result of the subquery is a table, called a derived table.

For instance, try to calculate the average score of students who are participants in at least one organization. We do this in three steps.

First, we ask: which students are participants in at least one organization?

1
2
3

SELECT DISTINCT student_id
FROM org_enrollment
WHERE position='participant';

+------------+
| student_id |
+------------+
|          1 |
|          2 |
|          3 |
|          4 |
|          5 |
+------------+
5 rows in set (0.001 sec)

Then join this derived table with the students table:

SELECT s.student_id, s.score
FROM (
    SELECT DISTINCT student_id
    FROM org_enrollment
    WHERE position='participant'
) AS p JOIN students s
ON p.student_id = s.student_id;

+------------+-------+
| student_id | score |
+------------+-------+
|          1 |    85 |
|          2 |    90 |
|          3 |    97 |
|          4 |    59 |
|          5 |    75 |
+------------+-------+
5 rows in set (0.001 sec)

Notice that it is mandatory to give the derived table an alias. You have to keep it in mind what class of objects the derived table represents. In the example above, it is participants of organizations (a special group of students). You could name the derived table as participants, but here we use p for short.

The last step is data aggregation:

SELECT AVG(s.score)
FROM (
    SELECT DISTINCT student_id
    FROM org_enrollment
    WHERE position='participant'
) AS p JOIN students s
ON p.student_id = s.student_id;

+--------------+
| AVG(s.score) |
+--------------+
|      81.2000 |
+--------------+
1 row in set (0.001 sec)

IN and EXISTS

Special operators that can be used in combination with a subquery include IN and EXISTS.

You can place a subquery whose result table has only one column after IN.

One join query we used earlier can be paraphrased to:

SELECT student_id, score
FROM students
WHERE student_id IN (
    SELECT DISTINCT student_id
    FROM org_enrollment
    WHERE position='participant'
);

As for EXISTS, it tests whether the derived table is not empty, or has rows. Take a look at this example:

SELECT *
FROM classes c
WHERE NOT EXISTS (
    SELECT *
    FROM students s
    WHERE s.class_id=c.class_id
);

+----------+---------+
| class_id | name    |
+----------+---------+
|        3 | Class C |
+----------+---------+
1 row in set (0.001 sec)

To understand this query: for every row in classes table, try to select rows in students table that have the same class_id. These selected rows form a derived table. If the derived table does not exist, then this row in classes table satisfies the condition.

You may already find something unusual. The subquery here is not independent of the main query, since c, not declared in the subquery, is somehow used. This is called a “correlated subquery”.

Try to remove the NOT in the last query and see the result.

Conclusion

Believe it or not, that’s almost everything you need to understand to write MySQL queries to retrieve data. To fill the gap indicated by the word “almost”, MySQL documentation is always reliable. Congrats, but don’t you forget that understanding is one thing, yet practice is another.