This article is a review of what I learned in the Coursera course Managing Big Data with MySQL.
Admittedly, not a small part of the emphasis (and difficulty) of this course is on knowledge beyond MySQL itself, that is, the ability to apply the MySQL skills to real-life business problems. Nevertheless, this article will mainly focus on MySQL itself and my understanding of this rational database manage system.
If you want to learn MySQL from scratch by reading this article, it assumes you have intermediate experience in programming, especially in object-oriented programming, and only serves as a tutorial that gives you the basic (and hopefully intuitive) understanding of the concept of MySQL and the usage of several frequently used MySQL queries. If you are curious about more details or have any problems when implementing your queries, you are encouraged to refer to MySQL official documents and other tutorials.
Database and Table
What is a database? A database is a collection of one or more interconnected tables. Then what is a table? If you have experience with object-oriented programming, you can understand the concept of a table quite with ease. (If you don’t, just remember a class is a kind of things, and an object is a thing of this kind.) Suppose you have a class, where an object has several attributes (but does not have methods). When you generate multiple objects of this class and store one object on a row, with a column being an attribute of this class, you get a table. Honestly, this is just how some Python database frameworks (for example flask-sqlalchemy
) consider a rational database table. To put it simply, a table is multiple objects of the same class.
You may find a relational database is somewhat like a spreadsheet. In fact, according what we’ve said, any relational database can be stored in a spreadsheet, but the reverse is certainly not true.
Keys
Primary Keys
When in real object-oriented programming, an object would usually be represented as a variable. It goes without saying that in (almost?) any programming language, within one naming scope, any variable should have a unique name to prevent confusion with other variables. It is the same in MySQL. An object, or a row, in the table should be able to be distinguished from other objects (or rows) so that many further operations could be applied.
That is why we have what is called a ‘primary key’ (key is a fancy synonym to attribute, or column). A primary key is the attribute that distinguish one object from others. Clearly, if a column is chosen as the primary key, no duplicate rows should appear in this column. Usually you can use an id number as the primary key, just as in other programming languages you use an index to refer to elements in an array or list.
However, sometimes you can use multiple keys together as the primary key. For example, if your class is a room which can be located by a building number and a room number in the building, the building number and the room number can be used together as the primary key. There may be multiple rooms in the same building, and there may be rooms sharing the same room number in different buildings, but typically no two rooms in the same building have the same room number.
Unique Keys
After being familiar with the concept of primary key, it would be easy to guess what a unique key is. Yes, it is just what it literally means. Clearly a primary key must be a unique key. Apart from the primary key, you can set any key (or combination of multiple keys) in your table as a unique key. This can act as an restriction on your data, preventing potential mistakes.
Other Constraints
In real implementation, other constraints may be imposed on keys when creating a table, which you can look up here. To name but a few frequently used ones:
- A key should be of a certain type, for example
int
,float
,char
,varchar
,date
, etc., like you may have seen in some programming languages, that a variable should be assigned with an unchangeable type specified. - A key can have a default value.
- You can explicitly forbid a key from having
NULL
values. - An
AUTO_INCREMENT
key has auto increment values, usually used as id numbers.
Installation
MySQL document has detailed instructions on how to install a MySQL service on your computer. If you have any trouble, ask Google.
If you are using Arch Linux like I do, please note that you should install mariadb
instead of mysql
:
1 | sudo pacman -S mariadb |
For other configurations on Arch Linux, you can refer to this article.
Running MySQL in Command Line
I hope you are familiar with the command line, where you can launch your MySQL service and start querying. If you don’t, Google.
MySQL Queries
Creating Databases
Creating a database is easy. You give it a name, and it’s done. A database is like a container of tables. Just remember to use the database when you want to operate on the tables in it.
1 | CREATE DATABASE school; |
In real life you may have to configure the permission to use the database for others if you are the MySQL administrator.
To display all the databases we have:
1 | SHOW DATABASES; |
1 | +--------------------+ |
Apart from the newly created database, MySQL has three databases by default.
Creating Tables
A table, as mentioned, is a list of objects of the same class. When creating a table, you have to name the table, usually after the name of the class. Then you list all the keys you are interested in and impose relevant constraints on each of them.
In our example, let’s first create a students
table using the snippet below:
1 | CREATE TABLE students ( |
To display all the tables we have:
1 | SHOW TABLES; |
1 | +------------------+ |
To display the settings of a table:
1 | DESCRIBE students; |
1 | +------------+--------------+------+-----+---------+----------------+ |
You can alter the settings using ALTER
.
Loading Data
Now you can load data into the table, or in other words, creating objects of the class. For each object, you have specify the values for the attributes that you listed when creating the table. When you leave out certain attributes, the entries in the table would be NULL
(which we’ll discuss later), or the default value that you set in advance.
1 | INSERT INTO students (name, class_id, gender, age, score) |
You may also insert some other (imaginary) students into this table to fulfill it, if you want to. Now this database only contains seven rows in order to facilitate your understanding of basic MySQL queries. In real-life situations, you may get tons of data in your database.
You can also DELETE and UPDATE values in a table.
SELECT
For a table loaded with data, what can be more exciting than to select data from it? That accounts for the fact that each and every query below starts with the key word SELECT
.
When you want to see some of the columns of your table, you can list the column names after the SELECT key word, separated by commas. The key word FROM
is used to inform MySQL which table you are talking about, which is also a must for all queries you are going to write.
1 | SELECT student_id, name, gender |
1 | +------------+--------+--------+ |
Most of the time you want to include a unique key in the columns you select so that you can distinguish all the objects from each other.
When you want to select all the columns in the table, use SELECT *
:
1 | SELECT * |
1 | +------------+--------+--------+----------+-----+-------+ |
WHERE
To filter the rows you want to display, a WHERE
clause can be utilized. After the WHERE
key word is a logical expression, stating the condition for a object to be satisfied if that row is to be displayed. You can compare it with the if
clause in most programming languages.
1 | SELECT * |
1 | +------------+-------+--------+----------+-----+-------+ |
Logical operations in MySQL:
=
<>
>
<
<=
>=
Please pay attention that unlike other programming languages, in MySQL you don’t use ==
to represent the equality of two values, but simply =
. The representations of those logical operations may vary among different database dialects (for example <>
vs !=
), but the key idea of setting conditions is the same.
It is not surprising that AND
, OR
, and NOT
are supported to link multiple logical operations. Have a check for the priority of these operations, or use parentheses if you feel unsure.
Here list several useful special logical expressions in MySQL:
WHERE age BETWEEN 17 AND 19
meansWHERE age >= 17 AND age <= 19
.WHERE age IN (16, 20)
meansWHERE age = 16 OR x = 20
. The(16, 20)
here is called an “expression list”, somewhat resembling a list in Python.WHERE name LIKE '%nnie'
means finding name ending withnnie
. This syntax is to judge if a string fits a certain pattern, where%
means zero, one or multiple characters, and_
means exactly one character. Learn more: Pattern Matching.WHERE score IS NOT NULL
is to judge if a value is NULL. The handling ofNULL
values should always be paid close attention to. See here and here for detailed information.
Try these syntaxes (combined with SELECT * FROM students
, certainly) with your database.
ORDER BY
Considering its similarity with a spreadsheet, it is not surprising that MySQL supports some operations that you would otherwise do in a spreadsheet. Ordering rows is one of them.
ORDER BY
does this stuff. You can sort the rows according to any column. You can specify whether you would like it to be in ascending or descending order. You can order the table according to multiple columns, the priority of each column given by the order you type them. These rules are really easy and intuitive.
1 | SELECT student_id, name, age |
1 | +------------+--------+-----+ |
What if you want a random order? Try ORDER BY Rand()
.
LIMIT
In real-life situations, the databases you encounter may have tens of thousands rows. It is usually useful to limit your output table to only a few rows to prevent a waste of system resources and, more importantly, your time. Add a LIMIT
key word at the end of your query can achieve this effect, which can “cut” your output table to only some number of rows at the top.
1 | SELECT * |
1 | +------------+-------+--------+----------+-----+-------+ |
We can also specify the order so as to retrieve the three oldest students like this:
1 | SELECT * |
1 | +------------+--------+--------+----------+-----+-------+ |
To retrieve three random students (three samples of the table):
1 | SELECT * |
1 | +------------+--------+--------+----------+-----+-------+ |
UNION
To stack two queries together:
1 | (SELECT * |
1 | +------------+--------+--------+----------+-----+-------+ |
Notice that duplicate rows in the result of a union are automatically eliminated.
Data Aggregation
So far, the output of our queries are only subsets of the initial table. We use SELECT to filter columns, WHERE to filter rows, ORDER BY to change the order. This means in our output table, each row still represents a single object that we named the initial table after.
However, there are occasions when we need to investigate the properties or attributes of a group of objects. For example, what is the total number of the objects in this table? This question is irrelevant to every single object, but is related to the group comprising all the objects in the table. This is why we introduce data aggregation, which that enables us to look at properties of a group of objects.
During data aggregation, what should be taken into consideration is two steps:
- How do you divide the objects (rows) into different groups?
- How do you calculate the aggregated value for each group? Or, you should find a function that turns a set of objects to a value.
In the aforementioned problem, the answer for question 1 is simple: all the rows go into one group. As for question 2, we count the number of objects in this group. Mathematically, we are calculating the size of the set consisting of these objects.
Let me give you another example. What if we want to know the average age for the two genders respectively? Then, the two steps would be:
- Group students by gender.
- For each group, calculate the average age value for all the objects.
We write the query like this:
1 | SELECT AVG(age), gender |
1 | +----------+--------+ |
You can easily detect the necessity of data aggregation in the example above, because the output table is concerned with the “average” property of gender groups of the students. The objects in the table are no longer “students” but “groups of students”.
On most occasions, we group the rows according to the values of one or more columns, which means in each group, those columns share the same value. Thus, it makes sense to say “the gender of this group of the students” in the last example. It is also reasonable to say “the average age of this group of students”, since “average” is a function applied on a group of objects, reflecting some properties of the group via a single value. However, “the age of this group of students” looks pretty absurd. Students in this group has different ages, and unless an aggregation function is provided, no one knows what should be the age of this entire group!
Another question: will you group by a unique key? Probably not. It would be useless since every group would only have one object. Meanwhile, the output table definitely has a unique key (or a primary key): those columns we group the table by. This may give you a better understanding of data aggregation: it turns non-unique columns to a unique key.
Therefore, the output table of data aggregation usually includes all the columns following GROUP BY
, rows in each group having the same value on these columns. For any other columns that should appear in the table, an aggregation function must be provided.
Common aggregation functions include:
AVG(age)
: averageMAX(age)
: maximumMIN(age)
: minimumSTDDEV(age)
: standard deviationCOUNT(age)
: number of valuesCOUNT(DISTINCT age)
: number of distinct valuesCOUNT(*)
: number of values, includingNULL
Notice that except for COUNT(*)
, every function listed above excludes NULL
values when computing.
DISTINCT
Literally, DISTINCT
means removing duplicate rows in a column. However, you can understand this key word from another perspective, since, really,
1 | SELECT DISTINCT gender, class_id |
1 | +--------+----------+ |
is (almost) equivalent to
1 | SELECT gender, class_id |
1 | +--------+----------+ |
If you don’t care about the order. In fact, as you may find, GROUP BY
automatically orders the results according to the columns, but DISTINCT
doesn’t.
HAVING
After data aggregation, if you want to filter the groups, you can use HAVING. You can understand this as a group version of WHERE.
1 | SELECT gender, class_id, AVG(age) |
1 | +--------+----------+----------+ |
Try to remove the HAVING
clause in the last query and compare the different results.
Derived Columns and Alias
You can derive new columns from existing ones by operations including basic arithmetic operations and many MySQL built-in functions. You can name the derived columns using the AS
key word.
1 | SELECT student_id, name, age+2 AS age_two_years_later |
1 | +------------+--------+---------------------+ |
The following example uses the built-in function SUBSTRING. Find more information here.
1 | SELECT student_id, name, SUBSTRING(name, 1, 1) AS initial |
1 | +------------+--------+---------+ |
Notice that the new columns can still be seen as attributes of the original objects.
IF and CASE
When deriving columns, two operations deserve specially mentioning, which are IF and CASE.
1 | SELECT student_id, score, IF(score >= 60, 'passed', 'failed') AS exam_result |
1 | +------------+-------+-------------+ |
1 | SELECT student_id, score, |
1 | +------------+-------+-------------+ |
If you have programming experience, understanding this two functions would not be hard.
Remember that GROUP BY
can only group rows according to the value of certain columns? If you directly use GROUP BY score
, since the scores spread over a relatively large interval and can take many values, the result table will not usually of much help. However, with the use of IF
and CASE
, you can derive a new column which only contains a small number of distinct values, and thus achieve the effect of what is called data binning.
1 | SELECT COUNT(*) AS number_of_students, |
1 | +--------------------+-------------+ |
As an example, this query retrieves the number of students having each exam results respectively.
Relationships between Tables and Foreign Keys
When we talk about relationships between tables, actually we are talking about relationships between the objects that the tables represent. These relationships are usually not without realistic meanings. For example, considering the enrollment of a student in a class, a relationship is thus formed between the students table and the class table.
There can be three kinds of relationships between tables, and accordingly different ways to represent them in the database:
One-one
That would be the easiest situation, but it is rare. You can use a column in either of the tables to represent objects in another table. This column becomes a foreign key.
Many-one
If a student can enroll in only one class, but a class can have many students, that would be the case. You can use a column in a student table to represent the class they are in (that would be a foreign key), but you can’t do it the other way around. You can’t use a column in the class table to represent the students enrolling in the class, because that would lead to duplicate class rows in the table, which conflicts with our definition of a primary key. You may be tempted to think that problems can be solved if the new column is an array-like object (for example a list
in Python or a vector
in C++), but it doesn’t work here in MySQL.
Now consider our students
table, which already has a class_id
column. Suppose now we give each class a name. To store this piece of information, we could add another column in the students
table, but it would be redundant since a class_id
appears multiple times in the students
table. It worth noting that the very “class” of objects we are considering has been changed from students to classes, and hence it is necessary we create another table named classes
.
1 | CREATE TABLE classes ( |
Load data into the new table:
1 | INSERT INTO classes (name) |
We’ll talk about how to join these tables together later.
Many-many
Now let’s suppose the school also has several organizations. A student can enroll in as many organizations as he wants, and of course an organization can have multiple students. That fits what is called a many-many relationship. In this case, you can’t simply add a column in either of the tables. Instead, you have to create a new table, probably named org_enrollment
, to represent the relationship between students and organizations. In this table, the primary key would be the combination of student_id
and organization_id
, which are both foreign keys.
Can that org_enrollment
table have other columns other than the primary keys? Yes. Suppose in each organization a student has a particular position of either “manager” or “participant”, this piece of data, being an attribute of an enrollment, should be stored in that org_enrollment
table.
Let’s firstly create an orgs
table and load data into it:
1 | CREATE TABLE orgs ( |
Then the org_enrollment
table:
1 | CREATE TABLE org_enrollment ( |
Clearly, you can regard one-one and many-one as special cases of many-many, and create a new table for one-one or many-one relationships. That would usually be unnecessary, unless you have good reasons, for example when the relationship has many its own attributes or the isolation of this table can help prevent confusion. Additionally, the examples above help explain the concept of a foreign key: a column in one table to represent objects in another table. Notice that to represent an object in another table without confusion, a foreign key is always the primary key in the other table. A foreign key can, but does not have to be declared when creating the table. It only comes into effect when you are performing the JOIN operation.
JOIN
With the foreign key, a shared column of two tables, being the “bridge”, two related tables can be joined to assemble data together.
For example, if we want to know the name of the class that each student is in, we can write:
1 | SELECT s.student_id, s.name AS student_name, s.class_id, c.name AS class_name |
1 | +------------+--------------+----------+------------+ |
Notice that we give an alias to the two tables so that we can refer to their columns later. After ON
we point out the condition for the join to be satisfied, that is, the value of the foreign key (class_id) in each table should be equal (thus representing the same class).
To see clearly what I mean, try to remove the ON
clause and tweak the query a little bit, like this:
1 | SELECT s.student_id, s.class_id AS s_class_id, c.class_id, c.name AS class_name |
1 | +------------+------------+----------+------------+ |
You get a Cartesian product of these two tables, that is, every possible pair of rows from the two tables. This table doesn’t make much sense, because of the lack of realistic meaning of the row pairs whose class_id
s from the students
table and the classes
table are not equal. This explains the necessity of the ON
clause.
You can see from the last example that a foreign key is somewhat like a seed in a table. When joining two tables, the seed grows into a complete plant, showing all the information of the objects stored in the other table.
Inner, Left and Right Join
JOIN
by default in MySQL means inner join. That is, the result table would only contain values that both tables have on that shared column (the foreign key). What if we want to see all the classes in the result even if some classes don’t have a student? Use a left join with classes table placing on the left…
1 | SELECT c.name AS class_name, c.class_id, s.student_id, s.name AS student_name |
1 | +------------+----------+------------+--------------+ |
… Or use a right join with classes table placing on the right.
1 | SELECT c.name AS class_name, c.class_id, s.student_id, s.name AS student_name |
The result is the same.
Join Multiple Tables
We have at least three tables to handle a many-many relationship. This is an example of how to join them together.
1 | SELECT o.name as org_name, s.name as student_name, oe.position |
1 | +---------------------+--------------+-------------+ |
This query is quite self-explanatory. Just pay attention to the syntax.
Subquery
You may want to use the result of one query in another query. In that case, the first query can be used as a subquery of the second one.
Single Numeric Value
Sometimes the subquery gives you a single numeric value. For example, what if you want to know the number of the students who are older than the average age of all? You may retrieve the average age first, like this:
1 | SELECT AVG(age) |
1 | +----------+ |
Then to count students older than average, we would write:
1 | SELECT COUNT(*) |
>
what? The result of the first query. Don’t you copy and paste the value 18.2857 after the >
. This could work, but that query would not be reusable if the students table undergoes modification in the future.
If you regard the first query as that single value and embed it into the latter query, you’ll get something like this:
1 | SELECT COUNT(*) |
1 | +----------+ |
That’s how you write a subquery. Do enclose the subquery with parentheses.
Derived Table
Sometimes the result of the subquery is a table, called a derived table.
For instance, try to calculate the average score of students who are participants in at least one organization. We do this in three steps.
First, we ask: which students are participants in at least one organization?
1 | SELECT DISTINCT student_id |
1 | +------------+ |
Then join this derived table with the students table:
1 | SELECT s.student_id, s.score |
1 | +------------+-------+ |
Notice that it is mandatory to give the derived table an alias. You have to keep it in mind what class of objects the derived table represents. In the example above, it is participants of organizations (a special group of students). You could name the derived table as participants
, but here we use p
for short.
The last step is data aggregation:
1 | SELECT AVG(s.score) |
1 | +--------------+ |
IN and EXISTS
Special operators that can be used in combination with a subquery include IN
and EXISTS
.
You can place a subquery whose result table has only one column after IN
.
One join query we used earlier can be paraphrased to:
1 | SELECT student_id, score |
As for EXISTS
, it tests whether the derived table is not empty, or has rows. Take a look at this example:
1 | SELECT * |
1 | +----------+---------+ |
To understand this query: for every row in classes
table, try to select rows in students
table that have the same class_id. These selected rows form a derived table. If the derived table does not exist, then this row in classes
table satisfies the condition.
You may already find something unusual. The subquery here is not independent of the main query, since c
, not declared in the subquery, is somehow used. This is called a “correlated subquery”.
Try to remove the NOT
in the last query and see the result.
Conclusion
Believe it or not, that’s almost everything you need to understand to write MySQL queries to retrieve data. To fill the gap indicated by the word “almost”, MySQL documentation is always reliable. Congrats, but don’t you forget that understanding is one thing, yet practice is another.