Basic & Intermediate SQL for Data Analysis
Structured Query Language (SQL) is a programming language that is used to interact with the relational database management system (RDBMS). SQL is used on all of the major RDBMS but certain instructions might work on one RDBMS and they might not work on another. You might need to do things a little bit differently depending on the RDBMS you used. However, it’s all basically the same for most of the parts but the implementation may vary just a little bit different.
For demonstration, I will use the Sakila database which is readily available once on MySQL. It is a schema modeling a DVD rental store, featuring things like films, actors, film-actor relationships, and a central inventory table that connects films, stores, and rentals. If you are not using MySQL, you can download it directly from the following link: https://downloads.mysql.com/docs/sakila-db.zip.
Query 1 (Retrieve all rows that satisfy multiple conditions): Find all commentaries or deleted scenes that require a replacement cost of more than $25, have a length of more than 150 minutes, and allow a rental duration of 7 days.
SELECT
→ allows us to specify the columns we want to select. Using the character*
in SQL will return every column for the table specified.FROM
→ allows us to specify the table that we want to retrieve the data.WHERE
→ only returns rows that fulfill the specified conditions; usually used withAND
,OR
,BETWEEN
,IN
,LIKE
, parentheses, and other common operators such as =, <, >, <=, ≥=, !, and <>. The presence of parentheses causes conditions within them to be evaluated together.
Query 2 (Referencing an aliased column & limiting the number of rows returned): Select 5 random records of deleted scenes and only return the columns of ‘title’, ‘rating’, and ‘special_features’ but rename these columns to ‘film_title’, ‘film_rating’ and ‘film_types’ respectively.
Placing the query that gives aliases in the FROM
clause allows us to reference the aliased columns in the WHERE
clause because the FROM
clause is evaluated before the WHERE
clause. The results from the query in the FROM
clause are generated first, so the WHERE
clause “senses” the alias names. The inline view in this solution is aliased x.
When you specify a RAND()
function in the ORDER BY
clause, the sort is performed on the result from the RAND()
function as it is evaluated for each row in the table. The LIMIT
clause picks the first N row in the result set sorted randomly.
Query 3 (Concatenating column values & returning query results in a specified order): Find the ‘full_name’ for each actor and its respective frequency of occurrence. Sort the result by the new column ‘frequency’ descendingly, and ‘last_name’ ascendingly. Remember to include the ‘first_name’ and ‘last_name’ in the result. Only shows the first 5 rows. Finally, find out the ‘actor_id’ for the actor(s) whose name appears more than once.
In this case, we used the CONCAT()
function to add ‘first_name’ and ‘last_name’ together, but we separated them with a space character. The order of precedence in ORDER BY is from left to right. By default, the ORDER BY
clause will sort the rows of your result set in ascending order, so the ASC
clause is therefore optional. Alternatively, specify DESC
to sort in descending order.
In conclusion, there are two Susan Davis entries in the list, distinguished by actor_id 101 and 110, respectively. 👯♀️
Query 4 (Using conditional logic & searching for patterns): Produce a result set such that if a film has a length of 120 minutes or more, a message of “LONG” is returned; otherwise, return “NORMAL”. However, only return the result in which the title of the film begins with “VIR” or the title contains “TAPE” somewhere.
The CASE
expression allows us to write a series of cases and provide an alias on values returned by a query to reach a more understandable result. The WHEN
statement allows us to set different conditions and the THEN
statement connects the outputs accordingly. We can have as many WHEN
statements as possible given the number of conditions we are filtering by. The ELSE
clause is optional. The CASE
expression will return NULL
for any row that does not satisfy the test condition if there is no ELSE
clause (see images below👇). The END
statement will finalize and create a new column name based on the different cases that classified the different outputs.
To find a film title that begins with “VIR” or a film title that contains “TAPE” somewhere, we can use the LIKE
statement and the percent (%) operator to match any sequence of characters. The percent (%) operator is the wildcard character and it can be placed at the front, middle, or end of a value.
VIR%
means it will look for strings that start with “VIR”. If the requirement is to search for all titles ending with “VIR,” then we can use %VIR
. To find any string that contains “TAPE” at any position, we can enclose the search pattern “TAPE” with % operators.
Query 5 (Joining multiple tables & computing average): Compute the average rental duration, average rental rate, average length, average replacement cost, and the number of films for each film category. Sort the result by average rating in descending order.
Originally, no category was found in the film; hence, we need to join the film, film_category, and category tables with film_id and category_id as the foreign keys. If you are interested in learning more about other types of joins in SQL, such as LEFT JOIN
, RIGHT JOIN,
and FULL JOIN
, check out SQL Joins for more details.
For the sake of convenience, a view/ virtual table is created. To compute the average and count of the required attributes for each film category, use the AVG
and COUNT
function with theGROUP BY
clause.
Query 6: Find the top 3popular actors based on the number of movies they were involved in. Shows the latest 5 movies (based on the ‘film_id’) for each of these 3 actors.
- The innermost subquery (
SELECT actor_id, first_name, last_name, full_name, COUNT(full_name) as frequency FROM actor_movie GROUP BY full_name ORDER BY frequency DESC LIMIT 5
) finds the top 5 actors based on the frequency of their full names. - The middle subquery (
SELECT actor_id FROM (...) fn
) selects the actor IDs from the result of the innermost subquery. - The outer subquery (
SELECT am.actor_id, am.film_id, am.title, am.release_year, am.full_name, ROW_NUMBER() OVER (PARTITION BY am.actor_id ORDER BY am.film_id DESC) AS row_num FROM actor_movie AS am
) retrieves information about actors and films, including a row number assigned to each film for each actor based on the descending order of film IDs. - The main query selects information from the outer subquery (
AS rm
), filtering the results to include only rows whererow_num
is less than or equal to 5 and theactor_id
is in the set of top 5 actors. - The final result is ordered by
actor_id
.
Happy learning!✌Hope you have gained some basic understanding of how SQL can be used in data analytics. If you are interested in knowing more about the practical ways to apply intermediate and advanced SQL to explore the Sakila database, do check out my article. 😀