Basic & Intermediate SQL for Data Analysis

Jada Ng Pooi Ling
7 min readAug 21, 2022

--

Structured Query Language (SQL) is a programming language that is used to interact with the relational database management system (RDBMS). SQL is used on all of the major RDBMS but certain instructions might work on one RDBMS and they might not work on another. You might need to do things a little bit differently depending on the RDBMS you used. However, it’s all basically the same for most of the parts but the implementation may vary just a little bit different.

Photo by saeed mhmdi on Unsplash

For demonstration, I will use the Sakila database which is readily available once on MySQL. It is a schema modeling a DVD rental store, featuring things like films, actors, film-actor relationships, and a central inventory table that connects films, stores, and rentals. If you are not using MySQL, you can download it directly from the following link: https://downloads.mysql.com/docs/sakila-db.zip.

Query 1 (Retrieve all rows that satisfy multiple conditions): Find all commentaries or deleted scenes that require a replacement cost of more than $25, have a length of more than 150 minutes, and allow a rental duration of 7 days.

SQL code for query 1
Result of query 1
  • SELECT→ allows us to specify the columns we want to select. Using the character * in SQL will return every column for the table specified.
  • FROM → allows us to specify the table that we want to retrieve the data.
  • WHERE → only returns rows that fulfill the specified conditions; usually used with AND, OR, BETWEEN, IN, LIKE, parentheses, and other common operators such as =, <, >, <=, ≥=, !, and <>. The presence of parentheses causes conditions within them to be evaluated together.

Query 2 (Referencing an aliased column & limiting the number of rows returned): Select 5 random records of deleted scenes and only return the columns of ‘title’, ‘rating’, and ‘special_features’ but rename these columns to ‘film_title’, ‘film_rating’ and ‘film_types’ respectively.

SQL code for query 2
Result of query 2

Placing the query that gives aliases in the FROM clause allows us to reference the aliased columns in the WHERE clause because the FROM clause is evaluated before the WHEREclause. The results from the query in the FROM clause are generated first, so the WHERE clause “senses” the alias names. The inline view in this solution is aliased x.

When you specify a RAND() function in the ORDER BY clause, the sort is performed on the result from the RAND()function as it is evaluated for each row in the table. The LIMITclause picks the first N row in the result set sorted randomly.

Query 3 (Concatenating column values & returning query results in a specified order): Find the ‘full_name’ for each actor and its respective frequency of occurrence. Sort the result by the new column ‘frequency’ descendingly, and ‘last_name’ ascendingly. Remember to include the ‘first_name’ and ‘last_name’ in the result. Only shows the first 5 rows. Finally, find out the ‘actor_id’ for the actor(s) whose name appears more than once.

SQL code for query 3 (part 1)
Result of query 3 (part 1): Only Susan Davis appears twice in the list

In this case, we used the CONCAT() function to add ‘first_name’ and ‘last_name’ together, but we separated them with a space character. The order of precedence in ORDER BY is from left to right. By default, the ORDER BY clause will sort the rows of your result set in ascending order, so the ASC clause is therefore optional. Alternatively, specify DESC to sort in descending order.

SQL code for query 3 (part 2)
Result of query 3 (part 2)

In conclusion, there are two Susan Davis entries in the list, distinguished by actor_id 101 and 110, respectively. 👯‍♀️

Query 4 (Using conditional logic & searching for patterns): Produce a result set such that if a film has a length of 120 minutes or more, a message of “LONG” is returned; otherwise, return “NORMAL”. However, only return the result in which the title of the film begins with “VIR” or the title contains “TAPE” somewhere.

SQL code for query 4
Result of query 4

The CASE expression allows us to write a series of cases and provide an alias on values returned by a query to reach a more understandable result. The WHEN statement allows us to set different conditions and the THEN statement connects the outputs accordingly. We can have as many WHEN statements as possible given the number of conditions we are filtering by. The ELSE clause is optional. The CASE expression will return NULL for any row that does not satisfy the test condition if there is no ELSEclause (see images below👇). The END statement will finalize and create a new column name based on the different cases that classified the different outputs.

SQL code for query 4 (alternative)
Result of query 4 (alternative)

To find a film title that begins with “VIR” or a film title that contains “TAPE” somewhere, we can use the LIKE statement and the percent (%) operator to match any sequence of characters. The percent (%) operator is the wildcard character and it can be placed at the front, middle, or end of a value.

VIR% means it will look for strings that start with “VIR”. If the requirement is to search for all titles ending with “VIR,” then we can use %VIR. To find any string that contains “TAPE” at any position, we can enclose the search pattern “TAPE” with % operators.

Query 5 (Joining multiple tables & computing average): Compute the average rental duration, average rental rate, average length, average replacement cost, and the number of films for each film category. Sort the result by average rating in descending order.

EER diagram
SQL code for query 5
Result of query 5

Originally, no category was found in the film; hence, we need to join the film, film_category, and category tables with film_id and category_id as the foreign keys. If you are interested in learning more about other types of joins in SQL, such as LEFT JOIN, RIGHT JOIN, and FULL JOIN, check out SQL Joins for more details.

For the sake of convenience, a view/ virtual table is created. To compute the average and count of the required attributes for each film category, use the AVG and COUNT function with theGROUP BY clause.

Query 6: Find the top 3popular actors based on the number of movies they were involved in. Shows the latest 5 movies (based on the ‘film_id’) for each of these 3 actors.

SQL code for query 6
Result of query 6
  • The innermost subquery (SELECT actor_id, first_name, last_name, full_name, COUNT(full_name) as frequency FROM actor_movie GROUP BY full_name ORDER BY frequency DESC LIMIT 5) finds the top 5 actors based on the frequency of their full names.
  • The middle subquery (SELECT actor_id FROM (...) fn) selects the actor IDs from the result of the innermost subquery.
  • The outer subquery (SELECT am.actor_id, am.film_id, am.title, am.release_year, am.full_name, ROW_NUMBER() OVER (PARTITION BY am.actor_id ORDER BY am.film_id DESC) AS row_num FROM actor_movie AS am) retrieves information about actors and films, including a row number assigned to each film for each actor based on the descending order of film IDs.
  • The main query selects information from the outer subquery (AS rm), filtering the results to include only rows where row_num is less than or equal to 5 and the actor_id is in the set of top 5 actors.
  • The final result is ordered by actor_id.

Happy learning!✌Hope you have gained some basic understanding of how SQL can be used in data analytics. If you are interested in knowing more about the practical ways to apply intermediate and advanced SQL to explore the Sakila database, do check out my article. 😀

--

--

Jada Ng Pooi Ling
Jada Ng Pooi Ling

Written by Jada Ng Pooi Ling

Stay Constantly Curious🧐 I LOVE data and writing! 😍 Hope my articles could bring some inspiration to you on your learning journey. 😄

Responses (2)