Intermediate & Advanced SQL For Data Analysis

Jada Ng Pooi Ling
6 min readAug 27, 2022

--

It is unavoidable for an analyst to deal with strings, numbers, and dates in their daily job. Below are some examples of how to use SQL to perform analysis on data containing strings, numbers, and dates. For demonstration, I would use the Sakila database on MySQL.

Photo by ANIRUDH on Unsplash

Query 1 (Parsing an email address & extracting initials from the names): Extract the customers’ first name and last name from the emails (only those first names that appear twice or more). After that, convert the extracted first names and last names to proper cases. Convert the extracted first and last names into initials.

SQL code for query 1
Result of query 1

I believe you have realized that I have embedded queries in other queries in my SQL code above. Yes, they are called subqueries or nested queries, and they allow us to feed the result of one query to another query. Subqueries have complete query components and they are enclosed in parentheses in order to be executed first. Assigning aliases is important for syntax and readability. 🤓

Nested query/ Subquery

There are 3 types of subqueries:

  • Type 1: Embed a query within the parentheses after the FROM statement so that we can use the results of this inner query as a table for the FROM statement of the outer query.
  • Type 2: Embed a query within the parentheses after the IN operator. It is useful when we want to filter the results of the outer query by the results of the inner query, whereby the inner query often provides a list of items in the form of a single column. In the example above, the inner query retrieves a list of the first name that appears more than one time, whereas the outer query directly looks for the first name provided by the inner query, instead of running through the entire table. It is important to note that the HAVING clause allows us to filter the aggregated values after they have been grouped.
  • Type 3: Embed a query within the parentheses after a comparison operator. The inner query retrieves the single value of another column from a different table. The outer query then takes the inner query and uses it with a comparison operator to filter a list of values. Check out more examples from SQL Nested subqueries.

Meanwhile, parsing an email address becomes an easy task with the SUBSTRING_INDEX() function as it can return a substring of a string before a specified number of delimiter occurs. It requires us to specify the (1) string, (2) delimiter, and (3) the number of times to search for the delimiter. The third parameter can be both a positive or negative number. If it is a positive number, this function returns all to the left of the delimiter. If it is a negative number, this function returns all to the right of the delimiter.

To create an initial for each customer, you simply need to extract the first letter of their first name and the first letter of their last name using the SUBSTR() function. After that, you can combine the extraction with the period using the CONCAT() function.

Similarly, to convert the name to the proper case, you simply need to extract the first letter of their first & last name, change the other letters of their first & last name to lower case using the LOWER() function, and finally combine the first letter and other lowercased letters using the CONCAT() function.

Query 2 (Counting values in a column + Generating a running total + Determining the percentage of a total): Find the number, running total, and the percentage of active & inactive customers of each store.

SQL code for query 2
Result of query 2

Below are some examples of aggregation functions:

  • COUNT: Returns the number of values in a particular column.
  • COUNT DISTINCT: Returns the number of unique values in a particular column. For example, we can use this function to find the number of distinct actors’ last names (see the query and its result below👇).
example 1
result of example 1
  • SUM: Adds together all the values in a particular column.
  • AVG: Computes the average of a group of selected values.
  • MIN/MAX: Returns the lowest and highest values in a particular column respectively.

The column “running_total2” shows the problem that you might face with duplicates. The values in “running_total2” for inactive customers of stores 1 & 2 are incorrect because it only considers the difference in the store without considering the difference in the status of the customers. Hence, we need to ensure the correct progression of the running total by specifying a combination of columns that cannot result in duplicate values.

On the other hand, to calculate a percentage against a total in SQL, you only need to simply divide, then multiply. In this example, simply find the number of customers for each store by its status, and then divide by the total number of customers for the table. Finally, multiply by 100 to return a value that represents a percent.

Query 3 (Making good use of wildcard): Find out the customers whose last name starts with H or J and ends with ‘SON’ or ‘SEN’. Display the result with “customer_id”, “first_name”, “last_name”, “active”, “address”, “district”, “city”, and “country”.

EER diagram of sakila database
SQL code for query 3
Result of query 3
  • %: Wildcard for more than 1 character.
  • _: Wildcard for a single character.

Query 4 (Joining multiple tables & filtering query results): Find the top 5 most popular film title and their respective categories.

SQL code for query 4
Result of query 4

Subqueries and join are both used to combine data from different tables. However, they have differences as shown below.

Differences between subquery and join

Query 5 (Determining the Number of Days Between Two Dates): Find the difference in days between the “rental_date” and “return_date”. Then, find the sports film which has returned late by more than 9 days.

SQL code for query 5
Result of query 5

I created another view called “rent_table” from the view of “combined_table” created for query 4. To find the number of days between two dates, we can use the DATEDIFF() function by passing the two dates we want to find the difference in days between as parameters. The larger of the two dates should be passed first to avoid negative values.

Hope you have gained some ideas on how to use SQL practically to perform analysis.😀Happy learning✌️

--

--

Jada Ng Pooi Ling
Jada Ng Pooi Ling

Written by Jada Ng Pooi Ling

Stay Constantly Curious🧐 I LOVE data and writing! 😍 Hope my articles could bring some inspiration to you on your learning journey. 😄

Responses (2)