Analyzing HR Metrics With SQL To Measure Diversity & Inclusion

10 min readSep 24, 2022

A workforce of diverse talents and backgrounds could help companies to succeed in this complex world. To unlock the potential of workplace diversity, companies need to eliminate systemic bias (e.g. gender bias or sexual discrimination) from their policies and practices, so that the employee lifecycle could be fair and robust from application to exit.👩‍💼

This article explains how I measure diversity and inclusion using the Kaggle dataset containing the employees’ details. Before going into the analysis, let’s understand the difference between diversity and inclusion.

Diversity: the traits or characteristics of an individual. It can be their age, race, gender, marital status, job level, service period, etc.
Inclusion: the behavior that makes the team feels welcomed. Employees feel included if they are satisfied with their job and have good relationships with other employees. A survey could be conducted to understand the feeling of the employees about their teams and job.

For your information, I performed the analysis using MySQL workbench. Before importing the CSV file into the MySQL workbench, I made some amendments using Excel. For example, fill in the blank cell in the “DateofTermination” column with “NA”. I have created a PDF showing the steps to create a new schema and import datasets to the MySQL workbench. You may refer to this PDF if you are not sure how to perform these steps.🧐

Below are the hypothetical questions I created on my own, my solutions, and the results. 🤓

Query 1: Create a view called “eng_hr_dataset” that includes all the columns of the dataset, except the “HispanicLatino” column and columns with “ID” in their names. Then, calculate the age, job level, and service period of all employees. Finally, determine generational diversity in the workplace, as well as the average (trimmed) service period of each generation.

A view could be used easily to query data just as though we were querying tables directly. Views do not involve data storage; hence, we do not need to worry about our disk space being filled up by views. We could use the DESCRIBE command to examine the columns of the view.

In this case, “Level”, “generation”, and “serv_period” columns were created by specifying the conditions and the values to be returned in the CASE expressions. It is conditional logic because the CASE expression will only return the corresponding value when the condition in a WHEN clause evaluates to be true. If none of the conditions evaluates to be true, then the default value as specified in the ELSE clause would be returned. It is optional whether or not you want to specify the ELSE clause. This is because NULL would be returned for any row that does not satisfy the test condition if there is no ELSE clause.

Before calculating the age and service period using the DATEDIFF() function, “DOB”, “DateofHire” and “DateofTermination” were converted from text format to date format using the STR_TO_DATE() function. Meanwhile, the YEAR() function was used to extract the year from the dates of birth when determining the generation of the employees.

A workforce with generational diversity allows innovation and creativity in problem-solving because people of different ages could bring different viewpoints. Besides, it can help companies to develop a better understanding of their customer base as most brands serve an audience of all ages. In other words, it could improve marketing, product development, and customer service.

However, generational diversity could bring issues to the company if it has not been handled properly. This is because each generation has its characteristics, expectations, and priorities. In this case, most of the current employees are Generation Y (102), followed by Generation X (87), and finally Baby Boomers (10). Generally, the average service period of Generation X is longer than that of Generation Y (also known as millennials).

To gain advantages of having a multigenerational workforce, the management of the company needs to avoid stereotypes based on age. Open communication and learning about employees’ individual preferences and working styles could also help to build team engagement and ensure efficient teamwork.

Query 2: Find out the number of existing employees by department and gender.

This case involves the use of pivoting. First, the number of existing employees of different genders in each department was retrieved using the GROUP BY clause in the unpivoted query which is aliased as “ds”. Then, the CASE expressions translate the rows returned by this query to columns. The MAX() functions remove the rows that contain zero as their value. Then, the subquery “sd” was combined with the subquery that returns the number of employees by gender (shown as the last column in the result) using an inner join. Finally, the ROLLUP extension of GROUP BY was used to generate each column’s subtotal.

The number of current female employees is higher than the number of male employees in the company. 111 females and 88 males are currently working in the company. Females mostly work in production, software engineering, admin, and executive office. On the other hand, a higher number of male employees work in IT/ IS and sales departments.

Query 3: Calculate the original number of employees, the number of current employees, and the percentage of change in the number of employees by job level and gender.

The query that returns the original number of employees was combined with the query that returns the number of current employees using an inner join.

Originally, the gender distribution in the top management and the senior executive levels are quite balanced, until one of the male senior executives left the company. There are fewer female managers as compared to male managers. Resigned employees are mostly female executives.

Query 4: Calculate the original total number of employees, the number of current employees, and the number of employees who left the company by department, job level, gender, and race. Then, determine the respective turnover rate of the employees. Only show the rows that involve resigned employees.

A LEFT JOIN was used in this case because we want to retain all records from the left table c before matching records with the right tables a and b.

It is important to note that all male Asian managers in production and all male Black or African American executives in the software engineering department (refer to the last 2 rows of the result) had left the company. This might be due to the lack of promotion opportunities or their jobs do not allow them to grow professionally. If the company values its employees but it could not offer them a promotion or a salary increment at this point, the company could communicate openly to them to avoid the employees from having resentment and wrong hopes.

The last column shows the employee turnover rate, the metric that measures the number of employees who are leaving a company. However, we need to evaluate employee turnover contextually. It is good for the company if the problematic (e.g. absent from work, having poor performance, or violating workplace policies) employees leave the company. Otherwise, they would affect the productivity and the work attitude of the other employees. Since this result involves the terminated employees; hence, we shall not make any judgment based on this result. This result simply gives a picture of the number of resigned employees based on the specified characteristics.

Query 5: Calculate the number and the running total of terminated employees as well as their respective termination reasons.

The SUM OVER() function helps us to generate a running total. To obtain the correct progression of the running total and avoid duplicate values, unique combinations of columns need to be specified. In this case, we want to consider the difference in the “TermReason” and “EmploymentStatus”. Hence, we need to specify these 2 column names after the ORDER BY clause when calculating the running total.

If an employee willingly leaves a company, it is considered voluntary turnover, and it is usually because the employee goes to work in another company. In contrast, involuntary turnover happens when employees are terminated due to poor job performance, absenteeism, or violation of workplace policies. Employee turnover costs a lot of money because the company needs to replace the resigned or terminated employees with new hires. Then, the company has to waste time onboarding and training the new hires until they are fully productive.

The result showed that there were 104 resigned or terminated employees. 4 employees left the company due to performance issues, and 3 out of these 4 low performers are terminated involuntarily. Besides, 7 employees left the company due to attendance issues, whereby one of them resigned voluntarily. Moreover, the result showed that most of the voluntarily resigned employees left the company to look for another position, more money, or they are unhappy working in their team. This provides us some hints to perform checking on the leadership pattern, as well as pay and benefit across the organizational level.

Query 6: Find out the managers’ names and the number of employees within his/ her team who resigned due to unhappiness, but fully met or exceeds their performance assessment criteria.

It is found that there are 3 employees under Amy Dunn’s team who resigned due to unhappiness. A deep investigation would be needed to discover if there is any workplace bullying within Amy’s team, or if her team members are overworking due to an insufficient workforce.

Query 7: Calculate the overall turnover rate (in the context of the whole organization) as well as the turnover rate of the employees by gender. Only considers those employees who voluntarily resigned and not due to their attendance or performance (fully met or exceeds their performance assessment criteria) issues.

In this case, a UNION operator was used to combine the turnover rate by gender (first 2 rows in the result) with the overall turnover rate (last row of the result). UNION operator can sort the combined set and removes duplicates. In contrast, the UNION ALL always returns the sum of the number of rows in the sets being combined without checking for overlapping data. This case does not involve any duplicated rows; hence, the result would be the same regardless of whether we use the UNION operator or UNION ALL operator. 🤭

In this case, 78 employees who have no performance or attendance issues left the company and this made up the overall turnover rate of 31.08%. The individual turnover rate contributed by the female employees is higher than the individual turnover rate contributed by the male employees.

Query 8: Find the median pay of each gender.

After counting the rows per gender and adding them to the table as another column, we need to add two new variables called “row_number” and “median_group”. Row number increases by 1 for each row until we reach the other gender. The CASE expression detects if there has been a change of gender and resets the counter accordingly.

Then, we need to define a search range and effectively select the middle of each group’s running count by dividing the count of the group by 2 and doing the same by adding 1. Lastly, we need to apply AVG() function because there might not always be only one row that gets selected for each group.

It is shown that the salary for female employees is lower than for male employees. Other than monitoring starting salaries for new staff, it would be good if the company could create an equal pay group to constantly review the equal pay (including basic salary, non-discretionary bonuses, and other contractual benefits) and take action where necessary.

Query 9: Determine the best recruitment platform to use if the company would like to ensure diversity. Only consider those employees who are not Caucasian, but who fully meet or exceed the performance assessment criteria.

It appears that Indeed brought the most diverse and performing people to this company. It is suggested that the company should continue using this platform if it wants to ensure workplace diversity.

This marks the end of my SQL project on the HR dataset. Hope this article can shed some light on how to use SQL to measure workplace diversity and inclusion on real-world datasets. Happy learning.😁

Analyzing HR Metrics With SQL To Measure Diversity & Inclusion

Written by Jada Ng Pooi Ling

Responses (3)