Advanced SQL Functions

SELECT {column} FROM {table} WHERE {other_column} >= {number}. For those with a working knowledge of Excel, SQL seems straight-forward. Its usually when a very specific question is asked regarding data that one will need to learn and understand the concepts below. This blog touches on how to go about learning how to use different approaches to SQL problems.

Self-Join

Self-join is very powerful way to look at a table because all the information is already in one place. A self-join is simply a regular join, but a table is joined to itself. We want to use some information in a table twice to find relationships between rows. A good sign that a self-join should be used is when a table has a foreign key that references its own primary key. Ex. an employees table that has employee_id and manager_id, with the manager_id simply being the employee_id.

Examples:
– Match customers that are from the same city
– Match managers to their employees

Ex. 1: Match customers from the same city

-- First approach
SELECT A.CustomerName AS CustomerName1, B.CustomerName AS CustomerName2, A.City
FROM Customers A, Customers B
WHERE A.CustomerID <> B.CustomerID
AND A.City = B.City 
ORDER BY A.City;

-- same way as above but with join statement
SELECT A.CustomerName AS CustomerName1, 
    B.CustomerName AS CustomerName2, 
    A.City
FROM Customers A
INNER JOIN Customers B
ON A.City =B.City AND A.CustomerID <> B.CustomerID
ORDER BY A.City;

-- using (>) operator instead of (<>) gives us just 1 pair of matches as opposed to the same match but in different columns

In the example above, we see that we are using the Customers table twice (by calling it and assigning different names: ‘FROM customers A, customers B’). In the select statement, we use the CustomerName column twice (one from each table) and the City column twice. We then find where customer1 and customer2 are from the same city, but not the same person. In the 2nd approach we use > operator instead of the <> operator, which ensures that reverse cases do not get included.

Ex. 2: Managers and Employees

SELECT e.first_name + ' ' + e.last_name employee,
    m.first_name + ' ' + m.last_name manager
FROM sales.staffs e
INNER JOIN sales.staffs m ON m.staff_id = e.manager_id
ORDER BY manager;
Sales Database and staffs table

The code above corresponds to the example of managers and employees. We simply use the manager_id column to find the name of the employee’s manager. Visually we are keeping the information from staff. The JOIN statement joins the staff_id to manager_id. The row will contain the staff information, followed by the manager’s information. It’s important to know that manager_id should be the same as staff_id, otherwise it won’t match.

Subquery (Inner Query or Nested Query)

Subqueries are great when we the information is straight forward. For example, we want to know what movies an actor has been in. In the graph above, we see that actor name is in its own table and movie names and cast are in a different table. We simply write a query to find the actor.id using LIKE ‘Ryan Reynolds’, then we create another query using the actor.id from our initial query.

There are some constraints:

  • embedded within the WHERE clause
  • only one column in the select clause
  • ORDER BY cannot be used in a subquery
  • Uses the IN function to call the subquery
  • Can also be used as an aggregate function for a column

Ex: Find customers with salaries over 4500

SELECT *
FROM CUSTOMERS
WHERE id IN (SELECT id
             FROM customers
             WHERE salary > 4500);

Ex: List all customers with their total number of orders

SELECT firstName, 
       last_name, 
       orderCount = (SELECT COUNT(O.ID)
                     FROM order O
                     WHERE O.CustomerId = C.id)
FROM Customer C
https://www.dofactory.com/sql/subquery

The great thing about subqueries is that we can easily read that we are trying to find the id for Deadpool actor named ‘Ryan Reynolds’. Once we know his id, instead of remembering it or writing it down, we simply use the subquery results as the input for our main query.

Temporary Tables

Can be used using several queries and inserts and using cursors to loop through the temporary data updates. Another simple way to store info is temporary tables. If we needed to call on a query, it’s best to just make a temporary table.

  • Use the INTO command #{temp_table_name}
  • Call the temporary table using a query FROM #{temp_table_name}
  • Temporary tables are automatically deleted when the connection is closed
#mysql syntax
CREATE TEMPORARY TABLE MaleStudents
SELECT *
FROM student
WHERE gender = 'Male';

#postgresql syntax
SELECT name, age, gender
INTO #MaleStudents
FROM student
WHERE gender = 'Male';

SELECT * FROM #MaleStudents;

DROP TABLE #MaleStudents

Using temporary tables can be taxing on performance. We’ll go over better approaches.

Derived Tables (Inline Views)

Sometimes, we need to do multiple queries on a specific subquery. One way to help with performance is the use of derived tables. A derived table is a virtual table that is created within the scope of a query. This is especially useful when you are starting to use multiple aggregation functions.

  • Select statement and given an alias
  • In the FROM clause, we introduce our query aliased as a table
  • Derived tables are not indexed
  • Once a query is completed, the derived table is not usable
  • Different from a temporary table because we don’t need to create and drop (using INSERT and UPDATE statements)
  • Gets rid of the need to use cursors and temporary tables
  • Like a temporary view (more on this below)
  • The subqueries always need to have an alias!

Examples:
– Get the salesperson’s stats in terms of the territory comparison
– Get the top 5 products in terms of sales (derived table is a table joining products and sales)

Ex.1 Join two tables that we’ve created

SELECT Area,
    COUNT(Value) AS NumAreaSales,
    SUM(Value) AS TotalAreaSales
FROM Sales
GROUP BY Area
http://www.blackwasp.co.uk/SQLDerivedTables_2.aspx
SELECT Salesperson,
    Area,
    COUNT(Value) AS NumSalesPersonSales,
    SUM(Value) AS TotalSalesPersonSales
FROM Sales
GROUP BY SalesPerson, Area

Derived tables are great as a way to combine two queried tables into one. In the above examples, we created two tables using the Sales table: the first looks at sales grouped by Area, the second table is grouped by the SalesPerson and Area. Since we are trying to find the differences in the area we use the JOIN function ON the first table’s area and the 2nd table’s area. The result gets us each salesperson’s numbers: number of sales, area, % of the number of sales in the area, a total of their sales, a total of area’s sales, and their market share of the area (how much their sales was a percentage of the area).

SELECT Salesperson,
    NumSalesPersonSales,
    NumAreaSales,
    100 * NumSalesPersonSales / NumAreaSales AS NumSalesPercent,
    TotalSalesPersonSales,
    TotalAreaSales,
    100 * TotalSalesPersonSales / TotalAreaSales AS NumSalesPercent
FROM
(   SELECT Area,
        COUNT(Value) AS NumAreaSales,
        SUM(Value) AS TotalAreaSales
    FROM Sales
    GROUP BY Area) AS DerivedAreaSales
INNER JOIN
(   SELECT Salesperson,
        Area,
        COUNT(Value) AS NumSalesPersonSales,
        SUM(Value) AS TotalSalesPersonSales
    FROM Sales
    GROUP BY SalesPerson, Area) AS DerivedSalesPersonSales
ON DerivedAreaSales.Area = DerivedSalesPersonSales.Area

Ex 2: Join Two Tables and Use the results for query


-- Query to get sales #s for product 
SELECT productCode, 
    ROUND(SUM(quantityOrdered * priceEach)) sales
FROM orderdetails
INNER JOIN orders USING (orderNumber)
WHERE YEAR(shippedDate) = 2003
GROUP BY productCode
ORDER BY sales DESC
LIMIT 5;

-- Derived table using the above aliased AS top5products2003
SELECT productName, sales
FROM
    (SELECT 
        productCode, 
        ROUND(SUM(quantityOrdered * priceEach)) sales
    FROM orderdetails
    INNER JOIN orders USING (orderNumber)
    WHERE YEAR(shippedDate) = 2003
    GROUP BY productCode
    ORDER BY sales DESC
    LIMIT 5) top5products2003
INNER JOIN products USING (productCode);
The result of the derived table.
The result of joining the derived table with products table

We are using the product code and the total of sales from our derived table and joining to show just the product name and total sales.

Ex. 2: Use CASE to group by volume of orders

CASE is a great way to bin a column using rules.

  • USE (CASE WHEN {function} THEN ‘{case1}’
    WHEN {function2} THEN ‘{case2}’
    END) {alias}
-- this creates table with customerGroup column with their metal group
SELECT customerNumber,
    ROUND(SUM(quantityOrdered * priceEach)) sales,
    (CASE
        WHEN SUM(quantityOrdered * priceEach) < 10000 THEN 'Silver'
        WHEN SUM(quantityOrdered * priceEach) BETWEEN 10000 AND 100000 THEN 'Gold'
        WHEN SUM(quantityOrdered * priceEach) > 100000 THEN 'Platinum'
    END) customerGroup
FROM orderdetails
INNER JOIN orders 
USING (orderNumber)
WHERE YEAR(shippedDate) = 2003
GROUP BY customerNumber;
The result of using CASE to group by sales numbers
--Count how many people belong to each metal
SELECT customerGroup, 
       COUNT(cg.customerGroup) AS groupCount
FROM
    (SELECT customerNumber,
            ROUND(SUM(quantityOrdered * priceEach)) sales,
            (CASE
                WHEN SUM(quantityOrdered * priceEach) < 10000 
                     THEN 'Silver'
                WHEN SUM(quantityOrdered * priceEach) BETWEEN 10000 AND 100000 
                     THEN 'Gold'
                WHEN SUM(quantityOrdered * priceEach) > 100000 
                     THEN 'Platinum'
            END) customerGroup
    FROM orderdetails
    INNER JOIN orders USING (orderNumber)
    WHERE
        YEAR(shippedDate) = 2003
    GROUP BY customerNumber) cg
GROUP BY cg.customerGroup;    

CTEs (Common Table Expressions)

CTEs are extremely useful. As you can see in the previous examples, the code starts to get very large and is not good to scale. CTEs are great for readability and can be used for a substitute if we do not have user access to create VIEWS. CTEs can be used for window function or analytic functions and is very flexible. They are similar to a temporary table, but have some major differences:

  • CTEs are not physically created on database
  • Temp tables are created and can have constraints and are indexed
  • CTE can’t be indexed and are created in working memory
-- Basic CTE creation
WITH expression_name[col_name]
AS (CTE_definition)
SQL_statement;

-- Call CTE
SELECT *
FROM expression_name

How to create a CTE

  • Uses the WITH clause.
  • Creates a temporary named result set that is temporarily available to use with SELECT, INSERT, UPDATE, DELETE, or MERGE
  • We name the CTE as {expression_name}. This is what we are going to call in future queries.
  • List comma-separated columns that the table will keep (corresponds to the columns that we are going to call in definition)

There are plenty of reasons to use CTEs:

  • CTEs are more readable than subqueries
  • CTEs are great for window functions or analytic functions
  • Can do recursive queries: queries that can call themselves (hierarchical data like org charts)
  • Can use rankings

Examples:
– Get sales amounts of sales staff in 2018
– Get the average number of sales orders for 2018 for all sales staff

Ex: Get sales amounts of sales staff in 2018

WITH cte_sales_amounts (staff, sales, year) AS (
    SELECT first_name + ' ' + last_name, 
        SUM(quantity * list_price * (1 - discount)),
        YEAR(order_date)
    FROM sales.orders o
    INNER JOIN sales.order_items i ON i.order_id = o.order_id
    INNER JOIN sales.staffs s ON s.staff_id = o.staff_id
    GROUP BY first_name + ' ' + last_name,
        year(order_date)
)

SELECT staff, 
    sales
FROM cte_sales_amounts
WHERE year = 2018;

Explanation of code. The WITH statement names the cte. There will be 3 columns in the cte with (staff,sales,year). The AS statement uses the definition query to build the cte. Once the cte is created, we can call the information we need using the FROM cte_name.

ex: Get average number of sales orders in 2018 for all sales staffs.

WITH cte_sales AS (
    SELECT staff_id, 
        COUNT(*) order_count  
    FROM sales.orders
    WHERE YEAR(order_date) = 2018
    GROUP BY staff_id
)

SELECT AVG(order_count) average_orders_by_staff
FROM cte_sales;

Explanation: we use the CTE to create a table where we group by staff_id and the number of orders they received in 2018. We then get the average for all our staff. A good application would be to group further by top producers and not top producers.

Window Functions

Window function performs a calculation similar to the GROUP BY function but does not cause the rows to become grouped. Instead, each row keeps its identity but gets a new column to show grouping.

  • Running total is a good example
  • Rows retain their separate identities
  • the OVER statement does the function over the full table
  • PARTITION BY groups the rows by values in column
  • Can’t use window function and GROUP BY in the same query
  • ROW_NUMBER() displays the number of the row according to order in window statement
  • RANK() similar to ROW_NUMBER() but ties display as the same rank
  • DENSE_RANK() skips the numbers where there are ties. (If there’s a tie for 2nd place: 1, 2, 2, 4… etc. Third place is not shown because of the tie at 2nd place)
  • NTILE() is a percentile function
  • LAG() compare preceding rows (sliding window)
  • LEAD(): compare the following rows (sliding window)

examples:
– running total (order by start_time)
-ROW_NUMB

Example: Get running total, running count, and average time of table

-- Simple running total
SELECT duration_seconds, 
       SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total
FROM table       

-- Realistic running total
SELECT start_terminal,
       duration_seconds,
       SUM(duration_seconds) OVER
         (PARTITION BY start_terminal ORDER BY start_time)
         AS running_total,
       COUNT(duration_seconds) OVER
         (PARTITION BY start_terminal ORDER BY start_time)
         AS running_count,
       AVG(duration_seconds) OVER
         (PARTITION BY start_terminal ORDER BY start_time)
         AS running_avg
FROM tutorial.dc_bikeshare_q1_2012;

Example: Get the second ride of each station

SELECT start_terminal,
       start_time,
       duration_seconds,
       ROW_NUMBER() OVER (PARTITION BY start_terminal
                          ORDER BY start_time)
                    AS row_number
FROM tutorial.dc_bikeshare_q1_2012
WHERE row_number = 2

Window functions are very powerful because they create a calculation and keep it in the table. The 2nd example shows how we can easily choose the 1st person to get a bike from the station, but now we have the ability to choose the 2nd, 3rd, or any number. This flexibility allows the user to dive deeper into the data without changing the data itself.

I hope you enjoyed my notes on some advanced approaches to SQL problems. A lot of problems can be solved with these tools as starting blocks. The best thing about learning a new coding language is the discovery of better approaches and making life easier in the future. One of my favorite activities is writing and journaling and its great to see the progress I’ve made from feeling like I can’t learn an idea to creating a blog post to make it easier for others that have gone through the same process.

Reference