Why would you use Common Table Expressions?

Introduction

In this article I suppose that you have a good understanding of SQL already. I will introduce some concepts very briefly before moving on to Common Table Expressions.

Below you can find the relevant database diagram of the database that I will use in this article:

How is SQL processed by SQL Server?

When we look at a basic SQL statement; the general structure looks like

SELECT <field list>
FROM <table list>
WHERE <row predicates>
GROUP BY <group by list>
HAVING <aggregate predicates>
ORDER BY <field list>

As a mental picture we see the order of execution as:

First determine where the data will come from. This is indicated in the <table list>. This list can contain zero or more tables. When there are many tables, they can be joined using inner or outer join operators, and possibly also cross join operators. At this stage we consider the Cartesian product of all the rows in all the tables.

select count(*) from [HR].[Employees]                    -- 9
select count(*) from [Sales].[Orders]                    -- 831
select count(*) from [HR].[Employees], [Sales].[Orders]  -- 7479
select 9 * 831                                           -- 7479

In the third query we combine the tables, without a join operator. The result will be all the combinations of employees with orders, which explains the 7479 rows. This can escalate quickly.

As a side remark: this is valid SQL, but when I encounter this in a code review it will make me suspicious. One way to make clean that you want all these combinations is the CROSS JOIN operator:

select count(*) from [HR].[Employees] cross join [Sales].[Orders]    -- 7479

This will be handled exactly the same as query 3, but now I know that this is on purpose.

Image result for sql joke

Once we know which data we are talking about, we can then filter using the <row predicates> in the where clause. This will make sure that soon in the process the number of rows is limited. In most join operators there is a condition (inner join T1 on <join condition>) which would be applied here, again limiting the number of rows.

select count(*) 
from [HR].[Employees] E 
inner join [Sales].[Orders]    O on E.empid = O.empid    -- 831

The predicate E.empid = O.empid will make sure that only the relevant combinations are returned.

If there is a group by clause, that happens next, followed by the filtering on aggregated values.

Then finally SQL looks at the <field list> to determine which fields / expressions / aggregates to make available, and then the order by clause is applied.

Of course this is all just a mental picture

Imagine a join between 3 tables, each containing 1000 rows. The resulting virtual table would contain 1.000.000.000 rows, on which SQL would have then to select the right ones. Through the use of indexes SQL Server will only obtain the relevant row combinations. Each DBMS (Database Management System) contains a query optimizer that will intelligently use indexes to obtain the rows in the <table list>, combined with the <row predicate> from the where condition, and so on. So, if the right indexes are created in the database, only the necessary data pages will be retrieved.

Inner queries

The table list can also contain the result of another SQL statement. The following is a useless example of this:

select count(*) 
from (select * from [HR].[Employees]) E

This example will first create a virtual table named E as the result of the inner query, and use this table to select from. We can now use E as a normal table, that can be joined with other tables (or inner queries).

Tip: It is mandatory to give the inner select statement an alias, otherwise it will be impossible to work with it. Even if this is the only data source that you use, an alias is still needed.

As an example I want to know the details of the 3 orders that gave me the highest revenue. To start with, I first find those 3 orders:

select top 3 [orderid], [unitprice] * [qty] as LineTotal
from [Sales].[OrderDetails]
order by LineTotal desc

This gives us the 3 biggest orders:

orderid	LineTotal
10865	15810,00
10981	15810,00
10353	10540,00

Now I can use these results in a query like

select *
from [Sales].[OrderDetails]
where orderid in (10865, 10981, 10353)

which will give the order details for these 3 orders, at this point in time. I can use the result of the previous query in the where condition to make the query work at any point in time:

select *
from [Sales].[OrderDetails]
where orderid in 
(
    select top 3 [orderid]
    from [Sales].[OrderDetails]
    order by [unitprice] * [qty] desc 
)

This query will give me the correct results. I just had to adapt some things from the initial query because the IN clause requires a list of values, so we can only return 1 value (the [orderId]. The order by clause then needs to use the full expression. Don’t worry, no more calculations than needed will be done. Trust the optimizer!

To further evolve this query we can now use an inner join instead of WHERE … IN. The resulting execution plan will be the same again, and the results too.

select *
from [Sales].[OrderDetails] SOD
inner join (select top 3 [orderid]
    from [Sales].[OrderDetails]
    order by [unitprice] * [qty] desc) SO 
on SO.orderid = SOD.orderid

Common Table Expressions

With all this we have gently worked toward CTEs. A first use would be to separate the inner query from the outer query, making the SQL statement more readable. Let’s first start with another senseless example to make the idea of CTEs more clear:

;with cte as
(
    select top 3 [orderid]
    from [Sales].[OrderDetails]
    order by [unitprice] * [qty] desc
)
select * from cte

What this does is to create a (virtual) table called cte, that can then be used in the following query as a normal data source.

Tip: the semicolon at the front of the statement is not needed if you just execute this statement. If the “with” statement follows another SQL statement then both must be separated by a semicolon. Putting the semicolon in front of the CTE makes sure you never have to search for this problem.

The CTE is NOT a temporary table that you can use. It is part of the statement that it belongs to, and it is local to that statement. So later in the script you can’t refer to the CTE table again. Given that the CTE is part of this statement, the optimizer will use the whole statement to make an efficient execution plan. SQL is a declarative language: you define WHAT you want, and the optimizer decides HOW to do this. The CTE will not necessarily be executed as first, it will depend on the query plan.

Let’s make this example more useful:

;with cte as
(
    select top 3 [orderid]
    from [Sales].[OrderDetails]
    order by [unitprice] * [qty] desc
)
select *
from [Sales].[OrderDetails] SOD 
inner join cte on SOD.orderid = cte.orderid

Now, for us humans we have split the query in 2 parts: we first calculate the 3 best orders, then we use the results of that to select their order details. Like this we can show the intent of our query.

In this case we use the CTE only once, but if you would use it multiple times in this query it would become more useful.

Hierarchical queries

In this table we see a field empid, and a field mgrid. (Almost) every employee has a manager, who can have a manager, … So clearly we have a recursive structure.

This kind of structures often occurs with

compositions
Categories with an unlimited level of subcategories
Folder structures
etc

So let’s see how things are organized:

select [empid], [firstname], [title], [mgrid]
from [HR].[Employees]

Gives us the following 9 rows:

We can see here that Don Funk has Sara Davis as a manager.

If we want to make this more apparent, we can join the Employees table with itself to obtain the manager info (self-join):

select E.[empid], E.[lastname], E.[firstname], 
       E.[title], E.[mgrid],
       M.[empid], M.[lastname], M.[firstname]
from [HR].[Employees] E
left join [HR].[Employees] M on E.mgrid = M.empid

Notice that a LEFT join operator is needed because otherwise the CEO (who doesn’t have a manager) would be excluded.

We could continue this with another level until the end of the hierarchy. But if a new level is added, or a level is removed, this query wouldn’t be correct anymore. So let’s use a hierarchical CTE:

;with cte_Emp as
(
select [empid], [lastname] as lname, [firstname], [title], 
       [mgrid], 0 as [level]
from [HR].[Employees]
where [mgrid] is null

union all

select E.[empid], E.[lastname], E.[firstname], E.[title], 
       E.[mgrid], [level] + 1 
from [HR].[Employees] E 
inner join cte_Emp M on E.mgrid = M.empid
)
select *
from cte_Emp

I’ll first give the result before explaining what is going on:

As explained before we start with a semicolon, to avoid frustrations later.

We then obtain the highest level of the hierarchy

select [empid], [lastname], [firstname], [title], 
       [mgrid], 0 as [level]
from [HR].[Employees]
where [mgrid] is null

This is our starting point for the recursion. Using UNION ALL we now obtain all the employees that have Sara as a manager. This is added to our result set, and then for each row that is added, we do the same, effectively implementing the recursion.

To make this more visual I added the [level] field, so you can see how things are executed. Row 1 has level 0, because this is the part of the query (0 as [level]). The for each pas in the recursive part, the level is incremented. This explains perfectly how this query is executed.

Conclusion

Common Table Expressions are one of the more advanced query mechanisms in T-SQL. They can make your queries more readable, or perform queries that would otherwise be impossible, such as outputting a hierarchical list. In this case the real power is that a CTE can reference itself, making it possible to handle recursive structures.

Reference

https://docs.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql

	Today’s Best Practic… on Sending notifications with Cor…
	Gaston on Areas in ASP.NET Core
	How to Use Microsoft… on How to use Microsoft Azure Key…
	Gaston on Areas in ASP.NET Core
	Gaston on Structuring your Excel –…