OLAP Queries in SQL

Posted on October 18, 2020

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!

Contributed by: Bart Baesens, Seppe vanden Broucke

The term business intelligence (BI) is often referred to as the set of activities, techniques and tools aimed at understanding patterns in past data and predicting the future. In other words, BI applications are an essential component for making better business decisions through data-driven insight. These applications can both be mission critical or occasionally used to answer a specific business question. On-Line Analytical Processing (OLAP) provides a advanced set of BI techniques to analyze your data. More specifically, OLAP allows you interactively analyze the data, summarize it and visualize it in various ways. The term on-line refers to the fact that the reports can be updated with data almost immediately after they have been designed (or with negligible delay). The goal of OLAP is to provide the business-user with a powerful tool for ad-hoc querying.

To facilitate the execution of OLAP queries and data aggregation, SQL-99 introduced three extensions to the GROUP BY statement: the CUBE, ROLLUP and GROUPING SETS operator.

The CUBE operator computes a union of GROUP BY’s on every subset of the specified attribute types. Its result set represents a multidimensional cube based upon the source table. Consider the following SALESTABLE:

PRODUCT	QUARTER	REGION	SALES
A	Q1	Europe	10
A	Q1	America	20
A	Q2	Europe	20
A	Q2	America	50
A	Q3	America	20
A	Q4	Europe	10
A	Q4	America	30
B	Q1	Europe	40
B	Q1	America	60
B	Q2	Europe	20
B	Q2	America	10
B	Q3	America	20
B	Q4	Europe	10
B	Q4	America	40

We can now formulate the following SQL query:

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)

Basically, this query computes the union of 2² = 4 groupings of the SALESTABLE being: {(quarter,region), (quarter), (region), ()}, where () denotes an empty group list representing the total aggregate across the entire SALESTABLE. In other words, since quarter has 4 values and region 2 values, the resulting multiset will have 4*2+4*1+1*2+1 or 15 tuples as you can see illustrated in the figure below. NULL values have been added in the dimension columns Quarter and Region to indicate the aggregation that took place. They can be easily replaced by the more meaningful ‘ALL’ if desired. More specifically, we can add two CASE clauses as follows

SELECT CASE WHEN grouping(QUARTER) = 1 
THEN 'All' 
ELSE QUARTER END AS QUARTER, 
CASE WHEN grouping(REGION) = 1 
THEN 'All' ELSE REGION END AS REGION, 
SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)

The grouping() function returns 1 in case a NULL value is generated during the aggregation and 0 otherwise. This distinguishes the generated NULLs and the possible real NULLs stemming from the data. We will not add this to the subsequent OLAP queries so as to not unnecessarily complicate them.

Also, observe the NULL value for Sales in the fifth row. This represents an attribute combination which is not present in the original SALESTABLE since apparently no products were sold in Q3 in Europe. Remark that besides SUM() also other SQL aggregator functions such as MIN(), MAX(), COUNT() and AVG() can be used in the SELECT statement.

The result looks as follows:

QUARTER	REGION	SALES
Q1	Europe	50
Q1	America	80
Q2	Europe	40
Q2	America	60
Q3	Europe	NULL
Q3	America	40
Q4	Europe	20
Q4	America	80
Q1	NULL	130
Q2	NULL	100
Q3	NULL	40
Q4	NULL	90
NULL	Europe	110
NULL	America	250
NULL	NULL	360

Next, the ROLLUP operator computes the union on every prefix of the list of specified attribute types, from the most detailed up to the grand total. It is especially useful to generate reports containing both subtotals and totals. The key difference between the ROLLUP and CUBE operator is that the former generates a result set showing the aggregates for a hierarchy of values of the specified attribute types, whereas the latter generates a result set showing the aggregates for all combinations of values of the selected attribute types. Hence, the order in which the attribute types are mentioned is important for the ROLLUP but not for the CUBE operator. Consider the following query:

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (QUARTER, REGION)

This query generates the union of three groupings {(quarter,region), (quarter}, ()} where () again represents the full aggregation. The resulting multiset will thus have 4*2+4+1 or 13 rows and is displayed in the figure below. You can see that the region dimension is first rolled up followed by the quarter dimension. Note the two rows which have been left out when compared to the result of the CUBE operator:

QUARTER	REGION	SALES
Q1	Europe	50
Q1	America	80
Q2	Europe	40
Q2	America	60
Q3	Europe	NULL
Q3	America	40
Q4	Europe	20
Q4	America	80
Q1	NULL	130
Q2	NULL	100
Q3	NULL	40
Q4	NULL	90
NULL	NULL	360

Whereas the previous example applied the GROUP BY ROLLUP construct to two completely independent dimensions, it can also be applied to attribute types that represent different aggregation levels (and hence different levels of detail) along the same dimension. For example, suppose the SALESTABLE tuples represented more detailed sales data at the individual city level and that the table contained three location related columns: City, Country and Region. We could then formulate the following ROLLUP query, yielding sales totals respectively per city, per country, per region and the grand total:

SELECT REGION, COUNTRY, CITY, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (REGION, COUNTRY, CITY)

Note that in that case the SALESTABLE would include the attribute types City, Country and Region in a single table. Since the three attribute types represent different levels of detail in the same dimension, they are transitively dependent on one another, illustrating the fact that these data warehouse data are indeed denormalized.

The GROUPING SETS operator generates a result set equivalent to that generated by a UNION ALL of multiple simple GROUP BY clauses. Consider the following example:

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER), (REGION))

This query is equivalent to:

SELECT QUARTER, NULL, SUM(SALES)
FROM SALESTABLE
GROUP BY QUARTER
UNION ALL
SELECT NULL, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY REGION

The result is given below.

QUARTER	REGION	SALES
Q1	NULL	130
Q2	NULL	100
Q3	NULL	40
Q4	NULL	90
NULL	Europe	110
NULL	America	250

Multiple CUBE, ROLLUP and GROUPING SETS statements can be used in a single SQL query. Different combinations of CUBE, ROLLUP and GROUPING SETS can generate equivalent result sets. Consider the following query

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)

This query is equivalent to:

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER, REGION), (QUARTER), (REGION), ())

Likewise, the following query

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (QUARTER, REGION)

is identical to:

SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER, REGION), (QUARTER),())

SQL2003 introduced additional analytical support for two types of frequently encountered OLAP activities: ranking and windowing. Ranking should always be done in combination with an SQL ORDER BY clause. Assume we have the following table:

PRODUCT	SALES
A	50
B	20
C	10
D	45
E	40
F	30
G	60
H	20
I	15
J	25

Various ranking measures can now be calculated by using the following SQL query:

SELECT PRODUCT, SALES,
RANK() OVER (ORDER BY SALES ASC) as RANK_SALES, 
DENSE_RANK() OVER (ORDER BY SALES ASC) as DENSE_RANK_SALES, 
PERCENT_RANK() OVER (ORDER BY SALES ASC) as PERC_RANK_SALES, 
CUM_DIST() OVER (ORDER BY SALES ASC) as CUM_DIST_SALES,
FROM SALES
ORDER BY RANK_SALES ASC

The result of this query is depicted below. The RANK() function assigns a rank based upon the ordered sales value whereby similar sales values are assigned the same rank. Contrary to the RANK() function, the DENSE_RANK() function does not leave gaps between the ranks. The PERCENT_RANK() function calculates the percentage of values less than the current value, excluding the highest value. It is calculated as (RANK() – 1) / (Number of Rows – 1). The CUM_DIST() function calculates the cumulative distribution or the percentage of values less than or equal to the current value.

Product	Sales	RANK_SALES	DENSE_RANK_SALES	PERC_RANK_SALES	CUM_DIST_SALES
C	10	1	1	0	0.1
I	15	2	2	1/9=0.11	0.2
B	20	3	3	2/9=0,22	0.4
H	20	3	3	2/9=0,22	0.4
J	25	5	4	4/9=0,44	0.5
F	30	6	5	5/9=0,55	0.6
E	40	7	6	6/9=0,66	0.7
D	45	8	7	7/9=0,77	0.8
A	50	9	8	8/9=0,88	0.9
G	60	10	9	9/9=1	1

All these measures can also be computed for selected partitions of the data. The measures depicted can also be computed for each region separately. Assuming the source table SALES now also includes a REGION attribute type, the query would then become

SELECT PRODUCT, SALES,
RANK() OVER (PARTITION BY REGION ORDER BY SALES ASC) as RANK_SALES, 
DENSE_RANK() OVER (PARTITION BY REGION ORDER BY SALES ASC) as DENSE_RANK_SALES, 
PERCENT_RANK() OVER (PARTITION BY REGION ORDER BY SALES ASC) as PERC_RANK_SALES, 
CUM_DIST OVER (PARTITION BY REGION ORDER BY SALES ASC) as CUM_DIST_SALES,
FROM SALES
ORDER BY RANK_SALES ASC

Windowing allows calculating cumulative totals or running averages based on a specified window of values. In other words, windowing allows getting access to more than one row of a table without requiring a self-join. Consider the table depicted below.

QUARTER	REGION	SALES
1	America	10
2	America	20
3	America	10
4	America	30
1	Europe	10
2	Europe	20
3	Europe	10
4	Europe	20

The following query calculates the average sales for each region and quarter on the basis of the current, previous and next quarter.

SELECT QUARTER, REGION, SALES,
AVG(SALES) OVER 
  (PARTITION BY REGION ORDER BY QUARTER ROWS 
   BETWEEN 1 PRECEDING AND 1 FOLLOWING) 
AS SALES_AVG
FROM SALES
ORDER BY REGION, QUARTER, SALES_AVG

The result is as such:

QUARTER	REGION	SALES	SALES_AVG
1	America	10	15
2	America	20	13,33
3	America	10	20
4	America	30	20
1	Europe	10	15
2	Europe	20	13,33
3	Europe	10	16,67
4	Europe	20	15

The PARTITION BY REGION statement subdivides the rows into partitions, similar to a GROUP BY clause. It enforces that the windows do not reach across partition boundaries. In other words, the SALES_AVG values will always be computed within a particular region. As an example, the SALES_AVG value for quarter 2 in America will be calculated as (10+20+10)/3=13,33 whereas for quarter 4 in America it is calculated as: (10+30)/2=20.

These are just a few examples of ranking and windowing facilities available in SQL. It is highly recommended to check the manual of the RDBMS vendor for more information. Furthermore, note that not all RDBMS vendors support these extensions. The ones that do support them usually also provide a user-friendly and graphical environment to construct OLAP reports using point-and-click which are then automatically translated by the tool into the corresponding SQL statements.

Given the amount of data to be aggregated and retrieved, OLAP SQL queries may get very time consuming. One way to speed up performance is by turning some of these OLAP queries into materialized views. For example, an SQL query with a CUBE operator can be used to precompute aggregations on a selection of dimensions of which the results can then be stored as a materialized view. A disadvantage of view materialization is that extra efforts are needed to regularly refresh these materialized views, although it can be noted that usually companies are fine with a close to current version of the data such that the synchronization can be done overnight or at fixed time intervals.