Using Cross-Tabulation
Sclera supports standard SQL cross-tabulation function PIVOT
and its inverse UNPIVOT
. The semantics of these functions is the same as in Oracle 11g and MS SQL Server 2008, but with a slightly modified (in our opinion, simplified) syntax.
PIVOT
◄
The PIVOT
operator creates a contingency table from raw input data.
Examples◄
For instance, consider the table vclicks
containing (rather simplified) visitor clicks data.
> vclicks;
-----------+-----------+----------
VISITORID | VISITTIME | PAGETYPE
-----------+-----------+----------
1 | 10:21:03 | login
1 | 10:24:39 | prodview
1 | 10:27:14 | logout
2 | 10:21:04 | login
2 | 10:22:10 | search
2 | 10:27:15 | logout
2 | 11:01:22 | login
1 | 11:01:23 | login
2 | 11:02:33 | prodview
2 | 11:04:10 | search
2 | 11:05:47 | prodview
1 | 11:05:48 | prodview
2 | 11:07:19 | checkout
2 | 11:09:52 | prodview
2 | 11:13:21 | logout
1 | 11:13:22 | logout
-----------+-----------+----------
(16 rows)
Lets say we want to see the counts of the visits for page types "search", "checkout" and "prodview" for each visitor. One way to compute these aggregates would be using GROUP BY
aggregation:
> SELECT visitorid, pagetype, COUNT(*)
FROM vclicks
WHERE pagetype IN ("search", "checkout", "prodview")
GROUP BY visitorid, pagetype
ORDER BY visitorid;
-----------+----------+-------
VISITORID | PAGETYPE | COUNT
-----------+----------+-------
1 | prodview | 2
2 | prodview | 3
2 | checkout | 1
2 | search | 2
-----------+----------+-------
(4 rows)
This is good, but for a better view, we might want all the aggregates for a visitor to appear in the same row. This can be achieved using embedded CASE
statements.
> SELECT visitorid,
COUNT(CASE pagetype WHEN "search" THEN 1 END) AS search,
COUNT(CASE pagetype WHEN "checkout" THEN 1 END) AS checkout,
COUNT(CASE pagetype WHEN "prodview" THEN 1 END) AS prodview
FROM vclicks
GROUP BY visitorid
ORDER BY visitorid;
-----------+--------+----------+----------
VISITORID | SEARCH | CHECKOUT | PRODVIEW
-----------+--------+----------+----------
1 | 0 | 0 | 2
2 | 2 | 1 | 3
-----------+--------+----------+----------
(2 rows)
With PIVOT
, we can say, instead:
> vclicks PARTITION BY visitorid
PIVOT COUNT(*) FOR pagetype IN ("search", "checkout", "prodview")
ORDER BY visitorid;
-----------+--------+----------+----------
VISITORID | SEARCH | CHECKOUT | PRODVIEW
-----------+--------+----------+----------
1 | 0 | 0 | 2
2 | 2 | 1 | 3
-----------+--------+----------+----------
(2 rows)
If we want counts across all the visitors, we do not need the PARTITION BY
clause.
> vclicks PIVOT COUNT(*) FOR pagetype IN ("search", "checkout", "prodview");
--------+----------+----------
SEARCH | CHECKOUT | PRODVIEW
--------+----------+----------
2 | 1 | 5
--------+----------+----------
(1 row)
We can use any aggregate in place of COUNT
. To get the last visit time for the visit to the pages instead, we can say:
> vclicks PARTITION BY visitorid
PIVOT MAX(visittime) FOR pagetype IN ("search", "checkout", "prodview")
ORDER BY visitorid;
-----------+----------+----------+----------
VISITORID | SEARCH | CHECKOUT | PRODVIEW
-----------+----------+----------+----------
1 | | | 11:05:48
2 | 11:04:10 | 11:07:19 | 11:09:52
-----------+----------+----------+----------
(2 rows)
Syntax◄
The syntax of the operator is:
table_expression [ PARTITION BY ( partn_columns ) ]
PIVOT aggr_func ( aggr_params ) FOR target_column IN ( target_value [ AS alias ] [, ...] )
where:
table_expression
is an arbitrary table expressionpartn_columns
is an optional comma-separated list of columns in the result oftable_expression
. When specified:- The result of
table_expression
is partitioned on this set of columns; the aggregation happens independently on the rows within each partition. - These columns will be included in each output row, alongside the aggregates for the corresponding partition.
- The result of
aggr_func
is an aggregate functionaggr_params
is a comma-separated list of scalar expressions, all of whose column references are contained in the result oftable_alias
. These are the parameters of the aggregate functionaggr_func
target_column
is theGROUP BY
column, but the values are restricted to thetarget_value
list specified nexttarget_value
are values oftarget_column
, these are the values on which the grouping of rows happens (within a partition).- The respective aggregates will be included in the output rows as separate columns.
- The name of the column for
target_value
will be the associatedalias
, if present, or a string representation oftarget_value
.
UNPIVOT
◄
The UNPIVOT
operator converts columns into rows.
Examples◄
Consider the result of the example above. We assume that the result is in a table pagecounts
.
> pagecounts;
-----------+--------+----------+----------
VISITORID | SEARCH | CHECKOUT | PRODVIEW
-----------+--------+----------+----------
1 | 0 | 0 | 2
2 | 2 | 1 | 3
-----------+--------+----------+----------
This table has all the counts for a visitor accumulated in a single row for the visitor.
This might not always be convenient, and lets say we need a separate row for SEARCH, CHECKOUT and PRODVIEW counts for each visitor. Each row in the result needs to have three columns: a column visitorid
containing the the visitor id, a column pagetype
containg one of the values "Search", "Checkout" or "ProdView" indicating what count the row is for, and a column pagecount
containing the correspoding count.
The UNPIVOT
operator does this very simply:
> pagecounts UNPIVOT pagecount FOR pagetype IN (
search AS "Search",
checkout AS "Checkout",
prodview AS "ProdView"
);
The result is:
-----------+----------+----------
VISITORID | PAGETYPE | PAGECOUNT
-----------+----------+----------
1 | Search | 0
1 | Checkout | 0
1 | ProdView | 2
1 | Search | 2
1 | Checkout | 1
1 | ProdView | 3
-----------+----------+----------
Note that pagetype
and pagecount
are new columns. The values in pagetype
correspond to the columns names in the input table, as specified in the IN
clause.
Syntax◄
The syntax of the operator is:
table_expression
UNPIVOT value_column FOR label_column IN ( label_value_column [ AS label ] [, ...] )
where:
table_expression
is an arbitrary table expression.label_value_column
is a column in the output oftable_expression
label
is the string-valued label associated with the columnlabel_value_column
; if unspecified, the name oflabel_value_column
is taken as the labelvalue_column
is the column in the result that will contain the unpivoted valueslabel_column
is the column in the result that will contain the label of the columnlabel_value_column
whose value is placed invalue_column
Each of the columns label_value_column
are assumed to be of the same type, and this common type is the type of the column value_column
in the result.
For each row in the output of the input table_expression
, the operator will generate a row for each specified column label_value_column
, consisting of:
- the value of
label_value_column
in the input row, placed in the result columnvalue_column
- the label of
label_value_column
, placed in the result columnlabel_column
- a copy of all columns in the input row except any of the columns
label_value_column
specified in theIN
clause
Overhead◄
The PIVOT
and UNPIVOT
operators in Sclera are evaluated in a single pass over the input. The memory consumption is independent of the number of input rows.