Selecting the Optimal
While computing the optimal (MAX
or MIN
) of an expression is straightforward in SQL, retrieving the rows that contain the optimal value is remarkably tedious and inefficient.
As a simple example, consider the table table vhclicks
containing (simplified) clickstream data for an e-commerce site. It contains a row for each page visited, containing the visitor identifier visitorid
, time of visit visittime
, page type pagetype
, and the number of "hovers" by the visitor on the page hcount
.
> vhclicks;
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
1 | 10:21:03 | login | 1
1 | 10:24:39 | prodview | 10
1 | 10:27:14 | logout | 1
2 | 10:21:04 | login | 1
2 | 10:22:10 | search | 5
2 | 10:27:15 | logout | 1
2 | 11:01:22 | login | 1
1 | 11:01:23 | login | 1
2 | 11:02:33 | prodview | 7
2 | 11:04:10 | search | 10
2 | 11:05:47 | prodview | 5
1 | 11:05:48 | prodview | 9
2 | 11:07:19 | checkout | 3
2 | 11:09:52 | prodview | 10
2 | 11:13:21 | logout | 1
1 | 11:13:22 | logout | 1
-----------+-----------+----------+--------
(16 rows)
Now, suppose we want to retrieve the rows for pages with the greatest hcount
visited after 11:00
. Here is how we perform this simple task in standard SQL.
> SELECT *
FROM vhclicks
WHERE hcount = (
SELECT MAX(hcount) FROM vhclicks WHERE visittime > '11:00:00'::time
) AND visittime > '11:00:00'::time;
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
2 | 11:04:10 | search | 10
2 | 11:09:52 | prodview | 10
-----------+-----------+----------+--------
(2 rows)
Notice the redundancy -- we need to specify the condition twice, which could be painful and error-prone for a complex WHERE
clause. Moreover, this takes two passes over the data, and the WHERE
clause is evaluated twice. This is clearly undesirable for large datasets.
Sclera provides a construct, called ARG
, that eliminates the redundancy and inefficiencies in syntax as well as execution of the above query. In Sclera, we can frame the same query as:
> (vhclicks WHERE visittime > '11:00:00'::time) ARG MAX(hcount)
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
2 | 11:04:10 | search | 10
2 | 11:09:52 | prodview | 10
-----------+-----------+----------+--------
(2 rows)
Unlike the earlier query, this query is evaluated in a single pass over the data.
We can think of ARG
as a special "filter", similar to WHERE
and HAVING
. However, while WHERE
and HAVING
compute their output by applying a condition one row at a time, the output rows of ARG
is computed based on the entire input (or the input partition).
We can get the rows containing optimal values for multiple aggregates as well -- the following query returns the row with the earliest visittime
past 11:00
, in addition to the rows above.
> (vhclicks WHERE visittime > '11:00:00'::time) ARG (MAX(hcount), MIN(visittime));
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
2 | 11:04:10 | search | 10
2 | 11:09:52 | prodview | 10
2 | 11:01:22 | login | 1
-----------+-----------+----------+--------
(3 rows)
It is easy to do a cascaded computation. For instance, to find the row with the earliest visittime
among those with the greatest hcount
across pages visited after 11:00
, we say:
> (vhclicks WHERE visittime > '11:00:00'::time) ARG MAX(hcount) ARG MIN(visittime);
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
2 | 11:04:10 | search | 10
-----------+-----------+----------+--------
(1 row)
We can use ARG
with the MATCH
operator as well. For instance, the following query finds the product views on which visitor id 1
hovered the most in a session:
> (vhclicks WHERE visitorid = 1 ORDER BY visittime) ARG prodview.MAX(hcount)
OVER MATCH "login.(prodview | search | checkout)*.logout" ON pagetype;
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
1 | 10:24:39 | prodview | 10
1 | 11:05:48 | prodview | 9
-----------+-----------+----------+--------
(2 rows)
A set of rows will be output for every match of the MATCH
regular expression. Note that the order of the input is relevant for the above query.
ARG
on Partitioned Input◄
We can use ARG
on partitioned input. The following query find the pages with the highest hcount
for each visitor after 11:00
.
> (vhclicks WHERE visittime > '11:00:00'::time) PARTITION BY visitorid ARG MAX(hcount);
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
2 | 11:04:10 | search | 10
2 | 11:09:52 | prodview | 10
1 | 11:05:48 | prodview | 9
-----------+-----------+----------+--------
(3 rows)
Similarly, the following query finds, for each visitor, the product views on which the visitor hovered the most in a session:
> (vhclicks ORDER BY visittime) PARTITION BY visitorid ARG prodview.MAX(hcount)
OVER MATCH "login.(prodview | search | checkout)*.logout" ON pagetype;
-----------+-----------+----------+--------
VISITORID | VISITTIME | PAGETYPE | HCOUNT
-----------+-----------+----------+--------
1 | 10:24:39 | prodview | 10
2 | 11:09:52 | prodview | 10
1 | 11:05:48 | prodview | 9
-----------+-----------+----------+--------
(3 rows)
ARG
Syntax◄
This section introduced a new table expression with the following syntax:
table_expression [ PARTITION BY ( partn_columns ) ]
ARG ( [ label . ] aggr_func ( aggr_params ) [, ...] )
[ [ OVER ] match_expression ]
where:
table_expression
is an arbitrary table expression.partn_columns
is an optional comma-separated list of columns in the result oftable_expression
. When specified:- The result of
table_expression
is partitioned on this set of columns; the aggregation happens independently on the rows within each partition.
- The result of
aggr_func
is an aggregate functionaggr_params
is a comma-separated list of scalar expressions, all of whose column references are contained in the result oftable_alias
. These are the parameters of the aggregate functionaggr_func
.match_expression
is an optionalMATCH
expressionlabel
is optional. When specified, it can be:- When
MATCH
is not present, it is thetable_alias
for thetable_expression
, or - When
MATCH
is present, it is a label identifying a subsequence in theMATCH
regular expression.
- When