Visualization
Sclera's visualization component, ScleraViz, enables quick and easy visualization of your query results. ScleraViz is integrated with ScleraSQL; this means a few lines of ScleraSQL can fetch, clean, analyze and visualize your data in a single sweep.
ScleraViz is inspired by Grammar of Graphics, specifically R's ggplot2 -- but is implemented as an extension to ScleraSQL and uses D3 as the rendering engine. Moreover, unlike ggplot2, ScleraViz can clean, analyze and plot streaming data.
An online preview, with a number of examples with code, is available at https://scleraviz.herokuapp.com.
In this document, we first describe how to set up Sclera to use ScleraViz, then illustrate ScleraViz by example, and finally give the detailed syntax of ScleraSQL visualization queries.
Command Line Setup◄
In the following, <sclera-root>
is the directory where Sclera is (or, is to be) installed.
To run ScleraViz queries, you need to:
-
Install Sclera Command Line using
scleradmin
(detailed instructions here)$ scleradmin --install --root <sclera-root>
-
Install the Sclera Web Display plugin
$ scleradmin --add sclera-plugin-webdisplay --root <sclera-root>
-
If you need to read CSV data, install the CSV plugin as well
$ scleradmin --add sclera-plugin-csv --root <sclera-root>
-
Start the Sclera Command Line Shell
$ <sclera-root>/bin/sclera Welcome to Sclera 4.0 > _
-
In the Sclera Command Line Shell, start the display web server:
Welcome to Sclera 4.0 > display start; HTTP server started at port 7070 Please point your browser to http://localhost:7070
To start the server at a port other than 7070, you need to specify the port number explicitly at the end. For example,
display start 8080;
starts the server atlocalhost:8080
. -
Start a web browser and open the URL http://localhost:7070
Running a ScleraViz query (described in the rest of this document) on the Sclera Command Line will display the visualization in the browser. In addition, to display the contents of a query as a table in the browser, you can say:
> display <query>;
For example, the following displays the result as a two column, single row table in the browser:
> display select 'Hello' as greeting1, 'World' as greeting2;
Leaving the keyword display
out displays the result in the command line shell, as usual.
You can also display arbitrary text (say, comments) as well:
> display "The plot above is amazing!";
When done, you can close the display web server by saying:
> display stop;
ScleraViz by Example◄
This section very briefly introduces the rudiments of the plot specification. This is to quickly get you started on plotting your own graphs. A complete and formal description of the syntax appear in the next section.
Example Dataset
In the following, we will be using the "tipping" dataset as our running example. The data set contains information about tips received by a waiter in a restaurant over a period of time. The information includes the tip in dollars, total bill in dollars, gender of the bill payer, whether there were smokers in the party, day of the week, time of day, and size of the party. Further details on the data can be found here.
To use the dataset, set up the tips
table as follows:
CREATE TEMP TABLE tips as
EXTERNAL CSV("http://scleraviz.herokuapp.com/assets/data/tips.csv")
TYPEINFER(LIMIT 1);
Notation: In the above and the following examples, the ScleraSQL keywords will be capitalized. This is for illustrative reasons only; ScleraSQL is case-insensitive.
A Simple Plot◄
Putting a PLOT
afer the query plots the column y
in the query result against the column x
in the query result as a scatter plot. So, to plot the tip
against the total_bill
in the tipping dataset, we say:
SELECT total_bill as x, tip AS y FROM tips
PLOT;
The above statement works on a bunch of defaults -- it creates a scatter-plot, which takes column x
in the input as the default for the X coordinate, and column y
in the input as the default for the y
coordinate, and also takes default values for aesthetics such as point fill and stroke color, size, opacity (alpha), etc.
In the discussion that follows, we show how to explicitly override these defaults, and even vary them as a function of the data.
Specifying the Geometry◄
A "geometry" refers to the manner in which a data point maps to the chart. In the scatterplot example above, the geometry is a POINT
.
We can explicitly specify the geometry as follows:
SELECT total_bill as x, tip AS y FROM tips
PLOT(GEOM=POINT)
The location of a geometry is determined by a set of parameters. By default, these parameters are mapped to appropriately named columns in the input. For instance, each point above is placed according to the values in the x
and y
columns.
We can override this default by explicitly specifing custom columns for these parameters. So, another way to write the example above is:
tips PLOT(GEOM=POINT(x=total_bill, y=tip))
Here, the parameters x
and y
are mapped to input columns total_bill
and tip
respectively; in general, they can be mapped to arbitrary expressions.
Sclera supports a number of alternative geometries.
For instance, to plot the average bill for a given party size against the party size as a histogram, we can say:
SELECT `size`, AVG(total_bill) as avg_bill
FROM tips
GROUP BY `size`
PLOT(GEOM=BAR(x=`size`, y=avg_bill))
size
needs to be in backquotes as it is a keyword. This is a parser requirement; we understand that it is cumbersome and are working on workarounds. Meanwhile, it is always a good idea to put the column names in backquotes.)
Similarly, to plot the same data as a line, we use GEOM=LINE
:
SELECT `size`, AVG(total_bill) as avg_bill
FROM tips
GROUP BY `size`
PLOT(GEOM=LINE(x=`size`, y=avg_bill))
A full list of the supported geometries appears as a part of the detailed documentation.
Aesthetics and Legends◄
The aesthetics parameters specify the "look and feel" of a chart. For a scatterplot, for instance, the parameters specify the shape of the point (circle, square, triangle, and so on), the color to be used to fill the point, the color used for the boundary, and so on.
The set of aesthetics parameters varies with the geometry. For example, a SHAPE
makes sense for a point, but does not make sense for a line. The list of aesthetics parameters associated with respective geometries appears as a part of the detailed documentation.
An aesthetics parameter can either be a constant, or an expression. If the parameter is a constant, e.g. FILL="black"
, then the value of the aesthetics is literally taken as the constant. If the parameter is an expression, then distinct values of the columns are mapped to distinct values of the aesthetic.
For instance, if we specify FILL=foo
where foo
is an input column, then by default, each distinct value of foo
is mapped to a distinct value of the color that is actually used for the fill. The mapping from each distict value of the column foo
to the corresponding color is automatically maintained, and can be output as a legend, by saying FILL=foo LEGEND
.
The following example shows a scatterplot with expressions for FILL
and SHAPE
aesthetics.
tips PLOT(
GEOM=POINT(x=total_bill, y=tip),
FILL=day LEGEND(TITLE="Day"),
SHAPE=`time` LEGEND(TITLE="Time")
)
A full description of the aesthetics parameters appears in the detailed documentation.
Group◄
The optional GROUP
directive specifies an expression on the input data. The input data points are first partitioned based on the values of this expression (similar to the SQL GROUP BY
), and each partition is plotted separately in the same layer.
For instance, adding a GROUP
directive to a LINE
will plot a line for each distinct value of the GROUP
expression.
SELECT `size`, gender, AVG(total_bill) as avg_bill
FROM tips
GROUP BY `size`, gender
ORDER BY `size`
PLOT(GEOM=LINE(x=`size`, y=avg_bill), GROUP=gender, STROKE=gender LEGEND)
Key◄
By default, each row in the input corresponds to a unique data point. The optional KEY
directive enables associating multiple rows with the same data point -- successive rows with the same key value render as the same data point, overriding the previous rendering.
This is helpful to display, say, a dynamic bar chart showing a running average for tips collected on each day of week. We specify KEY=day
and, for every input row, generate a row containing the updated running average of tip
for the associated value of day
.
With KEY=day
, the first rows for a value create a bar, and the subsequent roes for the same value update the existing bar.
SELECT day, T.avg(tip) as running_avg
FROM tips T PARTITION BY day
PLOT(GEOM=BAR(x=day, y=running_avg), KEY=day)
Without KEY=day
, a new bar will be generated for each update.
SELECT day, T.avg(tip) as running_avg
FROM tips T PARTITION BY day
PLOT(GEOM=BAR(x=day, y=running_avg), ALPHA=0.1)
Position Adjustments◄
Following the grammar of graphics philosophy, the input data is assumed to have one row per data point. When two data points are rendered one over the other, they are said to "collide".
Colliding data points can be repositioned by specifying a POSITION
directive, which can take values dodge
, stack
, or jitter
.
Specifying POSITION=DODGE
positions the data points in a row, one after other, around the common x
position on the X axis. When the geometry is BAR
, this gives the familiar "grouped" bar chart:
SELECT day, gender, count(*) AS count
FROM tips
GROUP BY day, gender
ORDER BY day, gender
PLOT(GEOM=BAR(x=day, y=count), POSITION=DODGE, FILL=gender LEGEND)
Specifying POSITION=STACK
positions the data points in a column, one over other, at the common x
position on the X axis. When the geometry is BAR
, this gives the familiar "stacked" bar chart:
SELECT day, gender, count(*) AS count
FROM tips
GROUP BY day, gender
ORDER BY day, gender
PLOT(GEOM=BAR(x=day, y=count), POSITION=STACK, FILL=gender LEGEND)
Specifying POSITION=jitter
adds a random offset to both the x
and y
positions of the data point. This is useful when we have overlapping points, so that the top points obscure the ones below -- the jitter randomly scatters the obscured points within a small ball around their position, making them visible.
Stat◄
The STAT
directive specifies a statistical operation over the input data points. Supported operations include computing local regression smoothing, computing density histograms, and computing heat-maps.
By default, the results are computation are overlayed over the base plot.
For instance, consider the scatter-plot of average tips against the total bill. Including STAT=loess
plots a line representing local regression smoothing over these data points. This curve brings out the trend in the data -- how the tips vary with increasing total_bill
-- that may not be apparent from just the scatter plot.
SELECT total_bill, avg(tip) as avg_tip
FROM tips
GROUP BY total_bill
PLOT(
GEOM=POINT(x=total_bill, y=avg_tip),
STAT=LOESS(STROKE="orange", STROKE_WIDTH="5px")
)
The computed result is plotted in its own layer (see the next section), using a geometry that is determined by the operation, and using aesthetic parameters that can be provided along with the operation, as in the example above.
Mark◄
The MARK
directive marks out regions where a specified predicate is true. This is done in real time, as the data arrives, and so can be very useful in monitoring for events.
For example, the following query marks regions in a stock ticker where the current high is greater than the running average high with green, and regions where the current low is less than the running average low with red.
EXTERNAL CSV("http://scleraviz.herokuapp.com/assets/data/infy.csv") TYPEINFER(LIMIT 1) AS infy
PLOT(
GEOM=LINE(x=ts, y=close),
MARK XAXIS(
high > infy.AVG(high),
FILL="green"
),
MARK XAXIS(
low < infy.AVG(low),
FILL="red"
)
)
Multi-Layered Plots◄
A layer represents a unit of rendering in the plot. The examples so far, except the stats, included a single layer -- which is the default.
A layer can only support one kind of geometry, and the associated set of aesthetic parameters and position specifications. If you want to include, say, a scatter plot and a line in the same plot, you need to have separate layers for the points and the line.
You can explicitly specify layers using the LAYER
keyword. The initial scatterplot can be equivalently be specified as:
tips PLOT(LAYER(GEOM=point(x=total_bill, y=tip)))
We can specify multiple layers in the same plot. The following plots the average bill for a party size, along with a one-standard deviation interval in a separate layer:
SELECT `size`, avg(total_bill) as avg_bill, stddev(total_bill) as sdev
FROM tips
GROUP BY `size`
ORDER BY `size`
PLOT(
LAYER(GEOM=LINE(x=`size`, y=avg_bill)),
LAYER(GEOM=RIBBON(x=`size`, ymin=avg_bill - sdev, ymax=avg_bill + sdev))
)
Another example on the same query, this time with bars and error lines:
SELECT `size`, avg(total_bill) as avg_bill, stddev(total_bill) as sdev
FROM tips
GROUP BY `size`
ORDER BY `size`
PLOT(
LAYER(GEOM=BAR(x=`size`, y=avg_bill)),
LAYER(
GEOM=POINTRANGEY(
x=`size`, y=avg_bill,
ymin=avg_bill - sdev, ymax=avg_bill + sdev
)
)
)
The STAT
directive mentioned in the previous section implicitly adds a new layer to the plot, with the result of the associated operation.
Multi-Plots◄
We can have multiple plots on the same input data; these plots are plotted simultaneously in a single scan of the input.
For instance, the following specification renders two separate plots on the same input:
tips
PLOT(GEOM=POINT(x=total_bill, y=tip), ALPHA=0.3)
PLOT(GEOM=POINT(x=`size`, y=tip), ALPHA=0.3)
PLOT(GEOM=POINT(x=`size`, y=total_bill), ALPHA=0.3)
PLOT(GEOM=POINT(x=total_bill, y=total_bill), ALPHA=0.3)
The plots are plotted one after the other, in separate rows. As the plots are rendered, the Y-axes of the first two plots, which plot the same column tip
, and the X-axes of the last two plots, which plot the same column size
are kept synchronized.
Automatic Layouts◄
ScalerViz provides an aligned layout mode that places the plots so that plots sharing the same X-axis columns are aligned vertically, and the plots sharing the same Y-axis columns are aligned horizontally. We try this mode, activated using the LAYOUT ALIGNED
directive on the example above, to get:
tips
PLOT(GEOM=POINT(x=total_bill, y=tip), ALPHA=0.3)
PLOT(GEOM=POINT(x=`size`, y=tip), ALPHA=0.3)
PLOT(GEOM=POINT(x=`size`, y=total_bill), ALPHA=0.3)
PLOT(GEOM=POINT(x=total_bill, y=total_bill), ALPHA=0.3)
LAYOUT ALIGNED
Weighted Axes◄
In the multi-plot examples above, all axes have the same size. However, we may sometimes want to emphasize some columns over the others.
In ScleraViz, we can assign weights to axes. By default, all axes are assigned a weight of 1. Assigning a weight of, say 0.5, makes the size of an axis half of an axis with weight 1. Similarly, assigning a weight of 2.0 makes the size of an axis twice that of an axis with weight 1. The actual axis lengths are assigned respecting these relative size constrainst and the constraint that all the plots need to fit in the given width and height.
tips
PLOT(GEOM=POINT(x=total_bill, y=tip), ALPHA=0.3)
PLOT(GEOM=POINT(x=`size`, y=tip), ALPHA=0.3)
PLOT(GEOM=POINT(x=`size`, y=total_bill), ALPHA=0.3)
PLOT(GEOM=POINT(x=total_bill, y=total_bill), ALPHA=0.3)
AXIS `size`(WEIGHT=0.25, TICKS=3)
AXIS tip(WEIGHT=2)
LAYOUT ALIGNED
Cross Plots◄
It is common to plot each of a set of columns against each of another set of columns. In ScleraViz, we can plot such "cross-plots" by using the single-layer PLOT
syntax, and specifying a list of columns for the parameters of the geometry.
tips
PLOT(
GEOM=POINT(x=(smoker, gender, day, `time`), y=(total_bill, tip)),
ALPHA=0.3
)
LAYOUT ALIGNED
Faceted Plots◄
Faceting a plot on an expression involves displaying separate plots for datapoints partitioned by distinct values of the expression.
ScleraViz can facet along columns (meaning that the plots corresponding to distinct values are layed out horizontally, in separate columns in a grid), or have faceting along rows (meaning that the plots corresponding to distinct values are layed out vertically, in separate rows in a grid), or both.
For instance, the following plot separates out the tip and total bill data based on whether smokers were present (yes/no, along rows), and the gender of the bill-payer (male/female, along columns):
tips
PLOT(GEOM=POINT(x=total_bill, y=tip), ALPHA=0.3)
FACET(ROWS=smoker, COLUMNS=gender)
Detailed Syntax◄
This section gives the detailed formal syntax of the ScleraViz specification language.
Formally, the ScleraViz specification is as follows (recall that |
means alternatives, []
means optional, and ...
means repetitions):
query plot_spec [ , ... ] [ facet_spec ] [ axis_spec ] [ ... ] [ layout_spec ] [ grid_spec ] [ display_spec ]
where the query
is a ScleraSQL query. The plot specification plot_spec
can appear one or more times, each plot_spec
defining a new plot, followed by an optional facet specification facet_spec
, zero or more axis_spec
if we need to change the axis defaults. A layout_spec
then gives the layout preferences. This is followed by a grid_spec
if we need to change the grid defaults, and a display_spec
to change the display area defaults.
Plot / Layer Specification◄
The plot_spec
specifies a plot, and has two variants:
PLOT ( layer_params ) | PLOT ( LAYER ( layer_params ) [, ...] )
The first variant is an shortened form of the second, and can be used when the plot has only one layer. The second variant is needed when the plot has multiple layers.
Each layer_param
specifies a layer in the plot. It contains several parts, each of them except geometry being optional, which can be specified in any order. We describe the sybtax of each part in the next few sections.
Geometry◄
The geometry specifies the manner in which the datapoint will be represented in the chart. The syntax is as follows:
GEOM = geom_type [ ( geom_params ) ]
The geom_type
is the type of the geometry, and geom_params
is a comma-separated list of key = value
entries. The keys for each geometry are given the following table; the values for the keys can be a numeric scalar expression on the underlying table (or a constant).
geom_type |
geom_params keys |
---|---|
POINT or SCATTER | X, Y |
LINE | X, Y |
POINTRANGEY | X, Y, YMIN, YMAX |
ERRORBAR or RANGEY | X, YMIN, YMAX, [WIDTH] |
ERRORBARH or RANGEX | XMIN, XMAX, Y, [HEIGHT] |
RIBBON | X, YMIN, YMAX |
ABLINE | YINTERCEPT, SLOPE |
HLINE | Y |
VLINE | X |
SEGMENT | X, XEND, Y, YEND |
ARROW | X, XEND, Y, YEND |
AREA | X, Y |
RECT | XMIN, XMAX, YMIN, YMAX |
HISTOGRAM or HIST | XMIN, XMAX, FREQ |
BAR | X, Y |
REGIONX | XMIN, XMAX, YMIN, YMAX |
REGIONY | XMIN, XMAX, YMIN, YMAX |
OHLC | TS, OPEN, HIGH, LOW, CLOSE, [WIDTH] |
CANDLESTICK | TS, OPEN, HIGH, LOW, CLOSE, [WIDTH] |
In the above table, keys enclosed in []
are optional, others are required.
The geom_params
are optional, and may only specify the values for a subset of the required keys for the geometry. When a required key needed by a geometry does not appear in the geom_params
, it is mapped to a column of the same name in the input.
The GEOM
specification itself is optional -- if omitted, it defauts to POINT
, with the keys X
and Y
mapped to input columns of the same name.
Cross-Plots◄
In the single-layer variant, the geom_param
keys can have a list of expressions instead of the single expression allowed above. This is considered a shortcut, and expands to a separate plot for each value in the list. If multiple geom_param
keys have a list of expressions, then the specification expands to a separate plot for each combination of the geom_param
values across these lists.
Apart from the geom_param
values, each of the plots so generated have the directives and properies as specified in the rest of the PLOT
specification.
Aesthetics, Scales, and Legends◄
The aesthetics directives change the default look (colors, shape, size, etc.) of the data points, and also assign legends. The syntax is as follows:
aes_key = aes_value [ SCALE = scale [ ( scale_params ) ] ] [ ON NULL null_aes_const ] [ LEGEND [ ( legend_params ) ] ]
The aes_key
is an attribute such as FILL
, STROKE
, etc. (see the table below), and aes_value
has the syntax:
aes_const | aes_expr | IDENTITY ( aes_identity_expr )
aes_const
is a constant of the appropriate type as specified in the table below. For example,FILL = "red"
.aes_expr
is a scalar expression of the appropriate type as specified in the table below. A scale maps distinct values of this expression to the appropriate value for theaes_key
. For instance, specifyingFILL = foo
maps distinct values in the columnfoo
to distinct colors from a default palette. The palette can be explicitly specied using ascale
(see below).aes_identity_expr
is an expression that evaluates to the literal value of an ordinal type. For instance, if a columnsfoo
contains valid color names such as "red", "blue", "black", etc., then we can specifyIDENTITY(foo)
as theaes_value
.
Each geometry has an associated list of aes_key
alternatives that it can interpret; aes_key
specifications that are not associated with the specified geometry are ignored.
The keys aes_key
, types of aes_value
and associated geometries are listed in the table below.
aes_key |
Type | Geometry | Description |
---|---|---|---|
SIZE |
Numeric | POINT, POINTRANGEY | Specifies the size of the point |
SHAPE |
Ordinal (valid shape values) | POINT, POINTRANGEY | Specifies the shape of the point |
FILL |
Ordinal (valid color values) | POINT, POINTRANGEY, AREA, BAR, RIBBON, OHLC | Color used to fill the shape |
ALPHA |
Numeric (between 0 and 1) | all geometries | Opacity of the rendering |
STROKE |
Ordinal (valid color values) | all geometries | Color used for the line |
STROKE-WIDTH |
Numeric | all geometries | Specifies the width of the line |
INTERPOLATE |
Ordinal (valid interpolate values) | LINE, AREA, RIBBON | Sets the path interpolation mode |
TENSION |
Numeric (between 0 and 1) | LINE, AREA, RIBBON | Sets the path tension |
Scale◄
Using a non-constant expression aes_expr
for the aes_value
makes the aesthetics a function of the data. The mapping from the distinct values of the expression (e.g. day of the week) to the aesthetics specification (e.g. colors "red", "blue", "green") is called a "scale".
The scale
is numeric if it outputs numeric values; examples of numeric scales are: LINEAR
, LOG
, and SQRT
. Further information on numeric scales, their defaults and capabilities appears in the D3 documentation.
The scale
is ordinal if it outputs ordinal values; examples of ordinal scales are ORDINAL
, COLOR
, and SYMBOL
.Further information on the ordinal scales, their defaults and capabilities appears in the D3 documentation.
To explicitly specify the scale, we specify the scale
(LINEAR
, LOG
, SQRT
, ORDINAL
, COLOR
or SYMBOL
), and optionally, the associated parameters scale_params
to override the defaults.
The following table lists the default scales for each aesthetics parameter:
aes_key |
Default Scale | Default Domain | Default Range |
---|---|---|---|
SIZE |
LINEAR | [0, 1] | [0, 1] |
SHAPE |
SYMBOL | Dynamic | Symbol types supported in D3 |
FILL |
COLOR | Dynamic | Palette "category20c" |
ALPHA |
LINEAR | [0, 1] | [0, 1] |
STROKE |
COLOR | Dynamic | Palette "category20c" |
STROKE-WIDTH |
LINEAR | [0, 1] | [0.5, 2.5] |
A "dynamic" domain means that the domain is updated on the fly and consists of all the unique value for the associated expression seen so far.
The following lists the syntax of the parameters scale_params
associated with the respective scale
s:
LINEAR ( DOMAIN = ( dom_min , dom_max ) , RANGE = ( range_min, range_max ) )
- Explicit specification of domain and range
( dom_min, dom_max )
and( range_min, range_max )
are numeric intervals.
LINEAR ( dom_min, dom_max )
- Explicit specification of only the domain, range at default [0, 1].
LOG ( DOMAIN = ( dom_min , dom_max ) , RANGE = ( range_min, range_max ) )
- Explicit specification of domain and range.
( dom_min, dom_max )
and( range_min, range_max )
are numeric intervals.
LOG ( dom_min, dom_max )
- Explicit specification of only the domain, range at default [0, 1].
SQRT ( DOMAIN = ( dom_min , dom_max ) , RANGE = ( range_min, range_max ) )
- Explicit specification of domain and range.
( dom_min, dom_max )
and( range_min, range_max )
are numeric intervals.
SQRT ( dom_min, dom_max )
- Explicit specification of only the domain, range at default [0, 1].
ORDINAL ( DOMAIN = ( dom_value [ , ... ] ) , RANGE = ( range_value [ , ... ] ) )
- Explicit specification of domain and range.
dom_value
andrange_value
are ordinal.
ORDINAL ( range_value [ , ... ] )
- Explicit specification of range.
- Domain is dynamic -- that is, updated on the fly and consists of all the unique value for the associated expression seen so far.
COLOR ( DOMAIN = ( dom_value [ , ... ] ) , RANGE = ( range_color [ , ... ] ) )
- Explicit specification of domain and range.
dom_value
is ordinal.range_color
is a valid color name or specification.
COLOR ( range_color_palette )
- Explicit specification of range as a palette.
range_color_palette
is color palette name supported by D3. It can be one of "category10", "category20", "category20b", or "category20c"- Domain is dynamic -- that is, updated on the fly and consists of all the unique value for the associated expression seen so far.
COLOR ( range_color , range_color [ , ... ] )
- Explicit specification of range as a list of colors. At least two colors must be present.
range_color
is a valid color name or specification.- Domain is dynamic -- that is, updated on the fly and consists of all the unique value for the associated expression seen so far.
-
COLOR ( dom_value AS range_color [ , ... ] )
- Explicit specification of domain and range as a mapping from values to colors.
dom_value
is ordinal.range_color
is a valid color name or specification.
-
SYMBOL ( DOMAIN = ( dom_value [ , ... ] ) , RANGE = ( range_symbol [ , ... ] ) )
- Explicit specification of domain and range.
dom_value
is ordinal.range_symbol
is a valid symbol name in D3.
SYMBOL ( range_symbol , [ , ... ] )
- Explicit specification of range as a list of symbols.
range_symbol
is a valid symbol name in D3.- Domain is dynamic -- that is, updated on the fly and consists of all the unique value for the associated expression seen so far.
SYMBOL ( dom_value AS range_symbol [ , ... ] )
- Explicit specification of domain and range as a mapping from values to symbols.
dom_value
is ordinal.range_symbol
is a valid symbol name in D3.
In the above, ORDINAL
scale is a generalization of COLOR
and SYMBOL
, where range_value
is constrained to colors and symbols respectively.
Handling NULLs◄
The scale maps the values to aesthetic specifications. By default, the NULLs are mapped to a default aesthetic specification, and are ignored in legends.
One way to map NULLs as aesthetic specifications is to replace them with an actial value using COALESCE(...)
in the ScleraSQL query. This is not recommended as this changes the data for the sake of visualization.
A better alternative is to use the ON NULL
modifier to specify the aesthetic specification (constant) to be used for NULLs (recall the syntax above). For instance, FILL = foo ON NULL "red"
will colour all datapoints where column foo
is NULL as red.
Legend◄
The mapping from the aesthetics specification (e.g. colors "red", "blue", "green") to distinct values of the expression (e.g. day of the week) -- that is, the inverse of the scale -- is documented as a legend.
To add a legend for an aes_key
, add the keyword LEGEND
after the specification (recall the syntax).
The LEGEND
modifier takes an optional set of parameters, labeled legend_params
in the syntax, and can include one or more of the following in a comma-separated list:
ORIENTATION = orientation_param
- The
orientation_param
can beVERTICAL
orHORIZONTAL
and defined the manner in which the legend will be rendered; the default isVERTICAL
. - Since the allocated space for legends is at the right of the chart,
VERTICAL
orientation is a better fit, and is recommended.
- The
TITLE = title_string
- Specifies the title of the legend. If omitted, the title is not rendered.
LABELS = ( label_string [ , ... ] )
- Use the specified labels instead of the values provided by the associated scale mapping.
LABELS = ( value AS label_string [ , ... ] )
- Specifies an explicit mapping of the labels to use instead of the values provided by the associated scale mapping.
Group◄
The optional GROUP
expression groups data points; each group is then rendered separately. The syntax is:
GROUP = scalar_expr
where scalar_expr
is a ScleraSQL scalar expression on the input.
Key◄
The optional KEY
expression identifies a rendered data point (a point in a scatter-plot, a bar in a bar-chart, etc.). All input rows with the same value of the KEY
expression map to the same rendered data point; the first such row creates the data point, and the subsequent rows modify the same.
The syntax is:
KEY = scalar_expr
where scalar_expr
is a ScleraSQL scalar expression on the input.
If KEY
is not specified, a new datapoint is created for each input row.
Position◄
The optional POSITION
directive specifies how to reposition "colliding" data points -- that is, data points that are rendered one over the other as they have the same value of the relevant geom_param
parameters (e.g. x
expression, y
expression, or both). The syntax is:
POSITION = { DODGE | STACK | JITTER } [ ( pos_params ) ]
POSITION = DODGE
repositions the colliding data points one after the other, around the common x
value; applied to a bar chart, this is a "grouped" bar chart.
POSITION = STACK
repositions the colliding data points one over the other, at the common x
value; applied to a bar chart, this is a "stacked" bar chart.
POSITION = JITTER
repositions the colliding data points randomly around the common x
and y
value, spreading a set of colliding scatterplot points, for instance, randomly into a circular region.
The POSITION
specification can also include pos_params
that are passed as arguments to the renderer. The arguments are dependent on the associated geometry of the underlying data points, and have the form pos_key = pos_value
as listed in the table below:
POSITION | GEOM | Parameter | Value |
---|---|---|---|
DODGE | BAR | PADDING | Padding between bars |
DODGE | BAR | OUTERPADDING | Padding at the start and end |
STACK | BAR | PADDING | Paddding between bars |
JITTER | POINT | x |
Maximum jitter along 'X' axis |
JITTER | POINT | y |
Maximum jitter along 'Y' axis |
Stat◄
The STAT
directive specifies a statistical operation over the input data points. The syntax is:
STAT = { LOESS | BIN | BIN2D } [ ( stat_params [ , aes_params ] ) ]
Applies on a scatter plot, i.e. GEOM = POINT(x = ..., y = ...)
, STAT = LOESS
computes local regression smoothing, STAT = BIN
computes density histograms over the x
expression values, and STAT = BIN2D
computes two-dimensional histograms (heat-maps) over the x
and y
expression values.
The result is a rendering in another layer. STAT = LOESS
renders a line, STAT = BIN
renders a histogram, and STAT = BIN2D
renders a two-dimensional tiling representing the bins with an opacity proportional to the number of points covered under the bin.
The STAT
specification also includes stat_params
that are passed as arguments to the operation, and aes_params
that override the default aesthetics parameters for the layer containing the rendered result.
The stat_params
for each STAT
type is listed in the table below:
STAT | Parameter | Required / Optional | Description |
---|---|---|---|
LOESS | BANDWIDTH | Optional | When computing the loess fit at a particular point, this fraction of source points closest to the current point is taken into account for computing a least-squares regression. A sensible value is usually 0.25 to 0.5, the default value is 0.3. See Apache Commons Math Documentation |
LOESS | ITERS | Optional | This many robustness iterations are done. A sensible value is usually 0 (just the initial fit without any robustness iterations) to 4, the default value is 2. See Apache Commons Math Documentation |
LOESS | ACCURACY | Optional | If the median residual at a certain robustness iteration is less than this amount, no more iterations are done. If the median residual at a certain robustness iteration is less than this amount, no more iterations are done. See Apache Commons Math Documentation |
LOESS | WEIGHT | Optional | Expression giving the coefficients by which the robustness weight of a point is multiplied. Default: 1. See Apache Commons Math Documentation |
BIN | BINWIDTH | Required | Width of a histogram bin. |
BIN | MIN | Optional | Minimum X-value. If omitted, is determined by pre-scanning the data |
BIN2D | XBINWIDTH | Either this or BINWIDTH required | Width of a histogram bin |
BIN2D | YBINWIDTH | Either this or BINWIDTH required | Height of a histogram bin |
BIN2D | BINWIDTH | Required when either XBINWIDTH or YBINWIDTH omitted | Width / height of a histogram bin, used when XBINWIDTH and YBINWIDTH are equal |
BIN2D | XMIN | Optional | Minimum X-value. If omitted, is determined by pre-scanning the data |
BIN2D | YMIN | Optional | Minimum Y-value. If omitted, is determined by pre-scanning the data |
Mark◄
The MARK
directive marks out regions where a specified predicate is true. The syntax is:
MARK { XAXIS | YAXIS } ( mark_predicate [ , aes_params [ , ... ] ] )
XAXIS
refers to the marking along the X
axis, and leads to vertical marked regions. Similarly, YAXIS
refers to marking along the Y
axis, and leads to horizontal marked regions.
The mark_predicate
refers to the predicate (that is, boolean-values expression) on the input; the marked region includes exactly the points where this predicate evaluates to true.
The MARK
specification also aes_params
that override the default aesthetics parameters for the layer containing the marked regions.
Tooltip◄
We can attach a tooltip to each data point rendered, so that when a viewer hovers a mouse cursor over the data point, this tooltip is displayed.
The syntax is, simply:
TOOLTIP = tooltip_expr
Where tooltip
is a string-valued expression. The tooltip message is generated by evaluating this expression over the row associated with the data point being hovered over.
Hiding the Layer◄
Stat and mark directives discussed earlier generate derivative layers by doing computation over the rows associated with data points in a given layer. After these derivative layers are generated, the default is to retain the original (base) layer.
This is not always desirable, especially if the derivative layer and the derived layer have different ranges of values. An example is a scatter-plot and its density (computed using STAT=BIN(...)
-- both share the 'X-axis', but the Y-axis
for the scatter-plot could be very different from the Y-axis
for the histogram which displays the relevant counts of the points. Displaying both these together in the same plot is not feasible.
ScleraViz therefore enables hiding of the original (base) layer, by including the directive HIDDEN
in the layer.
In the scatter-plot example, specifying HIDDEN
in the scatterplot layer alongside STAT=BIN(...)
hides the scatter-plot, and displays only the histogram.
Layer Display Order◄
Layers of a plot are overlayed one over the other. This makes it important to be able to explicitly specify the order in which the layers are rendered; this is done using the syntax:
DISPLAYORDER = display_order
where display_order
is an arbitrary numeric value. This can be specified for the base layers as well as for the derived layers (those added as results of STAT
and MARK
operations).
At runtime, the renderer sorts the available layers on increasing display_order
, and rennders the layers in the resulting order.
Facet◄
Faceting a plot on an expression involves displaying separate plots for datapoints partitioned by distinct values of the expression. ScleraViz allows faceting on a single expression, or a pair of expression; the resulting plots can be layed out in a row or columns or a row/column grid.
The syntax has the following alternatives (scalar_expr
is a ScleraSQL scalar expression on the input):
FACET ( ROWS = scalar_expr )
- Generates a plot for each distinct value of the given expression. The resulting plots are layed out in a row, their widths adjusted to fit the available total width.
FACET ( COLUMNS = scalar_expr )
- Generates a plot for each distinct value of the given expression. The resulting plots are layed out in a column, their heights adjusted to fit the available total height.
FACET ( ROWS = scalar_expr, COLUMNS = scalar_expr )
- Generates a plot for each distinct pair of values of the respective expressions. The resulting plots are layed out in a grid, in the row and column order according as specified, their widths and heights adjusted to fit the available total width and total height.
Axis◄
SclerViz identifies axes with expressions. Axes associated with the same expression across plots are actually renderings of the same internal representation; this keeps the axes renderings in sync as new data is received and the plots are updated.
The axis specification changes the default behaviour and appearance of an axis. The syntax is:
AXIS axis_expr ( axis_param [, ...] )
where axis_expr
is the expression mapped by the axis in the plots, and axis_param
can be:
LABEL ( label_string )
- Assigns the label
label_string
to all renderings of the axis. If the axis withaxis_expr
appears in two different plots, the label appears in both the axis renderings.
- Assigns the label
SCALE = { TIME | LINEAR | LOG | SQRT } [ ( min_value, max_value ) ]
- Assigns a continuous scale to the axis. The scale maps the values of the
axis_expr
to coordinates along the axis. - The optional interval
(min_value, max_value)
explicitly specifies the min and max values for the axis. If not specified, the interval is computed dynamically, on the fly as data comes in. Either ofmin_value
andmax_value
can beNULL
, in which case it is considered unspecified, and is computed dynamically.
- Assigns a continuous scale to the axis. The scale maps the values of the
SCALE = ORDINAL
- Assigns a scale to the axis. The scale maps the values of the
axis_expr
, assumed discrete, to evenly placed coordinates along the axis.
- Assigns a scale to the axis. The scale maps the values of the
INCREASING = { TRUE | FALSE }
- Declares whether the data is increasing along the axis. This extra knowledge of the input order helps speed up the computations while adjusting the scale when new data arrives.
WINDOW = window_size
- The data points are displayed in a sliding window of size
window_size
. The points are accumulated tillwindow_size
points are displayed, and then the plot slides as additional data arrives to display only the latestwindow_size
points.
- The data points are displayed in a sliding window of size
TICKFORMAT = format
- Formats the tick labels. For
LINEAR
,LOG
andSQRT
axis types, theformat
is specified in D3's format specification language. ForTIME
axis, the alternative time format specification language is used instead.
- Formats the tick labels. For
TICKS = num_ticks
- Sets a limit on the number of ticks displayed on this axis.
WEIGHT = axis_weight
- Sets the size of this axis relative to other axes.
- By default, all axes are assigned a weight of 1. Assigning a weight of, say 0.5, makes the size of an axis half of an axis with weight 1. Similarly, assigning a weight of 2.0 makes the size of an axis twice that of an axis with weight 1.
- The actual axis lengths are assigned respecting these relative size constrainst and the constraint that all the plots need to fit in the given width and height.
Layout◄
When multiple plots are specified, the default layout is to display them in a column, with the heights adjusted to fit the total height.
The optional LAYOUT ALIGNED
places the multiple plots so that the X-axes on the same expression and the Y-axes on the same expression are aligned in a column or row, respectively.
Grid Specification◄
Sets the aesthetics of the displayed grid and the axes. The syntax is:
GRID ( grid_param [ , ... ] )
where grid_param
can be:
COLOR = color
- Sets the background color.
color
is a valid color name or specification.- Default is "#F5F5F5"
axis_id( axis_aes [ , ... ] )
axis_id
isXAXIS
orYAXIS
XAXIS
sets the aesthetics for all theX
axes across all the plots.YAXIS
sets the aesthetics for all theY
axes across all the plots.
axis_aes
can be:COLOR = color
- Sets the color of the axis.
color
is a valid color name or specification.- Default is "#000"
TICKS = color
- Sets the color of the ticks.
- Ticks are the vertical grid lines for
XAXIS
, and the horizontal grid lines forYAXIS
. color
is a valid color name or specification.- Default is "#FFF".
TICKS = none
removes the ticks.
Display Specification◄
The display specification sets the dimensions of the plot and legend display areas. The syntax is:
DISPLAY ( display_param [ , ... ] )
where display_param
can be:
WIDTH = width
- Sets the width of the plot diplay area, in pixels.
- Default is 800 pixels.
HEIGHT = height
- Sets the height of the plot diplay area, in pixels.
- Default is 450 pixels.
MARGIN ( margin_spec )
- Sets the size of the four margins.
margin_spec
can be:TOP = top_margin
- Sets the top margin, in pixels.
- Default is 20 pixels.
RIGHT = right_margin
- Sets the right margin, in pixels.
- Default is 20 pixels.
BOTTOM = bottom_margin
- Sets the bottom margin, in pixels.
- Default is 30 pixels.
LEFT = left_margin
- Sets the bottom margin, in pixels.
- Default is 40 pixels.
LEGEND ( legend_spec )
- Sets the dimensions of the legend display area.
legend_spec
can be:PADDING = legend_padding
- Sets the padding between the plot and the legend, in pixels.
- Default is 20 pixels.
WIDTH = legend_width
- Sets the width of the legend display area, in pixels.
- Default is 100 pixels.
Acknowledgements◄
ScleraViz is inspired by Grammar of Graphics, specifically R's ggplot2. The plots are rendered using D3, and the legends are rendered using D3-Legend.