Data Access Plugins

In this document, we show how to build custom connectors to any data source. These connectors enable ingestion of data from arbitrary sources in a ScleraSQL query. You only need to format the data as rows of a table, and Sclera will take care of evaluating your SQL queries on the same -- these queries can include transforming, filtering and aggregating this data, as well as joining this data with data ingested from other connectors, or with data in tables stored in other data stores.

Sclera - CSV Connector and Sclera - Text Files Connector are built using this SDK. For examples of how these connectors are used in Sclera, please refer to the SQL documentation for external data access.

Building Data Access Connectors◄

To build a custom data access connector, you need to provide implementations of the following abstract classes in the SDK:

ExternalSourceService (API Link)
- Provides the external data source as a service to Sclera.
- Contains an id that identifies this service.
- Contains a method createSource that is used to create a new ExternalSource instance for this service.
ExternalSource (API Link)
- Represents the external data source.
- Provides the schema and other metadata about the sourced data to Sclera at compile-time.
- Provides the sourced data to Sclera at runtime.
TableResult (API Link)
- Represents the data sourced from the external data source.
- Provides an iterator over the rows containing the sourced data.

Example: Building a CSV File Connector◄

This section shows how to implement a lightweight clone of the CSV file connector using the Sclera Extensions SDK. The referenced code is available on GitHub.

This connector enables using data from a CSV file within a SQL query. The CSV file can be local or remote as long as it is accessible through an URL.

A simple SQL query that uses the connector is as follows:

SELECT * FROM EXTERNAL CSVLITE("http://scleraviz.herokuapp.com/assets/data/tips.csv")

This query retrieves the CSV data in the file at the specified URL and incorporates it as a virtual table for further processing. A more complicated query could have filters, joins and aggregates on this virtual table.

While processing this query, Sclera comes across the EXTERNAL keyword, and accordingly identifies "CSVLITE" as an external datasource service. Consulting the service provider specification file, it finds the service provider as an object of the class CSVSourceService (source), which implements the abstract class ExternalSourceService. (For details, see the Java Service Provider Interface documentation.)

The class CSVSourceService implements the method createSource, as required by the abstract class ExternalSourceService. This method takes a list of datasource parameters specified in the SQL query -- in the query above, the list consists of only one parameter, the URL, of type CharConst. The implementation of the method parses and validates the parameter, and uses them to create an object of the class TickerSource, which implements the abstract class ExternalSource mentioned above.

The service identifier is provided by the id attribute. For CSVSourceService, the identifier is "CSVLITE".

For the query above, Sclera calls the method createSource of the CSVSourceService object, with the URL as the parameter, and gets back a CSVSource object.

The class CSVSource implements the following methods and attributes, as required by the abstract class ExternalSource:

The attribute name provides a user-interpretable name to this data source. In our implementation, this is taken to be the same as the service identifier.
The attribute columns provides Sclera with the schema of the output that this data source will emit at runtime. The schema contains the name of each column, and its type. This must be consistent with that provided by CSVResult (see below). In our implementation, the CSVResult object is used to provide the columns in CSVSource -- but there may be cases where this coupling might not be possible (for instance, the CSVResult object might get computed later at runtime), hence the redundancy.
The method result is called by Sclera at runtime, and returns an object of the class CSVResult, which implements the abstract class TableResult mentioned above.
The method toString provides a printable string, which will be used for this datasource when EXPLAIN is run on a query using this data source.

Sclera plans the query mentioned above taking into account the schema and result sort order provided by the CSVSource object. When the plan is evaluated, the object's result method returns the object CSVResult, which retrieves the data from the URL, as discussed in a moment.

The class CSVResult implements the following methods, as required by the abstract class TableResult:

The attribute columns provides Sclera with the schema of the data that this data source will emit at runtime. The schema contains the name of each column, and its type.
The method rows returns an iterator over TableRow objects containing the data. A TableRow object can be constructed from a mapping of column names to column values. The column names and data types must be consistent with that provided the attribute columns above.
The attribute resultOrder tells Sclera how the emitted data is sorted. This information is optional, but helps in eliminating redundant sorts on the emitted result. In our implementation, the order of rows in the source CSV data is not known, hence this attribute is an empty list.

For the query above, Apache Commons CSV Library is used to retrieve the CSV data from the specified URL. This data is converted into an iterator of TableRow instances which is returned as a part of CSVResult, as described above.

Packaging and Deploying the Connector◄

The implementation uses sbt for building the connector (installation details). This is not a requirement -- any other build tool can be used instead.

Dependencies◄

The implementation has a dependency on:

the Apache commons-csv library.
the "sclera-core" and "sclera-config" core components. Note that these dependencies is annotated "provided" since these libraries will already be available in the CLASSPATH when this connector is run with Sclera.
(optional) the test framework scalatest for running the tests.

see the included sbt build file for details.

Deployment Steps◄

The connector can be deployed simply by having its jar and all its dependencies in the CLASSPATH.

Alternatively, for a managed deployment, follow the following steps:

First, publish the implementation as a package, locally or in a public artifact repository. In sbt, you can publish locally by running sbt publishLocal.
Use scleradmin to install the component and its dependencies:
```
> scleradmin --add <plugin> --root <sclera-root>
```

where <plugin> is the artifact identifier for your published package, and <sclera-root> is the directory where Sclera is installed.

For example, having published the "CSVLITE" plugin described above as com.example:sclera-plugin-csv-lite:1.0-SNAPSHOT, we can deploy it as:

> scleradmin --add "com.example:sclera-plugin-csv-lite:1.0-SNAPSHOT" --root /path/to/sclera

The connector should now be visible to Sclera, and can be used in the queries.

Note: Please ensure that the identifier you assign to your connector is unique -- that is, it does not conflict with the identifier of any other available ExternalSourceService instance.