External Data Access
In this section, we describe the extensions that enable access to data in CSV (Comma-Separated Value) files, free-form text files and web services within a SQL query. This data can be used just like a relational base table within the SQL query, and can be joined with data from relational/non-relational data stores (see the Data Platform Connection Reference document for details).
These extensions are developed using the Sclera Connector Development SDK, the sources for each of these extensions is available on GitHub for reference.
Accessing CSV Data◄
Sclera can dynamically load CSV files and present the data as a table to a SQL query. Recall that a CSV file is a text file, structured into lines such that the first line contains the column names separated by commas, and the remaining lines contain the corresponding values separated by commas.
This feature introduces a new table expression with the following syntax:
EXTERNAL CSV( path [ , format [ , header_flag [ , path_col ] ] ] )
[ ORDERED BY ( expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ) ]
path
is one of the following:- a URL that responds with CSV data (e.g.
http://download.finance.yahoo.com/d/quotes.csv?s=AAPL&f=sl1d1t1c1ohgv&e=.csv
) - the path to a CSV file
- the path to a directory containing multiple CSV files; each of these files is expected to have the same format and identical headers (if present).
- a URL that responds with CSV data (e.g.
-
format
is the CSV format, as specified by the Apache Commons CSV library. It can be one of (see the Apache Commons CSV documentation for details):- DEFAULT, which specifies
','
as the delimiter,'"'
as the quote,'\r\n'
as the record separator, and ignores empty lines in the input. - RFC4180, which specifies
','
as the delimiter,'"'
as the quote,'\r\n'
as the record separator, and does not ignore empty lines in the input. - EXCEL, which is the same as
RFC4180
except that it allows missing column names. - TDF, which specifies
'\t'
(tab) as the delimiter,'"'
as the quote,'\r\n'
as the record separator, and ignores spaces surrounding the values. - MYSQL, which specifies
'\t'
(tab) as the delimiter, does not quote the values (but escapes special characters with'\'
), specifies'\n'
as the record separator, and does not ignore empty lines in the input.
- DEFAULT, which specifies
header_flag
can beHEADER
orNOHEADER
indicating whether or not a header is present in the input CSV. If not specified, it defaults toHEADER
path_col
the name of an optional output column. If specified, a new column of the specified name is added to each row, and populated with the URL or file name from which that row has been read. This is useful when thepath
is a directory -- in this case, the specified column will contain the path to the relevant file under that directory.- The
ORDERED BY
clause declares the sort order of the rows in the CSV file; note that this is an unverified declaration, and Sclera blindly relies on the same while planning further evaluation on the rows read.
As an example, consider a CSV file "/path/to/custinfo.csv"
containing columns email
, location
and age
for each customer.
You can view the data as a table in Sclera by saying:
> EXTERNAL CSV("/path/to/custinfo.csv");
This can be used in SQL queries just like a regular base table or view. The following query lists, for each distinct location, the number customers in that location.
> SELECT location, COUNT(*)
FROM EXTERNAL CSV("/path/to/custinfo.csv")
GROUP BY location;
The following query joins this table with a MySQL (connected as location myloc
) table defaulters
, and computes the number of defaulters in that region:
> SELECT location, COUNT(*)
FROM EXTERNAL CSV("/path/to/custinfo.csv") JOIN myloc.defaulters USING (email)
GROUP BY location;
Note that since the CSV format does not include the type of the columns, each column in the table returned by CSV(_)
has type VARCHAR
. To enable automatic type inferencing, you can use the TYPEINFER operator.
Exporting Query Results as CSV Files◄
The EXTERNAL CSV
can also be used to export the result of a query into a CSV file, using the following syntax:
CREATE EXTERNAL CSV( file_path [ , format ] ) AS table_expression
file_path
is the path to the file to be created. If the file exists, it will be overwritten.format
is as described earlier.table_expression
is a table expression.
The syntax is similar to the standard CREATE TABLE ... AS
statement, except the TABLE
is replaced by the EXTERNAL CSV(...)
.
For example, the following statement creates a CSV file custdefault.csv
containing the JOIN
of the data in custinfo.csv
and the table defaulters
in the location myloc
:
> CREATE EXTERNAL CSV("/path/to/custdefault.csv") AS
EXTERNAL CSV("/path/to/custinfo.csv") JOIN myloc.defaulters USING (email);
Similarly, the following statement creates a CSV file locdefault.csv
containing the number of defaulters by location:
> CREATE EXTERNAL CSV("/path/to/locdefault.csv") AS
SELECT location, COUNT(*) as defaulter_count
FROM EXTERNAL CSV("/path/to/custinfo.csv") JOIN myloc.defaulters USING (email)
GROUP BY location;
Accessing Text Files◄
Sclera can dynamically load raw text data from text files and present the same as a table to a SQL query.
This feature introduces a new table expression with the following syntax:
EXTERNAL TEXTFILES( filedir_path )
where filedir_path
is the path to file to be loaded, or to a directory containing the files to be loaded.
The resulting virtual table contains a row for each file, with two columns of type VARCHAR
. The first column, called file
contains the canonical path of the file, and the second column contents
contains the textual contents of the file.
For instance, the following query returns the path and contents of all files under the directory "/tmp/mydir"
:
> SELECT file, contents
FROM TEXTFILES("/tmp/mydir");
The resulting table can also be aggregated over, joined with other base or virtual tables, and so on, just like a base table or a view.