Sclera - Extensible Data Processing
Streaming data processor with a simple plugin framework and a modernized SQL interface
Use SQL to build end-to-end workflows with multi-source data integration, streaming analytics, visualization, and more
Frequently Asked Questions
Why do I need Sclera?
Sclera helps you exploit the power of sophisticated machine learning and text processing libraries, incorporate external data from web-services, perform complex event processing and stream analytics, and more — all using familiar SQL. Moreover, it works on your existing database systems, with no need to move your data.
Sclera provides a highly customizable end-to-end stack for analytics, enabling quick experimentation and iteration. Sclera's modular architecture significantly reduces the maintenance complexity and simplifies upgrades — new technologies (database systems, analytics libraries) can be incorporated by simply adding appropriate plugins, with minimal change to the application.
Sclera gives you a standardized interface to run your algorithms within extended SQL, and a clean API (in the Sclera's Extensions SDK) that you can use to integrate your algorithms in just a few lines of code. You can now quickly experiment, iterate and thus focus on your analytics tasks, while Sclera takes care of the legwork.
What is the performance and productivity impact?
Sclera saves hundreds of lines of code in building analytics applications by abstracting away the complex implementation details. This helps you focus on your analytics tasks, and quickly experiment and iterate over alternatives.
Moreover, you can start with your existing infrastructure, identify the bottlenecks, and add more resources intelligently as and when needed — all without modifying your applications.
Sclera's SQL engine has three components: the query compiler, the embedded streaming SQL processor, and the embedded analytics evaluator.
The query compiler compiles the input query into a plan — this happens once per query, before the evaluation, and the compilation time is negligible as compared to the evaluation time. If the entire query gets pushed to an underlying database system, the cost is thus effectively zero.
The embedded streaming SQL processor is used to evaluate SQL (relational) operators on streaming data. This evaluation proceeds in a pipeline, in a single pass, with minimal memory overheads, and in the same JVM as your application.
The embedded analytics evaluator, likewise, proceeds in a pipeline, in a single pass, and in the same JVM as your application. A handcoded program would have identical overheads when it uses the same library.
Sclera also includes a query optimizer that optimizes the workflow before each run. These optimizations can potentially speed up the evaluation in ways that a handcoded application cannot.
Sclera aggressively pushes down query computations to the underlying database systems, and uses external analytics libraries for analytics evaluation where needed. Sclera's performance is thus largely determined by the performance of the underlying database systems and analytics libraries.
How is Sclera different from other platforms?
To the user, Sclera is just like a relational database system — with SQL as the interface language, and JDBC as the access mechanism from the application programs. However:
Sclera does not store data. It works on data from the connected database systems and/or external data sources (on-disk file, web-service, etc.) specified in your query. Sclera queries can work on data across multiple database systems and external data sources.
Sclera natively supports analytics. Analytics operations (such as classification) are provided as SQL language extensions, and analytics objects (such as classifiers) as first class-objects, at par with SQL tables.
Sclera includes an embedded SQL processor, but also pushes SQL computation to an underlying database systems wherever possible. Sclera's optimizer understands the capabilities of the underlying database systems and the data stored therein, and intelligently decides where to locate the computations.
No, Sclera complements your database systems. Sclera works with your database systems, and extends their capability to perform advanced analytics.
Apache Drill is a data virtualization solution, enabling standard SQL on Hadoop distributions, NoSQL datastores, cloud storage and local files with a variety of data formats. For data virtualization, thus, it is more versatile than Sclera.
Unlike Sclera, however, Apache Drill does not provide the ability to plug in your own data processing extensions. Supporting standard SQL, it also does not provide stream pattern matching, machine learning, text analytics, data cleaning, visualization, and other capabilities that are baked into Sclera.
In the near term, we plan to provide a plugin based on Apache Drill that brings its extensive virtualization capabilities to Sclera.
Presto is a parallel query processing engine that runs on a cluster of machines. Like Sclera and Apache Drill mentioned above, Presto can ingest data from a variety of external sources.
Unlike Sclera, however, Presto does not provide the ability to plug in your own data processing extensions. Supporting standard SQL, it also does not provide stream pattern matching, machine learning, text analytics, data cleaning, visualization, and other capabilities that are baked into Sclera.