The last few months, I've been collaborating on a project with Chris Wensel, the author of Cascading. Last week we announced Lingual, an open source project that puts a SQL interface on top of Cascading.

Architecturally, Lingual combines the Cascading engine with my own Optiq framework. Optiq provides the SQL interface (including JDBC), reads table and column definitions from Cascading's metadata store, and few custom Optiq rules target relational operations (project, filter, join and so forth) onto a Cascading operator graph. The queries are executed, on top of Hadoop, using Cascading's existing runtime engine.

Not everyone has heard of Cascading, so let me explain what it is, and why I think it fits well with Optiq. Cascading is a Java API for defining data flows. You write a Java program to build data flows using constructs such as pipes, filters, and grouping operators, Cascading converts that data flow to a MapReduce job, and runs it on Hadoop. Cascading was established early, picked the right level of abstraction to be simple and useful, and has grown to industry strength as it matured.

As a result, companies who are doing really serious things with Hadoop often use Cascading. Some of the very smartest Hadoop users are easily smart enough to have built their own Hadoop query language, but they did something even smarter — they layered DSLs such as Scalding and Cascalog on top of Cascading. In a sense, Optiq-powered SQL is just another DSL for Cascading. I'm proud to be in such illustrious company.

Newbies always ask, "What is Hadoop?" and then a few moments later, "Is Hadoop a database?". (The answer to the second question is easy. Many people would love Hadoop to be an "open source Teradata", but wanting it doesn't make it so. No Virginia, Hadoop is not a database.)

A cottage industry has sprung up of bad analogies for Hadoop, so forgive me if I make another one: Hadoop is, in some sense, an operating system for the compute cluster. After mainframes, minicomputers, and PCs, the next generation of hardware is the compute cluster. Hadoop is the OS, and MapReduce is the assembly language for that hardware — all powerful, but difficult to write and debug. UNIX came about to serve the then-new minicomputers, and crucial to its success was the C programming language. C allowed developers to be productive while writing code almost as efficient as assembler, and it allowed UNIX to move beyond its original PDP-7 hardware.

Cascading is the C of the Hadoop ecosystem. Sparse, elegantly composable, low-level enough to get the job done, but it abstracts away the nasty stuff unless you really want to roll up your sleeves.

It makes a lot of sense to put SQL on top of Cascading. There has been a lot of buzz recently about SQL on Hadoop, but we're not getting caught up in the hype. We are not claiming that Lingual will give speed-of-thought response times (Hadoop isn't a database, remember?), nor will it make advanced predictive analytics will be easy to write (Lingual is not magic). But Hadoop is really good at storing, processing, cleaning and exporting data at immense scale. Lingual brings that good stuff to a new audience.

A large part of that SQL-speaking audience is machines. I'd guess that 80% of the world's SQL statements are generated by tools. Machine-generated SQL is pretty dumb, so it essential that you have an optimizer. (As author of a tool that speaks SQL — Mondrian — and several SQL engines — Broadbase, LucidDB, SQLstream — I have been on both sides of this problem.) Once you have an optimizer, you can start doing clever stuff like re-organizing your data to make the queries run faster. Maybe the optimizer will even help.

Lingual is not a "SQL-like language". Because it is based on Optiq, Lingual is a mature implementation of ANSI/ISO standard SQL. This is especially important for those SQL-generating tools, which cannot rephrase a query to work around a bug or missing feature. As part of our test suite, we ran Mondrian on PostgreSQL, and captured the SQL queries it issued and the results the database gave. Then we replayed those queries — over 6,200 of them — to Lingual and checked that Lingual gave the same results. (By the way, putting Optiq and Cascading together was surprisingly easy. The biggest challenge we had was removing the Postgres-isms from thousands of generated queries.)

Lingual is not the only thing I've been working on. (You can tell when I'm busy by the deafening silence on this blog.) I've also been working on Apache Drill, using Optiq to extend SQL for JSON-shaped data, and I'll blog about this shortly. Also, as Optiq is integrated with more data formats and engines, the number of possibilities increases. If you happen to be at Strata conference tomorrow (Wednesday), drop me a line on twitter and we can meet up and discuss. Probably in the bar.

More...