Elasticsearch, Kettle and the CTools
I'm not much into the sql vs nosql discussion. I have enough years of BI to know that the important thing is to choose the right tool for the job. And that requires a lot of tools!
Here's one more for our set: ElasticSearch. ElasticSearch is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene
It may not be obvious, but there are tons of reasons why a search engine is a great choice as a BI data source - and far beyond the simple free-form text search.
Due to the intrinsic nature of nosql and it's schema-less approach, we can store virtually anything. Due to the clustering abilities of elasticsearch scalability is not even an issue. Using the query syntax we have a powerful way to get the data out. And it's blazing fast!
We initially used ElasticSearch for the twitter dashboard at Mozilla, described in a previous blog post. Everyone was very happy with the results, and we're betting quite a lot in elasticsearch at Mountain View.
So we made an effort to put elasticsearch closer to the Ctools and Pentaho. The first thing we did was to add an ElasticSearch Bulk Loader to Kettle.
In kettle 4.2 you'll be able to find this new step; Here's a sample of a transformation using it. As simple as it gets:
There are a few things I'd like to highlight:
From this point on we can just query for documents in elasticsearch:
- It's simple - in 5 minutes you can get an elasticsearch engine with data on it
- It's fast - 20krps on this sample docs (200k docs indexed in 9 seconds, for 60 Mbs of storage)
- It's versatile - we can either index fields or full json documents
Now, what to do with this? What's really interesting is to be able to use this from with CDA. Doing that not only we'll be able to use ElasticSearch as a datasource to dashboards but also to reports. Using kettle to do the bridge between ES and our frontend tools guarantees a great degree of isolation and security. Here's a sample transformation:
Now we can tie this to CDA, and then use it with CDE for our dashboards. Here's the result:
Note: in order to run this from pentaho bi, both jsonpath.jar and json_simple.jar have to be added to the lib dir of the application server
With all this we can quickly build any dashboard that uses all this resources. As a very rough demonstration I built this one: