Pentaho 7.1 is out

Remember when I said at the time of the previous release that Pentaho 7.0 was the best release ever? Well, I was true till today! But not any more, as 7.1 is even better! :p

Why do I say that? It's a big step forward in the direction we've been aiming - consolidating and simplifying our stack, not passing the complexity to the end user.

These are the main features in the release:

  • Visual Data Experience
    • Data Exploration (PDI)
      • Drill Down
      • New Viz's: Geo map, sunburst, Heat Grid
      • Tab Persistency
      • Several other improvements including performance

    • Viz API 3.0 (Beta)
      • Viz API 3.0, with documentatino
      • Rollout of consistent visualizations between Analyzer, PDI and Ctools

  • Enterprise Platform
    • VCS-friendly features
      • File / repository abstraction
      • PDI files properly indented
      • Repository performance improvements

    • Reintroducing Ops Mart
    • New default theme on User Console
    • Pentaho Mobile deprecation

  • Big Data Innovation
    • AEL - Adaptive Execution Layer (via Spark)
    • Hadoop Security
      • Kerberos Impersonation (for Hortonworks)
      • Ranger support

    • Microsoft Azure HD Insights shim

I'm getting tired just of listing all this stuff... Now into a bit more detail, and I'll jump back and forth in these different topics ordering by the ones that... well, that I like the most :p

Adaptive Execution with Spark

This is huge; We've decoupled the execution engine from PDI so we can plug in other engines. Now we have 2:

  • Pentaho - the classic pentaho engine
  • Spark - you've guessed it...

What's the goal of this? Making sure we treat our ETL development with a pay as you go approach; First, we worry about the logic, then we select the engine that makes most sense.

AEL execution of Spark
One of the things people need to do on other tools (and even on our own tools, that's why I don't like our own approach to the Pentaho Map Reduce) is that from the start you need to think about the engine and technology you're going to use. But this makes little sense.

Scale as you go

Pentaho’s message is one of future-proofing the IT architecture, leveraging the best of what the different technologies have to offer without imposing a certain configuration or persona as the starting point. The market is moving towards a demand for BA/DI to come together in a single platform. Pentaho has an advantage here as we have seen the differentiation of BI and DI better together with our customers and what sets us apart from the competition. Gartner predicts that BI and Discovery tool vendors will partner to accomplish this. Larger, proprietary vendors, will attempt to build these platforms themselves. With this approach from the competition, Pentaho has a unique and early lead in delivering this platform.

A good example is the story we can tell about governed blending. We don’t need to impose on customers any pre-determined configuration; We can start with the simple use of dataservices and unmaterialized data sets. If it’s fast enough, we’re done. If not, we can materialize the data into a data base or even an enterprise data warehouse. If it’s fast enough, we’re done. If not we can resort to other technologies – NoSQL, Lucene based engines, etc. If it’s fast enough, we’re done. If everything else fails, we can setup a SDR blueprint which is the ultimate scalability solution. And throughout this entire journey we never let go of the governed blending message.

This is an insanely powerful and differentiated message; We allow our customers to start simple, and only go down the more complex routes when needed. When going down a single path, a user knows, accepts and sees the value in extra complexity to address scalability

Adaptive Execution Layer

The strategy described for the “Logical Data Warehouse” is exactly the one we need for the execution environment; A lot of times customers get hung up on a certain technology without even understanding if they actually needed. Countless times we we’ve seen customers asking for Spark without a use case that justifies it. We have to challenge that.

We need to move towards a scenario where the customer doesn’t have to think about technology first. We’ll offer one single approach and ways to scale as needed. If a data integration job works on a single Pentaho Server, why bother with other stacks? if it’s not enough, then making the jump to something like Map Reduce or Spark has to be a linear move.

The following diagram shows the Adaptive Execution Layer approach just described

AEL conceptual diagram

Implementation in 7.1 - Spark

For 7.1 we chose Spark as the first engine to implement for AEL. It has seen a lot of adoption, and the fact that it's not restricted to a map reduce paradigm makes it a good candidate to separate business logic and execution.

How to make it work? This high definition conceptual diagram should help me explain it:

An architectural diagram so beautiful it should almost be roughly correct

We start by generating a PDI Driver for Spark from our own PDI instance. This is a very important starting point because using this methodology we ensure that any plugins we may have developed / installed will work when we run the transformation - we couldn't let go of the extensibility capabilities of Pentaho

That driver will be installed on an edge node of the cluster, and that's what will be responsible for executing the transformation. Note that by using spark we're leveraging all it's characteristics: namely, we don't even need a cluster, as we can select if we want to use spark standalone or yarn mode, even though I suspect the majority of users will be on yarn mode leveraging the clustering capabilities.

Runtime flow

One of the main capabilities of AEL is that we don't need to think about adapting the business logic to the engine; We develop the transformation first and then we select where we want to execute. This is how this will work from within Spoon:

Creating and selecting a Spark run configuration

We created the concept of a Run Configuration. Once we select a run configuration set up to use Spark as the engine, PDI will send the transformation to the edge node and the driver will then execute it.

All transformation steps in PDI will run in AEL-Spark! This was the thought from the start. And to understand how this works, there are 2 fundamental concepts to understand:

  • Some steps are safe to run in parallel while others are not parallelizable or not recommended to run in clustered engines such as Spark. All the steps that take one row as input and one row as output (calculator, filter, select values, etc, etc), all of them are parallelizable; Steps that require access to other rows or depend on the position and order on the row set, still run on spark, but have to run on the edge node, which implies a collect of the RDDs (spark's datasets) from the nodes. It is what it is. And how do we know that? We simply tell PDI which steps are safe to run in parallel, and which are not
  • Some steps can leverage Spark's native APIs for perfomance and optimization. When that's the case, we can pass to PDI a native implementation of the step, greatly increasing the scalability on possible bottleneck points. Examples of these steps are the hadoop file inputs, hbase lookups, and many more

Feedback please!

Even though running on secured clusters (and leveraging impersonation) is an EE capability only, AEL is also available in CE. Reason for that is that we want to get help from the community in testing, hardening, nativizing more steps and even writing more engines for AEL. So go and kick the tires of this thing! (and I'll surely do a blog post on this alone)

Visual Data Experience (PDI) Improvements

This is one of my favorite projects. You may be wondering what's the real value of having this improved data experience in PDI, why is this all that exciting... Let me tell you why: This is the first materialization of something that we hope becomes the way to handle data in pentaho regardless of where we are. So this thing that we're building in PDI, will eventually make it's way to the server... I'd like to throw away all the technicalities that we expose in our server (analyzer for olap, pir for metadata, prd for dashboards....) into a single content driver approach and usability experience. This is surely starting to sound confusing, so I better stop here :p

In the 7.1 release, Pentaho provides new Data Explorer capabilities to further support the following key use cases more completely:

  • Data Inspection: During the process of cleansing, preparing, and onboarding data, organizations often need to validate the quality and consistency of data across sources. Data Explorer enables easier identification of these issues, informing how PDI transformations can be adjusted to deliver clean data.
  • BI Prototyping: As customers deliver analytic ready data to business analysts, Data Explorer reduces the iterations between business and IT. Specifically, It enables the validation of metadata models that are required for using Pentaho BA. Models can be created in PDI and tested in Data Explorer, ensuring data sources are analytics-ready when published to BA.

And how? By adding these improvements:
New visualization: Heatgrid

This chart can display 2 measures (metrics) and 2 attributes (categories) at once. Attributes are displayed on the axes and measures are represented by the size and color of the points on the grid. It is most useful for comparing metrics at the ‘intersection’ of 2 dimensions, as seen in the comparisons of quantity and price across combinations of different territories and years below (did I just define what an heatgrid is?! No wonder it's taking me hours to write this post!):
Look at all those squares!

New visualization: Sunburst

A pie chart on steroids that can show hierarchies. Less useless than a normal piechart!
Circles are also pretty!
New visualization: Geo Maps

The geo map uses the same auto-geocoding as Analyzer, with out of box ability to plot latitude and longitude pairs, all countries, all country subdivisions (state/province), major cities in select countries, as well as United States counties and postal codes.
Geo Map visualization
Drill down capabilities

When using dimensions in Data Explorer charts or pivot tables, users can now expand hierarchies in order to see the next level of data. This is done by double clicking a level in the visualization (for instance, double click a ‘country’ bar in a bar chart to drill down to ‘city’ data).

Drill down in the visualizations...

This can be done though the visualizations or though the labels / axis. Once again, look at this as the beginning of a coherent way to handle data exploration!
... or from where it makes more sense
And this is only the first of a new set of actions we'll introduce here...
Analysis persistency

In 7.0 these capabilities were a one-time inspection only. Now we've taken a step further - they get persisted with the transformations. You can now use to validate the data, get insights right on the spot, and make sure everything is lined up to show to the business users.
Analysis persistency indicator
Viz Api 3.0

Every old timer knows how much disparity we've had throughout the stack in terms of offering a consistent visualization. This is not an easy challenge to solve - the reason they are different is because different parts of our stack were created in completely different times and places, so a lot of different technologies were used. An immediate follow-up consequence is that we can't just add a new viz and expect it to be available in several places of the stack

We're been working on a visualization layer, codenamed VizAPI (for a while, actually, but now we reached a point where we can make it available on beta form), that brings this so needed consistency and consolidation.

Viz API compatible containers

In order to make this effort worthwhile, we needed the following solve order:

  1. Define the VizAPI structure
  2. Implement the VizAPI in several parts of the product
  3. Document and allow users to extend it

And... we did it. We re-implemented all the visualizations in this new VizAPI structure, adapted 3 containers - Analyzer, Ctools and DET (Data Exploration) in PDI, and as a consequence, the look and feel of the visualizations are the same

Analyzer visualizations are now much better looking _and_ usable

One important note though - migration users will still default to the "old" VizAPI (yeah, we called it the same as well, isn't that smart :/ ) not to risk interfering with existing installations. In order for you to test an existing project with the new visualizations you need to change the VizAPI version number in New installs will default to the new ones.

In order to allow people to include their own visualization and promote more contributions to Pentaho (I'd love to start seeing more contributions to the marketplace with new and shiny Viz's), we need to really make it easy for people to know how to create them.

And I think we did that! Even though this will require it's own blog post, just take a look at the documentation the team prepared for this
Instructions for how to add new visualizations

You'll see this documentation has beta written on it. The reason is simple - we decided to put it out there, collect feedback from the community and implement any changes / fine tunes / etc before 8.0 timeframe, where we'll lock this down, guaranteeing long term support for new visualizations

MS HD Insights

HD Insights (HDI) is a hosted Hadoop cluster that is part of Microsoft’s Azure cloud offering. HDI is based on Hortonworks Data Platform (HDP). One of the major differences between the standard HDP release and HDI’s offering is the storage layer. HDI connects to local cluster storage via HDFS or to Azure Blob Storage (ABS) via a WASB protocol.

We now have a shim that allows us to leverage this cloud offering, something we've been seeing getting more and more interest on the marketplace.

Hortonworks security support

This is a continuation of the previous release, available on the Enterprise Edition (EE)
Added support for Hadoop user impersonation
Earlier releases of PDI introduced enterprise security for Cloudera, specifically, Kerberos Impersonation for authentication and integration with Apache Sentry for authorization.

This release of PDI extends these enterprise level security features to Hortonworks’s Hadoop distribution as well. Kerberos Impersonation is now support Hortonworks’s HDP. For authorization, PDI integrates with Apache Ranger, an alternative OSS component included in the HDP security platform.

Data Processing-Enhanced Spark Submit and SparkSQL JDBC

Earlier PDI and BA/Reporting releases broaden access to Spark for querying and preparing data through a dedicated transformation step Spark Submit and Spark SQL JDBC.

This release will be extending these existing features to support additional vendors so that these features can be used more widely. Apart from additional vendors, these features have been now certified with a more up to date version of Spark 2.0.

Additional big data infrastructure vendors supported for these functionalities apart from Cloudera and Hortonworks:

  1. Amazon EMR
  2. MapR
  3. Azure HD Insights

VCS Improvements

Repository agnostic transformations and jobs

Currently some specific step interfaces (the sub-transformation one being the more impactful) where the ETL dev has to choose, upfront, if he's using a file on the file system or the repository. This prevents us from being able to abstract the environment where we're working, so checking out things from git/svn and just import them is a no-go.

Here's an example of a step that used this:

The classic way to reference dependent objects
ThisIn general, we need to abstract the linkage to other artifacts (sub-jobs and sub-transformations) independent on the used repository or file system.

The linkage needs to work in all environments whether it is a repository (Pentaho, Database, File) or File Based system (kjb and ktr).

The linkage needs to work independently of the execution system: On the Pentaho Server, on a Carte Server (with a repository or file based system), in Map Reduce and future execution systems as part of the Adaptive Execution System (AES)

So we turned this into something much simpler:

The current approach to define dependencies
We just define where the transformation lives. This may seem a "what, just this??" moment, but now we can just work locally, remotely, check into a repository, even automate the promotion and control the lifecycle in between different installation environments. I'm absolutely sure that existing users will value this a lot (as we can deprecate the stupid file-based repository)

KTR / KJB XML format

We did something very simple (in concept), but very useful. While we absolutely don't recommend playing around with the job and transformation files (they are plain old XML files), we guaranteed that they are properly indented. Why? Cause when we use a version control system (git / svn, don't care which as long as you USE one!), you can easily identify what changes happened from version to version

Repository performance improvements

We want you to use the Pentaho Repository. And till now, performance while browsing that repository from Spoon was crap (there's no other way to say it!). We addressed that - it's now about 100x faster to browse and open files from the repository

Operations Mart Updates

Also known as the ops marts, available in EE. Used to work. Then it stoped working. Now it's working again. Yay :/

I'll skip this one. I hate it. We're working on a different way to handle monitoring on our product, and at scale

Other Data Integration Improvements

Apart from all the above new big features, there are some smaller data integration enhancements added to product to build data pipeline with Pentaho easier.

Metadata Injection Enhancement

Metadata Injection enables creating generalized ETL transformations whose behavior can be changed at run-time and thus significantly improves data integration developer agility and productivity.
In this release, a new option for constant has been added for Metadata Injection which will help making steps more dynamic with Metadata Injection feature.
This functionality extended to Analytic Query and Dimension Lookup/Update steps which will help making these steps dynamic and thus make them highly dynamic. Dynamism of these steps will improve the Data Warehouse & Customer 360 blueprints and similar analytic data pipeline.

Lineage Collection Enhancement

Customers can now configure the location for the lineage output and add the ability to write to VFS location. This will help customers to maintain lineage in clustered / transient node environments, such as Pentaho MapReduce. Lineage information helps with data compliance and security needs of the customers.

XML Input Step Enhancement

XML Input Stream (StAX) step has been updated to receive XML from a previous step. This will make it easier to develop XML processing in data pipeline when you are working with XML data.

New Mobile approach (and the deprecation of Pentaho Mobile)

We used to have a mobile specific plugin, introduced in a previous Pentaho release, that enabled touch gestures to work with analyzer.

But while it sounds good, in fact it didn't work as we'd expected. The fact that we had to develop and maintain a completely separate access to information caused that mobile plugin to become very outdated.

To complement that, the maturity of the browsers on mobile devices and the increased strength of tables makes it possible for Pentaho reports and analytic views to be accessed directly without any specialized mobile interface. Thus, we are deprecating the Pentaho mobile plug-in and investing on the responsive capabilities of the interface

It sounds bad? Actually it's not - just use your tablet to access your EE pentaho, looks great :)

Pentaho User Console Updates

Sapphire theme in PUC

Starting in Pentaho 7.1, Onyx will be deprecated and removed from the list of available themes in PUC. In addition, a new theme “Sapphire” has been introduced in 7.0. As of Pentaho 7.1, Sapphire will be PUC’s default selected theme. Crystal will be the available alternative.

Moreover, a newly refreshed log-in screen has been implemented in Pentaho 7.1, this screen has been based on the new Sapphire theme that was introduced in Pentaho 7.0. This is something that was already in 7.0 CE and now it's the default for EE as well


This is a spectacular release! I should be celebrating! But instead, it's 8pm, I'm stuck in the office writing this blog post, and already very very stressed because I have all my 8.0 work stuff already piling up on my inbox... :(

I'm out, have fun!