Pentaho 8.1 is available

The team has once again over delivered on a dot release! Below are what I think are the many highlights of Pentaho 8.1 as well as a long list of additional updates.


If you don’t have time to read to the end of my very long blog, just save some time and download it now. Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the community home!



Cloud

One of the biggest themes of the release: Increased support for Cloud. A lot of vendors are fighting for becoming the best providers, and what we do is try to make sure Pentaho users watch all that comfortably sitting on their chairs, having a glass of wine, and really not caring about the outcome. Like in a lot of areas, we want to be agnostic – which is not saying that we’ll leverage the best of each – and really focus on logic and execution.


It’s hard to do this as a one time effort, so we’ve been adding support as needed (and by “as needed” I really mean based on the prioritization given by the market and our customers). A big focus of this release was Google and AWS:



Google Storage (EE)

Google Cloud Storage is a RESTful unified storage for storing and accessing data on Google's infrastructure. PDI support for import and export Data To/From Cloud Storage is now done through a new VFS driver (gs://). You may even use it on the several steps that support it as well as browse it’s contents.


These are the roles required on Google Storage for this to work:
● Storage Admin
● Storage Object Admin
● Storage Object Creator
● Storage Object Viewer


In terms of authentication, you’ll need the following environment variable defined:


GOOGLE_APPLICATION_CREDENTIALS="/opt/Pentaho81BigQuery.json“


From this point on, just treat it as a normal VFS source.


Google BigQuery – JDBC Support (EE/CE)

BigQuery is Google's serverless, highly scalable, low cost enterprise data warehouse. Fancy name for a database, and that’s how we treat it.


In order to connect to it first we need the appropriate drivers. Steps here are pretty simple:


1. Download free driver: https://cloud.google.com/bigquery/partners/simba-drivers/
2. Copy google*.* files from Simba driver to /pentaho/design-tools/data-integration/libs folder


Host Name will default to https://www.googleapis.com/bigquery/v2 but your mileage may vary.


Unlike the previous item, authentication doesn’t use the previously defined environment variable as does Google VFS. Authentication here is done at the JDBC driver level, though a driver option, OAuthPvtKeyPath, set in the Database Connection Option and the you need to point to the Google Storage certificate through the P12 key format.


The following Google BigQuery roles are required:


1. BigQuery Data Viewer
2. BigQuery User


Google BigQuery – Bulk Loader (EE)

While you can use a regular table output to insert data into BigQuery that’s going to be slow as hell (who said hell was slow? This expression makes no sense at all!). So we’ve added a step for that: Google BigQuery Loader.


This step leverages google’s loading abilities, and is processed out on Google, not on PDI. So the data, that has to be either in Avro, JSON or CSV has to be previously copied to Google Storage. From that point on is pretty straightforward. Authentication is done via the GOOGLE_APPLICATION_CREDENTIALS environment variable point to the Google JSON file.






Google Drive (EE/CE)


While Google Storage will probably be seen more frequently in production scenarios, we also added support for Goggle Drive, a file storage and synchronization service, allows users to store files on their servers, synchronize files across devices, and share files.


This is also done through a VFS driver, but given it’s a per user authentication a few steps need to be fulfilled to leverage this support:


● Copy your Google client_secret.json file into (The Google Drive option will not appear as a Location until you copy the client_secret.json file into the credentials directory and restart)
o Spoon: data-integration/plugins/pentaho-googledrive-vfs/credentials directory, and restart spoon.
o Pentaho Server: pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials directory and restart the server
● Select Google Drive as your Location. You are prompted to login to your Google account.
● Once you have logged in, the Google Drive permission screen displays.
● Click Allow to access your Google Drive Resources.
● A new file called StoredCredential will be added to the same place where you had the client_secret.jsonfile. This file will need to be added to the Pentaho Server credential location and that authentication will be used


Analytics over BigQuery (EE/CE, depending on the tool used)

This JDBC connectivity to Google BigQuery, as defined previously for Spoon, can also be used throughout all the other Business Analytics browser and client tools – Analyzer, CTools, PIR, PRD, modeling tools, etc. Some care has to be taken here, though, as BigQuery’s pricing is related to 2 factors:


● Data stored
● Data queried


While the first one is relatively straightforward, the second one is harder to control, as you’re charged according to total data processed in columns selected. For instance, a ‘select *’ query should be avoided if only specific columns are needed. To be absolutely clear, this has nothing to do with Pentaho, these are Google BigQuery pricing rules.


So ultimately, and a bit like we need to do on all databases / data warehouses, we need to be smart and work around the constraints (usually speed and volume, on this case price as well) to leverage best what these technologies have to offer. Some examples are given here:


● By default, there is BigQuery caching and cached queries are free. For instance, if you run a report in Analyzer, clear the Mondrian cache, and then reload the report, you will not be charged (thanks to the BigQuery caching)
● Analyzer: Turn off auto refresh, i.e, this way you design your report layout first, including calculations and filtering, without querying the database automatically after each change
● Analyzer: Drag in filters before levels to reduce data queried (i.e. filter on state = California BEFORE dragging city, year, sales, etc. onto canvas)
● Pre-aggregate data in BigQuery tables so they are smaller in size where possible (to avoid queries across all raw data)
● GBQ administrators can set query volume limits by user, project, etc. (quotas)




AWS S3 Security Improvements (IAM) (EE/CE)

PDI is now able to get IAM security keys from the following places (in this order):


1. Environment Variables
2. Machine’s home directory
3. EC2 instance profile


This added flexibility helps accommodate different AWS security scenarios, such as integration with S3 data via federated SSO from a local workstation, by providing secure PDI read/write access to S3 without making user provide hardcoded credentials.


The IAM user secret key and access key can be stored in one place so they can be leveraged by PDI without repeated hardcoding in Spoon. These are the environment variables that point to them:


AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY






Big Data / Adaptive Execution Layer (AEL) Improvements





Bigger and Better (EE/CE)

AEL provides spectacular scale out capabilities (or is it scale up? I can’t cope with these terminologies…) by seamlessly allowing a very big transformation to leverage a clustered processing engine.


Currently we have support for Spark through the AEL layer, and throughout the latest releases we’ve been improving it in 3 distinct areas:


● Performance and resource optimizations
o Added Spark Context Reuse that, under certain circumstances can speed up startup performance on the range to 5x faster, proving specially useful under development conditions
o Spark History Server integration, providing a centralized administration, auditing and performance reviews of the transformations executed in Spark
o Ability to passing down to the cluster customized spark properties, allowing a finer-grained control of the execution process
● Increased support for native steps (eg, leveraging the spark specific group by instead of the PDI engine one)
● Adding support for more cloud vendors – and we just did that for EMR 5.9 and MapR 5.2


This is the current support matrix for Cloud Vendors:






Sub Transformation support (EE/CE)

This one is big, as it was the result of a big and important refactor on the kettle engine. AEL Now supports executing sub transformations through the Transformation Executor step, a long-standing request since the times of good-old PMR (Pentaho Map Reduce)

Big Data formats: Added support for Orc (EE/CE)

Not directly related to AEL, but most of the use cases where we want the AEL execution we’ll need to input data in a big data specific format. In previous releases we added support for Parquet and Avro, and we now added support for ORC (Optimized Record Columnar), a format favored by Hortonworks.


Like the others, Orc will be handled natively when transformations are executed in AEL


Worker Nodes (EE)





Jumping from scale-out to scale-up (or the opposite, like I mentioned, I never know), we continue to do lots of improvements on the Worker Nodes project. This is an extremely strategic project for us as we integrate with the larger Hitachi Vantara portfolio.


Worker nodes allow you to execute Pentaho work items, such as PDI jobs and transformations, with parallel processing and dynamic scalability with load balancing in a clustered environment. It operates easily and securely across an elastic architecture, which uses additional machine resources as they are required for processing, operating on premise or in the cloud.


It uses the Hitachi Vantara Foundry project, that leverages popular technologies under the hood such as Docker (Container Platform), Chronos (Scheduler) and Mesos/Marathon (Container Orchestration).


For 8.1 there are several other improvements:


● Improvements tn Monitoring, with accurate propagation of Work Items status for monitoring
● Performance improvements by optimizing the startup times for executing the work items
● Customizations are now externalized from docker build process
● Job clean up functionality








Streaming





In Pentaho 8.0 we introduced a new paradigm to handle streaming datasources. The fact that it’s a permanently running transformation required a different approach: The new streaming steps define the windowing mode and point to a sub transformation that will then be executed on a micro batch approach.


That works not only for ETL within the kettle engine but also in AEL, enabling spark transformations to feed from Kafka sources.


New Streaming Datasources: MQTT, and JMS (Active MQ / IBM MQ) (EE/CE)

Leveraging on the new streaming approach, there are 2 new steps available – well, one new and one (two, actually) refreshed.


The new one is MQTT – Message Queuing Telemetry Transport - an ISO standard publish-subscribe-based messaging protocol that works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. Alternative IoT centric protocols include AMQP, STOMP, XMPP, DDS, OPC UA, WAMP




There are 2 new steps – MQTT Input and MQTT Output, that connect with the broker for consuming and publishing back the results.


Other than this new, IoT centered streaming source, there are 2 new steps, JMS Input and JMS Output. These steps replace the old JMS Consumer/Producer and the IBM Websphere MQ steps, supporting, in the new mode the following message queue platforms:


● ActiveMQ
● IBM MQ


Safe Stop (EE/CE)


This new paradigm to handle streaming sources introduced a new challenge that we never had to face. Usually, when we triggered jobs and transformations, they had a well defined start and end; Our stop functionality was used when we wanted to basically kill a running process because something was not going well.


However, on these streaming use cases, a transformation may never finish. So stopping a transformation the way we’ve always done – by stopping all steps at the same time – could have unwanted results.


So we implemented a different approach – We added a new option to safe stop a transformation implemented within Spoon, Carte and the Abort step, that instead of killing all the step threads, stops the input steps and lets the other steps gracefully finish the processing, so no records currently being processed are lost.






This is especially useful in real-time scenarios (for example reading from a message bus). It’s one of those things that when we look back seems pretty dumb that it wasn’t there from the start. It actually makes a lot of sense, so we went ahead and made this the default behavior.


Streaming results (EE/CE)

When we launched streaming in Pentaho 8.0 we focused on the processing piece. We could launch the sub transformation but we could not get results back. Now we have the ability to define which step on the sub-transformation will send back the results to follow the rest of the flow.






Why is this important? Because of what comes next…


Streaming Dataservices (EE/CE)


There’s a new option new option to run data service in streaming mode. This will allow the consumers (on this case CTools Dashboards) to get streaming data from this dataservice.






Once defined, we can test these options within the test dataservices page and see the results as they come.






This screen exposes the functionality as it would be called from a client. It’s important to know that the windows that we define here are notthe same as the ones we defined for the micro batching service. The window properties are the following:


● Window Size – The number of rows that a window will have (row based), or the time frame that we want to capture new rows to a window (time based).
● Every - Number of rows (row based), or milliseconds (time based) that should elapse before creating a new window.
● Limit – Maximum number of milliseconds (row based) or rows (time based) which will be used to wait for a new window to be generated.


CTools and Streaming Visualizations (EE/CE)

We took a holistic approach to this feature. We want to make sure we can have a real time / streaming dashboard leveraging what was set up before. And this is where the CTools come in. There’s a new datasource in CDE available to connect to streaming dataservices:






Then the configuration of the component will select the kind of query we want – Time or number of records base, window size, frequency and limit. This gives us a good control for a lot of use cases.






This will allow us to then connect to a component the usual way. While this will probably be more relevant for components like tables and charts, ultimately all of them will work.


It is possible to achieve a level of multi-tenancy by passing a user name parameter from the PUC session (via CDE) to the transformation as a data services push-down parameter. This will enable restriction of the data viewed on a user by user basis


One important note is that the CTools streaming visualizations do not yet operate on a ‘push’ paradigm – this is on the current roadmap. In 8.1, the visualizations poll the streaming data service on a constant interval which has a lower refresh limit of 1 second. But then again… if you’re doing a dashboard of this types and need a refresh of 1 second, you’re definitely doing something wrong…


Time Series Visualizations (EE/CE)

One of the biggest use cases for streaming, from a visualization perspective, is time series. We improved the support for CCC for timeseries line charts, so now data trends over time will be shown without needing workarounds.


This applies not only to CTools but also to Analyzer








Data Exploration Tool Updates (EE)

We’re keeping on our path of improving our Data Exploration Tool. It’s no secret that we want to make it feature complete so that it can become the standard data analysis tool for the entire portfolio.


This time we worked on adding filters to the Stream view.




We’ll keep improving this. Next on the queue, hopefully, will be filters on the model view and date filters!


Additional Updates

As usual, there were several additional updates that did not make it to my highlights above. So for the sake of your time and not creating a 100 page blog – here are even more updates in Pentaho 8.1.


Additional updates:


● Salesforce connector API update (API version 41)
● Splunk connection updated to version 7
● Mongo version updated to 3.6.3 driver (supporting 3.4 and 3.6)
● Cassandra version updated to support version 3.1 and Datastax 5.1
● PDI repository browser performance updates, including lazy loading
● Improvements on the Text and Hadoop file outputs, including limit and control file handling
● Improved logging by removing auto-refresh from the kettle logging servlet
● Admin can empty trash folder of other users on PUC
● Clear button in PDI step search in spoon
● Override JDBC driver class and URL for a connection
● Suppressed the Pentaho ‘session expired’ pop-up on SSO scenarios, redirecting to the proper login page
● Included the possibility to schedule generation of reports with a timestamp to avoid overwriting content


In summary (and wearing my marketing hat) with Pentaho 8.1 you can:
Deploy in hybrid and multi-cloud environments with comprehensive support for Google Cloud Platform, Microsoft Azure and AWS for both data integration and analytics
Connect, process and visualize streaming data, fromMQTT, JMS, and IBM MQ message queues and gain insights from time series visualizations
Get better platform performance and increase user productivity with improved logging, additional lineage information, and faster repository access



Download it

Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the community home!



Pedro



More...