View Full Version : Big Data features open-sourced?

07-23-2012, 08:50 PM
I've been reading that pentaho open-sourced big data features. I checked out the Kettle project but I was unable to find the classes implementing Big Data steps and entries. Are they in some other place or they are actually not yet available.
All other steps I found in org.pentaho.di.trans.steps and org.pentaho.di.job.entries.

Please let me know if I'm looking at wrong direction or what is the problem.

I also downloaded 4.3 kettle version from sourceforge but still no big data found.


07-23-2012, 09:08 PM
The Big Data features are a plug-in to Kettle.

Wiki: http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home
Source code: https://github.com/pentaho/big-data-plugin

07-23-2012, 09:19 PM
ALright, but still I cannot locate the source files. They are not in the src-plugins. Maybe I'm looking at wrong places, again, please if you know let me know where can I find them.
One more question. Are they extending and implementing the same classes and interfaces as all other steps/entries? If they follow the guidelines they should but just to check.

07-23-2012, 09:21 PM
Oh so sorry I totally missed the links you posted. But are they included actually inside the project?
I mean: There is no release with them already included in the project? Or I should do it by myself?

07-23-2012, 10:58 PM
The information you need is in the wiki, including links to download binary builds.


07-24-2012, 08:43 AM
You can find the code on Github: https://github.com/pentaho/big-data-plugin

07-24-2012, 02:03 PM
OK from everything I've read I got a bit confused. On the one side they always say that from Kettle 4.3, Pentaho open sources the Big Data functionalists but they are actually not in the Kettle project. They are in the plugin project you gave me a link to.
Now, if I want to use them integrally I would need somehow to incorporate this plug in into the Kettle 4.3 project. (Please let me know if this does not make sense, I'm a bit rookie in all this.) Now there are some instruction about loading plugins to the kettle http://wiki.pentaho.com/display/COM/PDI+Plugin+Loading. So I was wondering, should I follow these instructions and do it myself or there is actually a Kettle 4.3 release with the BigData plugin already incorporated?

Also I found this in the readme.txt of the src-plugin package in kettle project:

Core Kettle Plugin Documentation

the following folders are considered core kettle plugins, plugins that are
distributed with kettles core distribution but are useful to have as plugins for architectural and
dependency reasons.

to add a core plugin:

- create a folder under src-plugins with the name of the plugin

- create src, test, lib, and res subfolders for the various files that will be included in your plugin

- add your plugin folder name to the plugins property in build.properties

- if you would like your plugin's jar and zip to get published to artifactory, update the
build-res/publish.properties with your plugin folder.

An ivy.xml file must be located with in the plugin's root folder. When creating a new plugin
the ivy.xml file from an existing plugin can be copied. No editing is needed.

all core plugins get built as part of the core dist, also you can build the plugins standalone by using
the "-standalone" ant targets related to the plugins. If you'd like to just build a single plugin,
you can do that by overriding the plugins property to just reference your plugin.

To have core plugins function in eclipse, you'll need to add the plugin's dependencies to your
.classpath file and set the property -DKETTLE_PLUGIN_CLASSES to the full name of your plugin class names.

Here is the current core kettle plugins eclipse flag:


Would this be the way to go with this?

07-24-2012, 02:14 PM
Have you taken a look at the getting started page for Java Developers? http://wiki.pentaho.com/display/BAD/Getting+Started+for+Java+Developers. That will show you how to get started with the Pentaho Big Data Plugin project with Eclipse.

If you want to debug Kettle code and the Pentaho Big Data Plugin at the same time you should check out this introduction: http://wiki.pentaho.com/display/EAI/How+to+debug+a+Kettle+4+plugin.

If you want to be able to step through the code at runtime, outside of a unit test, I believe most of our developers use a working version of Kettle to deploy into (via the big data plugin build script: `ant install-plugin`). We execute Spoon with remote debugging enabled and connect up to it through our IDE to debug the code at runtime.

07-24-2012, 02:23 PM
If you want to debug Kettle code and the Pentaho Big Data Plugin at the same time you should check out this introduction: http://wiki.pentaho.com/display/EAI/How+to+debug+a+Kettle+4+plugin.
This seems the thing I was searching for. Very useful link. Thanks very much! :)

07-26-2012, 01:18 PM
I've been following the guidelines for debugging plugin inside the eclipse

I successfully imported all plugins with this enviromental variable option but now when I start for example pentaho MapReduce job I got null pointer exception and I debug it a little bit and it seems that the plugin directory for this job is set to null.
Have I missed some step in the configuration?
Where is this directory path is set actually?
Is there some xml configuration file where this needs to be set?

pluginFolder in PluginInterface instantiation is set to null.