View Full Version : integrating Cassandra with Pentaho map Reduce in 4.3

02-24-2012, 07:16 PM
(adding some clarification ...)

Hi all

When using the Pentaho Map Reduce functionality in 4.3 prerelease with Cassandra 1.0x, is there an easy way to extract the column values in the pentaho map reduce map-reduce input when using the Cassandra ColumnFamily input format?

When using a Java based mr job, one possible approach is to use the Cassandra ColumnFamilyInputFormat which returns a key and a representation of the map for all columns returned for each row.

However to apply this to a Pentago based map reduce job, you would need to separate this out into fields - one thought was to use some java code to extract this into individual fields - but the catch is that Janino (the underlying engine for the user defined java class) does not support generics, so this would need to be written as a separate jar using eclipse or some other IDE.

The other aspect of using the ColumnInputFamilyFormat is that the slice predicate (used to indicate which rows and/or column families ) to select, needs to be written out as stringized representation into the configuration - again for a pentaho map reduce job, this could be worked around using java by writing a custom derivative of the ColumnInputFamilyFormat.

Are there any direct support features for this in the PDI 4.3 pre-release (for example custom properties to set on the map-reduce job to simplify this ?) The how-tos outline the process of integrating a custom input format, but I am wondering if there is some other approach that I may have overlooked


02-28-2012, 05:50 PM
Hi all

I am also trying to use the Cassandra output step from within a Pentaho map reduce job - it fails silently with the map reduce job reporting no issues, but no output going to cassandra. I see for HBase output that the hadoop class path needs to point to the hbase libraries - is there a similar requirement for using the cassandra output step ?

I've tried various classpath settings but with no effect

Any advice would be appreciated


02-29-2012, 01:36 AM
You’re right on the money with those workarounds and for the time being they are required to get the functionality you’re looking for. The required changes have not been implemented yet and likely will not be included in the 4.3.0 release.

We’re constantly looking to improve our Cassandra, and big data in general, support and if you're interested in helping out you can get started here: http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home

02-29-2012, 09:56 PM
Hi Ronans,

Can you elaborate a bit on your setup as far as using the Cassandra output step in a MR job? I ran a quick test earlier today with Cassandra output in the word count example's reducer transformation running on Apache Hadoop 20.2. It did halt due to a null key value (there was plenty of feedback in Spoon's log though), but once that was sorted out it worked fine.


02-29-2012, 10:07 PM
Sending you details of our setup via private message

03-03-2012, 12:01 AM
Adding the cassandra, thrift and other dependent libraries to the hadoop classpaths solved the problem. It is also necessary to add libraries from the pentaho pdi libext\jdbc directory to the hadoop class path to get logging to work correctly to mysl - these are not setup by default as part of the /opt/pentaho/pentaho-mapreduce/lib set of jars.