View Full Version : Cassandra Output Testing Feedback- SNAPHOT pentaho-big-data-plugin build 150

09-11-2012, 01:31 AM
Latest plug-in has some interesting enhancements for Cassandra steps as recommended in http://forums.pentaho.com/showthread.php?92840-Cassandra-input-and-output-enhancements-suggestions-etc
It has options to use Thrift or CQL while processing the data.

I am currently testing the SNAPSHOT build and from my test results, the throughput is actually decreased significantly with the SNAPSHOT build. I see that the speed (r/s) is decreasing to a very small number gradually while using the Cassandra output step. I see the same pattern after selecting 'Thrift' check box also. I am not sure if it associated with GC issues, but the new build is actually processing a less number of rows.

09-12-2012, 06:48 AM
Hi Satjo,

The enhancements for Cassandra output include error handling and robustness. What settings are you using for batch size, socket timeouts and batch timeouts? It could be the case that, as more data is inserted, Cassandra is not able to complete inserting each batch within the timeout limits. When a timeout occurs, the step has a procedure of splitting the batch into smaller sub-batches and inserting each of these separately. This is tried recursively until reaching single row sub-batches, at which point, if these still fail, the rows are written to the error stream. Naturally, this process will take more time than if the original sized batch was inserted successfully. If you set the logging level to "detailed" you should be able to see whether this is occurring as the step will output the size of the batch it is trying to the log.


09-12-2012, 05:35 PM

Thanks for your comments and I think the slow performance is due to splitting the batch into smaller sub-batches as you mentioned. I do not see timeout issues in some cases where the batch is split into sub-batches.
First, I noticed that it is not really executing 'commitThriftBatch' when I selected the checkbox 'Use Thrift I/O'. I have modified the code to hard code this option so that I could see the performance numbers for Thrift. Then I saw that it was using the 'm_batchSplitFactor' instead of using the batch size I defined on the Cassandra output screen, i.e., I did not see ant socket timeout error before the use of m_batchSplitFactor.

But atleast, I see that performance is much better after I increased the 'm_batchSplitFactor' to a larger number. I am still trying to understand why it is using this 'm_batchSplitFactor' number for Thrift output option. I will post later after further testing.

09-13-2012, 12:26 AM

Odd. I can't see any problem with these settings (at least not from a quick look at the code). The m_batchSplitFactor is used by both the CQL and Thrift mode. The "Commit batch size" is the desired size of the batches that you'd like to see go through for each commit. The "Sub batch size" (m_batchSplitFactor) is how many rows should be in each sub batch (created from the original batch) if a timeout occurs. So if the original batch size is 100 and the sub batch size is 10 then 10 sub-batches of 10 rows each will be tried if a timeout occurs. If a sub batch fails then the step drops back to trying individual rows. This process is used for both CQL and Thrift batch inserts.


09-16-2012, 11:38 PM
Thanks! Mark. After further testing, I found out my initial observation about 'm_batchSplitFactor' was wrong.
After further logging, I could see that m_batchSplitFactor is only picked up after the timeout exception.

I was trying to get higher throughput for my application and I tried various combinations of batch size and sub batch size. This application involves reading data from a Oracle table (about 20 strings and 50 Bignumbers) and inserting the data into Cassandra using Output step. I used both Thrift and CQL. The results show that Thrift option did not give any better numbers compared with CQL.
In both the cases, I am getting about 800 to 1000 rows per sec.

I put some print statements for 'commitThriftBatch' for 'commitCQLBatch' to see the time taken for writing the data for my application.
I write to a remote Cassandra instance and here are some sample numbers:

Here are some numbers for Writing using Thrift option:
Thrift batch commit time 7051 ms for batch size 4000
Thrift batch commit time 6730 ms for batch size 4000
Thrift batch commit time 8391 ms for batch size 4000
Thrift batch commit time 7062 ms for batch size 4000
Thrift batch commit time 6680 ms for batch size 4000
Thrift batch commit time 9611 ms for batch size 4000

Here are some numbers to write the same dat using CQL option:
CQL batch commit time 5731 ms for batch size 4000
CQL batch commit time 6209 ms for batch size 4000
CQL batch commit time 5657 ms for batch size 4000
CQL batch commit time 4975 ms for batch size 4000
CQL batch commit time 6516 ms for batch size 4000
CQL batch commit time 6289 ms for batch size 4000

I am wondering what are the options I have to reduce the batch commit time. I use Cassandra 1.0.5

09-19-2012, 12:29 AM
Hi Satjo,

I'm not too sure what advice to give now. How is your Cassandra cluster configured? Is the network fast? There is a page on tuning at:


It is interesting that the CQL mode looks to be slightly faster in your case. I guess that could be down to compression on the CQL batch inserts. I can't seem to find out if there is any equivalent ability to compress using the Thrift layer (or perhaps its done automatically?).

There might be a small amount of scope for improving efficiency in the Cassandra steps themselves. Data conversion and (de)serialization is unavoidable though, and we use Cassandra's own serialization classes for this.