ronans
03-26-2012, 06:08 PM
Hi all Based on some feedback, it was suggested that I post an overview of some proposed Cassandra enhancements here for people to comment on and review.
My scenario is that I am using cassandra to read and write medium to large numbers of documents and document fragments as part of some data integration processes involving multiple map/reduce jobs. The numbers of rows written or read in each run range into the 6 million+ range, and the jobs may read/write to or from multiple column families per run (some for auditing purposes, some for base data i/o).
From examining the source code of the Cassandra input and output plugins, i note a couple of potential areas for improvement, both in terms of general capabilities and performance improvements.
In general, the cassandra output writer writes via CQL inserts and Cassandra input reader reads via CQL select statements. For large reads and writes, it is also necessary to alter the timeout for the socket connections .
I have experimented with using direct thrift reads and writes in conjunction with the user defined java class and found that thrift based APis provide far better performance and availability (at least with the pre-release version I am using). As the user defined java class does not support generics, i have been using a wrapper class that I created to abstract out the basic services for batch mutate without use of generics in the exposed class.
My proposal is that the cassandra input and output steps could be modified to use the thift APIs directly without use of the CQL engine when a certain option (ie a checkbox such as "Use Thrift I/O") is selected in the UI. For the reader step, there would also need to be a way to specify what columns to retrieve - this could still be via CQL (i.e use CQL for getting metadata, use thrift for I/O) or via a simple column list.
In addition I would propose the addition of a timeout option that would allow overriding the connection timeouts without having to reconfigure the cassandra.yaml files.I have implementations of this for UTF-8 string fields using the user defined java class and some custom utility libraries.Using thrift api's or other low level apis would also allow for addressing support for range slice queries.
Here are some other alternatives that would also address some of the issues:
1) for cassandra reader, dont execute the cql only in the first row fetch
2) adding timeout could be added independent of support for thrift.
3) Support column input and output formats for hadoop jobs without requiring custom code
Comments, suggestions anyone ?
My scenario is that I am using cassandra to read and write medium to large numbers of documents and document fragments as part of some data integration processes involving multiple map/reduce jobs. The numbers of rows written or read in each run range into the 6 million+ range, and the jobs may read/write to or from multiple column families per run (some for auditing purposes, some for base data i/o).
From examining the source code of the Cassandra input and output plugins, i note a couple of potential areas for improvement, both in terms of general capabilities and performance improvements.
In general, the cassandra output writer writes via CQL inserts and Cassandra input reader reads via CQL select statements. For large reads and writes, it is also necessary to alter the timeout for the socket connections .
I have experimented with using direct thrift reads and writes in conjunction with the user defined java class and found that thrift based APis provide far better performance and availability (at least with the pre-release version I am using). As the user defined java class does not support generics, i have been using a wrapper class that I created to abstract out the basic services for batch mutate without use of generics in the exposed class.
My proposal is that the cassandra input and output steps could be modified to use the thift APIs directly without use of the CQL engine when a certain option (ie a checkbox such as "Use Thrift I/O") is selected in the UI. For the reader step, there would also need to be a way to specify what columns to retrieve - this could still be via CQL (i.e use CQL for getting metadata, use thrift for I/O) or via a simple column list.
In addition I would propose the addition of a timeout option that would allow overriding the connection timeouts without having to reconfigure the cassandra.yaml files.I have implementations of this for UTF-8 string fields using the user defined java class and some custom utility libraries.Using thrift api's or other low level apis would also allow for addressing support for range slice queries.
Here are some other alternatives that would also address some of the issues:
1) for cassandra reader, dont execute the cql only in the first row fetch
2) adding timeout could be added independent of support for thrift.
3) Support column input and output formats for hadoop jobs without requiring custom code
Comments, suggestions anyone ?