I am trying to use Kettle to move data from HDFS into a relational database (PostgreSQL) and the performance that I am getting is not so great. I am starting 4 copies of load from HDFS into PostgreSQL and it loaded 4 files with 2.9GB of compressed data or 37GB of uncompressed data (which corresponds to 37GB of data in PostgreSQL) in about 6 and a half hours. I am running the ETL job on Server A which get the gzipped files from the hadoop cluster and loads the data into PostgreSQL on Server B. We have 1 Gbit connection between the servers. The files on HDFS are gzipped and I am wondering how these files move around. Do they get transferred in the gzipped format and then unzipped at Server B when the data is ready to be loaded or does Pentaho transfer the data unzipped?

Generally, I would like to know what is the best way to use Kettle in order to get a good performance?

Loading data from NFS into PostgreSQL might be faster than this and I am wondering if that is a better approach than loading data from HDFS directly.

Any help will be appreciated!