US and Worldwide: +1 (866) 660-7555
Results 1 to 9 of 9

Thread: Postgresql Bulk Loader issues

  1. #1
    Join Date
    Jun 2007
    Posts
    232

    Question Postgresql Bulk Loader issues

    Hi Everyone,

    Playing with PDI 3.2 and the Postgresql bulk loader. Came across a problem that might be my fault and might be a bug - not sure which. Simple behavioural problem with the bulk loader it seems, where it does not provide the necessary information to the psql command to work properly.

    The scenario is that I am trying to move a table from one machine to another as quickly as possible. The source database is Postgresql 8.3 on a windows box, and the target machine is Postgresql 8.3 on an Ubuntu 8.10 server. The target table has been created via pgAdmin III and the sql scripts it generates for DDL. The target table appears to be an exact copy of the framework / DDL structure of the source.

    We have about 20 Million rows of data to move across - so the bulk loader makes sense. PSQL works from the command line on the windows box and can dump the data to the Ubuntu box so we know that psql works. The security settings are set on the Ubuntu box to trust the (source) windows box, and indeed you dont need a password to login with psql on the source Windows box to the target Ubuntu box. This fits with the instructions in the Wiki.

    In theory this should now work, but it doesnt. Here is what happens:
    1/ The Postgresql Bulk Loader appears to not be providing a username to the psql command, so you end up having to place it in the path location in the step (eg/ c:\psql.exe -U postgres). This is easy to solve but annoying. Maybe I am just doing something wrong......

    2/ The step asks for delimeter and enclosure parameters. These two step controls dont appear to do anything at all, and in fact if you save the transformation, close it then re-load it the information in those two controls will be empty.

    Even manually filling this information in the parameters are not passed through to the psql copy command, which will raise a complaint about needing to have a single ASCII character as a delimeter. What is passed through is actually 'null' for delimeter and 'null' for quotes.

    I thought that perhaps I needed to add these to the source information coming from my source table - so I altered the sql select statement to add the required quoting and delimeter characters. The resultant stream data looks to all intents and purposes like you are reading a csv file with everything quoted and delimeted. The Postgresql Bulk Loader step then seems to suddenly find the parameters in the delimeter and enclosure control and you end up with double quoted fields and double delimeted fields BUT still the copy command it issues to psql does not specify this nd you are left with 'null' as before. Needless to say it didnt work.

    Now maybe I am doing something wrong, and maybe I am not. Could someone throw-me-a-bone on this one. I know the step is experimental, and that integrating 3rd party stuff with such tools fort Postgres is difficult, but maybe its just me mucking things up. I have attached a copy of the transformation for anyone who wants to eyeball it - its pretty simple.

    Hoping someone has some luck with this one

    Cheers

    The Frog
    Attached Files Attached Files
    Last edited by TheFrog; 10-14-2009 at 09:29 AM. Reason: left off the transformation.....
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  2. #2
    Join Date
    Mar 2008
    Posts
    101

    Default

    TheFrog,

    I've also been experimenting with the experimenter step and have come across the following, in case you weren't aware of them:

    http://jira.pentaho.com/browse/PDI-1901
    http://jira.pentaho.com/browse/PDI-2614
    http://jira.pentaho.com/browse/PDI-2726

    In regards to your #1, that is already know. Not sure about your #2, and if it is tied to one of the above JIRA's.

    Peter

  3. #3
    Join Date
    Jun 2007
    Posts
    232

    Question Thanks for the feedback

    Hi Peter,

    Thanks for the feedback. It seems that the useage of psql as the means of bulk loading is actually pretty poor. I am not sure why it is being done this way when the COPY command is actually implemented directly in the Postgresql Type 4 driver: http://jdbc.postgresql.org/documenta...kage-tree.html

    It is possible to use a jdbc connection and dump data directly to the STDIN (or STDOUT). This would eliminate all of the psql related issues if can be implemented. I have not used this in code before, but it seems like a pretty simple object to work with.

    Has anyone 'out there' actually tried working with the copyManager in the Postgresql Type 4 driver?

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  4. #4
    Join Date
    Nov 1999
    Posts
    9,688

    Default

    As the poor b*st*rd that actually tried to get that code to run, I'm not too impressed with that "feature".
    You know, with this kind of work and the horrible, horrible transactional model, I became a radical PostgreSQL hater. Strangely, I don't have this with any other database. I guess that's because PG is open source. I guess that I think that the PG team really should do a better job.
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters@pentaho.org

    Author of the book Pentaho Kettle Solutions by Wiley. Also available as e-Book and on the Kindle reading applications (iPhone, iPad, Android, Kindle devices, ...)

    Join us on IRC server Freenode.net, channel ##pentaho

  5. #5
    Join Date
    Sep 2008
    Posts
    166

    Default

    Hi,

    I also have the pleasure to work with PDI and Postgres and asked Matt already a couple times "Why doesn't it work as it is supposed to?"

    My experience from my project with Postgres & PDI: If you can use any other DBMS than Postgres for your project then go for it.

    Benjamin

  6. #6
    Join Date
    Jun 2007
    Posts
    232

    Talking So then, PostgreSQL is the issue?

    Thanks once again for the feedback. Matt, if I am understanding you correctly you have already tried to use the drivers object model to implement the bulk load process? If I may ask two things:
    1/ Can you tell me (if you remember) where the problem arises between the driver and PDI?
    2/ What you would suggest as an alternative to PostgreSQL that supports multiple schemas and can integrate the R statistical language (or similar)?

    I am only using PostgreSQL as a 'staging' area for development with production being done on Oracle 8i. I am hoping to implement a column based db for the final version of this design process but I am not there yet.

    I need to implement the statistical capabilities of R because of time series modelling, and this can be done directly in Postgres (amongst other things). I would be interested in solving this issue if I can help, or finding a work-around for the time series modelling and schemas and ditching Postgres completely.

    I must admit that I have become a fan of Postgres after using it for several years now but I am not totally dependant on it. As I said production environments for me will be in Oracle 8i.

    I appreciate the feedback and help

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  7. #7
    Join Date
    Nov 1999
    Posts
    9,688

    Default

    There are a few problems with the bulk loader.
    First of all, the authentication is a major problem. PG is to my knowledge the only RDBMBS vendor that forces you to authenticate in this backward way. Shees folks, what's wrong with a username and password? Shees!

    Second, the COPY command over JDBC is documented but if you actually try to use it, it doesn't work. Jumping on the formus, #postgresql IRC, etc revealed the vaporware nature of this "feature".
    Matt Casters, Chief Data Integration
    Pentaho, Open Source Business Intelligence
    http://www.pentaho.org -- mcasters@pentaho.org

    Author of the book Pentaho Kettle Solutions by Wiley. Also available as e-Book and on the Kindle reading applications (iPhone, iPad, Android, Kindle devices, ...)

    Join us on IRC server Freenode.net, channel ##pentaho

  8. #8
    Join Date
    Jun 2007
    Posts
    232

    Talking I Understand

    Hi Matt,

    So basically the JDBC Type 4 driver implementation of the COPY command is simply a dud - if I understand correctly - thus forcing the use of psql, which is awkward and probably not really meant for this purpose.

    Unfortunately the position you have had to take seems the only viable one at the moment. Such a shame given the integrated statistics with PL/R.

    Thanks once again for the feedback

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  9. #9
    Join Date
    Feb 2011
    Posts
    12

    Default

    I just tried the bulk loader in 4.1.0 stable and even though I set the delimiter and the enclosure, it doesn't get used.. instead I get an error saying that it has to be single character

    it then goes to start bulk loading (nothing goes into the tables) ..

    so i think it should throw an exception instead of continuing

    -joe

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •