Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: bulk loaders, teradata, "staged Text Files"

  1. #1
    Join Date
    Dec 2008
    Posts
    11

    Default bulk loaders, teradata, "staged Text Files"

    Hi,

    I'm putting together a plan and trying to understand what it would take to utilize pentaho data integration in place of an informatica system.

    The good news is my environment is very simple in terms of the transformations. I primarily migrate data, without converting/transforming all that much.

    My only concern is the lack of a bulk loader for Teradata; Would it be best to try to export to a flat file and then try to kick off some post-step to process an external load script into the system? This is really what informatica does under the covers (especially when you "stage" the data in text files and don't use named pipes).

    Is anyone bulk loading (either through fastload, multiload, or tpump) into a teradata system?

    I'm really only using informatica for the "Auto Generation" of these bulk load scripts...


    Also, just a note about coming from Informatica Power Center: the first thing I noticed is that informatica thinks of it as a "Workflow Containing Transformations" where as in PDI it looks like a "Transformation containing Design Steps". Does this make sense? This was initially a confusing to me as I was initially an informatica person. All in all, I'm excited to use more of pentaho; Hopefully I could get this bulk load issue resolved.


    any suggestions appreciated, and thanks in advance.

  2. #2
    Join Date
    Dec 2008
    Posts
    11

    Default shell script - something other then cmd.exe ?

    Hi,

    I'm trying to use a shell script in my job to run a teradata fastload; The syntax is something like this:

    fastload.exe << EOF

    .logon

    ...commands for fast load

    .logoff

    EOF


    Is there anyway to kick off the "Shell Script" step in a job with something other then cmd.exe (on windows) ?


    Any tips appreciated.

  3. #3
    Join Date
    Dec 2008
    Posts
    11

    Default another try

    I've realized that the problem with <<EOF lies in the cmd.exe, its not functionality of teradata's fastload.exe.

    Next I'm trying another way; I'd like to issue a "command" to cmd.exe rather then kick of a shell script.

    I saw this functionality in the "Output File" transformation, but I don't see it at the Job level.

    Do i have to create an additional transformation and include it in the parent job? I'm looking for the cleanest way to do this, as i have hundreds of teradata tables (each probably a job) to create.

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The "Text File Output" step can pipe data to a shell script that may contain for example the fastload command.

    In the next release of PDI we'll most likely have a TeraData bulkloader step and job entry as well.

  5. #5
    pstoellberger Guest

    Default

    yeah i think so too

  6. #6
    Join Date
    Dec 2008
    Posts
    11

    Default

    Thats for the responses,

    I'm not much of a java guy (yet) but If there is anyway I can help out I'd love to give it a shot..

    I'll see if I can pull down the source tonight and make any sense of it, as well as read up on the the TD JDBC (I believe v12+ has the api into the loaders).

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Look at the code for the bulk loader steps and job entries. Usually all you need to do is replace the loader utility and change the parameters.

  8. #8
    Join Date
    Dec 2008
    Posts
    11

    Default bulk/fastloading into teradata

    Ok, so I talked with pstoellberger today on IRC and he gave me some tips that let me make some progress for loading into Teradata:

    I have tried the following:

    1.) Added a new Teradata Database connection for my test called "TeradataFastload".

    2.) In the Database Connection->Options Screen, I added two Parameters: "CHARSET" with the value of "UTF8" and "TYPE" with the value of "FASTLOAD". I also had a parameter representing the default Teradata port, at Parameter "DBS_PORT" with a value of "1025".

    3.) Next I created a test transformation to read from a system and write to this teradata fastload test connection.

    Notes about Teradata fastload:
    a.) You cannot load a table that has data already in it; You must delete all the data in the table first.
    b.) You Cannot load a table with Non-Unique Secondary Indexing, or Unique Secondary Indexing. (No NUSI or USI's)
    c.) Fastload doesn _not_ load 100% identical rows. It loads exactly a set of data. (This doesn't have to do with primary key/index constraints...)
    d.) You can't access the table in any way shape or form during the load; Its entirely locked.
    e.) The good part is that it is pretty speedy with big volumes; I've gotten 300k rows a second before with some tables depending on my Primary index/distribution key and other things.

    I have had some interesting locking problems with my test transformation so far; I think teradata is sort of strange about locking. Its got different modes for different types of transactions, and defaults to "Teradata Mode Transactions". I am going to mess around with it a bit more and figure anything additional else out.

    If in the table output step I check "Truncate table" it looks like it never gets anywhere; I think It tries to truncate the table as the fastload occurs. I pulled up the TD performance monitor (TD DBAs use this to analyze performance/locking) and saw two jobs that weren't allow each other to get work done. One was from my IP and was issuing the sql SELECT 1 FROM "jrochette"."lmpkp" WHERE 1=2. I think this might be something pentaho issues as the last step of the "Truncate Table" checkbox?


    Anyway, this didn't allow me to load any data so I did the work around listed in http://jira.pentaho.com/browse/PDI-1018.

    Now my test transformation looks like this:

    Code:
    "Execute SQL" (delete from table all) ->
    "Dummy Blocker Step" (I want to finsih my delete to avoid Locking issues with fastload) ->
    "Table Input"  (Just a simple select from another host with a fetch first 1000 rows only) -> 
    "Table Output" to the fastload test connection, without truncate data, and a commit size of 30 for testing.


    Now what happens is interesting. One of the rules of TD fastload is your target table has to be empty, but fetching 1000 rows and commiting every 30 makes me have 30 rows in the table when i start to commit the second batch. I don't know how I'm going to get around this (Without spooling everything... def. not an option with my volumes in production).

    Here's something from the Teradata JDBC help page at http://www.teradata.com/DownloadCent...403-98-1.aspx:

    Question

    I have questions about JDBC FastLoad scalability. My understanding about PreparedStatement addBatch is that the rows accumulate on the client until executeBatch sends them to the Teradata Database. I'm worried about overflowing the JVM heap, so I want to call executeBatch periodically to send rows to the database. But I also know that FastLoad can only insert rows into an empty table, so I am worried that the second executeBatch will hit an error. What are the best practices for using JDBC FastLoad?

    Answer

    We recommend that autocommit be set to false (meaning turned off) for a JDBC FastLoad Connection, then JDBC FastLoad PreparedStatement executeBatch can be executed multiple times. JDBC FastLoad PreparedStatement addBatch saves the data in memory on the heap, and JDBC FastLoad PreparedStatement executeBatch transmits the data to the Teradata Database where it is placed in a temporary holding area. With autocommit set to false, JDBC FastLoad PreparedStatement executeBatch won't transfer the data to the actual table. The data can accumulate in the holding area until JDBC FastLoad Connection commit instructs the Teradata Database to transfer the data from the holding area to the actual table.

    The Teradata JDBC Driver User Guide currently contains an incorrect recommendation -- it recommends a batch size of 500 to 1000 for JDBC FastLoad. This is too small. Our new recommendation is that an application should set autocommit to false for JDBC FastLoad, and use as large a batch size as possible with respect to their JVM heap size. A future revision of the JDBC Driver User Guide will recommend a batch size of 50,000 to 100,000 for JDBC FastLoad. If an application uses a very large batch size, then JDBC FastLoad performance can be comparable to standalone FastLoad utility performance.
    So it sounds like the TD answer is to "executeBatch multiple times"... I have no idea how to do this with PDI yet.

    If I make any additional progress I'll post it here, and any tips appreciated.

    -Joe



  9. #9
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Joe, from what I heard on IRC, Paul already tries the "Table Output" route and it wasn't really faster. We really need to create a fastload wrapper.
    Creating the optimal situation (like truncating the table) before loading is part of the job of-course. ;-)

  10. #10
    pstoellberger Guest

    Default

    As I told yeoj yesterday in IRC already, we are currently working on this plugin. I just added the "truncate table" flag option to the plugin requirement.

    Its development progress is quite good, I hope I can give you something to play by next week

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.