Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Dynamically extracting data from many tables with the same structure

  1. #1
    Join Date
    Oct 2007
    Posts
    255

    Default Dynamically extracting data from many tables with the same structure

    One of the Kettle example scripts demonstrates how you might setup a combination of jobs/transformations that works essentially like this:
    1. Main job
      1. First transformation gets table names, sets rows to result
      2. First sub-job is executed, with option execute once for every input row checked
        1. A transformation is executed that puts the table in an environment variable
        2. Another transformation is executed that makes use of the table name in a query
      3. A final transformation is executed that saves the filenames generated from the previous job in step 2
    This would be an adequate (albeit convoluted) solution if I only needed to handle at most a couple hundred tables, but in this case I'm handling thousands of tables. It doesn't scale well for a few reasons:
    • The solution doesn't lend itself to parallel processing very well, and without the ability to multi-thread this I end up capping at ~15,000 rows/sec
    • The solution isn't very compact at all
    • The log gets spammed with a ton of information -- granted the logging level can be set higher, but when I need detailed information, it's very awkward to sift through all that spam
    There may be more, but those are good enough reasons to look for a better solution.

    The solution I'm using now is essentially:
    1. Add a new stored procedure to the source database (mysql in this case) that creates a prepared statement out of a concatenated string, executes the prepared statement, then releases the prepared statement to free up resources
    2. In a transformation:
      1. Get table names using the information_schema, and pass these to another table-input step
      2. Set the 2nd table-input step to run multiple instances, and add the single statement:
        Code:
        call my_stored_procedure(?, ${ETL_DATETIME})
        ... where the 2nd parameter is there just to get it into the resultset without extra steps (there's backstory to that I won't go into)
      3. Process the rows in steps following the 2nd table-input step as normal
    This approach works great. I've managed to get efficiency up as high as 55-65,000 rows/sec, and the only reason it isn't higher is bandwidth limitations. However, this approach has a few drawbacks of its own:
    • The mysql driver that comes with Kettle is 3.1.4, or something thereabouts. When this driver is used the resultset is corrupted. Values in the range of 2^70 or higher show up where values in 0..65536 are expected, and other weirdness; I was able to resolve this using the latest mysql driver, 5.something.
    • Transformations after the table-input step don't get field names propogated to them. This seems to be fine for most steps, it's just annoying when I want to preview data (can't), or for steps that complain of a field not existing (join steps and others). The steps still work fine when I run the transformation, though.
    • The javascript step cannot be used at all. The missing, but not really missing, variables cause the javascript steps to exit when you run the transformation
    I am fully cognizant of the fact Kettle wasn't designed to be used with stored procedures. I'm fairly certain I logged a bug on this back when I ran into these issues originally, so hopefully a future version will have more support for this. In the meantime, I'm hoping some kettle master has a good suggestion, even if it's a completely different approach.

    -Brian
    Last edited by Phantal; 09-09-2008 at 02:48 PM.

  2. #2
    Join Date
    Oct 2007
    Posts
    255

    Default asdf

    I may've found a workaround for the missing fields issues. I added a select values step, and populated the metadata tab with information about all the fields before making it to the javascript step. The first attempt I made I just populated field names & lengths, and the javascript step still errored out. The 2nd attempt I populated the Type column on the metadata tab, and it seems to be functioning alright.

    Before I get fingers wagged at me, this is not going into a production environment, this is simply for testing.

    If anyone has better suggestions, I'd still love to hear them.

    -Brian

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's an interesting problem, I'm going to have to think about this one.
    Reading thousands of tables in parallel is surely going to be sub-optimal.
    At the same time, as you stated, reading them one at a time is not ideal either.
    As such, you would want some kind of way to parameterize this degree of parallelism.

    I think the feature request came up before on this forum to have a checkbox in the job/trans job entries to launch in parallel for all result rows.
    The additional feature you're asking about is the ability to limit this degree of parallelism, right?

    I'm now wondering if we shouldn't allow the user to define the field-to-variable mapping inside these job entries as well. I think that might be kinda cool to simplify the jobs a lot.

    Do you think that ideas like that would set us on the right track?

    Cheers,
    Matt

  4. #4
    Join Date
    Oct 2007
    Posts
    255

    Default

    Quote Originally Posted by MattCasters View Post
    It's an interesting problem, I'm going to have to think about this one.
    Reading thousands of tables in parallel is surely going to be sub-optimal.
    At the same time, as you stated, reading them one at a time is not ideal either.
    As such, you would want some kind of way to parameterize this degree of parallelism.
    I've found I can max a 100 base-T link (with this dataset) using only 4 threads of execution, where the connection is made over a compressed SSH tunnel (which, in case you're curious, improves throughput by about 4x over an uncompressed ssh tunnel).

    Each of the large tables typically has 1-300,000 rows, some more, but in the aggregate there's (for this server) 32 million rows. Executing 4 in parallel I'm able to extract all of this in about 520 seconds.

    In other words, the two bottlenecks are 1) the data rate for a single table input step (unsure why that is limited), and 2) with a relatively low number of rows per table, executing in a serial fashion adds hiccups to the process. Just adding 3 extra threads seemed to do the trick.

    I think the feature request came up before on this forum to have a checkbox in the job/trans job entries to launch in parallel for all result rows.
    The additional feature you're asking about is the ability to limit this degree of parallelism, right?

    ...

    I'm now wondering if we shouldn't allow the user to define the field-to-variable mapping inside these job entries as well. I think that might be kinda cool to simplify the jobs a lot.
    For this one use-case, the simplest feature would be the ability to parameterize the table-name. Barring that, if you allow parameters set within the transformation to be immediately available to steps following it, I could see that being a problem in that there's one variable and multiple table input steps. I'm not sure how to work around that one, but if that was the fix, I'm sure we could find a way around the problem.

    -Brian
    Last edited by Phantal; 09-09-2008 at 04:16 PM.

  5. #5
    Join Date
    Oct 2007
    Posts
    255

    Default asdf

    Another alternative would be to add a new Manually Partitioned Table Input step, or however you might call it. The step I use to get the table names just uses the information schema to get table names that match some criteria (right now I'm just getting them all, but later I'll add logic to only get the ones I'm missing) using a regular expression. The tables are named by year_month_day_hour. If there's something in JDBC that lets you get names of tables, and get data from tables matching a regular expression the user provided ...

    This creates another problem, in that the criteria to select the appropriate table might require some logic. For example, if the last time I did ETL I grabbed everything up to and including 2008_Jun_01_00, I now need everything including and after 2008_Jun_01_01, even if the current date is, say, August 25th 2009 ... and that I don't have a good answer for, other than allowing the table to be parameterized by rows from a previous step in either a normal table intput step, or this new step type I'm envisioning.

    -Brian

  6. #6
    DEinspanjer Guest

    Default

    Quote Originally Posted by MattCasters View Post
    I think the feature request came up before on this forum to have a checkbox in the job/trans job entries to launch in parallel for all result rows.
    The additional feature you're asking about is the ability to limit this degree of parallelism, right?
    I want to jump in on this thread a bit and say that I feel the feature for the parallel for multiple rows could be a simple text box that takes an integer of the number of rows to process in parallel. It could default to 1 (giving the current functionality) and a 0 could indicate to run them all in parallel (similar to how the cache size field in many DB steps works).


    Quote Originally Posted by MattCasters View Post
    I'm now wondering if we shouldn't allow the user to define the field-to-variable mapping inside these job entries as well. I think that might be kinda cool to simplify the jobs a lot.
    As far as this goes, I would love to see any shortcut that would make it simpler to turn a field into a variable. Right now, I've got so many two step transformations that do this that it is quite cluttered.

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Well Phantal, it seems like you have quite a nice challenge on your hands :-)
    I thought it was not done anymore to create tables like that, but now that you have the problem...

    > If there's something in JDBC that lets you get names of tables, and get data from tables matching a regular expression the user provided ...

    That metadata is in JDBC, but pretty much unreliably. You're probably better off to query the catalog information of your database yourself.
    For MySQL the command to run in "Table Input" is "Show tables" for example.
    Looking up those tables against another table with "already processed tables" in it wouldn't be that hard I would imagine.
    I don't think that was the challenge we we're facing here.

    For now, perhaps we can try the "parallel" checkbox and variable mapping.

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

  9. #9
    Join Date
    Oct 2007
    Posts
    255

    Default asdf

    Matt,

    Is there any reason we couldn't have some ability to parametrize the table used in a query directly using rows from a previous step in the same transformation? That would allow something like this to be done within a single transformation.

    Parallelizing jobs is, I think, a great idea in general, so I wouldn't want to abort that line of thinking, I'm just wondering whether there might be an alternative way to construct queries used for table input steps.

    -Brian

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It should be possible but a bit slower due to the dynamic nature of the query.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.