Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: Partition data by number of rows?

  1. #1
    Join Date
    Feb 2017
    Posts
    2

    Question Partition data by number of rows?

    Hello fellows!

    I've been fighting with some stuff for quite a long time.
    Is there a possibility to divide input data by number of rows?
    The idea is to process pretty big dataset(billions of rows) with no logical key to partition.
    Right now it is made by PostgreSQL function and intermediate table, in future it wouldn't be possible to create any objects in source database so i'm searching for solution.

    P.S. I've searched for the documentation on internal.step.* variables with no success, maybe I was looking in the wrong way?

    Thanks in advance!

  2. #2
    Join Date
    Jan 2015
    Posts
    107

    Default

    Do you mean partitioning during a batch for parallel processing or between batches so you can resume processing where you left off?

  3. #3
    Join Date
    Feb 2017
    Posts
    2

    Default

    I meant some point where I can resume.
    For example, we have a query that returns the result off 1.000.000 rows. I want to cycle it - fetch first 100.000 and process, then fetch second part and so on.
    Maybe it can be done with some java code?(never did anything on java).
    I'm sorry, I'm new to Pentaho, I remember I've done similar stuff in DataStage, still can't get used to absence transformer stage.

  4. #4
    Join Date
    Jan 2015
    Posts
    107

    Default

    With PDI, you can do all kinds of stuff with rownumbers, for example you can use variables inserted into a query to specify where clauses or limit and offset, paging through the result in subsequent runs. If your transformation mostly acts on a row-level and doesn't have operations that need to keep all rows in memory, you might also be able to just let it stream in one go.

    The tricky part is getting consistent results from your database if there are no keys or indexes you can order by. Your "first" 100.000 rows might not be the same the next time you run the query if there is no explicit ordering, regardless of what you try in PDI or any other ETL program.

    If the table is only getting inserts (very likely if there is no key or unique index) but there is a monotonically increasing field (ideally a timestamp), you can still attempt to use that in your where clause and get batches without overlap. Performance may be in issue if it does a full table scan every time.

    You could also cheat by using the Postgres system column "ctid". That column contains the physical location of the row and selecting by it can be very fast. But this ONLY works if rows are not getting updated or the table reorganized/moved in the timeframe of your job. Let me repeat that in bold red: Rows get a different ctid when being updated or when table blocks get moved around!
    I would not recommend this, especially with the database not being under your administration, but if you are desperate and can complete the whole job within a few hours, it might work.

    If the table is also being updated somehow while your job runs, I have no idea how to do this without keys.
    Last edited by Isha Lamboo; 03-02-2017 at 08:27 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.