Hitachi Vantara Pentaho Community Forums
Results 1 to 15 of 15

Thread: Process many files at once (addition error)

  1. #1
    Join Date
    Sep 2011
    Posts
    171

    Default Process many files at once (addition error)

    Hello Everyone,

    Here is structure of my data

    File1.csv

    Date, Fund Name, MarketValue
    11/12/2010, Fund A, $100
    11/12/2010, Fund B, $100

    File2. csv

    Date, Fund Name, MarketValue
    11/13/2010, Fund A, $110
    11/13/2010, Fund B, $130

    Here is what I would like to have in a database (Columns "Date" and "Fund Name" are set as Primary keys):

    Date, Fund Name, MarketValue
    11/12/2010, Fund A, $100
    11/12/2010, Fund B, $100
    11/13/2010, Fund A, $110
    11/13/2010, Fund B, $130


    However, after processing two files with PDI I got the following:

    Date, Fund Name, MarketValue
    11/12/2010, Fund A, $210
    11/12/2010, Fund B, $230


    i.e. values was added. It seems like a bug to me. I guess the variables that used within the transformation got accumulated.

    Any way around?

    Regards,
    Alex

    P.S. I attached transformation (it is fairly big)
    Attached Files Attached Files

  2. #2
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You specifically ask PDI to SUM the Base Market for each manager name, so this isn't a bug.

    Start simple, and work up from there.
    You have a number of text files, which you are reading and then want to load to the DB.

    So you use a Text Input followed by Table Output.
    But you have a couple of calculations you want first (SSGATreasFund + BaseMarketValue and Cash/BaseMarketValue)
    So put a calculator between the Text Input and the Table Output.

    But... SSGATreasFund isn't in your Text Input.
    It's equal to If( [Security Description]="SSGA TREASURY FUND";[Base Market Value] ;0)
    So you put formula step before the calculator

    In the end, you could do your transform as:
    Text File Input
    Formula
    Calculator
    Table Output
    Last edited by gutlez; 09-27-2011 at 12:30 PM.

  3. #3
    Join Date
    Sep 2011
    Posts
    171

    Default

    But summing should occur within one file, not across multiple dates (files)

    I guess I do not understand correctly the concept of processing of many file at once as it is provided within "Text file input" component.

    I would like to process files individually (an individual file does not know anything about the other files). Is this possible in PDI?

    Regards,
    Alex

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    If you MUST sum within the files, then include the filename in your data stream, and group on it.

    Think of a Transform like a river.
    Multiple files being opened by one Text File Input would be like several springs feeding the same river.

    The individual lines of data are like drops of water in the river - they will flow through the length of the river all together, and working on an individual line takes specific planning. Working with one drop will not stop the other drops from continuing downstream.

    So if you want to sum the values within a file, you must plan accordingly.

    From what we had below, you want to sum within each file before any further processing...

    Text File Input (Turn on "Include Filename" - this option exists in TFI, but not CSV Input)
    Group By (Filename, Manager; Sum on Base Market Value)
    Formula
    Calculator
    Table Output

    Your other alternative would be to figure out how you want to process one file, and then put them into a job.
    Continuing the river example, Jobs would be like Locks in a canal - they deal with one bunch first, and then deal with the next bunch.
    Last edited by gutlez; 09-27-2011 at 12:42 PM.

  5. #5
    Join Date
    Sep 2011
    Posts
    171

    Default

    Great, Thank you very much!

    I wonder if it would be possible to employ job to apply transformation to each file individually?

    Regards,
    Alex

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    It's definitely possible, however the drawback to a job is that it runs serially - first does one file, then the next, then the next...

    You could do:
    Job 1 (Master)
    - Start
    - Transform 1 (Get List of Files to Process)
    - Job 2 (Per File Loop) Make sure to set Execute Once for Each Row
    - - Transform 2 (Set Variables)
    - - Transform 3 (Process Files)

    It might be faster to do within a single transform, but you can control the order that the files get processed if you put them into a job.

  7. #7
    Join Date
    Sep 2011
    Posts
    171

    Default

    Thank you very much!

    You are answered two of my posts with one reply. No know probably know how to process file younger than particular date.

    Here is also what I found.

    http://www.cloud2land.com/2011/06/pe...es/#comment-88

  8. #8
    Join Date
    Sep 2011
    Posts
    171

    Default

    Hi gutlez,


    I tried using your approach to filter our files older than particular date. I attached the transformation. And I inserted this transformation in Job2 right after Set Variables transformation and before main transformation. However, it looks like filtering does not work. So wonder if "Filter rows" component is able to compare dates and filter accordingly.

    Basically, is the following going to work:

    <field=LASTMODIFIEDTIME> > <value = 9/26/2012>

    Regards,
    Alex
    Attached Files Attached Files

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    One of the great things, and simultaneously bad things, about PDI is that you can do things in so many different ways.

    I personally would deal with all the parts about the file list in the first step:
    - Transform 1 (Get List of Files to Process)

    Get list of files
    Filter Rows (Tip: If you do not define your True step, you don't need to define the False either!)
    Sort Rows on LAST MODIFIED ASC
    Copy Rows to Result


    Pro Tip #2: Use Job logging. This will define the last time your job ran successfully, and you can use that to specify which files to process!
    Last edited by gutlez; 09-27-2011 at 04:08 PM.

  10. #10
    Join Date
    Sep 2011
    Posts
    171

    Default

    Thank you again!

    Somehow your setup worked!


    Sort Rows on LAST MODIFIED ASC
    you are sorting just to be careful, right?


    Filter Rows (Tip: If you do not define your True step, you don't need to define the False either!)
    impressive!
    Pro Tip #2: Use Job logging. This will define the last time your job ran successfully, and you can use that to specify which files to process!
    I am afraid to write anything on this forum, you can read my mind!


    I do not remember if asked it or not.

    I have 2 types files in the folder, which I can differentiate based on "last modified time", I wonder if I can use your setup to process them according two different transformations? And what would be the steps?

  11. #11
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by august_month View Post
    you are sorting just to be careful, right?
    Mostly... If you sort them before you pass them on to the Copy Rows to Results, then they *should* (I can't promise that they will) be processed in the order sorted.

    Quote Originally Posted by august_month View Post
    I have 2 types files in the folder, which I can differentiate based on "last modified time", I wonder if I can use your setup to process them according two different transformations? And what would be the steps?
    It would likely be very difficult to sort out the first time, since there isn't a simple Switch/Case at the Job level, and the switch/case at the transformation level doesn't behave in the way that most programmers would expect (until they get familiar with the River analogy).

    HTH,

    GL.

  12. #12

    Default

    I'm sorry I'm replying to this old post but you said that if you did not define your True step, you didn't need to define the False either!
    I was wandering what happens in that case? To which step the true flow goes? Does it go to the step defined in the order part?

    Very thankful for any kind of answer and advice.

  13. #13
    Join Date
    Sep 2011
    Posts
    171

    Default

    As far as I understand. You have to drag filter step on existing hop and PDI will ask you incorporate filter. Then true will go where hop goes and false will go nowhere.

  14. #14

    Default

    That is also how I understood and it actually suits me so I will try.
    Thanks.
    If anyone else has some opinion on this please share.

  15. #15
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    If you define neither true nor false, then the true values continue on the default output for the filter step, and the false values are discarded.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.