Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Can Pentaho PDI handle? Several normalized files merged to a single positional file.

  1. #1
    Join Date
    Sep 2010

    Default Can Pentaho PDI handle? Several normalized files merged to a single positional file.

    Evaluating Pentaho as a solution to taking multiple files in a normalized form and merging them to a single positional structured format. New to PDI. Is this something it is capable of doing? Detail below example if need. Really appreciate your thoughts!

    Input files

    Demographic file

    Acct No, Date, Name, Address, City, State, Zip, DOB, Sex, etc….
    123456, 1/5/2010, Bob Smith, 2 Elm St, Taylors, SC, 29687,5/5/1975,M
    3859392,3/25/2010, Susan Jones, 54 Maple Dr, Taylors, SC, 29687,4/9/1960,F

    Diagnosis Code File

    Acct No, Date, Diag Code
    123456, 1/5/2010, 495.5
    3859392,3/25/2010, 410.91

    Charge File

    Acct No, Date, Charge Code, Amount
    123456, 1/5/2010, 01587, 87.26
    3859392,3/25/2010, 99214, 125.69

    The resultant file that would be imported needs to look something like this….

    ENCTRHDR,123456 (about to read a new patient)
    ENCTR, 123456, 1/5/2010, Bob Smith
    DEMO, 2 Elm St, Taylors, SC, 29687,5/5/1975,M
    DIAG, 495.5
    CHARGE, 01587, 87.26
    ENCTRHDR, 3859392 (new patient )
    ENCTR, 3859392,3/25/2010, Susan Jones
    DEMO, 54 Maple Dr, Taylors, SC, 29687,4/9/1960,F
    DIAG, 410.91
    CHARGE, 99214, 125.69

    I am trying to take several source files and turn it into a single file that would be imported into a decision support database (DSS) for healthcare. The import file is structured where various information (demographics, charges, payments, diagnoses codes, physicians) related to a patient’s visit is done in a positionally ordered sequence. Header recorders are used to inform the DSS database the type of data that follows the header ie…payment data, demographics, etc…For each patient’s visit there is an Encounter header that tells the database that it is reading a new patient visit. All the data following the Encounter Header would be data that is unique to that patient’s visit until a new Encounter Header is read which would indicate a new patient visit. Each of the records related to the various types a patient data are varied in the number of fields…The first field of every record has a record type value that lets the DSS database know the type of record that it is. Below is and basic example of the source data and then the desired result. I was hoping someone could let me know if Pentaho PDI is capable of handling this before I travel down the road of learning Pentaho. The basic steps are taking several files with different types of data that all have fields that would allow them to be joined together, sorted in the format described above. Really appreciate any thoughts, suggestions, or other avenues to take.

  2. #2
    Join Date
    Sep 2007


    Of course PDI can handle this.
    In this particular example, you can read the text files, join the files on the first two columns (Acct No, Date) and after doing some tricks you can have the desired output file.

  3. #3
    Join Date
    Sep 2009

    Default Example Transformation

    Hi there,

    I've created a sample transformation that processes your sample data. Just open it in PDI and execute. It will create an output.txt file next to the .ktr file.

    Hope that helps in evaluating PDI's possibilities



    edit: I added another example that would generate the entire record as a single string instead of generating each output row separately. Should execute a little faster, choose whatever you like best
    Attached Files Attached Files
    Last edited by slawomir.chodnicki; 09-05-2010 at 06:16 AM.

  4. #4
    Join Date
    Sep 2010


    Thanks Slawo. The logic that you provided looks like it will work fine with my live data....

    Couple additional questions...

    1. When I opened the code in the Javascript step there were a couple items that had a red font...Your example worked but thought that red font indicated an error in the code.... Do I have a reference to Javascript messed up? Hopefully the red shows below...

    2. For some of the patient records, there may not be an associated "child" record like a diagnosis code to match in the outer join. Will the code below skip if there is no record or would i need to add some code to check for Null... something like if Diag_Code==null { skip} else {do next code (which would also have to be evaluated for null)}... sorry about my limited Javascript ability (trying to work my way through a 1000 page Javascript book as well as Maria's PDI book)

    3. Would the code / job be able to handle 750,000 to 1MM rows in each of the tables that would be joined?

    Really appreciate any help...


    var row = createRowCopy(getOutputRowMeta().size());
    // find the index of the first field appended by this step

    var idx = getInputRowMeta().size();
    // fill the field

    row[idx++] = value;
    // output the row



    // using this as a template
    // ENCTRHDR,123456 (about to read a new patient)
    // ENCTR, 123456, 1/5/2010, Bob Smith
    // DEMO, 2 Elm St, Taylors, SC, 29687,5/5/1975,M
    // DIAG, 495.5
    // CHARGE, 01587, 87.26


    trans_Status = SKIP_TRANSFORMATION;

  5. #5
    Join Date
    Sep 2009


    Hi there,

    1. the functions show orange, which means that these functions are recognized by the formatter as Kettle specific functions that are made available by the plugin. Javascript itself has no getInputRowMeta function, obviously. The plugin formatter is just being nice

    2. In this solution, if you don't want the some rows to be generated, you'd have to do the logic yourself and compare for null.

    Something like this might do the trick...

    if (Diag_Code != null){

    You may also consider generating a special Diag Code "N/A" before the field arrives at the script. The "if field value is null" step is used to do that.

    3. 750.000 - 1MM rows should be no problem. If you want to speed things up, there's always the possibility to rewrite that using the User Defined Java Class Step which may give a processing boost up to 10x



  6. #6
    Join Date
    Sep 2010


    Thanks Slawo,

    I'll look into the User Defined Java Class Step... I assume the JavaScript The Definitive Guide book would be a good reference and possibly some examples for the that?? Will probably need the performance boost.

    I'll also need to consider that the parent / child table relationship is a one-to-many where I would have multiple diagnosis codes for the same patient...

    Thanks again for your help,

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.