Hitachi Vantara Pentaho Community Forums
Results 1 to 19 of 19

Thread: Grouping XML data through STAX Parser

  1. #1
    Join Date
    Nov 2015
    Posts
    10

    Default Grouping XML data through STAX Parser

    Hello,

    I'm a newbie in learning Pentaho kettle. I have come across an impediment in developing my project. I would like to know if there's anyway we could use stax parser and dynamically group the elements depending on the file each day comes? For eg; today if an xml comes with 50 tags, I make my parser with 50 elements but tomorrow if a file comes with 40 tags, then it throws an error because we have designed the stax parser for 50 tags only. Therefore, I'm not sure how to dynamically order and group the data from the xml. I would really appreciate a response. Thank you.

    Abinash

  2. #2
    Join Date
    Aug 2011
    Posts
    360

    Default

    Hi,

    We need a sample xml, need to know what you are trying to do, and what you have done so far.

    Cheers

  3. #3
    Join Date
    Nov 2015
    Posts
    10

    Default

    Hi Mathias,

    Thank you for your response. Here's a sample xml, because of the data integrity and security issue - I'm not able to share the exact xml however, here's one that I created on my own..

    <reporting:type="ABC">
    <reporting:Rate>
    <name>John </name>
    <id> 345 </id>
    <personal details>
    <phone>67890364 </phone>
    <address>post lane </address>
    </personal details>
    <interest> 2</interest>
    </reporting:Rate>
    <reporting:Rate>
    <name>Freddy </name>
    <id> 678 </id>
    <personal details>
    <phone>67986573 </phone>
    </personal details>
    </reporting:Rate>
    <reporting:Rate>
    <name>Sam </name>
    <id>987 </id>
    </reporting:Rate>

    You see there are three rates in the xml - first rate has 6 tags, second rate has 3 tags and third rate has 2 tags. The table has 6 fields which is as per the first rate in which name and id are mandatory fields and rest all are optional. Now what I'm trying to do is group the data as per the 6 tags using row denormaliser and calculator. However, the grouping doesn't happen properly. Please find attached ktr. I have given a sequence number for all the rows coming out of the step level and adding a sequence number so that I can group based on that using a calculator step. Let me know if I need address any other details. Thank you.

    Abinash



    Quote Originally Posted by Mathias.CH View Post
    Hi,

    We need a sample xml, need to know what you are trying to do, and what you have done so far.

    Cheers
    Attached Images Attached Images    

  4. #4
    Join Date
    Nov 2015
    Posts
    10

    Default

    Can anyone please help on this?

  5. #5
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Why are you not using the "Get Data from XML" step?

    Also, the snippet that you posted is not valid XML, so we can't try to build a valid transform to show you how you could work this problem.

  6. #6
    Join Date
    Nov 2015
    Posts
    10

    Default

    Hi gutlez,

    We have very large files, using Get Data from XML slows the process.

    Abinash

  7. #7
    Join Date
    Nov 2015
    Posts
    10

    Default

    Hi,

    I have added a sample xml. Let me know if that helps.

    Abinash
    Attached Files Attached Files

  8. #8
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I would still attack it with get data from XML.

    I was able to extract your sample using:
    Get Data From XML
    If NULL
    Get Data From XML
    Select Values

    Giving me:
    Code:
    name id interest phone address
    John   345   2 67890364  post lane 
    Freddy   678   67986573  
    Sam  987
    Last edited by gutlez; 12-08-2015 at 01:33 PM.

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by abinashsa View Post
    We have very large files, using Get Data from XML slows the process.
    A slow process that works is better than a fast one that doesn't work.
    At least as far as I learned...

  10. #10
    Join Date
    Nov 2015
    Posts
    10

    Default

    I'm not sure if my understanding is wrong but as far as I understand "Get data from xml" uses DOM parsing and parsing xmls with more than 500mbs don't work using "Get data from xml". I'm looking for an efficient STAX parsing solutions.

  11. #11
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    yes, it uses DOM.
    I can't find it now, but a long time ago, there was a demonstration on the throughput.

    Question I would ask: Have you actually tried it?
    Tell you what: You try building and running your file through Get Data From XML, and I'll try to build a version with Stax.

  12. #12
    Join Date
    Nov 2015
    Posts
    10

    Default

    Yes gutlez, I have already tried through "Get Data From XML". It works fine. Considering the volume of data- I would like to move it to STAX parser which gets the job done in seconds. Thank you for your kind response. Looking forward to see the stax version.

  13. #13
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    So, consider the prior response: A solution that works is better than a solution that doesn't work.

  14. #14
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I have no idea what the performance of this will be (and I know marabu can do this better than I can!)

    Stax
    **Copy Output to two dummy steps**
    Stream Lookup (input from both dummy steps), retrieve xml_parent_id from one stream where xml_parent_id of the other stream matches the xml_element_id as xml_grandparent_id (NOTE: This will mean you have all rows in memory!)
    Filter Rows to eliminate "START_" or "END_"
    Switch Case on xml_parent_path -- "/root/rate" goes to path one "/root/Rate/personaldetails" goes to path two

    Path1:
    Sort Rows on xml_parent_element_id
    Denormalize. Group on xml_parent_element_id and put id, name, and interest into columns
    Select Values
    Feed into Stream Lookup2

    Path2:
    Sort Rows on xml_parent_element_id
    Denormalize. Group on xml_parent_element_id and put phone and address into columns
    Select Values (renamte xml_grandparent_id to xml_parent_element_id)
    Feed into Stream Lookup2

    Stream Lookup2 - Stream to Lookup is from Path2, lookup where xml_parent_element_id = xml_parent_element_id and return all columns from the lookup

    Final Output:
    Code:
    PersonID PersonParentID Name id interest phone address
    8 2 John 345 2 67890364 post lane
    11 9 Freddy 678 <null> 67986573 <null>
    16 14 Sam 987 <null> <null> <null>
    Last edited by gutlez; 12-08-2015 at 07:02 PM.

  15. #15
    Join Date
    Nov 2015
    Posts
    10

    Default

    Hi gutlez,

    Thank you so much. Can you share me a snapshot or the sample ktr? That would be of great help. I really appreciate you for the solution.

    Abinash

  16. #16
    Join Date
    Nov 2015
    Posts
    10

    Default

    Hi gutlez,

    can you please explain this step -

    Stream Lookup (input from both dummy steps), retrieve xml_parent_id from one stream where xml_parent_id of the other stream matches the xml_element_id as xml_grandparent_id (NOTE: This will mean you have all rows in memory!)

    Abinash

  17. #17
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Lookup box:
    xml_parent_element_id = xml_element_id

    Retrieve box:
    xml_parent_element_id rename to xml_grandparent_element_id column type Integer

    You'll need this for the personaldetails sections, as it will give you a single element for joining the data back up again.
    Attached Files Attached Files
    Last edited by gutlez; 12-08-2015 at 07:33 PM.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  18. #18
    Join Date
    Nov 2015
    Posts
    10

    Default

    Hi Gutlez,

    I tried this way, this method doesn't seem to be effective when there's multiple levels. Is there any other way you could suggest? Thank you.

    Abinash

  19. #19
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    If you are going to use multiple levels, you need to develop the way to join to the parent yourself. The DOM parser takes care of this for you.

    Those are your choices.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.