Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: How can I read a multi-level XML file?

  1. #1
    Join Date
    Aug 2007
    Posts
    3

    Default How can I read a multi-level XML file?

    I am currently in the process of building a data store, which to a large extent will be populated by input in XML format. To do this I write transformations using the XMLInput step. This works fine when my XML document (or repeating part of it) is just one level deep and there is a known number of subordinate elements, such as this:

    <repeatingElement>
    <subElement1>value1</subElement1>
    <subElement2>value2</subElement2>
    <subElement3>value3</subElement3>
    </repeatingElement>

    However, when there are several levels under the repeating element, and the number of such elements can vary, I am running into trouble. Assume this XML Input structure:

    <repeatingElement>
    <subElement1>value1</subElement1>
    <subCompositeElement1>
    <subSubElement1>svalue1</subSubElement1>
    <subSubElement2>svalue2</subSubElement2>
    <subSubElement3>svalue3</subSubElement3>
    </subCompositeElement1>
    <subCompositeElement2>
    <subSubElement1>svalue1</subSubElement1>
    <subSubElement2>svalue2</subSubElement2>
    <subSubElement3>svalue3</subSubElement3>
    </subCompositeElement2>
    ...
    <subCompositeElementn>
    <subSubElement1>svalue1</subSubElement1>
    <subSubElement2>svalue2</subSubElement2>
    <subSubElement3>svalue3</subSubElement3>
    </subCompositeElement1>
    <subElement3>value3</subElement3>
    </repeatingElement>

    In this case, I want to insert records into a table using the subElement1 value and the values in each of the subCompositeElements to complete one row of output. If the number of subCompositeElements were fixed it would be possible to just have the output stream list all of them in the same row, and then try and transform them into one row per subCompositeElement. However, since the number of such elements is variable I cannot use this method, nor does it seem that I can define the subCompositeElement as my repeating element, since I need the subElement1 value in my output too.

    I have been trying to solve this scenario by merging multiple XML Inputs, one with the repeatingElement as my repeating element, and one with the subCompositeElement as my repeating element, but with no luck (there is no element in the subCompositeElement that can be used to join those to their parent repeatingElement, they just belong to a particular repeatingElement by virtue of their place in the XML document.)

    Does anyone have any examples or hints regarding how to read this kind of complex XML file, where output needs to come from several levels in it?

    Thanks in advance for any insight!

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Well, I usually pre-process those kind of files with perl

    The kind of XML inputs Kettle currently supports is "limited". The limitation is mostly from the fact that you have to be able to make rows from the input (which always have to be of the same format).

    Regards,
    Sven
    Last edited by sboden; 09-14-2007 at 09:17 AM.

  3. #3
    Join Date
    Aug 2007
    Posts
    3

    Default

    Ah, that is indeed a bummer.

    Is there a way to include the external preprocessing in a step in the Spoon transformation (such as using Java calling XSLT tranformations to flatten the file), or do you typically do that as a completely separate process (executing your Perl scripts that is), before the resulting flattened XML files even are visible to your Spoon tranformations?

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    In a job you can run an XLST transformation. You could some pre-processing from a cmd job entry.

    Personally I do it usually in a pre-pre formatting step. The ETL is not even aware it's touching XML files, just plain ASCII text files. Same with multi level XML output, I like to think it already saved me a couple of serious headaches

    Regards,
    Sven
    Last edited by sboden; 09-14-2007 at 01:36 PM.

  5. #5
    Join Date
    Aug 2007
    Posts
    3

    Default

    Thanks for the tip. I was already going down the XSLT path; the process that is providing Spoon with the XML files can do XSLT transforms too, so my fallback plan is for it to do that so that Spoon always will see only flattened XML files.

    Thanks again!

  6. #6
    Join Date
    Jun 2007
    Posts
    1

    Default Update to Kettle for complex XML

    Are there plans afoot for version 3.0 to support the ingestion of more complex XML using the Input XML component. We have 18,000 XML files with many levels of nested hierarchies (basically huge volumes of data per file), which need to be loaded on a daily basis and this would sure make life easier

  7. #7
    Join Date
    May 2006
    Posts
    4,882

    Default

    Not that I'm aware of right now. The problem is that the possible files being processed are limited by the fact that you need to output rows on the output side and that these rows need to be of the same format.

    Not that it's an excuse but all of the other ETL tools have the same problem... XML is easy to generate, but problematic to process when deeply nested with optional components, ....

    Regards.
    Sven

  8. #8
    Join Date
    Jul 2007
    Posts
    2,498

    Default

    There would be a simple funcionality that could improve a lot xml processing: atm we can only extract values that are on the same level node.

    If it was possible to extract some constant value from a parent node, it would increase xml handling by a large margin. I'm not sure if I'm making myself clear, so I'll put an example:

    Code:
    <header>
        <id>1234</id>
    </header>
    <line>
       <number>1</number>
       <value>10</value>
    </line>
    <line>
        <number>2</number>
        <value>20</value>
    </line>
    <line>
        <number>3</number>
        <value>30</value>
    </line>
    Currently, we can only setup kettle (afaik) to get 1|10, 2|20, 3|30. If there was a way to extract the header id and get 1234|1|10, 1234|2|20, 1234|3|30. I don't mind having to do multiple passes over the same xml, but most of the times I don't have at the same level the "keys" to relate the tables; I have to mess with system variables and joins just to do what I need.

    Just my 2 cents
    Pedro Alves
    Meet us on ##pentaho, a FreeNode irc channel

  9. #9
    Join Date
    May 2006
    Posts
    4,882

    Default

    Well, raise a JIRA tracker and in a very detailed way describe what you want

    Regards,
    Sven

  10. #10
    Join Date
    Jul 2007
    Posts
    2,498

    Default

    Pedro Alves
    Meet us on ##pentaho, a FreeNode irc channel

  11. #11

    Default

    Sven,
    Do you remember,i wrote someday a XML inputPath step....
    This step was a plugin now and should cover this situation.
    We have also added some extra features:
    - Possibility to use a XML stream as input (rather that file)
    - Extract repeating values at same level (new row for each value...)
    ...

    I have already sent the plugin to jens (refer to ew plugin space).
    I will soon sent the impoved one.

    Rgds

  12. #12

    Default

    We deal with "large" XML files at work.
    Most the the time i use XML Input path plugin..and XSLT step (ported from job).
    The XSLT step receive a stream in input and output another stream..(row by row)
    so possibilities are many.

    In PDI 3.0 allow the use of XSLT 2(saxon)

    Rgds

    Samatar

  13. #13
    Join Date
    Oct 2007
    Posts
    16

    Default

    Hi,

    I was just wandering if you ever resolved this. I have exactly the same problem. It seems that someone was developing a plug-in step for this, but there is no further update. is this step (XML inputPath or similar) availible for use and if so where can I find it?

    Thanks

    Magi

  14. #14

    Default Multi Level XML

    Dear Sir,

    Did any find the solution for the problem reported . I have a multi level xml file which i like to take into table . But unable to do so . Please guide me .on this

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.