Hitachi Vantara Pentaho Community Forums
Results 1 to 8 of 8

Thread: getXMLData vs. Streaming XML Input

  1. #1
    Join Date
    Sep 2008
    Posts
    3

    Default getXMLData vs. Streaming XML Input

    Hi, I'm new in Kettel. Please help me to find answers to 2 following questions:
    1. User guide says that Streaming XML Input use SAX parsing. Does getXMLData use SAX also?
    2. Does SAX parser based steps(Streaming XML Input/getXMLData) distributes all parsed data to next step after processing or distribute data chunks while execution?

  2. #2

    Default

    HI
    1-getXMLDATA uses DOM4J.

    2- getXMLData parse input file and output rows based on the loop Xpath you specified.
    At this time, i will not recommend it for big FILES (it load all document in memory)..uses for that case Straming XML step.


    Samatar
    Samatar

  3. #3
    Join Date
    May 2006
    Posts
    4,882

    Default

    If possible use "get XML data", else "streaming xml"

    "get XML data" has the biggest flexibility qua extracting data.

    Also streaming xml behaves strange sometimes, and unless I'm wrong that isn't fixed yet: sometimes it will eat part of the input data without reason.

    Regards,
    Sven

  4. #4
    Join Date
    Sep 2008
    Posts
    3

    Default

    Thanks for the answer but I still need clarification about "Streaming XML Input". Does the parsed data first stored in the memory than distributes to next step or it distributed while parsing? Can I define the size of data chunks to distribute to next step?


    Thanks in advance.

  5. #5
    DEinspanjer Guest

    Default

    Streaming XML Input is exactly as advertised. It uses SAX parsing methods and can read a large XML file without reading the entire structure into memory. The current implementation is rather limited in terms of functionality though. There are many types of XML schemas that it cannot properly parse.

  6. #6

    Default Streaming XML Input Documentation

    Is there any documentation of the problems observed with Streaming XML Input? For that matter, is there any documentation about how to use it effectively? No combination of content locations and fields elements/defining attributes seems to defiine the metadata needed to parse my data file properly. I've exhausted Google looking for useful hints and kinks.

    FWIW, my data is akin to:

    <root>
    <things>
    <thing id=1234>
    <thing_elem_1>value</thing_elem_1>
    <thing_elem_2>value</thing_elem_2>
    </thing>
    <thing id=5678>
    <thing_elem_1>value</thing_elem_1>
    <thing_elem_2>value</thing_elem_2>
    </thing>
    [...]
    </things>
    </root>

    and I must pull out the thing id and each thing elem value as a row instance.

    Thanks for any help you can provide, folks!

    Rod

  7. #7
    Join Date
    May 2006
    Posts
    4,882

    Default

    For the problems... sometimes streaming XML would "eat" data, it would seem to go blind for a block in your XML file and would mostly choke on it. As I recall we never did find the rootcause (it was something in the XML libraries), but maybe it was fixed afterwards by upgrading the XML libraries.

    For more information on the step, there are some examples availables. But to get all data out of your xml file you will probably need getXMLData.

    Regards,
    Sven

  8. #8

    Default

    Thanks for your reply, Sven. Since one of my input files is 1.2Gb, and getXMLData builds an in-memory DOM, that solution won't work on my 32-bit machines. =)

    Transforming the XML to CSV using STX is the next option to try. I might try transforming XML attributes to elements, and reading in the file using StreamingXMLInput. At least we'll find out if it's still dropping chunks of data.

    Rod

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.