Hitachi Vantara Pentaho Community Forums
Results 1 to 15 of 15

Thread: Can PDI/Kettle parse the following xml ?

  1. #1
    Join Date
    Oct 2008
    Posts
    7

    Question Can PDI/Kettle parse the following xml ?

    Dear fellows:

    I have read some of the kettle documents, and trying to parse some xml using "Get Data From XML". After reading your online documents, wiki etc, I still can not figured out:

    Q1. If there are multiple same level (xpath) tags, and I want only the last one, how could we ignore the first one IF AND ONLY IF the last one exists?

    Example:

    <ddd:update>
    <ddd:name>ignore me</ddd:name>
    <ddd:rem>
    <ddd:ns>ignore me, please</ddd:ns>
    </ddd:rem>
    </ddd:update>

    <ddd:update>
    <ddd:name>This is my name</ddd:name>
    <ddd:add>
    <ddd:ns>add me , please</ddd:ns>
    </ddd:add>
    </ddd:update>

    Q2: Repeat through same attributes but different values, and take a action for each element. We do NOT know how many will be repeated in advances:

    <tag:s>
    <ddd:status s="S1" lang="en">Status 2</ddd:status>
    <ddd:status s="S2" lang="en">Status 4</ddd:status>
    <ddd:status s="S3" lang="en">Status 6</ddd:status>
    <!-- could be more like above -->
    </tag:s>

    If I do not know how many of the above will repeat, how to loop through and get all the existing attributes and values?

    Thank you very much,

    Frank

  2. #2

    Default

    Hi,

    i don't understood your point 2 but for the first, uses the Xpath position()=last()].
    for example (in loop Xpath) :


    ../update[position()=last()]



    Rgds

    Samatar
    Samatar

  3. #3
    Join Date
    Oct 2008
    Posts
    7

    Default

    Thank you Samata.

    For second one, what I am trying to ask is process (a simpler one), or get 1,2,3 values from below:

    <a>
    <b>1</b>
    <b>2</b>
    <b>3</b>
    </a>

    In "Get XML Data" step, there is a column called "repeat" with value 'Y' and 'N'. I can not figure out why the 'Y' does not work.

    Anyway,
    Thank you
    Frank

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    "Repeat" repeats the last non-null value when the value in question is null.

  5. #5
    Join Date
    Oct 2008
    Posts
    7

    Default

    Thank you Matt.

    What I am trying to ask is how to get the 1/2/3 values from

    <a>
    <b>1</b>
    <b>2</b>
    <b>3</b>
    </a>

    Looks easy, but in kettle, I am not sure how ?

    Thanks,
    Frank

  6. #6
    Join Date
    Nov 1999
    Posts
    9,729

  7. #7
    Join Date
    Oct 2008
    Posts
    7

    Default

    Thank you Matt.

    Actually the wiki does not show the example I have mentioned. The big difference is: for all the xml examples in wiki: at the deepest level, you always have one and only one unique element.

    My example says, at the deepest level, we have multiple elements.

    I would appreciate you take a deeper look then you can see the differences. And the best if you can shine some light as how ...

    Thanks,
    Frank

  8. #8

    Default

    Same problem for me. Did you solve your problem?

    <RoomTypes RoomCount="38">
    <RoomType Code="SB">Camere singole</RoomType>
    <RoomType Code="DB">Camere matrimoniali</RoomType>
    <RoomType Code="TB">Camere doppie</RoomType>
    <RoomType Code="TR">Camere triple</RoomType>
    <RoomType Code="Q">Camere quadruple</RoomType>
    <RoomType Code="NS">Camere non fumatori</RoomType>
    </RoomTypes>

    I want to get all RoomType nodes but with Read Data from Xml File I can get only the first one. I could get all fields one by one ... but the number of these fields is variable.

    Quote Originally Posted by frankhu View Post
    Thank you Matt.

    Actually the wiki does not show the example I have mentioned. The big difference is: for all the xml examples in wiki: at the deepest level, you always have one and only one unique element.

    My example says, at the deepest level, we have multiple elements.

    I would appreciate you take a deeper look then you can see the differences. And the best if you can shine some light as how ...

    Thanks,
    Frank

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    From your sample XML posted, the "Get Data from XML" works and will provide a list:

    38|SB|Camere singole
    38|DB|Camere matrimoniali
    38|TB|Camere doppie
    38|TR|Camere triple
    38|Q|Camere quadruple
    38|NS|Camere non fumatori

    This works if you specify your input, loop, and fields correctly.

    TIP: You can use "." and ".." as a "XPath"

    Side Note: I don't use the "Get Data from XML" step in any of my transforms, I have only looked through the information from the forum. If I can learn how to make it work correctly from the forum, you can too.
    Last edited by gutlez; 08-11-2009 at 01:58 PM.

  10. #10
    Join Date
    Jun 2008
    Posts
    3

    Default get data from xml

    I have tried the same thing with an xml as below:

    <lcx:Type tGUID="guid3669EC21-8E41-438A-AA1A-26B477C15BE0" abstract="true">
    <TypeName>Entity</TypeName>
    <ImageFile>vlv_general</ImageFile>
    <LNMapping locale="US">
    <IconFile>General</IconFile>
    <TypeName>Entity</TypeName>
    </LNMapping>
    <Documentation>
    <lcx:Synonym>Abstract Entity</lcx:Synonym>
    <lcx:Synonym>Object</lcx:Synonym>
    <lcx:Synonym>Thing</lcx:Synonym>
    <Description>General semantic type that refers to any object or thing. </Description>
    </Documentation>
    </lcx:Type>
    <lcx:Type tGUID="guid8062D547-671B-4119-B130-500C5DC111CB" kindOf="guid3669EC21-8E41-438A-AA1A-26B477C15BE0">
    <TypeName>Annotation</TypeName>
    <ImageFile>vlv_annotation</ImageFile>
    <LNMapping locale="US">
    <IconFile>General</IconFile>
    <TypeName>Annotation</TypeName>
    </LNMapping>
    <Documentation>
    <lcx:Synonym>Comment</lcx:Synonym>
    <lcx:Synonym>Explanation</lcx:Synonym>
    <Description>A comment or explanation. Appropriate for a body of text that stands alone. </Description>
    </Documentation>
    </lcx:Type>

    I have tried many different variations in definition but could not succed to get the xml right.

    The closest config I could use is:

    xpath:
    /lcx:LibraryCatalogue/lcx:Type/Documentation/lcx:Synonym

    Name XPath Element type
    Synonym ../lcx:Synonym Node String

    this way I get a result as :

    Abstract Entity
    Abstract Entity
    Abstract Entity
    Comment
    Comment

    instead it sould be like:

    Abstract Entity
    Object
    Thing
    Comment
    Explanation

    How should I define the fields xpath and loop

  11. #11

    Default

    In the Get data from XML step, set Loop XPath to /RoomTypes/RoomType

    Then set fields as follows:
    | Name | XPath | Element | Type | ... |
    | Code | @Code | Node | String | ... |
    | Name | . | Node | String | ... |

    Quote Originally Posted by shaphiro View Post
    Same problem for me. Did you solve your problem?

    <RoomTypes RoomCount="38">
    <RoomType Code="SB">Camere singole</RoomType>
    <RoomType Code="DB">Camere matrimoniali</RoomType>
    <RoomType Code="TB">Camere doppie</RoomType>
    <RoomType Code="TR">Camere triple</RoomType>
    <RoomType Code="Q">Camere quadruple</RoomType>
    <RoomType Code="NS">Camere non fumatori</RoomType>
    </RoomTypes>

    I want to get all RoomType nodes but with Read Data from Xml File I can get only the first one. I could get all fields one by one ... but the number of these fields is variable.

  12. #12
    Join Date
    Nov 2010
    Posts
    23

    Arrow Can I parse this XML using the Get data from XML

    <content>
    <row>
    <field name="Result ID"><![CDATA[5417151]]></field>
    <field name="Media type"><![CDATA[Wiski]]></field>
    <field name="Content "><![CDATA[Neutral]]></field>
    <field name="Author name" />
    <field name="Permalink"><![CDATA[http://pmwiki.org/pmwiki.php]]></field>
    <field name="URL"><![CDATA[http://pes.org]]></field>
    <field name="position"><![CDATA[1]]></field>
    </row>

    I'm having the same issue as the original post, and I'm not sure how to manipulate the xpath to make kettle loop through the <row> note and extract the actual values stored after !CDATA, i.e 5417151 and Wiski, etc..

    I only manged to extract the <field name . i.e Result ID, Media Type etc... and only the value of the top field "Result ID" and can never get past that one, I can't get values for "Author name" and the rest.

    Please let me know of any ideas you may have

    Thanks,

  13. #13
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    harraz,

    PMs are generally frowned on on this board for clarification of threads - everyone wants to help everyone here.

    If you configure your XML to be valid (you need to close content) and use a Loop Path of /content/row/field, you can use the following fields:
    Code:
    Name    XPath    Element    Type    Format    Length    Precision    Currency    Decimal    Group    Trim type    Repeat
    name    @name    Node    String                            none    N
    Value    .    Node    String                            none    N
    ID    ../field[@name="Result ID"]    Node    String                            none    N
    And then denormalize on the Result ID
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  14. #14
    Join Date
    Nov 2010
    Posts
    23

    Default

    Thanks that worked!

    I didn't have to use denormalize, I just removed the repeating or redundant rows usin the Unique components

    BTW: I didn't mean to get an ansewer to the question when I sent you an PM, I meant something else

  15. #15
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Denormalize will change:

    Key|Value|ID
    Result ID|5417151|5417151
    Media type|Wiski|5417151
    Content|Neutral|5417151
    Author name||5417151
    Permalink|http://pmwiki.org/pmwiki.php|5417151
    URL|http://pes.org|5417151
    position|1|5417151

    Into:
    Result ID|Media type|Content|Author name|Permalink|URL|position|
    5417151|Wiski|Neutral||http://pmwiki.org/pmwiki.php|http://pes.org|1|

    Though you should be able to do the same in the XML input too...
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.