Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Get Data XML parsing with nested nodes

  1. #1
    Join Date
    Jul 2011
    Posts
    22

    Question Get Data XML parsing with nested nodes

    Hello Guys!
    I'm trying to solve a parsing problem in PDI. This is the attachment: getMesh.zip
    I have this input:


    <?xml version="1.0"?>
    <DescriptorRecordSet LanguageCode = "eng">
    <DescriptorRecord DescriptorClass = "1">
    <DescriptorUI>D000008</DescriptorUI>
    <DescriptorName>
    <String>Abdominal Neoplasms</String>
    </DescriptorName>
    <DateCreated>
    <Year>1999</Year>
    <Month>01</Month>
    <Day>01</Day>
    </DateCreated>
    <DateRevised>
    <Year>1995</Year>
    <Month>06</Month>
    <Day>08</Day>
    </DateRevised>
    <AllowableQualifiersList>
    <AllowableQualifier>
    <QualifierReferredTo>
    <QualifierUI>Q000737</QualifierUI>
    <QualifierName>
    <String>chemistry</String>
    </QualifierName>
    </QualifierReferredTo>
    <Abbreviation>CH</Abbreviation>
    </AllowableQualifier>
    <AllowableQualifier>
    <QualifierReferredTo>
    <QualifierUI>Q000821</QualifierUI>
    <QualifierName>
    <String>virology</String>
    </QualifierName>
    </QualifierReferredTo>
    <Abbreviation>VI</Abbreviation>
    </AllowableQualifier>
    </AllowableQualifiersList>
    <Annotation>general term for neopl of organs in the abdom cavity; prefer specific organ/neopl terms; /blood supply /chem /second /secret /ultrastruct permitted; coord IM with histol type of neopl if given (IM)
    </Annotation>
    <SeeRelatedList>
    <SeeRelatedDescriptor>
    <DescriptorReferredTo>
    <DescriptorUI>D034861</DescriptorUI>
    <DescriptorName>
    <String>Abdominal Wall</String>
    </DescriptorName>
    </DescriptorReferredTo>
    </SeeRelatedDescriptor>
    </SeeRelatedList>
    <TreeNumberList>
    <TreeNumber>1</TreeNumber>
    <TreeNumber>2</TreeNumber>
    </TreeNumberList>
    <ConceptList>
    <Concept PreferredConceptYN="Y">
    <ConceptUI>M0000008</ConceptUI>
    <ConceptName>
    <String>Abdominal Neoplasms</String>
    </ConceptName>
    <ConceptUMLSUI>C0000735</ConceptUMLSUI>
    <SemanticTypeList>
    <SemanticType>
    <SemanticTypeUI>T191</SemanticTypeUI>
    <SemanticTypeName>Neoplastic Process</SemanticTypeName>
    </SemanticType>
    </SemanticTypeList>
    <TermList>
    <Term ConceptPreferredTermYN="Y" IsPermutedTermYN="N" LexicalTag="NON" PrintFlagYN="Y" RecordPreferredTermYN="Y">
    <TermUI>T000016</TermUI>
    <String>Abdominal Neoplasms</String>
    <DateCreated>
    <Year>1999</Year>
    <Month>01</Month>
    <Day>01</Day>
    </DateCreated>
    <EntryVersion>ABDOMINAL NEOPL</EntryVersion>
    <ThesaurusIDlist>
    <ThesaurusID>NLM (1966)</ThesaurusID>
    </ThesaurusIDlist>
    </Term>
    <Term ConceptPreferredTermYN="N" IsPermutedTermYN="Y" LexicalTag="NON" PrintFlagYN="N" RecordPreferredTermYN="N">
    <TermUI>T000016</TermUI>
    <String>Abdominal Neoplasm</String>
    </Term>
    <Term ConceptPreferredTermYN="N" IsPermutedTermYN="Y" LexicalTag="NON" PrintFlagYN="N" RecordPreferredTermYN="N">
    <TermUI>T000016</TermUI>
    <String>Neoplasm, Abdominal</String>
    </Term>
    <Term ConceptPreferredTermYN="N" IsPermutedTermYN="Y" LexicalTag="NON" PrintFlagYN="N" RecordPreferredTermYN="N">
    <TermUI>T000016</TermUI>
    <String>Neoplasms, Abdominal</String>
    </Term>
    </TermList>
    </Concept>
    </ConceptList>
    </DescriptorRecord>
    </DescriptorRecordSet>
    I need to map it in a table and I was looking for a way to produce the result.
    You can see here it the mapping between column and XPATH query

    mh_ui /DescriptorRecordSet/DescriptorRecord/DescriptorUI
    mh_name /DescriptorRecordSet/DescriptorName/String
    mh_year /DescriptorRecordSet/DescriptorRecord/DateCreated/Year
    mh_subheadings /DescriptorRecordSet/DescriptorRecord/AllowableQualifiersList/QualifierReferredTo/QualifierName/String As you can understand here I have to match all entries not only the first and I should copy in the stream a new row for every new Qualifier that I found and the other field must contain the value for that record.
    mh_reference /DescriptorRecordSet/DescriptorRecord/SeeRelatedList/SeeRelatedDescriptor/DescriptorReferredTo/DescriptorName/String the same problem for this as before
    mh_description /DescriptorRecordSet/DescriptorRecord/ConceptList/Concept/ScopeNote the same problem for this as before
    mh_sinonimous /DescriptorRecordSet/DescriptorRecord/ConceptList/TermList/Term/String the same problem for this as before but in this case I should also concat it and separate every string by pipe

    So I'm able to parse the simple record but not the list with the step "Get data from XML" how could I solve it? I attached my sample step with my chunk.xml
    The original file has size about 300MB.

    Thank you so much
    Any help is appreciated !

    getMesh.zip

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Your Loop XPath must aim at the most deeply nested repeating element of interest, in your case //AllowableQualifier, I guess.
    You must change the XPath for mh_ui accordingly: ../../DescriptorUI

    However, with a 300 MB XML document you should consider to use the streaming parser.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Jul 2011
    Posts
    22

    Default

    I will try this approach however I would like to understand how to use the stream because I tried with it and I wasn't able to manage data in a way that could help me to produce more rows for every record based on the list of item that is greater than other list nodes for that record.
    If you could give me an example I will really appreciate that!
    Tomorrow morning I will try to change the Loop XPath I hope that this will create more rows.
    However the number of rows that it will create should be the maximum between the list nodes inside a single record. I hope that I have been clear to explain this concept.
    Quote Originally Posted by marabu View Post
    Your Loop XPath must aim at the most deeply nested repeating element of interest, in your case //AllowableQualifier, I guess.
    You must change the XPath for mh_ui accordingly: ../../DescriptorUI

    However, with a 300 MB XML document you should consider to use the streaming parser.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.