Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Need Suggest for performing ETL operation on XML files.

  1. #1
    Join Date
    Dec 2013
    Posts
    4

    Arrow Need Suggest for performing ETL operation on XML files.

    Hello all,

    I am working on Pentaho Data Integration (Kettle) and trying to perform some ETL operation on XML files.

    Following is the XML file:

    <host id="100">
    <ip>10.10.16.1</ip>
    <hostname>Name not in DNS</hostname>
    <netbiosname/>
    <hscore>346</hscore>
    <howner>Network</howner>
    <assetvalue>3</assetvalue>
    <os>Cisco</os>
    <nbtshares/>
    <fndv id="1191" port="22" proto="tcp"/>
    <fndv id="6266" port="22" proto="tcp"/>
    <fndv id="7290" port="22" proto="tcp"/>
    <fndv id="45208" port="22" proto="tcp"/>
    <fndv id="45211" port="22" proto="tcp"/>
    <fndv id="45212" port="22" proto="tcp"/>
    <fndv id="45213" port="22" proto="tcp"/>
    <fndv id="71471" port="22" proto="tcp"/>
    <fndv id="71473" port="22" proto="tcp"/>
    <fndv id="78682" port="22" proto="tcp"/>
    <fndv id="78685" port="22" proto="tcp"/>
    <fndv id="83288" port="22" proto="tcp"/>
    <fndapp id="1619" port="22" proto="tcp"/>
    <fndapp id="7135" port="22" proto="tcp"/>
    <fndapp id="5092" port="123" proto="udp"/>
    <fndapp id="16157" port="123" proto="udp"/>
    <fndapp id="297" port="161" proto="udp"/>
    <fndapp id="1408" port="500" proto="udp"/>
    <fndapp id="8037" port="0" proto="ip"/>
    </host>

    I kept loopXpath as: /host and able to get fields. After previewing rows, only ip , hscore, howner, assetvalue, os fields data is shown. Is it possible to read row value such as "<fndv id="78682" port="22" proto="tcp"/>" using pentaho ?

    I need to extract data from XML file and write into a table such as IP address will be same (10.10.16.1) and fndv values along with them.

    IP fndv id port
    10.10.16.1 1191 22
    10.10.16.1 6266 22
    10.10.16.1 7290 22
    10.10.16.1 45208 22

    Looking forward to suggestion, Thanks in advance.

  2. #2
    Join Date
    Nov 2009
    Posts
    688

    Default

    You can read the data with the step "XML Input Stream" (StAX).

  3. #3
    Join Date
    Nov 2007
    Posts
    2

    Default

    If you use files, use 'Get XML Data' Input step

    Tab File: browse to File or Map > Add..

    Tab Content: Loop over Xpath /host/fndv

    Tab Fields:
    Name XPath Element Result type Type
    Ip_Id ../ip Node Value of String
    Fndv_id id Attribute Value of String
    Fndv_Port port Attribute Value of String

    Preview Rows shows:
    Ip_Id Fndv_id Fndv_Port
    10.10.16.1 1191 22
    10.10.16.1 6266 22
    10.10.16.1 7290 22
    10.10.16.1 45208 22
    10.10.16.1 45211 22
    10.10.16.1 45212 22
    10.10.16.1 45213 22
    10.10.16.1 71471 22
    10.10.16.1 71473 22
    10.10.16.1 78682 22
    10.10.16.1 78685 22
    10.10.16.1 83288 22
    Last edited by Fred125; 01-06-2014 at 08:40 AM.

  4. #4
    Join Date
    Dec 2013
    Posts
    4

    Default

    Thanks Fred125, It works on the sample data. As i tired same approach on replicated data where IP address is changes, its not showing IP address where fndv values not exits.


    Following sample data I used for testing, and trying to retrieve IP and fndv :


    <device>
    <host id="100">
    <ip>10.10.16.1</ip>
    <hostname>Name not in DNS</hostname>
    <netbiosname/>
    <hscore>346</hscore>
    <howner>Network</howner>
    <assetvalue>3</assetvalue>
    <os>Cisco</os>
    <nbtshares/>
    <fndv id="1191" port="22" proto="tcp"/>
    <fndv id="6266" port="22" proto="tcp"/>
    <fndv id="7290" port="22" proto="tcp"/>
    <fndv id="45208" port="22" proto="tcp"/>
    <fndv id="45211" port="22" proto="tcp"/>
    <fndv id="45212" port="22" proto="tcp"/>
    <fndv id="45213" port="22" proto="tcp"/>
    <fndv id="71471" port="22" proto="tcp"/>
    <fndv id="71473" port="22" proto="tcp"/>
    <fndv id="78682" port="22" proto="tcp"/>
    <fndv id="78685" port="22" proto="tcp"/>
    <fndv id="83288" port="22" proto="tcp"/>
    <fndapp id="1619" port="22" proto="tcp"/>
    <fndapp id="7135" port="22" proto="tcp"/>
    <fndapp id="5092" port="123" proto="udp"/>
    <fndapp id="16157" port="123" proto="udp"/>
    <fndapp id="297" port="161" proto="udp"/>
    <fndapp id="1408" port="500" proto="udp"/>
    <fndapp id="8037" port="0" proto="ip"/>
    </host>
    <device>
    <host id="104">
    <ip>10.10.16.2</ip>
    <hostname>Name not in DNS</hostname>
    <netbiosname/>
    <hscore>346</hscore>
    <howner>Network</howner>
    <assetvalue>3</assetvalue>
    <os>Cisco</os>
    <nbtshares/>
    <fndapp id="1619" port="22" proto="tcp"/>
    <fndapp id="7135" port="22" proto="tcp"/>
    <fndapp id="5092" port="123" proto="udp"/>
    <fndapp id="16157" port="123" proto="udp"/>
    <fndapp id="297" port="161" proto="udp"/>
    <fndapp id="1408" port="500" proto="udp"/>
    <fndapp id="8037" port="0" proto="ip"/>
    </host>
    <host id="101">
    <ip>10.10.16.3</ip>
    <hostname>Name not in DNS</hostname>
    <netbiosname/>
    <hscore>346</hscore>
    <howner>Network</howner>
    <assetvalue>3</assetvalue>
    <os>Cisco</os>
    <nbtshares/>
    <fndv id="1191" port="22" proto="tcp"/>
    <fndv id="6266" port="22" proto="tcp"/>
    <fndv id="7290" port="22" proto="tcp"/>
    <fndv id="45208" port="22" proto="tcp"/>
    <fndv id="45211" port="22" proto="tcp"/>
    <fndv id="45212" port="22" proto="tcp"/>
    <fndv id="45213" port="22" proto="tcp"/>
    <fndv id="71471" port="22" proto="tcp"/>
    <fndv id="71473" port="22" proto="tcp"/>
    <fndv id="78682" port="22" proto="tcp"/>
    <fndv id="78685" port="22" proto="tcp"/>
    <fndv id="83288" port="22" proto="tcp"/>
    <fndapp id="1619" port="22" proto="tcp"/>
    <fndapp id="7135" port="22" proto="tcp"/>
    <fndapp id="5092" port="123" proto="udp"/>
    <fndapp id="16157" port="123" proto="udp"/>
    <fndapp id="297" port="161" proto="udp"/>
    <fndapp id="1408" port="500" proto="udp"/>
    <fndapp id="8037" port="0" proto="ip"/>
    </host>
    </device>


    Output:


    ip_id fndv_id
    10.10.16.1 1191
    10.10.16.1 6266
    10.10.16.1 7290
    10.10.16.1 45208
    10.10.16.1 45211
    10.10.16.1 45212
    10.10.16.1 45213
    10.10.16.1 71471
    10.10.16.1 71473
    10.10.16.1 78682
    10.10.16.1 78685
    10.10.16.1 83288
    10.10.16.3 1191
    10.10.16.3 6266
    10.10.16.3 7290
    10.10.16.3 45208
    10.10.16.3 45211
    10.10.16.3 45212
    10.10.16.3 45213
    10.10.16.3 71471
    10.10.16.3 71473
    10.10.16.3 78682
    10.10.16.3 78685
    10.10.16.3 83288


    I am not able to understand why ip "10.10.16.2" is not shown in output with fndv value "Null"


    It would be really helpful for your kind suggestion.

    Quote Originally Posted by Fred125 View Post
    If you use files, use 'Get XML Data' Input step

    Tab File: browse to File or Map > Add..

    Tab Content: Loop over Xpath /host/fndv

    Tab Fields:
    Name XPath Element Result type Type
    Ip_Id ../ip Node Value of String
    Fndv_id id Attribute Value of String
    Fndv_Port port Attribute Value of String

    Preview Rows shows:
    Ip_Id Fndv_id Fndv_Port
    10.10.16.1 1191 22
    10.10.16.1 6266 22
    10.10.16.1 7290 22
    10.10.16.1 45208 22
    10.10.16.1 45211 22
    10.10.16.1 45212 22
    10.10.16.1 45213 22
    10.10.16.1 71471 22
    10.10.16.1 71473 22
    10.10.16.1 78682 22
    10.10.16.1 78685 22
    10.10.16.1 83288 22
    Last edited by manishvrm21; 01-07-2014 at 03:11 AM.

  5. #5
    Join Date
    Nov 2007
    Posts
    2

    Default

    Basically because there's no data to be extracted (occurs bottom-up).
    No tag, no data.

    In case there is always one or more 'fndv' and/or 'fndapp' tag(s),
    change XPath loop in /device/host/*[name()= 'fndv' or name()= 'fndapp'],
    and the mapping in:
    ip ../ip Node Value of String
    id id Attribute Value of String
    port port Attribute Value of String
    proto proto Attribute Value of String

    (partial) Preview:

    ip id port proto
    ..
    10.10.16.1 1619 22 tcp
    10.10.16.1 7135 22 tcp
    10.10.16.1 5092 123 udp
    10.10.16.1 16157 123 udp
    10.10.16.1 297 161 udp
    10.10.16.1 1408 500 udp
    10.10.16.1 8037 0 ip
    10.10.16.2 1619 22 tcp
    10.10.16.2 7135 22 tcp
    10.10.16.2 5092 123 udp
    ..

    In case you really want to include rows with only ip's (witch feel wrong to me, but I don't know your case), change XPATH to /device/host/*, use same mapping.
    Make in a next step all rows unique.

    That's it..




  6. #6
    Join Date
    Dec 2013
    Posts
    4

    Default

    Thanks Fred for great explanation, your suggestions works like a charm. Well I need to take "IP address" along with "fndv id" values only. If I will keep XPath as /device/host/*[name()='fndv'] and tag fields ../id and id, it will show the records along with 10.10.16.1, 10.10.16.3 but not 10.10.16.2 (as there is no data). Basically I need to retrieve those records also where fndv attribute doesn't exist, so "NULL" value store along with IP address.

    Expected output:

    ip_id fndv_id
    10.10.16.1 1191
    10.10.16.1 6266
    ---
    ---

    10.10.16.2 NULL

    10.10.16.3 1191
    10.10.16.3 6266
    ---

    Thank you

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.