Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Need Help

  1. #1
    Join Date
    Mar 2013
    Posts
    148

    Default Need Help

    Hi,

    How to parse html page.. I want to get all rows from <div class="row"> is there any possibility we can do it using PDI. If yes how can we do it..

    Sample is attached below.
    Attached Files Attached Files

  2. #2
    Join Date
    Mar 2013
    Posts
    148

    Default

    Hi Marabu,

    I am designing ETL based on your previous posts. But i am not able to get the content what i want like output.txt format. I have attached the ktr file. Please suggest me how to get complete rows which are inside div section. It has to look like my sample output which i mentioned below. Any Suggestions ?

    Quote Originally Posted by pentaho2013 View Post
    Hi,

    How to parse html page.. I want to get all rows from <div class="row"> is there any possibility we can do it using PDI. If yes how can we do it..

    Sample is attached below.
    Attached Files Attached Files
    Last edited by pentaho2013; 12-19-2014 at 08:41 AM. Reason: typo

  3. #3
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    The trick is to find the single enveloping element of all the elements you are interested in.
    After proper cleansing you would select the corresponding markup using method outerHtml().
    From there on "Get Data From XML" is your friend.

    BTW: Element tbody isn't even occuring in your sample input...
    So long, and thanks for all the fish.

  4. #4
    Join Date
    Mar 2013
    Posts
    148

    Default

    Thanks for your reply. I am not able to do it.. Actually the data doesn't contain tbody. Can you please help me how can i get the sample output which is attached below

    Quote Originally Posted by marabu View Post
    The trick is to find the single enveloping element of all the elements you are interested in.
    After proper cleansing you would select the corresponding markup using method outerHtml().
    From there on "Get Data From XML" is your friend.

    BTW: Element tbody isn't even occuring in your sample input...

  5. #5
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    If my previous post wasn't helpful, then let's hope somebody else can do better.
    So long, and thanks for all the fish.

  6. #6
    Join Date
    Mar 2013
    Posts
    148

    Default

    Sorry Marabu, Based on your previous post only i have started working on that. Your post was more helpful to me But i am not able to find out the way how it can be done like this kind of input file

    Quote Originally Posted by marabu View Post
    If my previous post wasn't helpful, then let's hope somebody else can do better.

  7. #7
    Join Date
    Mar 2013
    Posts
    148

    Default

    Hi Marabu,

    I have tried adding additional code in this way

    //var title = doc.outputSettings(os).select("a[href]").text();
    //var links = doc.outputSettings(os).select("a").attr("href");

    then i got result set like combining multiple title and hrefs in two individual columns. How do i get each div value into one row.

    One more trial i did to get rows from xml but it throws an error everytime <img> or nbsp. I have attached the Source file and ETL script. Please let me know what i am doing wrong. for reference i am attaching the whole file content(Scrapper4.txt) and parsed file content(stext.txt)

    Quote Originally Posted by marabu View Post
    If my previous post wasn't helpful, then let's hope somebody else can do better.
    Attached Files Attached Files
    Last edited by pentaho2013; 12-20-2014 at 01:39 AM. Reason: adding scripts

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.