Hitachi Vantara Pentaho Community Forums
Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: How to read a Webpage and save it, locally (by step "Modified Java Script Value")?

  1. #1
    Join Date
    Oct 2012
    Posts
    6

    Question How to read a Webpage and save it, locally (by step "Modified Java Script Value")?

    I am so nooby in pentaho&JS.
    Please, make example for me.
    Simply, I need to parse a Webpage daily, and take from it the necessary data.




    I save it to a file in first step and then parse it for data by second step.

  2. #2
    Join Date
    Nov 2008
    Posts
    777

    Default

    What webpage? Do you have an example?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  3. #3
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Show what you already achieved and tell what makes you unhappy with that.

    From your post I can't tell if it's Kettle you don't understand or if you are out to hire a JavaScript developer.

    Try to give as much information as you can to avoid that I have to get back to you with more questions than you had in the first place.
    So long, and thanks for all the fish.

  4. #4

    Default

    I tried to answer you by your basics informations:

    - you can use wget for download the webpage and data related (there is some posts in this forum about wget and Kettle);
    - you can use Selenium (a tool for testing), create a little script for grabbing data by Xpaths and launch it by console in Kettle.

    At the moment this is my idea, I hope that help you.
    "The intuitive mind is a sacred gift and the rational mind is a faithful servant. We have created a society that honors the servant and has forgotten the gift." (A. Einstein)

  5. #5

    Default

    Some links that may interrest you:
    http://rpbouman.blogspot.ch/2011/05/...ages-with.html
    https://github.com/gkfabs/Kettle-jsoup

    One personal remark about "
    Simply, I need to parse a Webpage daily
    Parsing "freetext" content is never simple. Call me if you get a perfect solution to parse partial data out of any html without any hard layout dependencies

  6. #6
    Join Date
    Nov 2008
    Posts
    777

    Default

    There's actually a very easy-to-use web service that implements the "Tidy" function (referenced on Roland Bouman's blog) at http://services.w3.org/tidy/tidy. This can be used to clean up HTML and convert it to XML.

    All you have to do is pass to the web service the URL of the web page you want to scrape data from as a parameter and the parameters forceXML and indent - both of which should be set to "on".

    Below and attached is an example that calls this web service with the HTTP Client Step on URL http://www.w3.org/Protocols/HTTP/HTRESP.html. Then it uses the Get Data From XML Step to grab the values of all the <h3> tags. Finally, the Select Values Step selects just the h3 field which contains a list of the HTTP Status Codes and their descriptions as they appear on that web page.

    Note, however, that this is a very simple web page without any scripts. More complex pages are going to be much more difficult to parse this way.

    Name:  tidy.png
Views: 783
Size:  16.0 KB
    Attached Files Attached Files
    Last edited by darrell.nelson; 10-22-2012 at 08:14 PM. Reason: grammar
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  7. #7
    Join Date
    Oct 2012
    Posts
    6

    Default

    Thanks for replies but ...
    Hey, guys, the problem is much easier than you think))))
    I need to read Webpage and to pass source code as field in another table or to save to a text file.
    Chain:
    1. "Generate Row" - planned to init structure. Then
    2. "HTTP Client" - planned to read page. Then
    3. "Get data from XML" - planned to parse page. Then
    4. Output in text file or excel sheet
    Where am I wrong?
    ┌────────────────┐ From Russia with love.
    │░░░░░░░░░░░░░░░░│ Sorry for my English.
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    └────────────────┘

  8. #8
    Join Date
    Oct 2012
    Posts
    6

    Default

    Quote Originally Posted by darrell.nelson View Post
    What webpage? Do you have an example?
    for examples i need get currency rates http://www.ussurybank.ru/
    ┌────────────────┐ From Russia with love.
    │░░░░░░░░░░░░░░░░│ Sorry for my English.
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
    └────────────────┘

  9. #9
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by amur27 View Post
    Hey, guys, the problem is much easier than you think
    You wouldn't have asked, if it was that easy for you ...

    Quote Originally Posted by amur27 View Post
    Chain:
    1. "Generate Row" - planned to init structure. Then
    2. "HTTP Client" - planned to read page. Then
    3. "Get data from XML" - planned to parse page. Then
    4. Output in text file or excel sheet
    Where am I wrong?
    In most cases your flow will abort in step 3, since the markup will choke the DOM parser - but Darrell already told you that.

    With the help from jsoup I'm able to extract the exchange rates from the ussurybank landing page, thanks to Jonathan Hedley.

    http://jsoup.org/download (278 kB)

    Place the jar where Kettle can find it.
    Attached Files Attached Files
    So long, and thanks for all the fish.

  10. #10
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by amur27 View Post
    Thanks for replies but ...
    Hey, guys, the problem is much easier than you think))))
    I need to read Webpage and to pass source code as field in another table or to save to a text file.
    Chain:
    1. "Generate Row" - planned to init structure. Then
    2. "HTTP Client" - planned to read page. Then
    3. "Get data from XML" - planned to parse page. Then
    4. Output in text file or excel sheet
    Where am I wrong?
    How much easier can it get?

    My example provided exactly what your "Chain:" describes. All you had to do was change the URL and the XPath expressions...and add a Text File Output Step.
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.