Hitachi Vantara Pentaho Community Forums
Results 1 to 2 of 2

Thread: Kettle scrapping data from weblinks and URL's

  1. #1
    Join Date
    Apr 2012

    Default Kettle scrapping data from weblinks and URL's

    Hi Team,

    I have been using kettle since a long time and I wanted to go one step further in exploring new functionality of this awesome tool. I have been trying to download some HTML file from the URL like for example "" wherein I am planning to download the pages from 1 to 7 (for capturing all the reviews) and then extract data from these fields or another approach is to do some sort of web-crawling. I have been using Rapid-Miner to do this so that I can crawl through the review pages and then extract data from these individual reviews. But I would love to propose this to the customer the power of Kettle where we can achieve this via Kettle. I have not be very successful in my quest though, even though I was able to download the files, I am not able to extract the relevant data from these reviews for eg:

    Name of the reviewer, Type, Ratings by him and the overall rating from the site for this product , his reviews and comment etc. Is there a way I can get this done if anyone has any idea to guide me here? Thank you very much in advance!!

    - Kaustubh
    Last edited by Kausty; 09-29-2013 at 05:42 PM. Reason: Typo

  2. #2
    Join Date
    Jun 2012


    While Kettle provides all the necessary steps to implement a workflow, web scraping itself is not the focus of Kettle.
    But yes, it can be done - here are some links:

    Easy way to scrape web data

    How to load content from a HTML document

    Cleaning webpages with Pentaho Data Integration and JTidy
    So long, and thanks for all the fish.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.