Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: xpath

  1. #1
    Join Date
    Mar 2013
    Posts
    148

    Default xpath

    How to get title,date and href data based on the below content.

    I am attaching the etl script for the reference what i am trying in my system. Correct me if i am doing anything wrong.


    Finally i want two rows in the output format :
    title href date
    SANS Honors People Who Made a Difference in Cybersecurity in 2014 http://www.prnewswire.com/news-releases/sans-honors-people-who-made-a-difference-in-cybersecurity-in-2014-300011928.html Dec 18, 2014, 11:00 ET
    Microsemi's Ultra-secure SmartFusion2 SoC FPGAs and IGLOO2 FPGAs Recognized on EDN's List of Hot 100 Products of 2014 http://www.prnewswire.com/news-releases/microsemis-ultra-secure-smartfusion2-soc-fpgas-and-igloo2-fpgas-recognized-on-edns-list-of-hot-100-products-of-2014-300010576.html Dec 17, 2014, 07:00 ET
    Attached Files Attached Files
    Last edited by pentaho2013; 12-21-2014 at 03:37 AM.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    The way you are using Jsoup.clean() you loose the container of all the content you're interested in.
    If you don't know what you're doing, try single stepping and inspect the intermediate output.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Mar 2013
    Posts
    148

    Default

    Quote Originally Posted by marabu View Post
    The way you are using Jsoup.clean() you loose the container of all the content you're interested in.
    If you don't know what you're doing, try single stepping and inspect the intermediate output.
    Ya Correct Marabu, What you said is correct. I am trying with below content as input

    I have tried with your earlier suggestions i.e

    Jsoup = org.jsoup.Jsoup;
    Whitelist = org.jsoup.safety.Whitelist;
    OutputSettings = org.jsoup.nodes.Document.OutputSettings;
    EscapeMode = org.jsoup.nodes.Entities.EscapeMode;


    doc = Jsoup.parse(html);
    doc.body().html(Jsoup.clean(doc.body().html(), Whitelist.relaxed().addAttributes("div", "id", "class")));


    var xhtml = doc.select("div[class=section]").first().outerHtml();

    To get above ouput content is there any way which we can write in single doc.select ?
    Attached Files Attached Files

  4. #4
    Join Date
    Mar 2013
    Posts
    148

    Default

    I have tried with some more cleansing and i applied below xpath to get the different columns(final result has to be 20 rows based on the input)

    title = /div/div/div/div/div/div/ul/li[1]/a/@title
    href = /div/div/div/div/div/div/ul/li[1]/a/@href
    date = /div/div/div/div/div/div/p

    I have verified these outputs in the below url http://www.freeformatter.com/xpath-t...html#ad-output.

    How can i add these xpath values in the xml component. It gives empty results
    Attached Files Attached Files

  5. #5
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    The developer tools of my browser suggest this to me:

    Code:
    doc = Jsoup.parse(html);
    doc.select("img, script, form").remove();
    
    
    var xhtml = doc.select("main div div.section div.row div.col-sm-9.col-sm-pull-3 div.section:eq(1)").outerHtml();
    So long, and thanks for all the fish.

  6. #6
    Join Date
    Mar 2013
    Posts
    148

    Default

    Thanks a lot Marabu, But after replacing with your code when i am trying to get xpath values it is not giving proper result. I feel i am doing wrong. Please correct me if anything is wrong in the script
    Attached Files Attached Files

  7. #7
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    To make a long story short ...
    Attached Files Attached Files
    So long, and thanks for all the fish.

  8. #8
    Join Date
    Mar 2013
    Posts
    148

    Default

    thank you so much marabu.. You made the solution very easy way. Is there anyway we can get href content description as well is this possible ?

  9. #9
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You certainly know how to use XPath, right?

    .//li[2]

    Why is it I feel like I'm just doing your job?
    So long, and thanks for all the fish.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.