Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Problems with extracting data with jsoup

  1. #1

    Default Problems with extracting data with jsoup

    I'm back with another problem... after i saw the result of JTidy i tried to extract data with Jsoup, and i'm using that code to extract it:

    Code:
    java;
    
    Jsoup = org.jsoup.Jsoup;
    Whitelist = org.jsoup.safety.Whitelist;
    
    
    doc = Jsoup.parse(html);
    
    
    //doc.body().html(Jsoup.clean(doc.body().html(), Whitelist.relaxed()));
    
    
    doc.head().remove();
    
    
    var xhtml = doc.select("resultDataRow1").outerHtml();
    I had to do the fifth line as comment or i get back a blank page... Now, the data i have to extract are these

    <div id="resultDataRow1" class="docMain">
    <div class="dataCol1">
    <span class="custom-checkbox">
    <input name="selectedEIDs" value="2-s2.0-84884994817" onclick="return selectDeselectResult(document.SearchResultsForm, this);" id="eid_2-s2.0-84884994817" type="checkbox">
    <span class="box"><span class="tick"></span></span>
    </span>
    <br>
    <label for="eid_2-s2.0-84884994817">
    <span class="hidden-label">
    result
    2</span>
    </label>
    </div>
    <div class="dataCol2">
    <label class="hidden-label">Document</label>
    <span class="docTitle">
    <a href="http://www.scopus.com/record/display.url?eid=2-s2.0-84884994817&amp;origin=resultslist&amp;sort=plf-f&amp;src=s&amp;st1=flesca&amp;sid=CB6B7BF29B4C406FC28B0A56BC998655.WXhD7YyTQ6A7Pvk9AlA%3a20&amp;sot=b&amp;sdt=b&amp;sl=19&amp;s=AUTHOR-NAME%28flesca%29&amp;relpos=1&amp;relpos=1&amp;citeCnt=0&amp;searchTerm=AUTHOR-NAME%28flesca%29" title="Show document details" onclick="javascript:submitRecord('2-s2.0-84884994817','1','0');">Efficiently estimating the probability of extensions in abstract argumentation</a>
    </span>
    </div>
    <div class="dataCol3">
    <label class="hidden-label">Authors of Document</label>
    <span class="">
    <a href="http://www.scopus.com/authid/detail.url?origin=resultslist&amp;authorId=15131404100&amp;zone=" title="Show author details">Fazzinga, B.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=resultslist&amp;authorId=55926108000&amp;zone=" title="Show author details">Flesca, S.</a>, <a href="http://www.scopus.com/authid/detail.url?origin=resultslist&amp;authorId=36020331900&amp;zone=" title="Show author details">Parisi, F.</a>
    </span>
    </div>
    <div class="dataCol4">
    <label class="hidden-label">Year the Document was Publish</label>
    <span class="">
    2013
    </span>
    </div>
    <div class="dataCol5">
    <label class="hidden-label">Source of the Document</label>
    <span class="">
    <a href="http://www.scopus.com/source/sourceInfo.url?sourceId=25674&amp;origin=resultslist" title="Show source title details">Lecture
    Notes in Computer Science (including subseries Lecture Notes in
    Artificial Intelligence and Lecture Notes in Bioinformatics)</a>
    </span>
    </div>
    <div class="dataCol6">
    <label class="hidden-label">Number of Documents that reference this Document</label>
    0
    <br>
    <span class="showCitedBy visibleHidden">Cited <br> by</span>
    </div>
    But i get just this as output
    "<div id=""resultDataRow0"" class=""docMain""></div>"
    "<div id=""resultDataRow1"" class=""docMain""></div>"
    .
    .
    "<div id=""resultDataRow10"" class=""docMain"">
    <div class=""dataCol1""></div>
    </div>"

    Some suggestion about?

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    One suggestion would be to read the Jsoup API documentation more carefully ...

    Removal of the HEAD section is optional.
    There is no element named resultDataRow1, so no wonder you don't get any rows.
    You must address the element using a CSS selector like div#resultDataRow1 - see below code example.
    Cleaning will remove most of the attributes, though.
    You must declare attributes safe to keep them intact - that's done in the first line of code:

    Code:
    doc.body().html(Jsoup.clean(doc.body().html(), Whitelist.relaxed().addAttributes("div", "id", "class")));
    
    var xhtml = doc.select("div#resultDataRow1").first().outerHtml();
    Last edited by marabu; 04-17-2014 at 01:55 AM. Reason: typo
    So long, and thanks for all the fish.

  3. #3

    Default

    Quote Originally Posted by marabu View Post
    Code:
    doc.body().html(Jsoup.clean(doc.body().html(), Whitelist.relaxed().addAttributes("div", "id", "class")));
    i had the lost i could add some attributes to whitelist ^^... Anyway after reading the documentations about 3 times i found the right way. I had split the html in more strings (with CR as line separator) and this was the error, jsoup wants all the html in a single row ^^; when i removed the row splitter, i solved the problem O.O

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.