Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Parsing string in XML output

  1. #1

    Default Parsing string in XML output

    i had select some data from HTML with JSoup, and now i have in output 4 strings: title, authors, date, source... these strings are formed like:

    title=  title1 + "\r\n" + title2 + "\r\n" + ...... + titlen + "\r\n"
    authors= authors1 + "\r\n" + authors2 + "\r\n" + .......... + authorsn + "\r\n"
    And date and source are similars formed to title and authors... Now if i try to write them with XML output i get something like

     <title> title1
    <authors> authors1
    if i split strings with steps "Split field to rows" (one for title, one for authors, etc), kettle gaves me something like

    What should i do to get a single book with title1, authors1, date1, source1; title2, authors2, date2, source2; etc
    I tried to watch all the functions but i can't find something that gives me the possibility to combine data in that form into a XML.... If i use "split fields to rows" 4 times (one for string) i get something like 160k rows O.O

  2. #2
    Join Date
    Jun 2012


    Perhaps the bigger part of your problem stems from the way you read the formatted text from the Scopus response?
    You should provide sample input to play with and a precise description of the XML output you need.
    So long, and thanks for all the fish.

  3. #3


    for example i need to extract from this

    <div class="dataCol2"><label class="hidden-label">Document</label>
    <span class="docTitle">	     
    <a href=";origin=resultslist&amp;sort=plf-f&amp;src=s&amp;st1=flesca&amp;sid=CB6B7BF29B4C406FC28B0A56BC998655.WXhD7YyTQ6A7Pvk9AlA%3a20&amp;sot=b&amp;sdt=b&amp;sl=19&amp;s=AUTHOR-NAME%28flesca%29&amp;relpos=0&amp;relpos=0&amp;citeCnt=0&amp;searchTerm=AUTHOR-NAME%28flesca%29" title="Show document details" onclick="javascript:submitRecord('2-s2.0-84896062472','0','0');">On the complexity of probabilistic abstract: Argumentation</a>
    <div class="dataCol3">
    <label class="hidden-label">Authors of Document</label>
    <span class="">
    <a href=";authorId=15131404100&amp;zone=" title="Show author details">Fazzinga, B.</a>, <a href=";authorId=55926108000&amp;zone=" title="Show author details">Flesca, S.</a>, <a href=";authorId=36020331900&amp;zone=" title="Show author details">Parisi, F.</a>
    <div class="dataCol4">
    <label class="hidden-label">Year the Document was Publish</label>
    <span class="">
    <div class="dataCol5">
    <label class="hidden-label">Source of the Document</label>
    <span class="">
    IJCAI International Joint Conference on Artificial Intelligence
    i should get this kind of xml

    <title>On the complexity of probabilistic abstract: Argumentation</title>
    <author>Fazzinga B.</author><author>Flesca S.</author><author>Parisi F.</author>
    <source>IJCAI International Joint Conference on Artificial Intelligence</source>
    now i can easy select parts with jsoup and form the output as {title}_{authors}_{year}_{source} and i can split it with "Split fields" and get 4 string: Title, Authors, Year, Source... now the last problem is how can i split various authors :-/ and get them in the XML

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.