Hitachi Vantara Pentaho Community Forums
Results 1 to 11 of 11

Thread: Converting HTML Content From JSON to Text Field

  1. #1

    Default Converting HTML Content From JSON to Text Field

    Hello Friends,
    I am really very new to Pentaho, basically from SAP BI Background. Now on my new team I am working on some integrations.

    I am extracting data from REST Client(API) --> JSON Input (Filtering Using Json Filters) --> Select Values --> Push into DB

    O/P From Rest Call:

    URL :https://graph.microsoft.com/v1.0/groups/88f582fb-e966-4799-8120-40bb1e13109c/threads/AAQkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MAMkABAAh4YjinKSC0ONWCNgG2fChRAAh4YjinKSC0ONWCNgG2fChQ==/posts?$top=999

    {
    "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#groups('88f582fb-e966-4799-8120-40bb1e13109c')/threads('AAQkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MAMkABAAh4YjinKSC0ONWCNgG2fChRAAh4YjinKSC0ONWCNgG2fChQ%3D%3D')/posts",
    "value": [
    {
    "@odata.etag": "W/"CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAjmp1V"",
    "id": "AAMkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MABGAAAAAABkarwC4SYzTITEicF8ni-qBwBajS0wTsC5S5t6HizsaEaPAAAAAAEMAABajS0wTsC5S5t6HizsaEaPAAAjmMzpAAA=",
    "createdDateTime": "2017-11-02T21:43:09Z",
    "lastModifiedDateTime": "2017-11-02T21:43:10Z",
    "changeKey": "CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAjmp1V",
    "categories": [],
    "receivedDateTime": "2017-11-02T21:43:10Z",
    "hasAttachments": false,
    "body": {
    "contentType": "html",
    "content": "<html><body><div>\r\n<div>\r\n<div>\r\n<div>\r\n<div>\r\n<div><font size="2"><span style="font-size:10pt;">Task </span></font><a href="https://tasks.office.com/monsanto365.onmicrosoft.com/en-us/Home/Task/dKZ9zdDYikq7YqmqMwLMhGQABny6"><font size="2"><span style="font-size:10pt;">Generating tactical intelligence products\r\nand assisting with Strategic/Operational products and CIRT analysis</span></font></a><font size="2"><span style="font-size:10pt;"> assigned to xxxxxxxxxxxxxxxxxxx]</span></font></div>\r\n</div>\r\n</div>\r\n</div>\r\n</div>\r\n</div>\r\n</body></html>"
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxx]",
    "address": "xxxxxxx"
    }
    },
    "sender": {
    "emailAddress": {
    "name": "xxxxxxxxxxx]",
    "address": "xxxxxxxxxx"
    }
    }
    }


    I am trying to convert HTML content which I got above to text and pass it to a text field in DB. My expectation is to get the highlighted comments from HTML content. I have attached the .ktr file.

    Can anyone please guide me in right direction to achieve my goal. I have gone through couple blogs and was not able to understand exactly what I need to do.

    Please guide me the right steps that I need to perform. Its really very very urgent for me.

    Regards,
    Prakash.
    Attached Files Attached Files

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Let's assume you are able to retrieve a HTML mail message for a given URI using REST-Client and JSON-Input.
    Your sample message is well-formed, so you can use Get-Data-From-XML to extract all you need.
    So long, and thanks for all the fish.

  3. #3

    Default

    Hi Marabu,
    I have attached what I have done so fat in Pentaho Job with output what I have got and what I am trying to achieve. Please let me know steps what I need to perform for getting the output I need.

    Regards,
    Prakash.
    Attached Files Attached Files

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Get-Data-From-XML needs an XPath expression (Loop XPath) returning a node list.
    Each node will be subject to field XPath expressions, then.
    If you can give a formal or semi-formal description of what you want to extract, we can easily translate that into the necessary XPath expressions.

    Just pointing out some text is no usable description, though.

    If the messages are composed by individuals and not generated by a program, try //body as your Loop XPath and . as your sole field XPath.

  5. #5
    Join Date
    Jun 2016
    Posts
    181

    Default

    I see you are receiving not XML but JSON (what you posted here is not valid JSON).

    If you have JSON as response, you can use JSON-input to read field(s) and then Table-output to insert into table.
    In JSON-input, Fields tab define correct path to read field you want.

    See my example (writes one filed into the file).
    Attached Files Attached Files

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by Gosforth View Post
    I see you are receiving not XML but JSON
    Actually, they are getting HTML wrapped in JSON, and since HTML can be processed as XML, the flow is exactly as Marabu suggested:
    JSON-Input -> Get Data from XML

  7. #7
    Join Date
    Jun 2016
    Posts
    181

    Default

    You're right.
    I understood problem was to read correct field from result.
    Anyway here goes ready example.

    XPath tuto:

    https://www.w3schools.com/xml/xpath_syntax.asp


    Quote Originally Posted by gutlez View Post
    Actually, they are getting HTML wrapped in JSON, and since HTML can be processed as XML, the flow is exactly as Marabu suggested:
    JSON-Input -> Get Data from XML
    Attached Files Attached Files
    Last edited by Gosforth; 12-16-2017 at 09:54 AM.

  8. #8

    Default REST Client Output from URL Call

    Guys,
    I have tried the Xpaths that you have suggested, and tried couple other options as well. But still nothing helps. I am sending my complete JSON output, now let me know if anyone can give the right directions on this.

    From JSON Input Below are the fields that I have extracted. I am sending the below field values in SQL Server Table.

    Fields JSON Path
    ID $..id
    CreatedDate $..createdDateTime
    ModifiedDate $..lastModifiedDateTime
    Comments $..content (HTML Format)

    content is the one which I am getting in HTML format, I just need the body out of the entire content. I have highlighted what I exactly need.

    {
    "@odata.context": "https://graph.microsoft.com/v1.0/$metadata#groups('88f582fb-e966-4799-8120-40bb1e13109c')/conversations('AAQkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MAAQADR7nbr4z7lIhWO96mM1f6Y%3D')/threads('AAQkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MAMkABAANHuduvjPuUiFY73qYzV-phAANHuduvjPuUiFY73qYzV-pg%3D%3D')/posts",
    "value": [
    {
    "@odata.etag": "W/"CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAAABbj"",
    "id": "AQMkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MABGAAADZGq8AuEmM0yExInBfJ4v6gcAWo0tME7AuUubeh4s7GhGjwAAAgEMAAAAWo0tME7AuUubeh4s7GhGjwAAAhJRAAAA",
    "createdDateTime": "2017-09-09T00:47:29Z",
    "lastModifiedDateTime": "2017-09-09T00:47:29Z",
    "changeKey": "CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAAABbj",
    "categories": [],
    "receivedDateTime": "2017-09-09T00:47:29Z",
    "hasAttachments": false,
    "body": {
    "contentType": "html",
    "content": "<html><body><div>\r\n<div>\r\n<div>\r\n<div>\r\n<div>\r\n<div><font size="2"><span style="font-size:10pt;">Task </span></font><a href="https://tasks.office.com/monsanto365.onmicrosoft.com/en-us/Home/Task/fe_1vVA_Vki791Vc-Wuv32QAEUeT"><font size="2"><span style="font-size:10pt;">Enhance post incident review and establish\r\neffective measurements</span></font></a><font size="2"><span style="font-size:10pt;"> assigned to COLLINS, DANIEL [AG/1000]</span></font></div>\r\n</div>\r\n</div>\r\n</div>\r\n</div>\r\n</div>\r\n</body></html>"
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    }
    },
    {
    "@odata.etag": "W/"CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAPo5tH"",
    "id": "AAMkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MABGAAAAAABkarwC4SYzTITEicF8ni-qBwBajS0wTsC5S5t6HizsaEaPAAAAAAEMAABajS0wTsC5S5t6HizsaEaPAAAPopbjAAA=",
    "createdDateTime": "2017-10-05T05:03:25Z",
    "lastModifiedDateTime": "2017-10-05T05:03:25Z",
    "changeKey": "CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAPo5tH",
    "categories": [],
    "receivedDateTime": "2017-10-05T05:03:25Z",
    "hasAttachments": false,
    "body": {
    "contentType": "html",
    "content": "<html><body><div>\r\n<div>\r\n<div>\r\n<div>No significant progress. </div>\r\n</div>\r\n</div>\r\n</div>\r\n</body></html>"
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    }
    },
    {
    "@odata.etag": "W/"CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAmrMXU"",
    "id": "AAMkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MABGAAAAAABkarwC4SYzTITEicF8ni-qBwBajS0wTsC5S5t6HizsaEaPAAAAAAEMAABajS0wTsC5S5t6HizsaEaPAAAmqpJuAAA=",
    "createdDateTime": "2017-11-07T23:36:05Z",
    "lastModifiedDateTime": "2017-11-07T23:36:05Z",
    "changeKey": "CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAAmrMXU",
    "categories": [],
    "receivedDateTime": "2017-11-07T23:36:05Z",
    "hasAttachments": false,
    "body": {
    "contentType": "html",
    "content": "<html><body><div>\r\n<div>\r\n<div>The CIRT is partnering with Business Continuity Management and consulting with Incident Management to identify a common post-incident/after-action tracking and reporting framework. </div>\r\n</div>\r\n</div>\r\n</body></html>"
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    }
    },
    {
    "@odata.etag": "W/"CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAA5Ugh1"",
    "id": "AAMkADcwZDk2MjM2LTE2ZTItNDZiMS1hZmI1LWVkM2ViMDYzYTc0MABGAAAAAABkarwC4SYzTITEicF8ni-qBwBajS0wTsC5S5t6HizsaEaPAAAAAAEMAABajS0wTsC5S5t6HizsaEaPAAA5TlBRAAA=",
    "createdDateTime": "2017-12-04T20:35:50Z",
    "lastModifiedDateTime": "2017-12-04T20:35:50Z",
    "changeKey": "CQAAABYAAABajS0wTsC5S5t6HizsaEaPAAA5Ugh1",
    "categories": [],
    "receivedDateTime": "2017-12-04T20:35:50Z",
    "hasAttachments": false,
    "body": {
    "contentType": "html",
    "content": "<html><body><div>\r\n<div>Progress: Collaboration with Business Continuity and Incident Management is ongoing. Pushing this milestone into Q2 to better leverage these partnerships to ensure consistent post-incident/after-action processes across the enterprise. </div>\r\n</div>\r\n</body></html>"
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    },
    "from": {
    "emailAddress": {
    "name": "xxxxxxxxxxxxxxxxx",
    "address": "xxxxxxxxxx"

    }
    }
    }
    ]
    }


    Regards,
    Prakash.

  9. #9
    Join Date
    Jun 2016
    Posts
    181

    Default

    This is not valid JSON. And I'm not talking about html inside.
    Check the result here (or any other json parser): http://jsoneditoronline.org/

    If you cannot parse it, this is not valid JSON. Simple like that.

  10. #10

    Default

    I am able to Parse the JSON. Attaching the screenshot. I am trying the get the data from the Content which is highlighted in my screenshot.
    Attached Files Attached Files

  11. #11
    Join Date
    Jun 2016
    Posts
    181

    Default

    Do not attach picture (who will waist time to retype it...) - attach text.

    Quote Originally Posted by praku4you@gmail.com View Post
    I am able to Parse the JSON. Attaching the screenshot. I am trying the get the data from the Content which is highlighted in my screenshot.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.