Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Read data from PDF file and need to load into table

  1. #1
    Join Date
    Aug 2015

    Default Read data from PDF file and need to load into table


    there is a requirement on PDF data loads, we need to read the data from PDF file and need to load into table. i am not able to process the data in rows and columns wise because, when we transforming the data from PDF, data is transforming in undefined or unstructured way.

    column headers and the respective rows are not populating in order way.

    Could you please tell me how can we read the data with column headers and its respective data in proper way.

    Thank you

  2. #2
    Join Date
    Apr 2008


    I feel like you have asked this before.

    The hint is in the expansion of the acronym: PDF
    PDF is Page Description Format
    It doesn't have standard columns and rows, so very few tools will actually process it the way you want to. PDI is not one of those tools. If you pre-process it into another format, then PDI might be able to help.

  3. #3
    Join Date
    May 2016


    I think there are some java tools to do this, that perhaps you can integrate in PDI. I remember reading someone doing it in another place, but I don't know the details, I just got shock of the nightmare of having to read structure data from a PDF :-) A quick google search has come up with this:
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.