Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: PDF file Data Loading

  1. #1

    Question PDF file Data Loading

    Can we load data of pdf file to db Directly.

    How is it possible

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    No, not possible yet.

  3. #3
    Join Date
    Jan 2008
    Posts
    18

    Default PDF reader step

    Has anybody tried since then to read PDF files with PDI? Does anybody know of existing steps that can read PDF files?

    Ronny

  4. #4
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    It's not a structured file format, what kind of data would you extract?

  5. #5

    Default

    This is, to be blunt, far outside the scope of PDI. It is possible to leverage PDI to handle this assuming you have a controlled use case.

    use case 1 (doable): You want to read the 'header' information of the PDF (the author, title, etc). This should be doable, and requires writing a custom PDI plugin to read this information. I know some of the Document Management solutions out there (Liberty?) support custom index information on the PDF, not sure if those are 'header' data fields or not.

    use case 2 (eh..not PDI): If you want to OCR the information inside the PDF, look at a true Document Management solution that supports indexing on the fly (Apache Jackrabbit on the open-source/free side, otherwise there are dozen commercial ones out there). This is not a feature by-default in Document Management solutions, however most have it in as an add-on to OCR/index/search the information inside the PDF.

    use case 3 (doable): You want to store the binary blob PDF file format into the DB for streaming from a different application, and have no interested in the content of the PDF. Although I haven't done this, it should be doable, but not really what PDI is designed for. You could mix this with use case 1 to have index and blob/file storage, but again, this really isn't what PDI is designed for -- however, version 4 with the JCR/CMIS style content management should be better able to handle this use case assuming RDBMS isn't your true target, and instead targetting JCR-style repositories.
    Last edited by dhartford; 03-05-2010 at 10:08 AM.

  6. #6
    Join Date
    Jun 2006
    Posts
    282

    Default Embedded Tables?

    There is a fourth use case:
    Extracting embedded tables that have uniformity. Perhaps this could potentially be applicable for an ETL step? We currently handle this using yet another "3rd party utility". It would be nice to be able to bring this into our automated PDI solutions. :P


    Darrin
    Attached Images Attached Images  
    Last edited by microdisney; 03-05-2010 at 11:02 AM.
    "If you want to increase your success rate, double your failure rate."
    Thomas Watson, Sr (former president of IBM)

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I propose to write an artificially intelligent PDI plugin that can parse any sort of content, including PDF.

  8. #8
    Join Date
    Jun 2006
    Posts
    282

    Default

    Yes, but it must be green with zero carbon footprint!!
    "If you want to increase your success rate, double your failure rate."
    Thomas Watson, Sr (former president of IBM)

  9. #9
    Join Date
    Jun 2007
    Posts
    233

    Exclamation

    I propose to write an artificially intelligent PDI plugin that can parse any sort of content, including PDF
    Damn Matt, thats a hell of an undertaking. If you manage this you should do a white paper or something. I wait with baited breath and anticipation.

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    If I manage to do this, they will give me the Nobel Price for computer sciences or something like that.

  11. #11
    Join Date
    Jun 2007
    Posts
    233

    Talking I would hope so!

    At a minimum. I would hope there is a significant uptake in use of such a technology. Google would love you

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  12. #12
    Join Date
    Jan 2008
    Posts
    18

    Default

    Have a look to Apache Tika (http://tika.apache.org), which does:
    - Extract metadata
    - Extract text

    It does this for many file formats like PDF and MS Office. See also: http://tika.apache.org/0.7/formats.html

    It's a top level Apache project (former Lucene sub project), which is licensed under Apache License 2.0.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.