Hitachi Vantara Pentaho Community Forums
Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: integration of pdf or scanned file

  1. #1
    Join Date
    Jun 2008
    Posts
    9

    Default integration of pdf or scanned file

    Hi,

    I'm new to Pentaho, I'm now thinking of a project of warehousing organizational data, one thing confusing me!

    As quite amount of our data is in scanned images, what's a more reasonable approach for integrating this source of data? Assuming the data is to certain extent structure and scanned images have been passed through OCR.

    Will you have any idea or experience could be shared?



    Thanks, Dove

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    I have not seen an image in a datawarehouse yet. Most of the descriptions on the internet would place the image outside of the database/datawarehouse and possibly only store a pointer to the image.

    What do you think you can do with an image in a datawarehouse. situation?

    Regards,
    Sven

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Actually Sven, I added the Image data type in Pentaho Metadata to make it possible in the future to add images that come from a database. That way you can add product images etc on reports to make them clearer. Image vs Product code etc.
    I have no problem in seeing images, sounds and video as just another attribute of products, people, etc. I'm sure they can even be considered facts in some cases. Most certainly, the reporting aspects become intriguing but from a coolness perspective, I kinda like the idea.

    Dove, look in the samples/transformations directory.
    There is a sample there called "General - Load images and store into database table.ktr" Maybe you can work with this.

    All the best,

    Matt

  4. #4
    Join Date
    May 2006
    Posts
    4,882

    Default

    I know it's possible, I just haven't seen any real use of it .

    The closest I've seen are reports on which the user could click to see the original scanned document behind that detail line in the report. And the way that worked is that the report tool would generate a URL based on the detail line and it would call an "off-the-shelf" document retrieval application with it to fetch the scanned document.

    So @bcss02 I'm curious, what are going to use images/documents for in a datawarehouse.

    Regards,
    Sven

  5. #5
    Join Date
    Jun 2008
    Posts
    9

    Default

    Quote Originally Posted by sboden View Post
    I know it's possible, I just haven't seen any real use of it .

    The closest I've seen are reports on which the user could click to see the original scanned document behind that detail line in the report. And the way that worked is that the report tool would generate a URL based on the detail line and it would call an "off-the-shelf" document retrieval application with it to fetch the scanned document.

    So @bcss02 I'm curious, what are going to use images/documents for in a datawarehouse.

    Regards,
    Sven
    Sven,

    Yeah, I am not actually and essentially going for scanned image inside my data warehouse, it's just one of the options in mind. I agree with you keeping a URL for the files look more flexible without the overloading the database and complicating the storage.

    I think there are pros and cons. And keeping the URL and files outside, I am just thinking in what this structure can be maintained better.


    Rgds, Dove

  6. #6
    Join Date
    Jun 2008
    Posts
    9

    Default

    Quote Originally Posted by MattCasters View Post
    Actually Sven, I added the Image data type in Pentaho Metadata to make it possible in the future to add images that come from a database. That way you can add product images etc on reports to make them clearer. Image vs Product code etc.
    I have no problem in seeing images, sounds and video as just another attribute of products, people, etc. I'm sure they can even be considered facts in some cases. Most certainly, the reporting aspects become intriguing but from a coolness perspective, I kinda like the idea.

    Dove, look in the samples/transformations directory.
    There is a sample there called "General - Load images and store into database table.ktr" Maybe you can work with this.

    All the best,

    Matt
    Matt,

    Thanks for idea, let me spend sometimes digesting it.


    Rgds, Dove

  7. #7
    Join Date
    Jun 2008
    Posts
    9

    Default

    Quote Originally Posted by sboden View Post
    I have not seen an image in a datawarehouse yet. Most of the descriptions on the internet would place the image outside of the database/datawarehouse and possibly only store a pointer to the image.

    What do you think you can do with an image in a datawarehouse. situation?

    Regards,
    Sven
    Actually, I quite agreed with you a pointer looks a better approach. And how they establish the link between the pointer in database and file location?

    Do you have any detailed material for my reading?


    Rgds, Dove

  8. #8
    Join Date
    Jan 2007
    Posts
    32

    Default

    1) In most of my projects I've always used pointers to images and other non textual data. However I don remember that did create some problems in times of upgrades, moving servers, etc etc. You have to keep on assuring the pointer stays valid. From that perspective, I believe there is something to be said for having the image right there with all the rest of the data warehouse data.

    2) The element to consider is the reporting engine. Can it handle the images stored in the database. I haven't tried this with Pentaho Reporting yet. Maybe someone else can answer the problem. Most commercial tools work with pointers anyway, so they wouldn't benefit from having the image in the database.

    3) Towards the future, I believe we'll handle images, video, sound just like we handle structure data right now. The difference between what is structured data and unstructured data just depends on how much information you can extract automatically from the object you are looking at. So this again would be in favor of being able to load non text data into the DWH.

    Just my ideas, ... seem to have to much time on my hands ...

  9. #9
    Join Date
    May 2006
    Posts
    4,882

    Default

    For the "image pointers" ... it depends what you use to store the images. One system had e.g. a very expensive off-the-shelf storage application and you could access it with a URL supplying a keyword/value system. So the pointer was just the key/value to get at the original and the reporting system would create the actual URL for the image.
    You can use anything as pointer as long as you can get at the image, and as Jan mentioned you have to make sure everything keeps pointing correctly.

    Which ever way you look at it, I would first do a disk space requirement analysis. Some projects just fail over the nice to have things. E.g. if you get 2 million fact rows daily, and half of them refer to the same pdf, and suppose a PDF is 50K big. That would be 50Gb daily to load.

    Regards,
    Sven

  10. #10
    Join Date
    Jan 2007
    Posts
    32

    Default Space

    Sven,
    Nice example, but in that case your PDF would be contained in your dimension, not in your fact table.
    Jan

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.