Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Handling a multiple SQlite Databasefiles

  1. #1
    Join Date
    Feb 2017
    Posts
    18

    Smile Handling a multiple SQlite Databasefiles

    Hello everyone

    I'm totally new to Pentaho DI and Data-during my internship I was tasked with creating a system that could centrally store multiple distributed databases. These databases were SQlite files and all had the same structure.
    I went on with the following sollution:
    1st: I downloaded the files via rsync to my local machine
    2nd: I created a csv file containining the position on the file system, datbase name and hostname
    3rd: I created a transformation that read out this file and copied the rows to result
    4th: I created a database connection containing these informations as parameters and for each data load job I used this transformation
    Now I wonder. Is there a better sollution to this? Because my Jobs which will load the transformations are awfully slow. I also face a memory leak issue right now where loading 1million + rows and do 5 consecutive dimension lookups will simply fail with a "Out of Memory Exception". Once I close pentaho-di and restart it the job usually runs fine again. If anyone thinks it would be worth it I would create a demo of my example without the names and data from the original source systems. I believe handling it this way is not a good idea has I have to read more than one thousand database files multiple times (in each data extraction job). I would welcome any input on how to solve such a task in a more fashionable style or if there is even a build-in functionality that can solve such a task.

  2. #2
    Join Date
    Feb 2017
    Posts
    18

    Lightbulb Attachements

    I will ad the attachement here. It does not contain sample databases but you can create them yourself if you like. I would really like to know if there is a better way. One thing I did not mention is that the naming scheme follows the date of the files creation.
    Attached Files Attached Files

  3. #3
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    When downloading you most likely have a special folder where you store the sqlite files.
    You don't need to prepare a CSV file with the filenames, then.
    There's a Get-File-Names step you can use to gather all filenames in that folder.
    I don't see why you must read all the rows into memory just to feed them to your last transformation.
    Why not let the last transformation read from the sqlite file directly?
    So long, and thanks for all the fish.

  4. #4
    Join Date
    Feb 2017
    Posts
    18

    Default

    Quote Originally Posted by marabu View Post
    When downloading you most likely have a special folder where you store the sqlite files.
    You don't need to prepare a CSV file with the filenames, then.
    There's a Get-File-Names step you can use to gather all filenames in that folder.
    Okay I will try to use that step (though I do not know how it works yet).
    Quote Originally Posted by marabu View Post
    I don't see why you must read all the rows into memory just to feed them to your last transformation.
    Why not let the last transformation read from the sqlite file directly?
    I have no idea what you mean by "let the last transformation read from the sqlite file directly". An SQlite file needs a database connection that has to be setup based on a parameter in the Database Connection Wizard. As far as I understand it there is no way to read from a binary file directly anyway. Do you have an example of how to actually insert the result of the Get File Names into such a database conncetion or can you explain a bit deeper what you mean by "reading from it directly"?

  5. #5
    Join Date
    Feb 2017
    Posts
    18

    Default

    Sorry for giving information in such small pieces on the actual problem. As I noticed some of the databases are also malformed. As far as I know there is no way to check the database consistency of a database file like SQlite in Pentaho. If I would just use the "Check Db connections" step in my job it would say that the check was successful. In my sollution I will run over that csv file with a python script and do an integrity check on each database. If that fails I will delete the line and the database will not be loaded into my staging area. There is also another problem that is within the design of the source system. One table is in an entirely seperate database (file). What that means is that db1 and db2 both join over a common key two tables together. The reason for this was that db2 becomes so large that it would bloat up db1 which will contain the most important data in my source system. So the designers decided to use a seperate database for that single table (yet I need that information as well). I cannot see how that is even possible to be automated in Pentaho Data Integration. I would have to write another script that will delete malformed database files before my transformations start.

  6. #6
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Sorry for the misunderstanding. I saw you reading SQLite data rows and copying them to the result row set (Copy-Rows-To-Result), when in fact you were reading that CSV file.

    Get-File-Names is easy to understand.
    You just specify a folder and a regex and watch the output fields.
    You don't even have to use this step.
    Why not create a shell script to call kitchen (or pan) with a filename parameter?
    That way a fresh JVM is started, no worries about memory leaks.

    I'm not convinced there's a memory leak in Kettle / SQLite / JDBC, just because your Kettle job is eating up memory.
    Maybe you can estimate the overall memory consumed by your transformation?
    Sometimes, it helps to reduce buffering in several places, so your process has a smaller memory footprint.

  7. #7
    Join Date
    Feb 2017
    Posts
    18

    Default

    Quote Originally Posted by marabu View Post
    Sorry for the misunderstanding. I saw you reading SQLite data rows and copying them to the result row set (Copy-Rows-To-Result), when in fact you were reading that CSV file.

    Get-File-Names is easy to understand.
    You just specify a folder and a regex and watch the output fields.
    You don't even have to use this step.
    Why not create a shell script to call kitchen (or pan) with a filename parameter?
    That way a fresh JVM is started, no worries about memory leaks.

    I'm not convinced there's a memory leak in Kettle / SQLite / JDBC, just because your Kettle job is eating up memory.
    Maybe you can estimate the overall memory consumed by your transformation?
    Sometimes, it helps to reduce buffering in several places, so your process has a smaller memory footprint.
    Thank you for your kind reply. Yeah indeed I reduced the amount of buffering I do and the commit size. I also used "Select Values" to remove as many unneeded columns for my transformation as possible before any lookups happen. I also saw that I was actually pushing all results as parameters to the next job. I'm not sure if that had any influence but the jobs seem to run fater now that I unticked that option under the "Parameter"-Tab. I understand how the "Get Filename" step works by now, thank you for that information. I will for now stay with the copy results to row because I have already written a Python Script that will prevent Pentaho DI from chocking on malformed disk images. (I log all the filenames in my script so the ETL Administrator can later find them and inform the R&D team about potentially hazardous customer-machines). Oh and yeah I think I will call the text based tools with the filename parameter once I got a plan for deployment. Right now I try to massively load a base dataset for half a year.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.