Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Transformation in SQL Select vs Java Expressions Performance

  1. #1

    Default Transformation in SQL Select vs Java Expressions Performance

    Hi -

    In one of the applications we are using Oracle SQL functions and PL/SQL functions in SELECT statements directly for doing any kind of data transformation and manipulations; and even lookups.

    I want to understand, what if I will move all that transformation logic in Java Expressions provided in Kettle Transformation Flow, and will select from database tables only without doing anything fancy. Which option would be more better in terms of maintainablity and performance wise.

    Performance is the key, since we have around 8-10 billion total records from over 150 tables that we need to move from one Oracle Database to another.

    Any suggestions? I wanted to understand, how Kettle engine will handle transformations vs SQL select query functions.

    Regards

  2. #2
    Join Date
    Aug 2011
    Posts
    360

    Default

    Hi,

    That's a big question you ask!

    First, it all depends on what you're trying to achieve, so please provide more details on the following:
    - do you want to migrate one database, i.e. just a one shot migration, possibly with some model changes ?
    - if not migrating, what type of architecture and model are your source system and your target system? OLTP -> OLTP, OLTP->EDW (staging/datavault/dimensionnal model etc.) ??
    - if not migrating, is the volumes of 8-10 billion records just the initial load or your regular load in the batch window (hope not! this is big!) ?
    - what do you have as an infrastructure? Number of DB and ETL nodes, CPU's, maybe different physical location (like source DB in one city, target DB in another city) etc??

    Your question on performance between full SQL and full ETL (in PDI) will really depends on these things.

    However, based on my experience, I can provide some observations already:

    - The first thing I can say, is that for the same infrastructure in DB and ETL (like one node DB on 4 CPU Versus one node DB on 4 CPU + one node ETL on 4 CPU), you generally cannot beat full SQL stuffs, at least for simple to middle complexity of transformations. The reason is simple: for ETL transforms, you have to pull the data out of the database, do the transform, then put it back into the database. This can be quite an overhead (due to network in between, transformation of data from DB data to java objects back to DB data), especially when you'll get to put the data back into DB (selects are cheap, but inserts/updates can be really expensive). So small transformations in SQL are realy fast as the data is computed "in place" and DB engines are generally more optimised for this kind of work than any ETL (I mean, they are generally C/C++ engines really optimised for computation and I/O and really close to the byte data they work on... you cannot beat that with a java ETL).

    - For the same infrastructure, for complex transformations involving complex business logic, the full SQL solution is not always the fastest, since SQL can become quite nasty (and hard to maintain!) and not always the most performant because of the DB Engine. It all depends on the capabilities of your DB (ok Oracle has quite a good set of analytics functions). However, in general complex logic can be exploded in more simple consecutive steps, maybe with intermediate tables, so it all depends on how it is done. These complex cases is a good place for ETL, since you can really choose wich data you need for mutliple source, do really complex work (even use a java library handle the business logic, joins, sort, agreggate into memory etc. This can be really simpler to developp and maintain than pure PL/SQL scripts.

    - In PDI, if using the dedicated steps (like table input, some table lookups, some calculators, then table output/update), the idea is not to generate some "bulk" SQL and apply it to the DB, but to stream data from one step to another. It is fully multithreaded, so each step has its own thread, thus its own (part of) CPU, so while you are reading from you source database you are at the same time doing lookups then putting data back into the target output, as soon as you have rows of data. So things can be/is paralellised easyly and you can use the full power of all your CPU's. With this parallelisation, if things is done well and you have the infrastructure, you can then beat full SQL solution and scale pretty well.

    Based on these considerations, and supposing you are migrating data OR implementing an EDW load system from OLTP to EDW I would recommand to:
    1. create a staging area in target system
    2. use anything the fastest (PDI table input --> table output/bulk load, or Oracle dump --> move file --> Oracle bulk load) to move your data from source system to staging area, without any change to data
    3. Use a combination of SQL script and ETL transformations between your staging area and the final target system following these lines:
    3.1 Use PDI to handle the workflow of everything. If you had an SQL scripts somewhere, just call it from PDI in a job. Same for a shell script etc.
    You can then maintain the workflow in an easy way in the same tool.
    3.2 dont pull out data from DB if you dont need to. Use scripts like "insert into target select ....... from source" to do bulk transformations. If you need a simple transformation like "insert into target select concat(fieldA,fieldB) from source", just do that in SQL. The DB will threat all data in one shot, faster than any PDI Table input-->calculator -->table output. But you can partition the data to treat from the source system with a where condition and some PDI variable (for exemple, process one year of data only), such that you can parallelize your script with PDI.
    3.3 For complex logic:
    - if you can, use several bulk scripts with intermediate tables. Like "insert into temp01 select ..... from source", then "insert into temp02 select ... from temp01", finally "insert into target select .... from temp02". Use PDI /SQL to create intermediate tables (or truncate) and handle the flow and errors.
    - OR use a PDI transformation to do the logic. If you need to synchronize the result with a target table, first output the result into an empy table, then use bulk SQL scripts to synchronize the tables.
    3.4 Use PDI scripting steps and metadata driver transformations to generate generic SQL and load patterns. For example, with the metadata of the source DB and some rules, you can generate all the staging area and create a small set of PDI Jobs/transformations to load ALL tables from source to target. The same apply for generic transformations from staging to EDW (in particular if your EDW is a datavault).

    I would say that the overall performance of your system is not related to things like "is a CONCAT(...) in SQL faster than a concat in java", but really on your achitecture and your loading patterns. So there is no answer like "full SQL is faster than full ETL". The only true thing is: full SQL is generaly more cryptic and hard to maintain (and you need competent developpers in SQL), and full ETL is generaly not well done because you can too easyly do complex things but badly designed.

    So before rewriting the world in PDI, rethink the big picture, and let alone the stuffs that actually work well!
    Finaly, before taking a decision, I strongly recommend you to do some prototype on a small set of data, both with your PL/SQL and PDI and/or a mixed flavour of it, and do performance benchmark for each big type of problem you have to solve. You will rapidly see what works the best with each implementations.
    Last edited by Mathias.CH; 04-17-2016 at 12:19 PM.

  3. #3

    Default

    Thanks again for the detailed response, really appreciated.

    Yes we are also thinking to use some hybrid approach, and will be doing some performance testing.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.