Hitachi Vantara Pentaho Community Forums
Results 1 to 16 of 16

Thread: Development: java vs spoon

  1. #1
    Join Date
    Aug 2016
    Posts
    290

    Default Development: java vs spoon

    I wondered if anyone with software developer background had some thoughts about using java (pure code) vs spoon. And also the possibility to use kettle library in java code instead of spoon.

    In my point of view, spoon is some times better, specially for easier tasks and it gives quick 'big picture' understanding of what is happening. It is also fairly fast to get things done, specially connections to other resources (databases, web and more) works without problems (usually).

    Spoon really is an abstraction layer above the programming language. All class structures are already finished and hidden from the user (inheritance, interfaces etc). What I miss in spoon is handling data in more detail, specially between jobs and transformations. While spoon can handle single value arguments and a single list (result rows), java can handle any number of arguments (multiple result rows, lists as arguments etc). Java also has strict types so you will hardly be in a situation where a string is confused with an int, while spoon is more of a scripting language without this safety (except between steps in a transformation). I also miss the opportunity to use basic object oriented principles, instead having to re-use code with the help of arguments.

    A huge bonus in spoon is the ability to use custom java code in steps. This works well for smaller pieces of code, but the text editor in spoon for java code is nothing compared to a fully java development environment (eclipse etc). If you want a larger java code base in spoon, you could of course import it as a library, but then you would need to edit that separately (outside spoon).

    Do you have any thoughts, maybe tips to better fill the gaps between spoon and java programming? Have you tried using the kettle library directly in java code?

  2. #2

    Default

    In most cases where I've seen a company embed Kettle libraries into a Java application, they tend to use the Java application as a scheduler / orchestrator for an existing legacy enterprise application, of which PDI is a small component. Typically they just call the methods to start a job or transformation on a Carte server.

    If instead they're trying to manipulate/enrich the data itself as part of a transformation, there's a lot lower barrier-of-entry by just writing a plugin with the necessary logic than to do a full-on embedding of Kettle inside of a Java application.

    I'm not sure I understand where you're thinking PDI is a scripting language. It's a compiled Java application. PDI exposes various configuration settings through steps and job entries, allowing end-users to customize how it processes the data as it is passed through various steps. There are some steps that include a scripting engine (e.g. Modified Java Script Value), but the entire underlying application is a compiled application, which means Java data types are enforced.

    Hope that helps

  3. #3

    Default

    If you are a coder, rebuild all yourself. When you want an understandable and optimised solution, including its limitations, take a tool. For me I wouldn't want to code the more complex steps. I find tools to be greatly superiour also in administration. Otherwise feel free to do the same that Matt did.

  4. #4
    Join Date
    Aug 2016
    Posts
    290

    Default

    Thanks for sharing your thoughts!

    My original thought was "is Spoon really the best tool here, or would I be better off making my own java program from scratch?". This applies to solving a standard or comoplex ETL problem like read from some source (file, database, web etc) transform it in some way and write the results to database (statistics in my case). For the easier tasks, I think Spoon is actually superior to normal Java code. But for the more complicated problems, I'm not so sure so I sometimes ask myself if this is the reight tool or not. So far it has been, but new challenges occur periodically.

    Having worked a couple of years with Spoon and having some background with purely java (mostly educational/training), I found these differences:

    Spoon pros:
    -fast development of basic functionality
    -simple handling of events (success/failure)
    -easy connections to external sources (database, web, file ++)
    -scripted, no compile time
    -can take custom java code and additional java libraries (jars)
    -rich functionality already included (steps)
    -class structure is already defined, optimized and hidden (interface, inheritance, design patterns)

    Java pros:
    -better debug and testing environment
    -more freedom to pass complex data structures like arrays as variables, objects
    -object oriented
    -type checking, more security and alerts if you have miss-match

  5. #5

    Default

    I can't comment on the debug and testing part as I am not a Java developer. As for object oriented, I see no added value for working with data to fill data warehouses. I think from the problem I have to solve. PDI is an optimised tool that solves the issues you face when filling a data warehouse. When it falls short in one or more of those issues, I first ask myself if I think from the perspective of the ETL sub components, or as a software developer or a SQL developer / user.

    You are Always welcome to add certain functionalities you feel are missing from PDI.

  6. #6
    Join Date
    Aug 2016
    Posts
    290

    Default

    When I'm making multiple transformations for multiple fact tables that share certain similarities, I some times think object oriented principles could be nice. I'd like to use interface, inheritance etc. Instead, I'm often duplicating code to do the same thing multiple times. On the other hand, there already is good use of design patterns below the kettle layer, so it's nice to have that optimized and out of the way.

  7. #7

    Default

    Which similarities do they share?

  8. #8
    Join Date
    Aug 2016
    Posts
    290

    Default

    They will typically share a number of dimensions (time, date, source, destination). Then they will each have a number of unique dimensions. This would be a classic case of inheritance in java, but so far I've just made a bunch of duplicated steps in transformations.

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Very few users here have written Kettle transforms in Java.
    One of the big bonus points of using PDI is that you *don't* need to be a coder to build and maintain the scripts.

    Building internal company templates (like you would for MS Word!) is how you can get around a lot of your perceived shortcomings.
    I haven't written an java code in over 10 years, and yet, I can troubleshoot a PDI transformation quite well.

  10. #10
    Join Date
    Aug 2016
    Posts
    290

    Default

    With templates, you mean to use search and replace to insert custom values? I've used that before. It is useful for making the transformations, but if you are going to change the transformations later, you still need to go into each and every one and apply the same change. You would get duplicate code after using templates? That's the magic with object oriented programming and inheritance, if you need to change something, you only change it once and it applies to all.

  11. #11

    Default

    Sorry, but what advantage does object orientation have with shared (e.g. conformed) dimensions? Typically you lookup the data warehouse key of your dimension with the business key from your stream. There is a step for that. You can copy that step to your various fact transformations, provided that the stream field has the same name Always. It typically doesn't, as your facts have different sources, especially for date. So then you have to change the stream field (or fields if your dimension is unique only on multiple business keys) to reuse it. Nothing fancy for me.
    As for date dimension keys, if you follow the best practise, you use a conversion to get them. No lookup needed.

  12. #12
    Join Date
    Aug 2016
    Posts
    290

    Default

    You can define the shared code once in a parent class. How do you do that with transformations?

    I have no problem doing the mechanical keys, inserting new dimensions and facts in java code or transformations. This is about re-using code, basic principles. Every time I made a new transformation, I copied the steps you mention: dimensional lookups and inserting new dimension members. If I ever needed to change the implementation of these shared dimensions, I would have to do it for 20+ transformations, all having the same steps ("code").

    The only way I found to re-use transformations is to use variables and logic based on what variable was given. That is sub-optimal in many cases.
    Last edited by Sparkles; 03-12-2018 at 07:29 AM.

  13. #13
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    My point was that even Java developers get inheritance wrong (on a somewhat regular basis)
    PDI transforms are maintainable by people who DON'T CODE.
    You won't get that with Java development.

    PDI is about putting the tools in the hands with the business knowledge, rather than having to have a resource just for the transforms.

  14. #14

    Default

    The steps are the code, I really don't see your point. When I use a lookup step, I add 3-4 variables and I'm done. I think you just have an issue letting go of coding. I see no point in trying to explain you further.

  15. #15
    Join Date
    Sep 2015
    Posts
    20

    Default

    Sparkles, here are some things you might look at to avoid repetition of the same code in lots of places:
    1) You can call a transformation from within another transformation (Flow > Transformation Executor)
    2) You can pass information to at least some steps to configure them in fields of the data stream (such as Table Output)
    3) We quite successfully use variables defined in a job to control the operation of transformations (while in our case, the variables are generally set in the first transformation, that's not required--and you can change their values in subsequent transformations if needed)
    4) Be sure to check out the metadata injection capabilities also
    5) Plug-ins (we're [admittedly slowly] creating some to handle situations where we need to use the same logic repeatedly on different data streams--the logic in question is not available in a standard step so for now we have User-defined Java Classes)

    Maybe some of those ideas will be useful.

    John

  16. #16
    Join Date
    Aug 2016
    Posts
    290

    Default

    Thanks, great input!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.