Hitachi Vantara Pentaho Community Forums
Results 1 to 11 of 11

Thread: Debugging Kettle and Big Data plugin in Eclipse

  1. #1

    Exclamation Debugging Kettle and Big Data plugin in Eclipse

    Hi everybody,
    I want to have kettle with its big data plugin both available for some coding inside my eclipse project(s).
    Now, I've been trying thiese guidelines:
    http://wiki.pentaho.com/display/EAI/...ettle+4+plugin - I tried both solutions from here

    also the instructions from README.txt file inside the src-plugins package:
    http://jira.pentaho.com/browse/PDI-6...mmary-tabpanel

    Unfortunately, NONE of them has worked completely.

    I managed to have these plugins as icons and the dialog opens OK but for some reason execution of them fails. The same examples I successfully run with binary version of Kettle (inside Spoon).

    The exception I get when I run it from the Eclipse is:
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : java.lang.NullPointerException
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.hadoop.PluginPropertiesUtil.loadPluginProperties(PluginPropertiesUtil.java:53)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.loadPMRProperties(JobEntryHadoopTransJobExecutor.java:1011)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.execute(JobEntryHadoopTransJobExecutor.java:815)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.Job.execute(Job.java:534)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.Job.execute(Job.java:673)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.Job.execute(Job.java:673)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.Job.execute(Job.java:398)
    2012/08/01 17:29:55 - ParseFileInsideMapper - ERROR (version 5.0.0-M1 from 2012/08/01 17:29:11.809) : at org.pentaho.di.job.Job.run(Job.java:317)

    I debugged it and the exception is because the PluginDirectory for the pentaho hadoop job from big data plugin is set to null (also this var is set to null for all other plugins!) and spoon searches that directory to find pentaho-mapreduce.properties.

    Is there some run/debug configuration that I have missed in Eclipse?
    Has anyone before had something similar?


    Also I noticed that when I open for example hadoop copy files job entry dialog, I have only local file system avaialble and no HDFS and S3. Again in binary version this works ok.
    I guess this is an issue:
    http://jira.pentaho.com/browse/PDI-8224
    I'm trying to fix it.

    Thanks a lot for help

  2. #2
    Join Date
    Aug 2010
    Posts
    87

    Default

    Hey kepha,

    The big data plugin relies on the plugin being loaded through the file system and likely wont fully load when configured as you have. Most of our developers rely on remote debugging for Kettle plugins. We remotely debug Kettle with the Big Data Plugin installed in $KETTLE/plugins. I've outlined the process here under the section "Remote Debugging": http://wiki.pentaho.com/display/BAD/...ava+Developers

    Hope this helps. Let me know if something is unclear.

  3. #3

    Default

    Maybe I'm missing something here but I followed the instructions from that link and still I cannot have BigData functionalities incorporated into Spoon.
    I want both Kettle and big data plugin inside the eclipse, as I'm adding different things to both projects.

    When running Kettle I put JVM arguments "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005" inside the run/debug configuration of my kettle project with Spoon as main class.
    I don't see BigData options in the step nor in the entry palette in Spoon so I cannot debug them.

    In fact when I download recommended CI Build: 4.4.0-snapshot and start Spoon I also don't have any big data functionality (with 4.3. I do have them but there are some other things form 4.3. that I want to avoid).

    Is there something else other then these thing mentioned in the instruction http://wiki.pentaho.com/display/BAD/...ava+Developers for remote debugging that I might skipped.
    If so, it would be great if you can share it.


    Thanks

  4. #4
    Join Date
    Aug 2010
    Posts
    87

    Default

    The Big Data Plugin must reside on disk in the plugins/ directory. The most common development environment is to use the ant script to build the plugin once and extract it into Kettle's plugins/ directory (creating plugins/pentaho-big-data-plugin). You can then debug Kettle via Eclipse as you would any other Java application. There's no need for the remote debug options if you're using Eclipse to launch Kettle. As long as the Big Data Plugin resides in ${working.dir}/plugins Kettle will load it. As you make additional modifications to the Big Data plugin you will have to either:

    1) rebuild the plugin using ant and update the install in plugins/
    2) only rebuild the jar (much faster) and update the existing jar in plugins/pentaho-big-data-plugin
    3) Modify code during a debug session and let the JVM hot swap the classes in.

    If you're attempting to work on any Hadoop Configuration (shim) or Pentaho MapReduce you'll likely need to rebuild the entire plugin and redeploy. I have found no quicker way to work on the Big Data Plugin as it is a collection of many Kettle plugin types: Step Plugin, Job Entry Plugin, Database Plugin, VFS Provider, and Kettle Lifecycle Plugin.

  5. #5

    Default

    Thanks! This seems like a reasonable solution for my setting.
    However, I have a problem when starting Spoon now it is unable to register the plugins and thus give some exceptions.
    Which exactly files/folders should I copy to plugins/pentaho-big-data-plugin directory?

    I used "ant resolve install-plugin" which is recommended on the page you gave me.
    I tried with coping the dist and lib to plugin directory. Should I also modify classspath of my kettle project (e.g., should I inlude the jars from plugins/pentaho-big-data-plugin directory/lib directory)?

    If I'm wrong can you can please give me some further explanations.


    Thanks again.

  6. #6

    Default

    I finally succeeded to include big data plugin into my Kettle project.
    I used "ant resolve install-plugin" which creates the directory which should be copied to the plugins/pentaho-big-data-plugin directory.

    However, when starting the Spoon I got the warning
    "WARN 23-08 23:00:02,739 - Unable to load Hadoop Configuration from "file:///C:/Kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh3u4". For more information enable debug logging."

    Because of this map reduce job raises the exception:

    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : java.lang.RuntimeException: class org.apache.hadoop.security.ShellBasedUnixGroupsMapping not org.apache.hadoop.security.GroupMappingServiceProvider
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.security.ShellBasedUnixGroupsMapping not org.apache.hadoop.security.GroupMappingServiceProvider
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:898)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.Groups.<init>(Groups.java:48)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:137)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:180)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1418)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.hadoop.shim.common.CommonHadoopShim.getFileSystem(CommonHadoopShim.java:99)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.execute(JobEntryHadoopTransJobExecutor.java:697)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.di.job.Job.execute(Job.java:534)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.di.job.Job.execute(Job.java:673)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.di.job.Job.execute(Job.java:673)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.di.job.Job.execute(Job.java:398)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.pentaho.di.job.Job.run(Job.java:317)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : Caused by: java.lang.RuntimeException: class org.apache.hadoop.security.ShellBasedUnixGroupsMapping not org.apache.hadoop.security.GroupMappingServiceProvider
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:892)
    2012/08/23 23:09:28 - Pentaho MapReduce 2 - ERROR (version 5.0.0-M1 from 2012/08/23 23:00:00.960) : ... 18 more


    I tried including the jars form the plugin into the kettle project's class path but I couldn't solve it.

    Do you know what can cause this problem? Where should I set the hadoop configuration?

  7. #7
    Join Date
    Aug 2010
    Posts
    87

    Default

    You shouldn't need to add any files to the Kettle class path. The big data plugin is designed to be self sufficient (with the exception of the hive-jdbc driver shim jar that's already in Kettle/libext/JDBC).

    The Hadoop Configuration is configured through the big data plugin's plugin.properties. The failure to load a specific plugin is usually not a problem but it may indicate that you don't have your environment quite set up properly yet. I recommend you set up a Kettle install (the 4.4.0-stable build), locally build the big data plugin, and install it into that clean Kettle install. See if the big data plugin works then. If so, you've at least ruled out a problem building the big data plugin locally and you can move on to debugging your dev environment.

    You can find more info on the new Hadoop configurations support here: http://wiki.pentaho.com/display/BAD/...Configurations

    Hope this helps. Seems like you're close.

  8. #8

    Default

    Well I understand that I should align the hadoop distribution with the one we are using on the cluster but what about other configurations. It says that I should copy the configuration that most closely matches the one I want to communicate with.

    In the cluster we have hadoop-core-0.20.203.0 and the default here is hadoop-20. I copied hadoop-core-0.20.203.0.jar that we use on the hadoop but where can I find other libraries that I need to copy?

  9. #9
    Join Date
    Aug 2010
    Posts
    87

    Default

    Likely copying just the hadoop-core jar for 0.20.203.0 will be sufficient. I would copy the hadoop-20/ configuration into a hadoop-20-203/ and update the hadoop-core jar there. You'll then need to update plugin.properties and set the active.hadoop.configuration=hadoop-20-203.

    Probably more info than you need but good to know: If you want the jars to line up exactly as they are on the cluster (or if you cannot connect by simply swapping out the hadoop-core jar) you'll need to either copy them into that hadoop-20-203/lib/ or add a classpath entry in the property file for your new configuration (config.properties) file pointing to your HADOOP_HOME/lib and HADOOP_HOME/hadoop-core-0.20.203.0.jar files (if installed locally).

  10. #10

    Default

    You were right, my enviroment was not set properly since previously I was trying to set debugging enviroment in different way and left many things dirty. I checkout new kettle 4.4.0 and now it works fine.
    Only one comment is that in kettle 4.4.0 (unlike in stable build Kettle 4.3) in Pentaho MapReduce Job properties in tab "Cluster" I don't have option "Hadoop Distribution" to change the distribution. I suppose this is ok considering the fact that PHD is not longer required.

  11. #11
    Join Date
    Aug 2010
    Posts
    87

    Default

    Great news! And yes, we've removed the Hadoop configuration UI options in lieu of the new Hadoop Configurations (shim) support. You can configure your active configuration in pentaho-big-data-plugin/plugin.properties

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.