Hitachi Vantara Pentaho Community Forums
Results 1 to 15 of 15

Thread: Running Pentaho MapReduce Jobs from eclipse on CDH42 Shim

  1. #1
    Join Date
    Feb 2014
    Posts
    9

    Default Running Pentaho MapReduce Jobs from eclipse on CDH42 Shim

    Hi,

    I have created a mapreduce application using Pentaho big data plugin libraries & pentaho hadoop shims libraries.
    Transformation (.ktr) and the Hadoop Job (.kjb) files are running smoothly on Spoon.
    Using the same .ktr, .kjb and plugin.properties from Spoon environment on eclipse is failing.
    any thoughts?

    Configuration reads:
    FileObject pluginProperties = VFS.getManager().resolveFile(System.getProperty("user.home") + "/workspace/PDIMapReduceApps/plugin.properties");
    HadoopConfiguration config = new HadoopConfiguration(pluginProperties, "cdh42", "cloudera", new HadoopShim());

    Error is
    java.lang.NullPointerException
    at org.pentaho.hadoop.PluginPropertiesUtil.loadProperties(PluginPropertiesUtil.java:60)
    at org.pentaho.hadoop.PluginPropertiesUtil.loadPluginProperties(PluginPropertiesUtil.java:80)
    at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.loadPMRProperties(JobEntryHadoopTransJobExecutor.java:970)
    at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.execute(JobEntryHadoopTransJobExecutor.java:779)
    at com.impetus.di.MapOnlyWebLogsTransformationJobExecutor.main(MapOnlyWebLogsTransformationJobExecutor.java:91)
    2014/02/16 04:51:19 - ERROR (version 5.0.1-stable, build 1 from 2013-11-15_16-08-58 by buildguy) :
    2014/02/16 04:51:19 - ERROR (version 5.0.1-stable, build 1 from 2013-11-15_16-08-58 by buildguy) : java.lang.NullPointerException
    2014/02/16 04:51:19 - at org.pentaho.hadoop.PluginPropertiesUtil.loadProperties(PluginPropertiesUtil.java:60)
    2014/02/16 04:51:19 - at org.pentaho.hadoop.PluginPropertiesUtil.loadPluginProperties(PluginPropertiesUtil.java:80)
    2014/02/16 04:51:19 - at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.loadPMRProperties(JobEntryHadoopTransJobExecutor.java:970)
    2014/02/16 04:51:19 - at org.pentaho.di.job.entries.hadooptransjobexecutor.JobEntryHadoopTransJobExecutor.execute(JobEntryHadoopTransJobExecutor.java:779)
    2014/02/16 04:51:19 - at com.impetus.di.MapOnlyWebLogsTransformationJobExecutor.main(MapOnlyWebLogsTransformationJobExecutor.java:91)

  2. #2
    Join Date
    Sep 2012
    Posts
    71

    Default

    A couple of quick notes/questions:

    1) Instead of VFS.getManager() (if that's an Apache VFS class not your own), use KettleVFS
    2) That NPE looks like it can't find PDI plugins, does your code call KettleEnvironment.init() and perhaps KettleClientEnvironment.init(). From Eclipse you might need to set the KETTLE_PLUGIN_CLASSES system property to include various plugin classes, but for the core plugins like JobEntryPlugin you might just need to set your launch config's working directory to point to the PDI base folder, and/or add all the kettle-*.JARs to your classpath.

  3. #3
    Join Date
    Feb 2014
    Posts
    9

    Default

    I have KettleEnvironment.init() but not KettleClientEnvironment.init() - Now I have set this too
    Changed VFS.getManager to KettleVFS.getManager
    Added all the jars from big data plugin and Hadoop Shims (in my case CDH42) folders (lib, client, pmr)
    Added KETTLE_PLUGIN_CLASSES as runtime configuration to point to big-data-plugin folder

    Below please find the current error.

    at org.pentaho.di.core.plugins.PluginFolder.findJarFiles(PluginFolder.java:108)
    at org.pentaho.di.core.plugins.JarFileCache.getFileObjects(JarFileCache.java:45)
    at org.pentaho.di.core.plugins.BasePluginType.findAnnotatedClassFiles(BasePluginType.java:235)
    at org.pentaho.di.core.plugins.BasePluginType.registerPluginJars(BasePluginType.java:511)
    at org.pentaho.di.core.plugins.BasePluginType.searchPlugins(BasePluginType.java:117)
    at org.pentaho.di.core.plugins.PluginRegistry.registerType(PluginRegistry.java:517)
    at org.pentaho.di.core.plugins.PluginRegistry.init(PluginRegistry.java:489)
    at org.pentaho.di.core.plugins.PluginRegistry.init(PluginRegistry.java:457)
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:110)
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:65)

  4. #4
    Join Date
    Sep 2012
    Posts
    71

    Default

    KETTLE_PLUGIN_CLASSES would be set to a comma-delimited list of fully-qualified plugin classes. Also if you're accessing the cdh42 JARs via the API/SPI, you shouldn't add them to your classpath in Eclipse; rather you want to set your working directory (for your launch config) to the data-integration folder. The plugin types will search the classpath and then the plugins/ directory off your working directory for things like plugin.xml, annotated classes, etc.

  5. #5
    Join Date
    Feb 2014
    Posts
    9

    Default

    Looks like reading the configuration from where my pentaho installation is. but failed to load the Hadoop configuration.
    [P. S the same above config works fine within Spoon].

    WARN 17-02 01:40:29,498 - Unable to load Hadoop Configuration from "file:///home/cloudera/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh42". For more information enable debug logging.
    Exception in thread "main" org.pentaho.di.core.exception.KettleException:
    Error invoking lifecycle listener: org.pentaho.di.core.hadoop.HadoopConfigurationBootstrap@edc86eb
    Error initializing Hadoop configurations. Aborting initialization.


    at org.pentaho.di.core.lifecycle.KettleLifecycleSupport.onEnvironmentInit(KettleLifecycleSupport.java:102)
    at org.pentaho.di.core.KettleEnvironment.initLifecycleListeners(KettleEnvironment.java:131)
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:117)
    at org.pentaho.di.core.KettleEnvironment.init(KettleEnvironment.java:66)
    at com.impetus.di.MRTransformation.main(MRTransformation.java:34)
    Caused by: org.pentaho.di.core.lifecycle.LifecycleException: Error initializing Hadoop configurations. Aborting initialization.
    at org.pentaho.di.core.hadoop.HadoopConfigurationBootstrap.onEnvironmentInit(HadoopConfigurationBootstrap.java:109)
    at org.pentaho.di.core.lifecycle.KettleLifecycleSupport.onEnvironmentInit(KettleLifecycleSupport.java:97)
    ... 4 more
    Caused by: org.pentaho.hadoop.shim.ConfigurationException: Invalid active Hadoop configuration: "cdh42".
    at org.pentaho.di.core.hadoop.HadoopConfigurationBootstrap.onEnvironmentInit(HadoopConfigurationBootstrap.java:97)
    ... 5 more
    Caused by: org.pentaho.hadoop.shim.ConfigurationException: Unknown Hadoop Configuration: "cdh42"
    at org.pentaho.hadoop.shim.HadoopConfigurationLocator.getConfiguration(HadoopConfigurationLocator.java:495)
    at org.pentaho.hadoop.shim.HadoopConfigurationLocator.getActiveConfiguration(HadoopConfigurationLocator.java:503)
    at org.pentaho.di.core.hadoop.HadoopConfigurationBootstrap.onEnvironmentInit(HadoopConfigurationBootstrap.java:95)
    ... 5 more

  6. #6
    Join Date
    Feb 2014
    Posts
    9

    Default

    Any thoughts?

  7. #7
    Join Date
    Sep 2012
    Posts
    71

    Default

    Is there a folder at that location called cdh42? Also did you remove all the JARs in the Big Data plugin and your shim from your classpath? The HadoopConfigurationLocator will wire up all the JARs, classloaders, etc. If things are already loaded or otherwise interfere, it can prevent the shim from loading correctly.

    One last suggestion is to debug from Eclipse with breakpoints in HadoopConfigurationBootstrap and HadoopConfigurationLocator to see where the exception originates and why.

  8. #8
    Join Date
    Feb 2014
    Posts
    9

    Default

    I have pentaho-hadoop-shims-api-TRUNK-SNAPSHOT.jar, pentaho-hadoop-shims-cdh42-hbase-comparators-5.0.1-stable.jar and pentaho-hadoop-shims-cdh42-mapred-5.0.1-stable.jar

    If I remove them, this line won't resolve.
    HadoopConfiguration config = new HadoopConfiguration(pluginProperties, "cdh42", "cloudera", new CommonHadoopShim());

  9. #9
    Join Date
    Sep 2012
    Posts
    71

    Default

    You shouldn't have to create your own HadoopConfiguration, the Locator and Bootstrap stuff should load in the "active" shim and you can get to it via the following static method:

    HadoopConfigurationBootstrap.getHadoopConfigurationProvider().getActiveConfiguration()

  10. #10
    Join Date
    Feb 2014
    Posts
    9

    Default

    The code is pretty much like this. Still throwing the same error. Unable to load Hadoop Configuration from "file:///home/cloudera/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh42"

    Folder (Plugins/pentaho-big-data-plugin, etc are there). This same location work well for jobs submitted via Spoon.

    KettleEnvironment.init();
    KettleClientEnvironment.init();

    Job job = new Job();
    JobEntryHadoopTransJobExecutor executor = new JobEntryHadoopTransJobExecutor();

    executor.setParentJob(job);

    executor.setHdfsHostname("localhost");
    executor.setHdfsPort("8020");
    executor.setJobTrackerHostname("localhost");
    executor.setJobTrackerPort("8021");

    executor.setHadoopJobName("Web Logs Transformation");

    executor.setMapTrans("./WebLogTransformer.ktr");

    executor.setInputPath("/user/cloudera/weblogs/raw");
    executor.setOutputPath("/user/cloudera/weblogs/parse");
    executor.setInputFormatClass("org.apache.hadoop.mapred.TextInputFormat");
    executor.setOutputFormatClass("org.apache.hadoop.mapred.TextOutputFormat");
    executor.setMapInputStepName("MapReduce Input");
    executor.setMapOutputStepName("MapReduce Output");

    Result result = new Result();
    executor.execute(result, 0);

  11. #11
    Join Date
    Feb 2014
    Posts
    9

    Default

    big-data plugin and the libraries were already in the distributed cache (uploaded via Spoon jobs). could this be a cause?

  12. #12
    Join Date
    Sep 2012
    Posts
    71

    Default

    Only if you've added JARs, plugins, etc. It couldn't hurt (Pentaho stuff at least) to remove /opt/pentaho/mapreduce/* from HDFS and try again, as long as that won't mess up anything on your system (concurrent users, e.g)

  13. #13
    Join Date
    Feb 2014
    Posts
    9

    Default

    I have no opt/pentaho/mapreduce/ on my hdfs.

    removed the folders and its files from lib & plugins on hdfs (location based on pmr.kettle.installation.id). tried again.
    WARN 17-02 20:27:27,311 - Unable to load Hadoop Configuration from "file:///home/cloudera/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh42".

  14. #14
    Join Date
    Sep 2012
    Posts
    71

    Default

    Sorry /opt/pentaho/mapreduce is the default installation location, if you have set the pmr.kettle.installation.id property then that is the location I'm referring to. At this point I think you'd need to debug through the HadoopConfigurationBootstrap to see why it can't locate/load the cdh42 shim.

  15. #15
    Join Date
    Feb 2014
    Posts
    9

    Default

    Thanks, Looks like the configuration were not loaded or could not find the location. will debug.

    Exception in thread "main" org.pentaho.hadoop.shim.ConfigurationException: Not initialized. Please make sure the KettleEnvironment has been initialized.
    at org.pentaho.di.core.hadoop.HadoopConfigurationBootstrap.getHadoopConfigurationProvider(HadoopConfigurationBootstrap.java:80)
    at com.impetus.di.HadoopTransformation.main(HadoopTransformation.java:23)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.