PDA

View Full Version : Hadoop job executor can not take custom output format



Jasper
02-17-2011, 12:56 PM
Hi,

I have a complete jar which runs perfect from the command line. It creates multiple textoutput files, one for every distinct key value.

However I get some troubles when I try to run it from PDI. The problem is the input in the "output Format" box. It doesn't take a class which is in the jar.

-When I enter the standard org.apache.hadoop.mapred.TextOutputFormat the job runs but overrides my MultipleTextOutputFormat with the standardoutput of part-0000, part-0001 etc.

Is there a way to overcome this. I really don't want to have to start my job from a shell script to accomplish this.

Thnx.

jganoff
02-17-2011, 03:37 PM
You have a few options ff you cannot package the required classes in the main jar file.

1) You can package dependency jars in a /lib subdirectory inside your main jar.
2) You can set a User Defined property named "tmpjars" with the relative path to dependency jar files in HDFS. Any user defined name/value pair will be added to the JobConf as a name/value pair as provided. Using the tmpjars job configuration property you can define dependency jars by their HDFS paths but the catch is that they must exist in HDFS.

For example,

If you are using myjob.jar that requires some third party library "A" which is packaged in a.jar you can declare this dependency by providing a comma separated list of hdfs relative jar locations to the dependencies:

name: tmpjars
value: /lib/a.jar

Let me know what works out for you. I would recommend going with option 1 if the jars are small.

jganoff
02-17-2011, 03:40 PM
By the way, from the screenshot it looks like you are already defining the output format class in the main class of your jar. Do you see any errors when running that might indicate the configuration isn't valid in some way?

Jasper
02-17-2011, 07:34 PM
Yes, thats the point. The class shown is directly available from the main class (same jar) but in the configuration shown it raises an error which I will drop you later.

I have tried multiple entries in the "Output Format" box (like org.apache.hadoop.mapred.lib.MultipleOutputFormat or MultipleTextOutputFormat (which is @overridden by my custom version class) but all these raise the same error.

Only org.apache.hadoop.mapred.TextOutputFormat seems to work there.


Later..............
------------------------------------------------------------------------------------------------------
When I enter the custom class in the "Output Format" box the error is:

17:44:04,761 WARN [JobClient] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: SanomaInbound$SanomaMultipleTextOutputFormat
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1005)
at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:620)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:829)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)


-------------------------------------------------------------------------------------------------------------------
When I enter the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat, (also an existing class) in the "Output Format" box the error is:

17:39:05,837 WARN [JobClient] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
java.lang.RuntimeException: java.lang.InstantiationException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:620)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:829)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:767)
at org.pentaho.di.job.entries.hadoopjobexecutor.JobEntryHadoopJobExecutor.execute(JobEntryHadoopJobExecutor.java:384)