PDA

View Full Version : Hadoop Job Executor job step



Jasper
02-17-2011, 09:33 AM
Hi,

I have a few questions about the Hadoop Job Executor job step.

1. First of all I wonder where to put the details (jobtracker address and port / HDFS address and port) when executing a self-made hadoop jar in the "simple" version of the job Executor. Or is this "simple" option only meant for executing the hadoop jar in the local Pentaho JVM (so not on a cluster at all) ?? Does the "simple" read these connection parameter from the "cluster" tab of the advanced option?

2. Second I have a working jar for a Hadoop job with all that it needs (mapper class, reducer class ,custom partitioner class , multifiletextoutput class) wrapped in one main class. It works fine when I execute this jar from the command line from within the Hadoop cluster. So, if the issue under 1. isn't possible then I wonder what is the use of all the required inputs on the "advanced" tab of the Job Executor. Will these "interfere" with my jar? And where can I enter the main class to execute and a package where it is in?

jganoff
02-17-2011, 11:56 AM
1. First of all I wonder where to put the details (jobtracker address and port / HDFS address and port) when executing a self-made hadoop jar in the "simple" version of the job Executor. Or is this "simple" option only meant for executing the hadoop jar in the local Pentaho JVM (so not on a cluster at all) ?? Does the "simple" read these connection parameter from the "cluster" tab of the advanced option?


The simple option is designed to run a jar that contains a class with a main method that submits the job to Hadoop using the JobClient (or any other method you choose). The simple executor finds all classes with main methods in the jar provided and executes them. A jar the runs on the Hadoop command-line will work with the Hadoop Job Executor with minor modification. You must set the following JobConf properties to the JobClient knows where the Name Node and Job Tracker are, e.g.:



public static void main(String[] args) {
JobConf conf = ...;
conf.set("fs.default.name", "hdfs://my-hadoop-server:9000");
conf.set("mapred.job.tracker", "my-hadoop-server:9001");

// More configuration goes here, e.g.:
// conf.setMapperClass(org/pentaho/hadoop/mapreduce/sample/SimpleWordCount$Map);
// conf.setCombinerClass(org/pentaho/hadoop/mapreduce/sample/SimpleWordCount$Reduce);
// conf.setReducerClass(org/pentaho/hadoop/mapreduce/sample/SimpleWordCount$Reduce);

FileInputFormat.setInputPaths(conf, new Path[] {
new Path(args[0])
});
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}


This is how the shipping sample.jar is configured.



2. Second I have a working jar for a Hadoop job with all that it needs (mapper class, reducer class ,custom partitioner class , multifiletextoutput class) wrapped in one main class. It works fine when I execute this jar from the command line from within the Hadoop cluster. So, if the issue under 1. isn't possible then I wonder what is the use of all the required inputs on the "advanced" tab of the Job Executor. Will these "interfere" with my jar? And where can I enter the main class to execute and a package where it is in?

The advanced configuration allows you to configure the JobConf properties through the job entry. The Hadoop Job Executor job entry then builds up the JobConf based on that configuration and submits the JobConf with JobClient. You're welcome to browse our source repository and see how this is done here: http://source.pentaho.org/svnkettleroot/kettle-hadoop/trunk/src/org/pentaho/di/job/entries/hadoopjobexecutor/JobEntryHadoopJobExecutor.java
You'll want to look in JobEntryHadoopJobExecutor's execute method:

public Result execute(Result result, int arg1) throws KettleException;

Here's the more general link to our other repositories: Get the Code: http://community.pentaho.com/getthecode/

Hope this helps!

Jasper
03-10-2011, 05:29 PM
Jordan,

I have done some further testing on this but it is just not working.

Here is a snippet of the main class of my WordCountJas.jar.

--------------------------------------------------------------------
public class WordCountJas {

public static void main(String[] args) throws Exception {

if (args.length != 2) {
System.out.println("Usage: WordCountJas <input dir> <output dir>");
System.exit(-1);
}

JobConf conf = new JobConf(WordCountJas.class);
conf.setJobName("WordCountJas");

// FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileInputFormat.setInputPaths(conf, new Path("/user/pentaho/weblogs/input"));
FileOutputFormat.setOutputPath(conf, new Path("/user/pentaho/weblogs/output"));

conf.set("fs.default.name", "hdfs://rhhadoop1:8020");
conf.set("mapred.job.tracker", "rhhadoop1:8021");

conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReducer.class);

conf.set("mapred.textoutputformat.separator", "@");

conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

JobClient.runJob(conf);
System.exit(0);
}
}
--------------------------------------------------------------------------------------

The connection details to my Hadoop cluster are now in there. This jar runs fine from the cmd prompt on the Hadoop server itself by issueing "hadoop jar /<location>/wordcount_simple.jar WordCountJas /user/pentaho/weblogs/input /user/pentaho/weblogs/output"

However when I try to submit the same jar via the Hadoop Job Executor I found the following:

-When you enter 1 or 3 arguments (in stead of the 2 that the main method expects) in the cmd line arguments box of the Hadoop Job Executor, not the job crashes but spoon itself crashes and shuts down without a warning!! (because of the System.exit(-1) in the jar ???)

-When you enter a valid input & output path in the cmd line arguments box, the job finishes succesfully but on the Hadoop side nothing seems to happen. There is NO MapReduce job, and nothing on the jobtracker/tasktracker logs.

The only thing I could trace around the job submission time is this from the namenode log:

-------------------------------------------------------------------------------------------------------
INFO 10-03 22:02:53,424 - ugi=Jasper_VLC ip=/192.168.0.196 cmd=mkdirs src=/var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042 dst=null perm=Jasper_VLC:hadoop:rwxr-xr-x
WARN 10-03 22:02:53,455 - got exception trying to get groups for user Jasper_VLC
org.apache.hadoop.util.Shell$ExitCodeException: id: Jasper_VLC: No such user

at org.apache.hadoop.util.Shell.runCommand(Shell.java:250)
at org.apache.hadoop.util.Shell.run(Shell.java:177)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:370)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:456)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:439)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:56)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:43)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:79)
at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:970)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.<init>(FSPermissionChecker.java:50)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4920)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOwner(FSNamesystem.java:4883)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setPermission(FSNamesystem.java:769)
at org.apache.hadoop.hdfs.server.namenode.NameNode.setPermission(NameNode.java:556)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)
INFO 10-03 22:02:53,458 - ugi=Jasper_VLC ip=/192.168.0.196 cmd=setPermission src=/var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042 dst=null perm=Jasper_VLC:hadoop:rwx------
INFO 10-03 22:02:53,468 - ugi=Jasper_VLC ip=/192.168.0.196 cmd=create src=/var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar dst=null perm=Jasper_VLC:hadoop:rw-r--r--
INFO 10-03 22:02:53,470 - BLOCK* NameSystem.allocateBlock: /var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar. blk_8920746123232052559_2793
INFO 10-03 22:02:53,479 - BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.0.195:50010 is added to blk_8920746123232052559_2793 size 3391
INFO 10-03 22:02:53,482 - Removing lease on file /var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar from client DFSClient_-1685639398
INFO 10-03 22:02:53,482 - DIR* NameSystem.completeFile: file /var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar is closed by DFSClient_-1685639398
INFO 10-03 22:02:53,484 - Increasing replication for file /var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar. New replication is 10
INFO 10-03 22:02:53,485 - ugi=Jasper_VLC ip=/192.168.0.196 cmd=setReplication src=/var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar dst=null perm=null
INFO 10-03 22:02:53,488 - ugi=Jasper_VLC ip=/192.168.0.196 cmd=setPermission src=/var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042/job.jar dst=null perm=Jasper_VLC:hadoop:rw-r--r--
INFO 10-03 22:02:53,493 - BLOCK* NameSystem.addToInvalidates: blk_8920746123232052559 is added to invalidSet of 192.168.0.195:50010
INFO 10-03 22:02:53,495 - ugi=Jasper_VLC ip=/192.168.0.196 cmd=delete src=/var/lib/hadoop-0.20/cache/hadoop/mapred/staging/Jasper_VLC/.staging/job_201103072315_0042 dst=null perm=null
INFO 10-03 22:02:53,568 - BLOCK* ask 192.168.0.195:50010 to delete blk_8920746123232052559_2793
-----------------------------------------------------------------------------------------------------

It looks like the job.jar of the next sequential job 0042 can not be put on HDFS because of permissions and the job.jar is deemed "invalidate".