PDA

View Full Version : Clustering/clouds made easy



MattCasters
03-18-2009, 06:40 PM
Dear Kettle fans,

The last main item on my agenda before we could release a release candidate of version 3.2. was the inclusion of a number of features that would help us make dynamic clustering easier.

It was already possible to make things happen, but thanks to Sven Bodens parameters, we can now up all that one level. Let me explain to you what we did with a small simple example…

http://www.ibridge.be/images/clustering-transformation.png

The “Slaved” step takes the place of one or more steps that you would like to see run clustered (optionally partitioned) on a number of machines.

So let’s say this transformation is part of a job…

http://www.ibridge.be/images/clustering-job.png

We want to have this run on Amazon EC2. So I created an AMI just for you to test with:


IMAGE ami-f63ed99f kettle32/PDI32_CARTE_CLUSTER_V4.manifest.xml 948932434137 available public i386 machine

The input of that AMI is a piece of XML that configures the Carte instance on it that is started automatically upon boot of the image:




carte-slave
localhost
eth0
8080
cluster
cluster
N



This file, let’s call it carte-master.xml is passed when we run our instance:


ec2-run-instances -f carte-master.xml -k ami-f63ed99f

When it’s booted we take the internal Amazon EC2 IP address of this server and pass that into a second document, let’s call it carte-slave.xml:





master1
Internal IP address
8080
cluster
cluster
Y




Y


carte-slave
localhost
eth0
8080
cluster
cluster
N



Then we fire up 5 slaves with that configuration…


ec2-run-instances -f carte-slave.xml -k ami-f63ed99f -n 5

These 5 slaves will report to the master and explain where they can be reached. So all we need to do in our PDI job/transformation is create a master slave configuration:

http://www.ibridge.be/images/clustering-master-slave-server.png

To top it off, we define MASTER_HOST and MASTER_PORT as parameters in the job and transformation…

http://www.ibridge.be/images/clustering-job-parameters.png

So all that’s left to do is specify these parameters when you execute the job…

http://www.ibridge.be/images/clustering-job-execution-on-ec2.png

As you can see from the dialog, we pass the complete job (including sub-jobs and sub-transformations) over to the “Cluster Master” slave server prior to execution because it is not possible nor needed for Spoon to contact the various slave servers directly. That is because they report with their internal IP addresses. We wouldn’t want it otherwise since that offers the best performance (and costs less).

These goodies are soon to be had in a 3.2.0-RC1 near you…

Until next time,
Matt



More... (http://www.ibridge.be/?p=160)