PDA

View Full Version : What is your most wanted feature in Kettle?



MattCasters
12-07-2005, 03:43 AM
Of any ETL feature that is missing in Kettle, please tell me what you would like to see implemented first...
Feel free to ask!

Matt

kettle_anonymous
12-07-2005, 09:28 PM
I also would like to see SFTP/SCP support in Jobs, as well as POP3/IMAP support for grabbing file attachments.

MattCasters
12-07-2005, 11:35 PM
> What about XML support as input and output source ? And XSLT transformations?

I have been thinking about this one a long time.

In what form would you like to see the output?
XML is a hierarchical structure, the rows that a Kettle step emits are flat.

My idea was to do the following:

<Customer id="ID1"> <name>NAME1</name><address>ADDRESS1</address> </Customer>
<Customer id="ID2"> <name>NAME2</name><address>ADDRESS2</address> </Customer>
...

Turn this into:

CustomerID=1,CustomerName="NAME1", CustomerAddress="ADDRESS1"
CustomerID=2,CustomerName="NAME2", CustomerAddress="ADDRESS2"
...

So I would concatenate to build field names.

Whould that work for you?

The problem is that XML can have any form.
Perhaps it would be best to also have a selection possiblility of the starting point in the XML document.

I don&#39;t know if style sheets are really the answer here. In any case I would prefer a flexible solution to the requirement of the existence of an XSLT document.

Any further thoughts?

Thanks,

Matt

MattCasters
12-07-2005, 11:42 PM
This will probably take a bit more work.
FTP was easy to do as there are open source libraries available to make to job easy.

With SFTP or SCP some "piecing together" will be involved. :-(

Same with POP3 and/or IMAP support, I&#39;ll have to find a good open source library to avoid re-invention of the wheel(s).

I&#39;ll look into this though, it&#39;s a good suggestion.

Thanks,

Matt

MattCasters
12-08-2005, 03:10 AM
Over lunch I found this library: http://www.jcraft.com/
Seems to have a BSD style license.

For SFTP, I can probably re-use the GUI and most of the meta-data code.

kettle_anonymous
12-09-2005, 04:40 AM
There is an open source library for SFTP available I think. Look at the ANT documentation you can find the necessary jars there. ANT is using SCP and SSH too.

MattCasters
12-09-2005, 05:54 AM
Actually, from what I read last night it should be jsch by the people from jcraft.
I found some examples at the site and it should be rather straightforward to implement scp and sftp.

kettle_anonymous
12-09-2005, 12:41 PM
A job scheduler might be a nice addition, using something like Quartz (no need to reinvent the wheel).

Once that is up, adding a lightweight dashboard (maybe a JSP or two) and perhaps an Email notification system so people could subscribe to a feed and be notified of failures (or even just regular status), so you don&#39;t have to continually watch it yourself.

Just a thought.

kettle_anonymous
12-12-2005, 01:21 AM
For POP3 and IMAP I would simply use javamail API

MattCasters
12-12-2005, 04:16 AM
Ofcourse, it&#39;s been a while since I implemented the mail job entry, but the jar is already included in Kettle.
It looks like I can just get an InputStream and store the attachement(s) to disk.
We better think of supporting saving the mail message itself too.

ChrisPerrin
12-14-2005, 12:56 PM
XML is really freaking tricky because you have three ways to define XML. An instance XML doc, a Schema, and DTD. In most cases, I think we can ignore DTD as they&#39;re kind of going a way from the infinitely more verbose Schema.

So let&#39;s look at the problem of either of those two options.

First, like you said, you need to define what defines a row and then pull your field names from that. However, that can get ugly. Consider this case:
<Items>
<Item id="101001">
<Type value="type1"/>
<Attribute name="height">10 ft</Attribute>
<Attribute name="weight">100</Attribute>
<SubType>
<SubTypeDescription>This is a description</SubTypeDescription>
</SubType>
<SubType>
<SubTypeDescription>This is another description</SubTypeDescription>
</SubType>
</Item>
<Item id="101002">
<Type value="type2"/>
<Attribute name="height">20 ft</Attribute>
<Attribute name="weight">102</Attribute>
<SubType>
<SubTypeDescription>This is not a description</SubTypeDescription>
</SubType>
<SubType>
<SubTypeDescription>This is not another description</SubTypeDescription>
</SubType>
</Item>
</Items>

How do you pull field names out of that? That&#39;s the problem with XML. It&#39;s beautiful in it&#39;s complexity.

You&#39;ve got a number of issues with this:
1. What is my row? It&#39;s legit to say that Item, Attribute, or SubType could be rows.
2. Assuming Item is the row, what are field names for Attribute?

The Open Source engine BIE tried to tackle this mapping problem and their solution was horrible complex.

You could require a schema. Schemas are easy to generate off instance docs. You still have the same issues, but things are organized a little more neatly. Even if you didn&#39;t require a schema, I think it would be cool to have an input which is only the schema and allow Kettle to dynamically pull the XML at runtime from a file or a Web Service.

MattCasters
12-15-2005, 03:55 AM
Actually, getting the data out is rather easy.
The format in which it ends up in Kettle rows is a different thing.

My proposal is the following (based on your sample XML)

ROW
------------
Items_Item_id = "101001"
Items_Item_Type_Value = "type1"
Items_Item_Type_Attribute_name = "height"
Items_Item_Type_Attribute = "10 ft"

ROW
------------
Items_Item_id = "101001"
Items_Item_Type_Value = "type1"
Items_Item_Type_Attribute_name = "weight"
Items_Item_Type_Attribute = "100"

ROW
------------
Items_Item_id = "101001"
Items_Item_Type_Value = "type1"
Items_Item_Type_SubType_SubDescription = "This is a description"

ROW
------------
Items_Item_id = "101001"
Items_Item_Type_Value = "type1"
Items_Item_Type_SubType_SubDescription = "This is another description"

ROW
------------
Items_Item_id = "101002"
Items_Item_Type_Value = "type2"
Items_Item_Type_Attribute_name = "height"
Items_Item_Type_Attribute = "20 ft"

ROW
------------
Items_Item_id = "101002"
Items_Item_Type_Value = "type2"
Items_Item_Type_Attribute_name = "weight"
Items_Item_Type_Attribute = "102"

ROW
------------
Items_Item_id = "101002"
Items_Item_Type_Value = "type2"
Items_Item_Type_SubType_SubDescription = "This is not a description"

ROW
------------
Items_Item_id = "101002"
Items_Item_Type_Value = "type2"
Items_Item_Type_SubType_SubDescription = "This is not another description"


Processing the file with SAX would give us good performance and (optionally) allow us to check the document for validity.

Thoughts?

Matt

ChrisPerrin
12-15-2005, 06:58 AM
Matt, just to make sure we&#39;re on the same page.

So if I understand your correctly, there would be 7 rows created off of two Items? If I were loading that XML into a table, I&#39;d have 7 different rows?

Can Kettle support more heirarchical rows?

MattCasters
12-15-2005, 07:37 AM
Chris,

A row of fields is by definition not hierarchical.
The same goes for a relational database table.
There is just no way of mapping XML directly into these formats.

If we would have to keep it very simple, something like this is basically my proposal.
As "Extra" options you could envision ways of cleaning up the data.
For example, if you add a checkbox "repeat fields" you would get better results.

Perhaps you could envision a way of telling Kettle that it should keep all fields until </item> gets hit.
That way you&#39;dd get 2 rows. (Implementation details ...)

We&#39;re just brainstorming, nothing final, hoping to hear other proposals.

Cheers,

Matt

By the way, I know this is off topic, but this example just shows what a complete mistake attributes are in XML.
There is not a single attribute that can&#39;t be rewritten using just elements.

ChrisPerrin
12-16-2005, 10:33 AM
Matt,

I figured that rows couldn&#39;t be heirarchical. But it never hurts to ask.

I feel your pain on this one becuase it&#39;s going to be tough, but I think that XML support is going to be necessary since so many companies use it. I also am having to do something like this for a different project (parse XML and turn it into RDMS data without having any idea what the incoming XML looks like.) In fact, I am hoping to use Kettle in a number of applications, but we can talk more on that later...

Anyway, I understand your comments about attributes and know that the XML example I used is pretty much the worst case senario I could think of off the top of my head so it&#39;s nasty by nature.

I think you are right on with your ideas. Scan the XML to get a list of modes. Let the user select the repeating node they are concerned with. That takes us potentially from 7 rows to 2 (if they choose item.) But I would prompt them the repeating node first. Now, you&#39;re not going to like my next thought at all. But really, I think the fields for this document should be:

Item Id, Type, Height, Weight, SubTypeDescription1, SubTypeDescription2 but you&#39;re only going to know that if you scan the doc first and you have to ope there&#39;s only two SubTypeDescriptions. Perhaps you can designate a column a 1-to-many and support variable numbers of columns.

The other thing you can do is basically say that Kettle is not going to support relational data in an inherently non-relational structure (your row set) and have itemid, type, height, weight and a single subTypeDescription. Reading the schema might also be an option.

MattCasters
12-16-2005, 11:06 AM
Actually, I think the example is very nice. It allows us to think of the potential solutions.
I understand that you want [ Item Id, Type, Height, Weight, SubTypeDescription1, SubTypeDescription2 ], however...
I think the poblem is clearly situated around Height and Weigth:

What if you wrote: <Attribute name="height" info="in meters unless otherwise specified">10 ft</Attribute> ??
Some people would like to have that information, others just want height = "10 ft".

So your point about scanning the document twice is correct. However, we only need to do it once, specify what we want by means of a GUI and store this in Kettle meta-data. The parsing of the actual documents should be single pass!

Pffew, you could even think about pattern detectors to propose "sane" defaults: scan for repeating <item>...</item> like structures. However, we can do this later.

<OffTopic>My hands are itching to get started on this, but I kind of promised the Pentaho guys I would look at their BI suite (which is super-exellent de luxe by the way)</OffTopic>

OK, how about this as a starter kit: scan the XML document based on the repeating item:
- we specify <items><item>
- Kettle scans the complete file and notes all occurences in the file:
* id (101001)
* TypeValue (type1)
* Attribute1Name (Heigth)
* Attribute1Info (in meters unless otherwise specified)
* Attribute1 (10 ft)
* Attribute2Name (Weigth)
* Attribute1Info (in inches)
* Attribute2 (100)
* SubType1SubTypeDescription (This is a description)
* SUbType2SubtypeDescription (This is another description)

--> These fields are presented in the GUI. We store them as meta-data
--> If there are others fields that are not present in out "test-file" we can add them. (value would be null)
--> This means that the "keys" have to be saved: a key would be a sequence of elements
For example:
Field: Attribute1Info
Key: E=Attribute, A=Info
--> Additional advantage of specifying the fields is that we can do conversions to number, date, boolean, ... we can even give Kettle the conversion mask.

Please note that if required, you can normalize wide rows with the Kettle row normalizer ;-)

So, would do you think about this proposal? I&#39;m certainly seeing it as a solution that could work because you can generate the meta-data from XML file, even without requiring any "XML descriptive files" like DTD, Schema, ...

ChrisPerrin
12-16-2005, 01:56 PM
<ShamelessPlug>Yes, Pentaho looks VERY sweet</ShamlessPlug>

I think for a first pass what you are suggesting sounds fine with one question. You say we specify <item> </item>. I assume by that you mean that the user via the GUI specifies that. Is that accurate?

Other than that, I think you&#39;re the man. I can&#39;t wait for this!

MattCasters
12-16-2005, 02:10 PM
>You say we specify <item> </item>. I assume by that you mean that the user via the GUI specifies that. Is >that accurate?

Yeah, that&#39;s right.

OK I&#39;m losing some sleep here, but I couldn&#39;t help myself and hacked up a prototype.
Will try to finish it up over the weekend, but for now it&#39;s gone passed midnight in this part of the world ;-)

Grab source and/or kettle.jar to get a peek under Experimental.

I&#39;m not using SAX at the moment for simplicity. (SAX is event driven and I need to work out the conflict with Kettle treads, so for now I&#39;m using DOM: all in memory)

I guess the most important thing is to get functionality, metadata and gui first, then speed.

TTY Later!

Matt

MattCasters
12-19-2005, 12:19 AM
I anounced the XMLInput step in another discussion thread.


See: New XMLInput Step available (http://forums.pentaho.org/showthread.php?t=48243)



For fun, check out this XML:



The Flat-File Society Does XML (http://www.thedailywtf.com/forums/53627/ShowPost.aspx)






Matt

ChrisPerrin
12-22-2005, 01:35 PM
Matt,

Great job. Sorry I dropped off the face the planet. Life's been hectic. But this is awesome!

Chris

kettle_bijugv
02-17-2006, 04:13 AM
Kettle is already a fantastic tool. If you still ask me "MOST WANTED", I would think that it is a fantastic idea to have :

1. A GUI mapper that shows input and output fileds on enabled drag-and-drop mapping. It make life so easy for end-users. In its current form (version 2.2.2) I find it extremely time consuming to map input columns from a process flow to the columns of a database table. I guess this is the only functionality that kettle lacks compared to other ETL tools.

2. The tool does not have a list of available transformation functions (e.g concat, substring, lpad, rpad, replace.... ). I agree that these are supported within the Javascript Transformation step, however having a Toolbox window with all supported functions catagorized into groups like "string functions", numeric functions etc would make kettle usable to end-users not familier with programming.

3. Ability to save a step or part of a transformation as a "composite step" can help in making reusable flows.


Certinly looking forward to see these functionalities in the future releases of Kettle.

Regards
Biju.