PDA

View Full Version : Excel/CSV Input



ghost
12-13-2005, 01:54 AM
Attachment: TestCase.jpg (http://forums.pentaho.org/archived_att/kettle/TestCase.jpg) Hi,

i've got a problem in read/transform a XLS as well as an CSV file. I'll describe in short:

I want to migrate data from outlook to a database. Therefore i exported the data from MS Outlook to Excel format. The Excel file itself contains all Outlook related data and is used as input. But i only want to migrate some of its data fields to my database. So i specified my needed fields only in the "Excel Input". In preview the column headers of my selected fields are shown, but their data isn't right. It seems that only the defined starting position is responsible for choosing the data. So, if i define a position: "Start row = 0" and "Start column = 0" in the "Sheet" tab and have selected the columns 3 & 4 of the Excel file in the "Fields" tab, i would get the right column headers but the data of 0 & 1. When i switch the position to "Start row = 0" and "Start column = 3" is works right. But when i need the data of column 3 & 5, i reached a similiar problem: Headers are right too, as well as the data of column 3, but the column 5 takes the data of column 4.
I added a "Select values" transform item between in- and output to make the selection with it. But i reached the same behaviour. A similiary problem occurs, if i use a CSV file instead of the XLS file.

In my test case i use a CSV file as output instead of database. And also in execution the problem remains.

Maybe i'm the responsible for this problem ... ;) So i post here and hope to get some help. Many thanks.

Regards
Gunther

MattCasters
12-13-2005, 02:05 AM
Hi Gunther,

For me this works fine:
- I exported my Contacts from Outlook 2000 to Excel.
- Put a new ExcelInput step on the canvas
- Added the file (for me: contacts.xls) to the list of files.
- Selected the sheet (for me: contacts).
- Contents: leave default
- Fields: pressed "Get fields from header row..."
- At this stage the "preview rows" button already should work.

Like you said, use the SelectValues step to select fields and/or change the names. It's not a good idea to select the fields in the ExcelInput step as Excel doesn't know about fields, only about coordinates.

Good luck!

Matt

ghost
12-13-2005, 02:56 AM
Ok,

thanks for (very) fast reply. I found out that a column is completely ignored when the first row contains no value.

Assuming the Excel files sheet contains a table like:

<table>
<tr><td>Name<td> Phone<td>Birthday</tr>
<tr><td>Oliver<td>12345<td>18.08.1980</tr>
<tr><td>Gunther<td><td>19.02.1978</tr>
</table>

If you click on the "Get fields from header row ..." button in the "Excel Input" at "Fields tab", all the columns are inserted. But if the example table looks like:

<table>
<tr><td>Name<td> Phone<td>Birthday</tr>
<tr><td>Oliver<td><td>18.08.1980</tr>
<tr><td>Gunther<td>12345<td>19.02.1978</tr>
</table>

the column "Phone" isn&#39;t detected within the "Fields" tab. Only "Name" and "Birthday" are received. As i see, that may relate to my problem. Because like you said that it isn&#39;t a good idea to do column filtering within the "Excel input", but at the "Select values".

Gunther

ghost
12-13-2005, 02:58 AM
Sorry about the above, i forgot to choose the content format HTML ... ;) Here again:



Ok,

thanks for (very) fast reply. I found out that a column is completely ignored when the first row contains no value.



Assuming the Excel files sheet contains a table like:



<table>
<tr><td>Name<td> Phone<td>Birthday</tr>
<tr><td>Oliver<td>12345<td>18.08.1980</tr>
<tr><td>Gunther<td><td>19.02.1978</tr>
</table>



If you click on the "Get fields from header row ..." button in the "Excel Input" at "Fields tab", all the columns are inserted. But if the example table looks like:



<table>
<tr><td>Name<td> Phone<td>Birthday</tr>
<tr><td>Oliver<td><td>18.08.1980</tr>
<tr><td>Gunther<td>12345<td>19.02.1978</tr>
</table>



the column "Phone" isn't detected within the "Fields" tab. Only "Name" and "Birthday" are received. As i see, that may relate to my problem. Because like you said that it isn't a good idea to do column filtering within the "Excel input", but at the "Select values".



Gunther

MattCasters
12-13-2005, 03:58 AM
OK, I don&#39;t have the time to verify this now, but I opened a bug so it gets verified later.

Bug - [# 1348] ExcelInput: a column is completely ignored when the first row contains no value.

Thanks for the detailed report!
I hope you still found Kettle useful to work with.

All the best,

Matt

ghost
12-13-2005, 04:33 AM
No problem.

To get a workaround take notice about the below:
To get right functionallity here you only need to insert a dummy row as first row into the XLS which gives all columns initial data. But take care about a right datatype setting, e.g. it&#39;s a bad idea to give the "Birtday" column from the above case a value like "12345" or "dummy".

But its still a great tool!

Gunther

MattCasters
12-13-2005, 04:51 AM
Hi Gunther,

The comment about the data types is probably spot on.
Kettle detects the data type from Excel by reading the first row.
That way the construction of new rows is optimised (set metadata only once).
It also ensures that all rows have the same data type. (not a strict requirement of Kettle, BTW, but a good idea :-))
If there is no value you can&#39;t get the data type from Excel.

This should probably be fixed in a couple of days or so.
Watch the bug in the Trackers and then grab the daily built kettle.jar to get a fix.

Cheers,

Matt

Amit Gupta
02-01-2007, 09:45 AM
hi

when i used to Excelsheet as Input
in this case i m getting error as following
and what abt Regular Expression and Variable.How to define it.
error.....
Excel Input.0 - ERROR (version 2.3.1, build 63 from 2006/09/14 12:04:05 @ sam) : No input fields defined!

Excel Input - ERROR (version 2.3.1, build 63 from 2006/09/14 12:04:05 @ sam) : Error initializing step [Excel Input]

Plz telll me...

Regards
Amit Gupta
www.davlin.co.in (http://www.davlin.co.in)