Hitachi Vantara Pentaho Community Forums
Results 1 to 20 of 20

Thread: Different comparison rules between "Unique rows" and "Stream lookup"

  1. #1
    Join Date
    Dec 2005
    Posts
    20

    Default Different comparison rules between "Unique rows" and "Stream lookup"

    Attachment: js.jpg Hi,


    here again. I cause an behavioural inconsistency by doing the following steps:



    Assuming there are many contact data from outlook, imported via "Excel input". These contacts joins data of an individual person together with its company data. In the whole imported data there may several persons from the same company. So the company data includes some redundancy. I intend to build a db structure, that defines companies as discrete records. The person's data record then references its company data only, using a unique record id.



    The way to reach these structure is to split (copy) the main steps line into two single lines. These lines takes exactly the same source data.
    The first one continues with a "Sort rows" followed by a "Unique rows" on company's name. This is necessary to filter each company exactly one time. As finish step a unique record id is generated.
    The second line runs ahead with a "Stream lookup". This lookup is coupled with the first line's record id generation as "Source step". The comparison field is the company's name again. The provided new data field is the generated record id. So you make sure that every company (by name) takes exactly the same record id.



    But here i received an inconsist behaviour. Responsible for that seems to be a different comparison rules of the "Unique rows" and the "Stream lookup". It seems that the "Unique rows" compare with ignore case, whereas the "Stream lookup" compare case-sensitive. So the id can't be applied to all that syntax of company's name, that isn't provided by the "Unique rows".



    A workaround to avoid this can be done by normalizing the company's name to a defined syntax, using e.g. "Java Script value".



    Any other suggestions are strongly welcome.



    Gunther.

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Hi Gunther,

    The problem is indeed that the following method in the Value class compares Strings ignoring case.

    public int compare(Value v)

    In most cases, like sorting etc, it is probably the desired behaviour, in your case it's not. We have to figure out a way of making this configurable.
    The quick fix ofcourse would be to make the strings do compareTo in stead of compareToIgnoreCase. However, like I said someone else will be unhappy. :-(

    For that reason I'll add another method in Value: compareIgnoreCase(Value v) and in Row: compareIgnoreCase(...)

    This more work then you might think as I would need to add flags to every field parameter in the UniqueRows step.

    As far as the StreamLookup is concerned, StreamLookup is doing a lookup in a Hashtable and of-course, the Hascode of a string is always case sensitive. There too, it would be a good thing to have a case insensitive option for every matching key field.

    I'll add a Tracker later today so that you can see how far the work has progressed.

    In the mean time, for your problem, try the following approach: Add a field in both streams with a little piece of javascript:

    var companyLower = company.Clone().lower().getString();

    Do the lookup on this field. That is, if you want UniqueRows to ignore the case, otherwise you'dd have to write your own UniqueRows function in javascript: (something like this)

    ----------------------------------
    var keep="Yes";

    var previousCompany;

    if ( previousCompany != null )
    {
    if ( company.getString().equals( previousCompany.getString() ) ) keep="No";
    }

    previousCompany = company.Clone();
    ------------------------------------

    Grab the "keep" variable as a String and then filter on Keep="Yes".


    Hope this helps,

    All the best,

    Matt

    P.S. Splitting a stream and re-joining them with a lookup ONLY works when the number of records read is SMALLER then the rows buffered in between the source and the stream lookup step. That is because streamlookup will only start processing when it has all records in memory. (you can change the number of rows buffered in the Transformation window CTRL-T)
    So the cautious thing to do would be to split this up in 2 transformations.

  3. #3
    Join Date
    Sep 2005
    Posts
    1,403

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Thanks a lot - that helps me.

    An explicit choice of case-sensitive or -insensitive comparison - i think - is a good solution for this issue.

    Salute
    Gunther.

  4. #4
    Join Date
    Sep 2005
    Posts
    1,403

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Ok - i tried your approach, using the Javascript expression to change the company's name to lower case. It won't work. The test of the "Java Script value" (Company Name) says OK , but in execution i reached the error:

    "Company Name.0 - ERROR: Javascript error: TypeError: undefined is not a function. (script; line 23)"

    ( [Line 23]: "var lookupCompanyName = Company.Clone().lower().getString();" )

    I'm searched in the Spoon documentation and tried to set "var lookupCompanyName = Company; " and than apply "lookupCompanyName.lower();". Similiar thing - The Javascript check says OK, but transformation failed in execution at "lookupCompanyName.lower();". Also the change to "upper()" brings no luck.

    What goes wrong here - i can't explain myself?

    Gunther.

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    OK, the image attachement is somehow not working.
    To compensate this, especially for *you* I implemented the case sensitivity on a field level in UniqueRows.

    Grab a new kettle.jar and start cooking!

    Matt

  6. #6
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Hey - great!

    Especially for me? - I'm very impressed! Nobody has done such a thing for me only before.

    You work really hard, i see - 21:52. Do you needn't to relax any time?

    But let me say it again:
    You offers a really great support for your application! My deeply respects to your great work!

    Thanks a lot!

    Best wishes,
    Gunther

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    >Especially for me?

    I just want to see Kettle get off to a good start. Kettle has been dowloaded more then 4350 times as we speak and it has gotten off very well I think. The number of bugs reported is really minimal.
    But I guess every improvement and bug fix is a step in the right direction. Over time it will make the difference with "other" ETL tools on the market. (even commercial ones)

    The downside is indeed that most of the work needs to get done after work when my son is in bed. (if he sees me working on my laptop he wants to "join in" ;-))
    That leaves me with a couple of hours, but Kettle is very well structured so most things can be done in that period of time.
    The work that requires more time just gets pushed back until I have a longer stretch of time to work with.
    For example, writing the XML plugin would probably require a day to write or a week of work every night. That's still not a lot of work though, there is a lot of code you can reuse when programming Kettle.

    Anyway, glad to be of assistance ;-)

    Matt

  8. #8
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Yes . i can see - The forum isn't used very often. Until 3 days I'm one of the most active users here But i agree with you. Feedback from the users is an important aspect to improve the application's functionallity.



    So i'll go on - perhaps more fast then you like Because i tried your change on "Unique rows" in a simple test case:



    The source XLS:


    <table>
    <tr><td>Name<td>Phone<td>Birthday</tr>
    <tr><td>Oliver<td>12345<td>18.08.1980</tr>
    <tr><td>Gunther<td>67891<td>19.02.1978</tr>
    <tr><td>GUNTHER<td>67891<td>19.02.1978</tr>
    </table>



    is read by an "Excel input". The following steps are a "Sort rows" on name (Ascending = Y) and a "Unique rows" on name again (Ignore case = N). The result is written back to the file system, using a "Text file output". With my setup of "Unique rows" I would expect a result that looks like the source, including 3 rows with data. But the following output is generated:



    <table>
    <tr><td>Name<td>Phone<td>Birthday</tr>
    <tr><td colspan="3" align="CENTER">[empty row]</tr>
    <tr><td>GUNTHER<td>67891<td>19.02.1978</tr>
    <tr><td>Oliver<td>12345<td>18.08.1980</tr>
    </table>



    When i change the "Unique rows" setup to "Ignore Case = Y", the same is given out again. If i change the "Sort rows" setup to "Ascending = N" the output looks like the previews one without the empty row. Changing the "Unique rows" setup has no effect to the output.



    Gunther

  9. #9
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Ok, thanks - i'll try.

    Beside i found the failure on my Javascript:




    var Company = "";



    if(Firma.getString() == null) {



    if(Vorname.getString() != null) {

    Company += Vorname.getString();

    }



    if(Nachname.getString() != null) {



    if(Vorname.getString() != null) {

    Company += " ";

    }




    Company += Nachname.getString();

    }

    }

    else {



    Company = Firma;

    }



    var lookupCompanyName = Company.Clone().lower().getString();




    The "Test script" runs without error. But when i cut off the conditions only for testing to this:




    var Company = "";


    var lookupCompanyName = Company.Clone().lower().getString();





    i reached an error in executing "Test script". I need to work with the field directly and not with its string representation ... my fault (And something goes wrong in "Test script" function)



    Gunther

    PS: Are you daily work within a "normal" employee contract and nightly you work on your project, or? Is the project your personal work or there are other active developers too?

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Basically, at the moment I&#39;m the only active Kettle developer although I really hope that that will change over time.
    Again, anyone with good java skilzz is welcome. ;-)

    JavaScript is really a programming language all on its own.
    The test button can only check very basic things like existence of variables.
    Also it&#39;s a static check, a single dummy value row is tested against your script.

    For that reason, I&#39;m building a Step that will have less flexibility but will be able to perform all possible calculations on field values.

    See also: Change Request - [# 1347] Expose Value API in a step
    The first code for that has been written.

    Cheers,

    Matt

    P.S. I&#39;m on my lunch break right now, I&#39;m working as a contractor on a data warehouse in Brussels.

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Attachment: js.jpg Seems to work here. See atachement.

  12. #12
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Attachment: name - sort - unique.ktr Well, it seems to work alright in my setup.
    Copy the XML at the bottom and paste it using Transformation / Paste transformation from clipboard.

    So I tried your example as well:

    TEST TEXT-FILE:
    ------------------------------------
    Name,Phone,Birthday
    Oliver,12345,18.08.1980
    Gunther,67891,19.02.1978
    GUNTHER,67891,19.02.1978

    After TextFile input:
    ------------------------------------
    Name Phone Birthday
    Oliver 12345 1980/08/18 00:00:00.000
    Gunther 67891 1978/02/19 00:00:00.000
    GUNTHER 67891 1978/02/19 00:00:00.000

    After Sort: (Name: ascending)
    ------------------------------------
    Name Phone Birthday
    Gunther 67891 1978/02/19 00:00:00.000
    GUNTHER 67891 1978/02/19 00:00:00.000
    Oliver 12345 1980/08/18 00:00:00.000

    After Unique (Name: Ignore case : N)
    ------------------------------------
    Name Phone Birthday
    Gunther 67891 1978/02/19 00:00:00.000
    GUNTHER 67891 1978/02/19 00:00:00.000
    Oliver 12345 1980/08/18 00:00:00.000


    TIP: Try the preview functionality to debug what&#39;s happening after each step

    Good luck,

    Matt



    <?xml version="1.0" encoding="UTF-8"?>
    <transformation>
    <info>
    <name></name>
    <directory>\</directory>
    <log>
    <read></read>
    <write></write>
    <input></input>
    <output></output>
    <update></update>
    <connection></connection>
    <table></table>
    <use_batchid>N</use_batchid>
    <use_logfield>N</use_logfield>
    </log>
    <maxdate>
    <connection></connection>
    <table></table>
    <field></field>
    <offset>0.0</offset>
    <maxdiff>0.0</maxdiff>
    </maxdate>
    <size_rowset>350</size_rowset>
    <dependencies>
    </dependencies>
    </info>
    <notepads>
    </notepads>
    <order>
    <hop> <from>Gunther (Lower)</from><to>Unique rows</to><enabled>Y</enabled> </hop> <hop> <from>GUNTHER (Upper)</from><to>Unique rows</to><enabled>Y</enabled> </hop> <hop> <from>Unique rows</from><to>Dummy (do nothing)</to><enabled>Y</enabled> </hop> </order>

    <step>
    <name>Gunther (Lower)</name>
    <type>RowGenerator</type>
    <description></description>
    <distribute>Y</distribute>
    <copies>1</copies>
    <fields>
    <field>
    <name>Name</name>
    <type>String</type>
    <format></format>
    <currency></currency>
    <decimal></decimal>
    <group></group>
    <nullif>Gunther</nullif>
    <length>-1</length>
    <precision>-1</precision>
    </field>
    </fields>
    <limit>1</limit>
    <GUI>
    <xloc>249</xloc>
    <yloc>157</yloc>
    <draw>Y</draw>
    </GUI>
    </step>

    <step>
    <name>GUNTHER (Upper)</name>
    <type>RowGenerator</type>
    <description></description>
    <distribute>Y</distribute>
    <copies>1</copies>
    <fields>
    <field>
    <name>Name</name>
    <type>String</type>
    <format></format>
    <currency></currency>
    <decimal></decimal>
    <group></group>
    <nullif>GUNTHER</nullif>
    <length>-1</length>
    <precision>-1</precision>
    </field>
    </fields>
    <limit>1</limit>
    <GUI>
    <xloc>250</xloc>
    <yloc>226</yloc>
    <draw>Y</draw>
    </GUI>
    </step>

    <step>
    <name>Unique rows</name>
    <type>Unique</type>
    <description></description>
    <distribute>Y</distribute>
    <copies>1</copies>
    <count_rows>N</count_rows>
    <count_field></count_field>
    <fields> <field> <name>Name</name>
    <case_insensitive>N</case_insensitive>
    </field> </fields> <GUI>
    <xloc>425</xloc>
    <yloc>199</yloc>
    <draw>Y</draw>
    </GUI>
    </step>

    <step>
    <name>Dummy (do nothing)</name>
    <type>Dummy</type>
    <description></description>
    <distribute>Y</distribute>
    <copies>1</copies>
    <GUI>
    <xloc>559</xloc>
    <yloc>198</yloc>
    <draw>Y</draw>
    </GUI>
    </step>

    </transformation>

  13. #13
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Attachment: StringTable.java Hi,

    i tried your case and yes, in preview it is running like it should. But using a "Text file output" there&#39;s always generated only 1 row with "GUNTHER". Hmm ... i&#39;ll keep on testing


    "... There too, it would be a good thing to have a case insensitive option for every matching key field. "

    Referring to the above issue of case in/-sensitivity i attached a class (written in a minute - hope it works in any case ) that implements a configurable case state. So its offers a Hashtable functionallity, but with explicit settable ignoring case state or not. Perhaps it helps you.
    I tested it, using this:

    StringTable table1 = new StringTable(StringTable.CASE_SENSITIVE);
    StringTable table2 = new StringTable(StringTable.CASE_INSENSITIVE);

    table1.put("TEST", "dummy");
    table1.put("Test", "DUMMY");
    table2.put("TEST", "dummy");
    table2.put("Test", "DUMMY");

    System.out.println(table1.get("Test")+" -- "+table2.get("Test"));
    System.out.println(table1.get("TEST")+" -- "+table2.get("TEST"));

    Best wishes,
    Gunther.

  14. #14
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Hi,

    i hadn&#39;t know about the only basic Javascrip test - so i only was amazed about. But it is not dramatic - Because complex scripts i write in a adequate editor - so i only need to check, what&#39;s supported by your application and test scripts on the editor itself.

    I already posted here today some messages above (as answer to your message that contains your KTR file).

    Best wishes,
    Gunther

  15. #15
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"



    Thanks Gunther,



    It's a nice idea to subclass Hashtable, however, I'm not putting Strings into the Hashtable but Rows:


    StreamLookup.java



    But, it's a good idea: I will extend Hashtable into a RowHashtable.

    Furthermore, we want to be able to set case sensitivity on a field level.

    For that reason I built a new compare in Row.java:



    <code>
    /**

    489: * Compare 2 rows with each other using certain values in the rows and also considering an ascending clause.

    490: * @param r The row to compare with

    491: * @param fieldnrs The indexes of the values to compare

    492: * @param ascending an entry for each value to compare where true means and normal compare, false the reverse.

    493: * @return -1 if the row is smaller, 0 if they are equal and 1 if the row is larger.

    494: */

    495: public int compare(Row r, int fieldnrs[], boolean ascending[], boolean caseInsensitive[])



    </code>


    I just need to find a way of using this compare in Hashtable, in stead of the normal compare that it is using now. (case sensitive)



    All the best,



    Matt


  16. #16
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"



    Actually, it seems that the people from Eclipse WTP now have some sort of javascript editor inherited from the JSEditor.

    I'll have to look into this when I have some time available.



    javascript component

    ;-)



    Matt

  17. #17
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Ok - if you compare on Row&#39;s and the Row class itself should inherit the comparison rules, you may add two methods like "compareIgnoreCase" and "compare" to the Row class. Then you only need to make a descision in the way of the StringTable, using Row instead of String. CASE_INSENSITIVE setting of the RowHashtable then provides the comparison of two key rows on calling their "compareIgnoreCase" method - CASE_SENSITIVE analogue performs the call of the method "compare".

    But think about this:
    You could simply overwrite the toString-method of the Row class. The toString then returns all field values wrapped into a single String. Then you can use the StringTable because your comparison is done on this String representation of a row. The StringTable always use the toString-method to convert an Object to a String. I think that may the simplest way, because you needn&#39;t to take care about the different datatypes, like Integer, Boolean, etc. .

    The toString-method of the Row class may looks like:

    public String toString() {

    String str = "";
    for(Iterator i = list.iterator(); i.hasNext() {
    str += i.next().toString();
    }

    return str;
    }

    Gunther.

  18. #18
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Hi Gunther,

    You may be supprised by this, but (like I said in a previous message) the case sensitive compare AND toString() already exist in Row().
    However, I have a bad feeling about having to do a toString() just to compare rows. This can prove to be expensite in case of double precision variables, dates, etc.
    Even then, we compare the complete row case sensitive and I want to do it on field by field.
    We should always strive to do the best we can!

    But, like I said, subclassing Hashtable is a good idea and we&#39;ll implement it this week ;-)
    (just based on Row instead of String)

    Cheers,

    Matt

  19. #19
    Join Date
    Dec 2005
    Posts
    20

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Hmm ... ok - i hadn&#39;t seen your toString()-method ... my fault. Ok - second idea was more a simple way to reach case sensitivity, but maybe (or definitely! ) not the best one ( I justify myself ). Like you said: "We should always strive to do the best we can! " - I thought about it and have to agree with you!

    Sorry about my "not so good" (or better stupid ... ) suggestion!

    Salute
    Gunther

  20. #20
    Join Date
    Nov 1999
    Posts
    9,729

    Default RE: Different comparison rules between "Unique rows" and "Stream lookup"

    Please don't feel that way, it's great to get feedback!



    You know, earlier this year when I was down with the flu for a couple of days, I read this book Better, Faster, Lighter Java by Bruce Tate.



    In it, Tate says something like: often it's better to write something simple that works, even it's not the most optimal solution. If it didn't work, you didn't spend too much time on it. And if it works, all the better.

    So far, that's how Kettle has been constructed, piece by piece, slowly progressing over the last 5 years.

    However, to improve now, we have to really think about what we're doing.

    For example, we have to ensure backward compatibility at all cost.



    So, give suggestions, by all means, so far they have been good, and I'll try to find the optimal solution in the Kettle framework.



    Cheers,



    Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.