Hitachi Vantara Pentaho Community Forums
Results 1 to 11 of 11

Thread: How to pull out two foreign key(techincal keys) from the same table

  1. #1
    Join Date
    Sep 2012
    Posts
    27

    Default How to pull out two foreign key(techincal keys) from the same table

    Say I have a table input step which has two columns orig_country and dest_country. As part of the transformation, I need to populate the table called country and also a fact table step populating the technical keys orig_country_key and dest_country_key.
    How do i do this ?

    Thanks

  2. #2

    Default

    Quote Originally Posted by lramamur View Post
    Say I have a table input step which has two columns orig_country and dest_country. As part of the transformation, I need to populate the table called country and also a fact table step populating the technical keys orig_country_key and dest_country_key.
    How do i do this ?

    Thanks

    Hi Iramamur.

    Can you take a snapshot of your transformation?
    Ricardo Díaz
    Pentaho Consultant
    Quito - Ecuador

  3. #3
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by lramamur View Post
    Say I have a table input step which has two columns orig_country and dest_country. As part of the transformation, I need to populate the table called country and also a fact table step populating the technical keys orig_country_key and dest_country_key.
    How do i do this?
    What I would do with this scenario is use both source columns to update a single country dimension and then populate both TK's in the fact table by looking up into that table. Then have two views of that country table which rename all the fields (i.e., one view would have DestCountryName, DestCountryCode, etc., and the other view would have OrigCountryName, OrigCountryCode, etc.) so your queries can join on the individual views (which look like separate tables) and reference distinctive column names.

    I have to do this with my date dimension fairly often when I have something like a StartDate and an EndDate or an OpenDate and a CloseDate - with one date dimension table and several views of that table.
    Last edited by darrell.nelson; 11-05-2013 at 04:43 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  4. #4
    Join Date
    Sep 2012
    Posts
    27

    Default

    Quote Originally Posted by darrell.nelson View Post
    What I would do with this scenario is use both source columns to update a single country dimension and then populate both TK's in the fact table by looking up into that table. Then have two views of that country table which rename all the fields (i.e., one view would have DestCountryName, DestCountryCode, etc., and the other view would have OrigCountryName, OrigCountryCode, etc.) so your queries can join on the individual views (which look like separate tables) and reference distinctive column names.

    I have to do this with my date dimension fairly often when I have something like a StartDate and an EndDate or an OpenDate and a CloseDate - with one date dimension table and several views of that table.

    I am not clear on what you are saying.
    I have attached my transformation :
    1. A CSV file which has source,destination fields
    2. populate country table with source alone.
    3. Rename country_key as source_country_key
    4. populate country table with destination
    5. populate the fact table with source_country_key and dest_country_key with country_key.
    In this case, I had to set commit size as 1 to avoid duplications in country table. This I would like to avoid.

    Name:  onetable_twokeys.jpg
Views: 91
Size:  21.6 KB

  5. #5
    Join Date
    Nov 2008
    Posts
    777

    Default

    Here are my observations on your transformation diagram:
    1. I usually build my dimensions in one transformation and then my fact table in another one. This prevents the need to set the commit size to 1.
    2. In my first transformation, I would normalize the two country columns into one column before updating the dimension table. That way you need to update only one dimension table and caching/batching can be used.
    3. I don't think the Combination lookup/update step is the right choice. That is for "junk" dimensions. I would use the Dimension lookup/update step.


    Other than those items, your fact table appears to be built with the proper keys. So now, create those database views. In MySQL, it can be done like this:
    Code:
    CREATE VIEW orig_country AS SELECT country_key AS orig_country_key, country AS orig_country FROM country;
    CREATE VIEW dest_country AS SELECT country_key AS dest_country_key, country AS dest_country FROM country;
    Those views can then used in your queries as if they are actual tables even though they are just different views of the same table:
    Code:
    SELECT * FROM country_fact f
    LEFT JOIN orig_country_key o ON f.orig_country_key=o.orig_country_key
    LEFT JOIN dest_country_key d ON f.dest_country_key=d.dest_country_key
    Last edited by darrell.nelson; 11-07-2013 at 06:59 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  6. #6
    Join Date
    Sep 2012
    Posts
    27

    Default

    Thanks for your reply.
    My basic doubt is that when i have two columns source_country and dest_country in the same stream(in the same row), how can i populate both in the same table country.
    Should we use the same step(whatever it is...either Dimensional update or some other one) twice.
    I didn't understand your second bullet. Can you please explain

  7. #7
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by lramamur View Post
    My basic doubt is that when i have two columns source_country and dest_country in the same stream(in the same row), how can i populate both in the same table country?
    The answer to your question is, of course, understanding my second bullet. In your example, "normalizing" means converting a 2-column x 5-row data matrix into a 1-column x 10-row data matrix. You can use the Row Normalizer step to do this.

    With your 2-column input, set up the Row Normalizer like this:

    Name:  RowNormSetup.png
Views: 97
Size:  19.5 KB

    and the output of that step will be this:

    Name:  RowNormOutput.png
Views: 97
Size:  15.6 KB

    Now you can run the "country" column through a single Dimension lookup/update step (in update mode) to maintain a single country dimension. As you can see, there are more duplicate countries now but if you enable caching the performance should not suffer. In most of my applications, using the default cache size of 5000 (and commit size of 100) has been a very good place to start.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  8. #8
    Join Date
    Sep 2012
    Posts
    27

    Default

    Thanks a lot for you time to reply. I understood and it works.
    Now my next question is on your first bullet. Why do you want to have fact table populated in a separate transformation. This again would demand us to have a CSV file step, followed by lookup on orig_country_view, lookup on dest_country_view to populate the fact table, isn't it.
    You are proposing separate transformation just because we are pulling two values from the same table ?
    Thanks once again.

  9. #9
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by lramamur View Post
    Now my next question is on your first bullet. Why do you want to have fact table populated in a separate transformation?
    You are proposing separate transformation just because we are pulling two values from the same table?
    Your example is simple enough so it probably could be done in one transformation. By reducing the commit size as you have done, or I suppose you could add a blocking step in the middle to force the dimension table to update completely before building the fact table. However, I've never really had good success with those throttling or blocking techniques. Remember that all the steps in a transformation initialize and start running at the same time followed by each step simultaneously waiting for rows from a previous step or some external operation. Contrast this with a job where, by default, the entries run in series and don't even start until the previous entry finishes.

    With those concepts in mind, I recommended that you run your process in two passes based on my experience. Here are my reasons for that recommendation:
    1. Simplicity - What happens if your requirement grows a little and all of a sudden the entire screen is filled with steps and paths meandering back and forth from left-to-right and top-to-bottom? When that happens to me, sometimes I can't even understand what everything does and I'm the one that created it! Logically splitting it into separate parts helps keep each part simple.
    2. Maintainability - I assume that, at some point in time, someone else will need to take over my work and either fix it or enhance it. Having a very simple flow in two sequentially run transformations will be far easier for someone else to understand than a single flow that requires throttling or blocking to make it work. Those timing constraints are sometimes the hardest things to figure out if you weren't the one who instilled them.
    3. Modularity of Design - Some dimensions are by nature what I call independent, a date dimension for instance, in which there is very little source data. I find it best to build dimensions like that in a completely separate transformation so I just follow that lead and build (almost) all of my dimensions separately.
    4. Modularity of Execution - It's a really nice operational feature to be able to run only part of the whole process if an execution error is encountered. If your first transformation "coughs" on some of the input data for instance, you can debug it with the first transformation alone and then, when the issue is resolved, run the second transformation alone to finish off the process. This is especially beneficial with large data sets.
    5. Flexibility - You stated that splitting your example into two transformation would require you to have two CSV input steps. Yes, that it true. However, I rarely find that all the data for both the dimensions and the facts table are in one source. What's more common (for me anyway) is to have several sources of data such as normalized tables in an OLTP system where there is a different table to query for each dimension and some joined query for the fact table.
    6. Commonality - Again, your example is small, but what if you (or a colleague) have to create another fact table that also has a country dimension in it? Would you want to build it again in yet another all-encompassing transformation and then store it in another table? No, of course not. It should be built once and used twice. I like to think of my dimensions as global so there is one and only one version of the truth. Building the dimensions in transformations that are separate from the fact table transformations allows me to achieve that.

    In summary, and to answer your second question, I am recommending a design philosophy of using two passes so you build/update your dimension(s) in the first pass and reference/lookup your dimension(s) to build your fact table in the second pass. I think you will find it very rare to have all the source data for all your dimensions and your fact table in one single flat file.
    Last edited by darrell.nelson; 11-08-2013 at 12:47 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  10. #10
    Join Date
    Sep 2012
    Posts
    27

    Default

    Thanks a lot again for taking time to explain the design philosophy. I really appreciate.
    In my case, we have a single source(a database) with all data for dimensions and fact tables. The source has data for around 20 dimensions. The dimensions will be similar to country and my source table will have columns for orig/dest for each of the 20 dimensions. We just have one transformation which populates the dimensions/fact tables similar to the one which I attached. i.e, populating the source columns in the dimensions(with commit size as 1), renaming the keys, populating the dest columns in the dimensions followed by fact table population. I have been trying to improve the performance as i have given commit size as 1. Do you think there would be performance improvement in the approach you have suggested. I tried the simple transformation (country) in both the approaches with 20,000 rows. The one I had earlier was faster, because the second approach require two transformations.

  11. #11
    Join Date
    Nov 2008
    Posts
    777

    Default

    If you can build 20x2 dimensions and a fact table all in one transformation then I think you've got it. Can you post a screenshot?
    Last edited by darrell.nelson; 11-11-2013 at 10:06 AM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.