Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Populating a Fact Table

  1. #1
    Join Date
    Aug 2017
    Posts
    11

    Default Populating a Fact Table

    I'm very new to using pentaho/ any kinds of ETL development. Please forgive me if I'm not clear at explaining what I am looking for.

    I've created a small data warehouse to house a set of surveys gathered annually through Qualtrics for several years.

    The data warehouse consists of 5 tables:
    dim_date
    year_id
    year

    dim_institution
    institution_id
    institution_name

    dim_Program
    program_id
    program_name

    dim_question
    question_id
    question_name

    fact_response
    year_id
    institution_id
    program_id
    question_id
    response

    I've populated each dimension tables but I don't know how to populate the fact table.
    I guess I'm not sure how to transform the csv containing all of the responses into something useable for my fact table. Does this question make sense? Sorry if this is a confusing question or I am asking it wrong. If there is anything you can point me to to read I would much appreciate it.

  2. #2
    Join Date
    May 2016
    Posts
    277

    Default

    So you have a csv with this fields:
    year
    institution_name
    program_name
    question_name
    response

    My usual way to populate this would be:
    - A transformation to load a table that is a copy of the csv file as is, perhaps only cleaning some fields (blank spaces at beginning/end, performing TO_UPPER or TO_LOWER operations). Each time I run the load, the table is loaded with a TRUNCATE/INSERT operation. Don't know if you need to keep history of the information in the csv, maybe create another transformation with only INSERT operation to keep track of the information you load each time (although both transformations are practically equal, I usually separate this into two transformations, so if something fails afterwards, I have at least a starting point).
    - A transformation to load each of the dim tables, with a SELECT DISTINCT FIELD_NAME FROM TCOPY_CSV, and adding (INSERT) only new values using a sequence to create the ID for the new values. Depending on your database you can use the SEQUENCE object in the database or you can use a sequence step in the transformation (get the MAX id in your dim and go adding that max value for next IDs you are inserting) By the way, I wouldn't bother creating an ID on the year dimension, I'll use the year as ID. Those dimension tables should have a value for UNKNOWN values (I usually keep the 0 ID for those values and start "normal" IDs at 1000, so I can add other reserved IDs as N/A, or Not necessary value, etc if I need them, I use 1900 for unknown values in year, and 3000 for when I have END_DATE in my fact table so I don't have nulls)
    - A transformation to load the FACT_RESPONSE table, performing a TRUNCATE/INSERT or only INSERT (depending of your needs for historical values) where you do the lookup of the dimension tables to get the IDs, using the unknown ID as default when you can't find a value in the lookup. I usually never leave null ID or values in a fact table, unless I need to perform AVG operations on a metric, so null doesn't count in the AVG.

    Depending on how complicated are your dimensions and facts you add transformations and intermediate tables to perform the Transform part of ETL, but those should be the basics.
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  3. #3
    Join Date
    Aug 2017
    Posts
    11

    Default

    Thank you for your response. I think that all mostly makes sense or I can look up how to do it.

    So you have a csv with this fields:
    year
    institution_name
    program_name
    question_name
    response
    One question about this. What you are saying is that I need to create a csv that has these column names to use to populate my fact table. My problem is that the raw data from Qualtrics doesn't look like this. The Qualtrics output has columns for institution_name, and columns for each question by program with responses by institution. Like this (p stands for program and q for question):
    Name:  Capture1.PNG
Views: 413
Size:  8.3 KB

    You are saying that I need to make it look like this:
    Name:  Capture2.jpg
Views: 412
Size:  18.4 KB

    Am I understanding this correctly?

    I can do this manually but it becomes a little time intensive. Is this just something I have to do manually or perhaps figure out a VBA script to create? Is there some sort of converter out there that will get my data to look like it needs to for my fact table? Is there a term for the way the data needs to be structured?

  4. #4
    Join Date
    May 2016
    Posts
    277

    Default

    Ok, I didn't know Qualtrics data format, so I assumed from your example the format of the csv data.
    So yes, you'll have to add steps and intermediate tables to do the Transformation part in ETL. Before the transformations to get the dimensions, you'll have to unpivot your data using the Row Normaliser step. And then use string steps (there are various steps to achieve it) to get the program and question columns.
    Regards
    OS: Ubuntu 16.04 64 bits
    Java: Openjdk 1.8.0_131
    Pentaho 6.1 CE

  5. #5
    Join Date
    Aug 2017
    Posts
    11

    Default

    Quote Originally Posted by Ana GH View Post
    Ok, I didn't know Qualtrics data format, so I assumed from your example the format of the csv data.
    So yes, you'll have to add steps and intermediate tables to do the Transformation part in ETL. Before the transformations to get the dimensions, you'll have to unpivot your data using the Row Normaliser step. And then use string steps (there are various steps to achieve it) to get the program and question columns.
    Regards

    Thank you so much! This looks like exactly what I need.

  6. #6
    Join Date
    Aug 2017
    Posts
    11

    Default

    Quote Originally Posted by Ana GH View Post
    Ok, I didn't know Qualtrics data format, so I assumed from your example the format of the csv data.
    So yes, you'll have to add steps and intermediate tables to do the Transformation part in ETL. Before the transformations to get the dimensions, you'll have to unpivot your data using the Row Normaliser step. And then use string steps (there are various steps to achieve it) to get the program and question columns.
    Regards

    I've been able to figure out the row normaliser but I still can't figure out how to separate the program and question columns. Any more suggestions for what I should look at?

  7. #7
    Join Date
    Apr 2008
    Posts
    4,686

    Default

    You'll likely need to run two Normalizers back-to-back

    Input -> Row Normalizer (Programs) -> Row Normalizer (Questions) -> Output

    From your Input, after the first Normalizer, you'll have data like:
    Code:
    year	INST	Prog	Q1	Q2
    2017	1	1	5	12.5
    2017	1	2	9	22.5
    2017	1	3	6	15
    2017	2	1	20	<null>
    2017	2	2	<null>	<null>
    2017	2	3	50	<null>
    2017	3	1	8	20
    2017	3	2	10	25
    2017	3	3	<null>	<null>
    2016	1	1	4	10
    2016	1	2	8	20
    2016	1	3	7	17.5
    2016	2	1	25	62.5
    2016	2	2	<null>	<null>
    2016	2	3	<null>	<null>
    2016	3	1	7	17.5
    2016	3	2	9	25.5
    2016	3	3	<null>	<null>
    2015	1	1	3	7.5
    2015	1	2	7	17.5
    2015	1	3	6	15
    2015	2	1	28	70
    2015	2	2	<null>	<null>
    2015	2	3	<null>	<null>
    2015	3	1	6	15
    2015	3	2	8	20
    2015	3	3	<null>	<null>
    Now you can feed into the other normalizer to get data like:
    Code:
    year	INST	Prog	Q	Data
    2017	1	1	1	5
    2017	1	1	2	12.5
    2017	1	2	1	9
    2017	1	2	2	22.5
    2017	1	3	1	6
    2017	1	3	2	15
    2017	2	1	1	20
    2017	2	1	2	<null>
    2017	2	2	1	<null>
    2017	2	2	2	<null>
    2017	2	3	1	50
    2017	2	3	2	<null>
    2017	3	1	1	8
    2017	3	1	2	20
    2017	3	2	1	10
    2017	3	2	2	25
    2017	3	3	1	<null>
    2017	3	3	2	<null>

  8. #8
    Join Date
    Aug 2017
    Posts
    11

    Default

    Thank you! I'll have to play with that this week some more. I guess I didn't think about sending it through a second normaliser.

    Quote Originally Posted by gutlez View Post
    You'll likely need to run two Normalizers back-to-back

    Input -> Row Normalizer (Programs) -> Row Normalizer (Questions) -> Output

    From your Input, after the first Normalizer, you'll have data like:
    Code:
    year    INST    Prog    Q1    Q2
    2017    1    1    5    12.5
    2017    1    2    9    22.5
    2017    1    3    6    15
    2017    2    1    20    <null>
    2017    2    2    <null>    <null>
    2017    2    3    50    <null>
    2017    3    1    8    20
    2017    3    2    10    25
    2017    3    3    <null>    <null>
    2016    1    1    4    10
    2016    1    2    8    20
    2016    1    3    7    17.5
    2016    2    1    25    62.5
    2016    2    2    <null>    <null>
    2016    2    3    <null>    <null>
    2016    3    1    7    17.5
    2016    3    2    9    25.5
    2016    3    3    <null>    <null>
    2015    1    1    3    7.5
    2015    1    2    7    17.5
    2015    1    3    6    15
    2015    2    1    28    70
    2015    2    2    <null>    <null>
    2015    2    3    <null>    <null>
    2015    3    1    6    15
    2015    3    2    8    20
    2015    3    3    <null>    <null>
    Now you can feed into the other normalizer to get data like:
    Code:
    year    INST    Prog    Q    Data
    2017    1    1    1    5
    2017    1    1    2    12.5
    2017    1    2    1    9
    2017    1    2    2    22.5
    2017    1    3    1    6
    2017    1    3    2    15
    2017    2    1    1    20
    2017    2    1    2    <null>
    2017    2    2    1    <null>
    2017    2    2    2    <null>
    2017    2    3    1    50
    2017    2    3    2    <null>
    2017    3    1    1    8
    2017    3    1    2    20
    2017    3    2    1    10
    2017    3    2    2    25
    2017    3    3    1    <null>
    2017    3    3    2    <null>

  9. #9
    Join Date
    Apr 2008
    Posts
    4,686

    Default

    Another option would be the "Split Fields" step with a delimiter of Q (P1Q1 -> P1 in the first column and 1 in the second)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.