Hi Community,

i have a question about building up a dimension. The basic problem is that i am building the dimension from csv files from an internet service that come in on a daily basis. So there is no real source system wich natural keys i could use for looking up.

I am using Kettle 5.2 Community Edition and the data looks like this:

date | domain | query | impression | klicks
20141216 | www.example.de | example | 12345 | 651

My plan was to build up a dim_date, dim_domain and dim_query and a fct_query. Before i started using kettle i've used a php skript that did the inserting into a mysql database. First i loaded the data into a staging table and used an sql query to load the domain and query tuples into the dimensions.

It was something like:
INSERT INTO dim_domain (dim_domain.domain)
SELECT DISTINCT stg_query.domain
FROM stg_query
WHERE NOT EXISTS (SELECT dim_domain.domain from dim_domain WHERE dim_domain.domain = stg_query.domain);

So i just used the string values for comparison because i don't have natural keys. If i am building up something similar with the dimension lookup step it is running very slow (3 r/s). I guess this is because i'm using the string value as lookup key. The query field will be something about 100.000 new distinct values every day, so 3 r/s is not an option.

So what could be a best practice for a scenario like this?

I hope you have some advice for me :-) thank u!