US and Worldwide: +1 (866) 660-7555
Page 2 of 4 FirstFirst 1234 LastLast
Results 11 to 20 of 35

Thread: saiku + mondrian using star schema, boolean column showing all results, not unique

  1. #11
    Join Date
    Apr 2011
    Posts
    103

    Default

    Q: Cardinality estimate for the internal_number dimension?
    A: I would expect it to be smaller than the external one. May be more realistic to say 20-50k; though, for some, it could be as small as 5k (that is, if I rebuild this datamodel again for a separate single instance).

    Q: Same country, state, and city fields for the internals?
    A: not currently. the internal number is (in this context) irrelevant. That is, 'isValid' for an internal number doesn't matter and city/state data is already known.

    Q: Estimated number of customers?
    A: between 1-10. I have zero experience with scalability/etc, so really apprehensive about adding too many to any individual model.

    Q: Is external_num_raw a string and external_num an integer?
    A: They are both strings, though, raw could be an integer. That is, external_num would be something like [1](202)4561111 where _raw would be 12024561111. They are both currently strings though.

    Q: Do you need both in your cube?
    A: I think so. One object is top callers per day to particular numbers, so in that context, yes.

    Q: Are there really that many repeat calls per day?
    A: Depends on the vertical, but surprisingly, between 30-50%. Consider that the majority are business partners, affiliates, remote offices etc who call in part of regular business.

    Q: Would it matter that much if you put every call in the fact table?
    A: I don't think it would...probably worth restating I have zero scalability experience so it didn't even cross my mind the just use 'raw data' so to speak - I assumed daily aggregation was a 'smart' step;

    Q: Won't you have to partition it anyway at some point?
    A: I am actually reading that more like 'I will have to partition it anyway at some point' (which I appreciate, actually) - I am not familiar with partitioning. I have heard the term (as it pertains to DBs) but will have to R&D what I need to do.

    Cool!

  2. #12
    Join Date
    Nov 2008
    Posts
    738

    Default

    Good answers! Now back to your problem with the isValidNumber thing. My great idea is to create a junk dimension for it. You'll see what I mean below but the main reason for that recommendation, and my suggestions on the whole DB design, is this: cardinality. Your external number dimension will get kind of big. A junk dimension will hold various common flags and strings which will help trim it down. And your fact table will get really big. At 100K records per day that's 20M to 36.5M records per year. That's not unmanageable but it's up there. At that size, I think you will need to have partitions (totally handled by the database server) but that can wait for 6 months or more. The other thing you can do to help with the size factor is keep the fact table narrow. For that reason, among others, I completely withdraw my suggestion of a degenerate dimension. Knowing what I know now about your application you certainly should not do that.

    And here are my thoughts...in the attached PDF...because they are kind of long. Sorry about that.

    I've also included another version of my example transformation with expanded JavaScript code to return all the "junk".
    Attached Files Attached Files
    Last edited by darrell.nelson; 12-07-2013 at 02:54 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  3. #13
    Join Date
    Apr 2011
    Posts
    103

    Default

    Blowing my mind, Homie...but I am following you (mostly!)!

    I was reading about junk dimensions since you mentioned it and from Mondrian in Action (page 55):

    It should be noted that the only reason to do this is for performance; from Mondrian's perspective, assuming it's been configured properly, there's no logical difference between consolidating multiply degenerate dimensions into a junk dimension and leaving them as degenerate dimensions in the fact table


    I (think) what I'm pointing out is I'm confused with your use of 'junk dimension' and 'the books' use of it - because M.I.A. talks about it being a cartesian product of all possible combinations of attributes along with the surrogate key. The book reads like this:

    There's another technique for taking lots of single-attribute degenerate dimensions and putting them back into the standard dimension of unrelated attributes. This is referred to as a combination (or more commonly called a junk) dimension.

    It does sound, in some ways, like that term is what you are talking about and, again, learning a great deal here. It also sounds like the External Flags Dimension (in this context) is just another dimension table now (?) and not what the book calls junk? Idk, just trying to get my language right.

    I LOVE your re-iterating Kimball's points and putting those into practice in your last suggestion to even further simplify the dimensions.

    I also agree there is no need for customer_id in the customer_dim - as I should be able to (easily) identify that by way of the fact table (since customer_id and internal_num_id are both present in the fact).

    Ok, so where am I still confused.
    1. "Start with an un-aggregated fact table." Isn't that what is in the original OLTP db? what's the difference and why use PDI to build (another) un-aggregated table?

    2. is the first table you listed in the PDF (essentially) the aggregated fact table I've been talking about? That is, so your columns CallCount and Call Minutes (from my previous example) would be 15/30 (for a particular day)? If so, I am not following the CallCounter pk (col 4).

    3. the idea of combined natural key (nk) and/or creating an index on more than 1 column is awesome (new to me); but how is this used? That is, as a key? I can see how the pk/fk pair lines up on 'ExternalNumberKey' but perhaps i need to go read about natural keys....

    4. I cannot (in my feeble rookie mind) see how the external_flags_dim get 'tied' to the external_number...unless the ExternalFlagsKey is THE SAME as the ExternalNumberKey (?). Indeed, from reading this, it sounds like that may be what's going on? If that is what is going on, then what is the value of reducing the width of the dimension table?

    The laptop I am on today doesn't have PDI (so I haven't looked at the *.ktr file, yet) but DEFINITELY AM GOING TO!

    Dude...seriously, above and beyond, Amigo in this conversation!!! And now a brother has to go buy some Kimball books!

  4. #14
    Join Date
    Nov 2008
    Posts
    738

    Default

    One of my admitted weaknesses in all this is having the foresight in designing the DB tables so they fit properly in the cube schema when I get to Schema Workbench and Mondrian. For instance, in retrospect I maybe should have recommended keeping the customer and internal numbers in the same table. That way when you get to the schema, the customer name could be the first level and the internal numbers could be the second level. That's another example of that coarse-to-fine level progression i mentioned before. In my defense, when I saw "customer_id" I thought "snowflake" and it broke one of the Kimball rules. In reality, Mondrian handles snowflakes just fine anyway.

    Oh yeah, the junk dimension. I was basically proposing 3 flags because that's what I found in my search for useful methods in those Java libraries. Those would be isValidNumber (Boolean), ValidationResult (1 to 4), and NumberType (1 to 11). Most likely those belong in the fact table and could be considered degenerate dimensions exactly the way you quoted the manual. However, I propose putting them in a junk dimension and actually making them more useful in the process. Instead of just the Boolean flag I also included a string equivalent ("VALID" and "INVALID") and similar companion strings for the two integer flag fields. Adding the strings doesn't change the Cartesian product since they match the flag fields one-for-one. In essence, each flag has two pieces - a numeric value and a string equivalent. Make sense now?

    1 & 2. Okay, the fact table I proposed was not aggregated. It is the root of all data. The grain is one call. Yes, it may look like the source data but that data is not in a dimensional model I suspect. Where your row was (15, 30) mine would be 15 rows. The CallNumbers would be a sequence from 1 to 15, the CallCounts all 1, and the CallMinutes all something less than 30. Aggregating my 15 rows and dropping the sequence reduces them down to your one row. The moral of all that is the concept of aggregate tables. Read that chapter in M.I.A. and you will see what I mean. Given the size of your fact table you may need to aggregate even further to improve query times.

    3. The natural key fields are what you use in the Dimension Lookup/Update step to see if your new data matches an existing dimension row or needs a new row added. It is what differentiates each row in the dimension. Everything else in the dimension is fluff (basically properties) except, of course, the surrogate key. Creating a UNIQUE index in the DB table makes sure that uniqueness is enforced but also speeds up the performance of the dimension lookup step. It takes more DB space for the index table though.

    4. No, the flags key and the external number key are not the same. They are tied together by the fact that they are side by side in every fact table row. I proposed separating them into different tables in the model to save space in the fact table/external dimension table because those tables are very long and those wide flags will be much more efficient in a very short table. Note that this puts the data in separate tables but that doesn't necessarily mean separate dimensions in the schema. Hence the mea culpa above... I will do a little more research on this and try to make more sense of it in practice.

    P.S. You can open .ktr files with an ordinary text editor and probably find the code somewhere in the XML. There's nothing really new there though. It just demonstrates the use of all those Java methods I listed in the notes columns of the PDF.
    Last edited by darrell.nelson; 12-07-2013 at 08:27 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  5. #15
    Join Date
    Apr 2011
    Posts
    103

    Default

    Thank you, again for the conversation. Can't say that enough.

    I am definitely concerned with DB size (the original one, not the aggregated one) and think I will need to aggregate by day (the current fact table I have, so the 15/30 deal) - though I do understand the CallNumber column now (1-15). Very interesting to me.

    Going back to my original thoughts towards the external number (1202 is in DC and isValid, just like 2013-12-07 is a Saturday and is in Dec) - I don't (think I) like the two external number tables requiring a fact table to be related. It seems 'cluttered' to me - but that may be because I'm a noobie here! I do like the idea of a 'thin' dimension table, though, and really like your last line (that just because there may be 2 tables, doesn't mean it has to look like - or be - two dimensions...that would have not have come naturally to me!). While I don't know 'what' exactly I'd do with the external dimension at this point, I do know that it is interesting on it's own and I can see running reports against it to get the number of INVALID or TOO_LONG entries, etc. That is, I can see doing analytics against the external_dimension_tables themselves - part of why I built the fat one I have today.

    For a little more clarity on the natural key, you were saying that the 'lookup' would be on BOTH natural keys right (the cc and phNum)? I read a little about and understand why it's called natural key, I just didn't grasp the concept of joining on both in a schema (I have joined plenty of tables on 2 columns, though).

    This is all really great stuff, man! I'm excited to keep building this joker out!

  6. #16
    Join Date
    Nov 2008
    Posts
    738

    Default

    Regarding the compound natural key, it (they) would really only be used during lookups in Kettle when building/updating the dimension and fact table. They are not used for joining tables. The single field surrogate key is still used for that. In reality, that's its main purpose since it adds no value (sic) to the data.

    Thinking again about the flags dimension, maybe it makes sense to hang it off the external number dimension, i.e., hold the foreign key to the flags in each row of the numbers dimension instead of putting both keys in the fact table. This goes against one of the Kimball rules as it then becomes a snowflake schema but it would definitely narrow that very long external number dimension. Give it some thought anyway, and so will I...
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  7. #17
    Join Date
    Apr 2011
    Posts
    103

    Default

    Thank you, Sir.

    For posterity of those future readers of this thread (as well as our continuing conversation), from M.I.A. section 3.2.3 (snowflake design):

    There are typically two use cases where snowflakes make good sense:
    1. To reduce the size of dimension table by factoring out seldom used but really big columns, and
    2. To more easily manage a Type I type attribute in an otherwise Type II dimension.
    I suppose the decision to break apart (in this use case we are discussing) the external_dim table into a snowflake (external_flags) would (should?) be made based on an understanding of the performance gain in doing so. I definitely understand the reasons for your earlier list of questions! I don't have a good feel (and zero experience) on what it might be, but by understanding the potential size, etc, one could figure that....and that, said, a good 'rule of thumb' sounds like 'thin' dimensions.

    I'll revisit it this week and maybe I can create the time to test/build-both.

    Many many thanks.

  8. #18
    Join Date
    Nov 2008
    Posts
    738

    Default

    Well, at least we're getting somewhere! I've led you around in circles but my last post resonates with the case #1 (reduce size) you just quoted. The flags dimension should hang off the external number dimension for that very reason without cluttering up your keys in the fact table. Nice.

    i suppose the customer dimension I suggested should hang off the internal number dimension in a similar manner. Given the lower cardinality of the internal numbers, however, the customer information could also be included directly in with the numbers but I like the idea of holding the customer info on its own. It could potentially be a case #2 scenario but I suspect you might find a use for a standalone customer dimension in another application down the road. Incidentally, that's one more of Ralph's lessons: standardize those dimensions and make them reusable.

    Oh, and regarding your "thin" dimension rule, that should only apply to dimensions with very high cardinality. For smaller dimensions (I would say less than 500K) I wouldn't hesitate to load them up with lots of fields and stay with the more simple star model. Just look at how wide Kimball's date dimension is. And mine is usually even wider with the multiple hierarchies I use.

    Happy hunting...
    Last edited by darrell.nelson; 12-08-2013 at 05:42 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  9. #19
    Join Date
    Apr 2011
    Posts
    103

    Default

    Thinking through this external_num_flags snowflake...

    Thoughts on using the external_num_id for the external_num_flags_id, as well?

    In other words, I have a dedicated dim_external_num job that builds the 1st dimension and it will (probably) look like:

    1. external_num_id
    2. external_num
    3. external_num_raw
    4. country
    5. stateprov
    6. city

    to link to external_num_flags my first thought was to have a "7. external_flags_id" but then it dawned on me that it's always going to be 1:1 so why create a new surrogate key (when there is a unique one in dim_external_num)?

    external_num_flags might be:
    1. external_num_id
    2. regionCode
    3. nationalNumber
    4. isValid
    etc...

    In this way, as well, the ID in the fact table could be used to link to both dimensions directly (maintain star) or, if in a snowflake, I'd be using the external_num_id in the external_num dimension as both a fk (to join the fact table) and as a pk (to join the flags dimension).

    I am not sure the disadvantage here, though?

  10. #20
    Join Date
    Nov 2008
    Posts
    738

    Default

    The relationship between the external numbers dimension and the external flags dimension is definitely NOT a 1:1 relationship! If that were the case, there would be no reason to create a flags dimension at all! You would just cram everything into one table.

    Looking back you will see that I proposed breaking out the external flags into a combination (junk) dimension - specifically isValidNumber (T/F), ValidationResult (1..4), and NumberType (1...11) - as well as their text string equivalents. The maximum cardinality of such a dimension table would be 2 x 4 x 11, or 88. My reason for the external flags dimension was solely to keep those fairly long "text string equivalents" out of the very high cardinality external number dimension. My "great idea" and proposal was to make those strings available in your analysis queries because you seemed to be struggling (re: title of your post) with the Boolean nature of the isValidNumber field - and not bloat the external number dimension in the process. Remember now?
    Last edited by darrell.nelson; 12-09-2013 at 02:53 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •