Hitachi Vantara Pentaho Community Forums
Page 1 of 2 12 LastLast
Results 1 to 10 of 20

Thread: Job/Batch ID's ... no way to link the log files?

  1. #1
    Join Date
    Apr 2007
    Posts
    14

    Default Job/Batch ID's ... no way to link the log files?

    Greetings,

    Still refining my ETL project to read files in a directory, process the files, update tables, then move the files.

    I am using a JOB to fire off multiple transformations as well as a single Shell step.

    I have turned on logging (and batch-id) in in the job settings as well as in each of the transformation steps. I setup a job log table as well as a transformation log table, both have the requisite BATCH-ID fields.

    Edit: Pass ID's to Jobs is turned on.

    When I run the job, I get an entry into the job log table and multiple entries into the transformation log table (one for each transformation process).

    What I don't get, is any common "batch id" (or any ID) in either file (job log or transformation log) that links anything together. No Job ID that links multiple Transformation ID'S to each other. No Transformation ID'S that group the Transformations together.

    I would assume that a "batch-id" would start at the JOB level and be passed into each of the transformation logs linking them all together.

    Isn't this what Batch-ID was meant for? and if so, what am I doing wrong?
    Last edited by DavidJ; 05-07-2007 at 01:52 PM.

  2. #2
    Join Date
    Apr 2007
    Posts
    14

    Default

    Bump for a reply.

    Did I make it too complicated?

    Basically, I would like to be able to create a web page that displays all the logs for a particular run of a particular job, but that requires something universal in all the logs that are germain to that one run.

    Is this just beyond the scope of the logging system?

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Here is what I usually propose:

    Create a table yourLogTable with a field BATCH_ID in it, seed it at 1.
    At the end of your job, increment it by 1, but ONLY if the complete job ran OK.
    at the start of your job, do a SELECT max(batch_id) FROM yourLogTable, call it "batch_id"
    Assign it to a variable for ease of use.

    Store that ID in your fact tables and dimensions to keep track of what got updated when.

    Delete all rows from the fact tables with "batch_id", normally 0 rows will be deleted, but in case you have aborted job, you will do a cleanup first.

    We did not automate this because it's easy to do in a job and because preferences vary too much here.
    The batch_id used in jobs and transformations is more of a way to id the operational metadata.

    HTH,

    Matt

  4. #4
    Join Date
    Apr 2007
    Posts
    14

    Default

    Matt,

    Thanks for the response, but after reading it, I don't think you understood what I was asking for.

    I am not asking about putting a "batch id" in Fact/Dimension tables.

    I am desiring to relate Log files together for easy, and reliable, extraction.

    In a JOB, under Job Settings, under Log, there are two options:

    1) Use Batch ID?

    2) Pass Batch ID?

    The problem seems to be that option #2 doesn't appear to "pass" the Batch ID to Transformations (or their Logs).

    In a Transformation, under transformation settings, under Logging there is one option:

    1) Use Batch-ID

    I am expecting a second option:

    2) Use Batch-ID from previous Job/Transformation

    That's it.

    If #2 is checked, then it inherits the batch-id from the job/transformation that called it.

    The problem? If a complicated job spawns 25+ log files that need to be reviewed, and the job runs every 30 mins, by the end of the day (or week), it's going to be extremely hard to do so without some piece of data that relates the logs files together for every 15 mins run.

    Relational databases are fantastic ... when you have data that can be related :-)

    Does that make more sense?

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    I gave you a method of generating a unique batch-ID per succesful run.
    You have an internal batch_id that Kettle generates.
    That gives you the data to link the 2 together.

    If you then also store the date of the run in yourLogTable you can report on all the jobs, subjobs, transformations executed for a single run/day/week/month.

    Storing the batch_id in the facts and dimensions is irrelevant to the topic but considered "best practice" by some.

    All the best,

    Matt

  6. #6
    Join Date
    Apr 2007
    Posts
    14

    Default

    Matt,

    Thank you for your time, but I am still in the same position as I was when I started this thread.

    Quote Originally Posted by MattCasters View Post
    You have an internal batch_id that Kettle generates.
    That gives you the data to link the 2 together.
    Where is this internal batch_id that links the 2 together? I can't find it.

    My MySQL Log structure for "Jobs" has an "ID_Job" that is unique for each job (or row in the table).

    My MySQL Log structure for "Transformations" has an "ID_Batch" that is likewise "Unique" for each transformation run (or row in the table).

    By "Internal" do you mean hidden? as in not being written to the Database Log files? and if so, how do I get that "Internal batch_id" to be written into the Log rows in both files so I can link them?

    I just ran my job again, I got 1 row in the "Job Log" and 13 rows in the "Transformation Log" and there is no common data in the 13 rows to be able to even link the 13 rows together, let alone link those 13 rows back to the 1 Job Log entry.

    FYI: Both MySQL Log Structures were created by Kettle using the SQL buttons.

    Help me Obi-Wan Kanobie ...

    - David

  7. #7
    Join Date
    Apr 2007
    Posts
    19

    Default

    I was facing the same problem that David is facing. Under Job-> Settings, there is a check box, "Pass batch-id", which when checked as per the docs is sipposed to pass the batch id to the transformations within the job. Each transformation is getting their own batch id which makes it difficult to link them to a job execution.

    I am using Kettle 2.5 version.


    Uppili

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Well guys, again, it doesn't matter what meaning those Kettle generated batch_id's have. They could be random, you shouldn't care about them.

    If you have an ID per run / night / week / month, call it RUN_ID.
    The Job batch ID is called JOB_BATCH_ID (from the Job log table)
    The Trans batch ID is called TRANS_BATCH_ID (from the trans log table)

    It's only logical that if you look at any given time at for example transformations that get executed, you have the RUN_ID and TRANS_BATCH_ID.
    Likewise, you also have the RUN_ID and the BATCH_ID.

    That allows you to group the log entries of each run to get an overview.

    All the best,

    Obi-Wan Kenobi a.k.a. "Your only hope"

  9. #9
    Join Date
    Apr 2007
    Posts
    14

    Default

    Matt,

    Thanks for taking the time to answer, I think we are getting to the nitty gritty ... I have not JOB_BATCH_ID or TRANS_BATCH_ID in my tables.

    Read on ...

    Quote Originally Posted by MattCasters View Post
    If you have an ID per run / night / week / month, call it RUN_ID.
    See schema below, I believe it's the ID_JOB.

    The Job batch ID is called JOB_BATCH_ID (from the Job log table)
    The Trans batch ID is called TRANS_BATCH_ID (from the trans log table)
    Neither of those fields exist in my tables. I have a fresh 2.5 install and used the SQL options under the logging tabs in both Job and Transformation to create my tables. Schema follows:

    CREATE TABLE `impLogJob` (
    `ID_JOB` int(11) default NULL,
    `JOBNAME` varchar(50) default NULL,
    `STATUS` varchar(15) default NULL,
    `LINES_READ` bigint(20) default NULL,
    `LINES_WRITTEN` bigint(20) default NULL,
    `LINES_UPDATED` bigint(20) default NULL,
    `LINES_INPUT` bigint(20) default NULL,
    `LINES_OUTPUT` bigint(20) default NULL,
    `ERRORS` bigint(20) default NULL,
    `STARTDATE` datetime default NULL,
    `ENDDATE` datetime default NULL,
    `LOGDATE` datetime default NULL,
    `DEPDATE` datetime default NULL,
    `REPLAYDATE` datetime default NULL,
    `LOG_FIELD` mediumtext
    ) ;

    CREATE TABLE `impLogTrans` (
    `ID_BATCH` int(11) default NULL,
    `TRANSNAME` varchar(50) default NULL,
    `STATUS` varchar(15) default NULL,
    `LINES_READ` bigint(20) default NULL,
    `LINES_WRITTEN` bigint(20) default NULL,
    `LINES_UPDATED` bigint(20) default NULL,
    `LINES_INPUT` bigint(20) default NULL,
    `LINES_OUTPUT` bigint(20) default NULL,
    `ERRORS` bigint(20) default NULL,
    `STARTDATE` datetime default NULL,
    `ENDDATE` datetime default NULL,
    `LOGDATE` datetime default NULL,
    `DEPDATE` datetime default NULL,
    `REPLAYDATE` datetime default NULL,
    `LOG_FIELD` mediumtext
    ) ;

    It's only logical that if you look at any given time at for example transformations that get executed, you have the RUN_ID and TRANS_BATCH_ID.
    Likewise, you also have the RUN_ID and the BATCH_ID.

    That allows you to group the log entries of each run to get an overview.
    So, where is JOBS_BATCH_ID and TRANS_BATCH_ID? As you can see, they are not in my Schema.

    Furthermore:

    Job Properties -> Log -> SQL yields: "No sql needs to be executed. The log table looks great!"

    Transformation Properties -> Logging -> SQL yields the same.

    ###### Test ########

    Upon the rare chance there was a bug in the SQL create/alter procedures, I manually added impLogJob.JOBS_BATCH_ID and impLogTrans.TRANS_BATCH_ID as unsigned, null, integers, with no defaults.

    Ran my process and those fields are NULL.

    ###### Test 2 #####

    Switched the fields to impLogJob.TRANS_BATCH_ID and impLogTrans.JOBS_BATCH_ID ... just in case.

    Ran again ... nothing but NULLS!!

    ##### Test 3 #####

    Dropped both log tables, let kettle recreate both log tables. JOBS_BATCH_ID and TRANS_BATCH_ID were not created.

    So Matt, does your JOB and TRANS schema's match mine? or are they different? Are you on 2.5 stable a later dev version?

    - David

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Dear Mr Carter,

    impLogJob.ID_BATCH and impLogTrans.ID_BATCH are being populated.
    You can generate a batch ID per run, succesful job or not (up to you). Set a variable.

    impLogTrans.ID_BATCH is available through Get System Info if logging is enabled.
    impLogJob.ID_BATCH is available through Get System Info (parent job id) if logging is enabled.
    If you for some reason can't get at it, how 'bout doing a select max(id_batch) from the job log table?

    That means that all parameters are indeed available to link it all.
    Indeed I didn't tell you how to do it.

    Have fun,

    Said Matt

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.