Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: EDI parser step

  1. #1
    Darren Hartford Guest

    Default EDI parser step

    Before I begin the venture of writing an EDI File Input step, just
    couple of questions:

    1) Has someone already started one?

    2) Any tips as I start working with this, especially concerning EDI
    being very hierarchical? One of my concerns was 'denormalizing' the EDI
    file to the point of each record/datastream would be many (200?) fields
    (transaction segments, header segments repeated for each detail
    segment). Think deeply-nested XML file.

    The intent is to create a module that can help define how to parse an
    EDI file into useful fields. Usecase would be X12-style files initially
    (there are many formats). If you do not know about EDI files, well,
    count yourself lucky ;-)



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Tim Pigden Guest

    Default RE: EDI parser step

    Hmm

    Surely we (Kettle community) have to be doing something wrong if you are
    forced to denormalise a hierarchical structure only to put it back
    together again at later stage.

    You ought to be able to decompose this into multiple output streams
    without having to bend the kettle model (which currently only really
    caters for split, similar output streams plus error streams).

    In the long term, surely it has got to be multiple inputs and multiple
    outputs and possibly even hierarchical inputs and outputs.

    If the system doesn't fit the underlying data model then the system is
    wrong or has to have a very good excuse (like relational calculus and
    query optimisation) for being the way it is.

    Perhaps, if we're looking for version 3 or 4 improvements this aspect of
    the underlying model should be way up high on the list of priorities.
    XML-type nested data structures are getting more and more common
    because, well, just because these days we can. Why use 4 files, copying
    an underlying relational model of the data, with all the attendant
    problems of the files getting out of synch, when you can use a natural
    structure like xml and send it around in one indivisible lump?

    I know complex hierarchies may be impractical for people processing
    zillions of identically structured records, but even those people will
    probably have to deal with other structures some time.

    Tim



    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Darren Hartford
    Sent: 27 April 2007 19:34
    To: kettle-developers (AT) googlegroups (DOT) com
    Subject: EDI parser step


    Before I begin the venture of writing an EDI File Input step, just
    couple of questions:

    1) Has someone already started one?

    2) Any tips as I start working with this, especially concerning EDI
    being very hierarchical? One of my concerns was 'denormalizing' the EDI
    file to the point of each record/datastream would be many (200?) fields
    (transaction segments, header segments repeated for each detail
    segment). Think deeply-nested XML file.

    The intent is to create a module that can help define how to parse an
    EDI file into useful fields. Usecase would be X12-style files initially
    (there are many formats). If you do not know about EDI files, well,
    count yourself lucky ;-)





    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    Alexandre Guest

    Default Re: EDI parser step

    Tim

    I'm wondering if to split a very deep hierarchical structure in
    various output stream will not produce too complex transformations
    (like 8 or 10 streams coming out of a step and data from some of these
    streams being used together on calculation or javascript steps).

    I loved the idea of structured data streams, but I don't know enough
    of kettle do analyze the impact of an idea like this yet. I'm just
    starting to look at the source code :-(

    Alexandre

    On 4/27/07, Tim Pigden <tim.pigden (AT) optrak (DOT) co.uk> wrote:
    >
    > Hmm
    >
    > Surely we (Kettle community) have to be doing something wrong if you are
    > forced to denormalise a hierarchical structure only to put it back
    > together again at later stage.
    >
    > You ought to be able to decompose this into multiple output streams
    > without having to bend the kettle model (which currently only really
    > caters for split, similar output streams plus error streams).
    >
    > In the long term, surely it has got to be multiple inputs and multiple
    > outputs and possibly even hierarchical inputs and outputs.
    >
    > If the system doesn't fit the underlying data model then the system is
    > wrong or has to have a very good excuse (like relational calculus and
    > query optimisation) for being the way it is.
    >
    > Perhaps, if we're looking for version 3 or 4 improvements this aspect of
    > the underlying model should be way up high on the list of priorities.
    > XML-type nested data structures are getting more and more common
    > because, well, just because these days we can. Why use 4 files, copying
    > an underlying relational model of the data, with all the attendant
    > problems of the files getting out of synch, when you can use a natural
    > structure like xml and send it around in one indivisible lump?
    >
    > I know complex hierarchies may be impractical for people processing
    > zillions of identically structured records, but even those people will
    > probably have to deal with other structures some time.
    >
    > Tim
    >

    --
    Alexandre Machado
    axmachado (AT) gmail (DOT) com

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  4. #4
    Sven Boden Guest

    Default Re: EDI parser step

    The splitting of hierarchical data over multiple streams won't really
    work currently since all steps run in a seperate thread in parallel,
    and the structure of each row over a hop has to be the same (no
    optional fields).

    Kettle is currently pretty close to relational calculus, and changing
    it to support hierarchical stuff would be "biggish", and personally I
    haven't found a good way to describe actions on hierachical data: like
    e.g. variable number of attributes, optional fields, ...
    If you consider Xpath or it would get pretty complex and pretty slow.

    There is Serializable type in Kettle right now that can be used to
    store about anything in Rows, but you have to write customized steps
    for it. It was used as a backdoor for 1 or the other company. And I
    would not really advise using it.

    If I need to process hierachical data I currently would pre-process
    with custom applications e.g., it's not because you have a hammer
    (i.e. Kettle) that everything in the world is a nail ;-)

    Regards,
    Sven

    P.S. on a side note, not that it's a good excuse but all major ETL
    products use a kind of relational structure to pass information
    internally.


    On Apr 27, 10:53 pm, Alexandre <axmach... (AT) gmail (DOT) com> wrote:
    > Tim
    >
    > I'm wondering if to split a very deep hierarchical structure in
    > various output stream will not produce too complex transformations
    > (like 8 or 10 streams coming out of a step and data from some of these
    > streams being used together on calculation or javascript steps).
    >
    > I loved the idea of structured data streams, but I don't know enough
    > of kettle do analyze the impact of an idea like this yet. I'm just
    > starting to look at the source code :-(
    >
    > Alexandre
    >
    > On 4/27/07, Tim Pigden <tim.pig... (AT) optrak (DOT) co.uk> wrote:
    >
    >
    >
    > > Hmm

    >
    > > Surely we (Kettle community) have to be doing something wrong if you are
    > > forced to denormalise a hierarchical structure only to put it back
    > > together again at later stage.

    >
    > > You ought to be able to decompose this into multiple output streams
    > > without having to bend the kettle model (which currently only really
    > > caters for split, similar output streams plus error streams).

    >
    > > In the long term, surely it has got to be multiple inputs and multiple
    > > outputs and possibly even hierarchical inputs and outputs.

    >
    > > If the system doesn't fit the underlying data model then the system is
    > > wrong or has to have a very good excuse (like relational calculus and
    > > query optimisation) for being the way it is.

    >
    > > Perhaps, if we're looking for version 3 or 4 improvements this aspect of
    > > the underlying model should be way up high on the list of priorities.
    > > XML-type nested data structures are getting more and more common
    > > because, well, just because these days we can. Why use 4 files, copying
    > > an underlying relational model of the data, with all the attendant
    > > problems of the files getting out of synch, when you can use a natural
    > > structure like xml and send it around in one indivisible lump?

    >
    > > I know complex hierarchies may be impractical for people processing
    > > zillions of identically structured records, but even those people will
    > > probably have to deal with other structures some time.

    >
    > > Tim

    >
    > --
    > Alexandre Machado
    > axmach... (AT) gmail (DOT) com



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  5. #5
    Tim Pigden Guest

    Default RE: EDI parser step

    Sven,
    When Matt is telling us that at the moment

    "we're doing too much work in the various steps by creating metadata
    objects and settings for all the Value objects.
    In a step that reads 1M rows with 150 strings each, we create the
    following objects 150.000.000 too much:"

    we're clearly carrying around all the object overhead without the
    benefits of an object-based record solution.

    Now we could view the opportunity in 3 ways. We could go for rigid fixed
    metadata that would clearly give us the opportunity to do away with lots
    of objects and go for efficient processing of byte arrays and things
    like that.

    Or we could say "hell, it's got the overhead of a smart process, why not
    make it one"

    Or we can take an intermediate step and treat the raw data one way and
    the objects another - the classic compromise of int v Integer that has
    been in in java from first version (until v5 where they try to brush the
    distinction under the carpet).

    But while I would admit that this is not too bad in a single jvm
    environment I can see it would be hairy when it comes to passing data
    around between members of a cluster. I just don't know what happens
    there.

    But I can see that if Kettle is merely an excellent hammer, people may
    go elsewhere to buy a full toolkit (which may be stretching the analogy
    too far, since you can't have incompatible hammers ...)

    ETL systems do what they do because they have always done that. If your
    primary sources of data have always been databases or flat files or
    homogenous record sets then it's an easy decision to make your tools
    work like that - especially if that's what everybody else does.

    I know EDI has been around donkey's years, but it's niche. It moves data
    between a few (big) retail systems but after that you can probably get
    at the data some easier way in any case. But as Darren said or implied
    in the first place, it's like xml. Once you get over the syntactic
    wrapping of actual file (/data) layout, it's very like xml. But whereas
    EDI has been pretty niche, xml is big.

    Apologies for my ignorance of how kettle actually approaches the
    processing of streams, but surely a single stream input multi-stream
    output object doesn't intrinsically contradict a thread per stream
    design. Nor even should a transformation that has to consume all its
    inputs before producing any outputs.

    As for the issue of describing the actions - why not xpath? It might not
    work for your 10,000,000 records, but is surely works for my 500. As
    long as the spoon developer can see that using xpath on 10m records is
    going to take a long time he/she can see the error of his/her ways and
    choose a more efficient method for the data that they have (raw sax
    parser transform?).

    Any enough of this late Friday night rambling - I'm sure you've got
    better things to do this weekend than read this.

    Goodnight
    Tim



    -----Original Message-----
    From: kettle-developers (AT) googlegroups (DOT) com
    [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Sven Boden
    Sent: 27 April 2007 22:30
    To: kettle-developers
    Subject: Re: EDI parser step



    The splitting of hierarchical data over multiple streams won't really
    work currently since all steps run in a seperate thread in parallel,
    and the structure of each row over a hop has to be the same (no
    optional fields).

    Kettle is currently pretty close to relational calculus, and changing
    it to support hierarchical stuff would be "biggish", and personally I
    haven't found a good way to describe actions on hierachical data: like
    e.g. variable number of attributes, optional fields, ...
    If you consider Xpath or it would get pretty complex and pretty slow.

    There is Serializable type in Kettle right now that can be used to
    store about anything in Rows, but you have to write customized steps
    for it. It was used as a backdoor for 1 or the other company. And I
    would not really advise using it.

    If I need to process hierachical data I currently would pre-process
    with custom applications e.g., it's not because you have a hammer
    (i.e. Kettle) that everything in the world is a nail ;-)

    Regards,
    Sven

    P.S. on a side note, not that it's a good excuse but all major ETL
    products use a kind of relational structure to pass information
    internally.


    On Apr 27, 10:53 pm, Alexandre <axmach... (AT) gmail (DOT) com> wrote:
    > Tim
    >
    > I'm wondering if to split a very deep hierarchical structure in
    > various output stream will not produce too complex transformations
    > (like 8 or 10 streams coming out of a step and data from some of these
    > streams being used together on calculation or javascript steps).
    >
    > I loved the idea of structured data streams, but I don't know enough
    > of kettle do analyze the impact of an idea like this yet. I'm just
    > starting to look at the source code :-(
    >
    > Alexandre
    >
    > On 4/27/07, Tim Pigden <tim.pig... (AT) optrak (DOT) co.uk> wrote:
    >
    >
    >
    > > Hmm

    >
    > > Surely we (Kettle community) have to be doing something wrong if you

    are
    > > forced to denormalise a hierarchical structure only to put it back
    > > together again at later stage.

    >
    > > You ought to be able to decompose this into multiple output streams
    > > without having to bend the kettle model (which currently only really
    > > caters for split, similar output streams plus error streams).

    >
    > > In the long term, surely it has got to be multiple inputs and

    multiple
    > > outputs and possibly even hierarchical inputs and outputs.

    >
    > > If the system doesn't fit the underlying data model then the system

    is
    > > wrong or has to have a very good excuse (like relational calculus

    and
    > > query optimisation) for being the way it is.

    >
    > > Perhaps, if we're looking for version 3 or 4 improvements this

    aspect of
    > > the underlying model should be way up high on the list of

    priorities.
    > > XML-type nested data structures are getting more and more common
    > > because, well, just because these days we can. Why use 4 files,

    copying
    > > an underlying relational model of the data, with all the attendant
    > > problems of the files getting out of synch, when you can use a

    natural
    > > structure like xml and send it around in one indivisible lump?

    >
    > > I know complex hierarchies may be impractical for people processing
    > > zillions of identically structured records, but even those people

    will
    > > probably have to deal with other structures some time.

    >
    > > Tim

    >
    > --
    > Alexandre Machado
    > axmach... (AT) gmail (DOT) com





    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  6. #6
    Sven Boden Guest

    Default Re: EDI parser step

    Lol... I'm not perse against hierarchical processing ;-) ... I just
    haven't found a nice and elegant way to express actions on it (and
    even the ugly versions run like a snail). Have a look e.g at the XML
    input step which only even does a fraction of all possible XMLs. I
    tried building some hierarchical stuff for a cobol file step (in a
    local kettle version) but it didn't really work out: I'm still
    thinking on the cobol step.
    The current way of the rows (however it's implemented) is like
    relational calculus: simple and elegant.

    If you have an elegant way of describing actions on XML files,
    hierarchical stuff... try it out locally, propose it. If it's good
    it's good. Amaze me

    Enough rambling for now for today (for me anyway)

    Regards,
    Sven


    On Apr 28, 12:41 am, "Tim Pigden" <tim.pig... (AT) optrak (DOT) co.uk> wrote:
    > Sven,
    > When Matt is telling us that at the moment
    >
    > "we're doing too much work in the various steps by creating metadata
    > objects and settings for all the Value objects.
    > In a step that reads 1M rows with 150 strings each, we create the
    > following objects 150.000.000 too much:"
    >
    > we're clearly carrying around all the object overhead without the
    > benefits of an object-based record solution.
    >
    > Now we could view the opportunity in 3 ways. We could go for rigid fixed
    > metadata that would clearly give us the opportunity to do away with lots
    > of objects and go for efficient processing of byte arrays and things
    > like that.
    >
    > Or we could say "hell, it's got the overhead of a smart process, why not
    > make it one"
    >
    > Or we can take an intermediate step and treat the raw data one way and
    > the objects another - the classic compromise of int v Integer that has
    > been in in java from first version (until v5 where they try to brush the
    > distinction under the carpet).
    >
    > But while I would admit that this is not too bad in a single jvm
    > environment I can see it would be hairy when it comes to passing data
    > around between members of a cluster. I just don't know what happens
    > there.
    >
    > But I can see that if Kettle is merely an excellent hammer, people may
    > go elsewhere to buy a full toolkit (which may be stretching the analogy
    > too far, since you can't have incompatible hammers ...)
    >
    > ETL systems do what they do because they have always done that. If your
    > primary sources of data have always been databases or flat files or
    > homogenous record sets then it's an easy decision to make your tools
    > work like that - especially if that's what everybody else does.
    >
    > I know EDI has been around donkey's years, but it's niche. It moves data
    > between a few (big) retail systems but after that you can probably get
    > at the data some easier way in any case. But as Darren said or implied
    > in the first place, it's like xml. Once you get over the syntactic
    > wrapping of actual file (/data) layout, it's very like xml. But whereas
    > EDI has been pretty niche, xml is big.
    >
    > Apologies for my ignorance of how kettle actually approaches the
    > processing of streams, but surely a single stream input multi-stream
    > output object doesn't intrinsically contradict a thread per stream
    > design. Nor even should a transformation that has to consume all its
    > inputs before producing any outputs.
    >
    > As for the issue of describing the actions - why not xpath? It might not
    > work for your 10,000,000 records, but is surely works for my 500. As
    > long as the spoon developer can see that using xpath on 10m records is
    > going to take a long time he/she can see the error of his/her ways and
    > choose a more efficient method for the data that they have (raw sax
    > parser transform?).
    >
    > Any enough of this late Friday night rambling - I'm sure you've got
    > better things to do this weekend than read this.
    >
    > Goodnight
    > Tim
    >
    > -----Original Message-----
    > From: kettle-developers (AT) googlegroups (DOT) com
    >
    > [mailto:kettle-developers (AT) googlegroups (DOT) com] On Behalf Of Sven Boden
    > Sent: 27 April 2007 22:30
    > To: kettle-developers
    > Subject: Re: EDI parser step
    >
    > The splitting of hierarchical data over multiple streams won't really
    > work currently since all steps run in a seperate thread in parallel,
    > and the structure of each row over a hop has to be the same (no
    > optional fields).
    >
    > Kettle is currently pretty close to relational calculus, and changing
    > it to support hierarchical stuff would be "biggish", and personally I
    > haven't found a good way to describe actions on hierachical data: like
    > e.g. variable number of attributes, optional fields, ...
    > If you consider Xpath or it would get pretty complex and pretty slow.
    >
    > There is Serializable type in Kettle right now that can be used to
    > store about anything in Rows, but you have to write customized steps
    > for it. It was used as a backdoor for 1 or the other company. And I
    > would not really advise using it.
    >
    > If I need to process hierachical data I currently would pre-process
    > with custom applications e.g., it's not because you have a hammer
    > (i.e. Kettle) that everything in the world is a nail ;-)
    >
    > Regards,
    > Sven
    >
    > P.S. on a side note, not that it's a good excuse but all major ETL
    > products use a kind of relational structure to pass information
    > internally.
    >
    > On Apr 27, 10:53 pm, Alexandre <axmach... (AT) gmail (DOT) com> wrote:
    > > Tim

    >
    > > I'm wondering if to split a very deep hierarchical structure in
    > > various output stream will not produce too complex transformations
    > > (like 8 or 10 streams coming out of a step and data from some of these
    > > streams being used together on calculation or javascript steps).

    >
    > > I loved the idea of structured data streams, but I don't know enough
    > > of kettle do analyze the impact of an idea like this yet. I'm just
    > > starting to look at the source code :-(

    >
    > > Alexandre

    >
    > > On 4/27/07, Tim Pigden <tim.pig... (AT) optrak (DOT) co.uk> wrote:

    >
    > > > Hmm

    >
    > > > Surely we (Kettle community) have to be doing something wrong if you

    > are
    > > > forced to denormalise a hierarchical structure only to put it back
    > > > together again at later stage.

    >
    > > > You ought to be able to decompose this into multiple output streams
    > > > without having to bend the kettle model (which currently only really
    > > > caters for split, similar output streams plus error streams).

    >
    > > > In the long term, surely it has got to be multiple inputs and

    > multiple
    > > > outputs and possibly even hierarchical inputs and outputs.

    >
    > > > If the system doesn't fit the underlying data model then the system

    > is
    > > > wrong or has to have a very good excuse (like relational calculus

    > and
    > > > query optimisation) for being the way it is.

    >
    > > > Perhaps, if we're looking for version 3 or 4 improvements this

    > aspect of
    > > > the underlying model should be way up high on the list of

    > priorities.
    > > > XML-type nested data structures are getting more and more common
    > > > because, well, just because these days we can. Why use 4 files,

    > copying
    > > > an underlying relational model of the data, with all the attendant
    > > > problems of the files getting out of synch, when you can use a

    > natural
    > > > structure like xml and send it around in one indivisible lump?

    >
    > > > I know complex hierarchies may be impractical for people processing
    > > > zillions of identically structured records, but even those people

    > will
    > > > probably have to deal with other structures some time.

    >
    > > > Tim

    >
    > > --
    > > Alexandre Machado
    > > axmach... (AT) gmail (DOT) com



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers-unsubscribe (AT) g...oups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.