Hitachi Vantara Pentaho Community Forums
Results 1 to 6 of 6

Thread: Spoon reads an extra line for file input with wrapped data

  1. #1
    Peter Hunsberger Guest

    Default Spoon reads an extra line for file input with wrapped data

    Basic issue is in org.pentaho.di.trans.steps.textfileinput.TextFileInput.

    At line 1305 we have:

    int bufferSize = 1;

    which means that the code at line 1322:

    for (int i = 0; i < bufferSize && !data.doneReading; i++)

    will always read at lest one line of the file when a file is first opened.

    However, the code at line 1360 does:

    data.pageLinesRead = 0;

    Apparently this does not cause problems for the case when you are not
    using wrap (though why not I haven't investigated). However, for
    files where you have wrapped lines it means that an extra line of data
    is read on the fist set of wrapped lines. If you have your data
    buffer size set to the expected length the last line read will be lost
    completely. If you have it set bigger you will see a bigger set of
    data for the first set of lines than on all subsequent sets of lines.

    There are two possible fixes:

    1) the easiest is to just initialize bufferSize = 0; at line 1305

    2) if other code depends on the fact that there is in fact a line of
    data read at file open time, then the fix would be more complex. A
    flag would have to be added that says the file was just opened and set
    to true when the file is first opened. The flag would be checked on
    the first read of the wrapped lines, and the number of wrapped lines
    to be read would be decremented by one for the first read, and the
    flag now would be set to false.

    The second fix gets into issues of where to set the flag and possibly
    asynchronous use cases which I haven't investigated. So the question
    is, is it possible that any part of the code requires a line of data
    to be read at file open time? If not, the trivial fix is the way to
    go (and it in fact works fine for my data). Do I even need a patch
    for a fix like this, and if so, what format?


    --
    Peter Hunsberger

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  2. #2
    Matt Casters Guest

    Default Re: Spoon reads an extra line for file input with wrapped data

    Very nice Peter. Now put this information in JIRA, NOT on this mailing list.

    TIA!

    Matt

    Matt Casters <mcasters (AT) pentaho (DOT) org>
    Chief Data Integration
    Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    Pentaho : The Commercial Open Source Alternative for Business Intelligence



    On Thursday 20 August 2009 18:24:28 Peter Hunsberger wrote:
    >
    > Basic issue is in org.pentaho.di.trans.steps.textfileinput.TextFileInput.
    >
    > At line 1305 we have:
    >
    > int bufferSize = 1;
    >
    > which means that the code at line 1322:
    >
    > for (int i = 0; i < bufferSize && !data.doneReading; i++)
    >
    > will always read at lest one line of the file when a file is first opened.
    >
    > However, the code at line 1360 does:
    >
    > data.pageLinesRead = 0;
    >
    > Apparently this does not cause problems for the case when you are not
    > using wrap (though why not I haven't investigated). However, for
    > files where you have wrapped lines it means that an extra line of data
    > is read on the fist set of wrapped lines. If you have your data
    > buffer size set to the expected length the last line read will be lost
    > completely. If you have it set bigger you will see a bigger set of
    > data for the first set of lines than on all subsequent sets of lines.
    >
    > There are two possible fixes:
    >
    > 1) the easiest is to just initialize bufferSize = 0; at line 1305
    >
    > 2) if other code depends on the fact that there is in fact a line of
    > data read at file open time, then the fix would be more complex. A
    > flag would have to be added that says the file was just opened and set
    > to true when the file is first opened. The flag would be checked on
    > the first read of the wrapped lines, and the number of wrapped lines
    > to be read would be decremented by one for the first read, and the
    > flag now would be set to false.
    >
    > The second fix gets into issues of where to set the flag and possibly
    > asynchronous use cases which I haven't investigated. So the question
    > is, is it possible that any part of the code requires a line of data
    > to be read at file open time? If not, the trivial fix is the way to
    > go (and it in fact works fine for my data). Do I even need a patch
    > for a fix like this, and if so, what format?
    >
    >
    >



    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  3. #3
    samatar hassan Guest

    Default Re : Spoon reads an extra line for file input with wrapped data

    FYI,

    the related jira is PDI-2607

    http://jira.pentaho.com/browse/PDI-2607

    Samatar




    ________________________________
    De : Matt Casters <mcasters (AT) pentaho (DOT) org>

  4. #4
    Peter Hunsberger Guest

    Default Re: Spoon reads an extra line for file input with wrapped data

    On Thu, Aug 20, 2009 at 11:28 AM, Matt Casters<mcasters (AT) pentaho (DOT) org> wrote:
    >
    > Very nice Peter. Now put this information in JIRA, NOT on this mailing list.
    >


    That's already been done, question is, do you want a patch for a one
    line fix or do you want the more complex patch?

    >
    > Matt
    >
    > Matt Casters <mcasters (AT) pentaho (DOT) org>
    > Chief Data Integration
    > Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    > Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >
    >
    >
    > On Thursday 20 August 2009 18:24:28 Peter Hunsberger wrote:
    >>
    >> Basic issue is in org.pentaho.di.trans.steps.textfileinput.TextFileInput.
    >>
    >> At line 1305 we have:
    >>
    >> int bufferSize = 1;
    >>
    >> which means that the code at line 1322:
    >>
    >> for (int i = 0; i < bufferSize && !data.doneReading; i++)
    >>
    >> will always read at lest one line of the file when a file is first opened.
    >>
    >> However, the code at line 1360 does:
    >>
    >> data.pageLinesRead = 0;
    >>
    >> Apparently this does not cause problems for the case when you are not
    >> using wrap (though why not I haven't investigated). However, for
    >> files where you have wrapped lines it means that an extra line of data
    >> is read on the fist set of wrapped lines. If you have your data
    >> buffer size set to the expected length the last line read will be lost
    >> completely. If you have it set bigger you will see a bigger set of
    >> data for the first set of lines than on all subsequent sets of lines.
    >>
    >> There are two possible fixes:
    >>
    >> 1) the easiest is to just initialize bufferSize = 0; at line 1305
    >>
    >> 2) if other code depends on the fact that there is in fact a line of
    >> data read at file open time, then the fix would be more complex. A
    >> flag would have to be added that says the file was just opened and set
    >> to true when the file is first opened. The flag would be checked on
    >> the first read of the wrapped lines, and the number of wrapped lines
    >> to be read would be decremented by one for the first read, and the
    >> flag now would be set to false.
    >>
    >> The second fix gets into issues of where to set the flag and possibly
    >> asynchronous use cases which I haven't investigated. So the question
    >> is, is it possible that any part of the code requires a line of data
    >> to be read at file open time? If not, the trivial fix is the way to
    >> go (and it in fact works fine for my data). Do I even need a patch
    >> for a fix like this, and if so, what format?
    >>
    >>
    >>

    >
    >
    >
    > >

    >




    --
    Peter Hunsberger

    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  5. #5
    Jens Bleuel Guest

    Default Re: Spoon reads an extra line for file input with wrapped data

    > That's already been done, question is, do you want a patch for a one
    > line fix or do you want the more complex patch?


    Please stop discussing this singular case on this list - I strongly
    believe it is not of interest for all developers and you need time to
    evaluate this.

    Just imagine everyone who has a JIRA case would go this way and beat the
    drums on the market place....

    Everyone, who is interested in this case, can follow up on the JIRA
    case: http://jira.pentaho.com/browse/PDI-2607

    Thanks,
    Jens

    Peter Hunsberger schrieb:
    > On Thu, Aug 20, 2009 at 11:28 AM, Matt Casters<mcasters (AT) pentaho (DOT) org> wrote:
    >> Very nice Peter. Now put this information in JIRA, NOT on this mailing list.
    >>

    >
    > That's already been done, question is, do you want a patch for a one
    > line fix or do you want the more complex patch?
    >
    >> Matt
    >>
    >> Matt Casters <mcasters (AT) pentaho (DOT) org>
    >> Chief Data Integration
    >> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
    >> Pentaho : The Commercial Open Source Alternative for Business Intelligence
    >>
    >>
    >>
    >> On Thursday 20 August 2009 18:24:28 Peter Hunsberger wrote:
    >>> Basic issue is in org.pentaho.di.trans.steps.textfileinput.TextFileInput.
    >>>
    >>> At line 1305 we have:
    >>>
    >>> int bufferSize = 1;
    >>>
    >>> which means that the code at line 1322:
    >>>
    >>> for (int i = 0; i < bufferSize && !data.doneReading; i++)
    >>>
    >>> will always read at lest one line of the file when a file is first opened.
    >>>
    >>> However, the code at line 1360 does:
    >>>
    >>> data.pageLinesRead = 0;
    >>>
    >>> Apparently this does not cause problems for the case when you are not
    >>> using wrap (though why not I haven't investigated). However, for
    >>> files where you have wrapped lines it means that an extra line of data
    >>> is read on the fist set of wrapped lines. If you have your data
    >>> buffer size set to the expected length the last line read will be lost
    >>> completely. If you have it set bigger you will see a bigger set of
    >>> data for the first set of lines than on all subsequent sets of lines.
    >>>
    >>> There are two possible fixes:
    >>>
    >>> 1) the easiest is to just initialize bufferSize = 0; at line 1305
    >>>
    >>> 2) if other code depends on the fact that there is in fact a line of
    >>> data read at file open time, then the fix would be more complex. A
    >>> flag would have to be added that says the file was just opened and set
    >>> to true when the file is first opened. The flag would be checked on
    >>> the first read of the wrapped lines, and the number of wrapped lines
    >>> to be read would be decremented by one for the first read, and the
    >>> flag now would be set to false.
    >>>
    >>> The second fix gets into issues of where to set the flag and possibly
    >>> asynchronous use cases which I haven't investigated. So the question
    >>> is, is it possible that any part of the code requires a line of data
    >>> to be read at file open time? If not, the trivial fix is the way to
    >>> go (and it in fact works fine for my data). Do I even need a patch
    >>> for a fix like this, and if so, what format?
    >>>
    >>>
    >>>

    >>
    >>

    >
    >
    >


    --~--~---------~--~----~------------~-------~--~----~
    You received this message because you are subscribed to the Google Groups "kettle-developers" group.
    To post to this group, send email to kettle-developers (AT) googlegroups (DOT) com
    To unsubscribe from this group, send email to kettle-developers+unsubscribe (AT) googlegroups (DOT) com
    For more options, visit this group at http://groups.google.com/group/kettle-developers?hl=en
    -~----------~----~----~----~------~----~------~--~---

  6. #6
    Peter Hunsberger Guest

    Default Re: Spoon reads an extra line for file input with wrapped data

    On Thu, Aug 20, 2009 at 2:52 PM, Jens Bleuel<jbleuel (AT) pentaho (DOT) com> wrote:[color=blue]
    >
    >
    >

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.