Fetch change list

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Fetch change list

anup s
Hi ,
                I'm trying to fetch a set of files which have recently changed in a "filesystem". Also I'm supposed to keep the original copy as it is.
For obtaining the latest files that have changed, I'm using a PutFile with "replace" strategy piped to a GetFile with a minimum age of 5 sec,  max file age of 30 sec, Keep source file as true,

Also, running it in clustered mode. I'm seeing the below issues

-          The queue starts growing if there's an error.

-          Continuous errors with 'NoSuchFileException'

-          Penalizing StandardFlowFileErrors




ERROR

0ab3b920-1f05-4f24-b861-4fded3d5d826

161.91.234.248:7087

GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve files due to org.apache.nifi.processor.exception.FlowFileAccessException: Failed to import data from /nifi/UNZ/log201403230000.log for StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim=,offset=0,name=6908587554457536,size=0] due to java.nio.file.NoSuchFileException: /nifi/UNZ/log201403230000.log

18:45:56 IST



10:54:50 IST

ERROR

c552b5bc-f627-3cc3-b3d0-545c519eafd9

161.91.234.248:6087

PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim=1430717088883-73580,offset=0,name=file1.log,size=29314779] and transferring to failure due to org.apache.nifi.processor.exception.ProcessException: Could not rename /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException: Could not rename: /nifi/UNZ/.file1.log

10:54:56 IST

ERROR

60662bb3-490a-3b47-9371-e11c12cdfa1a

161.91.234.248:7087

PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim=1430717094668-73059,offset=1533296,name=file2.log,size=28014262] and transferring to failure due to org.apache.nifi.processor.exception.ProcessException: Could not rename: /data/softwares/RS/nifi/OUT/.file2.log: org.apache.nifi.processor.exception.ProcessException: Could not rename: /nifi/OUT/.file2.log



Do I have to tweak the Run schedule or keep the same minimum file age and maximum file age to overcome this issue?
What might be an elegant solution in NiFi?


Thanks,
anup

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Fetch change list

anup s
Hi ,
                I'm trying to fetch a set of files which have recently changed in a "filesystem". Also I'm supposed to keep the original copy as it is.
For obtaining the latest files that have changed, I'm using a PutFile with "replace" strategy piped to a GetFile with a minimum age of 5 sec,  max file age of 30 sec, Keep source file as true,

Also, running it in clustered mode. I'm seeing the below issues

-          The queue starts growing if there's an error.

-          Continuous errors with 'NoSuchFileException'

-          Penalizing StandardFlowFileErrors




ERROR

0ab3b920-1f05-4f24-b861-4fded3d5d826

161.91.234.248:7087

GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve files due to org.apache.nifi.processor.exception.FlowFileAccessException: Failed to import data from /nifi/UNZ/log201403230000.log for StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim=,offset=0,name=6908587554457536,size=0] due to java.nio.file.NoSuchFileException: /nifi/UNZ/log201403230000.log

18:45:56 IST



10:54:50 IST

ERROR

c552b5bc-f627-3cc3-b3d0-545c519eafd9

161.91.234.248:6087

PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim=1430717088883-73580,offset=0,name=file1.log,size=29314779] and transferring to failure due to org.apache.nifi.processor.exception.ProcessException: Could not rename /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException: Could not rename: /nifi/UNZ/.file1.log

10:54:56 IST

ERROR

60662bb3-490a-3b47-9371-e11c12cdfa1a

161.91.234.248:7087

PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim=1430717094668-73059,offset=1533296,name=file2.log,size=28014262] and transferring to failure due to org.apache.nifi.processor.exception.ProcessException: Could not rename: /data/softwares/RS/nifi/OUT/.file2.log: org.apache.nifi.processor.exception.ProcessException: Could not rename: /nifi/OUT/.file2.log



Do I have to tweak the Run schedule or keep the same minimum file age and maximum file age to overcome this issue?
What might be an elegant solution in NiFi?


Thanks,
anup

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Corey Flowers
In reply to this post by anup s
Good morning Anup!

         Is the pickup directory coming from a network share mount point?

On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup <[hidden email]>
wrote:

> Hi ,
>                 I'm trying to fetch a set of files which have recently
> changed in a "filesystem". Also I'm supposed to keep the original copy as
> it is.
> For obtaining the latest files that have changed, I'm using a PutFile with
> "replace" strategy piped to a GetFile with a minimum age of 5 sec,  max
> file age of 30 sec, Keep source file as true,
>
> Also, running it in clustered mode. I'm seeing the below issues
>
> -          The queue starts growing if there's an error.
>
> -          Continuous errors with 'NoSuchFileException'
>
> -          Penalizing StandardFlowFileErrors
>
>
>
>
> ERROR
>
> 0ab3b920-1f05-4f24-b861-4fded3d5d826
>
> 161.91.234.248:7087
>
> GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve files
> due to org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> to import data from /nifi/UNZ/log201403230000.log for
> StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim=,offset=0,name=6908587554457536,size=0]
> due to java.nio.file.NoSuchFileException: /nifi/UNZ/log201403230000.log
>
> 18:45:56 IST
>
>
>
> 10:54:50 IST
>
> ERROR
>
> c552b5bc-f627-3cc3-b3d0-545c519eafd9
>
> 161.91.234.248:6087
>
> PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim=1430717088883-73580,offset=0,name=file1.log,size=29314779]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename
> /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException:
> Could not rename: /nifi/UNZ/.file1.log
>
> 10:54:56 IST
>
> ERROR
>
> 60662bb3-490a-3b47-9371-e11c12cdfa1a
>
> 161.91.234.248:7087
>
> PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim=1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /data/softwares/RS/nifi/OUT/.file2.log:
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /nifi/OUT/.file2.log
>
>
>
> Do I have to tweak the Run schedule or keep the same minimum file age and
> maximum file age to overcome this issue?
> What might be an elegant solution in NiFi?
>
>
> Thanks,
> anup
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --
Reply | Threaded
Open this post in threaded view
|

RE: Fetch change list

anup s
Yes Corey, Right now the pickup directory is from a network share mount point. The data is picked up from one location and transferred to the other. I'm using site-to-site communication.

-----Original Message-----
From: Corey Flowers [mailto:[hidden email]]
Sent: Monday, May 04, 2015 7:57 PM
To: [hidden email]
Subject: Re: Fetch change list

Good morning Anup!

         Is the pickup directory coming from a network share mount point?

On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup <[hidden email]>
wrote:

> Hi ,
>                 I'm trying to fetch a set of files which have recently
> changed in a "filesystem". Also I'm supposed to keep the original copy
> as it is.
> For obtaining the latest files that have changed, I'm using a PutFile
> with "replace" strategy piped to a GetFile with a minimum age of 5
> sec,  max file age of 30 sec, Keep source file as true,
>
> Also, running it in clustered mode. I'm seeing the below issues
>
> -          The queue starts growing if there's an error.
>
> -          Continuous errors with 'NoSuchFileException'
>
> -          Penalizing StandardFlowFileErrors
>
>
>
>
> ERROR
>
> 0ab3b920-1f05-4f24-b861-4fded3d5d826
>
> 161.91.234.248:7087
>
> GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> files due to
> org.apache.nifi.processor.exception.FlowFileAccessException: Failed to
> import data from /nifi/UNZ/log201403230000.log for
> StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim
> =,offset=0,name=6908587554457536,size=0]
> due to java.nio.file.NoSuchFileException:
> /nifi/UNZ/log201403230000.log
>
> 18:45:56 IST
>
>
>
> 10:54:50 IST
>
> ERROR
>
> c552b5bc-f627-3cc3-b3d0-545c519eafd9
>
> 161.91.234.248:6087
>
> PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim
> =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename
> /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException:
> Could not rename: /nifi/UNZ/.file1.log
>
> 10:54:56 IST
>
> ERROR
>
> 60662bb3-490a-3b47-9371-e11c12cdfa1a
>
> 161.91.234.248:7087
>
> PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim
> =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /data/softwares/RS/nifi/OUT/.file2.log:
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /nifi/OUT/.file2.log
>
>
>
> Do I have to tweak the Run schedule or keep the same minimum file age
> and maximum file age to overcome this issue?
> What might be an elegant solution in NiFi?
>
>
> Thanks,
> anup
>
> ________________________________
> The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely
> for the addressee(s). If you are not the intended recipient, you are
> hereby notified that any use, forwarding, dissemination, or
> reproduction of this message is strictly prohibited and may be
> unlawful. If you are not the intended recipient, please contact the
> sender by return e-mail and destroy all copies of the original message.
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Corey Flowers
Ok, the get file that is running, is basically causing a race condition
between all of the servers in your cluster. That is why you are seeing the
"NoSuchFile" error. If you change the scheduling strategy on that processor
to "On Primary node" Then the only system that will try to pick up data
from that mount point, is the server you have designated "primary node".
This should fix that issue.

On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <[hidden email]>
wrote:

> Yes Corey, Right now the pickup directory is from a network share mount
> point. The data is picked up from one location and transferred to the
> other. I'm using site-to-site communication.
>
> -----Original Message-----
> From: Corey Flowers [mailto:[hidden email]]
> Sent: Monday, May 04, 2015 7:57 PM
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Good morning Anup!
>
>          Is the pickup directory coming from a network share mount point?
>
> On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup <[hidden email]
> >
> wrote:
>
> > Hi ,
> >                 I'm trying to fetch a set of files which have recently
> > changed in a "filesystem". Also I'm supposed to keep the original copy
> > as it is.
> > For obtaining the latest files that have changed, I'm using a PutFile
> > with "replace" strategy piped to a GetFile with a minimum age of 5
> > sec,  max file age of 30 sec, Keep source file as true,
> >
> > Also, running it in clustered mode. I'm seeing the below issues
> >
> > -          The queue starts growing if there's an error.
> >
> > -          Continuous errors with 'NoSuchFileException'
> >
> > -          Penalizing StandardFlowFileErrors
> >
> >
> >
> >
> > ERROR
> >
> > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> >
> > 161.91.234.248:7087
> >
> > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > files due to
> > org.apache.nifi.processor.exception.FlowFileAccessException: Failed to
> > import data from /nifi/UNZ/log201403230000.log for
> > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim
> > =,offset=0,name=6908587554457536,size=0]
> > due to java.nio.file.NoSuchFileException:
> > /nifi/UNZ/log201403230000.log
> >
> > 18:45:56 IST
> >
> >
> >
> > 10:54:50 IST
> >
> > ERROR
> >
> > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> >
> > 161.91.234.248:6087
> >
> > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim
> > =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not rename
> > /nifi/UNZ/.file1.log:
> org.apache.nifi.processor.exception.ProcessException:
> > Could not rename: /nifi/UNZ/.file1.log
> >
> > 10:54:56 IST
> >
> > ERROR
> >
> > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> >
> > 161.91.234.248:7087
> >
> > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim
> > =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /data/softwares/RS/nifi/OUT/.file2.log:
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /nifi/OUT/.file2.log
> >
> >
> >
> > Do I have to tweak the Run schedule or keep the same minimum file age
> > and maximum file age to overcome this issue?
> > What might be an elegant solution in NiFi?
> >
> >
> > Thanks,
> > anup
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended solely
> > for the addressee(s). If you are not the intended recipient, you are
> > hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> [hidden email]
>
> -- This account not approved for unencrypted proprietary information --
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --
Reply | Threaded
Open this post in threaded view
|

RE: Fetch change list

anup s
Thanks Corey for that info. But the major problem I'm facing is I am backing up a large set of data into HDFS (with a GetHDFS , source retained as true) and then trying to fetch the delta from it. (get only the files which have arrived recently by using the min Age and max Age). But I'm unable to get the exact delta if I have 'keep source file' as true..
I played around a lot with schedule time and min & max age but didn't help.

-----Original Message-----
From: Corey Flowers [mailto:[hidden email]]
Sent: Tuesday, May 05, 2015 5:35 PM
To: [hidden email]
Subject: Re: Fetch change list

Ok, the get file that is running, is basically causing a race condition between all of the servers in your cluster. That is why you are seeing the "NoSuchFile" error. If you change the scheduling strategy on that processor to "On Primary node" Then the only system that will try to pick up data from that mount point, is the server you have designated "primary node".
This should fix that issue.

On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <[hidden email]>
wrote:

> Yes Corey, Right now the pickup directory is from a network share
> mount point. The data is picked up from one location and transferred
> to the other. I'm using site-to-site communication.
>
> -----Original Message-----
> From: Corey Flowers [mailto:[hidden email]]
> Sent: Monday, May 04, 2015 7:57 PM
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Good morning Anup!
>
>          Is the pickup directory coming from a network share mount point?
>
> On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> <[hidden email]
> >
> wrote:
>
> > Hi ,
> >                 I'm trying to fetch a set of files which have
> > recently changed in a "filesystem". Also I'm supposed to keep the
> > original copy as it is.
> > For obtaining the latest files that have changed, I'm using a
> > PutFile with "replace" strategy piped to a GetFile with a minimum
> > age of 5 sec,  max file age of 30 sec, Keep source file as true,
> >
> > Also, running it in clustered mode. I'm seeing the below issues
> >
> > -          The queue starts growing if there's an error.
> >
> > -          Continuous errors with 'NoSuchFileException'
> >
> > -          Penalizing StandardFlowFileErrors
> >
> >
> >
> >
> > ERROR
> >
> > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> >
> > 161.91.234.248:7087
> >
> > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > files due to
> > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > to import data from /nifi/UNZ/log201403230000.log for
> > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > im =,offset=0,name=6908587554457536,size=0]
> > due to java.nio.file.NoSuchFileException:
> > /nifi/UNZ/log201403230000.log
> >
> > 18:45:56 IST
> >
> >
> >
> > 10:54:50 IST
> >
> > ERROR
> >
> > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> >
> > 161.91.234.248:6087
> >
> > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not
> > rename
> > /nifi/UNZ/.file1.log:
> org.apache.nifi.processor.exception.ProcessException:
> > Could not rename: /nifi/UNZ/.file1.log
> >
> > 10:54:56 IST
> >
> > ERROR
> >
> > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> >
> > 161.91.234.248:7087
> >
> > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /data/softwares/RS/nifi/OUT/.file2.log:
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /nifi/OUT/.file2.log
> >
> >
> >
> > Do I have to tweak the Run schedule or keep the same minimum file
> > age and maximum file age to overcome this issue?
> > What might be an elegant solution in NiFi?
> >
> >
> > Thanks,
> > anup
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended
> > solely for the addressee(s). If you are not the intended recipient,
> > you are hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> [hidden email]
>
> -- This account not approved for unencrypted proprietary information
> --
>
> ________________________________
> The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely
> for the addressee(s). If you are not the intended recipient, you are
> hereby notified that any use, forwarding, dissemination, or
> reproduction of this message is strictly prohibited and may be
> unlawful. If you are not the intended recipient, please contact the
> sender by return e-mail and destroy all copies of the original message.
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

RE: Fetch change list

Mark Payne
Anup,
With the 0.1.0 release that we are working on right now, there are two new processors: ListHDFS, FetchHDFS, that are able to keep state about what has been pulled from HDFS. This way you can keep the data in HDFS and still only pull in new data. Will this help?
Thanks-Mark

> From: [hidden email]
> To: [hidden email]
> Subject: RE: Fetch change list
> Date: Tue, 5 May 2015 15:32:07 +0000
>
> Thanks Corey for that info. But the major problem I'm facing is I am backing up a large set of data into HDFS (with a GetHDFS , source retained as true) and then trying to fetch the delta from it. (get only the files which have arrived recently by using the min Age and max Age). But I'm unable to get the exact delta if I have 'keep source file' as true..
> I played around a lot with schedule time and min & max age but didn't help.
>
> -----Original Message-----
> From: Corey Flowers [mailto:[hidden email]]
> Sent: Tuesday, May 05, 2015 5:35 PM
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Ok, the get file that is running, is basically causing a race condition between all of the servers in your cluster. That is why you are seeing the "NoSuchFile" error. If you change the scheduling strategy on that processor to "On Primary node" Then the only system that will try to pick up data from that mount point, is the server you have designated "primary node".
> This should fix that issue.
>
> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <[hidden email]>
> wrote:
>
> > Yes Corey, Right now the pickup directory is from a network share
> > mount point. The data is picked up from one location and transferred
> > to the other. I'm using site-to-site communication.
> >
> > -----Original Message-----
> > From: Corey Flowers [mailto:[hidden email]]
> > Sent: Monday, May 04, 2015 7:57 PM
> > To: [hidden email]
> > Subject: Re: Fetch change list
> >
> > Good morning Anup!
> >
> >          Is the pickup directory coming from a network share mount point?
> >
> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> > <[hidden email]
> > >
> > wrote:
> >
> > > Hi ,
> > >                 I'm trying to fetch a set of files which have
> > > recently changed in a "filesystem". Also I'm supposed to keep the
> > > original copy as it is.
> > > For obtaining the latest files that have changed, I'm using a
> > > PutFile with "replace" strategy piped to a GetFile with a minimum
> > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
> > >
> > > Also, running it in clustered mode. I'm seeing the below issues
> > >
> > > -          The queue starts growing if there's an error.
> > >
> > > -          Continuous errors with 'NoSuchFileException'
> > >
> > > -          Penalizing StandardFlowFileErrors
> > >
> > >
> > >
> > >
> > > ERROR
> > >
> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> > >
> > > 161.91.234.248:7087
> > >
> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > > files due to
> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > > to import data from /nifi/UNZ/log201403230000.log for
> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > > im =,offset=0,name=6908587554457536,size=0]
> > > due to java.nio.file.NoSuchFileException:
> > > /nifi/UNZ/log201403230000.log
> > >
> > > 18:45:56 IST
> > >
> > >
> > >
> > > 10:54:50 IST
> > >
> > > ERROR
> > >
> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> > >
> > > 161.91.234.248:6087
> > >
> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not
> > > rename
> > > /nifi/UNZ/.file1.log:
> > org.apache.nifi.processor.exception.ProcessException:
> > > Could not rename: /nifi/UNZ/.file1.log
> > >
> > > 10:54:56 IST
> > >
> > > ERROR
> > >
> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> > >
> > > 161.91.234.248:7087
> > >
> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > > /data/softwares/RS/nifi/OUT/.file2.log:
> > > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > > /nifi/OUT/.file2.log
> > >
> > >
> > >
> > > Do I have to tweak the Run schedule or keep the same minimum file
> > > age and maximum file age to overcome this issue?
> > > What might be an elegant solution in NiFi?
> > >
> > >
> > > Thanks,
> > > anup
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> > > legally protected under applicable law. The message is intended
> > > solely for the addressee(s). If you are not the intended recipient,
> > > you are hereby notified that any use, forwarding, dissemination, or
> > > reproduction of this message is strictly prohibited and may be
> > > unlawful. If you are not the intended recipient, please contact the
> > > sender by return e-mail and destroy all copies of the original message.
> > >
> >
> >
> >
> > --
> > Corey Flowers
> > Vice President, Onyx Point, Inc
> > (410) 541-6699
> > [hidden email]
> >
> > -- This account not approved for unencrypted proprietary information
> > --
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended solely
> > for the addressee(s). If you are not the intended recipient, you are
> > hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> [hidden email]
>
> -- This account not approved for unencrypted proprietary information --
>
> ________________________________
> The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
     
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Corey Flowers
Wahoo! Thanks Mark for saving me on this one!

Anup, before this release, it would not have been pretty to pull that delta
off! :-)

On Tue, May 5, 2015 at 11:39 AM, Mark Payne <[hidden email]> wrote:

> Anup,
> With the 0.1.0 release that we are working on right now, there are two new
> processors: ListHDFS, FetchHDFS, that are able to keep state about what has
> been pulled from HDFS. This way you can keep the data in HDFS and still
> only pull in new data. Will this help?
> Thanks-Mark
>
> > From: [hidden email]
> > To: [hidden email]
> > Subject: RE: Fetch change list
> > Date: Tue, 5 May 2015 15:32:07 +0000
> >
> > Thanks Corey for that info. But the major problem I'm facing is I am
> backing up a large set of data into HDFS (with a GetHDFS , source retained
> as true) and then trying to fetch the delta from it. (get only the files
> which have arrived recently by using the min Age and max Age). But I'm
> unable to get the exact delta if I have 'keep source file' as true..
> > I played around a lot with schedule time and min & max age but didn't
> help.
> >
> > -----Original Message-----
> > From: Corey Flowers [mailto:[hidden email]]
> > Sent: Tuesday, May 05, 2015 5:35 PM
> > To: [hidden email]
> > Subject: Re: Fetch change list
> >
> > Ok, the get file that is running, is basically causing a race condition
> between all of the servers in your cluster. That is why you are seeing the
> "NoSuchFile" error. If you change the scheduling strategy on that processor
> to "On Primary node" Then the only system that will try to pick up data
> from that mount point, is the server you have designated "primary node".
> > This should fix that issue.
> >
> > On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <
> [hidden email]>
> > wrote:
> >
> > > Yes Corey, Right now the pickup directory is from a network share
> > > mount point. The data is picked up from one location and transferred
> > > to the other. I'm using site-to-site communication.
> > >
> > > -----Original Message-----
> > > From: Corey Flowers [mailto:[hidden email]]
> > > Sent: Monday, May 04, 2015 7:57 PM
> > > To: [hidden email]
> > > Subject: Re: Fetch change list
> > >
> > > Good morning Anup!
> > >
> > >          Is the pickup directory coming from a network share mount
> point?
> > >
> > > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> > > <[hidden email]
> > > >
> > > wrote:
> > >
> > > > Hi ,
> > > >                 I'm trying to fetch a set of files which have
> > > > recently changed in a "filesystem". Also I'm supposed to keep the
> > > > original copy as it is.
> > > > For obtaining the latest files that have changed, I'm using a
> > > > PutFile with "replace" strategy piped to a GetFile with a minimum
> > > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
> > > >
> > > > Also, running it in clustered mode. I'm seeing the below issues
> > > >
> > > > -          The queue starts growing if there's an error.
> > > >
> > > > -          Continuous errors with 'NoSuchFileException'
> > > >
> > > > -          Penalizing StandardFlowFileErrors
> > > >
> > > >
> > > >
> > > >
> > > > ERROR
> > > >
> > > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> > > >
> > > > 161.91.234.248:7087
> > > >
> > > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > > > files due to
> > > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > > > to import data from /nifi/UNZ/log201403230000.log for
> > > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > > > im =,offset=0,name=6908587554457536,size=0]
> > > > due to java.nio.file.NoSuchFileException:
> > > > /nifi/UNZ/log201403230000.log
> > > >
> > > > 18:45:56 IST
> > > >
> > > >
> > > >
> > > > 10:54:50 IST
> > > >
> > > > ERROR
> > > >
> > > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> > > >
> > > > 161.91.234.248:6087
> > > >
> > > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > > > and transferring to failure due to
> > > > org.apache.nifi.processor.exception.ProcessException: Could not
> > > > rename
> > > > /nifi/UNZ/.file1.log:
> > > org.apache.nifi.processor.exception.ProcessException:
> > > > Could not rename: /nifi/UNZ/.file1.log
> > > >
> > > > 10:54:56 IST
> > > >
> > > > ERROR
> > > >
> > > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> > > >
> > > > 161.91.234.248:7087
> > > >
> > > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > > > and transferring to failure due to
> > > > org.apache.nifi.processor.exception.ProcessException: Could not
> rename:
> > > > /data/softwares/RS/nifi/OUT/.file2.log:
> > > > org.apache.nifi.processor.exception.ProcessException: Could not
> rename:
> > > > /nifi/OUT/.file2.log
> > > >
> > > >
> > > >
> > > > Do I have to tweak the Run schedule or keep the same minimum file
> > > > age and maximum file age to overcome this issue?
> > > > What might be an elegant solution in NiFi?
> > > >
> > > >
> > > > Thanks,
> > > > anup
> > > >
> > > > ________________________________
> > > > The information contained in this message may be confidential and
> > > > legally protected under applicable law. The message is intended
> > > > solely for the addressee(s). If you are not the intended recipient,
> > > > you are hereby notified that any use, forwarding, dissemination, or
> > > > reproduction of this message is strictly prohibited and may be
> > > > unlawful. If you are not the intended recipient, please contact the
> > > > sender by return e-mail and destroy all copies of the original
> message.
> > > >
> > >
> > >
> > >
> > > --
> > > Corey Flowers
> > > Vice President, Onyx Point, Inc
> > > (410) 541-6699
> > > [hidden email]
> > >
> > > -- This account not approved for unencrypted proprietary information
> > > --
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> > > legally protected under applicable law. The message is intended solely
> > > for the addressee(s). If you are not the intended recipient, you are
> > > hereby notified that any use, forwarding, dissemination, or
> > > reproduction of this message is strictly prohibited and may be
> > > unlawful. If you are not the intended recipient, please contact the
> > > sender by return e-mail and destroy all copies of the original message.
> > >
> >
> >
> >
> > --
> > Corey Flowers
> > Vice President, Onyx Point, Inc
> > (410) 541-6699
> > [hidden email]
> >
> > -- This account not approved for unencrypted proprietary information --
> >
> > ________________________________
> > The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely for
> the addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Oscar de la Pena
Hi Mark,

My team and I are working on a similar scenario as Anup but we're using SFTP not HDFS remote file source.
I'm wondering if there will also be processors like ListSFTP and FetchSFTP in the 0.1.0 release
that can keep state about what have been already pulled? We are thinking of implementing a custom processor
just to do that.

Thanks!
Owie

----- Original Message -----

From: "Corey Flowers" <[hidden email]>
To: [hidden email]
Sent: Wednesday, May 6, 2015 12:05:48 AM
Subject: Re: Fetch change list

Wahoo! Thanks Mark for saving me on this one!

Anup, before this release, it would not have been pretty to pull that delta
off! :-)

On Tue, May 5, 2015 at 11:39 AM, Mark Payne <[hidden email]> wrote:



Anup,
With the 0.1.0 release that we are working on right now, there are two new
processors: ListHDFS, FetchHDFS, that are able to keep state about what has
been pulled from HDFS. This way you can keep the data in HDFS and still
only pull in new data. Will this help?
Thanks-Mark

> From: [hidden email]
> To: [hidden email]
> Subject: RE: Fetch change list
> Date: Tue, 5 May 2015 15:32:07 +0000
>
> Thanks Corey for that info. But the major problem I'm facing is I am
backing up a large set of data into HDFS (with a GetHDFS , source retained
as true) and then trying to fetch the delta from it. (get only the files
which have arrived recently by using the min Age and max Age). But I'm
unable to get the exact delta if I have 'keep source file' as true..
> I played around a lot with schedule time and min & max age but didn't
help.
>
> -----Original Message-----
> From: Corey Flowers [mailto:[hidden email]]
> Sent: Tuesday, May 05, 2015 5:35 PM
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Ok, the get file that is running, is basically causing a race condition
between all of the servers in your cluster. That is why you are seeing the
"NoSuchFile" error. If you change the scheduling strategy on that processor
to "On Primary node" Then the only system that will try to pick up data
from that mount point, is the server you have designated "primary node".
> This should fix that issue.
>
> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <
[hidden email]>

> wrote:
>
> > Yes Corey, Right now the pickup directory is from a network share
> > mount point. The data is picked up from one location and transferred
> > to the other. I'm using site-to-site communication.
> >
> > -----Original Message-----
> > From: Corey Flowers [mailto:[hidden email]]
> > Sent: Monday, May 04, 2015 7:57 PM
> > To: [hidden email]
> > Subject: Re: Fetch change list
> >
> > Good morning Anup!
> >
> > Is the pickup directory coming from a network share mount
point?

> >
> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> > <[hidden email]
> > >
> > wrote:
> >
> > > Hi ,
> > > I'm trying to fetch a set of files which have
> > > recently changed in a "filesystem". Also I'm supposed to keep the
> > > original copy as it is.
> > > For obtaining the latest files that have changed, I'm using a
> > > PutFile with "replace" strategy piped to a GetFile with a minimum
> > > age of 5 sec, max file age of 30 sec, Keep source file as true,
> > >
> > > Also, running it in clustered mode. I'm seeing the below issues
> > >
> > > - The queue starts growing if there's an error.
> > >
> > > - Continuous errors with 'NoSuchFileException'
> > >
> > > - Penalizing StandardFlowFileErrors
> > >
> > >
> > >
> > >
> > > ERROR
> > >
> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> > >
> > > 161.91.234.248:7087
> > >
> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > > files due to
> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > > to import data from /nifi/UNZ/log201403230000.log for
> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > > im =,offset=0,name=6908587554457536,size=0]
> > > due to java.nio.file.NoSuchFileException:
> > > /nifi/UNZ/log201403230000.log
> > >
> > > 18:45:56 IST
> > >
> > >
> > >
> > > 10:54:50 IST
> > >
> > > ERROR
> > >
> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> > >
> > > 161.91.234.248:6087
> > >
> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not
> > > rename
> > > /nifi/UNZ/.file1.log:
> > org.apache.nifi.processor.exception.ProcessException:
> > > Could not rename: /nifi/UNZ/.file1.log
> > >
> > > 10:54:56 IST
> > >
> > > ERROR
> > >
> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> > >
> > > 161.91.234.248:7087
> > >
> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not
rename:
> > > /data/softwares/RS/nifi/OUT/.file2.log:
> > > org.apache.nifi.processor.exception.ProcessException: Could not
rename:

> > > /nifi/OUT/.file2.log
> > >
> > >
> > >
> > > Do I have to tweak the Run schedule or keep the same minimum file
> > > age and maximum file age to overcome this issue?
> > > What might be an elegant solution in NiFi?
> > >
> > >
> > > Thanks,
> > > anup
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> > > legally protected under applicable law. The message is intended
> > > solely for the addressee(s). If you are not the intended recipient,
> > > you are hereby notified that any use, forwarding, dissemination, or
> > > reproduction of this message is strictly prohibited and may be
> > > unlawful. If you are not the intended recipient, please contact the
> > > sender by return e-mail and destroy all copies of the original
message.

> > >
> >
> >
> >
> > --
> > Corey Flowers
> > Vice President, Onyx Point, Inc
> > (410) 541-6699
> > [hidden email]
> >
> > -- This account not approved for unencrypted proprietary information
> > --
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended solely
> > for the addressee(s). If you are not the intended recipient, you are
> > hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> [hidden email]
>
> -- This account not approved for unencrypted proprietary information --
>
> ________________________________
> The information contained in this message may be confidential and
legally protected under applicable law. The message is intended solely for
the addressee(s). If you are not the intended recipient, you are hereby
notified that any use, forwarding, dissemination, or reproduction of this
message is strictly prohibited and may be unlawful. If you are not the
intended recipient, please contact the sender by return e-mail and destroy
all copies of the original message.





--



Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Mark Payne
In reply to this post by anup s
Oscar,


No, we have not separated out the GetSFTP into List/Fetch. It probably makes sense to do this for many of the Get* Processors. However, we do have a ticket to add simple state management to the framework so that processors can easily save state across the cluster. This is the difficult part of List* processors. It’s pretty easy if you don’t have to worry about running in a cluster, but clusters make it much more difficult. So I held off on doing this for many processors until that is implemented.


If you want to tackle (S)FTP though and/or contribute to the framework’s ability to manage state, we would be more than happy to work with you on it!


Thanks

-Mark









From: Oscar dela Pena
Sent: ‎Tuesday‎, ‎May‎ ‎5‎, ‎2015 ‎7‎:‎43‎ ‎PM
To: [hidden email]





Hi Mark,

My team and I are working on a similar scenario as Anup but we're using SFTP not HDFS remote file source.
I'm wondering if there will also be processors like ListSFTP and FetchSFTP in the 0.1.0 release
that can keep state about what have been already pulled? We are thinking of implementing a custom processor
just to do that.

Thanks!
Owie

----- Original Message -----

From: "Corey Flowers" <[hidden email]>
To: [hidden email]
Sent: Wednesday, May 6, 2015 12:05:48 AM
Subject: Re: Fetch change list

Wahoo! Thanks Mark for saving me on this one!

Anup, before this release, it would not have been pretty to pull that delta
off! :-)

On Tue, May 5, 2015 at 11:39 AM, Mark Payne <[hidden email]> wrote:



Anup,
With the 0.1.0 release that we are working on right now, there are two new
processors: ListHDFS, FetchHDFS, that are able to keep state about what has
been pulled from HDFS. This way you can keep the data in HDFS and still
only pull in new data. Will this help?
Thanks-Mark

> From: [hidden email]
> To: [hidden email]
> Subject: RE: Fetch change list
> Date: Tue, 5 May 2015 15:32:07 +0000
>
> Thanks Corey for that info. But the major problem I'm facing is I am
backing up a large set of data into HDFS (with a GetHDFS , source retained
as true) and then trying to fetch the delta from it. (get only the files
which have arrived recently by using the min Age and max Age). But I'm
unable to get the exact delta if I have 'keep source file' as true..
> I played around a lot with schedule time and min & max age but didn't
help.
>
> -----Original Message-----
> From: Corey Flowers [mailto:[hidden email]]
> Sent: Tuesday, May 05, 2015 5:35 PM
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Ok, the get file that is running, is basically causing a race condition
between all of the servers in your cluster. That is why you are seeing the
"NoSuchFile" error. If you change the scheduling strategy on that processor
to "On Primary node" Then the only system that will try to pick up data
from that mount point, is the server you have designated "primary node".
> This should fix that issue.
>
> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <
[hidden email]>

> wrote:
>
> > Yes Corey, Right now the pickup directory is from a network share
> > mount point. The data is picked up from one location and transferred
> > to the other. I'm using site-to-site communication.
> >
> > -----Original Message-----
> > From: Corey Flowers [mailto:[hidden email]]
> > Sent: Monday, May 04, 2015 7:57 PM
> > To: [hidden email]
> > Subject: Re: Fetch change list
> >
> > Good morning Anup!
> >
> > Is the pickup directory coming from a network share mount
point?

> >
> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> > <[hidden email]
> > >
> > wrote:
> >
> > > Hi ,
> > > I'm trying to fetch a set of files which have
> > > recently changed in a "filesystem". Also I'm supposed to keep the
> > > original copy as it is.
> > > For obtaining the latest files that have changed, I'm using a
> > > PutFile with "replace" strategy piped to a GetFile with a minimum
> > > age of 5 sec, max file age of 30 sec, Keep source file as true,
> > >
> > > Also, running it in clustered mode. I'm seeing the below issues
> > >
> > > - The queue starts growing if there's an error.
> > >
> > > - Continuous errors with 'NoSuchFileException'
> > >
> > > - Penalizing StandardFlowFileErrors
> > >
> > >
> > >
> > >
> > > ERROR
> > >
> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> > >
> > > 161.91.234.248:7087
> > >
> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > > files due to
> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > > to import data from /nifi/UNZ/log201403230000.log for
> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > > im =,offset=0,name=6908587554457536,size=0]
> > > due to java.nio.file.NoSuchFileException:
> > > /nifi/UNZ/log201403230000.log
> > >
> > > 18:45:56 IST
> > >
> > >
> > >
> > > 10:54:50 IST
> > >
> > > ERROR
> > >
> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> > >
> > > 161.91.234.248:6087
> > >
> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not
> > > rename
> > > /nifi/UNZ/.file1.log:
> > org.apache.nifi.processor.exception.ProcessException:
> > > Could not rename: /nifi/UNZ/.file1.log
> > >
> > > 10:54:56 IST
> > >
> > > ERROR
> > >
> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> > >
> > > 161.91.234.248:7087
> > >
> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not
rename:
> > > /data/softwares/RS/nifi/OUT/.file2.log:
> > > org.apache.nifi.processor.exception.ProcessException: Could not
rename:

> > > /nifi/OUT/.file2.log
> > >
> > >
> > >
> > > Do I have to tweak the Run schedule or keep the same minimum file
> > > age and maximum file age to overcome this issue?
> > > What might be an elegant solution in NiFi?
> > >
> > >
> > > Thanks,
> > > anup
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> > > legally protected under applicable law. The message is intended
> > > solely for the addressee(s). If you are not the intended recipient,
> > > you are hereby notified that any use, forwarding, dissemination, or
> > > reproduction of this message is strictly prohibited and may be
> > > unlawful. If you are not the intended recipient, please contact the
> > > sender by return e-mail and destroy all copies of the original
message.

> > >
> >
> >
> >
> > --
> > Corey Flowers
> > Vice President, Onyx Point, Inc
> > (410) 541-6699
> > [hidden email]
> >
> > -- This account not approved for unencrypted proprietary information
> > --
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended solely
> > for the addressee(s). If you are not the intended recipient, you are
> > hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> [hidden email]
>
> -- This account not approved for unencrypted proprietary information --
>
> ________________________________
> The information contained in this message may be confidential and
legally protected under applicable law. The message is intended solely for
the addressee(s). If you are not the intended recipient, you are hereby
notified that any use, forwarding, dissemination, or reproduction of this
message is strictly prohibited and may be unlawful. If you are not the
intended recipient, please contact the sender by return e-mail and destroy
all copies of the original message.





--



Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
[hidden email]

-- This account not approved for unencrypted proprietary information --
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

anup s
In reply to this post by Mark Payne
Thanks Mark for that one; that should be a big relief. I¹d be waiting to
check that out!

Regards,
anup

On 05/05/15 9:09 pm, "Mark Payne" <[hidden email]> wrote:

>Anup,
>With the 0.1.0 release that we are working on right now, there are two
>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>what has been pulled from HDFS. This way you can keep the data in HDFS
>and still only pull in new data. Will this help?
>Thanks-Mark
>
>> From: [hidden email]
>> To: [hidden email]
>> Subject: RE: Fetch change list
>> Date: Tue, 5 May 2015 15:32:07 +0000
>>
>> Thanks Corey for that info. But the major problem I'm facing is I am
>>backing up a large set of data into HDFS (with a GetHDFS , source
>>retained as true) and then trying to fetch the delta from it. (get only
>>the files which have arrived recently by using the min Age and max Age).
>>But I'm unable to get the exact delta if I have 'keep source file' as
>>true..
>> I played around a lot with schedule time and min & max age but didn't
>>help.
>>
>> -----Original Message-----
>> From: Corey Flowers [mailto:[hidden email]]
>> Sent: Tuesday, May 05, 2015 5:35 PM
>> To: [hidden email]
>> Subject: Re: Fetch change list
>>
>> Ok, the get file that is running, is basically causing a race condition
>>between all of the servers in your cluster. That is why you are seeing
>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>processor to "On Primary node" Then the only system that will try to
>>pick up data from that mount point, is the server you have designated
>>"primary node".
>> This should fix that issue.
>>
>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>><[hidden email]>
>> wrote:
>>
>> > Yes Corey, Right now the pickup directory is from a network share
>> > mount point. The data is picked up from one location and transferred
>> > to the other. I'm using site-to-site communication.
>> >
>> > -----Original Message-----
>> > From: Corey Flowers [mailto:[hidden email]]
>> > Sent: Monday, May 04, 2015 7:57 PM
>> > To: [hidden email]
>> > Subject: Re: Fetch change list
>> >
>> > Good morning Anup!
>> >
>> >          Is the pickup directory coming from a network share mount
>>point?
>> >
>> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>> > <[hidden email]
>> > >
>> > wrote:
>> >
>> > > Hi ,
>> > >                 I'm trying to fetch a set of files which have
>> > > recently changed in a "filesystem". Also I'm supposed to keep the
>> > > original copy as it is.
>> > > For obtaining the latest files that have changed, I'm using a
>> > > PutFile with "replace" strategy piped to a GetFile with a minimum
>> > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
>> > >
>> > > Also, running it in clustered mode. I'm seeing the below issues
>> > >
>> > > -          The queue starts growing if there's an error.
>> > >
>> > > -          Continuous errors with 'NoSuchFileException'
>> > >
>> > > -          Penalizing StandardFlowFileErrors
>> > >
>> > >
>> > >
>> > >
>> > > ERROR
>> > >
>> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>> > > files due to
>> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>> > > to import data from /nifi/UNZ/log201403230000.log for
>> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>> > > im =,offset=0,name=6908587554457536,size=0]
>> > > due to java.nio.file.NoSuchFileException:
>> > > /nifi/UNZ/log201403230000.log
>> > >
>> > > 18:45:56 IST
>> > >
>> > >
>> > >
>> > > 10:54:50 IST
>> > >
>> > > ERROR
>> > >
>> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
>> > >
>> > > 161.91.234.248:6087
>> > >
>> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>> > > rename
>> > > /nifi/UNZ/.file1.log:
>> > org.apache.nifi.processor.exception.ProcessException:
>> > > Could not rename: /nifi/UNZ/.file1.log
>> > >
>> > > 10:54:56 IST
>> > >
>> > > ERROR
>> > >
>> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /data/softwares/RS/nifi/OUT/.file2.log:
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /nifi/OUT/.file2.log
>> > >
>> > >
>> > >
>> > > Do I have to tweak the Run schedule or keep the same minimum file
>> > > age and maximum file age to overcome this issue?
>> > > What might be an elegant solution in NiFi?
>> > >
>> > >
>> > > Thanks,
>> > > anup
>> > >
>> > > ________________________________
>> > > The information contained in this message may be confidential and
>> > > legally protected under applicable law. The message is intended
>> > > solely for the addressee(s). If you are not the intended recipient,
>> > > you are hereby notified that any use, forwarding, dissemination, or
>> > > reproduction of this message is strictly prohibited and may be
>> > > unlawful. If you are not the intended recipient, please contact the
>> > > sender by return e-mail and destroy all copies of the original
>>message.
>> > >
>> >
>> >
>> >
>> > --
>> > Corey Flowers
>> > Vice President, Onyx Point, Inc
>> > (410) 541-6699
>> > [hidden email]
>> >
>> > -- This account not approved for unencrypted proprietary information
>> > --
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> > legally protected under applicable law. The message is intended solely
>> > for the addressee(s). If you are not the intended recipient, you are
>> > hereby notified that any use, forwarding, dissemination, or
>> > reproduction of this message is strictly prohibited and may be
>> > unlawful. If you are not the intended recipient, please contact the
>> > sender by return e-mail and destroy all copies of the original
>>message.
>> >
>>
>>
>>
>> --
>> Corey Flowers
>> Vice President, Onyx Point, Inc
>> (410) 541-6699
>> [hidden email]
>>
>> -- This account not approved for unencrypted proprietary information --
>>
>> ________________________________
>> The information contained in this message may be confidential and
>>legally protected under applicable law. The message is intended solely
>>for the addressee(s). If you are not the intended recipient, you are
>>hereby notified that any use, forwarding, dissemination, or reproduction
>>of this message is strictly prohibited and may be unlawful. If you are
>>not the intended recipient, please contact the sender by return e-mail
>>and destroy all copies of the original message.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Oscar de la Pena
Thanks Mark for the response. We will try to work on the SFTP List/Retrieve. We will be glad to give our contribution
if time permits and our task schedule fits.
Owie

----- Original Message -----
From: "Anup Sethuram" <[hidden email]>
To: [hidden email]
Sent: Wednesday, May 6, 2015 11:38:52 AM
Subject: Re: Fetch change list

Thanks Mark for that one; that should be a big relief. I¹d be waiting to
check that out!

Regards,
anup

On 05/05/15 9:09 pm, "Mark Payne" <[hidden email]> wrote:

>Anup,
>With the 0.1.0 release that we are working on right now, there are two
>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>what has been pulled from HDFS. This way you can keep the data in HDFS
>and still only pull in new data. Will this help?
>Thanks-Mark
>
>> From: [hidden email]
>> To: [hidden email]
>> Subject: RE: Fetch change list
>> Date: Tue, 5 May 2015 15:32:07 +0000
>>
>> Thanks Corey for that info. But the major problem I'm facing is I am
>>backing up a large set of data into HDFS (with a GetHDFS , source
>>retained as true) and then trying to fetch the delta from it. (get only
>>the files which have arrived recently by using the min Age and max Age).
>>But I'm unable to get the exact delta if I have 'keep source file' as
>>true..
>> I played around a lot with schedule time and min & max age but didn't
>>help.
>>
>> -----Original Message-----
>> From: Corey Flowers [mailto:[hidden email]]
>> Sent: Tuesday, May 05, 2015 5:35 PM
>> To: [hidden email]
>> Subject: Re: Fetch change list
>>
>> Ok, the get file that is running, is basically causing a race condition
>>between all of the servers in your cluster. That is why you are seeing
>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>processor to "On Primary node" Then the only system that will try to
>>pick up data from that mount point, is the server you have designated
>>"primary node".
>> This should fix that issue.
>>
>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>><[hidden email]>
>> wrote:
>>
>> > Yes Corey, Right now the pickup directory is from a network share
>> > mount point. The data is picked up from one location and transferred
>> > to the other. I'm using site-to-site communication.
>> >
>> > -----Original Message-----
>> > From: Corey Flowers [mailto:[hidden email]]
>> > Sent: Monday, May 04, 2015 7:57 PM
>> > To: [hidden email]
>> > Subject: Re: Fetch change list
>> >
>> > Good morning Anup!
>> >
>> >          Is the pickup directory coming from a network share mount
>>point?
>> >
>> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>> > <[hidden email]
>> > >
>> > wrote:
>> >
>> > > Hi ,
>> > >                 I'm trying to fetch a set of files which have
>> > > recently changed in a "filesystem". Also I'm supposed to keep the
>> > > original copy as it is.
>> > > For obtaining the latest files that have changed, I'm using a
>> > > PutFile with "replace" strategy piped to a GetFile with a minimum
>> > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
>> > >
>> > > Also, running it in clustered mode. I'm seeing the below issues
>> > >
>> > > -          The queue starts growing if there's an error.
>> > >
>> > > -          Continuous errors with 'NoSuchFileException'
>> > >
>> > > -          Penalizing StandardFlowFileErrors
>> > >
>> > >
>> > >
>> > >
>> > > ERROR
>> > >
>> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>> > > files due to
>> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>> > > to import data from /nifi/UNZ/log201403230000.log for
>> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>> > > im =,offset=0,name=6908587554457536,size=0]
>> > > due to java.nio.file.NoSuchFileException:
>> > > /nifi/UNZ/log201403230000.log
>> > >
>> > > 18:45:56 IST
>> > >
>> > >
>> > >
>> > > 10:54:50 IST
>> > >
>> > > ERROR
>> > >
>> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
>> > >
>> > > 161.91.234.248:6087
>> > >
>> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>> > > rename
>> > > /nifi/UNZ/.file1.log:
>> > org.apache.nifi.processor.exception.ProcessException:
>> > > Could not rename: /nifi/UNZ/.file1.log
>> > >
>> > > 10:54:56 IST
>> > >
>> > > ERROR
>> > >
>> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /data/softwares/RS/nifi/OUT/.file2.log:
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /nifi/OUT/.file2.log
>> > >
>> > >
>> > >
>> > > Do I have to tweak the Run schedule or keep the same minimum file
>> > > age and maximum file age to overcome this issue?
>> > > What might be an elegant solution in NiFi?
>> > >
>> > >
>> > > Thanks,
>> > > anup
>> > >
>> > > ________________________________
>> > > The information contained in this message may be confidential and
>> > > legally protected under applicable law. The message is intended
>> > > solely for the addressee(s). If you are not the intended recipient,
>> > > you are hereby notified that any use, forwarding, dissemination, or
>> > > reproduction of this message is strictly prohibited and may be
>> > > unlawful. If you are not the intended recipient, please contact the
>> > > sender by return e-mail and destroy all copies of the original
>>message.
>> > >
>> >
>> >
>> >
>> > --
>> > Corey Flowers
>> > Vice President, Onyx Point, Inc
>> > (410) 541-6699
>> > [hidden email]
>> >
>> > -- This account not approved for unencrypted proprietary information
>> > --
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> > legally protected under applicable law. The message is intended solely
>> > for the addressee(s). If you are not the intended recipient, you are
>> > hereby notified that any use, forwarding, dissemination, or
>> > reproduction of this message is strictly prohibited and may be
>> > unlawful. If you are not the intended recipient, please contact the
>> > sender by return e-mail and destroy all copies of the original
>>message.
>> >
>>
>>
>>
>> --
>> Corey Flowers
>> Vice President, Onyx Point, Inc
>> (410) 541-6699
>> [hidden email]
>>
>> -- This account not approved for unencrypted proprietary information --
>>
>> ________________________________
>> The information contained in this message may be confidential and
>>legally protected under applicable law. The message is intended solely
>>for the addressee(s). If you are not the intended recipient, you are
>>hereby notified that any use, forwarding, dissemination, or reproduction
>>of this message is strictly prohibited and may be unlawful. If you are
>>not the intended recipient, please contact the sender by return e-mail
>>and destroy all copies of the original message.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Oscar de la Pena
Hi Mark,
I would like to help and contribute to the project by writing an implementation of List/Fetch (S)FTP.
I am thinking of doing a similar approach as the List/Fetch HDFS where the list of previously fetched files
are persisted in the distributed cache service. From the cache service the processor can filter
and check which filenames have been downloaded already and skip those.

I understand that the "simple state management to the framework" you mentioned before might
be the long-term solution and perhaps the more elegant implementation. Please let me know if there is already ongoing
effort in implementing that or if there's a plan to implement it soon. Then I may just have to wait for that.

I'd be happy work on the List/Fetch SFTP and share the implementation since I will be using it on my own project as well.
Thanks!
Owie
----- Original Message -----

From: "Oscar dela Pena" <[hidden email]>
To: [hidden email]
Sent: Thursday, May 7, 2015 6:37:15 AM
Subject: Re: Fetch change list

Thanks Mark for the response. We will try to work on the SFTP List/Retrieve. We will be glad to give our contribution
if time permits and our task schedule fits.
Owie

----- Original Message -----

From: "Anup Sethuram" <[hidden email]>
To: [hidden email]
Sent: Wednesday, May 6, 2015 11:38:52 AM
Subject: Re: Fetch change list

Thanks Mark for that one; that should be a big relief. I¹d be waiting to
check that out!

Regards,
anup

On 05/05/15 9:09 pm, "Mark Payne" <[hidden email]> wrote:

>Anup,
>With the 0.1.0 release that we are working on right now, there are two
>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>what has been pulled from HDFS. This way you can keep the data in HDFS
>and still only pull in new data. Will this help?
>Thanks-Mark
>
>> From: [hidden email]
>> To: [hidden email]
>> Subject: RE: Fetch change list
>> Date: Tue, 5 May 2015 15:32:07 +0000
>>
>> Thanks Corey for that info. But the major problem I'm facing is I am
>>backing up a large set of data into HDFS (with a GetHDFS , source
>>retained as true) and then trying to fetch the delta from it. (get only
>>the files which have arrived recently by using the min Age and max Age).
>>But I'm unable to get the exact delta if I have 'keep source file' as
>>true..
>> I played around a lot with schedule time and min & max age but didn't
>>help.
>>
>> -----Original Message-----
>> From: Corey Flowers [mailto:[hidden email]]
>> Sent: Tuesday, May 05, 2015 5:35 PM
>> To: [hidden email]
>> Subject: Re: Fetch change list
>>
>> Ok, the get file that is running, is basically causing a race condition
>>between all of the servers in your cluster. That is why you are seeing
>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>processor to "On Primary node" Then the only system that will try to
>>pick up data from that mount point, is the server you have designated
>>"primary node".
>> This should fix that issue.
>>
>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>><[hidden email]>
>> wrote:
>>
>> > Yes Corey, Right now the pickup directory is from a network share
>> > mount point. The data is picked up from one location and transferred
>> > to the other. I'm using site-to-site communication.
>> >
>> > -----Original Message-----
>> > From: Corey Flowers [mailto:[hidden email]]
>> > Sent: Monday, May 04, 2015 7:57 PM
>> > To: [hidden email]
>> > Subject: Re: Fetch change list
>> >
>> > Good morning Anup!
>> >
>> > Is the pickup directory coming from a network share mount
>>point?
>> >
>> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>> > <[hidden email]
>> > >
>> > wrote:
>> >
>> > > Hi ,
>> > > I'm trying to fetch a set of files which have
>> > > recently changed in a "filesystem". Also I'm supposed to keep the
>> > > original copy as it is.
>> > > For obtaining the latest files that have changed, I'm using a
>> > > PutFile with "replace" strategy piped to a GetFile with a minimum
>> > > age of 5 sec, max file age of 30 sec, Keep source file as true,
>> > >
>> > > Also, running it in clustered mode. I'm seeing the below issues
>> > >
>> > > - The queue starts growing if there's an error.
>> > >
>> > > - Continuous errors with 'NoSuchFileException'
>> > >
>> > > - Penalizing StandardFlowFileErrors
>> > >
>> > >
>> > >
>> > >
>> > > ERROR
>> > >
>> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>> > > files due to
>> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>> > > to import data from /nifi/UNZ/log201403230000.log for
>> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>> > > im =,offset=0,name=6908587554457536,size=0]
>> > > due to java.nio.file.NoSuchFileException:
>> > > /nifi/UNZ/log201403230000.log
>> > >
>> > > 18:45:56 IST
>> > >
>> > >
>> > >
>> > > 10:54:50 IST
>> > >
>> > > ERROR
>> > >
>> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
>> > >
>> > > 161.91.234.248:6087
>> > >
>> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>> > > rename
>> > > /nifi/UNZ/.file1.log:
>> > org.apache.nifi.processor.exception.ProcessException:
>> > > Could not rename: /nifi/UNZ/.file1.log
>> > >
>> > > 10:54:56 IST
>> > >
>> > > ERROR
>> > >
>> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /data/softwares/RS/nifi/OUT/.file2.log:
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /nifi/OUT/.file2.log
>> > >
>> > >
>> > >
>> > > Do I have to tweak the Run schedule or keep the same minimum file
>> > > age and maximum file age to overcome this issue?
>> > > What might be an elegant solution in NiFi?
>> > >
>> > >
>> > > Thanks,
>> > > anup
>> > >
>> > > ________________________________
>> > > The information contained in this message may be confidential and
>> > > legally protected under applicable law. The message is intended
>> > > solely for the addressee(s). If you are not the intended recipient,
>> > > you are hereby notified that any use, forwarding, dissemination, or
>> > > reproduction of this message is strictly prohibited and may be
>> > > unlawful. If you are not the intended recipient, please contact the
>> > > sender by return e-mail and destroy all copies of the original
>>message.
>> > >
>> >
>> >
>> >
>> > --
>> > Corey Flowers
>> > Vice President, Onyx Point, Inc
>> > (410) 541-6699
>> > [hidden email]
>> >
>> > -- This account not approved for unencrypted proprietary information
>> > --
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> > legally protected under applicable law. The message is intended solely
>> > for the addressee(s). If you are not the intended recipient, you are
>> > hereby notified that any use, forwarding, dissemination, or
>> > reproduction of this message is strictly prohibited and may be
>> > unlawful. If you are not the intended recipient, please contact the
>> > sender by return e-mail and destroy all copies of the original
>>message.
>> >
>>
>>
>>
>> --
>> Corey Flowers
>> Vice President, Onyx Point, Inc
>> (410) 541-6699
>> [hidden email]
>>
>> -- This account not approved for unencrypted proprietary information --
>>
>> ________________________________
>> The information contained in this message may be confidential and
>>legally protected under applicable law. The message is intended solely
>>for the addressee(s). If you are not the intended recipient, you are
>>hereby notified that any use, forwarding, dissemination, or reproduction
>>of this message is strictly prohibited and may be unlawful. If you are
>>not the intended recipient, please contact the sender by return e-mail
>>and destroy all copies of the original message.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Reply | Threaded
Open this post in threaded view
|

RE: Fetch change list

Mark Payne
Owie,

I think at this point we are still in the phase of designing how the state management would work and throwing around ideas. So it'll likely be a while before that's available.

If you want to tackle the (S)FTP stuff then by all means, go for it! We'd love to have the contribution. We can worry about iterating to take advantage of those new state management features once they exist but I don't know when that will happen, so it makes total sense to do at least a first round implementation without it.

----------------------------------------

> Date: Wed, 20 May 2015 14:10:07 +0800
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Hi Mark,
> I would like to help and contribute to the project by writing an implementation of List/Fetch (S)FTP.
> I am thinking of doing a similar approach as the List/Fetch HDFS where the list of previously fetched files
> are persisted in the distributed cache service. From the cache service the processor can filter
> and check which filenames have been downloaded already and skip those.
>
> I understand that the "simple state management to the framework" you mentioned before might
> be the long-term solution and perhaps the more elegant implementation. Please let me know if there is already ongoing
> effort in implementing that or if there's a plan to implement it soon. Then I may just have to wait for that.
>
> I'd be happy work on the List/Fetch SFTP and share the implementation since I will be using it on my own project as well.
> Thanks!
> Owie
> ----- Original Message -----
>
> From: "Oscar dela Pena" <[hidden email]>
> To: [hidden email]
> Sent: Thursday, May 7, 2015 6:37:15 AM
> Subject: Re: Fetch change list
>
> Thanks Mark for the response. We will try to work on the SFTP List/Retrieve. We will be glad to give our contribution
> if time permits and our task schedule fits.
> Owie
>
> ----- Original Message -----
>
> From: "Anup Sethuram" <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, May 6, 2015 11:38:52 AM
> Subject: Re: Fetch change list
>
> Thanks Mark for that one; that should be a big relief. I¹d be waiting to
> check that out!
>
> Regards,
> anup
>
> On 05/05/15 9:09 pm, "Mark Payne" <[hidden email]> wrote:
>
>>Anup,
>>With the 0.1.0 release that we are working on right now, there are two
>>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>>what has been pulled from HDFS. This way you can keep the data in HDFS
>>and still only pull in new data. Will this help?
>>Thanks-Mark
>>
>>> From: [hidden email]
>>> To: [hidden email]
>>> Subject: RE: Fetch change list
>>> Date: Tue, 5 May 2015 15:32:07 +0000
>>>
>>> Thanks Corey for that info. But the major problem I'm facing is I am
>>>backing up a large set of data into HDFS (with a GetHDFS , source
>>>retained as true) and then trying to fetch the delta from it. (get only
>>>the files which have arrived recently by using the min Age and max Age).
>>>But I'm unable to get the exact delta if I have 'keep source file' as
>>>true..
>>> I played around a lot with schedule time and min & max age but didn't
>>>help.
>>>
>>> -----Original Message-----
>>> From: Corey Flowers [mailto:[hidden email]]
>>> Sent: Tuesday, May 05, 2015 5:35 PM
>>> To: [hidden email]
>>> Subject: Re: Fetch change list
>>>
>>> Ok, the get file that is running, is basically causing a race condition
>>>between all of the servers in your cluster. That is why you are seeing
>>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>>processor to "On Primary node" Then the only system that will try to
>>>pick up data from that mount point, is the server you have designated
>>>"primary node".
>>> This should fix that issue.
>>>
>>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>>><[hidden email]>
>>> wrote:
>>>
>>>> Yes Corey, Right now the pickup directory is from a network share
>>>> mount point. The data is picked up from one location and transferred
>>>> to the other. I'm using site-to-site communication.
>>>>
>>>> -----Original Message-----
>>>> From: Corey Flowers [mailto:[hidden email]]
>>>> Sent: Monday, May 04, 2015 7:57 PM
>>>> To: [hidden email]
>>>> Subject: Re: Fetch change list
>>>>
>>>> Good morning Anup!
>>>>
>>>> Is the pickup directory coming from a network share mount
>>>point?
>>>>
>>>> On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>>>> <[hidden email]
>>>>>
>>>> wrote:
>>>>
>>>>> Hi ,
>>>>> I'm trying to fetch a set of files which have
>>>>> recently changed in a "filesystem". Also I'm supposed to keep the
>>>>> original copy as it is.
>>>>> For obtaining the latest files that have changed, I'm using a
>>>>> PutFile with "replace" strategy piped to a GetFile with a minimum
>>>>> age of 5 sec, max file age of 30 sec, Keep source file as true,
>>>>>
>>>>> Also, running it in clustered mode. I'm seeing the below issues
>>>>>
>>>>> - The queue starts growing if there's an error.
>>>>>
>>>>> - Continuous errors with 'NoSuchFileException'
>>>>>
>>>>> - Penalizing StandardFlowFileErrors
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ERROR
>>>>>
>>>>> 0ab3b920-1f05-4f24-b861-4fded3d5d826
>>>>>
>>>>> 161.91.234.248:7087
>>>>>
>>>>> GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>>>>> files due to
>>>>> org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>>>>> to import data from /nifi/UNZ/log201403230000.log for
>>>>> StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>>>>> im =,offset=0,name=6908587554457536,size=0]
>>>>> due to java.nio.file.NoSuchFileException:
>>>>> /nifi/UNZ/log201403230000.log
>>>>>
>>>>> 18:45:56 IST
>>>>>
>>>>>
>>>>>
>>>>> 10:54:50 IST
>>>>>
>>>>> ERROR
>>>>>
>>>>> c552b5bc-f627-3cc3-b3d0-545c519eafd9
>>>>>
>>>>> 161.91.234.248:6087
>>>>>
>>>>> PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>>>>> StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>>>>> im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>>>>> and transferring to failure due to
>>>>> org.apache.nifi.processor.exception.ProcessException: Could not
>>>>> rename
>>>>> /nifi/UNZ/.file1.log:
>>>> org.apache.nifi.processor.exception.ProcessException:
>>>>> Could not rename: /nifi/UNZ/.file1.log
>>>>>
>>>>> 10:54:56 IST
>>>>>
>>>>> ERROR
>>>>>
>>>>> 60662bb3-490a-3b47-9371-e11c12cdfa1a
>>>>>
>>>>> 161.91.234.248:7087
>>>>>
>>>>> PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>>>>> StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>>>>> im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>>>>> and transferring to failure due to
>>>>> org.apache.nifi.processor.exception.ProcessException: Could not
>>>rename:
>>>>> /data/softwares/RS/nifi/OUT/.file2.log:
>>>>> org.apache.nifi.processor.exception.ProcessException: Could not
>>>rename:
>>>>> /nifi/OUT/.file2.log
>>>>>
>>>>>
>>>>>
>>>>> Do I have to tweak the Run schedule or keep the same minimum file
>>>>> age and maximum file age to overcome this issue?
>>>>> What might be an elegant solution in NiFi?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> anup
>>>>>
>>>>> ________________________________
>>>>> The information contained in this message may be confidential and
>>>>> legally protected under applicable law. The message is intended
>>>>> solely for the addressee(s). If you are not the intended recipient,
>>>>> you are hereby notified that any use, forwarding, dissemination, or
>>>>> reproduction of this message is strictly prohibited and may be
>>>>> unlawful. If you are not the intended recipient, please contact the
>>>>> sender by return e-mail and destroy all copies of the original
>>>message.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Corey Flowers
>>>> Vice President, Onyx Point, Inc
>>>> (410) 541-6699
>>>> [hidden email]
>>>>
>>>> -- This account not approved for unencrypted proprietary information
>>>> --
>>>>
>>>> ________________________________
>>>> The information contained in this message may be confidential and
>>>> legally protected under applicable law. The message is intended solely
>>>> for the addressee(s). If you are not the intended recipient, you are
>>>> hereby notified that any use, forwarding, dissemination, or
>>>> reproduction of this message is strictly prohibited and may be
>>>> unlawful. If you are not the intended recipient, please contact the
>>>> sender by return e-mail and destroy all copies of the original
>>>message.
>>>>
>>>
>>>
>>>
>>> --
>>> Corey Flowers
>>> Vice President, Onyx Point, Inc
>>> (410) 541-6699
>>> [hidden email]
>>>
>>> -- This account not approved for unencrypted proprietary information --
>>>
>>> ________________________________
>>> The information contained in this message may be confidential and
>>>legally protected under applicable law. The message is intended solely
>>>for the addressee(s). If you are not the intended recipient, you are
>>>hereby notified that any use, forwarding, dissemination, or reproduction
>>>of this message is strictly prohibited and may be unlawful. If you are
>>>not the intended recipient, please contact the sender by return e-mail
>>>and destroy all copies of the original message.
>>
>
>
> ________________________________
> The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
>
     
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

anup s
In reply to this post by anup s
Hi Mark,
   I downloaded the latest version and I see that the FetchHDFS processor could be used for my delta files that have arrived to the HDFS. But how do I maintain a sync  from a local file system to my HDFS. I cannot move files from the local filesystem. It needs to be copied.

I'm facing issues with queueing trying to maintain a sync.

Any thoughts on how I could tackle this issue?

Regards,
anup
Reply | Threaded
Open this post in threaded view
|

RE: Fetch change list

Mark Payne
Anup,

The List/Fetch HDFS would allow you to pull new data from HDFS without destroying it.

But it sounds like what you want here is to also pull from disk without removing it. The GetFile processor does
not currently keep any state about what it's pulled in. It would likely be a fairly easy modification to GetFile, if it is
reading from a local filesystem. If reading from a network-mounted file system like nfs then it gets much more complex, as
the state would have to be shared across the cluster, as with ListHDFS.

A few possible solutions that I could offer in the meantime (I realize none is great but should work):

1. If you can move the data, you could use GetFile and then immediately route to PutFile. PutFile would then put the data to a different directory.

2. Similar to #1, you could use GetFile -> UpdateAttribute -> PutFile, and put the data back to the same directory but use UpdateAttribute to change
the filename, perhaps to "${filename}.pulled" and then configure GetFile to ignore files that end with ".pulled"

3. Use GetFile and configure it with a "Maximum File Age" of say 10 minutes, and only run every 5 minutes. Then, use DetectDuplicate
and throw away any duplicate. The downside here is that you would potentially pull in the data a couple of times, which means that you're
not being super efficient. If there is a huge amount of data coming in, this may be less than ideal. But if the data is coming in slowly, like
10 MB/sec then maybe this is fine.

Does any of this help?

Thanks
-Mark

----------------------------------------

> Date: Thu, 21 May 2015 20:01:30 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Hi Mark,
> I downloaded the latest version and I see that the FetchHDFS processor
> could be used for my delta files that have arrived to the HDFS. But how do I
> maintain a *sync * from a local file system to my HDFS. I cannot move files
> from the local filesystem. It needs to be copied.
>
> I'm facing issues with queueing trying to maintain a sync.
>
> Any thoughts on how I could tackle this issue?
>
> Regards,
> anup
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1615.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
     
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

anup s
Thanks Mark for those tips.. I was mostly trying around the third option
but it didn¹t work well as the data we are playing with is huge..

The problem with 1 and 2 options is we cannot move or update that
directory.



On 22/05/15 7:25 pm, "Mark Payne" <[hidden email]> wrote:

>The List/Fetch HDFS would allow you to pull new data from HDFS without
>destroying it.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

anup s
Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location and then be passed onto a Kafka, is there a way out to do that?
Reply | Threaded
Open this post in threaded view
|

Re: Fetch change list

Joe Witt
Anup,

Cross posting this to users since it is a great user question.

That answer is: Absolutely.

So couple of details to iron out to get started.  I'll ask the
question and explain why.  First some background:
- Kafka wants the small events themselves ideally.
- HDFS wants those events bundled together typically along whatever
block size you have in HDFS.

The questions:
- Where is this 1TB dataset living today?  This will help determine
best way to pull the dataset in.

- What is the current nature of the dataset?  Is it already in large
bundles as files or is it a series of tiny messages, etc..?  Does it
need to be split/merged/etc..

- What is the format of the data?  Is it something that can easily be
split/merged or will it require special processes to do so?

These are good to start with.

Thanks
Joe


On Tue, Jun 2, 2015 at 10:41 AM, anup s <[hidden email]> wrote:
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

RE: Fetch change list

Mark Payne
In reply to this post by anup s
Anup,

I have created a ticket for creating two new Processors: ListFile, FetchFile. These should provide a much nicer user experience for what you're trying to do here.

The ticket is NIFI-631:  https://issues.apache.org/jira/browse/NIFI-631

Thanks
-Mark

----------------------------------------

> Date: Tue, 2 Jun 2015 07:41:45 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Fetch change list
>
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
     
12