How to count the number of occurrences of a certain string in file

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to count the number of occurrences of a certain string in file

tzhu
This post was updated on .
Hi,

I have a text log file. I want to count the number of occurrences of strings
in the file and store it in a SQL table. For example, I want to count the
number of times the string "Established connection to server" present in the
file as TOTAL_CONNECTIONS and the number of times "Lost connection to
server" existed in the same file as TOTAL_DISCONNECTION.

I don't know how I should count the strings in this case. I find this blog
as a template:  https://www.batchiq.com/database-injest-with-nifi.html
<https://www.batchiq.com/database-injest-with-nifi.html>   Maybe I can
modify things from this?

Any suggestions would be appreciated.

Thanks,

Tina

---------------------
update: The result turns out to be simple. I want to list some key points here for anyone that might have the similar task in the future.

The processors I used are "TailFile", "ExecuteScript", and "PutSQL". The hard part is to write a custom script, and I find ExecuteScript Cookbook by Matt Burgess useful. I used the recipe for "Overwrite an incoming flow file with updated content using a callback". For Python it says "implicit return at the end", I just write "session.transfer(flowFile, REL_SUCCESS)" at the end.
To write in SQL table, I just set the output text in ExetureScript to be SQL query ("Insert into [table name] values (%d %d)")

Hope anyone who reads this in the future will find it helpful. And thanks Mark Payne for replying and suggesting me things to try!


--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to count the number of occurrences of a certain string in file

Mark Payne
Tina,

I don't believe there are any processors right now that will count the number of occurrences
of some string in a FlowFile. I would recommend either using an ExecuteScript processor and
scripting it out in Groovy/Jython, or if you're comfortable enough with Java and want to get your
hands a bit dirtier, you could actually update the ScanContent processor to optionally count
the number of occurrences of each term in the dictionary, rather than simply routing to 'matched'
or 'unmatched'. Then we could have ScanContent provide attributes such as <Search Term> = <Occurrences>.

Thanks
-Mark

> On Nov 6, 2017, at 3:58 PM, tzhu <[hidden email]> wrote:
>
> Hi,
>
> I have a text log file. I want to count the number of occurrences of strings
> in the file and store it in a SQL table. For example, I want to count the
> number of times the string "Established connection to server" present in the
> file as TOTAL_CONNECTIONS and the number of times "Lost connection to
> server" existed in the same file as TOTAL_DISCONNECTION.
>
> I don't know how I should count the strings in this case. I find this blog
> as a template:  https://www.batchiq.com/database-injest-with-nifi.html
> <https://www.batchiq.com/database-injest-with-nifi.html>   Maybe I can
> modify things from this?
>
> Any suggestions would be appreciated.
>
> Thanks,
>
> Tina
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: How to count the number of occurrences of a certain string in file

tzhu
Hi Mark,

I am confused about the whole process. I have the following questions:

1. From what I read, I can use TailFile to read the log file. However, it
would only read the file once (as the input file does not change). Is there
a way that I can read the file every time it gets started?

2. As you suggested, I am writing a personal Python script to handle the
count. Most of the examples online are in Jython.(My NiFi version is 1.3.0.
I suppose it's similar to Python, correct?)
I find
https://community.hortonworks.com/articles/75032/executescript-cookbook-part-1.html
<https://community.hortonworks.com/articles/75032/executescript-cookbook-part-1.html>  
as a useful guide, but I don't understand what to choose from. What's the
input and output for the script? I want to read the file line by line and
count the string occurrences. I'm currently using  key,value =
flowFile.getAttributes().iteritems() to get the file content, but it shows
"too many values to unpack". (The original file is 41.14MB)
For the output, the common way seems to be using a callback. Is it necessary
in my case? Or can I just add these attributes to the output file and
extract the attributes later?

3. To write the columns into the SQL table, the common way seems to use
"ReplaceText" and "PutSQL". I also noticed there's a processor called
PutDatabaseRecord that might combine the function of ExecuteScript and
PutSQL together. Since my Python script doesn't work so far I can't really
test the result. But is this an easier approach?

Any help is appreciated...

Thanks,
Tina



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to count the number of occurrences of a certain string in file

tzhu
In reply to this post by Mark Payne
Hi Mark,

According to the  language guide
<https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#count>
, the count can be done by:
${allMatchingAttributes("abc","xyz"):contains("world"):count()}

I'm wondering if I can contain this one in one of the processor, as a side
product of "update attribute" or some other processor.

Thanks,

Tina



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to count the number of occurrences of a certain string in file

Mark Payne
Tina,

In NiFi, a FlowFile is made up of two parts: Content (the payload, or bytes) and Attributes (metadata about the content).
When you are using the Expression Language, you are operating on FlowFile Attributes, not the content. So if you are
wanting to count the number of occurrences of some string in the content, the Expression Language will not help you.

This is why I was suggesting using the ExecuteScript processor. You can have a script that reads the content from
StdIn (this will provide you the content of the FlowFile) and count the number of occurrences of each word/phrase of
interest. Once you have counted the number of occurrences, you can add those to the FlowFile as attributes.
Or, alternatively, you could write out the data to the contents of the FlowFile, from within your script.

Thanks
-Mark


> On Nov 9, 2017, at 4:05 PM, tzhu <[hidden email]> wrote:
>
> Hi Mark,
>
> According to the  language guide
> <https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#count>
> , the count can be done by:
> ${allMatchingAttributes("abc","xyz"):contains("world"):count()}
>
> I'm wondering if I can contain this one in one of the processor, as a side
> product of "update attribute" or some other processor.
>
> Thanks,
>
> Tina
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: How to count the number of occurrences of a certain string in file

tzhu
Hi Mark,

This is the script I'm using currently:

import sys
TOTAL_DISCONNECTIONS = 0
TOTAL_CONNECTIONS = 0
flowFile = sys.stdin
if (flowFile != None):
    for line in flowFile:
        if "Lost connection to server" in line:
            TOTAL_DISCONNECTIONS += 1
        if "Established connection to server" in line:
             TOTAL_CONNECTIONS += 1
attrMap = {"TOTAL_DISCONNECTIONS":TOTAL_DISCONNECTIONS,
           "TOTAL_CONNECTIONS":TOTAL_CONNECTIONS}  
flowFile = session.putAllAttributes(flowFile, attrMap)



Does it make sense to you? The error message says "filereader" object is not
iterable. So what is in flowFile now? How should I access the content in
filwFile?

Thanks,

Tina



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to count the number of occurrences of a certain string in file

Mark Payne
Tina,

My apologies - I was conflating ExecuteScript with ExecuteStreamCommand.
In ExecuteStream, the contents of the FlowFile are not read from StdIn but rather
you'd want to use the session to read the contents. So in Groovy we'd do something like:

import org.apache.nifi.processors.script.ExecuteScript
def flowFile = session.get()
def in = session.read(flowFile)
def totalDisconnections = 0
def totalConnections = 0

try {
// Here, 'in' is the InputStream that contains the data
finally {
in.close()
}

flowFile = session.putAttribute("TOTAL_DISCONNECTIONS", totalDisconnections)
flowFile = session.putAttribute("TOTAL_CONNECTIONS", totalConnections)
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)

Sorry, I am not familiar enough with Python to provide any sort of meaningful suggestions on what
to do there.

Thanks
-Mark



On Nov 10, 2017, at 4:01 PM, tzhu <[hidden email]<mailto:[hidden email]>> wrote:

Hi Mark,

This is the script I'm using currently:

import sys
TOTAL_DISCONNECTIONS = 0
TOTAL_CONNECTIONS = 0
flowFile = sys.stdin
if (flowFile != None):
   for line in flowFile:
       if "Lost connection to server" in line:
           TOTAL_DISCONNECTIONS += 1
       if "Established connection to server" in line:
            TOTAL_CONNECTIONS += 1
attrMap = {"TOTAL_DISCONNECTIONS":TOTAL_DISCONNECTIONS,
          "TOTAL_CONNECTIONS":TOTAL_CONNECTIONS}
flowFile = session.putAllAttributes(flowFile, attrMap)



Does it make sense to you? The error message says "filereader" object is not
iterable. So what is in flowFile now? How should I access the content in
filwFile?

Thanks,

Tina



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/