Wednesday, February 24, 2010

Filtering lines efficiently

Whenever you are dealing with log lines or that you're program is filtering data you always have to handle 'escaping' efficiently.

While developing a log module using a gen_event, I needed to escape simple quotes.
Sometime thoses quotes were already escaped...

I've found this regexp to handle gracefully the case:
re:replace( Bin, "(?<!\\\\)'", "\\\\'", [ global ] ).

Not so easy to read, and because of the various backslashes, this regexp needed some tests
before being fully usable.
Basically this regexp only filter simple quote when they're not already escaped...

Now that I've my filter for quotes and I need a filter for newline characters.
Everytime you find some newlines in your log files, you can be sure that many tools already in your network will not treat them efficiently, worse this may break everything after ...

So I came across this regexp:
re:replace( Bin, "[\\n\\r]+", " ", [ global ] ).


Finally having two filter functions for every line, I wanted to be able to add or remove easily any function.
You can reference functions with this notation:
fun filterquotes/1
Or a list of functions:
[ fun filterquotes/1, fun filternewlines/1 ]

With this notation and the famous 'lists:foldl', I can filter a line with many functions quite nicely:
% Data is the accumulator, but we don't change it :)
        lists:foldl( fun( Fun, Data ) ->
                         Fun(Data) 
        end, Bin, [ fun filterquotes/1, fun filternewline/1 ]).


Here's the code:
filter( Bin ) when is_atom(Bin) ->
        Bin;
filter( Bin ) when is_integer(Bin) ->
        Bin;
filter( Bin ) ->
        lists:foldl( fun( Fun, Data ) ->
                         Fun(Data) 
        end, Bin, [ fun filterquotes/1, fun filternewline/1 ]).

filterquotes(Bin) ->
        re:replace( Bin, "(?
        re:replace( Bin, "[\\n\\r]+", " ", [ global ] ).

No comments:

Sticky