Thursday, November 15, 2007

An gen_server for mass regexp computing... (LibTre)

This is the first test session of my 'tregex_srv' that provides some nice regexp features:

266> l(tregex_srv).
{module,tregex_srv}
267> tregex_srv:start_link().
{ok,<0.4045.0>}
268> tregex_srv:store( [<<"[0-9+] pid">>, <<"[a-z]+.tmp">>]).
ok
269> tregex_srv:grep(<<"test 9405904.tmp acuu.tmpmulaor 10+ pid">>).
[[[{34,39,<<"+ pid">>}],[{17,25,<<"acuu.tmp">>}]]]
270> tregex_srv:store( [{ <<"test">>, fun(X) -> io:format("found: ~p~n", [X]) end}, <<"[0-9][0-9]">>]).
ok
271> tregex_srv:grep(<<"test 9405904.tmp acuu.tmpmulaor 10+ pid">>).
found: [{0,4,<<"test">>}]
[[[{34,39,<<"+ pid">>}],[{17,25,<<"acuu.tmp">>}]],
[[{0,4,<<"test">>}],[{5,7,<<"94">>}]]]
272> tregex_srv:store( [{ <<"SRC=[^ ]+">>, fun(X) ->
[{_,_,M}] = X, io:format("Source: ~p~n", [M])
end}]).
ok
273> tregex_srv:grep(<<"test 9405904.tmp acuu.tmpmulaor 10+ pid">>).
found: [{0,4,<<"test">>}]
[[[{34,39,<<"+ pid">>}],[{17,25,<<"acuu.tmp">>}]],
[[{0,4,<<"test">>}],[{5,7,<<"94">>}]],
[]]
274> tregex_srv:grep(<<"tst SRC=192.135.15.1 pid">>).
Source: <<"SRC=192.135.15.1">>
[[[{19,24,<<"1 pid">>}]],
[[{8,10,<<"19">>}]],
[[{4,20,<<"SRC=192.135.15.1">>}]]]
275> tregex_srv:store( [{ <<"SRC=([^ ]+)">>, fun(X) ->
[{_,_,_}, {_,_,M}] = X, io:format("Source IP: ~p~n", [M])
end}]).
ok
276> tregex_srv:grep(<<"tst SRC=192.135.15.1 pid">>).
Source IP: <<"192.135.15.1">>
Source: <<"SRC=192.135.15.1">>
[[[{19,24,<<"1 pid">>}]],
[[{8,10,<<"19">>}]],
[[{4,20,<<"SRC=192.135.15.1">>}]],
[[{4,20,<<"SRC=192.135.15.1">>},{8,20,<<"192.135.15.1">>}]]]

As you can see, you can associate Funs with regexp Matches. This means that you can bind action to regexp...
First we store (in fact add regexp to the existing regexp list) new tuples {RE, Fun}:

275> tregex_srv:store( [{ <<"SRC=([^ ]+)">>, fun(X) ->
[{_,_,_}, {_,_,M}] = X, io:format("Source IP: ~p~n", [M])
end}]).
ok

Now the exec does call already registered funs, but call the new one since our regexp matches and you can see that the IP number is only printed, the "submatches" feature works as expected:

276> tregex_srv:grep(<<"tst SRC=192.135.15.1 pid">>).
Source IP: <<"192.135.15.1">>
Source: <<"SRC=192.135.15.1">>
...

The gen_server state is the following:

-record(state, {
requests,
reindex,
re = [],
pids = []
}).

Its init function is:

init(_Args) ->
process_flag(trap_exit, true),
{ok, #state{
re = ets:new(?MODULE, [set,private]),
requests = 0,
reindex = 1 }}.

Internally the module calls 'treregex:compile' to compile regexp and store the resulting #port into a list that is stored in the 'ets' table. Every call to 'tregex_srv:store' create a new entry in the ets table.

%% Storing RE and Funs
%% Creating simple fun when there's none provided...
%%
store([], Res, State) ->
ets:insert(State#state.re, { State#state.reindex, Res});
store([ { Regexp, Fun } | List ], Res, State) ->
{ok, Re } = treregex:compile(iolist_to_binary(Regexp), [extended]),
store(List, [ { Re, Fun } | Res ], State);
store([ Regexp | List ], Res, State) ->
{ok, Re } = treregex:compile(iolist_to_binary(Regexp), [extended]),
store(List, [ { Re, fun(_) -> false end} | Res ], State).

The 'tregex_srv:grep' just uses 'ets:foldl' to compute results:

handle_call({grep, Line}, _Node, State) ->
Requests = State#state.requests,
Grep = fun({_Reindex, ReList}, Acc) ->
[ exec(ReList, Line, []) | Acc]
end,
{reply, ets:foldl(Grep, [], State#state.re), State#state{ requests = Requests + 1} }.

%% exec, using a List of {Re, Funs}
exec([], _Line, Acc) ->
Acc;
exec([ { Re, Fun } | ReList ], Line, Acc) ->
case treregex:exec(Re, Line) of
{ok, Matches} ->
Fun(Matches),
exec(ReList, Line, [ Matches | Acc ]);

{error, nomatch} ->
exec(ReList, Line, Acc)
end;
exec([ _Any | ReList ], Line, Acc) ->
exec(ReList, Line, Acc).

The code is still young, but seems to work.

The main purpose here, is to be able to massively process lines of logs. I want to be able to
spawn multiple process on multiples nodes that will be able to extract valuable content from
various lines. This is the first step forward :-)

I may cleanup the 'grep' fun since it will returns empty list whenever a regexp didn't match anything from the supplied line...

I'm really excited to think that I will be able to use the 'gen_server:multi_call' with this module :)

No comments:

Sticky