Wednesday, October 10, 2007

LibTre for Fast regular expression

While reading the almost famous article about regular expressions, I tried to use TRE.
Since TRE is posix Compliant, but unfortunately don't have any erlang driver I downloaded ''posregex'' and hack it a little to make it use TRE.

Now here's some experiment with it:

22> Init = fun() -> erl_ddll:load_driver(code:priv_dir(treregex)++"/bin", "TRE_drv") end.
#Fun
23> Init().
ok
24> f(List), List = treutils:build([<<"test">>, <<"toto">>, <<"[a-z][0-9]$">>, <<"^[a-zA-Z][a-zA-Z0-9_]+">>]).
[{<<"test">>,#Port<0.84>},
{<<"toto">>,#Port<0.85>},
{<<"[a-z][0-9]$">>,#Port<0.86>},
{<<"^[a-zA-Z][a-zA-Z0-9_]+">>,#Port<0.87>}]

25> treutils:exec(List, <<"alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksd
jlkgjsdlkfgjlsdk fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test asl;dksdf">>).
{<<"alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksdjlkgjsdlkfgjlsdk
fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test a"...>>,

[{ok,<<"^[a-zA-Z][a-zA-Z0-9_]+">>},{ok,<<"test">>}]}

26> treutils:exec(List, <<"10alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksdjlkgjsdlkfgjlsdk
fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test asl;dksdf">>).
{<<"10alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksdjlkgjsdlkfgjlsdk
fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test"...>>,

[{ok,<<"test">>}]}


I build a List of Port that are compiled regular expressions, then I iterate thru the list matching "Line".

Here's the TreUtils module (BTW, you can replace treregex with posregex if you want...)

-module(treutils).
-export([build/1, exec/2, destroy/1]).

build(List) ->
Fun = fun(X) ->
{ok, RE} = treregex:compile(X, [extended]),
{X, RE}
end,
lists:map(Fun, List).

destroy(List) ->
lists:foreach( fun({_Name, RE}) -> treregex:free(RE) end, List).

exec(List, Line) ->
exec(List, Line, []).

exec([], Line, Acc) ->
{Line, Acc};
exec([H|Rest], Line, Acc) ->
{Name, RE} = H,
case treregex:match(RE, Line) of
ok ->
exec(Rest, Line, [ {ok, Name} | Acc ]);

{error, nomatch} ->
exec(Rest, Line, Acc)
end.


Other libraries are also available.

Unfortunately I'm unable to make this module rock solid since it segfault if I ever call the 'exec' method two times... I think that there's a problem in the TRV_drv.c (heavily copied from the RE_drv.c) in the 'RE_from_erlang' function. The regexec call may garbage some of its internal...

typedef struct _desc {
ErlDrvPort port;
ErlDrvTermData dport; /* the port identifier as ErlDrvTermData */
regex_t re;
regmatch_t pm[16];
int compiled;
} Desc;

... snip ...

static void RE_from_erlang(ErlDrvData drv_data, char *buf, int len)
{
int status;
unsigned int op = get_int32(buf);
unsigned int flags = get_int32(buf+4);
Desc *d = (Desc*) drv_data;

switch(op) {

... snip ...

case EXEC:
status = regexec(&d->re, buf+8, (size_t) 16, &d->pm[0], flags);
if (status != 0) {
driver_send_status(d, status);
return;
}
driver_send_pm(d);
break;


I hope I'll come back soon with good news, since this TRE library looks very very promising...
If anyone have any clue ;p, please comment !

No comments:

Sticky