Monday, October 22, 2007

LibTRE returning matching values

New version of libTRE driver, now exec returns also matching binaries:

Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.5.5 (abort with ^G)
1> erl_ddll:load_driver(code:priv_dir(treregex)++"/bin", "TRE_drv").
ok
2> {ok, RE} = treregex:compile(<<"([a-z]+)([0-9]+)">>, [extended]).
{ok,#Port<0.79>}
3> treregex:exec(RE, <<"this is a test9234 of blast">>).
{ok,[{10,18,<<"test9234">>},{10,14,<<"test">>},{14,18,<<"9234">>}]}
4>
4> treregex:exec(RE, <<"this is arolpghin39235 test9234 of blast">>).
{ok,[{8,22,<<"arolpghin39235">>},{8,17,<<"arolpghin">>},{17,22,<<"39235">>}]}
5>

Soon to be released !

Wednesday, October 17, 2007

LibTRE in Action, the approximative match

Here's a sample session showing some of the libTre features:

Erlang (BEAM) emulator version 5.5.5 [source] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.5.5 (abort with ^G)
1> erl_ddll:load_driver(code:priv_dir(treregex)++"/bin", "TRE_drv").
ok
2> f(RE), {ok, RE} = treregex:compile(<<"fear">>, [extended]).
{ok,#Port<0.74>}
3> treregex:approx(RE, <<"fir">>, 0, []).
{error,nomatch}
4> treregex:approx(RE, <<"fir">>, 1, []).
{error,nomatch}
5> treregex:approx(RE, <<"fir">>, 2, []).
{ok,[{0,3}]}
6> treregex:approx(RE, <<"fir">>, 1, []).
{error,nomatch}
7> treregex:exec(RE, <<"fir">>).
{error,nomatch}
8> treregex:free(RE).
ok
9> f(RE), {ok, RE} = treregex:compile(<<"loubov">>, [extended]).
{ok,#Port<0.76>}
10> treregex:approx(RE, <<"love">>, 3, []).
{ok,[{0,3}]}
11> treregex:approx(RE, <<"to love">>, 0, []).
{error,nomatch}
12> treregex:approx(RE, <<"to love">>, 1, []).
{error,nomatch}
13> treregex:approx(RE, <<"to love">>, 2, []).
{error,nomatch}
14> treregex:approx(RE, <<"to love">>, 3, []).
{ok,[{0,6}]}
15> treregex:approx(RE, <<"aimer">>, 3, []).
{error,nomatch}
16> treregex:approx(RE, <<"aimour">>, 3, []).
{error,nomatch}
17> treregex:approx(RE, <<"amour">>, 3, []).
{error,nomatch}
18> treregex:approx(RE, <<"amour">>, 10, []).
{ok,[{1,6}]}

LibTRE, I finally got it working !

Hi !
I'm please to say that I finally managed the 'TRE_drv.c' code to work as expected. This time there's no more 'segfault'. The solution is to use only non conflicting function names:
  • regncomp and not regcomp
  • regnexec and not regexec
  • tree_free and not regfree
Since every posix Regex function will be used instead of their Tre equivalent, I was forced do find a special case for 'regfree'. There's no special regX function for 'regfree'.
Because regfree can be called at many places, when an erlang error occurs, or simply when we quit the shell; the function is trying to free the posix regex_t with a pointer to a TRE regex_t, so the crash is always there.
While looking at 'tre-internal.h' I've found the 'tre_free' function that will do the real job of freeing the TRE regex_t... So I've just declared as 'extern' this function directly into the TRE_drv.c file...
extern void tree_free(regex_t *preg);
static void tre_stop(ErlDrvData drv_data)
{
struct driver *d = (struct driver*) drv_data;

// if (d->compiled)
// regfree(&(d->re));

if (d->compiled)
tre_free(&d->re);

driver_free(d);
}


This time I've also added a call to the 'reganexec' function that's able to find approximate matches. The code is of course located into the 'tre_from_erlang' function. (this function gets called whenever erlang talks to the driver).

... snip ...
case APPROX:

/* <> */

flags = get_int32(buf + 4);
cost = get_int32(buf + 8);

memset(&amatches, 0, sizeof(amatches));
amatches.pmatch = matches;
amatches.nmatch = MAX_MATCHES;

regaparams_default(®aparams);
regaparams.max_cost = cost;


status = reganexec(&d->re, buf + 12, len - 12, &amatches, regaparams, flags);
if (status != 0) {
driver_send_status(d, status);
return;
}

driver_send_matches(d, matches, MAX_MATCHES);
break;
... snip ...


With this you're able to make things like this:

{ok, RE} = treregex:compile(<<"test">>, [extended]).
treregex:approx(RE, <<"this is a tast">>, 0, []).

the 'fun approx/4' has the 'cost' (an integer) as extra argument, this correspond to the cost of manipulation characters to find a match. You'll find more about this here.

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be
close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure
(also known as the Levenshtein distance) where characters can be inserted, deleted, or
substituted in the searched text in order to get an exact match. Each insertion, deletion, or
substitution adds the distance, or cost, of the match. TRE can report the matches which have a
cost lower than some given threshold value. TRE can also be used to search for matches with the
lowest cost.


Finally, now that the driver and the erlang module works well I'll upload it to various repositories...

Saturday, October 13, 2007

LibTre and Posix, the segfault explanation

You already know that I want to make a libTRE driver such as the RE posix driver... But what you may not know is that I've never lost this amount of time debugging things...
I've, for a long long time ago, called 'gdb' my friend to rescue me ...
Let me explain the problem: we want to call 'regcomp' and 'regexec' from the libTRE package. Take your time and reread this sentence:

We want to call 'regcomp' and 'regexec' from the libTRE package.

This seems really easy, but we want to make this from a shared library... A shared library that we will open at runtime.
And lauching our erlang shell make the ld process finds symbols addresses in various .so files for us automagically...

Since I'm using ubuntu, the posix regexp are located in the libc.so.6 library, so ld knows about regcomp and regexec before I can even call 'erl_ddll:load_driver/2'...

Loading the driver into the erlang vm from my freshly built .so file, don't really work as expected... 'regcomp' is still the one from the glibc, and not the one from the libTRE driver...

But 'regex_t' is the one from tre, and 'regcomp' is the one from posix... There definition of the regex_t struct isn't the same :p

The driver structure:

typedef struct _desc {
ErlDrvPort port;
ErlDrvTermData dport; /* the port identifier as ErlDrvTermData */
regex_t re;
regmatch_t pm[16];
int compiled;
} Desc;

Now the 'sizeof(regex_t)' from Tre is 8 and the posix is a lot more ... So when I call the 'regcomp' function like this:

switch(op) {
case COMPILE:
... snip ...
status = regcomp(&d->re, buf+8, flags);
d->compiled = 1;
... snip ...


The 'regmatch_t' is exactly overflowed by exactly
( sizeof (posix regex_t) - (2 * sizeof (tre regex_t)) ) 
bytes ...

This is the reason why I get a segfault (from my last post) when I try to call 'regexec'...

I'm now rewriting a simpler interface to libtre that will not have names that can collide with posix...

Wednesday, October 10, 2007

LibTre for Fast regular expression

While reading the almost famous article about regular expressions, I tried to use TRE.
Since TRE is posix Compliant, but unfortunately don't have any erlang driver I downloaded ''posregex'' and hack it a little to make it use TRE.

Now here's some experiment with it:

22> Init = fun() -> erl_ddll:load_driver(code:priv_dir(treregex)++"/bin", "TRE_drv") end.
#Fun
23> Init().
ok
24> f(List), List = treutils:build([<<"test">>, <<"toto">>, <<"[a-z][0-9]$">>, <<"^[a-zA-Z][a-zA-Z0-9_]+">>]).
[{<<"test">>,#Port<0.84>},
{<<"toto">>,#Port<0.85>},
{<<"[a-z][0-9]$">>,#Port<0.86>},
{<<"^[a-zA-Z][a-zA-Z0-9_]+">>,#Port<0.87>}]

25> treutils:exec(List, <<"alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksd
jlkgjsdlkfgjlsdk fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test asl;dksdf">>).
{<<"alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksdjlkgjsdlkfgjlsdk
fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test a"...>>,

[{ok,<<"^[a-zA-Z][a-zA-Z0-9_]+">>},{ok,<<"test">>}]}

26> treutils:exec(List, <<"10alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksdjlkgjsdlkfgjlsdk
fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test asl;dksdf">>).
{<<"10alkjlskdjflskjglksjflakgjlkfgjl;dkgjklsdjglkdsjfglksdjlkgjsdlkfgjlsdk
fg989t9sgdkgj lkyrdy sjd gyrdsl;gkj test"...>>,

[{ok,<<"test">>}]}


I build a List of Port that are compiled regular expressions, then I iterate thru the list matching "Line".

Here's the TreUtils module (BTW, you can replace treregex with posregex if you want...)

-module(treutils).
-export([build/1, exec/2, destroy/1]).

build(List) ->
Fun = fun(X) ->
{ok, RE} = treregex:compile(X, [extended]),
{X, RE}
end,
lists:map(Fun, List).

destroy(List) ->
lists:foreach( fun({_Name, RE}) -> treregex:free(RE) end, List).

exec(List, Line) ->
exec(List, Line, []).

exec([], Line, Acc) ->
{Line, Acc};
exec([H|Rest], Line, Acc) ->
{Name, RE} = H,
case treregex:match(RE, Line) of
ok ->
exec(Rest, Line, [ {ok, Name} | Acc ]);

{error, nomatch} ->
exec(Rest, Line, Acc)
end.


Other libraries are also available.

Unfortunately I'm unable to make this module rock solid since it segfault if I ever call the 'exec' method two times... I think that there's a problem in the TRV_drv.c (heavily copied from the RE_drv.c) in the 'RE_from_erlang' function. The regexec call may garbage some of its internal...

typedef struct _desc {
ErlDrvPort port;
ErlDrvTermData dport; /* the port identifier as ErlDrvTermData */
regex_t re;
regmatch_t pm[16];
int compiled;
} Desc;

... snip ...

static void RE_from_erlang(ErlDrvData drv_data, char *buf, int len)
{
int status;
unsigned int op = get_int32(buf);
unsigned int flags = get_int32(buf+4);
Desc *d = (Desc*) drv_data;

switch(op) {

... snip ...

case EXEC:
status = regexec(&d->re, buf+8, (size_t) 16, &d->pm[0], flags);
if (status != 0) {
driver_send_status(d, status);
return;
}
driver_send_pm(d);
break;


I hope I'll come back soon with good news, since this TRE library looks very very promising...
If anyone have any clue ;p, please comment !

Tuesday, October 2, 2007

High Order Functions must be tested before use

From my previous article comments it is a small and efficient method of filtering datas:

Cpu = fun({cpu, _, _}) -> true; (_) false end.

But whenever we want to transposing it into a HOF (high order function):

Filter = fun(Elem) ->
fun({Elem, _, _}) -> true; (_) false end
end.

This naive approach doesn't work:

1> Filter = fun(Elem) -> fun({Elem, _, _}) -> true; (_) -> false end end.
#Fun<erl_eval.6.49591080>
2> C = Filter(cpu).
#Fun<erl_eval.6.49591080>
3> C({test, t, t}).
true % this should have been false ...

As explained in this document, we need to use guards to make our high order function effective:

The rules for importing variables into a fun has the consequence that certain pattern matching
operations have to be moved into guard expressions and cannot be written in the head of the fun.

The correct way is:

Filter = fun(Elem) ->
fun({X, _, _}) when X == Elem -> true; (_) false end
end.

So we must keep in mind that sometimes we should really check that our HOF is working as expected !

Monday, October 1, 2007

High Order Functions, filtering lists...

I have a list of collected cpu and network values, from eth0 and eth1:
 L =
[{cpu,user,<<"3.05">>},
{cpu,nice,<<"0.00">>},
{cpu,system,<<"0.72">>},
{cpu,iowait,<<"0.03">>},
{cpu,steal,<<"0.00">>},
{cpu,idle,<<"96.20">>},
{eth0,rxpck,<<"2.52">>},
{eth0,txpck,<<"0.15">>},
{eth0,rxbyt,<<"173.80">>},
{eth0,txbyt,<<"44.68">>},
{eth0,rxcmp,<<"0.00">>},
{eth0,txcmp,<<"0.00">>},
{eth0,rxmcst,<<"1.25">>},
{eth1,rxpck,<<"0.00">>},
{eth1,txpck,<<"0.02">>},
{eth1,rxbyt,<<"0.00">>},
{eth1,txbyt,<<"1.00">>},
{eth1,rxcmp,<<"0.00">>},
{eth1,txcmp,<<"0.00">>},
{eth1,rxmcst,<<"0.00">>}]

If I want to manipulate such data set I'll need some filter functions that will help me to extract values, I need cpu values and eth0 values. High order function can do that for me !

First we need to select tuples, 'cpu' tuples:

Cpu = fun({X, Y, Z}) ->
if X == cpu ->
true;
true ->
false
end
end.


A better and far more erlangish method (thanks Zvi):

Cpu = fun({cpu,_,_}) -> true;
(_) -> false
end.


Okay this fun will return true whenever the first element of the tuple is 'cpu'.
But this fun is static, since 'cpu' is written in the function body. Let's make it dynamic:

Filter = fun(Motif) ->
fun({X, Y, Z}) ->
if X == Motif ->
true;
true ->
false
end
end
end.


The new High order function 'Filter' is generated by 'fun(Motif)' and takes as argument a tuple '{X, Y, Z}', this is a fun that return a fun...
This function can be used like this:

List = {cpu, test, dummy}. % a sample list
Cpu = Filter(cpu). % generate the Cpu fun
Cpu(List). %executing the Cpu fun
true.
List2 = {test, cpu, dummy}. %Other dummy list
Cpu(List2).
false.


Come back to our initial data set, and realize that we must iterate thru the list to extract possible values. Iterate and apply a fun to every element is what we need to do, futhermore we need to retrieve matching values... In fact we need to build the list of extracted values, and this can be accomplished by 'lists:foldl':

extractor(Motif) ->
fun(L) ->
lists:foldl(
fun({X, Y, Z}, List) ->
if X == Motif
-> [ {Y,Z} | List ];
true -> List
end
end,
[], L)
end.


In details, 'List' is the accumulator list, the one that will grow with valid tuple, the one we will return. The fun is the same as described before. To make things clear 'extractor/1' returns a fun that will parse a list of tuple extracting values that matches 'Motif'.

Another easier method, using only list comprehension:

extractor(Motif) ->
fun(L) when is_list(L) ->
[ {Y, Z} || {X, Y, Z} <- L, X == Motif]
end.

Usage:

38> Cpu = module:extractor(cpu).
#Fun<module.1.131615259>
39> Cpu(L).
[{idle,<<"96.20">>},
{steal,<<"0.00">>},
{iowait,<<"0.03">>},
{system,<<"0.72">>},
{nice,<<"0.00">>},
{user,<<"3.05">>}]
40> Eth0 = module:extractor(eth0).
41> Eth0(L).
[{rxmcst,<<"1.25">>},
{txcmp,<<"0.00">>},
{rxcmp,<<"0.00">>},
{txbyt,<<"44.68">>},
{rxbyt,<<"173.80">>},
{txpck,<<"0.15">>},
{rxpck,<<"2.52">>}]


I use such code for my monitoring project, I use the sar output (sysstat package), and I graph using rrdtool... This is a part of a bigger project that will compose a 'scheduler'.

Sticky