Wednesday, October 17, 2007

LibTRE, I finally got it working !

Hi !
I'm please to say that I finally managed the 'TRE_drv.c' code to work as expected. This time there's no more 'segfault'. The solution is to use only non conflicting function names:
  • regncomp and not regcomp
  • regnexec and not regexec
  • tree_free and not regfree
Since every posix Regex function will be used instead of their Tre equivalent, I was forced do find a special case for 'regfree'. There's no special regX function for 'regfree'.
Because regfree can be called at many places, when an erlang error occurs, or simply when we quit the shell; the function is trying to free the posix regex_t with a pointer to a TRE regex_t, so the crash is always there.
While looking at 'tre-internal.h' I've found the 'tre_free' function that will do the real job of freeing the TRE regex_t... So I've just declared as 'extern' this function directly into the TRE_drv.c file...
extern void tree_free(regex_t *preg);
static void tre_stop(ErlDrvData drv_data)
{
struct driver *d = (struct driver*) drv_data;

// if (d->compiled)
// regfree(&(d->re));

if (d->compiled)
tre_free(&d->re);

driver_free(d);
}


This time I've also added a call to the 'reganexec' function that's able to find approximate matches. The code is of course located into the 'tre_from_erlang' function. (this function gets called whenever erlang talks to the driver).

... snip ...
case APPROX:

/* <> */

flags = get_int32(buf + 4);
cost = get_int32(buf + 8);

memset(&amatches, 0, sizeof(amatches));
amatches.pmatch = matches;
amatches.nmatch = MAX_MATCHES;

regaparams_default(®aparams);
regaparams.max_cost = cost;


status = reganexec(&d->re, buf + 12, len - 12, &amatches, regaparams, flags);
if (status != 0) {
driver_send_status(d, status);
return;
}

driver_send_matches(d, matches, MAX_MATCHES);
break;
... snip ...


With this you're able to make things like this:

{ok, RE} = treregex:compile(<<"test">>, [extended]).
treregex:approx(RE, <<"this is a tast">>, 0, []).

the 'fun approx/4' has the 'cost' (an integer) as extra argument, this correspond to the cost of manipulation characters to find a match. You'll find more about this here.

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be
close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure
(also known as the Levenshtein distance) where characters can be inserted, deleted, or
substituted in the searched text in order to get an exact match. Each insertion, deletion, or
substitution adds the distance, or cost, of the match. TRE can report the matches which have a
cost lower than some given threshold value. TRE can also be used to search for matches with the
lowest cost.


Finally, now that the driver and the erlang module works well I'll upload it to various repositories...

No comments:

Sticky