Tuesday, July 17, 2007

Parallelizing simple external commands ... Part I

Once upon a time I need to parse enormous files to find simple patterns... My prefered tools were so far the shell based one, i.e. 'grep'.

But now I have a Magical Ability, Erlang Magic... So I decide to split this enormous file, using the 'split' comand, 'split -l 10000' for example.

Now that I have a lot of smaller file, I can parallelize their parsing, and this is were erlang comes...

First, let's design a bit:
  • I need a central process that will control all my processes
  • Processes and master must be able to communicate
That's all. Hopefully the latter is directly provided by erlang, this the ! operator.
The master process will be a little more tricky, but this is 'easyerl' remember, so here we go:

doit(Step) ->
Master = spawn(?MODULE, test, [Step]),
register(computing_master, Master).

test(Step) ->
file:set_cwd("/home/rolphin/Work"),
List = filelib:wildcard("seg-a*"),
upto(Step, 0, List).


We create a process running the test function, whose job is starting the upto/3 fun...
What's interesting here is the 'filelib' function that provides me the list of file contained in the directory '/home/rolphin/Work'.

Now we go describe the 'upto/3' fun :

upto(Max, Current, []) ->
loop(Max, Current, []);

upto(Max, Max, List) ->
loop(Max, Max, List);

upto(Max, Current, [New|List]) ->
io:format("upto: ~p/~p~n", [Max, Current]),
spawn(?MODULE, grep, ["user.list", New, ["result-", New]]),
upto(Max, Current + 1, List).


More details:
  • upto with an empty list will just call the loop/3 fun
  • upto with the Max number of processe allowed equals the current number of process, will call loop/3
  • upto with less active process than the max, with a non empty list, will spawn a child process
The child process is a 'grep' command, and here it is:

grep(File, Source, Result) ->
% Command line is: "grep -f motif_file sourcefile > result"
Cmd = [ "grep -f ", File , $ , Source, $>, Result ],
io:format("Starting: ~p~n", [Cmd]),
Status = os:cmd(Cmd),
computing_master ! {exited, Status}.

Okay that's it for today ! It's a little late now ! And I need some sleep to succesfully pass the required skill tests for my new job !

More of this tomorrow...

No comments:

Sticky