Code Comments
Programming Forum and web based access to our favorite programming groups.I wrote the following function to split a string into a list of words, using blanks as a delimiter: % Tokenize a sentence into a list of words. strtok([], [], _):-!. strtok([], CharsAccum, [Word]):- string_chars(Word, CharsAccum), !. strtok([32|T], CharsAccum, [Word | W_T]):- !, string_chars(Word, CharsAccum), strtok(T, [], W_T). strtok([C|T], CharsAccum, WordList):- C \= 32, !, append(CharsAccum, [C], CharsAccum2), strtok(T, CharsAccum2, WordList). str_to_list(S, L):- string_chars(S, Chars1), strtok(Chars1, [], L). What I don't like about what I've done is that I've used the cut frequently to make the (overall) predicate deterministic, but I only want the predicate to attempt to return one answer. Is there a more elegant way to do this? BTW, I know the predicate could be more efficient if I used a difference list for the CharsAccum list, but I was feeling lazy. Thanks.
Post Follow-up to this message"Robert Oschler" wrote in message
news:l4mdnYOnQ8zWwN_cRVn-qA@adelphia.com...
> I wrote the following function to split a string into a list
> of words, using blanks as a delimiter:
>
> % Tokenize a sentence into a list of words.
> strtok([], [], _):-!.
> strtok([], CharsAccum, [Word]):-
> string_chars(Word, CharsAccum),
> !.
> strtok([32|T], CharsAccum, [Word | W_T]):-
> !,
> string_chars(Word, CharsAccum),
> strtok(T, [], W_T).
> strtok([C|T], CharsAccum, WordList):-
> C \= 32,
> !,
> append(CharsAccum, [C], CharsAccum2),
> strtok(T, CharsAccum2, WordList).
>
> str_to_list(S, L):-
> string_chars(S, Chars1),
> strtok(Chars1, [], L).
>
> What I don't like about what I've done is that I've used the
> cut frequently to make the (overall) predicate deterministic,
> but I only want the predicate to attempt to return one answer.
> Is there a more elegant way to do this?
>
> BTW, I know the predicate could be more efficient if I used a
> difference list for the CharsAccum list, but I was feeling lazy.
I posted the following Win-Prolog program almost a year
back (2003-09-26), which leads me to believe that you are
asking for help with a homework problem. This might not be
the case, however, but since the program is already in the
public domain, I suppose there's no reason to exhibit
scruples by not answering your request. ;) Of course, only
you can decide whether it is elegant enough for you.
/* Substrings extraction program */
% substrings_all(+String, -Substrings) is true if
% Substrings is a list of all substrings in String.
substrings_all(String, Substrings) :-
Delimiter = 32, % could specify a list of delimiters
substrings(String,Delimiter,Substrings).
% substrings(+String, +Delimiter, -Substrings)
substrings(InString, Delimiter, OutStrings) :-
(InString = []
-> OutStrings = []
; frontstring(InString,Delimiter,Substring
,Rest),
string_chars(S,Substring), % conv. chars. to string
OutStrings = [S|Substrings],
substrings(Rest,Delimiter,Substrings)
).
% frontstring(+String, +Delimiter, -FrontString, -Rest)
frontstring(InString, Delimiter, OutString, Rest) :-
(InString = [H|T]
-> (H == Delimiter % replace if more than one delimiter
-> OutString = [],
Rest = T
; OutString = [H|FrontString],
frontstring(T,Delimiter,FrontString,Rest
)
)
; OutString = [],
Rest = []
).
/* End of substrings extraction program */
Example goal:
| ?- substrings_all("Say hello, Molly", Substrings).
Substrings = ['Say','hello,','Molly']
Note that the comma in the input string is part of the
second word, not a substring (or token) or delimiter per
se. I have placed comments in substrings_all/2 and
frontstring/4 to indicate where to (start to) make changes
to enable a 'true' substrings extractor (or tokenizer).
Consider this program a framework for further development.
Good luck!
--
Mail sent to this email address is automatically deleted
(unread) on the server. Send replies to the newsgroup.
Post Follow-up to this messageRobert Oschler wrote: > I wrote the following function to split a string into a list of words, usi ng > blanks as a delimiter: > > % Tokenize a sentence into a list of words. > strtok([], [], _):-!. > strtok([], CharsAccum, [Word]):- > string_chars(Word, CharsAccum), > !. > strtok([32|T], CharsAccum, [Word | W_T]):- > !, > string_chars(Word, CharsAccum), > strtok(T, [], W_T). > strtok([C|T], CharsAccum, WordList):- > C \= 32, > !, > append(CharsAccum, [C], CharsAccum2), > strtok(T, CharsAccum2, WordList). > > str_to_list(S, L):- > string_chars(S, Chars1), > strtok(Chars1, [], L). > > What I don't like about what I've done is that I've used the cut frequentl y > to make the (overall) predicate deterministic, but I only want the predica te > to attempt to return one answer. Is there a more elegant way to do this? and yet another way to do it: str_to_lists(A, L) :- atom_codes(A, AL), strtok(AL, Acu, L1), join_sublists([Acu|L1], L). strtok([], [], []) :- !. strtok([32|M], [], [Acu|L]) :- !, strtok(M, Acu, L). strtok([C|M], [C|Acu], L) :- C \= 32, strtok(M, Acu, L). join_sublists([], []) :- !. join_sublists([H|T], [H1|T1]) :- atom_codes(H1, H), join_sublists(T, T1). bye - Salva
Post Follow-up to this messageDCG may sometimes be rather elegant (still using cuts, however):
words([Word|Words]) --> word(Word), !, words(Words).
words(Words) --> blank, words(Words).
words([]) --> "".
word(Word) --> char(Char), chars(Chars), {atom_chars(Word,[Char|Chars])}.
chars([Char|Chars]) --> char(Char), !, chars(Chars).
chars([]) --> "".
char(Char) --> [Char], {Char >= 65, Char =< 122}.
blank --> " ".
Use like so:
| ?- words(L,"this is a test","").
L = [this,is,a,test]
Cheers,
Torbjörn
"Robert Oschler" <no-mail-please@nospam.com> wrote in message
news:<l4mdnYOnQ8zWwN_cRVn-qA@adelphia.com>...
> I wrote the following function to split a string into a list of words, usi
ng
> blanks as a delimiter:
>
> % Tokenize a sentence into a list of words.
> strtok([], [], _):-!.
> strtok([], CharsAccum, [Word]):-
> string_chars(Word, CharsAccum),
> !.
> strtok([32|T], CharsAccum, [Word | W_T]):-
> !,
> string_chars(Word, CharsAccum),
> strtok(T, [], W_T).
> strtok([C|T], CharsAccum, WordList):-
> C \= 32,
> !,
> append(CharsAccum, [C], CharsAccum2),
> strtok(T, CharsAccum2, WordList).
>
> str_to_list(S, L):-
> string_chars(S, Chars1),
> strtok(Chars1, [], L).
>
> What I don't like about what I've done is that I've used the cut frequentl
y
> to make the (overall) predicate deterministic, but I only want the predica
te
> to attempt to return one answer. Is there a more elegant way to do this?
>
> BTW, I know the predicate could be more efficient if I used a difference
> list for the CharsAccum list, but I was feeling lazy.
>
> Thanks.
Post Follow-up to this messageOverall, I think using cut in this predicate is fine, it simply expresses when a predicate commits to the clause. Prolog is always a mixture of deterministic and non-deterministic predicates. Personally, I use a predicate mode descriptor: % strtok(+TokenList,+CharsAccum,-WordList) to express input and output arguments. But as I personally dont like to have to call strtok/3 from other clauses with a [] as the start value of the accumulator, which somehow is informatio n that should be local to strtok/3, not to the calling environment, I would add a clause: strtok(TokenList,WordList) :- strtok(TokenList,[],WordList). and use strtok/2 from all calling locations. Export only strtok/2 from the module. > I wrote the following function to split a string into a list of words, usi ng > blanks as a delimiter: > > % Tokenize a sentence into a list of words. > strtok([], [], _):-!. > strtok([], CharsAccum, [Word]):- > string_chars(Word, CharsAccum), > !. Here the !/0 should be placed before the call to string_chars/2. Where you have it, it is a "red" cut, pruning possible alternatives to string_chars/2 from inside strtok/3. Placing it before will make it a "green" cut. > strtok([32|T], CharsAccum, [Word | W_T]):- > !, > string_chars(Word, CharsAccum), > strtok(T, [], W_T). > strtok([C|T], CharsAccum, WordList):- > C \= 32, > !, > append(CharsAccum, [C], CharsAccum2), > strtok(T, CharsAccum2, WordList). In this clause the !/0 is a "blue" cut as it prunes \=/2 which is a builtin. For me builtins are all ISO predicates not library predicates like: string_chars/2 For a description of red green and blue cuts please see Richard O'Keefe s book "The Craft of Prolog" it is excellent on this respect. Of course, in IF/Prolog we just write: atom_split(Atom,' ',SplitAtomList) to split a string into a list of sub-strings.... it avoids all the unnecessary list creation in the tokenisation.
Post Follow-up to this messageRobert Oschler wrote:
> I wrote the following function to split a string into a list of words, usi
ng
> blanks as a delimiter:
>
> % Tokenize a sentence into a list of words.
> strtok([], [], _):-!.
> strtok([], CharsAccum, [Word]):-
> string_chars(Word, CharsAccum),
> !.
> strtok([32|T], CharsAccum, [Word | W_T]):-
> !,
> string_chars(Word, CharsAccum),
> strtok(T, [], W_T).
> strtok([C|T], CharsAccum, WordList):-
> C \= 32,
> !,
> append(CharsAccum, [C], CharsAccum2),
> strtok(T, CharsAccum2, WordList).
>
> str_to_list(S, L):-
> string_chars(S, Chars1),
> strtok(Chars1, [], L).
>
> What I don't like about what I've done is that I've used the cut frequentl
y
> to make the (overall) predicate deterministic, but I only want the predica
te
> to attempt to return one answer. Is there a more elegant way to do this?
>
> BTW, I know the predicate could be more efficient if I used a difference
> list for the CharsAccum list, but I was feeling lazy.
>
> Thanks.
>
>
There is another way of looking at the question.
Suppose I had a predicate next_token(Str,Token) which
returned successive tokens contained in string Str, much in the way
member(X,L) returns successive members of list L.
Then, to tokenize any given string, I could just say
findall(Token,next_token(Str,Token),Toke
nList)
However, if I use member(X,L) as my model:
next_token(Str,Token) :-
scan_token(Str,_,Token).
next_token(Str,Token) :-
scan_token(Str,Str0,_),
next_token(Str,Token).
each token is scanned twice, first to get it, then to get past it.
This is where a neat trick I learned in this n.g. comes in handy:
next_token(Str,Token) :-
s_token(Str,Str0,CurrToken),
dispatch(Str0,CurrToken,Token).
dispatch(_Str0,CurrToken,Token) :-
Token = CurrToken.
dispatch(Str0,_,Token) :-
next_token(Str0,Token).
s_token([X|Str],Str0,Token) :-
X = 32,
!,
s_token(Str,Str0,Token).
s_token([X|Str],Str0,[X|TokenTail]) :-
\+ X = 32,
s_tokentail(Str,Str0,TokenTail).
s_tokentail([],[],[]).
s_tokentail([X|Str],Str,TokenTail) :-
X = 32,
!,
TokenTail=[].
s_tokentail([X|Str],Str0,[X|TokenTail]) :-
\+ X = 32,
s_tokentail(Str,Str0,TokenTail).
Now the scan resumes at the point where it left off after each
successive token is scanned, and
?- findall(Tok,next_token(" how now brown cow ",Tok),Tokl).
Tok = _G527
Tokl = [[104, 111, 119], [110, 111, 119], [98, 114, 111, 119, 110], [99,
111, 119]] ;
No
?-
Done in haste.
Bill
Post Follow-up to this messagereader wrote: > There is another way of looking at the question. > > Suppose I had a predicate next_token(Str,Token) which > returned successive tokens contained in string Str, much in the way > member(X,L) returns successive members of list L. > > Then, to tokenize any given string, I could just say > > findall(Token,next_token(Str,Token),Toke nList) > > However, if I use member(X,L) as my model: > > next_token(Str,Token) :- > scan_token(Str,_,Token). > next_token(Str,Token) :- > scan_token(Str,Str0,_), > next_token(Str,Token). <--------------- error > > each token is scanned twice, first to get it, then to get past it. > That should be next_token(Str,Token) :- scan_token(Str,_,Token). next_token(Str,Token) :- scan_token(Str,Str0,_), next_token(Str0,Token). --
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.