For Programmers: Free Programming Magazines  


Home > Archive > Prolog > June 2007 > General Purpose parsing









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author General Purpose parsing
Reuben Grinberg

2007-04-06, 4:16 am

I'm trying to write a parser in prolog and all the tutorials on the net
assume that I'm only interested in a limited number of tokens:

sentence --> verb, noun.
verb(hammering).
noun(hammer).
noun(nail).

I'd like to pick up tokens that fit a certain regular expression:
[a-zA-Z][a-zA-Z0-9]*

I've tried a number of different things.

First, I tried:
id --> [S], {regmatch("^[a-zA-Z][a-zA-Z0-9]*$", S)}.

| ?- phrase(id, "bob").
no

So then I tried, this, thinking that the problem might have to do with
the difference between strings and atoms:

id --> [C], {atom_chars(C,S), regmatch("^[a-zA-Z][a-zA-Z0-9]*$", S)}.

again:

| ?- phrase(id, "bob").
no
When I trace through, it doesn't even look like regmatch is being called:
| ?- phrase(id, "bob").
1 1 Call: id([98,111,98],[]) ?
2 2 Call: 'C'([98,111,98],_1148,[]) ?
2 2 Fail: 'C'([98,111,98],_1148,[]) ?
1 1 Fail: id([98,111,98],[]) ?
no

I tried a different tack, which was to manually encode the regular
expression:
id --> [C], {letter([C])}.
id --> [C], restid, {letter([C])}.
restid --> [C], {letter([C])}.
restid --> [C], {num([C])}.

restid --> [C], restid, {letter([C])}.
restid --> [C], restid, {num([C])}.

letter("a").
....
num("9").

This seems to work:
| ?- phrase(id, "bob").
yes
| ?- phrase(id, "bob ").
no


But now I have the following problem.
I've added another rule:
bind --> id, ["="], id.

Now when I try this rule out it doesn't work:
| ?- phrase(bind, "a=b").
no

I've pasted in the trace for this below, and it's really freaking long
for some reason.

Any advice on what the correct way to parse arbitrary tokens using
regular expressions is in addition to why I can't get my 'bind' rule to
work would be appreciated.

Thanks,
Reuben Grinberg


| ?- phrase(bind, "a=b").
26 5 Call: restid([],_1107) ?
1 1 Call: bind([97,61,98],[]) ?
27 6 Call: 'C'([],_7704,_1107) ?
2 2 Call: id([97,61,98],_1107) ?
27 6 Fail: 'C'([],_7704,_1107) ?
3 3 Call: 'C'([97,61,98],_1773,_1107) ?
28 6 Call: 'C'([],_7704,_1107) ?
3 3 Exit: 'C'([97,61,98],97,[61,98]) ?
28 6 Fail: 'C'([],_7704,_1107) ?
4 3 Call: letter([97]) ?
29 6 Call: 'C'([],_7710,_7711) ?
? 4 3 Exit: letter([97]) ?
29 6 Fail: 'C'([],_7710,_7711) ?
? 2 2 Exit: id([97,61,98],[61,98]) ?
30 6 Call: 'C'([],_7710,_7711) ?
5 2 Call: 'C'([61,98],[61],_1101) ?
30 6 Fail: 'C'([],_7710,_7711) ?
5 2 Fail: 'C'([61,98],[61],_1101) ?
26 5 Fail: restid([],_1107) ?
2 2 Redo: id([97,61,98],[61,98]) ?
13 4 Fail: restid([98],_1107) ?
4 3 Redo: letter([97]) ?
31 4 Call: 'C'([61,98],_3736,_3737) ?
4 3 Fail: letter([97]) ?
31 4 Exit: 'C'([61,98],61,[98]) ?
6 3 Call: 'C'([97,61,98],_1779,_1780) ?
32 4 Call: restid([98],_1107) ?
6 3 Exit: 'C'([97,61,98],97,[61,98]) ?
33 5 Call: 'C'([98],_5717,_1107) ?
7 3 Call: restid([61,98],_1107) ?
33 5 Exit: 'C'([98],98,[]) ?
8 4 Call: 'C'([61,98],_3730,_1107) ?
34 5 Call: letter([98]) ?
8 4 Exit: 'C'([61,98],61,[98]) ?
? 34 5 Exit: letter([98]) ?
9 4 Call: letter([61]) ?
? 32 4 Exit: restid([98],[]) ?
9 4 Fail: letter([61]) ?
35 4 Call: num([61]) ?
10 4 Call: 'C'([61,98],_3730,_1107) ?
35 4 Fail: num([61]) ?
10 4 Exit: 'C'([61,98],61,[98]) ?
32 4 Redo: restid([98],[]) ?
11 4 Call: num([61]) ?
34 5 Redo: letter([98]) ?
11 4 Fail: num([61]) ?
34 5 Fail: letter([98]) ?
12 4 Call: 'C'([61,98],_3736,_3737) ?
36 5 Call: 'C'([98],_5717,_1107) ?
12 4 Exit: 'C'([61,98],61,[98]) ?
36 5 Exit: 'C'([98],98,[]) ?
13 4 Call: restid([98],_1107) ?
37 5 Call: num([98]) ?
14 5 Call: 'C'([98],_5717,_1107) ?
37 5 Fail: num([98]) ?
14 5 Exit: 'C'([98],98,[]) ?
38 5 Call: 'C'([98],_5723,_5724) ?
15 5 Call: letter([98]) ?
38 5 Exit: 'C'([98],98,[]) ?
? 15 5 Exit: letter([98]) ?
39 5 Call: restid([],_1107) ?
? 13 4 Exit: restid([98],[]) ?
40 6 Call: 'C'([],_7704,_1107) ?
16 4 Call: letter([61]) ?
40 6 Fail: 'C'([],_7704,_1107) ?
16 4 Fail: letter([61]) ?
41 6 Call: 'C'([],_7704,_1107) ?
13 4 Redo: restid([98],[]) ?
41 6 Fail: 'C'([],_7704,_1107) ?
15 5 Redo: letter([98]) ?
42 6 Call: 'C'([],_7710,_7711) ?
15 5 Fail: letter([98]) ?
42 6 Fail: 'C'([],_7710,_7711) ?
17 5 Call: 'C'([98],_5717,_1107) ?
43 6 Call: 'C'([],_7710,_7711) ?
17 5 Exit: 'C'([98],98,[]) ?
43 6 Fail: 'C'([],_7710,_7711) ?
18 5 Call: num([98]) ?
39 5 Fail: restid([],_1107) ?
18 5 Fail: num([98]) ?
44 5 Call: 'C'([98],_5723,_5724) ?
19 5 Call: 'C'([98],_5723,_5724) ?
44 5 Exit: 'C'([98],98,[]) ?
19 5 Exit: 'C'([98],98,[]) ?
45 5 Call: restid([],_1107) ?
20 5 Call: restid([],_1107) ?
46 6 Call: 'C'([],_7704,_1107) ?
21 6 Call: 'C'([],_7704,_1107) ?
46 6 Fail: 'C'([],_7704,_1107) ?
21 6 Fail: 'C'([],_7704,_1107) ?
47 6 Call: 'C'([],_7704,_1107) ?
22 6 Call: 'C'([],_7704,_1107) ?
47 6 Fail: 'C'([],_7704,_1107) ?
22 6 Fail: 'C'([],_7704,_1107) ?
48 6 Call: 'C'([],_7710,_7711) ?
23 6 Call: 'C'([],_7710,_7711) ?
48 6 Fail: 'C'([],_7710,_7711) ?
23 6 Fail: 'C'([],_7710,_7711) ?
49 6 Call: 'C'([],_7710,_7711) ?
24 6 Call: 'C'([],_7710,_7711) ?
49 6 Fail: 'C'([],_7710,_7711) ?
24 6 Fail: 'C'([],_7710,_7711) ?
45 5 Fail: restid([],_1107) ?
20 5 Fail: restid([],_1107) ?
32 4 Fail: restid([98],_1107) ?
25 5 Call: 'C'([98],_5723,_5724) ?
7 3 Fail: restid([61,98],_1107) ?
25 5 Exit: 'C'([98],98,[]) ?
2 2 Fail: id([97,61,98],_1107) ?
26 5 Call: restid([],_1107) ?
1 1 Fail: bind([97,61,98],[]) ?
27 6 Call: 'C'([],_7704,_1107) ?
27 6 Fail: 'C'([],_7704,_1107) ?
28 6 Call: 'C'([],_7704,_1107) ?
| ?-
28 6 Fail: 'C'([],_7704,_1107) ?
| ?-
29 6 Call: 'C'([],_7710,_7711) ?
29 6 Fail: 'C'([],_7710,_7711) ?
30 6 Call: 'C'([],_7710,_7711) ?
30 6 Fail: 'C'([],_7710,_7711) ?
26 5 Fail: restid([],_1107) ?
13 4 Fail: restid([98],_1107) ?
31 4 Call: 'C'([61,98],_3736,_3737) ?
31 4 Exit: 'C'([61,98],61,[98]) ?
32 4 Call: restid([98],_1107) ?
33 5 Call: 'C'([98],_5717,_1107) ?
33 5 Exit: 'C'([98],98,[]) ?
34 5 Call: letter([98]) ?
? 34 5 Exit: letter([98]) ?
? 32 4 Exit: restid([98],[]) ?
35 4 Call: num([61]) ?
35 4 Fail: num([61]) ?
32 4 Redo: restid([98],[]) ?
34 5 Redo: letter([98]) ?
34 5 Fail: letter([98]) ?
36 5 Call: 'C'([98],_5717,_1107) ?
36 5 Exit: 'C'([98],98,[]) ?
37 5 Call: num([98]) ?
37 5 Fail: num([98]) ?
38 5 Call: 'C'([98],_5723,_5724) ?
38 5 Exit: 'C'([98],98,[]) ?
39 5 Call: restid([],_1107) ?
40 6 Call: 'C'([],_7704,_1107) ?
40 6 Fail: 'C'([],_7704,_1107) ?
41 6 Call: 'C'([],_7704,_1107) ?
41 6 Fail: 'C'([],_7704,_1107) ?
42 6 Call: 'C'([],_7710,_7711) ?
42 6 Fail: 'C'([],_7710,_7711) ?
43 6 Call: 'C'([],_7710,_7711) ?
43 6 Fail: 'C'([],_7710,_7711) ?
39 5 Fail: restid([],_1107) ?
44 5 Call: 'C'([98],_5723,_5724) ?
44 5 Exit: 'C'([98],98,[]) ?
45 5 Call: restid([],_1107) ?
46 6 Call: 'C'([],_7704,_1107) ?
46 6 Fail: 'C'([],_7704,_1107) ?
47 6 Call: 'C'([],_7704,_1107) ?
47 6 Fail: 'C'([],_7704,_1107) ?
48 6 Call: 'C'([],_7710,_7711) ?
48 6 Fail: 'C'([],_7710,_7711) ?
49 6 Call: 'C'([],_7710,_7711) ?
49 6 Fail: 'C'([],_7710,_7711) ?
45 5 Fail: restid([],_1107) ?
32 4 Fail: restid([98],_1107) ?
7 3 Fail: restid([61,98],_1107) ?
2 2 Fail: id([97,61,98],_1107) ?
1 1 Fail: bind([97,61,98],[]) ?
no





Markus Triska

2007-04-06, 4:16 am

Reuben Grinberg <reuben.grinberg@aya.yale.edu> writes:

> id --> [S], {regmatch("^[a-zA-Z][a-zA-Z0-9]*$", S)}.
>
> | ?- phrase(id, "bob").
> no


S will only be a single element of the list (of character codes). You
need to accumulate more elements for "bob".

> I tried a different tack, which was to manually encode the regular
> expression:


Idea for a shorter version:

id --> [C], { between(0'a, 0'z, C) }, id_r.
id_r --> [].
id_r --> [C], { between(0'a, 0'z, C) ; between(0'0, 0'9, C)}, id_r.

> I've added another rule:
> bind --> id, ["="], id.


We have:

%?- X = "=".
%@% X = [61]

So the rule is actually:

bind --> id, [[61]], id.

Try instead:

bind --> id, [0'=], id.

Because:

%?- X = 0'=.
%@% X = 61;

All the best,
Markus

--
comp.lang.prolog FAQ: http://www.logic.at/prolog/faq/
Reuben Grinberg

2007-04-06, 7:06 pm

Thanks for your reply Markus!

Your suggestions fixed my problem and the id definition you suggested is
much better than the one I had.

Could you explain to me what the 0' syntax is? I'm using sicstus and I
searched the documentation for 0' and didn't find anything. is 0'a an
atom, a string, or a single character?

Also, do you know why I can't use spaces now in my phrase?

expr --> [let], id.

| ?- phrase(expr, "let a").
no
| ?- phrase(expr, "leta").
yes


Also, I'm unclear about why my code is working differently than this
snippet:
a --> b, c.
b --> [the].
c --> [dog].

| ?- phrase(a, "the dog").
no
| ?- phrase(a, "thedog").
no
| ?- phrase(b, "the").
no
| ?- phrase(b, [the]).
yes
| ?- phrase(a, [the, dog]).
yes


But this doesn't work with my code:
| ?- phrase(expr, [let, a]).
! Domain error in argument 1 of >= /2
! expected expression, found let
! goal: let>=97


I'm guessing some of the problems I'm having have to do with prolog
trying to parse and tokenize at the same time. I have a function that
tokenizes exactly the way I want:
hasktok(X,Y) :- tokenize(
"+|([a-zA-Z][a-zA-Z0-9]*|\\(|\\)|\\\\|\\.|=|;)",
X,Y).

But because my expr, bind, etc... won't take arrays in phrase, this
doesn't work:

| ?- hasktok("let a", Y).
Y = [let,a] ? ;
Y = [le,t,a] ? ;
Y = [l,et,a] ? ;
Y = [l,e,t,a] ?
yes
| ?- hasktok("let a", Y), phrase(expr, Y).
! Domain error in argument 1 of >= /2
! expected expression, found let
! goal: let>=97


Any advice you have would be much appreciated!

Thanks,
Reuben Grinberg






Markus Triska wrote:
> Reuben Grinberg <reuben.grinberg@aya.yale.edu> writes:
>
>
> S will only be a single element of the list (of character codes). You
> need to accumulate more elements for "bob".
>
>
> Idea for a shorter version:
>
> id --> [C], { between(0'a, 0'z, C) }, id_r.
> id_r --> [].
> id_r --> [C], { between(0'a, 0'z, C) ; between(0'0, 0'9, C)}, id_r.
>
>
> We have:
>
> %?- X = "=".
> %@% X = [61]
>
> So the rule is actually:
>
> bind --> id, [[61]], id.
>
> Try instead:
>
> bind --> id, [0'=], id.
>
> Because:
>
> %?- X = 0'=.
> %@% X = 61;
>
> All the best,
> Markus
>

Markus Triska

2007-04-07, 8:05 am

Reuben Grinberg <reuben.grinberg@aya.yale.edu> writes:

> Could you explain to me what the 0' syntax is?


0'X denotes the ASCII/Unicode code point of character X.

> I'm using sicstus and I searched the documentation for 0' and didn't
> find anything. is 0'a an atom, a string, or a single character?


Ask Prolog:

%?- atom(0'a).
%@% No

%?- string(0'a).
%@% No

%?- number(0'a).
%@% Yes

%?- 97 =:= 0'a.
%@% Yes

> Also, do you know why I can't use spaces now in my phrase?


0' (=:= 32) is neither between 0'a through 0'z nor between 0'0 through
0'9, so it's not permitted in id/2.

> expr --> [let], id.


This mixes atoms (`let') with character codes (id/2).

> | ?- phrase(expr, "leta").
> yes


"leta" is a valid id. "leta" is shorthand notation for a list of
character codes:

%?- Xs = "leta".
%@% Xs = [108, 101, 116, 97]

> Also, I'm unclear about why my code is working differently than this
> snippet:


Your grammar generates lists of character codes. The snippet generates
lists of atoms.

> I'm guessing some of the problems I'm having have to do with prolog
> trying to parse and tokenize at the same time.


It's common to first tokenise (= convert character codes to atoms and
compound terms), and then parse based on these tokens for clarity. You
can do it simultaneously too of course, but not by arbitrarily
intermingling the phases. DCGs are suitable for both phases: For
tokenising, since strings are lists. For parsing, since lists of terms
are - well, also lists. So it's list processing in both cases.

--
comp.lang.prolog FAQ: http://www.logic.at/prolog/faq/
Reuben Grinberg

2007-04-11, 7:04 pm

Thanks a lot for your help. I got it working.

Markus Triska wrote:
> Reuben Grinberg <reuben.grinberg@aya.yale.edu> writes:
>
>
> 0'X denotes the ASCII/Unicode code point of character X.
>
>
> Ask Prolog:
>
> %?- atom(0'a).
> %@% No
>
> %?- string(0'a).
> %@% No
>
> %?- number(0'a).
> %@% Yes
>
> %?- 97 =:= 0'a.
> %@% Yes
>
>
> 0' (=:= 32) is neither between 0'a through 0'z nor between 0'0 through
> 0'9, so it's not permitted in id/2.
>
>
> This mixes atoms (`let') with character codes (id/2).
>
>
> "leta" is a valid id. "leta" is shorthand notation for a list of
> character codes:
>
> %?- Xs = "leta".
> %@% Xs = [108, 101, 116, 97]
>
>
> Your grammar generates lists of character codes. The snippet generates
> lists of atoms.
>
>
> It's common to first tokenise (= convert character codes to atoms and
> compound terms), and then parse based on these tokens for clarity. You
> can do it simultaneously too of course, but not by arbitrarily
> intermingling the phases. DCGs are suitable for both phases: For
> tokenising, since strings are lists. For parsing, since lists of terms
> are - well, also lists. So it's list processing in both cases.
>

Erterup1

2007-04-26, 7:27 pm

http://Pamela-Anderson-in-nylons.in...hp?movie=148803
4022

2007-05-02, 9:03 am

Celine Dion facestanding movies!
http://Celine-Dion-facestanding-mov...hp?movie=148803
Necropheliac

2007-05-09, 4:51 am

http://Halle-Berry-anal-action.org/...hp?movie=148803
Jenna

2007-05-13, 3:37 am

http://Angelina-Jolie-doing-it.info...hp?movie=148803
Abadedenceder80

2007-06-02, 3:09 am

http://Angelina-Jolie-doing-it.info...p?movie=1673286
Tepmengh78

2007-06-11, 7:26 am

My girlfriend cheated on me! Here is the revenge, look how I f^^^ed her!
http://xx-amateur-movies.org/vid/218571/
Please share with friends
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com