Home > Archive > PHP Language > February 2007 > Multi - level parsing
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Multi - level parsing
|
|
| Michael 2007-02-03, 6:58 pm |
| Hello all you (reg|parsing)experts
Here's one for you ;)
I'm creating a kind of markup system, parsing a number of (custom) markup
tags with the following syntax:
[tag|arg1|arg2|...]contents[/tag]
where any tag with no arguments may be written as [tag]contents[/tag] and
any tag with no contents may be written as [tag|arg1|arg2|...|argn /].
What I have now works fine for constructions like
[first][second]Hello[/second] world[/first]!
But consider
[first]Hello [first]world[/first][/first]!
Here it will look for the closing tag for the first tag, [first], which it
finds right after "world". It will then process it's contents, "Hello
[first]world". If [first] happens to be a tag which leaves its contents more
or less intact, it will then find another [first] / [/first] pair and parse
it, but if [first] returns something entirely different (like a database
value) this will leave a trailing end tag (eg. if the tag maps "Hello
[first]world" to "Nicey-nice", the result will be "Nicey-nice[/first]").
Of course, what I want it to do is replace the innermost tag first (with
contents "world"), substitute the result in the original string and then
process the outermost tag (if first replaces "world" with "earth", the
original string would first become "[first]Hello earth[/first]" which poses
no more problem).
The regex I´m currently using is
'#\[(.*?)((?:\|.*?)*)(?:/|\](.*?)\[/\1)\]#s'
which basically
a) finds an opening tag [aaa]
b) gobbles arguments separated by | until it finds
b1) the closing bracket ], the contents of the tag -- that's the
(.*?) part -- and a closing tag [/aaa]
or
b2) an implied bracket /]
Of course the problem is the (.*?) part, which stops as soon as a matching
closing tag is encountered. What I actually need is some way to count the
number of opening tags of the same name (say aaa) and the number of those
that are closed, and only match [/aaa] once no more [aaa] tags INSIDE the
one I'm parsing are open.
Another way I could think of to solve the problem is to find the first
opening tag, find the LAST matching closing tag (which could be very far
away, if I'd use <B>..</B> on each line it would match the entire body of
the document), then recursively find the first opening tag INSIDE of that
with a matching closing tag, until there are no more opening tags inside -
effectively parsing the "shortest" tags first (in other words: from the
inside outwards).
If you're still with me, please help me out on this because currently I'm
kind of at a loss.
Thank you so much.
Michael.
| |
|
| Michael <mischa.tbaINSERTATHERExs4all.nl> wrote:
> Hello all you (reg|parsing)experts
>
> Here's one for you ;)
>
> I'm creating a kind of markup system, parsing a number of (custom) markup
> tags with the following syntax:
> [tag|arg1|arg2|...]contents[/tag]
> where any tag with no arguments may be written as [tag]contents[/tag] and
> any tag with no contents may be written as [tag|arg1|arg2|...|argn /].
>
> What I have now works fine for constructions like
> [first][second]Hello[/second] world[/first]!
> But consider
> [first]Hello [first]world[/first][/first]!
> Here it will look for the closing tag for the first tag, [first], which
> it
> finds right after "world". It will then process it's contents, "Hello
> [first]world". If [first] happens to be a tag which leaves its contents
> more
> or less intact, it will then find another [first] / [/first] pair and
> parse
> it, but if [first] returns something entirely different (like a database
> value) this will leave a trailing end tag (eg. if the tag maps "Hello
> [first]world" to "Nicey-nice", the result will be "Nicey-nice[/first]").
> Of course, what I want it to do is replace the innermost tag first (with
> contents "world"), substitute the result in the original string and then
> process the outermost tag (if first replaces "world" with "earth", the
> original string would first become "[first]Hello earth[/first]" which
> poses
> no more problem).
>
> The regex I´m currently using is
>
> which basically
> a) finds an opening tag [aaa]
> b) gobbles arguments separated by | until it finds
> b1) the closing bracket ], the contents of the tag -- that's the
> (.*?) part -- and a closing tag [/aaa]
> or
> b2) an implied bracket /]
> Of course the problem is the (.*?) part, which stops as soon as a
> matching
> closing tag is encountered. What I actually need is some way to count the
> number of opening tags of the same name (say aaa) and the number of those
> that are closed, and only match [/aaa] once no more [aaa] tags INSIDE the
> one I'm parsing are open.
> Another way I could think of to solve the problem is to find the first
> opening tag, find the LAST matching closing tag (which could be very far
> away, if I'd use <B>..</B> on each line it would match the entire body of
> the document), then recursively find the first opening tag INSIDE of that
> with a matching closing tag, until there are no more opening tags inside
> -
> effectively parsing the "shortest" tags first (in other words: from the
> inside outwards).
>
> If you're still with me, please help me out on this because currently I'm
> kind of at a loss.
Ask any expert, and they'll say regexes have huge limitations in parsing.
You might want to consider scanning the string and creating a stack of the
tags. It seems to me a better way to be able to process 'broken' tags.
However, for you particular problem, I'd go the opposite way then you
propose. You propose to process the innermost tags first, I'd say process
the outer tags and let it ignore inner tags with exactly the same name.
Maybe something like helps (beware of typing errors, untested):
'%\[ # start of opening tag
([^\]|]+) # type of opening tag captured in \1
(?:|([^\]]+)) # 'attributes'
(
/] # self-closing tag
| # or
] # end of tag, scan futher for end-tag:
(.*?) # arbitrary data
(\[\1(?:|[^\]*])*? # start of nested opening tag of same type
( # ignore untill closing tag of nested set
/\] # again, either self-closing
| # or
.*?\[/\1\] # data + closing tag
).*? # some more arbitrary data
)* # allow for more then one nested tag
.*? # again, some more loose data
\[/\1\] # closing of actual tag
)%sx'
Something similar to this should work, but you can see the problem: this
will only allow for the nesting of 1 level, for every level possible we
would have to alter the regex.
One way I've used in the past is the following:
1. Match (don't replace yet) all opening tags, closing tags and
'self-contained' tags in different array's, with PREG_MATCH_OFFSET.
2. For every opening tags, search the nearest closing tag from the other
array, which doesn't have an opening tag in between.
3. Replace in string (using substr() for instance), remove opening and
closing tags from their respective arrays (keep track of changing offsets,
it's a pain...)
4. Loop untill arrays are empty, or throw an error when tags don't match
(opening tag without closing tag or the other way around).
I know I have thath code somewhere, can't seem to find it at the moment
though :P
--
Rik Wasmus
| |
|
| Rik <luiheidsgoeroe@hotmail.com> wrote:
> with PREG_MATCH_OFFSET.
Euhm, PREG_OFFSET_CAPTURE offcourse.
--
Rik Wasmus
| |
|
| Well, another method then previously described, but something I just =
thought of: your idea to process tags only if they don't have nested tag=
s =
in them might work indeed... As long as a normal '[' is strictly forbidd=
en =
in the text itself. Again, without testing, so beware of typing errors, =
or =
other huge mistakes on my part :-)
<?php
/* first, process all self_closing tags */
function process_self_closing(){
//you logic for replacing
var_dump(func_get_args());
}
$string =3D =
preg_replace_callback('#\[([^\]|]+)(?:|([^\]|]+))*/]#','process_self_clo=
sing',$string);
/* match tags that don't have children */
function process_tags((){
//you logic for replacing
var_dump(func_get_args());
}
$count =3D 1;
while($count !=3D 0){
$string =3D =
preg_replace_callback('#\[([^\]|]+)(?:|([^\]|]+))*/]([^\[]*)\[/\1]#','pr=
ocess_tags',$string,-1,$count);
}
?>
Note: count parameter is available since PHP 5.1.0.
-- =
Rik Wasmus
| |
| Michael 2007-02-03, 6:59 pm |
| Hey Rik,
Wow that sounds complicated :)
I just thought up a method of inside-out parsing (don't know if it's gonna
work though, haven't had time to type it out)
function ParseTag($contents) {
Find a complete tag at position n...n+p (simple regex will do)
Check if it's contents contain another tag.
If not, process the tag we found
If it does, ParseTag(substr($contents, n+p+1)) and start over
}
Long example: in
[first][first]Hello [second]beautiful
[first]world![/first][/second][/first][/first]
this would first match a "first" tag, with contents
[first]Hello [second]beautiful [first]world!
this contains an opening tag, so find look at
[first]Hello [second]beautiful
[first]world![/first][/second][/first][/first]
we find a "first" tag with contents
Hello [second]beautiful [first]world!
containing another tag, so we parse
Hello [second]beautiful [first]world![/first][/second][/first][/first]
matching a second tag with
beautiful [first]world![/first]
so we parse
beautiful [first]world![/first]
and find a self-contained first-tag. We can process it, replacing the
original string
[first][first]Hello [second]beautiful
[first]world![/first][/second][/first][/first]
by (assume "[first]text[/first]" processes as ""text"")
[first][first]Hello [second]beautiful
"world"[/second][/first][/first]
and now we start the process all over -- replacing the second tag -- and
over until all tags have been resolved.
Any traling opening or closing tags will not form a problem as the regex
will only match opening tags with a closing tag.
Only problem is, for large documents this may become slow.
Vriendelijke groet
Michael
PS I should find out if Wikipedia allows nested tags. If so I can perhaps
try to find my way through their sourcecode... not looking forward to that
though.
| |
| Michael 2007-02-04, 6:58 pm |
| I found this solution:
function ReplaceTags($input) {
$offset = 0;
$regex = '#\[(.*?)((?:\|.*?)*)(?:/|\](.*?)\[/\1)\]#s';
// Find a matching tag pair, using offset $offset
while (preg_match($regex, $input, $matches, PREG_OFFSET_CAPTURE,
$offset)) {
// If it contains another opening tag, skip over this and let
// the next loop execution handle it
if (preg_match('#\[[^/](.*?)\]#s', $matches[3][0], $m))
$offset = $matches[0][1] + 1;
else
{
// If it's not an opening tag, we can safely parse it
// Replace the tag with it's processed value and start from the beginning
$input = substr_replace($input, ParseTag(array($matches[0][0],
$matches[1][0], $matches[2][0], $matches[3][0])), $matches[0][1],
strlen($matches[0][0]));
$offset = 0;
}
}
return $input;
}
Don't mind the ParseTag call, it's the old version expecting an array
directly from the preg_match. Doesn't seem to be too slow - I expect to be
parsing mostly top-level and singly nested tags anyway.
Kind regards
Michael.
|
|
|
|
|