Code Comments
Programming Forum and web based access to our favorite programming groups.I am building a servlet that needs to parse XHTML files (with DTD and everyt hing), in order to figure out the link to the pictures (<img src="getmeifyoucan.gif " /> ) I thought I had already solved the problem elegantly when I realized that the package to parse XML would automatically open a connection to the a website on the internet to retrieve the DTD! Since this happens at every request to to the servlets, this behaviour is unacceptable for my application. Apparently, there is no simple way to disable this behavior, since the XML spec demands that the DTD is retrieved. I tried to treat the XML as a string and remove the DTD reference, but, unfortunately, the library will fail if an entity is encountered ( for example). I am puzzled. If I treat the XML as a string, String methods and regexps are hardly powerful enough to achieve the task. On the other hand, XML parsing turns up to introduce even more problems than I am trying to solve (as an aside, wasn't XML supposed to be simple?) Is there an easy way to achieve my goal? XML parsing or regexps? thanks Luca
Post Follow-up to this messageluca passani wrote: > > I am building a servlet that needs to parse XHTML files (with DTD and > everything), in order to figure out the link to the pictures (<img > src="getmeifyoucan.gif" /> ) > > I thought I had already solved the problem elegantly when I realized that > the package to parse XML would automatically open a connection to > the a website on the internet to retrieve the DTD! > Since this happens at every request to to the servlets, this behaviour > is unacceptable for my application. > > Apparently, there is no simple way to disable this behavior, since > the XML spec demands that the DTD is retrieved. > I tried to treat the XML as a string and remove the DTD reference, > but, unfortunately, the library will fail if an entity is encountered > ( for example). > > I am puzzled. If I treat the XML as a string, String methods and > regexps are hardly powerful enough to achieve the task. On the contrary. Those methods are more than powerful enough to handle the described task. How do I know? That is how the class responsible for this task does it. > On the other hand, XML parsing turns up to introduce > even more problems than I am trying to solve Name them. > (as an aside, wasn't > XML supposed to be simple?) No, that is a myth. XML is supposed to eliminate unnecessary duplication and provide a way to standardize data structures. If the data structures are complex, so is the XML representation. > Is there an easy way to achieve my goal? XML parsing or regexps? What can I say? Yes? XML parsing and regular expressions seem to be part of the same topic. -- Paul Lutus http://www.arachnoid.com
Post Follow-up to this messagePaul Lutus wrote: > On the contrary. Those methods are more than powerful enough to handle the > described task. How do I know? That is how the class responsible for this > task does it. Which class? How do you handle: <img src="pippo" /> > > > Name them. I just did. The stupid parser try to open an HTTP connection to retrieve the DTD > > > > No, that is a myth. If this is a myth, it is one that the XML industry has contributed to fuel. Have a look at the first line of: http://www.w3.org/XML/ "Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML" > XML is supposed to eliminate unnecessary duplication and > provide a way to standardize data structures. If the data structures are > complex, so is the XML representation. the problem is that a lot of complexity is there also for mega-simple stuff. > > > > What can I say? Yes? XML parsing and regular expressions seem to be part o f > the same topic. So, how do you handle: <img src="pippo" /> with regexps in Java? Luca
Post Follow-up to this messageOn Tue, 28 Sep 2004 09:24:30 GMT, luca <passani@eunet.no> wrote: > > I just did. The stupid parser try to open an HTTP connection to retrieve > the DTD Use a non-validating Parser then. :) -- Whom the gods wish to destroy they first call promising.
Post Follow-up to this messageOn Tue, 28 Sep 2004 09:24:30 GMT, luca <passani@eunet.no> wrote:
> So, how do you handle:
>
> <img
> src="pippo"
> />
From the top of my head:
"<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match pretty
any img tag that has no alts, height etc attributes. How to add them...
look at the
alternative Operator (It is the | )
--
Whom the gods wish to destroy they first call promising.
Post Follow-up to this messageStefan Schulz wrote: > Use a non-validating Parser then. :) Which one? even SAX goes for the DTD!!! Also, be careful, because what I found out by discussing with XML gurus is that even non-validating parsers are required to go after the DTD if they see one according to XML specs!!!!! Luca
Post Follow-up to this message
Stefan Schulz wrote:
> From the top of my head:
>
> "<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match prett
y
> any img tag that has no alts, height etc attributes. How to add them...
> look at the
> alternative Operator (It is the | )
but this is not good enough for me (this is why I went for XML parsing
in the first place). All I know about my mark-up is that it's well-formed,
but I don't know anything about the order or the availability
of other attributes:
<img
src="pippo"
/>
<img alt="pippo"
height="25"
src="pippo"
/>
<img src="pippo" alt="pippo"
height="25" />
<img height="35" src="pippo" />
this are all good. BTW the XML guys claimed confidently that RegExps
are, generally speaking, not powerful enough to parse XML!
Luca
Post Follow-up to this messageOn Tue, 28 Sep 2004 13:49:52 GMT, luca <passani@eunet.no> wrote: > > > Stefan Schulz wrote: > > > but this is not good enough for me (this is why I went for XML parsing > in the first place). All I know about my mark-up is that it's > well-formed, > but I don't know anything about the order or the availability > of other attributes: Well, in that case do what i said: Within the tag, make an alternative of all the possible attributes (refer to the DTD for the List of allowed attributes). > > this are all good. BTW the XML guys claimed confidently that RegExps > are, generally speaking, not powerful enough to parse XML! Generally speaking, this is true. In this particular case, you can however do it, since the only thing XML can do that Regular expressions can not is build trees. img tags, however, are necessarily leaves on the document tree. -- Whom the gods wish to destroy they first call promising.
Post Follow-up to this messageluca wrote: > > BTW the XML guys claimed confidently that RegExps > are, generally speaking, not powerful enough to parse XML! They aren't. Parsing XML requires a stack (or more precisely, the parser needs to remember all the previous states that led to the current state.) Regular languages can be parsed without remembering state. However, some of the available regular expression packages contain constructs that are quite a bit more powerful than real regular expressions. -- Daniel Sjöblom Remove _NOSPAM to reply by mail
Post Follow-up to this messageluca wrote: > > > Paul Lutus wrote: > > Which class? How do you handle: > > <img > src="pippo" > /> To what does "you" refer? Existing classes, or your own classes? The answer in both cases is "easily", but that is beside the point. > > > I just did. The stupid parser try to open an HTTP connection to retrieve > the DTD How long will this take? I already told you -- write your own parsing class. > > > If this is a myth, it is one that the XML industry has contributed > to fuel. Have a look at the first line of: > > http://www.w3.org/XML/ > > "Extensible Markup Language (XML) is a simple, very flexible text format > derived from SGML" And simple languages can be used to convey complex ideas. If that were not true, the language would be abandoned. > > > the problem is that a lot of complexity is there also for mega-simple > stuff. No, not really. Simple tasks can be handled using simple XML. Complex tasts require complex XML. > > So, how do you handle: > > <img > src="pippo" > /> > > with regexps in Java? Trivially: String result = original.replaceAll("\\n+"," "); Working example: public class Test { public static void main(String[]args) { String a = "<img\n" + "src=\"pippo\"\n" + "/>"; String b = a.replaceAll("\\n+"," "); System.out.println(a + " -> " + b); } } Result: <img src="pippo" /> -> <img src="pippo" /> Wow, that was really hard! -- Paul Lutus http://www.arachnoid.com
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.