Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

quick and easy way to parse XML
I am building a servlet that needs to parse XHTML files (with DTD and everyt
hing),
in order to figure out the link to the pictures (<img src="getmeifyoucan.gif
" /> )

I thought I had already solved the problem elegantly when I realized that
the package to parse XML would automatically open a connection to
the a website on the internet to retrieve the DTD!
Since this happens at every request to to the servlets, this behaviour
is unacceptable for my application.

Apparently, there is no simple way to disable this behavior, since
the XML spec demands that the DTD is retrieved.
I tried to treat the XML as a string and remove the DTD reference,
but, unfortunately, the library will fail if an entity is encountered
(  for example).

I am puzzled. If I treat the XML as a string, String methods and
regexps are hardly powerful enough to achieve the task.
On the other hand, XML parsing turns up to introduce
even more problems than I am trying to solve (as an aside, wasn't
XML supposed to be simple?)

Is there an easy way to achieve my goal?  XML parsing or regexps?

thanks

Luca

Report this thread to moderator Post Follow-up to this message
Old Post
luca passani
09-27-04 09:02 PM


Re: quick and easy way to parse XML
luca passani wrote:

>
> I am building a servlet that needs to parse XHTML files (with DTD and
> everything), in order to figure out the link to the pictures (<img
> src="getmeifyoucan.gif" /> )
>
> I thought I had already solved the problem elegantly when I realized that
> the package to parse XML would automatically open a connection to
> the a website on the internet to retrieve the DTD!
> Since this happens at every request to to the servlets, this behaviour
> is unacceptable for my application.
>
> Apparently, there is no simple way to disable this behavior, since
> the XML spec demands that the DTD is retrieved.
> I tried to treat the XML as a string and remove the DTD reference,
> but, unfortunately, the library will fail if an entity is encountered
> (  for example).
>
> I am puzzled. If I treat the XML as a string, String methods and
> regexps are hardly powerful enough to achieve the task.

On the contrary. Those methods are more than powerful enough to handle the
described task. How do I know? That is how the class responsible for this
task does it.

> On the other hand, XML parsing turns up to introduce
> even more problems than I am trying to solve

Name them.

> (as an aside, wasn't
> XML supposed to be simple?)

No, that is a myth. XML is supposed to eliminate unnecessary duplication and
provide a way to standardize data structures. If the data structures are
complex, so is the XML representation.

> Is there an easy way to achieve my goal?  XML parsing or regexps?

What can I say? Yes? XML parsing and regular expressions seem to be part of
the same topic.

--
Paul Lutus
http://www.arachnoid.com


Report this thread to moderator Post Follow-up to this message
Old Post
Paul Lutus
09-27-04 09:02 PM


Re: quick and easy way to parse XML

Paul Lutus wrote:
> On the contrary. Those methods are more than powerful enough to handle the
> described task. How do I know? That is how the class responsible for this
> task does it.

Which class? How do you handle:

<img
src="pippo"
/>
 
>
>
> Name them.

I just did. The stupid parser try to open an HTTP connection to retrieve the
 DTD

> 
>
>
> No, that is a myth.

If this is a myth, it is one that the XML industry has contributed
to fuel. Have a look at the first line of:

http://www.w3.org/XML/

"Extensible Markup Language (XML) is a simple, very flexible text format
derived from SGML"

> XML is supposed to eliminate unnecessary duplication and
> provide a way to standardize data structures. If the data structures are
> complex, so is the XML representation.

the problem is that a lot of complexity is there also for mega-simple stuff.

> 
>
>
> What can I say? Yes? XML parsing and regular expressions seem to be part o
f
> the same topic.

So, how do you handle:

<img
src="pippo"
/>

with regexps in Java?

Luca



Report this thread to moderator Post Follow-up to this message
Old Post
luca
09-28-04 02:04 PM


Re: quick and easy way to parse XML
On Tue, 28 Sep 2004 09:24:30 GMT, luca <passani@eunet.no> wrote:
 
>
> I just did. The stupid parser try to open an HTTP connection to retrieve
> the DTD

Use a non-validating Parser then. :)


--

Whom the gods wish to destroy they first call promising.

Report this thread to moderator Post Follow-up to this message
Old Post
Stefan Schulz
09-28-04 02:04 PM


Re: quick and easy way to parse XML
On Tue, 28 Sep 2004 09:24:30 GMT, luca <passani@eunet.no> wrote:

> So, how do you handle:
>
> <img
>   src="pippo"
> />

From the top of my head:

"<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match pretty
any img tag that has no alts, height etc attributes. How to add them...
look at the
alternative Operator (It is the | )


--

Whom the gods wish to destroy they first call promising.

Report this thread to moderator Post Follow-up to this message
Old Post
Stefan Schulz
09-28-04 02:04 PM


Re: quick and easy way to parse XML

Stefan Schulz wrote:

> Use a non-validating Parser then. :)

Which one? even SAX goes for the DTD!!!

Also, be careful, because what I found out by discussing
with XML gurus is that even non-validating parsers are required
to go after the DTD if they see one according to XML specs!!!!!

Luca


Report this thread to moderator Post Follow-up to this message
Old Post
luca
09-28-04 09:13 PM


Re: quick and easy way to parse XML

Stefan Schulz wrote:

>  From the top of my head:
>
> "<\p{Space}*img\p{Space}+src=\"\p{Graph}+\"\p{Space}*>" should match prett
y
> any img tag that has no alts, height etc attributes. How to add them...
> look at the
> alternative Operator (It is the | )

but this is not good enough for me (this is why I went for XML parsing
in the first place). All I know about my mark-up is that it's well-formed,
but I don't know anything about the order or the availability
of other attributes:


<img
src="pippo"
/>


<img alt="pippo"
height="25"
src="pippo"
/>


<img src="pippo" alt="pippo"
height="25" />


<img height="35" src="pippo" />

this are all good. BTW the XML guys claimed confidently that RegExps
are, generally speaking, not powerful enough to parse XML!

Luca



Report this thread to moderator Post Follow-up to this message
Old Post
luca
09-28-04 09:13 PM


Re: quick and easy way to parse XML
On Tue, 28 Sep 2004 13:49:52 GMT, luca <passani@eunet.no> wrote:

>
>
> Stefan Schulz wrote:
> 
>
> but this is not good enough for me (this is why I went for XML parsing
> in the first place). All I know about my mark-up is that it's
> well-formed,
> but I don't know anything about the order or the availability
> of other attributes:

Well, in that case do what i said: Within the tag, make an alternative of
all the
possible attributes (refer to the DTD for the List of allowed attributes).

>
> this are all good. BTW the XML guys claimed confidently that RegExps
> are, generally speaking, not powerful enough to parse XML!

Generally speaking, this is true. In this particular case, you can however
do it,
since the only thing XML can do that Regular expressions can not is build
trees.

img tags, however, are necessarily leaves on the document tree.


--

Whom the gods wish to destroy they first call promising.

Report this thread to moderator Post Follow-up to this message
Old Post
Stefan Schulz
09-28-04 09:13 PM


Re: quick and easy way to parse XML
luca wrote:
>
> BTW the XML guys claimed confidently that RegExps
> are, generally speaking, not powerful enough to parse XML!

They aren't. Parsing XML requires a stack (or more precisely, the parser
needs to remember all the previous states that led to the current
state.) Regular languages can be parsed without remembering state.

However, some of the available regular expression packages contain
constructs that are quite a bit more powerful than real regular expressions.

--
Daniel Sjöblom
Remove _NOSPAM to reply by mail

Report this thread to moderator Post Follow-up to this message
Old Post
Daniel Sjöblom
09-28-04 09:13 PM


Re: quick and easy way to parse XML
luca wrote:

>
>
> Paul Lutus wrote: 
>
> Which class? How do you handle:
>
> <img
>   src="pippo"
> />

To what does "you" refer? Existing classes, or your own classes? The answer
in both cases is "easily", but that is beside the point.

> 
>
> I just did. The stupid parser try to open an HTTP connection to retrieve
> the DTD

How long will this take? I already told you -- write your own parsing class.

> 
>
> If this is a myth, it is one that the XML industry has contributed
> to fuel. Have a look at the first line of:
>
> http://www.w3.org/XML/
>
> "Extensible Markup Language (XML) is a simple, very flexible text format
> derived from SGML"

And simple languages can be used to convey complex ideas. If that were not
true, the language would be abandoned.

> 
>
> the problem is that a lot of complexity is there also for mega-simple
> stuff.

No, not really. Simple tasks can be handled using simple XML. Complex tasts
require complex XML.
 
>
> So, how do you handle:
>
> <img
>   src="pippo"
> />
>
> with regexps in Java?

Trivially:

String result = original.replaceAll("\\n+"," ");

Working example:

public class Test {


public static void main(String[]args)
{
String a = "<img\n"
+ "src=\"pippo\"\n"
+ "/>";
String b = a.replaceAll("\\n+"," ");
System.out.println(a + " -> " + b);
}
}

Result:

<img
src="pippo"
/> -> <img src="pippo" />

Wow, that was really hard!

--
Paul Lutus
http://www.arachnoid.com


Report this thread to moderator Post Follow-up to this message
Old Post
Paul Lutus
09-28-04 09:13 PM


Sponsored Links




Last Thread Next Thread Next
Pages (2): [1] 2 »
Search this forum -> 
Post New Thread

Java Help archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:35 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.