Home > Archive > Visual Basic Syntax > November 2005 > Regex to grab keywords from HTML header
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Regex to grab keywords from HTML header
|
|
| Digital.Rebel.18@gmail.com 2005-11-18, 7:04 pm |
| I'm trying to figure out how to extract the keywords from an HTML
document.
The input string would typically look like:
<meta name='keywords' content='word1, more stuff, etc'>
Either single quotes or double quotes can be used and there can be any
number of spaces or returns between any element. Keywords can contain
special characters except for a comma or a closed bracket. For
example, the HTML might be:
<
meta name =
'
keywords'
content=
"word1 ,
more
stuff
,
etc"
>
The est thing would be to have a routine actually return one
keyword at a time (the keywords are separated by commas) However, I'd
be happy just to have the routine return only the keywords w/o all the
rest of the surrounding HTML.
Here's what I've tried so far for a Regex string.
"[< ][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]
*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\
r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"
It's not working very well :) (this regex stuff is complicated!)
Can anybody help a regex newbie?
| |
|
| You might want to try Expresso
http://www.ultrapico.com/
or
Regulator (look at google to find out the address)
HIH
<Digital.Rebel.18@gmail.com> schrieb im Newsbeitrag
news:1132327507.362379.190550@g47g2000cwa.googlegroups.com...
> I'm trying to figure out how to extract the keywords from an HTML
> document.
> The input string would typically look like:
> <meta name='keywords' content='word1, more stuff, etc'>
>
> Either single quotes or double quotes can be used and there can be any
> number of spaces or returns between any element. Keywords can contain
> special characters except for a comma or a closed bracket. For
> example, the HTML might be:
>
> <
> meta name =
> '
> keywords'
> content=
> "word1 ,
> more
> stuff
> ,
> etc"
>
> The est thing would be to have a routine actually return one
> keyword at a time (the keywords are separated by commas) However, I'd
> be happy just to have the routine return only the keywords w/o all the
> rest of the surrounding HTML.
>
> Here's what I've tried so far for a Regex string.
>
> "[< ][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]
*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\
r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"
>
> It's not working very well :) (this regex stuff is complicated!)
>
> Can anybody help a regex newbie?
>
| |
|
| Try one of the many RegEx sites:
How To Use Regular Expressions in Microsoft Visual Basic 6.0
http://support.microsoft.com/defaul...kb;en-us;818802
RegEx Tutorial for VB:
http://juicystudio.com/tutorial/vb/regexp.asp
RegEx Library:
http://www.regexlib.com/
RegEx Module for VB:
http://www.aivosto.com/regexpr.html
--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--
<Digital.Rebel.18@gmail.com> wrote in message
news:1132327507.362379.190550@g47g2000cwa.googlegroups.com...
> I'm trying to figure out how to extract the keywords from an HTML
> document.
> The input string would typically look like:
> <meta name='keywords' content='word1, more stuff, etc'>
>
> Either single quotes or double quotes can be used and there can be any
> number of spaces or returns between any element. Keywords can contain
> special characters except for a comma or a closed bracket. For
> example, the HTML might be:
>
> <
> meta name =
> '
> keywords'
> content=
> "word1 ,
> more
> stuff
> ,
> etc"
>
> The est thing would be to have a routine actually return one
> keyword at a time (the keywords are separated by commas) However, I'd
> be happy just to have the routine return only the keywords w/o all the
> rest of the surrounding HTML.
>
> Here's what I've tried so far for a Regex string.
>
> "[< ][\s\n\r\t]*meta[\s\n\r\t]name[\s\n\r\t]
*='[\s\n\r\t]*'[\s\n\r\t]*keywords[\s\n\
r\t]content[\s\n\r\t]*=[\s\n\r\t]*'[^>]'[\s\n\r\t]*>"
>
> It's not working very well :) (this regex stuff is complicated!)
>
> Can anybody help a regex newbie?
>
| |
| Digital.Rebel.18@gmail.com 2005-11-18, 7:04 pm |
| Woohoo! Great reference Boni!
Here's the regex string that returns the keywords:
<\s*meta\s*name\s*=\s*"\s*keywords\s*"\s*content\s*=\s*"\s*([^"]+)"\s*>
This makes a lot more sense now...
Is there a way to further parse the keywords inside the ([^"]+)
adding to the string above?
Keywords are listed as
at least one keyword (ending in either quote or comma)
if it ends with a quote, then throw away the quote and we're done.
if it ends in comma then look for repeating groups of [,next keyword]
and throw away the comma each time
I've looked at a number of tutorials online and this part is more
complicated
Note: Veign - the "jucystudio" reference 404'd out :(
| |
|
|
| Michael Cole 2005-11-20, 9:59 pm |
| Digital.Rebel.18@gmail.com wrote:
> I'm trying to figure out how to extract the keywords from an HTML
> document.
> The input string would typically look like:
> <meta name='keywords' content='word1, more stuff, etc'>
You have posted this to both a dotnet group and a VB Classic group - the two
are different languages. You need to specify which language you are using,
because...
--
<response type="generic" language="VB.Net">
This newsgroup (.vb.syntax) is for users of Visual Basic version 6.0
and earlier and not the misleadingly named VB.Net
or VB 200x. Solutions, and often even the questions,
for one platform will be meaningless in the other.
When VB.Net was released Microsoft created new newsgroups
devoted to the new platform so that neither group of
developers need wade through the clutter of unrelated
topics. Look for newsgroups with the words "dotnet" or
"vsnet" in their name. For the msnews.microsoft.com news
server try these:
microsoft.public.dotnet.general
microsoft.public.dotnet.languages.vb
</response>
--
Regards,
Michael Cole
|
|
|
|
|