For Programmers: Free Programming Magazines  


Home > Archive > AWK > February 2008 > Search pattern for non-ASCII alphabetic characters









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Search pattern for non-ASCII alphabetic characters
Hermann Peifer

2008-02-03, 7:01 pm

Hi,

Occasionally, I'd like to search for non-ASCII alphabetic characters in
UTF-8 encoded text documents.

In the absence of an appropriate character class (at least I wouldn't
know of any), I do something like:

awk '/[ÀÁÂÃÄÅ ...and so on... ŸŹźŻżŽž]/{ action }'

This is perhaps not the smartest solution. Any better idea?

TIA. Hermann
Janis Papanagnou

2008-02-03, 7:01 pm

Hermann Peifer wrote:
> Hi,
>
> Occasionally, I'd like to search for non-ASCII alphabetic characters in
> UTF-8 encoded text documents.
>
> In the absence of an appropriate character class (at least I wouldn't
> know of any), I do something like:
>
> awk '/[ÀÁÂÃÄÅ ...and so on... ŸŹźŻżŽž]/{ action }'
>
> This is perhaps not the smartest solution. Any better idea?
>
> TIA. Hermann


I can't tell if it is a smarter solution but you could use the inverse
logic based on the existing character classes...

LANG=C awk '/[^[:alnum:][:punct:][:blank:][:cntrl:]]/'

(Note: there's also the ANSI character class [:ascii:] but my GNU awk
seems to not support it.)

Janis
Hermann Peifer

2008-02-03, 7:01 pm

Janis Papanagnou wrote:
> Hermann Peifer wrote:
>
> I can't tell if it is a smarter solution but you could use the inverse
> logic based on the existing character classes...
>
> LANG=C awk '/[^[:alnum:][:punct:][:blank:][:cntrl:]]/'
>
> (Note: there's also the ANSI character class [:ascii:] but my GNU awk
> seems to not support it.)
>
> Janis



Thanks for the hint. This pattern also finds: N°1
Which does not exactly contain a non-ASCII *alphabetic* character, but
it's still better than the long character list I was using. I can filter
out some false positives.

Hermann
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com