Home > Archive > AWK > February 2008 > Search pattern for non-ASCII alphabetic characters
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Search pattern for non-ASCII alphabetic characters
|
|
| Hermann Peifer 2008-02-03, 7:01 pm |
| Hi,
Occasionally, I'd like to search for non-ASCII alphabetic characters in
UTF-8 encoded text documents.
In the absence of an appropriate character class (at least I wouldn't
know of any), I do something like:
awk '/[ÀÁÂÃÄÅ ...and so on... ŸŹźŻżŽž]/{ action }'
This is perhaps not the smartest solution. Any better idea?
TIA. Hermann
| |
| Janis Papanagnou 2008-02-03, 7:01 pm |
| Hermann Peifer wrote:
> Hi,
>
> Occasionally, I'd like to search for non-ASCII alphabetic characters in
> UTF-8 encoded text documents.
>
> In the absence of an appropriate character class (at least I wouldn't
> know of any), I do something like:
>
> awk '/[ÀÁÂÃÄÅ ...and so on... ŸŹźŻżŽž]/{ action }'
>
> This is perhaps not the smartest solution. Any better idea?
>
> TIA. Hermann
I can't tell if it is a smarter solution but you could use the inverse
logic based on the existing character classes...
LANG=C awk '/[^[:alnum:][:punct:][:blank:][:cntrl:]]/'
(Note: there's also the ANSI character class [:ascii:] but my GNU awk
seems to not support it.)
Janis
| |
| Hermann Peifer 2008-02-03, 7:01 pm |
| Janis Papanagnou wrote:
> Hermann Peifer wrote:
>
> I can't tell if it is a smarter solution but you could use the inverse
> logic based on the existing character classes...
>
> LANG=C awk '/[^[:alnum:][:punct:][:blank:][:cntrl:]]/'
>
> (Note: there's also the ANSI character class [:ascii:] but my GNU awk
> seems to not support it.)
>
> Janis
Thanks for the hint. This pattern also finds: N°1
Which does not exactly contain a non-ASCII *alphabetic* character, but
it's still better than the long character list I was using. I can filter
out some false positives.
Hermann
|
|
|
|
|