Home > Archive > Tcl > May 2004 > Trying to toss input with non-printable characters
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Trying to toss input with non-printable characters
|
|
| lvirden@yahoo.com 2004-05-17, 12:36 pm |
|
I'm trying to write a tiny script that reads its stdin and, if
the input line has above a particular threshold of characters
that won't print out normally, drop them. Here's what I have:
#! /usr/tcl84/bin/tclsh
# while more data on stdin
# get a line of data
# if line contains less than N non printable bytes
# print to stdout
# end loop
# How many non-printables in a row constitutes an offense?
set N 4
while { [gets stdin input] >= 0 } {
if {[regexp -- \[:graph:\]\{$N,\} $input] == 0 } {
puts $input
}
}
However, when I run the above, I'm not getting the results I'd like.
For instance, here's one of the lines I am wanting to drop. However
¢½¢¾¢Î¾ÆÀú¾¾~Á¦
For some reason, the filter above doesn't reject this.
Can anyone give me a tip on how to improve my filter so that it
drops the stuff ?
--
<URL: http://wiki.tcl.tk/ > In God we trust.
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.
<URL: mailto:lvirden@yahoo.com > <URL: http://www.purl.org/NET/lvirden/ >
| |
| Bruce Hartweg 2004-05-17, 1:39 pm |
|
lvirden@yahoo.com wrote:
> I'm trying to write a tiny script that reads its stdin and, if
> the input line has above a particular threshold of characters
> that won't print out normally, drop them. Here's what I have:
>
>
> #! /usr/tcl84/bin/tclsh
>
> # while more data on stdin
> # get a line of data
> # if line contains less than N non printable bytes
> # print to stdout
> # end loop
>
> # How many non-printables in a row constitutes an offense?
> set N 4
>
> while { [gets stdin input] >= 0 } {
>
> if {[regexp -- \[:graph:\]\{$N,\} $input] == 0 } {
> puts $input
> }
>
> }
>
a couple of things, the [:graph:] character class is only
valid *inside* a bracket expression, so you would actually
need [[:graph:]]
also, your expression does not match your description, you
state you want to skip the line if there are 4 or more bad
chars in a row, but the you are atcually checking the opposite
(i,e, 4 good chars in a row) and then bngeating the result
so instead of looking for 4 bad, you have NOT(4 good) so you will
fail a line with no bad characters at all if there isn't at
least 4 good chars in a row, and you will pass any string
that has 4 good characters, even if it has 100s of bad ones.
a better check wouls be
if {[regexp -- "\[^\[:graph:\]\]{$N,}" $input] == 1} {
puts $input
}
Bruce
| |
| Glenn Jackman 2004-05-17, 2:34 pm |
| The 'graph' character class indicates "A character with a visible
representation". I tend to think of that as a character that will use
ink if printed. Your sample line contains visible characters, although
perhaps your application cannot display them. Perhaps you want
non-ASCII instead?
# RE represents whitespace and the visible ASCII characters
# '!' through '~'
set RE {[^\s!-~]}
set input {¢½¢¾¢~N¾~F~@ú¾¾~~A¦ }
set N 4
if {[regexp -- "$RE{$N,}" $input]} {
puts "too many consecutive non-ASCII chars: '$input'"
}
Your requirements seem confusing:
"if line contains less than N non printable bytes"
or
"How many non-printables in a row..."
If you want lines with less than $N non-ASCII chars in total:
if {[regexp -all -- {[^\s!-~]} $input] > $N} {
puts "input contains more than $N non-ASCII characters: '$input'"
}
Also, the [:graph:] character class itself must be placed within a
[bracketed expression], so the proper usage is, for example:
# note the "double" brackets
if {[regexp {^[[:xdigit:]]+$} $aString]} {puts "a hex number"}
lvirden@yahoo.com <lvirden@yahoo.com> wrote:
>
> I'm trying to write a tiny script that reads its stdin and, if
> the input line has above a particular threshold of characters
> that won't print out normally, drop them. Here's what I have:
>
>
> #! /usr/tcl84/bin/tclsh
>
> # while more data on stdin
> # get a line of data
> # if line contains less than N non printable bytes
> # print to stdout
> # end loop
>
> # How many non-printables in a row constitutes an offense?
> set N 4
>
> while { [gets stdin input] >= 0 } {
>
> if {[regexp -- \[:graph:\]\{$N,\} $input] == 0 } {
> puts $input
> }
>
> }
>
>
> However, when I run the above, I'm not getting the results I'd like.
> For instance, here's one of the lines I am wanting to drop. However
>
> ¢½¢¾¢Î¾ÆÀú¾¾~Á¦
>
>
> For some reason, the filter above doesn't reject this.
>
> Can anyone give me a tip on how to improve my filter so that it
> drops the stuff ?
--
Glenn Jackman
NCF Sy min
glennj@ncf.ca
| |
| Donal K. Fellows 2004-05-18, 5:31 am |
| lvirden@yahoo.com wrote:
> I'm trying to write a tiny script that reads its stdin and, if
> the input line has above a particular threshold of characters
> that won't print out normally, drop them. Here's what I have:
Two comments. First, I prefer to construct a regular expression once
and stash it in a variable rather than building it dynamically every
time (and if you do that, you make the RE engine run quickly too by
handling cacheing yourself.) Secondly, the sense of your RE is very
odd, and that is what is making things go wrong. Here are some
alternatives.
### ALTERNATIVE ONE ###
# Drop lines containing too many bad chars in a row
set RE [format {[^[:graph:]]{%d,}} $N]
while {[gets stdin line] > -1} {
if {[regexp $RE $line]} continue
puts $line
}
### ALTERNATIVE TWO ###
# Drop lines containing too many bad chars in total
set RE {[^[:graph]]}
while {[gets stdin line] > -1} {
if {[regexp -all $RE $line] >= $N} continue
puts $line
}
Hope these help.
Donal.
| |
| lvirden@yahoo.com 2004-05-18, 9:35 am |
|
According to Donal K. Fellows <donal.k.fellows@man.ac.uk>:
:lvirden@yahoo.com wrote:
:> I'm trying to write a tiny script that reads its stdin and, if
:> the input line has above a particular threshold of characters
:> that won't print out normally, drop them. Here's what I have:
:
:Two comments. First, I prefer to construct a regular expression once
:and stash it in a variable rather than building it dynamically every
:time (and if you do that, you make the RE engine run quickly too by
:handling cacheing yourself.)
Great idea - the application doesn't run much, but it is a great
idea and I will take advantage of this if I can.
Secondly, the sense of your RE is very
:odd, and that is what is making things go wrong. Here are some
:alternatives.
Yes, I don't know why I didn't put in that negating the expression.
Sigh - my brain just wasn't firing on Monday morning I guess.
: ### ALTERNATIVE ONE ###
: # Drop lines containing too many bad chars in a row
: set RE [format {[^[:graph:]]{%d,}} $N]
: while {[gets stdin line] > -1} {
: if {[regexp $RE $line]} continue
: puts $line
: }
Okay, I tried this example. However, more items are being removed
than should match.
So I changed the code to be this:
set N 4
set RE [format {([^[:graph:]]{%d,})} $N]
while { [gets stdin input] > -1 } {
set m1 ""
set m2 ""
set m3 ""
if {[regexp -- $RE $input m1 m2 m3] == 0 } {
puts $input
} else {
puts stderr "$input has m1 = .$m1. , m2 = .$m2. , m3 = .$m3."
}
and I ran it. My results say:
Id Time Score From Subject has m1 = . . , m2 = .
. , m3 = ..
or, as od -bc says:
0000000 111 144 040 040 040 040 040 040 040 040 124 151 155 145 040 040
I d T i m e
0000020 123 143 157 162 145 040 106 162 157 155 040 040 040 040 040 040
S c o r e F r o m
0000040 040 040 040 040 040 040 040 123 165 142 152 145 143 164 040 150
S u b j e c t h
0000060 141 163 040 155 061 040 075 040 056 040 040 040 040 040 040 040
a s m 1 = .
0000100 040 056 040 054 040 155 062 040 075 040 056 040 040 040 040 040
. , m 2 = .
0000120 040 040 040 056 040 054 040 155 063 040 075 040 056 056 012
. , m 3 = . . \n
0000137
If I add [:space:] to the RE, then none of the items are removed.
I'm just trying to get rid of lines which have 4 or more characters > 127 .
I'm still missing something here.
So I'm . Why is
--
<URL: http://wiki.tcl.tk/ > In God we trust.
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.
<URL: mailto:lvirden@yahoo.com > <URL: http://www.purl.org/NET/lvirden/ >
| |
| Donal K. Fellows 2004-05-18, 10:46 am |
| lvirden@yahoo.com wrote:
> I'm just trying to get rid of lines which have 4 or more characters > 127 .
> I'm still missing something here.
OK, now we have a better specification of what you're really up to. :^)
Try this:
while {[gets stdin line] > -1} {
if {[regexp -all {[\u007f-\uffff]} $line] < 4} {
puts $line
}
}
OK, this version allows for throwing out of lines with distcontiguous
sequences, but that's probably OK in practice. If you're handling a lot
of non-English text, you probably need to reconsider what you're doing
though. That's much more app-specific...
Donal.
|
|
|
|
|