Code Comments
Programming Forum and web based access to our favorite programming groups.Stoll, Steven R. wrote: > After solving the case sensitivity issue, separating the alternations, and > solving the un-escaped /, here is what we are left with. > > (p(ost)?[.\s]*o(ffice)?[.\s]*box) > po(b|x|drawer|stoffice|[ ]bx|box) > p[\/]o > b(x|ox|uzon) > a(partado|ptdo) If we unroll that to post[.\s]*o(ffice)?[.\s]*box p[.\s]*o(ffice)?[.\s]*box pob pox podrawer postoffice po[ ]bx pobox p[\/]o bx box buzon apartado aptdo reassembling it, we obtain (?:p(?:o(?:st(?:[.\s]*o(ffice)?[.\s]*box|office)|(?: b)?x|b(?:ox)?|drawer)|[.\s]*o(ffice)?[.\s]*box|\/o)|ap(?:arta|t)do|b(?:uzon| o?x)) I'm happy that they thought to check for 'pox' as a shorthand to postoffice box. I'll remember to use that next time I need to address such a letter. David
Post Follow-up to this messageDan Collins wrote: > On Jan 7, 2008 5:06 PM, Uri Guttman <uri@stemsystems.com> wrote: > > > I don't even want to know what that's supposed to do. > First, and most obviously, that should use /i. You can also find code like this in HTML::Template. Sam Tregar =20 explained the reason in the FAQ: Q: Why do you use /[Tt]/ instead of /t/i? It's so ugly! A: Simple - the case-insensitive match switch is very inefficient. According to "Mastering Regular Expressions" from O'Reilly Press, /[Tt]/ is faster and more space efficient than /t/i - by as much as double against long strings. //i essentially does a lc() on the string and keeps a temporary copy in memory. When this changes, and it is in the 5.6 development series, I will gladly use //i. Believe me, I realize [Tt] is hideously ugly. =BB http://search.cpan.org/dist/HTML-Template/=20 Template.pm#FREQUENTLY_ASKED_QUESTIONS --=20 S=E9bastien Aperghis-Tramoni Close the world, txEn eht nepO.
Post Follow-up to this messageSébastien Aperghis-Tramoni wrote: > Q: Why do you use /[Tt]/ instead of /t/i? It's so ugly! > > A: Simple - the case-insensitive match switch is very > inefficient. According to "Mastering Regular Expressions" > from O'Reilly Press, /[Tt]/ is faster and more space > efficient than /t/i - by as much as double against long > strings. //i essentially does a lc() on the string and > keeps a temporary copy in memory. > > When this changes, and it is in the 5.6 development series, > I will gladly use //i. Believe me, I realize [Tt] is hideously > ugly. Looks like it was painfully true for 5.5... $ time perl5.5.5 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for 1..100000' real 0m4.882s user 0m4.761s sys 0m0.026s $ time perl5.5.5 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for 1..100000' real 0m40.656s user 0m39.587s sys 0m0.149s And the reverse is now true in this highly inaccurate test... $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for 1..100000' real 0m5.732s user 0m5.565s sys 0m0.027s $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for 1..100000' real 0m2.589s user 0m2.544s sys 0m0.015s -- <Schwern> What we learned was if you get, grab someone and swing them around a few times -- Life's lessons from square dancing
Post Follow-up to this messageMichael G Schwern wrote:
> And the reverse is now true in this highly inaccurate test...
>
> $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
> 1..100000'
>
> real 0m5.732s
> user 0m5.565s
> sys 0m0.027s
>
> $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
> 1..100000'
>
> real 0m2.589s
> user 0m2.544s
> sys 0m0.015s
And if I recall my perl510delta correctly, /i should be even faster on
5.10.0. No, hang on, it's when UTF-8 strings are involved.
% time perl5.8.8 -Mutf8 -Mcharnames=:full -wle '$foo = "e\N{GREEK SMALL
LETTER BETA}" x 5000; $foo .= "T"; $foo =~ /t/i for 1..1000'
real 0m22.855s
user 0m22.827s
sys 0m0.016s
% ./perl -v
This is perl, v5.10.0 DEVEL32604 built for i386-freebsd-thread-multi
% time ./perl -Ilib -Mutf8 -Mcharnames=:full -wle '$foo = "e\N{GREEK
SMALL LETTER BETA}" x 5000; $foo .= "T"; $foo =~ /t/i for 1..1000'
real 0m22.957s
user 0m22.948s
sys 0m0.001s
Well, look on the bright side. It's no worse.
The benchmark may be flawed, since my appreciation of Unicode is little
more than "things went downhill after 7-bit ASCII".
David
Post Follow-up to this messageOn Jan 11, 2008, at 8:01 AM, David Landgren wrote: > The benchmark may be flawed, since my appreciation of Unicode is =20 > little more than "things went downhill after 7-bit ASCII". Haven't I read that you live in Paris? I figured that anyone who =20 lives in a country whose dominant language was not fully expressible =20 in ASCII would love Unicode. On a major tangent, have others noticed the resurgence of the umlaut =20 in printed English? I keep seeing things like co=F6peration or =20 co=F6rdinates -- particularly in Technology Review, but in other =20 publications on occasion too. Is that because it's *supposed* to be =20 spelled that way, but ASCII and the typewriter have suppressed that =20 spelling for my lifetime? Chris
Post Follow-up to this messageChris Dolan wrote: > On a major tangent, have others noticed the resurgence of the umlaut > in printed English? I keep seeing things like coöperation or > coördinates -- particularly in Technology Review, but in other > publications on occasion too. Is that because it's *supposed* to be > spelled that way, but ASCII and the typewriter have suppressed that > spelling for my lifetime? > A quick use of Google-fu unearthed a blog entry http://www.dwelle.org/archives/2007...l-the-umlauts/, which in turn pointed to the page http://ourworld.compuserve.com/homepages/profirst/d.htm that says: *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in English, oftentimes replaced by a hyphen. In English, the dieresis is used on a second identical vowel to indicate a change in pronunciation of that vowel or indicate it is pronounced in a separate syllable. It is sometimes referred to as an « umlaut » when used with a single character or in a « diphthong. » Examples: reëlecting, reëncoding, coöperation, coördination. Well I, for one, never knew that such a thing existed. Neato! Too bad the name of the mark, though, which is definitively unfortunate. Joÿ, `/anick
Post Follow-up to this messageOn Sun, Jan 13, 2008 at 12:23:35AM +0100, Georg Moritz wrote: > Well, that's sort of quotemeta for the double o - differentiating e.g. > double-o usage invs. cooperation. I haven't seen that usage in > english yet, but it's used in spanish to mark a vowel as literal, e.g. in > "Parque Güell". The only English word I think its commonly seen with is naïve, to indicate that the ai isn't a digraph. -- "But Sidley Park is already a picture, and a most amiable picture too. The slopes are green and gentle. The trees are companionably grouped at intervals that show them to advantage. The rill is a serpentine ribbon unwound from the lake peaceably contained by meadows on which the right amount of sheep are tastefully arranged." -- Lady Croom, "Arcadia"
Post Follow-up to this messageChris Dolan wrote: > Haven't I read that you live in Paris? I figured that anyone who lives > in a country whose dominant language was not fully expressible in ASCII > would love Unicode. "Not fully expressible" seems mild to apply to writing French in ASCII (which after all has no diacritics). The phrase seems more appropriate for writing French in ISO-8859-1 (because of the lack of "oe" ligature). -- Keith C. Ivey <keith@iveys.org> Washington, DC
Post Follow-up to this message* Chris Dolan <chris@chrisdolan.net> [2008-01-12 23:55]: > I figured that anyone who lives in a country whose dominant > language was not fully expressible in ASCII would love Unicode. For bonus points, try writing, say, German (fully expressible with an ISO-8859 charset) and Gr(fully expressible[^1] with an ISO-8859 charset) in the same document. [1]: Well, Modern Gr
anyway. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.