Home > Archive > AWK > March 2007 > gensub "g"
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
|
| Hi,
I need to fix names in a BibTeX file, which looks like:
@ARTICLE{1690,
author = {Andrades-Miranda, J. and Oliveira, L.F.B. and Lima-Rosa, C.A.V. and
Nunes, A.P. and Zanchin, N.I.T. and Mattevi, M.S.},
...
}
where each initial needs to be separated by a space; e.g. Lima-Rosa,
C.A.V. above should be rewritten as Lima-Rosa, C. A. V. Of course, names
may not only appear in the author field (in the BibTeX sense), but also
elsewhere in the entry (e.g. editor). I thought I might do this with:
gawk --re-interval '{print gensub(/([A-Z]{1}\.)([A-Z]{1}\.)/, "\\1 \\2", "g")}' myfile.bib
but this returns:
@ARTICLE{1690,
author = {Andrades-Miranda, J. and Oliveira, L. F.B. and Lima-Rosa, C. A.V. and
Nunes, A. P. and Zanchin, N. I.T. and Mattevi, M. S.},
...
}
I would have expected the "g" argument to do the replacements everywhere
in each line. Any ideas as to what I'm missing here?
Cheers,
--
Seb
| |
| Vassilis 2007-03-05, 6:58 pm |
|
=CF/=C7 Seb =DD=E3=F1=E1=F8=E5:
> Hi,
>
> I need to fix names in a BibTeX file, which looks like:
>
>
> @ARTICLE{1690,
> author =3D {Andrades-Miranda, J. and Oliveira, L.F.B. and Lima-Rosa, C.=
A=2EV. and
> Nunes, A.P. and Zanchin, N.I.T. and Mattevi, M.S.},
> ...
> }
>
>
> where each initial needs to be separated by a space; e.g. Lima-Rosa,
> C.A.V. above should be rewritten as Lima-Rosa, C. A. V. Of course, names
> may not only appear in the author field (in the BibTeX sense), but also
> elsewhere in the entry (e.g. editor). I thought I might do this with:
>
>
> gawk --re-interval '{print gensub(/([A-Z]{1}\.)([A-Z]{1}\.)/, "\\1 \\2", =
"g")}' myfile.bib
>
>
> but this returns:
>
>
> @ARTICLE{1690,
> author =3D {Andrades-Miranda, J. and Oliveira, L. F.B. and Lima-Rosa, C=
.. A.V. and
> Nunes, A. P. and Zanchin, N. I.T. and Mattevi, M. S.},
> ...
> }
>
>
> I would have expected the "g" argument to do the replacements everywhere
> in each line. Any ideas as to what I'm missing here?
>
>
> Cheers,
>
> --
> Seb
First off, I don't think intervals are necessary in this case ([A-Z]
matches one character).
Anyway, regexps match leftmost longest.
That is, in the second record of your input, leftmost longest match
of /([A-Z]{1}\.)([A-Z]{1}\.)/ is ``L.F.''.
``B.'' stays unmatched, and then, gensub continues to C.A. and so on.
HTH
| |
|
| On 5 Mar 2007 15:55:38 -0800,
"Vassilis" <F.H.Novalis@gmail.com> wrote:
[...]
> First off, I don't think intervals are necessary in this case ([A-Z]
> matches one character). Anyway, regexps match leftmost longest. That
> is, in the second record of your input, leftmost longest match of
> /([A-Z]{1}\.)([A-Z]{1}\.)/ is ``L.F.''. ``B.'' stays unmatched, and
> then, gensub continues to C.A. and so on. HTH
Thanks Vassilis. I'll have to think of a more appropriate regexp.
Cheers,
--
Seb
| |
|
| On 5 Mar 2007 15:55:38 -0800,
"Vassilis" <F.H.Novalis@gmail.com> wrote:
[...]
> First off, I don't think intervals are necessary in this case ([A-Z]
> matches one character). Anyway, regexps match leftmost longest. That
> is, in the second record of your input, leftmost longest match of
> /([A-Z]{1}\.)([A-Z]{1}\.)/ is ``L.F.''. ``B.'' stays unmatched, and
> then, gensub continues to C.A. and so on.
I don't quite understand though why F.B. is not the next match in this
case, rather than C.A. It seems that I will have to use a while() loop to
repeat the gensub() until all intials have been separated by a space.
--
Seb
| |
| Vassilis 2007-03-05, 9:58 pm |
|
=CF/=C7 Seb =DD=E3=F1=E1=F8=E5:
> On 5 Mar 2007 15:55:38 -0800,
> "Vassilis" <F.H.Novalis@gmail.com> wrote:
>
> [...]
>
>
> I don't quite understand though why F.B. is not the next match in this
> case, rather than C.A. It seems that I will have to use a while() loop to
> repeat the gensub() until all intials have been separated by a space.
>
>
> --
> Seb
F=2EB. is not the next match, because the regexp machinery never sees
it.
``L.F.'' is transformed to ``L. F.'' as per requested, and then it
moves forward to match anything else /after/ the first match.
I wouldn't use on this occasion awk.
Sed is just fine.
sed 's/\([A-Z]\.\)\([A-Z]\.\)\([A-Z]\.\)/\1 \2 \3/g ; s/\([A-Z]\.\)\
([A-Z]\.\)/\1 \2/g' file
| |
| gerryt 2007-03-06, 3:58 am |
| On Mar 5, 5:23 pm, "Vassilis" <F.H.Nova...@gmail.com> wrote:
> =CF/=C7 Seb =DD=E3=F1=E1=F8=E5:
>
>
>
>
>
>
to[color=darkred]
>
>
> F.B. is not the next match, because the regexp machinery never sees
> it.
> ``L.F.'' is transformed to ``L. F.'' as per requested, and then it
> moves forward to match anything else /after/ the first match.
> I wouldn't use on this occasion awk.
> Sed is just fine.
> sed 's/\([A-Z]\.\)\([A-Z]\.\)\([A-Z]\.\)/\1 \2 \3/g ; s/\([A-Z]\.\)\
> ([A-Z]\.\)/\1 \2/g' file
Thats a GNU sed dialect : > otherwise use -e.
You can do this with gawk too but there must be a better way than
feeding one output into another gawk for this? As in:
gawk --re-interval '{print gensub(/([A-Z]\.)([A-Z]\.)([A-Z]\.)/,\
"\\1 \\2 \\3", "g")}' zmyfile.bib\
|gawk --re-interval '{print gensub(/([A-Z]\.)([A-Z]\.)/, "\\1 \\2",
"g")}'
| |
| Lorenz 2007-03-06, 3:58 am |
| Hi,
Seb wrote:
>I need to fix names in a BibTeX file, which looks like:
>
>
>@ARTICLE{1690,
> author = {Andrades-Miranda, J. and Oliveira, L.F.B. and Lima-Rosa, C.A.V. and
> Nunes, A.P. and Zanchin, N.I.T. and Mattevi, M.S.},
> ...
>}
>[...]
>gawk --re-interval '{print gensub(/([A-Z]{1}\.)([A-Z]{1}\.)/, "\\1 \\2", "g")}' myfile.bib
>[...]
wouldn't
gawk '{print gensub(/\.([A-Z])/, ". \\1", "g")}' myfile.bib
do the job?
Lorenz
| |
| Ed Morton 2007-03-08, 6:57 pm |
| Seb wrote:
> Hi,
>
> I need to fix names in a BibTeX file, which looks like:
>
>
> @ARTICLE{1690,
> author = {Andrades-Miranda, J. and Oliveira, L.F.B. and Lima-Rosa, C.A.V. and
> Nunes, A.P. and Zanchin, N.I.T. and Mattevi, M.S.},
> ...
> }
>
>
> where each initial needs to be separated by a space; e.g. Lima-Rosa,
> C.A.V. above should be rewritten as Lima-Rosa, C. A. V. Of course, names
> may not only appear in the author field (in the BibTeX sense), but also
> elsewhere in the entry (e.g. editor). I thought I might do this with:
>
>
> gawk --re-interval '{print gensub(/([A-Z]{1}\.)([A-Z]{1}\.)/, "\\1 \\2", "g")}' myfile.bib
>
>
> but this returns:
>
>
> @ARTICLE{1690,
> author = {Andrades-Miranda, J. and Oliveira, L. F.B. and Lima-Rosa, C. A.V. and
> Nunes, A. P. and Zanchin, N. I.T. and Mattevi, M. S.},
> ...
> }
>
>
> I would have expected the "g" argument to do the replacements everywhere
> in each line. Any ideas as to what I'm missing here?
>
>
> Cheers,
>
Rather than worrying about whether or not there's already a space, just
always add a space after the '[A-Z]\.', then reduce multiple spaces to one:
awk '{gsub(/[A-Z]\./,"& ");gsub(/ +/," ")}1' file
Obviously that assumes you don't need to preserve existing chains of
spaces. If you do, just gsub() them to some other pattern before
starting, then gsub() them back afterwards.
Regards,
Ed.
|
|
|
|
|