Home > Archive > AWK > February 2005 > pattern-range blocks on a Acad DXF file
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
pattern-range blocks on a Acad DXF file
|
|
|
| I am looking for an eloquent way to parse acad .dxf files... one of the most
ineloquent file ascii file structurs that I know of. I'm encourage by the
awk's pattern1,pattern2, i.e. /this/,/that/ action block syntax.
/CIRCLE/,/^ 0$/ { circle[++i]=$0 }
END{ print paste(circle,"|")}
This two lines of code amazingly almost works ...it prints out of a long
string of all CIRCLE entities on a single line...like so:
CIRCLE| 5|27|330|.....| 0|CIRCLE| 5|33|330|....|4.13| 0|CIRCLE|
5|34|330|.......|4.13| 0
what I want is each CIRCLE entity on a separate line.
CIRCLE| 5|27|330|...| 10|-49.40| 20|-21.23| 30|0.0| 40|4.13|
CIRCLE| 5|33|330|...| 10|-44.18| 20|0.0048| 30|0.0| 40|4.13|
CIRCLE| 5|34|330|...| 10|-45.99| 20|-10.41| 30|0.0| 40|4.13|
This means that I need to restart the ++i after pattern2 and also assign a
unque array index to each CIRCLE.
I'm using Tawk. Any suggestions?
If The next step will be to go after the number after the | 5| which is
Acads unique hex handle
for each entity.
..dxf file structure primer 101:
The dxf file mimics the internal acad data base which is a long, long list
of entities following lisp struture.
Looking at a snip of a dxf file it is (generally) a long list of single
values with first a flag then a data string. A important
flag is a a zero with two spaces preceding it ...which indicates the
beginning of an entity list. My brainstorm is that
I've flipped the meaning of 0 as signifying the end of a list instead of the
start. Amazingly this seems to work a the entities I'm interested are
chained ...I therefore start with an entity name like CIRCLE or POLYLINE etc
and end with 0.
0
SECTION
2
HEADER
9
$ACADVER
1
AC1015
9
$ACADMAINTVER
====================snip.
zillions of lines
====================snip.
0
SECTION
2
ENTITIES
=====================snip.
.. 0
CIRCLE
5
27
330
2
100
AcDbEntity
8
0
100
AcDbCircle
10
-49.40078978037286
20
-21.23479143819105
30
0.0
40
4.1340663209653
0
POLYLINE
5
28
| |
| William James 2005-02-14, 3:56 am |
| Try something like this:
* /CIRCLE/ { array[++count]=$0 "|";grabbing=1;next }
* /^ 0$/ { grabbing=0 }
* grabbing { array[count] = array[count] $0 "|" }
*
* END {
* for (i=1; i in array; i++)
* print array[i]
* }
Let us know how this works for you.
| |
| Kenny McCormack 2005-02-14, 3:56 am |
| In article <9uydncaW5OpwZpLfRVn-pg@comcast.com>, trexx <foo@foo.com> wrote:
>I am looking for an eloquent way to parse acad .dxf files... one of the most
>ineloquent file ascii file structurs that I know of. I'm encourage by the
>awk's pattern1,pattern2, i.e. /this/,/that/ action block syntax.
>
>/CIRCLE/,/^ 0$/ { circle[++i]=$0 }
>END{ print paste(circle,"|")}
>
>This two lines of code amazingly almost works ...it prints out of a long
>string of all CIRCLE entities on a single line...like so:
>
>CIRCLE| 5|27|330|.....| 0|CIRCLE| 5|33|330|....|4.13| 0|CIRCLE|
>5|34|330|.......|4.13| 0
>
>what I want is each CIRCLE entity on a separate line.
>
>CIRCLE| 5|27|330|...| 10|-49.40| 20|-21.23| 30|0.0| 40|4.13|
>CIRCLE| 5|33|330|...| 10|-44.18| 20|0.0048| 30|0.0| 40|4.13|
>CIRCLE| 5|34|330|...| 10|-45.99| 20|-10.41| 30|0.0| 40|4.13|
>
>This means that I need to restart the ++i after pattern2 and also assign a
>unque array index to each CIRCLE.
>I'm using Tawk. Any suggestions?
I no nothing of the actual problem you are trying to solve or what your
input file really looks like, but...
What happens if you change:
END{ print paste(circle,"|")}
to:
END{ print paste(circle,"\n")}
Also, what platform is this?
| |
| Ed Morton 2005-02-14, 3:56 am |
|
trexx wrote:
> I am looking for an eloquent way to parse acad .dxf files... one of the most
> ineloquent file ascii file structurs that I know of. I'm encourage by the
> awk's pattern1,pattern2, i.e. /this/,/that/ action block syntax.
>
> /CIRCLE/,/^ 0$/ { circle[++i]=$0 }
> END{ print paste(circle,"|")}
>
> This two lines of code amazingly almost works ...it prints out of a long
> string of all CIRCLE entities on a single line...like so:
>
> CIRCLE| 5|27|330|.....| 0|CIRCLE| 5|33|330|....|4.13| 0|CIRCLE|
> 5|34|330|.......|4.13| 0
>
> what I want is each CIRCLE entity on a separate line.
>
> CIRCLE| 5|27|330|...| 10|-49.40| 20|-21.23| 30|0.0| 40|4.13|
> CIRCLE| 5|33|330|...| 10|-44.18| 20|0.0048| 30|0.0| 40|4.13|
> CIRCLE| 5|34|330|...| 10|-45.99| 20|-10.41| 30|0.0| 40|4.13|
>
> This means that I need to restart the ++i after pattern2 and also assign a
> unque array index to each CIRCLE.
> I'm using Tawk. Any suggestions?
>
> If The next step will be to go after the number after the | 5| which is
> Acads unique hex handle
> for each entity.
>
> .dxf file structure primer 101:
>
> The dxf file mimics the internal acad data base which is a long, long list
> of entities following lisp struture.
> Looking at a snip of a dxf file it is (generally) a long list of single
> values with first a flag then a data string. A important
> flag is a a zero with two spaces preceding it ...which indicates the
> beginning of an entity list. My brainstorm is that
> I've flipped the meaning of 0 as signifying the end of a list instead of the
> start. Amazingly this seems to work a the entities I'm interested are
> chained ...I therefore start with an entity name like CIRCLE or POLYLINE etc
> and end with 0.
>
> 0
> SECTION
> 2
> HEADER
> 9
> $ACADVER
> 1
> AC1015
> 9
> $ACADMAINTVER
> ====================snip.
> zillions of lines
> ====================snip.
> 0
> SECTION
> 2
> ENTITIES
> =====================snip.
> . 0
> CIRCLE
> 5
> 27
> 330
> 2
> 100
> AcDbEntity
> 8
> 0
> 100
> AcDbCircle
> 10
> -49.40078978037286
> 20
> -21.23479143819105
> 30
> 0.0
> 40
> 4.1340663209653
> 0
> POLYLINE
> 5
> 28
If the "." before the " 0" preceeding the line with "CIRCLE" above is a
mistake, then this:
gawk 'BEGIN{RS="(^|\n) 0";OFS="|"}NR==1{next}{$1=$1}1' file
will print every entity on it's own line, with fields separated by "|"s.
e.g.:
SECTION|2|HEADER|9|$ACADVER|1|AC1015|9|$
ACADMAINTVER|====================snip. |zillions|of|lines|====================s
nip.
SECTION|2|ENTITIES|=====================
snip.
CIRCLE|5|27|330|2|100|AcDbEntity|8|0|100
|AcDbCircle|10|-49.40078978037286|20|-21.23479143819105|30|0.0|40|4.1340663209653
POLYLINE|5|28
Then to just print the CIRCLE ones you just change the constant
condition "1" at the end to a test, e.g.
gawk 'BEGIN{RS="(^|\n) 0";OFS="|"}NR==1{next}{$1=$1}/CIRCLE/' file
CIRCLE|5|27|330|2|100|AcDbEntity|8|0|100
|AcDbCircle|10|-49.40078978037286|20|-21.23479143819105|30|0.0|40|4.1340663209653
I don't know if you'd do anything different for tawk since there's
nowhere to get it from and it's unsupported but you could try the above
and see. You might want to switch go gawk anyway since it's still
supported and readily available.
Ed.
| |
|
|
Thanks Kenny
The platform is Win2K. As a first step I'm trying to simplify i.e. "fold"
the dxf file so 1 entity = 1 line. I can then
go on to process these entities properties (color, layer, etc.) for futher
processing.
I can also easily clip out out unwanted entities...i.e. lines that are too
short, etc. My idea with the
pipe "|" symbol is to allow me to unfold my processed file so that the exact
number of spaces before each
entity code is maintained...this appears to be a critical issue (and
previously frustrating) for producing a valid dxf file.
REX
"Kenny McCormack" <gazelle@yin.interaccess.com> wrote in message
news:cup4jl$66d$1@yin.interaccess.com...
> In article <9uydncaW5OpwZpLfRVn-pg@comcast.com>, trexx <foo@foo.com>
wrote:
most[color=darkred]
a[color=darkred]
>
> I no nothing of the actual problem you are trying to solve or what your
> input file really looks like, but...
>
> What happens if you change:
>
> END{ print paste(circle,"|")}
> to:
> END{ print paste(circle,"\n")}
>
> Also, what platform is this?
>
| |
|
| Thanks...Your code works like a charm. I will make some more coffee and
start playing with variations.
REX
"William James" <w_a_x_man@yahoo.com> wrote in message
news:1108349510.650833.243210@z14g2000cwz.googlegroups.com...
> Try something like this:
>
> * /CIRCLE/ { array[++count]=$0 "|";grabbing=1;next }
> * /^ 0$/ { grabbing=0 }
> * grabbing { array[count] = array[count] $0 "|" }
> *
> * END {
> * for (i=1; i in array; i++)
> * print array[i]
> * }
>
> Let us know how this works for you.
>
| |
|
| Ed
Thanks...
I need to get Gawk to play with your code ... it's obviously elloquent...
Is there a Win2000 compiled Gawk?... I quickly looked at the GNU site and
missed seeing one. In converting
from Tawk to Gawk... I have some questions:
1. Tawk compiles the code... Gawk stays interpreted ... in your experience
will there be a big hit in processing time?
2. For cracking binary files Tawk is currently my first choice with its
Pack/Unpack.. Do you go to C++ or Perl or ?? to do this before
using Gawk?
3. When programming goes sour, I use Tawk's Debugger ... how does one step
through Gawk code and take a look at variables and arrays etc?
4. Other gotcha I should be concerned with?
I looked at the Gawk PDF manual and was pleased to see that the automatic
sorting of an array is part of Gawk.
I can live with the syntax change of multidimensional arrays, Tawk[x][y][z]
versus Gawk[x,y,z].
REX
"Ed Morton" <morton@lsupcaemnt.com> wrote in message
news:L_udnWrzntMJh43fRVn-qA@comcast.com...
>
>
> trexx wrote:
most[color=darkred]
the[color=darkred]
a[color=darkred]
list[color=darkred]
the[color=darkred]
etc[color=darkred]
>
> If the "." before the " 0" preceeding the line with "CIRCLE" above is a
> mistake, then this:
>
> gawk 'BEGIN{RS="(^|\n) 0";OFS="|"}NR==1{next}{$1=$1}1' file
>
> will print every entity on it's own line, with fields separated by "|"s.
> e.g.:
>
>
SECTION|2|HEADER|9|$ACADVER|1|AC1015|9|$
ACADMAINTVER|====================sni
p. |zillions|of|lines|====================s
nip.
> SECTION|2|ENTITIES|=====================
snip.
>
CIRCLE|5|27|330|2|100|AcDbEntity|8|0|100
|AcDbCircle|10|-49.40078978037286|20
|-21.23479143819105|30|0.0|40|4.1340663209653
> POLYLINE|5|28
>
> Then to just print the CIRCLE ones you just change the constant
> condition "1" at the end to a test, e.g.
>
> gawk 'BEGIN{RS="(^|\n) 0";OFS="|"}NR==1{next}{$1=$1}/CIRCLE/' file
>
>
CIRCLE|5|27|330|2|100|AcDbEntity|8|0|100
|AcDbCircle|10|-49.40078978037286|20
|-21.23479143819105|30|0.0|40|4.1340663209653
>
> I don't know if you'd do anything different for tawk since there's
> nowhere to get it from and it's unsupported but you could try the above
> and see. You might want to switch go gawk anyway since it's still
> supported and readily available.
>
> Ed.
>
>
| |
| Ed Morton 2005-02-14, 3:55 pm |
|
trexx wrote:
> Ed
> Thanks...
> I need to get Gawk to play with your code ... it's obviously elloquent...
> Is there a Win2000 compiled Gawk?... I quickly looked at the GNU site and
> missed seeing one.
I use gawk on my PC as part of the cygwin distribution (www.cygwin.com)
which should run fine on Win2000. I haven't looked for another version.
In converting
> from Tawk to Gawk... I have some questions:
> 1. Tawk compiles the code... Gawk stays interpreted ... in your experience
> will there be a big hit in processing time?
Probably not. I've done some experiments in the past using "awkcc" to
convert awk scripts to C and compile them and there was just a small
performance improvement.
> 2. For cracking binary files Tawk is currently my first choice with its
> Pack/Unpack.. Do you go to C++ or Perl or ?? to do this before
> using Gawk?
I never use awk on binary files.
> 3. When programming goes sour, I use Tawk's Debugger ... how does one step
> through Gawk code and take a look at variables and arrays etc?
You set the appropriate debugging flags.
> 4. Other gotcha I should be concerned with?
Maybe, but I've never used tawk.
> I looked at the Gawk PDF manual and was pleased to see that the automatic
> sorting of an array is part of Gawk.
> I can live with the syntax change of multidimensional arrays, Tawk[x][y][z]
> versus Gawk[x,y,z].
Again, I don't know tawk, but there MAY be a gotcha there since gawk
just simulates MD arrays by using "subscript1 SUBSEP subscript2 ..." so
accessing the subscripts may not work the same in gawk as in tawk.
Ed.
| |
| Kenny McCormack 2005-02-14, 3:55 pm |
| In article <vJqdnVvEN_JPfY3fRVn-3Q@comcast.com>,
Ed Morton <morton@lsupcaemnt.com> wrote:
....
You can certainly get GAWK with DOS/Windows, but there's no need when you
already have TAWK. Most (and I mean, all, except in some very obscure
cases) code written for GAWK will run with TAWK (less so, in the opposite
direction, of course).
[color=darkred]
>I use gawk on my PC as part of the cygwin distribution (www.cygwin.com)
>which should run fine on Win2000. I haven't looked for another version.
>
>
>Probably not. I've done some experiments in the past using "awkcc" to
>convert awk scripts to C and compile them and there was just a small
>performance improvement.
TAWK *is* much faster than GAWK. Make no mistake about that.
Notes:
1) This pickup is not specifically because it is "compiled" (a term
with many different meanings), but rather because the interpreter is very
well written (but the code is still interpreted at runtime).
2) AWKCC uses the GAWK routines, so there is essentially no speed
difference between it and (interpreted) GAWK.
>
>I never use awk on binary files.
Wise man. Yes, this is one of the many strengths of TAWK.
>
>You set the appropriate debugging flags.
I've actually never used the TAWK debugger, but it does look impressive.
Nice if you are the sort who likes such things.
>
>Maybe, but I've never used tawk.
>
Really? The WHINY_USER functionality is now actually documented in the PDF?
I'll be darned...
[color=darkred]
>
>Again, I don't know tawk, but there MAY be a gotcha there since gawk just
>simulates MD arrays by using "subscript1 SUBSEP subscript2 ..." so
>accessing the subscripts may not work the same in gawk as in tawk.
TAWK's true multi-dimensional arrays are very nice. I could never go back
to the simulated stuff.
| |
| William Park 2005-02-14, 8:55 pm |
| Kenny McCormack <gazelle@yin.interaccess.com> wrote:
> TAWK *is* much faster than GAWK. Make no mistake about that.
> Notes:
> 1) This pickup is not specifically because it is "compiled" (a
> term with many different meanings), but rather because the
> interpreter is very well written (but the code is still
> interpreted at runtime).
> 2) AWKCC uses the GAWK routines, so there is essentially no
> speed difference between it and (interpreted) GAWK.
>
>
> Wise man. Yes, this is one of the many strengths of TAWK.
Kenny, Is there any features of TAWK that you'd like to see in Bash?
--
William Park <opengeometry@yahoo.ca>, Toronto, Canada
Slackware Linux -- because I can type.
| |
| Aharon Robbins 2005-02-15, 8:55 am |
| In article <cuqos5$d2f$1@yin.interaccess.com>,
Kenny McCormack <gazelle@interaccess.com> wrote:
> 2) AWKCC uses the GAWK routines, so there is essentially no speed
>difference between it and (interpreted) GAWK.
AWKCC is based on the Bell Labs awk, it has nothing to do with gawk.
But your point remains valid, it uses chunks of that awk for much of
the basic functionality.
Gawk can handle binary files, if you can specify something that acts
as a record separator. However, that doesn't mean it's a good idea,
and I would use a different tool for working on binary data.
[color=darkred]
This is admittedly not what I'd like. You can get dumps of variable and array
values, at program exit. I hope one day that this will improve.
[color=darkred]
>
>Really? The WHINY_USER functionality is now actually documented in the PDF?
>I'll be darned...
No, it's not in any processed form of the documentation. At least, not in
anything I distribute. asort() and asorti() are documented.
Of course, I have no idea where the PDF manual that was being looked at
came from.
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL
| |
| Kenny McCormack 2005-02-15, 3:56 pm |
| In article <4211c44b$1@news.012.net.il>,
Aharon Robbins <arnold@skeeve.com> wrote:
>In article <cuqos5$d2f$1@yin.interaccess.com>,
>Kenny McCormack <gazelle@interaccess.com> wrote:
>
>AWKCC is based on the Bell Labs awk, it has nothing to do with gawk.
>But your point remains valid, it uses chunks of that awk for much of
>the basic functionality.
Oops - I was thinking of AWK2C, which *is* based on gawk.
I've played around a bit with AWK2C, not with AWKCC.
>
>No, it's not in any processed form of the documentation. At least, not in
>anything I distribute. asort() and asorti() are documented.
Most likely, the OP was referring to asort() or asorti().
In which case, using the terminology "automatic sorting" is inaccurate at
best, misleading at worst...
| |
| Patrick TJ McPhee 2005-02-16, 3:56 am |
| In article <9uydncaW5OpwZpLfRVn-pg@comcast.com>, trexx <foo@foo.com> wrote:
% ineloquent file ascii file structurs that I know of. I'm encourage by the
% awk's pattern1,pattern2, i.e. /this/,/that/ action block syntax.
%
% /CIRCLE/,/^ 0$/ { circle[++i]=$0 }
% END{ print paste(circle,"|")}
[...]
% what I want is each CIRCLE entity on a separate line.
You could do something like
/CIRCLE/,end = /^ 0$/ { circle[++i]=$0 }
end { print paste(circle,"|")}
--
Patrick TJ McPhee
North York Canada
ptjm@interlog.com
| |
|
| Patrick
thanks, I think I understand that the variable 'end' goes from 0 to1 when
the second pattern is met.
I also see that 'end' is not 'END' in awk.
By experimenting with your code, I found that by resetting end=0,i=0 and
adding 'next' all works perfectly.
/CIRCLE/,end = /^ 0$/ { circle[++i]=$0 }
end { end=0;i=0;print paste(circle,"|");next}
Now...by replacing /CIRCLE/ with /LWPOLYLINE/ one gets a light weight
polyline ...a more complex Acad entity.
The code I made chops up the line with a variable number of vertices
(x-coord follows |10| and the y-coord follows |20|.
It does use Tawk's style of arrays. I've shown the result as strings, but
in practice I would make x and y into arrays
and do the trig to get the total length of the line. This code looks more
like modified Object Rexx! Anyway suggestions
on streamling this code in Awk would be appreciated.
* INIT{
* $0="LWPOLYLINE| 5|2B|snip|100|snip| 43|0.0| 10|-64.893| 20|-51.351|
10|7.880| 20|3.920| 10|16.443| 20|4.872| 0"
* FS = "|"
* i = 0
* }
* BEGIN{
* for (i=1; i<NF; ++i){
* ++count[$i]
* array[$i][count[$i]] = $(i+1)
* }
* for (i=1; i<= count[" 10"]; i++){
* print "x[" i "]= " array[" 10"][i], "," "y[" i "]= " array[" 20"][i]
* }
* }
x[1] = -64.893 , y[1] = -51.351
x[2] = 7.88 , y[2] = 3.92
x[3] = 16.443 , y[3] = 4.872
REX
" <ptjm@interlog.com> wrote in message news:37g6guF5aejveU1@uni-berlin.de...
> In article <9uydncaW5OpwZpLfRVn-pg@comcast.com>, trexx <foo@foo.com>
wrote:
>
> % ineloquent file ascii file structurs that I know of. I'm encourage by
the
> % awk's pattern1,pattern2, i.e. /this/,/that/ action block syntax.
> %
> % /CIRCLE/,/^ 0$/ { circle[++i]=$0 }
> % END{ print paste(circle,"|")}
>
> [...]
>
> % what I want is each CIRCLE entity on a separate line.
>
> You could do something like
>
> /CIRCLE/,end = /^ 0$/ { circle[++i]=$0 }
> end { print paste(circle,"|")}
>
> --
>
> Patrick TJ McPhee
> North York Canada
> ptjm@interlog.com
| |
|
| Patrick
thanks, I think I understand that the variable 'end' goes from 0 to1 when
the second pattern is met.
I also see that 'end' is not 'END' in awk.
By experimenting with your code, I found that by resetting end=0,i=0 and
adding 'next' all works perfectly.
/CIRCLE/,end = /^ 0$/ { circle[++i]=$0 }
end { end=0;i=0;print paste(circle,"|");next}
Now...by replacing /CIRCLE/ with /LWPOLYLINE/ one gets a light weight
polyline ...a more complex Acad entity.
The code I made chops up the line with a variable number of vertices
(x-coord follows |10| and the y-coord follows |20|.
It does use Tawk's style of arrays. I've shown the result as strings, but
in practice I would make x and y into arrays
and do the trig to get the total length of the line. This code looks more
like modified Object Rexx! Anyway suggestions
on streamling this code in Awk would be appreciated.
* INIT{
* $0="LWPOLYLINE| 5|2B|snip|100|snip| 43|0.0| 10|-64.893| 20|-51.351|
10|7.880| 20|3.920| 10|16.443| 20|4.872| 0"
* FS = "|"
* i = 0
* }
* BEGIN{
* for (i=1; i<NF; ++i){
* ++count[$i]
* array[$i][count[$i]] = $(i+1)
* }
* for (i=1; i<= count[" 10"]; i++){
* print "x[" i "]= " array[" 10"][i], "," "y[" i "]= " array[" 20"][i]
* }
* }
x[1] = -64.893 , y[1] = -51.351
x[2] = 7.88 , y[2] = 3.92
x[3] = 16.443 , y[3] = 4.872
REX
" <ptjm@interlog.com> wrote in message news:37g6guF5aejveU1@uni-berlin.de...
> In article <9uydncaW5OpwZpLfRVn-pg@comcast.com>, trexx <foo@foo.com>
wrote:
>
> % ineloquent file ascii file structurs that I know of. I'm encourage by
the
> % awk's pattern1,pattern2, i.e. /this/,/that/ action block syntax.
> %
> % /CIRCLE/,/^ 0$/ { circle[++i]=$0 }
> % END{ print paste(circle,"|")}
>
> [...]
>
> % what I want is each CIRCLE entity on a separate line.
>
> You could do something like
>
> /CIRCLE/,end = /^ 0$/ { circle[++i]=$0 }
> end { print paste(circle,"|")}
>
> --
>
> Patrick TJ McPhee
> North York Canada
> ptjm@interlog.com
|
|
|
|
|