Home > Archive > Unix Shell Programming > July 2006 > Splitting files with SED or AWK or other app/script
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Splitting files with SED or AWK or other app/script
|
|
| matt@bobsroom.ca 2006-07-23, 7:59 am |
| I have a situation where I need to split a file that is being received
into two separate files. The file I am receiving is both ASCII data and
binary data. The binary data is a TIFF file which is appended to an
ASCII file. Something like this:
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
text text tex text text text text text text text tex text text text
text text
##TIFF##
binary data
binary data
binary data
binary data
This is ALWAYS the format of the file. The binary data always appears
at the very bottom of the file and is always preceded by the text tag
"##TIFF##"
I need to be able to split this file into two separate files. One which
contains only the ascii data, and one that contains only the ##TIFF##
tag and the binary data.
I currently use the SED command for some search and replace but know
very little about it and have done a little reading (and will do more)
but was wondering if anyone else might have some suggestions.
Thanks.
Matt
| |
| Janis Papanagnou 2006-07-23, 6:59 pm |
| matt@bobsroom.ca wrote:
> I have a situation where I need to split a file that is being received
> into two separate files. The file I am receiving is both ASCII data and
> binary data. The binary data is a TIFF file which is appended to an
> ASCII file. Something like this:
>
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
>
> ##TIFF##
> binary data
> binary data
> binary data
> binary data
>
>
> This is ALWAYS the format of the file. The binary data always appears
> at the very bottom of the file and is always preceded by the text tag
> "##TIFF##"
>
> I need to be able to split this file into two separate files. One which
> contains only the ascii data, and one that contains only the ##TIFF##
> tag and the binary data.
sed '/##TIFF##/q' >part1
sed -n '/##TIFF##/,$p' >part2
The string "##TIFF##" is ASCII data and thus also in part1 as you said;
if you don't like it that way you have to strip the last line from part1.
Janis
> I currently use the SED command for some search and replace but know
> very little about it and have done a little reading (and will do more)
> but was wondering if anyone else might have some suggestions.
>
> Thanks.
>
> Matt
>
| |
| Loki Harfagr 2006-07-23, 6:59 pm |
| Le Sun, 23 Jul 2006 06:59:12 -0700, matt@bobsroom.ca a écrit_:
> I have a situation where I need to split a file that is being received
> into two separate files. The file I am receiving is both ASCII data and
> binary data. The binary data is a TIFF file which is appended to an
> ASCII file. Something like this:
>
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
>
> ##TIFF##
> binary data
> binary data
> binary data
> binary data
>
>
> This is ALWAYS the format of the file. The binary data always appears
> at the very bottom of the file and is always preceded by the text tag
> "##TIFF##"
>
> I need to be able to split this file into two separate files. One which
> contains only the ascii data, and one that contains only the ##TIFF##
> tag and the binary data.
>
> I currently use the SED command for some search and replace but know
> very little about it and have done a little reading (and will do more)
> but was wondering if anyone else might have some suggestions.
You already had a good answer in 'sed' from Janis, as you
"subjected" 'or other app/script' here's the way I'd favour, using
the 'csplit' tool :
$ csplit MISCFILES/splitin2.txt /^##TIFF##/+1
790
48
It'll create two files xx00 and xx01 :
$ cat xx01
binary data
binary data
binary data
binary data
You can use autonaming inside the call with the parameters
-f and -b for prefix and suffixes, see the manpage to adapt to
your taste :-)
| |
| Jon LaBadie 2006-07-23, 6:59 pm |
| Loki Harfagr wrote:
> Le Sun, 23 Jul 2006 06:59:12 -0700, matt@bobsroom.ca a �crit�:
>
>
> You already had a good answer in 'sed' from Janis, as you
> "subjected" 'or other app/script' here's the way I'd favour, using
> the 'csplit' tool :
> $ csplit MISCFILES/splitin2.txt /^##TIFF##/+1
> 790
> 48
csplit was going to be my suggestion as well,
The OP wanted the TIFF tag in the file with the binary data.
For this, drop the "+1" from the end of the command line.
>
> It'll create two files xx00 and xx01 :
>
> $ cat xx01
> binary data
> binary data
> binary data
> binary data
>
>
> You can use autonaming inside the call with the parameters
> -f and -b for prefix and suffixes, see the manpage to adapt to
> your taste :-)
>
| |
| matt@bobsroom.ca 2006-07-23, 6:59 pm |
| For both the csplit and sed examples you provided, I had the exact same
problem happen on both.
When it processes anything after the ##TIFF## tag, it generates an
error saying "input file is a binary file" and stops.
When I did the csplit, I end up with a file that contains the TIFF tag
only and no binary data.
When I run the 2nd sed command provided, I get no file at all.
Any ideas?
Jon LaBadie wrote:[color=darkred]
> Loki Harfagr wrote:
>
> csplit was going to be my suggestion as well,
>
> The OP wanted the TIFF tag in the file with the binary data.
> For this, drop the "+1" from the end of the command line.
>
>
| |
| Jon LaBadie 2006-07-23, 6:59 pm |
| matt@bobsroom.ca wrote:
> For both the csplit and sed examples you provided, I had the exact same
> problem happen on both.
>
> When it processes anything after the ##TIFF## tag, it generates an
> error saying "input file is a binary file" and stops.
>
> When I did the csplit, I end up with a file that contains the TIFF tag
> only and no binary data.
>
> When I run the 2nd sed command provided, I get no file at all.
>
> Any ideas?
>
> Jon LaBadie wrote:
>
Better to reply inline or at the bottom.
Some utilities are "binary safe". My original test was on SuSE linux
and it worked fine. I could use tail +2 on the xx01 file and get a
exact copy of the tiff file I had appended after the tag.
After your reply I repeated the experiment on a Solaris system.
Did not get your error message, but definitely got a corrupt file.
I also tried on solaris and linux a script using the ex link to vim.
echo '1,/^##TIFF##$/-1 w part1
1;//,$ w part2
q!
' | ex - datafile
This "worked" on both OS's.
The resulting tiff file was useable, but had a newline added by ex(vim)
Thus the quotes around worked.
| |
| John W. Krahn 2006-07-23, 6:59 pm |
| matt@bobsroom.ca wrote:
> I have a situation where I need to split a file that is being received
> into two separate files. The file I am receiving is both ASCII data and
> binary data. The binary data is a TIFF file which is appended to an
> ASCII file. Something like this:
>
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
> text text tex text text text text text text text tex text text text
> text text
>
> ##TIFF##
> binary data
> binary data
> binary data
> binary data
>
>
> This is ALWAYS the format of the file. The binary data always appears
> at the very bottom of the file and is always preceded by the text tag
> "##TIFF##"
>
> I need to be able to split this file into two separate files. One which
> contains only the ascii data, and one that contains only the ##TIFF##
> tag and the binary data.
>
> I currently use the SED command for some search and replace but know
> very little about it and have done a little reading (and will do more)
> but was wondering if anyone else might have some suggestions.
#!/usr/bin/perl
my $input = shift
or die "usage: $0 inputfilename [textfilename] [tifffilename]\n";
my $text = @ARGV ? shift : "$input.text";
my $tiff = @ARGV ? shift : "$input.tiff";
open IN, '<:raw', $input or die "open '$input' $!";
open TEXT, '>', $text or die "open '$text' $!";
open TIFF, '>:raw', $tiff or die "open '$tiff' $!";
$/ = "##TIFF##\n";
chomp( my $text_data = <IN> );
print TEXT $text_data;
$/ = \1024;
print TIFF while <IN>;
__END__
John
--
use Perl;
program
fulfillment
| |
| Janis Papanagnou 2006-07-24, 3:58 am |
| matt@bobsroom.ca wrote:
[Please don't top-post.]
> For both the csplit and sed examples you provided, I had the exact same
> problem happen on both.
>
> When it processes anything after the ##TIFF## tag, it generates an
> error saying "input file is a binary file" and stops.
>
> When I did the csplit, I end up with a file that contains the TIFF tag
> only and no binary data.
>
> When I run the 2nd sed command provided, I get no file at all.
>
> Any ideas?
If the used tools are not binary-clean try the following; separate
the first part, the ASCII text, using sed, then interrogate the size
of that file, and finally use the size in the dd command to skip the
ASCII part.
Janis
> Jon LaBadie wrote:
>
>
>
| |
| matt@bobsroom.ca 2006-07-24, 7:59 am |
|
Jon LaBadie wrote:
> matt@bobsroom.ca wrote:
>
> Better to reply inline or at the bottom.
>
> Some utilities are "binary safe". My original test was on SuSE linux
> and it worked fine. I could use tail +2 on the xx01 file and get a
> exact copy of the tiff file I had appended after the tag.
>
> After your reply I repeated the experiment on a Solaris system.
> Did not get your error message, but definitely got a corrupt file.
>
> I also tried on solaris and linux a script using the ex link to vim.
>
> echo '1,/^##TIFF##$/-1 w part1
> 1;//,$ w part2
> q!
> ' | ex - datafile
>
> This "worked" on both OS's.
> The resulting tiff file was useable, but had a newline added by ex(vim)
> Thus the quotes around worked.
Well, I tried your solution on my system and ended up getting the TIFF
tag at the top of the second file which I don't want.
I adjusted what you gave me and tried this:
echo '1,/^##TIFF##/ w part1
1;//+1,$ w part2
q!
' | ex - ./mysrc
This generates the second file without the TIFF tag at the top and
appears to look right, but it does not create a valid TIF file. It's
weird, if I cut out all of the ASCII data from my original data file
including the TIFF tag and save it, it comes up fine as a TIF, but when
I process it like this, I don't get a TIF file.
This is the closest I have gotten yet to getting this to work.
| |
| Janis Papanagnou 2006-07-24, 7:59 am |
| matt@bobsroom.ca wrote:
> Jon LaBadie wrote:
I wouldn't trust line oriented text editors like ex to work on data other
than ASCII text. (The least thing they do is adding a terminating newline.)
[color=darkred]
>
> Well, I tried your solution on my system and ended up getting the TIFF
> tag at the top of the second file which I don't want.
(Which is what you asked for in your original posting.)
> I adjusted what you gave me and tried this:
>
> echo '1,/^##TIFF##/ w part1
> 1;//+1,$ w part2
> q!
> ' | ex - ./mysrc
>
> This generates the second file without the TIFF tag at the top and
> appears to look right, but it does not create a valid TIF file. It's
> weird, if I cut out all of the ASCII data from my original data file
> including the TIFF tag and save it, it comes up fine as a TIF, but when
> I process it like this, I don't get a TIF file.
>
> This is the closest I have gotten yet to getting this to work.
Have you tried the dd approach? dd is supposed to work with binaries.
# untested
sed '/##TIFF##/q' <infile >part1
len=$( stat -c %s part1 )
dd if=infile of=part2 bs=1 skip=$len
If you don't have stat you may use
len=$( ls -l part1 | awk '{print $5}' )
Depending on your file sizes (text part and binary part) you may want to
change the dd options to bs=$len and skip=1.
Janis
| |
| matt@bobsroom.ca 2006-07-24, 6:59 pm |
|
Janis Papanagnou wrote:
> matt@bobsroom.ca wrote:
>
> I wouldn't trust line oriented text editors like ex to work on data other
> than ASCII text. (The least thing they do is adding a terminating newline.)
>
>
> (Which is what you asked for in your original posting.)
>
>
> Have you tried the dd approach? dd is supposed to work with binaries.
>
> # untested
> sed '/##TIFF##/q' <infile >part1
> len=$( stat -c %s part1 )
> dd if=infile of=part2 bs=1 skip=$len
>
> If you don't have stat you may use
>
> len=$( ls -l part1 | awk '{print $5}' )
>
> Depending on your file sizes (text part and binary part) you may want to
> change the dd options to bs=$len and skip=1.
>
> Janis
Ok Janis... I am getting much closer...
I did your method with DD and it seems to be working perfectly except
for one minor problem.
It appears that when DD generates part2, it is cutting off 2 characters
at the very beginning of the file.
For example, say my original file contained something similar to this:
II*XOXXXWANG TIFF!4xR ....
When dd outputs the file to disk and I open it, I would find the first
line to read as:
*XOXXXWANG TIFF!4xR ....
It's cutting off the first two characters. Can I modify that Len value
somehow to tell it to go back to chars?
| |
| matt@bobsroom.ca 2006-07-24, 6:59 pm |
|
matt@bobsroom.ca wrote:
> Janis Papanagnou wrote:
>
> Ok Janis... I am getting much closer...
>
> I did your method with DD and it seems to be working perfectly except
> for one minor problem.
>
> It appears that when DD generates part2, it is cutting off 2 characters
> at the very beginning of the file.
>
> For example, say my original file contained something similar to this:
>
> II*XOXXXWANG TIFF!4xR ....
>
> When dd outputs the file to disk and I open it, I would find the first
> line to read as:
>
> *XOXXXWANG TIFF!4xR ....
>
> It's cutting off the first two characters. Can I modify that Len value
> somehow to tell it to go back to chars?
I got it!
let "mylen=len-2"
I used that to do the math, changed the variable in the dd command for
skip and it works!
Now, the next question I have is how can I recombine these two files
after I have made modifications to one of them?
Can I just use something like this:
cat part1 part2 > final.out
??
| |
| matt@bobsroom.ca 2006-07-24, 6:59 pm |
|
matt@bobsroom.ca wrote:
> matt@bobsroom.ca wrote:
>
> I got it!
>
> let "mylen=len-2"
>
> I used that to do the math, changed the variable in the dd command for
> skip and it works!
>
> Now, the next question I have is how can I recombine these two files
> after I have made modifications to one of them?
>
> Can I just use something like this:
>
> cat part1 part2 > final.out
>
> ??
Ok, I've been given more to do that I did not realize.
How could you do this same process with multiple binary & ascii files
already combined.
Say I have a file like this:
#TOPTAG#
ascii ascii ascii ascii ascii ascii
ascii ascii ascii ascii ascii ascii
ascii ascii ascii ascii ascii ascii
##TIFF##
binary binary binary binary binary binary binary binary
binary binary binary binary binary binary binary binary
binary binary binary binary binary binary binary binary
#TOPTAG#
ascii ascii ascii ascii ascii ascii
ascii ascii ascii ascii ascii ascii
ascii ascii ascii ascii ascii ascii
##TIFF##
binary binary binary binary binary binary binary binary
binary binary binary binary binary binary binary binary
binary binary binary binary binary binary binary binary
I now need to strip out the ASCII files as separate data, and the
binary data, and then recombine them after in the same order.
Any ideas?
| |
| Chris F.A. Johnson 2006-07-24, 6:59 pm |
| On 2006-07-24, matt@bobsroom.ca wrote:
>
> I got it!
>
> let "mylen=len-2"
The portable syntax is:
mylen=$(( $len - 1 ))
> I used that to do the math, changed the variable in the dd command for
> skip and it works!
>
> Now, the next question I have is how can I recombine these two files
> after I have made modifications to one of them?
>
> Can I just use something like this:
>
> cat part1 part2 > final.out
Yes; that's what cat is for.
--
Chris F.A. Johnson, author <http://cfaj.freeshell.org>
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
===== My code in this post, if any, assumes the POSIX locale
===== and is released under the GNU General Public Licence
|
|
|
|
|