Home > Archive > PERL Miscellaneous > February 2007 > Reading in data until I have a full structure
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Reading in data until I have a full structure
|
|
| pwaring@gmail.com 2007-02-25, 7:03 pm |
| I've got a text file which is full of questions in a format similar to
the following:
QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)
At the moment I can parse each individual question into its component
parts without any problems (it's not the most pleasant regex in the
world, but it works), however I'm having trouble turning the whole
file into an array of questions which I can then parse individually.
Each question is separated from the next by at least two newlines, but
unfortunately there is sometimes two newlines between SHORT_QUESTION
and (ANSWER_1, so I can't assume that two newlines indicate the end of
a question, which is what I've been doing so far.
I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.
If anyone has any ideas as to how I can get around this, I'd be very
grateful.
Thanks in advance,
Paul
| |
| Mark Clements 2007-02-25, 7:03 pm |
| pwaring@gmail.com wrote:
> I've got a text file which is full of questions in a format similar to
> the following:
>
> QUESTION_ID "QUESTION_META_DATA
> FULL_QUESTION"
> /"SHORT_QUESTION"
> (ANSWER_1,
> ANSWER_2,
> ...
> ANSWER_N)
>
> At the moment I can parse each individual question into its component
> parts without any problems (it's not the most pleasant regex in the
> world, but it works), however I'm having trouble turning the whole
> file into an array of questions which I can then parse individually.
> Each question is separated from the next by at least two newlines, but
> unfortunately there is sometimes two newlines between SHORT_QUESTION
> and (ANSWER_1, so I can't assume that two newlines indicate the end of
> a question, which is what I've been doing so far.
>
> I was wondering if anyone could point me in the right direction for a
> way to get around this problem - basically I need to read in data
> until I know I've got a full question with answers (assuming this ends
> at two newlines often means I get the answers separately, which causes
> problems when I try to split this into smaller parts), parse that
> (which I can already do), save the results somewhere (already done as
> well) and then carry on to read in the next question.
>
I'm sure someone here who knows far more about regular expressions than
I do will come up with a workable solution, but personally I'd be
tempted to use a lexer instead.
http://www.perl.com/pub/a/2006/01/05/parsing.html
Mark
| |
| A. Sinan Unur 2007-02-25, 7:03 pm |
| "pwaring@gmail.com" <pwaring@gmail.com> wrote in
news:1172429732.393516.115680@j27g2000cwj.googlegroups.com:
> I've got a text file which is full of questions in a format similar to
> the following:
Please read the posting guidelines for this group before posting again.
> QUESTION_ID "QUESTION_META_DATA
> FULL_QUESTION"
> /"SHORT_QUESTION"
> (ANSWER_1,
> ANSWER_2,
> ...
> ANSWER_N)
>
....
> I was wondering if anyone could point me in the right direction for a
> way to get around this problem - basically I need to read in data
> until I know I've got a full question with answers (assuming this ends
> at two newlines often means I get the answers separately, which causes
> problems when I try to split this into smaller parts), parse that
> (which I can already do), save the results somewhere (already done as
> well) and then carry on to read in the next question.
You might want to read perldoc perlvar, especially about $/ :
#!/usr/bin/perl
use strict;
use warnings;
local $/ = ")\n\n";
my %questions;
while( my $chunk = <DATA> ) {
chomp $chunk;
$chunk =~ s/\A\s+//;
$chunk =~ s/\s+\z//;
if( $chunk =~ m{
\A
\s*
(\w+) # QUESTION_ID
\s+"
(\w+) # QUESTION_META_DATA
\n+\s+
(\w+) # FULL_QUESTION
"\n\s+/"
(\w+) # SHORT_QUESTION
"\n+\s+\(
(.+) # ANSWERS
}xms
)
{
my %q;
@q{ qw( qmeta qfull qshort ) } = ($2, $3, $4);
$q{ answers } = [ split /,\n\s+/, $5 ];
$questions{ $1 } = \%q;
}
}
use Data::Dumper;
print Dumper \%questions;
__DATA__
QUESTION_1 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_3,
ANSWER_4,
ANSWER_N)
QUESTION_2 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_N)
QUESTION_3 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_X,
ANSWER_N)
C:\DOCUME~1\asu1\LOCALS~1\Temp\2> t
$VAR1 = {
'QUESTION_3' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_X',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_1' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_3',
'ANSWER_4',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_2' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
}
};
| |
| pwaring@gmail.com 2007-02-25, 7:03 pm |
| On Feb 25, 7:49 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
> You might want to read perldoc perlvar, especially about $/ :
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> local $/ = ")\n\n";
That looks almost like what I want, but I should have mentioned in my
original post that the brackets are optional if there is only one
answer, so I don't think that looking for )\n\n would work.
Paul
| |
| A. Sinan Unur 2007-02-25, 7:03 pm |
| "pwaring@gmail.com" <pwaring@gmail.com> wrote in
news:1172436165.434037.47720@t69g2000cwt.googlegroups.com:
> On Feb 25, 7:49 pm, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>
> That looks almost like what I want, but I should have mentioned in my
> original post that the brackets are optional if there is only one
> answer, so I don't think that looking for )\n\n would work.
Well, here's your last fish:
#!/usr/bin/perl
use strict;
use warnings;
my %questions;
LINE: while( my $line = <DATA> ) {
next LINE unless $line =~ /\AQUESTION/;
NEW_QUESTION: my $chunk = $line;
do {
$line = <DATA>;
unless ( defined $line ) {
parse_chunk( $chunk );
last LINE;
}
if ( $line =~ /\AQUESTION/ ) {
parse_chunk( $chunk );
goto NEW_QUESTION;
}
$chunk .= $line;
} while ( 1 );
}
sub parse_chunk {
my ($chunk) = @_;
$chunk =~ s/\A\s+//;
$chunk =~ s/\s+\z//;
if( $chunk =~ m{
\A
\s*
(\w+) # QUESTION_ID
\s+"
(\w+) # QUESTION_META_DATA
\n+\s+
(\w+) # FULL_QUESTION
"\n\s+/"
(\w+) # SHORT_QUESTION
"\n+\s+\(
(.+) # ANSWERS
}xms
)
{
my %q;
@q{ qw( qmeta qfull qshort ) } = ($2, $3, $4);
$q{ answers } = [ split /,\n\s+/, $5 ];
$questions{ $1 } = \%q;
}
}
use Data::Dumper;
print Dumper \%questions;
__DATA__
QUESTION_1 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_3,
ANSWER_4,
ANSWER_N)
QUESTION_2 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_N)
QUESTION_3 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_X,
ANSWER_N)
$VAR1 = {
'QUESTION_3' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_X',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_1' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_3',
'ANSWER_4',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_2' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
}
};
Sinan
|
|
|
|
|