For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > January 2007 > Re: How to pull Text from a PDF using Perl?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: How to pull Text from a PDF using Perl?
DJ Stunks

2007-01-06, 7:04 pm

Wagner, David --- Senior Programmer Analyst --- WGO wrote:
> I have tried both PDF::API2 and CAM::PDF and I must be
> misunderstanding how to use these modules. Here is the way
> I attempted using CAM::PDF
>
> Source portion:
> =E2=80=A6
> use CAM::PDF;
> =E2=80=A6=E2=80=A6=E2=80=A6=E2=80=A6
>
> $MyPDF =3D CAM::PDF->new($MyFileIn); # a PDF file which has text
>
> $MyPDFPgCnt =3D $MyPDF->numPages();
>
> my $contentTree =3D $MyPDF->getPageContentTree(1);
> $contentTree->render("CAM::PDF::Renderer::Text");
>
> I get a lot of blank lines and the characters I do get, look like:
>
> 3 U L Q W =E2=99=A5 ' D W H =E2=86=94 =E2=99=A5 =C2=B6 =C2=A7 =E2=86=95 =

=C2=A7 =C2=A7 =E2=86=95 =C2=A7 =E2=80=BC =E2=80=BC =E2=86=93
>
>
> & K L O G =E2=99=A5 $ F F R X Q W V
> 7 L P H =E2=86=94 =E2=99=A5 =C2=B6 =C2=A7 =E2=86=

=94 =C2=B6 =E2=88=9F 3 0

I think your use of render() isn't right. This seems to work for me:

#!/usr/bin/perl

use strict;
use warnings;

use CAM::PDF;
use CAM::PDF::PageText;

my $filename =3D shift || die "Supply pdf on command line\n";

my $pdf =3D CAM::PDF->new($filename);

print text_from_page(1);

sub text_from_page {
my $pg_num =3D shift;

return
CAM::PDF::PageText->render($pdf->getPageContentTree($pg_num));
}
=20
__END__

HTH,
-jp

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com