Home > Archive > PostScript > October 2004 > extract text from string literals
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
extract text from string literals
|
|
| Andre Friedrich 2004-10-10, 3:56 am |
| Hi experts,
I try to extract the text from a ps document (without any tools). But I don't
know how any white spaces (like space, tabs, etc) are encoded. I don't mean
the white-spaces between '(' and ')' I mean the white-spaces which are
declared outside of these.
For example:
-------------------------
/F0 10/Times-Roman@0 SF 104.275(CDRECORD\(1\) Schily\264s)72 48 R
(USER COMMANDS)2.5 E(CDRECORD\(1\))106.775 E/F1 10.95/Times-Bold@0 SF
-.219(NA)72 84 S(ME).219 E F0
(cdrecord \255 record audio or data Compact Discs from a master)108 96 Q
F1(SYNOPSIS)72 112.8 Q/F2 10/Times-Bold@0 SF(cdr)108 124.8 Q(ecord)-.18
E F0([)2.5 E/F3 10/Times-Italic@0 SF -.1(ge)2.5 G(ner).1 E(al options)
-------------------------
I got this:
-------------------------
CDRECORD(1) Schily´sUSER COMMANDSCDRECORD(1)NAMEcdrecord _ record audio or
data Compact Discs from a masterSYNOPSIScdrecord[general options
-------------------------
but it should looks like this:
-------------------------
CDRECORD(1) Schily's USER COMMANDS CDRECORD(1)
NAME
cdrecord - record audio or data Compact Discs from a master
SYNOPSIS
cdrecord [ general options
-------------------------
Can anyone tell me how white-spaces are encoded outside of braces, please?
Bye,
André
| |
| Aandi Inston 2004-10-10, 3:56 am |
| "Andre Friedrich" <dragon_af_de@yahoo.de> wrote:
>Hi experts,
>I try to extract the text from a ps document (without any tools). But I don't
>know how any white spaces (like space, tabs, etc) are encoded.
White space is not typically encoded. Each string is placed at a
particular position. When the human eye reads the page, if a string is
further apart it sees spaces. If things are placed at the same X
co-ordinate it may see tabs.
To extract an approximation of the intended text you need to know the
metrics of all fonts used, so you can determine the widths between
strings. Then you can use fuzzy logic to decide which of those widths
are to be considered to be spaces or something else.
This is not the work of an afternoon, or a w . Now you see,
perhaps, why people use tools for this job... to do an accurate job
you do need a PostScript interpreter, which a fast and experienced
programmer could perhaps put together in a couple of years.
----------------------------------------
Aandi Inston quite@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.
| |
| Andre Friedrich 2004-10-10, 3:56 am |
|
"Aandi Inston" <quite@dial.pipex.con> wrote:
> White space is not typically encoded. Each string is placed at a
> particular position. When the human eye reads the page, if a string is
> further apart it sees spaces. If things are placed at the same X
> co-ordinate it may see tabs.
So far so good...
> To extract an approximation of the intended text you need to know the
> metrics of all fonts used, so you can determine the widths between
> strings. Then you can use fuzzy logic to decide which of those widths
> are to be considered to be spaces or something else.
> This is not the work of an afternoon, or a w . Now you see,
> perhaps, why people use tools for this job... to do an accurate job
> you do need a PostScript interpreter, which a fast and experienced
> programmer could perhaps put together in a couple of years.
It sounds like much and hard work. Now I know why one should use tools for
this job.
Thanks for your answer,
André
| |
| Aandi Inston 2004-10-10, 3:56 am |
| "Andre Friedrich" <dragon_af_de@yahoo.de> wrote:
>It sounds like much and hard work. Now I know why one should use tools for
>this job.
I agree. If anyone is in any doubt, consider this sample (which even
may be handled badly by text extraction tools)
%!
/Courier 10 selectfont
100 500 moveto
(The amount tendered was $) show
[
(9) 500 264
(1) 500 250
(0) 500 256
(0) 500 256
(0) 500 256
]
aload
length 3 idiv
{ exch moveto show } repeat
( in guaranteed bonds) pop
showpage
----------------------------------------
Aandi Inston quite@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.
| |
| Ken Sharp 2004-10-10, 3:56 am |
| In article <2smttmF1n4bonU1@uni-berlin.de>, dragon_af_de@yahoo.de=20
says...
> Hi experts,
> I try to extract the text from a ps document (without any tools). But I d=
on't
> know how any white spaces (like space, tabs, etc) are encoded. I don't me=
an
> the white-spaces between '(' and ')' I mean the white-spaces which are
> declared outside of these.
>=20
> For example:
> -------------------------
> /F0 10/Times-Roman@0 SF 104.275(CDRECORD\(1\) Schily\264s)72 48 R
> (USER COMMANDS)2.5 E(CDRECORD\(1\))106.775 E/F1 10.95/Times-Bold@0 SF
> -.219(NA)72 84 S(ME).219 E F0
> (cdrecord \255 record audio or data Compact Discs from a master)108 96 Q
> F1(SYNOPSIS)72 112.8 Q/F2 10/Times-Bold@0 SF(cdr)108 124.8 Q(ecord)-.18
> E F0([)2.5 E/F3 10/Times-Italic@0 SF -.1(ge)2.5 G(ner).1 E(al options)
> -------------------------
> I got this:
> -------------------------
> CDRECORD(1) Schily?sUSER COMMANDSCDRECORD(1)NAMEcdrecord =AD record audio=
or
> data Compact Discs from a masterSYNOPSIScdrecord[general options
> -------------------------
> but it should looks like this:
> -------------------------
> CDRECORD(1) Schily's USER COMMANDS CDRECORD(1)
> NAME
> cdrecord - record audio or data Compact Discs from a master
> SYNOPSIS
> cdrecord [ general options
> -------------------------
>=20
> Can anyone tell me how white-spaces are encoded outside of braces, please=
?
It isn't, POstScript is both a programming language and a page=20
description language. The text can be positioned at any point on the=20
page. In this case the 'missing' white space is done by explicit=20
movement commands :
(ge)2.5 G(ner).1 E(al options)
G will be a procedure for moving the current point and then drawing the=20
text (or drawing the text and then moving the current point, or possibly=20
drawing the text kerned. E will be a similar procedure but different.
These procedures will be defined in the prologue of the program.
PostScript is not a format for information exchange, its a program=20
intended to be run on a printer and to make makrs on the printed output.=20
Extracting text (or anything else) from a PostScript program is non-
trivial and in extreme cases impossible.
You *might* be able to use pstotext, or this document (if you're feeling=20
adventurous) might help:
http://craig.nevill-manning.com/pub...ed-IHW-Extract-
Text.pdf
But unless you have a few years to spare, don't try writing a program to=20
extract 'text' from PostScript in the general case.
=09=09=09Ken
| |
| Aandi Inston 2004-10-12, 3:56 am |
| "Andre Friedrich" <dragon_af_de@yahoo.de> wrote:
>Hi experts,
>I try to extract the text from a ps document (without any tools). But I don't
>know how any white spaces (like space, tabs, etc) are encoded.
White space is not typically encoded. Each string is placed at a
particular position. When the human eye reads the page, if a string is
further apart it sees spaces. If things are placed at the same X
co-ordinate it may see tabs.
To extract an approximation of the intended text you need to know the
metrics of all fonts used, so you can determine the widths between
strings. Then you can use fuzzy logic to decide which of those widths
are to be considered to be spaces or something else.
This is not the work of an afternoon, or a w . Now you see,
perhaps, why people use tools for this job... to do an accurate job
you do need a PostScript interpreter, which a fast and experienced
programmer could perhaps put together in a couple of years.
----------------------------------------
Aandi Inston quite@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.
| |
| Andre Friedrich 2004-10-12, 3:56 am |
|
"Aandi Inston" <quite@dial.pipex.con> wrote:
> White space is not typically encoded. Each string is placed at a
> particular position. When the human eye reads the page, if a string is
> further apart it sees spaces. If things are placed at the same X
> co-ordinate it may see tabs.
So far so good...
> To extract an approximation of the intended text you need to know the
> metrics of all fonts used, so you can determine the widths between
> strings. Then you can use fuzzy logic to decide which of those widths
> are to be considered to be spaces or something else.
> This is not the work of an afternoon, or a w . Now you see,
> perhaps, why people use tools for this job... to do an accurate job
> you do need a PostScript interpreter, which a fast and experienced
> programmer could perhaps put together in a couple of years.
It sounds like much and hard work. Now I know why one should use tools for
this job.
Thanks for your answer,
André
|
|
|
|
|