Home > Archive > PostScript > May 2005 > Beginner's question
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Beginner's question
|
|
| Erik ? 2005-05-17, 3:59 am |
| Hi, I'm basically trying to write an app which will extract text from
a postscript (PS)-file.
According to page 43 in the Postscript Language Reference
(http://partners.adobe.com/public/de.../en/ps/PLRM.pdf) it seems
like the text in a PS-file is expressed through string-objects.
Is this correct? If so, then there shouldn't really be a problem to
extract the text from a PS-file, right? All you would have to do is to
look for the data within the () and the <> characters.
This is where the trouble comes...I have exported a simple text-file
as a PS-file and in it I can find hexadecimal values within
<>-characters.
For example:
<10141D0B011E141F1D200F12130E0E1E>
The length of it (32 hexadecimal characters which is the same as 16
ASCII-characters) matches this line of text from the original
document:
"Bonuspoäng 2500p"
Same thing with the other hexadecimal strings, they match perefectly
in length (divided by two) and order to the lines of text from the
original document.
The thing is that when you convert these hexadecimal values to decimal
values, they don't match to the original text's ascii-values.
For example: if you take the first hexadecimal value mentioned before,
10, and convert it to a decimal you get the number 16. If you look in
the ascii-tabel 16 is the same as a whitecode character and not the
letter "B" as expected. (B's ascii-value is 66). That goes for all the
hexadecimal values I can find. When converted to decimal values they
are all in the range 30 - 0.
So why is this? Does it have anything to do with the document being
coded in Clean7Bit?
Please help, I'm going nuts over this and can't figure it out...!
(big thanks in advance)
| |
| Jim Land 2005-05-17, 3:59 am |
| skickahit10@hotmail.com (Erik ?) wrote in
news:36c22c81.0505161608.f42563e@posting.google.com:
> ...I have exported a simple text-file
> as a PS-file...
How did you make your PS file? From which application? On which OS?
Using which PS driver? I ask because there are hundreds of ways of making
a PS file, and the files they produce aren't all the same.
| |
| Ken Sharp 2005-05-17, 8:58 am |
| In article <36c22c81.0505161608.f42563e@posting.google.com>, skickahit10
@hotmail.com says...
> According to page 43 in the Postscript Language Reference
> (http://partners.adobe.com/public/de.../en/ps/PLRM.pdf) it seems
> like the text in a PS-file is expressed through string-objects.
Yes, and again, no. Some (indeed most) text is coded inside string
objects, but not all. glyphshow for instance takes either a name object
or a numeric CID.
The only way to deal with arbitrary PostScript files in this way is to
write a full PostScript itnerpreter. Even then, it may still not be
possible to completely extract 'text'.
In addition to encoding (see later) there are two cases to consider:
1) images, you would need an OCR package to extract this text
2) Text converted to outline. This appears in the program as vector
objects.
> Is this correct?
Mostly
> If so, then there shouldn't really be a problem to
> extract the text from a PS-file, right?
Nope.
> All you would have to do is to
> look for the data within the () and the <> characters.
What about images, patterns, shfill function objects and /Indexed colour
spaces ? All of these can use strings as a data source. A PostScript
program is a program, it can contain strings for a variety of purposes.
> For example: if you take the first hexadecimal value mentioned before,
> 10, and convert it to a decimal you get the number 16. If you look in
> the ascii-tabel 16 is the same as a whitecode character and not the
> letter "B" as expected. (B's ascii-value is 66). That goes for all the
> hexadecimal values I can find. When converted to decimal values they
> are all in the range 30 - 0.
>
> So why is this? Does it have anything to do with the document being
> coded in Clean7Bit?
No. That really just means its suitable for transmission by 7-bit
systems, such as email.
> Please help, I'm going nuts over this and can't figure it out...!
Text isn't really 'text'. Its an instruction to the interpreter to
extract a specific glyph from the current font, and follow the program
contained there.
Glyphs in PostScript fonts are identified by *name*, not by number
(caveat, CIDFonts are different again). So if you want the program for
the letter 'A' then you would access the glyph named '/A'.
Of course, that's rather a cumbersome way to use glyphs. Each font has
an 'Encoding'. This is an array of names, essentially this maps a number
to a glyph name. Eg:
index name
===========
0 /.notdef
....
....
0x30 /zero
0x31 /one
....
etc
So, when you present a string to the show operator, it takes each byte
of the string, maps it to a glyph name using the Encoding array,
extracts the glyph program from the font, and then executes the program.
Now, you will often find that a font is encoded with an ASCII Encoding,
so that the text will seem 'readable'. However, this is not a guarantee,
embedding a font in the program will often lead to it being re-encoded
in some way which the author of the driver felt was more useful.
So:
1) you can't extract 'text' just be reading the string arguments,
because non-glyph operators can use string arguments
2) Even if you work only with string arguments to the various show
operators you can't simply use the bytes in the string, because these
may be encoded oddly. You would have to examine the encoding and try to
make sense of the glyph names.
3) Not all the show operators take a string argument
4) An embedded font may not use meaningful glyph names. I've seen fonts
where the glyphs are name '/G00', '/G01' etc.
If this is anything other than a hobby project, then I would suggest
that you look into the commercial and free offerings available. There
are commercial conversion packages which will do a decent job of this,
and GhostScript (for example) is a free (for non-commercial use) utility
which I believe has text extraction capability as an optional adjunct.
Note, there are a number of areas of encoding which I have glossed over
above. For example I haven't considered OCF fonts (Chinese Japanese
Korean Vietnamese) which use two bytes for each glyph in their
'encoding', nor have I mentioned CIDFonts, which utilise a CMap instead
of an Encoding, and which can use a variable number of bytes for glyph
mapping.
If you continue with this project you will run into these, and other
issues, please don't think this is a comprehensive treatment of the
subject.
Ken
| |
| Aandi Inston 2005-05-17, 8:58 am |
| skickahit10@hotmail.com (Erik ?) wrote:
>According to page 43 in the Postscript Language Reference
>(http://partners.adobe.com/public/de.../en/ps/PLRM.pdf) it seems
>like the text in a PS-file is expressed through string-objects.
Yes.
>
>Is this correct? If so, then there shouldn't really be a problem to
>extract the text from a PS-file, right?
If you think this is easy, you haven't read enough of the PS Reference
yet. To do this with full accuracy, you must have a full PostScript
interpreter. That isn't to say that you might not be able to get
reasonable text in some cases.
>All you would have to do is to
>look for the data within the () and the <> characters.
If it were that simple it wouldn't be a project of 2-5 person years!
>
>This is where the trouble comes...I have exported a simple text-file
>as a PS-file and in it I can find hexadecimal values within
><>-characters.
>
>For example:
><10141D0B011E141F1D200F12130E0E1E>
>
>The length of it (32 hexadecimal characters which is the same as 16
>ASCII-characters) matches this line of text from the original
>document:
>"Bonuspoäng 2500p"
This seems perfectly normal. You are assuming that a string will
contain ASCII characters. It need not. It contains numbers which are
an index into the font's encoding array, which is used to look up
glyph names.
Here are some more interesting things to make it hard.
Consider this fragment.
% Demonstrating that you must interpret the PostScript
(Is that a ) show 2 { (yo) show } (?) show
% Demonstrating that word breaks might not be spaces
(Is) show ... rmoveto (that) ... rmoveto (a) show ...
% Demonstrating that strings need not be split on words
(I) show ... rmoveto (s tha) show ... rmoveto (t a) show
% Demonstrating that strings need not be in reading order
(ta) show ... rmoveto (s tha) show ... rmoveto (I) show
----------------------------------------
Aandi Inston quite@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.
|
|
|
|
|