Home > Archive > Software Testing > October 2005 > Compare two pdf documents
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Compare two pdf documents
|
|
|
| Hi guys,
I need to write automated tests for an application that performs
manipulation on data that he holding. Part of its ability is generating
pdf doc from the holding data. I need to find some way to compare two
generated pdf's from the same data (generated pdf should be same) but
simple bit by bit cooperation is not good because the pdf has time
stamps/signatures header that changing with each generation. So my
question is: is there any way to parse the pdf or to split the pdf data
from the header and compare just the data, or is there any way to
compare two pdf documents?
| |
|
| ziki wrote...
>
>Hi guys,
>I need to write automated tests for an application that performs
>manipulation on data that he holding. Part of its ability is generating
>pdf doc from the holding data. I need to find some way to compare two
>generated pdf's from the same data (generated pdf should be same) but
>simple bit by bit cooperation is not good because the pdf has time
>stamps/signatures header that changing with each generation. So my
>question is: is there any way to parse the pdf or to split the pdf data
>from the header and compare just the data, or is there any way to
>compare two pdf documents?
I printed your posting to Acrobat PDF twice, as test1.pdf and test2.pdf.
On diff, it showed the following difference.
It is not hard to filter those line difference containing "CreateDate",
"ModifyDate", etc.
d:\>gdiff --unified=3 test1.pdf test2.pdf | grep "^[+-]"
--- test1.pdf 2005-10-14 12:39:46.000000000 -0700
+++ test2.pdf 2005-10-14 12:40:04.000000000 -0700
-<</Size 17/Prev 7902/Root 7 0 R/Info 5 0
R/ID[< 00A171CFAFAD570BB1F043C125716AF4
><64464BAA5F4335419FDA6E557532239C>]>>
+<</Size 17/Prev 7902/Root 7 0 R/Info 5 0
R/ID[< C22804FD5EFAC7716F4D80D6B021F1AE
><B6749592377C084CB6BED9834E47706E>]>>
- <xap:ModifyDate>2005-10-14T12:39:43-07:00</xap:ModifyDate>
- <xap:CreateDate>2005-10-14T12:39:43-07:00</xap:CreateDate>
+ <xap:ModifyDate>2005-10-14T12:40:03-07:00</xap:ModifyDate>
+ <xap:CreateDate>2005-10-14T12:40:03-07:00</xap:CreateDate>
-
<xapMM:DocumentID>uuid:d35513ef-e9c6-45db-875b-1be90a4f1fc8</xapMM:Doc
umentID>
-
<xapMM:InstanceID>uuid:2212b4a6-d4d0-416c-96aa-17bdf2047155</xapMM:Ins
tanceID>
+
<xapMM:DocumentID>uuid:a84f5c84-7676-4dfb-b086-e19f46747061</xapMM:Doc
umentID>
+
<xapMM:InstanceID>uuid:a72bc779-ee61-410b-a41d-2f278992d006</xapMM:Ins
tanceID>
5 0
obj<</CreationDate(D:20051014123943-07'00')/Author(Harry)/Creator(PScript5.dll
Version 5.2)/Producer(Acrobat Distiller 7.0
\(Windows\))/ModDate(D:200510141239
xrefbj00')/Title(WinVn Article)>>
5 0
obj<</CreationDate(D:20051014124003-07'00')/Author(Harry)/Creator(PScript5.dll
Version 5.2)/Producer(Acrobat Distiller 7.0
\(Windows\))/ModDate(D:200510141240
xrefbj00')/Title(WinVn Article)>>
| |
|
| One way to do this is to convert the two PDFs to Word format using a
tool like (www.solidpdf.com). From there you can copy and paste the
text into notepad and use a Diff comparison tool like Beyond Compare
(www.scootersoftware.com).
I have found this technique quite useful recently for migration testing
and verifying whether or not two PDFs are alike. It may sound like a
bit of work (converting from pdf to txt etc.) but I can guarantee that
this method is much quicker than doing it manually.
Let me know if this is of any use to you.
Regards,
David
| |
| Michael Bolton 2005-10-16, 9:57 pm |
|
Harry wrote:
> ziki wrote...
>
> I printed your posting to Acrobat PDF twice, as test1.pdf and test2.pdf.
> On diff, it showed the following difference.
> It is not hard to filter those line difference containing "CreateDate",
> "ModifyDate", etc.
>
> d:\>gdiff --unified=3 test1.pdf test2.pdf | grep "^[+-]"
> --- test1.pdf 2005-10-14 12:39:46.000000000 -0700
> +++ test2.pdf 2005-10-14 12:40:04.000000000 -0700
> -<</Size 17/Prev 7902/Root 7 0 R/Info 5 0
> R/ID[<00A171CFAFAD570BB1F043C125716AF4
> +<</Size 17/Prev 7902/Root 7 0 R/Info 5 0
> R/ID[<C22804FD5EFAC7716F4D80D6B021F1AE
> - <xap:ModifyDate>2005-10-14T12:39:43-07:00</xap:ModifyDate>
> - <xap:CreateDate>2005-10-14T12:39:43-07:00</xap:CreateDate>
> + <xap:ModifyDate>2005-10-14T12:40:03-07:00</xap:ModifyDate>
> + <xap:CreateDate>2005-10-14T12:40:03-07:00</xap:CreateDate>
> -
> <xapMM:DocumentID>uuid:d35513ef-e9c6-45db-875b-1be90a4f1fc8</xapMM:Doc
> umentID>
> -
> <xapMM:InstanceID>uuid:2212b4a6-d4d0-416c-96aa-17bdf2047155</xapMM:Ins
> tanceID>
> +
> <xapMM:DocumentID>uuid:a84f5c84-7676-4dfb-b086-e19f46747061</xapMM:Doc
> umentID>
> +
> <xapMM:InstanceID>uuid:a72bc779-ee61-410b-a41d-2f278992d006</xapMM:Ins
> tanceID>
> 5 0
> obj<</CreationDate(D:20051014123943-07'00')/Author(Harry)/Creator(PScript5.dll
> Version 5.2)/Producer(Acrobat Distiller 7.0
> \(Windows\))/ModDate(D:200510141239
> xrefbj00')/Title(WinVn Article)>>
> 5 0
> obj<</CreationDate(D:20051014124003-07'00')/Author(Harry)/Creator(PScript5.dll
> Version 5.2)/Producer(Acrobat Distiller 7.0
> \(Windows\))/ModDate(D:200510141240
> xrefbj00')/Title(WinVn Article)>>
Harry's approach is an excellent first project for someone who is
interested in learning a scripting language like Perl, Ruby, or any
other scripting language that supports regular expressions. He's also
shown us how to do the investigation. The general idea is that if two
files differ, they tend to differ in similar ways; I find the
comparison most easy by opening the ninary file with a good text editor
(like TextPad).
Having figured out where the files differ (that is, in the first thirty
lines or so), you can use regular expressions to figure out which parts
of the file you do or do not wish to compare. (Support for regular
expressions is central to Perl and a key factor of Ruby.) Read the
file into memory, but as you're reading, ignore the lines about which
you don't care. Byte-compare the rest. If the two output streams are
supposed to be identical, then the computational (that is, programming)
task is easy.
If it's just the text you're interested in (which is often a
good-enough difference detector) and programming is a struggle, you can
pick Select off the Acrobat Reader menu, then Ctrl-A for Select All,
then copy the results to a text file for each file that you're
inspecting; then you CAN do a binary compare. That may not be
complete, but it may be good enough.
If I were in a terrible rush, I'd use the latter technique, but if I
had practically any time at all, I'd use the former.
---Michael B.
---Michael B.
| |
| Matti Vuori 2005-10-17, 7:58 am |
| "dwebb" <dwebb83@gmail.com> wrote in news:1129447015.728441.317590
@g44g2000cwa.googlegroups.com:
> One way to do this is to convert the two PDFs to Word format using a
> tool like (www.solidpdf.com). From there you can copy and paste the
> text into notepad and use a Diff comparison tool like Beyond Compare
> (www.scootersoftware.com).
As the idea was to automate the process, this seems a bit cumbersome.
However, there are many free applications that can convert the (textual)
contents of a well behaving PDF to ASCII; headers can then be stripped and
and the rest analyzed with any comparing software.
--
Matti Vuori
http://sivut.koti.soon.fi/mvuori/index-e.htm
| |
|
| Thanks a lot to you guys,
I'm trying to automate the compare process as you said by
programming; also i would like to hear more ideas :)
Thanks again!
| |
|
| Sorry I mis-read the initial thread. The approach I suggested was
simply to aid manual testing in regards to the comparison of two PDFs.
| |
|
|
| grigsoft@gmail.com 2005-10-22, 3:57 am |
| Hello,
You may want to check our Compare It! tool from
http://www.grigsoft.com/. We have addin for pdf comparison.
Igor Green
http://www.grigsoft.com/
Compare It! + Synchronize It! - files and folders comparison never was
easier!
| |
| fprax@web.de 2005-10-23, 7:01 pm |
| "Michael Bolton" <google@michaelbolton.net> schrieb:
[color=darkred]
>
>Harry wrote:
A simple way is to use acrobat (not the reader - the full package).
This program includes some functionality to manipulate pdfs (cut out
certain pieces) and to compare 2 files with different options (content,
content and format, graphical). As a result, you will get a new pdf
with the differences marked with coloured borders.
We use this in a greater project - and for convinience we automated the
comparism with a test automation tool (like silk test), so we can
compare hundreds of documents with one click.
maybe not the cheapest solution - but it works fine.
Ciao,
Franz
| |
| sunlightgupta@yahoo.co.in 2005-10-24, 7:58 am |
| I believe that Adobe Inc. should provide some user-level API to do this
work (as I think that some of the API for .dll).
Regards,
-Ravi Prakash
|
|
|
|
|