Home > Archive > Visual Basic > April 2006 > compare jpg and corresponding doc file
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
compare jpg and corresponding doc file
|
|
|
| hi
please give a mail or post your replay if u know anything regarding the
following question
suppose we had a jpg file which consists of only text data which was
obtained by scanning a page from a book (contains only text but not
photos).
The said jpg file is converted to doc file using fine reader.
now i like to know the procedure to compare each and every word in the
jpg and the corresponding doc file to correct or prepare a report.
my mail id : sai_ksvsk@yahoo.co.in
| |
| Mike Williams 2006-04-26, 3:56 am |
| "sai" <sai_ksvsk@yahoo.co.in> wrote in message
news:1146036442.354259.147600@g10g2000cwb.googlegroups.com...
> suppose we had a jpg file which consists of only text data
No such thing.
> which was obtained by scanning a page from a book
> (contains only text but not photos).
That's a jpg which consists of compressed bitmap data, and the bitmap just
happens to be a copy of a document containing text.
> The said jpg file is converted to doc file using fine reader.
Presumably that's some sort of OCR program?
> now i like to know the procedure to compare each and
> every word in the jpg and the corresponding doc file to
> correct or prepare a report.
If there was an easy way of "reading the text" in the jpg image file to do
the comparison then you would have used to extract the text in the first
place, and you would not therefore have needed to use FineReader ;-)
Optical character recognition (OCR) is not a trivial task and there isn't an
application out there that can do it with complete accuracy. I think the
best you can do is to OCR the file twice, or perhaps more than twice, using
two or more different OCR engines, and then compare their outputs using a bi
of intelligent processing. You'e still not going to get absolute accuracy
though, and the outcome might even be worse than the FineReader output in
the first place! Nothing you do can give you absolute accuracy, except
perhaps having a human being read the result of the OCR (your doc file) file
and the orignal text side by side. Human beings are much cleverer than
computers!
Mike
| |
|
| hi Mike
Thanks for your response.
Actually i would like to know that if a jpg and the corresponding doc
file exists then is it possibel to compare the binary code of both the
files charecter wise sothat we can identify the dissimilar words to
prepare a report. If not, can we compare the tif and doc files if so
please give an advise
sai
mail id : sai_ksvsk@yahoo.co.in
| |
|
| The binary code of a Jpeg (or tiff) contain no words, only information as it
relates to the image (pixel information). It just happens that the image is
of text but it still is nothing more than an image. There is no way to
easily compare a Jpeg and a Word document - this would be a huge overtaking.
--
Chris Hanscom - Microsoft MVP (VB)
Veign's Resource Center
http://www.veign.com/vrc_main.asp
Veign's Blog
http://www.veign.com/blog
--
"sai" <sai_ksvsk@yahoo.co.in> wrote in message
news:1146063625.540812.61830@u72g2000cwu.googlegroups.com...
> hi Mike
>
> Thanks for your response.
> Actually i would like to know that if a jpg and the corresponding doc
> file exists then is it possibel to compare the binary code of both the
> files charecter wise sothat we can identify the dissimilar words to
> prepare a report. If not, can we compare the tif and doc files if so
> please give an advise
>
> sai
> mail id : sai_ksvsk@yahoo.co.in
>
| |
| Mike Williams 2006-04-26, 6:56 pm |
| "sai" <sai_ksvsk@yahoo.co.in> wrote in message
news:1146063625.540812.61830@u72g2000cwu.googlegroups.com...
> Actually i would like to know that if a jpg and the corresponding
> doc file exists then is it possibel to compare the binary code of both
> the files charecter wise sothat we can identify the dissimilar words
I think you've missed the point of what I said, Sai. A jpg file is a picture
file. It contains picture data in a compressed format. When you decompress
it you end up with a bitmap which tells you what colour each pixel is in the
image. That is all. There is no text data in it. If you want to write some
code to look at that pixel data in an attempt to recognise the shape of text
characters then you have a *huge* job on your hands. Imagine of you take a
snapshot of your family with a digital camera. You, as a person, can look at
that photo and perhaps be able to say, "Yes, that's my Aunt Mabel with her
two grandchildren and the dog they got for their birthday". People can do
that sort of thing. They find it very easy. Computers, on the other hand,
are rubbish at doing that stuff. In order to analyse a picture and make that
sort of sense out of it a computer needs to be fed *massive amounts* of
information and you would need a team of hard working programmers to spend
many months or even years writing a program to "teach" the computer how to
analyse that data. It's much the same with a photo of a page from a bnook or
a newspaper page or whatever. A person, even a not very intelligent person,
can look at that phot and read the page. In that respect, people are vwery,
very much cleverer than computers. What you're basicall asking for is
information on how to write some VB code to perform what is called "optical
character recognition" by analysing the pixel data contained in a photo and
attempting to find out if any of the "shapes" in that photo represent
characters or words. It is a *massive* undertaking, and teams of
professionals have been trying to get it right for twenty years or more, and
they still haven't succeeded! Some OCR programs these days are quite good of
course, but even the best of them is still far from perfect. So, to
summarise, your question seems to be, "How can I write code to analyse the
binary code of a jpg in order to recognise and sort the shape of any text
characters it might contain". And the answer is that you can do one of three
things:
1. Buy lots of books dealing with OCR and spend the next twenty years,
preferably in a locked room to avoid interruptions, writing and testing your
code until it manages to be able to recognise sufficient characters to
enable it to make a half decent guess as to what might be on the page! . . .
or
2. Buy a commercial OCR program that allows you to interface with VB and
spend a few w s or perhaps months learning how to "drive" it properly.
3. Download a free OCR program and waste a few w s discovering that it
really isn't any good.
Mike
| |
|
|
"Mike Williams" <Mike@WhiskyAndCoke.com> wrote in message
news:ehiAZaVaGHA.1352@TK2MSFTNGP05.phx.gbl...
> "sai" <sai_ksvsk@yahoo.co.in> wrote in message
> news:1146063625.540812.61830@u72g2000cwu.googlegroups.com...
>
>
> I think you've missed the point of what I said, Sai. A jpg file is a
picture
> file. It contains picture data in a compressed format. When you decompress
> it you end up with a bitmap which tells you what colour each pixel is in
the
> image. That is all. There is no text data in it. If you want to write some
> code to look at that pixel data in an attempt to recognise the shape of
text
> characters then you have a *huge* job on your hands. Imagine of you take a
> snapshot of your family with a digital camera. You, as a person, can look
at
> that photo and perhaps be able to say, "Yes, that's my Aunt Mabel with her
> two grandchildren and the dog they got for their birthday". People can do
> that sort of thing. They find it very easy. Computers, on the other hand,
> are rubbish at doing that stuff. In order to analyse a picture and make
that
> sort of sense out of it a computer needs to be fed *massive amounts* of
> information and you would need a team of hard working programmers to spend
> many months or even years writing a program to "teach" the computer how to
> analyse that data. It's much the same with a photo of a page from a bnook
or
> a newspaper page or whatever. A person, even a not very intelligent
person,
> can look at that phot and read the page. In that respect, people are
vwery,
> very much cleverer than computers. What you're basicall asking for is
> information on how to write some VB code to perform what is called
"optical
> character recognition" by analysing the pixel data contained in a photo
and
> attempting to find out if any of the "shapes" in that photo represent
> characters or words. It is a *massive* undertaking, and teams of
> professionals have been trying to get it right for twenty years or more,
and
> they still haven't succeeded! Some OCR programs these days are quite good
of
> course, but even the best of them is still far from perfect. So, to
> summarise, your question seems to be, "How can I write code to analyse the
> binary code of a jpg in order to recognise and sort the shape of any text
> characters it might contain". And the answer is that you can do one of
three
> things:
>
> 1. Buy lots of books dealing with OCR and spend the next twenty years,
> preferably in a locked room to avoid interruptions, writing and testing
your
> code until it manages to be able to recognise sufficient characters to
> enable it to make a half decent guess as to what might be on the page! . .
..
> or
>
> 2. Buy a commercial OCR program that allows you to interface with VB and
> spend a few w s or perhaps months learning how to "drive" it properly.
>
> 3. Download a free OCR program and waste a few w s discovering that it
> really isn't any good.
>
> Mike
>
Do option 3.
I can speak from experience as I have done that at least every 3 or 4 months
for the last twenty years. <g>
-ralph
|
|
|
|
|