Home > Archive > PERL Miscellaneous > November 2005 > Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word doc
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word doc
|
|
| Purl Gurl 2005-11-27, 6:58 pm |
| Barry Millman wrote:
(snipped)
>The code that is not finding the text on the Win XP machine (same as
> the Win 2000 machine which does find the test)is:
(snipped)
Move this line above and outside your while loop:
> $dir="C:\\IGINproducts\\UserDocuments\\";
The reason for moving that line above and outside your while loop
is you are creating a new value for that variable with each loop
iteration. That is inefficient because that variable has a "fixed"
value; set the value above and outside your while loop.
You do not need to use double left hand slashes for your
file path but doing so causes no harm. You can use single
right hand slashes for your path, for a open(FILE) syntax
as shown below.
However, despite claims of one the "experts" in this group,
you must use double lefthand slashes for some syntax,
certainly for some system command syntax for Win32.
For a file open, you do not need double slashes but it
is perfectly ok to use them.
Uppercase letters in a file path are not needed for Win32
but are ok to use; no problem.
Your code produces this directory / file name path:
C:\IGINproducts\UserDocuments\mydoc.doc
That "appears" to be a valid path. Check to be sure it is valid.
Double check to be sure there are not spaces in a directory
name, such as, User Documents which is typical.
You do not show your syntax for your OUTFILE open for write.
Be sure to use error checking to verify that file opens for write.
Run this test code,
#!perl
open (TEST, "c:/iginproducts/userdocuments/mydoc.doc") || die "File Open Failed: $!";
while (<TEST> )
{
if (index ($_, "HYPERLINK") > -1)
{ print "HYPERLINK found at line $.\n"; }
}
close (TEST) || die "File Close Failed $!";
Clearly I cannot test that code not having your file to test.
However, my syntax is ok,
C:\APACHE\USERS\TEST>perl -c test.pl
test.pl syntax OK
Running that test code will determine if your file path and file name
are valid, and will determine if HYPERLINK is actually in your file.
Be cautious. If your HYPERLINK word spans lines, index will not
find that specific instance.
Often, reducing your code to most simple version possible will find
errors for you, quickly.
Purl Gurl
| |
| Purl Gurl 2005-11-27, 6:58 pm |
| Barry Millman wrote:
> Purl Gurl wrote:
(snipped)
[color=darkred]
[color=darkred]
[color=darkred]
>I tried your suggestions, but no luck.
Then you have verified the word HYPERLINK does not exist in your file.
Your regex will never match that word with that word not existing.
I would instantly question why that word exists in your Win2K file and
does not exist in your WinXP file. You did indicate both files are the same,
or at least I think you did. I have not gone back to read again.
> There is something really odd in MS Word storage in Win XP. If I save
> the document to RTF it finds the stuff in the RTF file.
> I looked at both the MS Word and RTF files with the XVI32 Hex editor.
> They both showed the same hex values for the string HYPERLINK.
I never work with MS Word nor RTF in a programming environment. I do
use those for writing business letters, however!
Documents of that type do contain binary data. This presents myriad
problems for Perl based programs.
An example problem is some binary data will create a false end of file
signal, resulting in termination of reading, early.
There are myriad other problems created by reading in an ASCII mode
and encountering binary data; no telling what will happen.
Obvious problem is you will never find HYPERLINK in your file simply
because that word does not exist. It is possible that word is in binary
format or in partial binary format. Again, use of Perl's index function
has verified that word does not exist in your file. So, there is no error.
You have some choices.
You can open your file in a binary mode, cross your fingers and search
for your string in binary format. Almost a certainty that will fail. Perl is
not all that capable of reading and "regex" searching binary data; some
characters are "out of range" for perl core.
An alternative is to print your Word / RTF file in "plaintext" to a test file,
then open and search that file as you are currently trying to do. My use
of "plaintext" means in pure ASCII format, such as these articles we
post and read.
A rather wild alternative, and I use this method at times, is to write a
simple Visual Basic macro which runs your MS Word processor, sends
a command to s and find incidents of your HYPERLINK, then return
data to your Perl program. For some cases, you can use VB commands,
Control C
OR
Control Insert
to move data to your clipboard, then use the Win32 clipboard module to
capture that data to move it into your Perl program.
My choice would be to print your binary document to a file, in plaintext, then
use that file for my Perl program.
Whatever, index has verified your search word does not exist. Now you
know what is causing your problem.
Purl Gurl
| |
| Purl Gurl 2005-11-27, 6:58 pm |
| Purl Gurl wrote:
> Whatever, index has verified your search word does not exist. Now you
> know what is causing your problem.
I have looked over Word Perfect and MS Word but not RTF formats, on a
9.x machine, a 2K machine and an XP machine.
There are some variations, but very minor. All present an ability to
save a file in a plaintext (.txt) format. Word Perfect on my XP contains
MS Word software. There is a feature which will convert various file
formats to other formats by using "all files" or specific file types.
However, I tried converting a desktop.ini file, which I know to contain
binary data, and this caused Word Perfect to hang. Eventually I had
to kill the process, with some difficulties; the "hang" remained in
RAM and I could not run Word Perfect again until a reboot. I would
suggest you not try to move outside limits of Perfect / Word software.
Bottom line is if you need to convert a MS Word document to a
plaintext format, this is very easy.
A follow-up suggestion is to use my same test script but index
for instances of,
"http:"
If returns are successful, you could pull http hyperlinks without
searching for HYPERLINK as you were doing. Upon success,
which is dubious, you could try a regex to match hyperlink
URL formats, sans your word HYPERLINK in your regex.
if ($_ =~ ¡http://[your regex set]+ ¡)
{ print success or use ( ) to capture $1 for printing }
I would expect a space to follow a hyperlink, thus my space at the end.
Perhaps Word does not binary encode hyperlinks. No guarantee.
A hex editor will display plaintext format, if in a binary file. I use
Hex Workshop v. 2.2x for this. Very old program but works with
excellence. You could simply open your Word document with a
hex editor, then search for http: from there.
Purl Gurl
|
|
|
|
|