For Programmers: Free Programming Magazines  


Home > Archive > PERL Modules > February 2005 > Patent::Retrieve Request for Comments









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Patent::Retrieve Request for Comments
wanda_b_anon@yahoo.com

2005-02-13, 3:55 am

I have written a new module and propose to submit it to CPAN. Your
comments would be appreciated.

Patent::Retrieve is alpha software- my first module, and my intent is
to see if the perl community has any interest in the idea.

The module provides a consistent way to obtain patent documents from
various patent offices that make them available on the web. Typically,
doing this is relatively easy by hand, but involves screen-scraping if
you want to do it effectively for many pages or doucments. The offices
typically make it hard to get the whole document, presumably because
that is one source of revenue.

The module uses submodules, specific to patent offices, and comes with
working examples for the USPTO and EPO, which between them supply
granted patents in html and tiff (USPTO) and pdf (US, EP, and much of
the world...).

For casual users, this module should simplify life. Abusive users will
likely find their IP address banned by the patent office being
spidered.

I propose a new name space, "Patent", because I see no related modules
in another name space; I am happy to take suggestions. I think it is
reasonable to have a "Patent" namespace, since patents involve a lot of
text-wrangling that is single purpose. For example, searches of the
prior art, patent family relationships, patent applications via XML,
etc. With a namespace, related modules may be grouped easily.

Here is the documentation as it now stands:


Patent::Retrieve

NAME
Patent::Retrieve - retrieve a patent page (from United States
Patent and
Trademark Office (uspto) website or the European Patent Office
(espace_ep). )

SYNOPSIS
Please see the test suite for working examples. The following is
not
guaranteed to be working or up-to-date.

use Patent::Retrieve;

my $patent_document = Patent::Retrieve->new(); # new object

my $document1 = $patent_document->provide_doc('6,123,456');
# defaults: office => 'uspto',
# country => 'US',
# format => 'htm',
# page => '1', # typically htm IS "1"
page
# modules => qw/ us ep / ,

my $document2 = $patent_document->provide_doc('US_6_123_456',
office => 'espace_ep' ,
format => 'tif',
page => 2 ,
);

my $pages_known = $patent_document->pages_available( # e.g. TIFF
document=> '6 123 456',
);

DESCRIPTION
Intent: Use public sources to retrieve patent documents such as
TIFF images of patent pages, html of patents, pdf, etc.
Expandable for your office of interest by writing new
submodules..
Alpha release by newbie to find if there is any interest

USAGE
See also SYNOPSIS above

To install the module...

perl Makefile.PL

make

make test

make install

If you are on a windows box you could try to use 'nmake' rather
than
'make'.

Examples of use:

$patent_document = Patent::Retrieve->new(
doc_id => 'US6,654,321(B2)issued_2_Okada',
office => 'espace_ep' ,
format => 'tif',
page => 2 ,
agent => 'Mozilla/5.0 (Windows; U;
Windows NT 5.0; en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6',
);

# 'Windows IE 6' => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1)', # 'Windows Mozilla' => 'Mozilla/5.0 (Windows; U; Windows NT
5.0;
en-US; rv:1.4b) Gecko/20030516 Mozilla Firebird/0.6', # 'Mac
Safari' =>
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/85
(KHTML,
like Gecko) Safari/85', # 'Mac Mozilla' => 'Mozilla/5.0 (Macintosh;
U;
PPC Mac OS X Mach-O; en-US; rv:1.4a) Gecko/20030401', # 'Linux
Mozilla'
=> 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4)
Gecko/20030624', #
'Linux Konqueror' => 'Mozilla/5.0 (compatible; Konqueror/3;
Linux)',

my %attributes = $patent_document->get_patent('all'); # hash of
all

my $document_id = $patent_document->get_patent('doc_id');
# US6,654,321(B2)issued_2_Okada

my $office_used = $patent_document->get_patent('office'); # ep

my $country_used = $patent_document->get_patent('country'); #US

my $doc_id_used = $patent_document->get_patent('doc_id'); #
6654321

my $page_used = $patent_document->get_patent('page'); # 2

my $kind_used = $patent_document->get_patent('kind'); # B2

my $comment_used = $patent_document->get_patent('comment'); #
issued_2_Okada

my $format_used = $patent_document->get_patent('format'); #tif

my $pages_total =
$patent_document->get_patent('pages_available'); # 101

my $terms_and_conditions = $patent_document->terms('us'); # and
conditions

my $document = $patent_document->get_patent('document'); # the
loot

BUGS
Pre-alpha release, to gauge whether the perl community has any
interest.

Code contributions, suggestions, and critiques are welcome.

Error handling is undeveloped.

By definition, a non-trivial program contains bugs.

For United States Patents (US) via the USPTO (us), the 'kind' is
ignored
in method provide_doc

SUPPORT
Yes, please. Checks are best. Or email me at Wanda_B_Anon@yahoo.com
to
arrange fund transfers.

AUTHOR
Wanda B. Anon
Wanda_B_Anon@yahoo.com

COPYRIGHT
This program is free software; you can redistribute it and/or
modify it
under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file
included
with this module.

ACKNOWLEDGEMENTS
Andy Lester for WWW::Mechanize, that got me thinking, even if
cygwin was
trouble.,

The authors of Finance::Quote, which served as an example of
providing
submodules,

Erik Oliver for patentmailer, serving as an example of getting
patent
documents,

Howard P. Katseff of AT&T Laboratories for wsp.pl, version 2, a
proxy
that speaks LWP and understands proxies,

and of course Larry and Randal and the gang.

SEE ALSO
perl(1).


_countries_known()
Usage : internal method only
Purpose : list all entities that could give a patent
Returns : ref to a hash with keys of abbreviations and values of
entities (usually a country) ...

John Bokma

2005-02-13, 3:55 am

wrote:

> I propose a new name space, "Patent", because I see no related modules
> in another name space;


The webscraping modules?

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

anon-wb

2005-02-13, 8:56 pm

> wrote:
modules[color=darkred]
>
> The webscraping modules?


Which namespace do you propose?

WWW:Patent:Retrieve might be reasonable. But the information source
need not be the web, it could also be a file server of cached pages or
documents or your own drive. Often the question is whether the patent
document is already on my drive, or if I have to go out on the web.

Also, WWW seems to take the opposite hierarchy: e.g. WWW:Search:Ebay
implies we should name it WWW:Retrieve::Patent . That would
necessitate reorganizing so that uspto.pm, espace_ep.pm, etc. are in
folder Patent rather than Retrieve, which seems backward in logic.
Also, it would put a patent searching module into WWW:Search::Patent,
which is a long way from WWW:Retrieve:Patent.

But it could work. The new namespace would not be top level.

Any other suggestions?

John Bokma

2005-02-13, 8:56 pm

anon-wb wrote:

> modules
>
> Which namespace do you propose?
>
> WWW:Patent:Retrieve might be reasonable. But the information source
> need not be the web, it could also be a file server of cached pages or
> documents or your own drive.


I can imagine that there is no problem at all for the namespace if the
scraping module does smart caching.

> Often the question is whether the patent
> document is already on my drive, or if I have to go out on the web.


That's just a cache. I can even imagine that other WWW:: modules use a
caching mechanism, or otherwise can profit from one.

I see there is a WWW::Mechanize::Cached or maybe Cache::Cached is better
for your module.

> Also, WWW seems to take the opposite hierarchy: e.g. WWW:Search:Ebay
> implies we should name it WWW:Retrieve::Patent .


WWW::Search::Patent?

> That would
> necessitate reorganizing so that uspto.pm, espace_ep.pm, etc. are in
> folder Patent rather than Retrieve, which seems backward in logic.
> Also, it would put a patent searching module into WWW:Search::Patent,
> which is a long way from WWW:Retrieve:Patent.


The search modules also retrieve results. There is no point in searching
and not getting results :-D.

> But it could work. The new namespace would not be top level.


To me more logical.


--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

anon-wb

2005-02-15, 3:57 pm

Regarding the naming of a new module to retrieve pages of patent
documents:

The module does not cache, the point about having documents on ones own
drive (not WWW) was that the WWW is not the only source of the
documents, you might scan them yourself- so maybe a file:// url would
be a source. But that is more the exception than the rule, so I see no
obvious reason to rule out the WWW hierarchy.

I am leaning toward

WWW::Patent::Page

since this module will retrieve "pages", given a document identifier,
(without parsing the page) such as html, tiff, pdf, and leads to

WWW::Patent::Page::uspto.pm
WWW::Patent::Page::espace_ep.pm
etc. for the page sources.

This hierarchy makes sense in light of future possible WWW
interactions:

WWW:Patent::Information (given a patent document ID, retrieve
associated information such as inventors, assignees, earliest filing
date, etc., family, possibly by screen-scraping some html or going to a
database)

WWW::Patent::Search (input topics, receive document identifiers or
related information)

WWW::Patent::Submit (input a patent application, receive
acknowledgement of filing office)

WWW::Patent::Submit::XML (use an XML interface, e.g. at the USPTO)

We are somewhat distracted by focussing on screen scraping. The
scraping only happens here to find out where the document resides, then
the document is retrieved. The scraping results are mostly internal
and not returned to the user, except gems like how many pages are
available for the complete document.

"Patents" work has three main information needs- "searching" for patent
documents of interest, relating those documents to similar documents,
e.g. in different countries or by the same inventor (a family, an
inventor), "retrieving" documents of interest or associated information
(cited documents, inventors, assignees), and "getting" (submitting an
application and being granted) a new patent.

Comments welcome- any objections to WWW::Patent::Page ?

John Bokma

2005-02-16, 3:57 pm

anon-wb wrote:

> I am leaning toward
>
> WWW::Patent::Page
>
> since this module will retrieve "pages", given a document identifier,
> (without parsing the page) such as html, tiff, pdf, and leads to


How about WWW:Patent::Document ?

> WWW::Patent::Page::uspto.pm
> WWW::Patent::Page::espace_ep.pm
> etc. for the page sources.


IIRC lower case module names are reserved for pragmas.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

anon-wb

2005-02-16, 8:57 pm

> How about WWW:Patent::Document ?

The module is lower level or more primitive than "Document"; mainly it
retrieves a page at a time, as the offices typically allow. I leave it
to the user to decide how to, if desired, stitch the pages together
into a document. So, someone might take WWW::Patent::Page and use it
for making WWW:Patent:Document . Thus, Page seems more accurate than
Document as the finest level of naming.

> IIRC lower case module names are reserved for pragmas.


WWW::Patent::Page::Uspto.pm ?
WWW::Patent::Page::USPTO.pm ?

Is there a preferred way of naming modules that are worthless without
their parent?

Peter Scott

2005-02-16, 8:57 pm

In article <Xns95FF81D68D10Bcastleamber@130.133.1.4>,
John Bokma <postmaster@castleamber.com> writes:
>anon-wb wrote:
>
>IIRC lower case module names are reserved for pragmas.


Only when they're single words. They're okay on the end of modules that
begin with capital letters. See, for example, LWP::Protocol::{http,ftp,...}.

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/
John Bokma

2005-02-16, 8:57 pm

Peter Scott wrote:

> In article <Xns95FF81D68D10Bcastleamber@130.133.1.4>,
> John Bokma <postmaster@castleamber.com> writes:
>
> Only when they're single words. They're okay on the end of modules
> that begin with capital letters. See, for example,
> LWP::Protocol::{http,ftp,...}.


Yeah, actually I saw those two days ago :-D. But personally I would stick
to HTTP, FTP etc)

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

John Bokma

2005-02-16, 8:57 pm

anon-wb wrote:

>
> The module is lower level or more primitive than "Document"; mainly it
> retrieves a page at a time, as the offices typically allow.


Ah, ok, didn't know that. Yeah, in that case Page is more appropriate.

> WWW::Patent::Page::Uspto.pm ?
> WWW::Patent::Page::USPTO.pm ?


The latter

> Is there a preferred way of naming modules that are worthless without
> their parent?


I would use first upper case, and if it's an acronym I would use all
uppercase especially if that's common:

http://www.answers.com/uspto

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

Peter Scott

2005-02-21, 3:57 pm

In article <Xns95FF81D68D10Bcastleamber@130.133.1.4>,
John Bokma <postmaster@castleamber.com> writes:
>anon-wb wrote:
>
>IIRC lower case module names are reserved for pragmas.


Only when they're single words. They're okay on the end of modules that
begin with capital letters. See, for example, LWP::Protocol::{http,ftp,...}.

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com