read all available pages on a Website

Brad Tilley bradtilley at usa.net
Mon Sep 13 09:49:02 EDT 2004
Alex Martelli wrote:
> Leif K-Brooks <eurleif at ecritters.biz> wrote:
> 
> 
>>Tim Roberts wrote:
>>
>>>Brad Tilley <bradtilley at usa.net> wrote:
>>>
>>>
>>>>Is there a way to make urllib or urllib2 read all of the pages on a Web
>>>>site?
>>>
>>>By the way, there are many web sites for which this sort of behavior is not
>>>welcome.
>>
>>Any site that didn't want to be crawled would most likely use a 
>>robots.txt file, so you could check that before doing the crawl.
> 
> 
> Python's Tools/webchecker/ directory has just the code you need for all
> of this.  The directory is part of the Python source distribution, but
> it's all pure Python code, so, if your distribution is binary and omits
> that directory, just download the Python source distribution, unpack it,
> and there you are.
> 
> 
> Alex

Thank you, this is ideal.



More information about the Python-list mailing list