Issue523041
Created on 2002-02-26 17:14 by cmalamas, last changed 2022-04-10 16:05 by admin. This issue is now closed.
| Messages (5) | |||
|---|---|---|---|
| msg9427 - (view) | Author: Costas Malamas (cmalamas) | Date: 2002-02-26 17:14 | |
Robotparser uses re to evaluate the Allow/Disallow directives: nowhere in the RFC is it specified that these directives can be regular expressions. As a result, directives such as the following are mis- interpreted: User-Agent: * Disallow: /. The directive (which is actually syntactically incorrect according to the RFC) denies access to the root directory, but not the entire site; it should pass robotparser but it fails (e.g. http://www.pbs.org/robots.txt) From the draft RFC (http://www.robotstxt.org/wc/norobots.html): "The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html" Also the final RFC excludes * as valid in the path directive (http://www.robotstxt.org/wc/norobots- rfc.html). Suggested fix (also fixes bug #522898): robotparser.RuleLine.applies_to becomes: def applies_to(self, filename): if not self.path: self.allowance = 1 return self.path=="*" or self.path.find (filename) == 0 |
|||
| msg9428 - (view) | Author: Bastian Kleineidam (calvin) | Date: 2002-02-27 14:11 | |
Logged In: YES
user_id=9205
Patch is not good:
>>> print RuleLine("/tmp", 0).applies_to("/")
1
>>>
This would apply the filename "/" to rule "Disallow: /tmp".
I think it should be:
return self.path=="*" or filename.startswith(self.path)
|
|||
| msg9429 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2002-02-28 15:25 | |
Logged In: YES user_id=21627 This has been fixed in robotparser.py 1.11. |
|||
| msg9430 - (view) | Author: Costas Malamas (cmalamas) | Date: 2002-03-06 12:09 | |
Logged In: YES
user_id=71233
calvin is right; the patch was incorrect. A better one
(and more tested by now):
def applies_to(self, filename):
if not self.path:
self.allowance = 1
return self.path=="*" or urllib.quote
(filename).startswith(self.path)
|
|||
| msg9431 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2002-03-06 12:18 | |
Logged In: YES user_id=21627 Can you please review the code which is currently in CVS? I believe it fixes your problem, as well as a number of other problems. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-10 16:05:02 | admin | set | github: 36164 |
| 2002-02-26 17:14:30 | cmalamas | create | |
