Issue32779
Created on 2018-02-06 04:48 by pfish, last changed 2022-04-11 14:58 by admin.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 5645 | open | python-dev, 2018-02-12 22:01 | |
| Messages (7) | |||
|---|---|---|---|
| msg311704 - (view) | Author: Paul Fisher (pfish) * | Date: 2018-02-06 04:48 | |
urljoining with '?' will not clear a query string:
ACTUAL:
>>> import urllib.parse
>>> urllib.parse.urljoin('http://a/b/c?d=e', '?')
'http://a/b/c?d=e'
EXPECTED:
'http://a/b/c' (optionally, with a ? at the end)
WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string:
https://url.spec.whatwg.org/#relative-state
Seen in versions 3.6 and 2.7, but probably also affects later versions.
|
|||
| msg311937 - (view) | Author: Paul Fisher (pfish) * | Date: 2018-02-10 06:05 | |
I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked. I'm new to the Github process but hopefully it will be a good start for the discussion. |
|||
| msg312201 - (view) | Author: Andrew Svetlov (asvetlov) * ![]() |
Date: 2018-02-15 11:04 | |
Python follows not WhatWG but RFC. https://tools.ietf.org/html/rfc3986#section-5.2.2 is proper definition for url joining algorithm. |
|||
| msg312223 - (view) | Author: Paul Fisher (pfish) * | Date: 2018-02-15 20:28 | |
In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG). It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too).
I would personally expect urljoin to do "the thing that everybody else does". Is there a sensible way to reduce this mismatch?
For reference, Java's stdlib does what I would expect here:
URI base = URI.create("https://example.com/?a=b");
URI rel = base.resolve("?");
System.out.println(rel);
https://example.com/?
|
|||
| msg394648 - (view) | Author: Irit Katriel (iritkatriel) * ![]() |
Date: 2021-05-28 09:52 | |
The relevant part in the RFC pseudo code is
if defined(R.query) then
T.query = R.query;
else
T.query = Base.query;
endif;
which is implemented in urljoin as:
if not query:
query = bquery
Is this correct? Should the code not say "if query is not None"?
(I can't see in the RFC a definition of defined()).
|
|||
| msg394649 - (view) | Author: Irit Katriel (iritkatriel) * ![]() |
Date: 2021-05-28 10:00 | |
Sorry, urlparse returns '' rather than None when there is no query.
So we indeed need to check something like
if '?' not in url:
or what's in Paul's patch.
However, my main point was to question whether fixing this is actually in contradiction with the RFC.
|
|||
| msg394664 - (view) | Author: Paul Fisher (pfish) * | Date: 2021-05-28 14:36 | |
Reading more into this, from section 5.2,1: > A component is undefined if its associated delimiter does not appear in the URI reference So you could say that since there is a '?', the query component is *defined*, but *empty*. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:58:57 | admin | set | github: 76960 |
| 2021-05-28 14:36:47 | pfish | set | messages: + msg394664 |
| 2021-05-28 10:00:15 | iritkatriel | set | messages: + msg394649 |
| 2021-05-28 09:52:36 | iritkatriel | set | nosy:
+ iritkatriel messages:
+ msg394648 |
| 2018-02-15 20:28:51 | pfish | set | messages: + msg312223 |
| 2018-02-15 11:04:20 | asvetlov | set | nosy:
+ asvetlov messages: + msg312201 |
| 2018-02-12 22:01:41 | python-dev | set | keywords:
+ patch stage: patch review pull_requests: + pull_request5446 |
| 2018-02-10 06:05:14 | pfish | set | messages: + msg311937 |
| 2018-02-10 03:38:04 | terry.reedy | set | nosy:
+ orsenthil |
| 2018-02-06 04:48:49 | pfish | create | |
