Splitting URLs
Tim Chase
python.list at tim.thechases.com
Sun Oct 21 15:55:01 EDT 2007
More information about the Python-list mailing list
Sun Oct 21 15:55:01 EDT 2007
- Previous message (by thread): Tkinter Status on OSX
- Next message (by thread): Splitting URLs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> URL = 'http://steve:secret@www.domain.com.au:82/dir" + \ > 'ectory/file.html;params?query#fragment' > > If I split the URL, I would like to get the following components: > > scheme = 'http' > netloc = 'steve:secret at www.domain.com.au:82' > username = 'steve' > password = 'secret' > hostname = 'www.domain.com.au' > port = 82 > path = '/directory/file.html' > parameters = 'params' > query = 'query' > fragment = 'fragment' > > I can get *most* of the way with urlparse.urlparse: it will split the URL > into a tuple: > > ('http', 'steve:secret at www.domain.com.au:82', '/directory/file.html', > 'params', 'query', 'fragment') > > If I'm using Python 2.5, I can split the netloc field further with named > attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have > to support 2.4). Before I write code to split the netloc field by hand (a > nuisance, but doable) I thought I'd ask if there was a function somewhere > in the standard library I had missed. there are some goodies in urllib for doing some of this splitting. Example code at the bottom of my reply (though it seems to choke on certain protocols such as "mailto:" and "ssh:" because urlparse doesn't return the netloc properly) > This second question isn't specifically Python related, but I'm asking it > anyway... > > I'd also like to split the domain part of a HTTP netloc into top level > domain (.au), second level (.com), etc. I don't need to validate the TLD, > I just need to split it. Is splitting on dots sufficient, or will that > miss some odd corner case of the HTTP specification? I believe that dots are the sanctioned separator, HOWEVER, you can have a non-qualified machine-name with local scope, so you can easily have NO TLD, such as http://user:password@localhost:8000/path/to/thing There's also the ambiguity of what "TLD" means if you use IP addresses: http://user:password@192.168.1.1:8000/path/to/thing Does that make the TLD "1"? Other odd edge-cases that are usually allowable (but frowned upon, mostly used by spammers/phishers) include using a long-int as the domain-name, such as http://user:password@2130706433:8000/path/to/thing In an attempt to play with these functions, I present the code below. -tkc import urlparse, urllib tests = ( 'http://steve:secret@www.example.com.au:82/' 'directory/file.html;params?query#fragment', 'http://user:password@192.168.1.2/path/to/thing/', 'http://192.168.1.2/path/to/thing/', 'http://2130706433/path/to/thing/', 'http://localhost/path/to/thing/', 'http://user:password@localhost/path/to/thing/', 'telnet://foo@bar.com', 'ssh://user@example.com', 'gopher://wais.example.edu', 'svn+ssh://user:password@svn.example.com/svn/here/there/', 'mailto:joe at example.com', ) def is_ip_address(s): for i, part in enumerate(s.split('.')): try: assert 0 <= int(i) <= 255 except: return False return i == 3 def steve_parse(url): (scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url) creds, host = urllib.splituser(netloc) username, password = urllib.splitpasswd(creds or '') host, port = urllib.splitport(host) if '.' in host and not is_ip_address(host): domain, tld = host.rsplit('.', 1) else: domain = host tld = '' return ( scheme, username, password, domain, tld, port, path, params, query, fragment) if __name__ == '__main__': for test in tests: print test (scheme, username, password, domain, tld, port, path, params, query, fragment) = steve_parse(test) print '\tScheme: ', scheme print '\tUsername: ', username print '\tPassword: ', password print '\tDomain: ', domain print '\tTLD: ', tld print '\tPort: ', port print '\tPath: ', path print '\tParams: ', params print '\tQuery: ', query print '\tFragment: ', fragment print '='*50
- Previous message (by thread): Tkinter Status on OSX
- Next message (by thread): Splitting URLs
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Python-list mailing list