Message 318901 - Python tracker

Message318901

Author	mark.dickinson
Recipients	Francois Schneider, mark.dickinson
Date	2018-06-07.06:58:40
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1528354720.83.0.592728768989.issue33784@psf.upfronthosting.co.za>
In-reply-to

Content
This shouldn't be a problem: there's no rule that says that different objects should have different hashes. Indeed, with a countable infinity of possible different hashable inputs, a deterministic hashing algorithm, and only finitely many outputs, such a rule would be a mathematical impossibility. For example: >>> hash(-1) == hash(-2) True Are these hash collisions causing real issues in your code? While a single hash collision like this shouldn't be an issue, if there are many collisions within a single (non-artificial) dataset, that _can_ lead to performance issues. Looking at the code, we could probably do a better job of making the hash collisions less predictable. The current code looks like: def __hash__(self): return hash(int(self.network_address) ^ int(self.netmask)) I'd propose hashing a tuple instead of using the xor. For example: def __hash__(self): return hash((int(self.network_address), int(self.netmask))) Hash collisions would almost certainly still occur with this scheme, but they'd be a tiny bit less obvious and harder to find.

Content

This shouldn't be a problem: there's no rule that says that different objects should have different hashes. Indeed, with a countable infinity of possible different hashable inputs, a deterministic hashing algorithm, and only finitely many outputs, such a rule would be a mathematical impossibility. For example:

>>> hash(-1) == hash(-2)
True

Are these hash collisions causing real issues in your code? While a single hash collision like this shouldn't be an issue, if there are many collisions within a single (non-artificial) dataset, that _can_ lead to performance issues.

Looking at the code, we could probably do a better job of making the hash collisions less predictable. The current code looks like:

    def __hash__(self):
        return hash(int(self.network_address) ^ int(self.netmask))

I'd propose hashing a tuple instead of using the xor. For example:

    def __hash__(self):
        return hash((int(self.network_address), int(self.netmask)))

Hash collisions would almost certainly still occur with this scheme, but they'd be a tiny bit less obvious and harder to find.

History
Date	User	Action	Args
2018-06-07 06:58:40	mark.dickinson	set	recipients: + mark.dickinson, Francois Schneider
2018-06-07 06:58:40	mark.dickinson	set	messageid: <1528354720.83.0.592728768989.issue33784@psf.upfronthosting.co.za>
2018-06-07 06:58:40	mark.dickinson	link	issue33784 messages
2018-06-07 06:58:40	mark.dickinson	create