Friday, July 24, 2020

Bloomfilters debunked: Dispelling 30 Years of math with Coq

There's this rather nifty feature of modern web browsers (such as Firefox or Chrome) where the browser will automatically warn the user if they happen to navigate to a "malicious" url:

While conceptually simple, this feature actually requires more engineering effort than one would expect - in particular, tracking the set of known malicious URLs in a practical manner turns out to be somewhat difficult.

This is because, on one hand, trying to store the database of all known malicious URLs, which, bear in mind, may contain millions and millions of entries, is something that is just practically infeasible for most users.

Conversely, sending every URL that a user visits to some external service, where it could be logged and data-mined by nefarious third parties (i.e Google), is something that most users should probably not be comfortable with.

As it turns out, browsers actually implement this functionality by means of a rather interesting compromise.

Using a probabilistic data structure known as a Bloomfilter, Browsers maintain a approximate representation of the set of known malicious URLs locally. By querying this space-efficient local set, browsers will only send up a small proportion of URLs that have a high likelihood of actually being malicious.

Use a Bloomfilter to act as a local proxy for queries to an external database.

This proxy technique, which safe-guards the privacy of millions of users ever day, depends on two key properties of Bloomfilters:

No false negatives
This states that if a query for an URL in the Bloomfilter returns negative, then the queried item can be guaranteed to not be present in the set of malicious URLs - i.e the Bloomfilter is guaranteed to return a positive result for all known malicious URLs.
Low false positive rate
This property states that for any URL that is not in the set of malicious URLs, the likelihood of a Bloomfilter query returning a positive result should be fairly low - thus minimising the number of unnecessary infractions on user privacy.

This mechanism works quite well, and the guarantees of a Bloomfilter seem to be perfectly suited for this particular task, however it has one small problem, and that is:

The widely cited expression for the false positive rate of a bloomfilter is wrong!

In fact, as it turns out, the behaviours of a Bloomfilter have actually been the subject of 30 years of mathematical contention, requiring multiple corrections and even corrections of these corrections.

Given this history of errors, can we really have any certainty in our understanding of a Bloomfilter at all?

Well, never fear, I am writing this post to inform you that we have just recently used Coq to produce the very first certified proof of the false positive rate of a Bloomfilter, finally putting an end to this saga of errors and returning certainty (pardon the pun) to a mechanism that countless people rely on every single day.

In the rest of this post, we'll explore the main highlights of this research, answering the following questions:

  • What is a Bloomfilter?
  • Why did these errors initially arise?
  • What is the true behaviour of a Bloomfilter? and how can we be certain?

This research was just recently presented at CAV2020 under the title "Certifying Certainty and Uncertainty in Approximate Membership Query Structures", you can find the paper here, and a video presentation of it here.

The code and proofs used in this research project are FOSS, and can be found on the Ceramist repo: https://github.com/certichain/ceramist



from Hacker News https://ift.tt/3hAHvyw

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.