Monday, April 13, 2020

Digging into the Privacy Sandbox


Digging into the Privacy Sandbox

The Privacy Sandbox is a series of proposals to satisfy third-party use cases without third-party cookies or other tracking mechanisms.

Updated

Summary #

Thanks to Michael Kleber and Marshall Vale for their help in writing this post.


Why does the web use third-party code? #

Websites use services from other companies to provide analytics, serve video and do lots of other useful stuff. Composability is one of the web's superpowers.

Most notably, ads are included in web pages via third-party JavaScript and iframes. Ad views, clicks and conversions are tracked via third-party cookies and scripts. That's how most of the web is funded.

Relevant ads are less annoying to users and more profitable for publishers (the people running ad-supported websites). Third party ad targeting tools make ad space more valuable to advertisers (the people who purchase ad space on websites) which in turn increases revenue for ad-supported websites and enables content to get created and published.

Reliable measurement and anti-fraud protection are also crucial. Advertisers and site owners must be able to distinguish between malicious bots and trustworthy humans. If advertisers can't reliably tell which ad clicks are from real humans, they spend less, so site publishers get less revenue. Many third party services currently use techniques such as device fingerprinting to combat fraud.

The problem is… privacy.

The current state of privacy on the web #

When you visit a website you may not be aware of the third parties involved and what they're doing with your data. Even publishers and web developers may not understand the entire third-party supply chain.

Ad targeting, conversion measurement, and other use cases currently rely on establishing stable cross-site identity. Historically this has been done by using third-party cookies, but browsers have begun to restrict access to these cookies. There's been an increase in the use of other mechanisms for cross-site user tracking, such as covert browser storage, device fingerprinting, and requests for personal information like email addresses.

This is a dilemma for the web. How can legitimate third-party use cases be supported without enabling users to be tracked across sites?

In particular, how can websites fund content by enabling third parties to show ads and measure ad performance—but not allow individual users to be profiled? How can advertisers verify real users, and site owners check that users are trustworthy, without resorting to dark patterns such as device fingerprinting?

The way things work at the moment can be problematic for everyone concerned, not just users. For publishers and advertisers, tracking identity and using a variety of non-native, un-standardised third-party solutions can add to technical debt, code complexity and data risk. Users, developers, publishers, and advertisers shouldn't have to worry.

Introducing the Privacy Sandbox #

The Privacy Sandbox introduces a set of privacy-preserving APIs to accomplish tasks that use tracking today.

The Privacy Sandbox APIs require web browsers to take on a new role. Rather than working with limited tools and protections, the APIs enable the user's browser to act on the user's behalf to ensure that data is never shared without their knowledge and consent. The APIs enable use cases such as ad targeting and conversion measurement, but without revealing individual private and personal information.

This is a shift in direction for browsers. The Privacy Sandbox authors' vision of the future has browsers providing specific tools to target specific use cases, while preserving user privacy.

The Privacy Sandbox proposals #

In order to successfully transition away from third-party cookies the Privacy Sandbox authors need your support. The proposal explainers need feedback from developers as well as publishers, advertisers, and advertising platforms, to suggest missing use cases and more-private ways to accomplish their goals.

You can comment on the explainers by filing issues against each repository:

  • Privacy Model for the Web
    Establish the range of web activity across which the user's browser can let websites treat a person as having a single identity. Identify the ways in which information can move across identity boundaries without compromising that separation.
  • Privacy Budget
    Limit the total amount of potentially identifiable data that sites can access. Update APIs to reduce the amount of sensitive data revealed. Make access to sensitive data measurable.
  • Trust Token API
    Enable an origin that trusts a user to issue them with cryptographic tokens which are stored by the user's browser so they can used in other contexts to verify the user's authenticity.
  • Willful IP Blindness
    Enable sites to 'blind' themselves to IP addresses so they can avoid consuming privacy budget.
  • First-Party Sets
    Allow related domain names owned by the same entity (such as apple.com and icloud.com) to declare themselves as the same first party.
  • Aggregated Reporting API
    Provide privacy preserving mechanisms to support a variety of use cases such as view-through-conversion, brand, lift, and reach measurement.
  • Event-level conversion measurement
    Provide privacy preserving click-through-conversion measurement.
  • Federated Learning of Cohorts
    Enable the user's browser, not the advertiser, to hold information about what a person is interested in. Target ads based on these interests without cross-site user tracking and while preserving the privacy of the user.
  • TURTLEDOVE
    Enable some form of on-device 'auction' to choose the most relevant ads which would include ads that remarket an advertiser based on a prior expression of interest by the user.

You can dive into the explainers right away, and over the coming months we'll be publishing posts about each of the API proposals.

Use cases and goals #

Measure conversion #

Goal: Enable advertisers to measure ad performance.

There are two proposals for APIs that would allow the user's browser to gather impression and conversion data and report this back to advertisers in a way that prevents linking of identities across sites or collecting user browsing history:

  • Event-Level Click Through Conversion Measurement allows advertisers to determine which ad clicks later turned into conversions. (API name suggestions welcome!)
  • Aggregated Reporting aggregates browsing data for multiple sites and multiple users in a single report, while preserving privacy by only allowing aggregate reporting on things that a lot of different people did.

Other companies have been investigating similar ideas, such as Facebook's Cross-Browser Anonymous Conversion Reporting, Apple's Ad Click Attribution API and Brave's ad conversion attribution.

Target ads #

Goal: Enable advertisers to display ads relevant to users.

There are many ways to make ads relevant to the user, including the following:

  • First-party-data targeting: Show ads relevant to topics a person has told a website they have an interest in, or content a person has looked at previously on this web site.
  • Contextual targeting: Choose where to display ads based on site content. For example, 'Put this ad next to articles about knitting.'
  • Remarketing: Advertise to people who've already visited your site, while they are not on your site. For example, 'Show this ad for discount wool to people who visited your store and left knitting items in their shopping cart—while they're visiting craft sites.'
  • Interest-based targeting: Select ads based on a user's browsing history. For example, 'Show this ad to users whose browsing behaviour indicates they might be interested in knitting'.

First-party-data and contextual targeting can be achieved without knowing anything about the user other than their activity within a site. These techniques don't require cross-site tracking.

Remarketing is usually done by using cookies or some other way to recognize people across web sites: adding users to lists and then targeting them with specific ads.

TURTLEDOVE moves the final ad "auction" (to choose the most relevant ads) to the browser. The API leverages information which is only stored in the user's browser itself, about advertisers the user had previously expressed an interest in, along with information about the current page. Two requests are sent for ads: one to retrieve an ad based on contextual data, and one to retrieve an ad based on an advertiser-defined interest. The browser has the responsibility of ensuring these requests are independent and uncorrelated so they can't be linked together to let an ad network know that the requests are from the same person. An "auction" is then conducted by the browser to choose the most relevant ad, using JavaScript code provided by the advertiser. This code can only be used to choose between ads: it cannot make network requests, or access the DOM or external state.

Interest-based targeting currently uses cookies or device fingerprinting to track user behaviour across as many sites as possible. Many people are concerned about the privacy implications of ad targeting. The Privacy Sandbox includes two alternatives:

FLoC generates clusters of similar people, known as cohorts or "flocks". Data is generated locally on the user's browser, not by a third party. The browser shares the generated flock data, but this cannot be used to identify or track individual users. This enables companies to target ads based on the behavior of people with similar browsing behaviour, while preserving privacy.

Combat fingerprinting #

Goal: Reduce the amount of sensitive data revealed by APIs and make access to sensitive data controllable by users, and measurable.

Browsers have taken steps to deprecate third-party cookies, but techniques to identify and track the behaviour individual users, known as fingerprinting, have continued to evolve. Fingerprinting uses mechanisms that users aren't aware of and can't control.

The Privacy Budget proposal aims to limit the potential for fingerprinting by identifying how much fingerprint data is exposed by JavaScript APIs or other 'surfaces' (such as HTTP request headers) and setting a limit on how much of this data can be accessed.

Fingerprinting surfaces such as the User-Agent header will be reduced in scope, and the data made available by alternative mechanisms such as Client Hints will be subject to Privacy Budget limits. Other surfaces, such as the device orientation and battery-level APIs, will be updated to keep the information exposed to a minimum.

Combat spam, fraud and denial-of-service attacks #

Goal: Verify user authenticity without fingerprinting.

Anti-fraud protection is crucial for keeping users safe, and to ensure that advertisers and site owners can get accurate ad performance measurements.

Unfortunately, the techniques used to identify legitimate users and block spammers, fraudsters, and bots work in ways similar to fingerprinting techniques that damage privacy.

The Trust Tokens API proposes an alternative approach, allowing authenticity of a user in one context, such as gmail.com, to be conveyed to another context, such as an ad running on nytimes.com—without identifying the user or linking the two identities.

IP address security #

Goal: Control access to IP addresses to reduce covert fingerprinting, and allow sites to opt out of seeing IP addresses in order to not consume privacy budget.

Your IP address is the public 'address' of your computer on the internet, which in most cases is dynamically assigned by the network through which you connect to the internet. However, even dynamic IP addresses may remain stable over a significant period of time.

Not surprisingly, this means that IP addresses are a significant source of fingerprint data. The Willful IP Blindness proposal is an attempt to provide a privacy-preserving alternative.

Third parties that aren't third parties #

Goal: Enable entities to declare that related domain names are owned by the same first party: apple.com and icloud.com, for example.

Many organizations own sites across multiple domains. For example, google.com, google.co.uk, and youtube.com are owned by the same entity, as are apple.com and icloud.com, or amazon.com.au and amazon.de. This can become a problem if restrictions are imposed on tracking user identity across sites that are seen as 'third-party' but actually belong to the same organization.

First Party Sets aims to make the web's concept of first and third parties more closely aligned with the real world's by enabling multiple domains to declare themselves as belonging to the same first party.

Next steps #

To reiterate: the Privacy Sandbox authors need your support. The explainers need feedback—in particular to suggest missing use cases and more-private ways to accomplish their goals.

Find out more #

Explainers #

The Privacy Sandbox #

Policy and requirements #


Appendix: Glossary of terms used in the explainers #

Click-through rate (CTR) #

The ratio of users who click on an ad, having seen it. (See also impression.)

Click-through-conversion (CTC) #

A conversion attributed to an ad that was 'clicked'.

Conversion #

The completion of an action on an advertiser's website by a user who has previously interacted with an ad from that advertiser. For example, purchase of a product or sign-up for a newsletter after clicking an ad that links to the advertiser's site.

Differential privacy #

Share information about a dataset to reveal patterns of behaviour without revealing private information about individuals or whether they belong to the set.

Domain #

See Top-Level Domain and eTLD.

eTLD, eTLD+1 #

'Effective' top level domains are defined by the Public Suffix List. For example:

co.uk
appspot.com
glitch.me

Effective TLDs are what enable foo.appspot.com to be a different site from bar.appspot.com. The effective top-level domain (eTLD) in this case is appspot.com, and the whole site name (foo.appspot.com, bar.appspot.com) is known as the eTLD+1.

See also Top-Level Domain.

Entropy #

A measure of how much an item of data reveals individual identity.

Data entropy is measured in bits. The more that data reveals identity, the higher its entropy value.

The total number of humans on the planet is around eight billion, which is almost equal to two to the power of 33. This means you need around 33 bits worth of entropy to identify an individual.

Each bit of entropy halves the possible number of potential individuals referred to. For example, binary gender data has an entropy value of around 1 bit. Assuming that birthdays are evenly distributed throughout the year, revealing your birthday (such as 1 January) would reduce your data entropy by around 8.5 bits (since 2 to the power of 8.5 is approximately 365). A postal code might be worth somewhere between 10 and 25 bits. This means that for postal areas with a small population, knowing a person's birthday and postal code means you might have 8.5 + 25 (= 33.5) bits of data—which is likely to be enough to identify individuals.

Data can be combined to identify an individual, but it can be difficult to work out whether new data adds to entropy. For example, knowing a person is from Australia doesn't reduce entropy if you already know the person is from Kangaroo Island.

Fingerprinting #

Techniques to identify and track the behaviour of individual users. Fingerprinting uses mechanisms that users aren't aware of and can't control. Sites such as Panopticlick and amiunique.org show how fingerprint data can be combined to identify you as an individual.

Fingerprinting surface #

Something that can be used (probably in combination with other surfaces) to identify a particular user or device. For example, the navigator.userAgent() JavaScript method and the User-Agent HTTP request header provide access to a fingerprinting surface (the user agent string).

First-party #

Resources from the site you're visiting. For example, the page you're reading is on the site web.dev and includes resources from that site. See also Third-party.

Impression #

View of an ad. (See also click-through rate.)

k-anonymity #

A measure of anonymity within a data set. If you have k anonymity, you can't be distinguished from k-1 other individuals in the data set. In other words, k individuals have the same information (including you).

Nonce #

Arbitrary number used once only in cryptographic communication.

Origin #

The origin of a request, including the server name but no path information. For example: https://web.dev.

Passive surface #

Some fingerprinting surfaces, such as user agent strings, IP addresses and accept-language headers, are available to every website whether the site asks for them or not. That means passive surfaces can easily consume a site's privacy budget.

The Privacy Sandbox initiatives proposes replacing passive surfaces with active ways to get specific information, for example using Client Hints a single time to get the user's language rather than having an accept-language header for every response to every server.

Publisher #

The Privacy Sandbox explainers are mostly about ads, so the kinds of publishers referred to are ones that put ads on their web sites.

Reach #

The total number of people who see an ad.

Remarketing #

Advertising to people who've already visited your site. For example, an online store could show ads for a toy sale to people who previously viewed toys on their site.

Site #

See Top-Level Domain and eTLD.

Surface #

See Fingerprinting surface and Passive surface.

Third-party #

Resources served from a domain that's different from the website you're visiting. For example, a website foo.com might use analytics code from google-analytics.com (via JavaScript), fonts from use.typekit.net (via a link element) and a video from vimeo.com (in an iframe). See also First-party.

Top-level domain (TLD) #

Top-level domains such as .com and .org are listed in the Root Zone Database.

Note that some 'sites' are actually just subdomains. For example, translate.google.com and maps.google.com are just subdomains of google.com (which is the eTLD + 1).

.well-known #

It can be useful to access policy or other information about a host before making a request. For example, robots.txt tells web crawlers which pages to visit and which pages to ignore. IETF RFC8615 outlines a standardised way to make site-wide metadata accessible in standard locations in a /.well-known/ subdirectory. You can see a list of these at iana.org/assignments/well-known-uris/well-known-uris.xhtml.

Photo by Pierre Bamin on Unsplash.

Last updated: Improve article


from Hacker News https://ift.tt/3bHHSEs

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.