Understanding the limitations of data pollution tools
02 May 2017
Jeremy Gillula and Yomna Nasser write, on the EFF blog,
Internet users have been asking what they can do to protect their own data from this creepy, non-consensual tracking by Internet providers—for example, directing their Internet traffic through a VPN or Tor. One idea to combat this that’s recently gotten a lot of traction among privacy-conscious users is data pollution tools: software that fills your browsing history with visits to random websites in order to add “noise” to the browsing data that your Internet provider is collecting.
[T]here are currently too many limitations and too many unknowns to be able to confirm that data pollution is an effective strategy at protecting one’s privacy. We’d love to eventually be proven wrong, but for now, we simply cannot recommend these tools as an effective method for protecting your privacy.
This is one of those "two problems one solution" situations.
The problem for makers and users of "data pollution" or spoofing tools is QA. How do you know that your tool is working? Or are surveillance marketers just filtering out the impressions created by the tool, on the server side?
The problem for companies using so-called Non-Human Traffic (NHT) is that when users discover NHT software (bots), the users tend to remove it. What would make users choose to participate in NHT schemes so that the NHT software can run for longer and build up more valuable profiles?
So what if the makers of spoofing tools could get a live QA metric, and NHT software maintainers could give users an incentive to install and use their software?
NHT market as a tool for discovering information
Imagine a spoofing tool that offers an easy way to buy bot pageviews, I mean buy Perfectly Legitimate Data on how fast a site loads from various home Internet connections. When the tool connects to its server for an update, it gets a list of URLs to visit—a mix of random sites, popular sites, and paying customers.
Now the spoofing tool maintainer will be able to to tell right away if the tool is really generating realistic traffic, by looking at the market price of pageviews. The maintainer will even be able to tell whose tracking the tool can beat, by looking at which third-party resources are included on the pages getting paid-for traffic.
The money probably won't be significant, since real web ad money is moving to whitelisted, legit sites and away from fraud-susceptible schemes anyway, but in the meantime it's a way to measure effectiveness.