Wednesday, July 27, 2011

Degradation of Privacy

I guess this stance is an unpopular one - or, rather, an unfashionable one - but I really hate when companies gather data about me and my friends.

I'm bringing this up here because I catch a lot of flak about it from even fairly geeky friends. So let me explain why I am against corporations collecting and retaining information. More specifically, why I think it should be illegal for corporations to use facial recognition software or otherwise "deanonymize" information.

Let's start small.

Point one: corporations cannot be trusted to keep data safely. Corporate IT practices are notoriously poor, and there have been hundreds of examples of accounts (even ones with financial information such as credit card numbers) being stolen by the hundreds of thousands. Sony is the most recent loud example, but Citibank and others show it is hardly a black swan.

So, even before we get to things like privacy, corporations can't even be trusted to keep the data they actually need to operate day to day safe. There are plenty of practices that can keep user data safe even if the corporate database is hacked or leaked. These all involve not keeping data. Only keeping hashes, discarding the credit card number except perhaps the last four digits for identification, and so on.

Some data are less critical. It's a pain when someone steals 500,000 client email addresses, but it's probably not going to result in your clients actually being harmed significantly - just an uptick in the amount of useless spam being caught by the filters. Except that's not actually true: that data can help deanonymize other data, which is a problem most people don't bother considering.

Point two: Data give corporations advantages. Corporations are businesses, 99.99999% of which are out to get as much money as possible. Even without any information about you, corporations build their products to lock you in and drill your pockets as much as possible.

While I don't like this much, I understand that it's not feasible to magically make corporations stop doing that. So let's proceed with the idea that any advantage the corporation can get will go to mining its consumers as much as possible.

Normally, this is moderated by competition. If one company is too abusive, you can switch to a competitor that offers a very similar product.

However, data are nontransferable value. It's best to think of data as the on-line equivalent of "location, location, location!" The reason eBay is popular is because eBay is popular: the mass of data it has - the number of transactions, the ratings history, and so on - makes it more valuable to post your stuff to eBay than a smaller competitor. The only competitors likely to succeed are those specializing in very limited fields where the noise on eBay is actually a downside.

This is true of social data as well. A big difficulty for most people moving from Facebook to Google+ was the need to recreate their social network. Google+ reduced this difficulty by offering up suggestions based on your email history. Google was able to (somewhat) overcome the mass of Facebook's data by leveraging its own, similar data. However, a social site such as Appleseed does not have data to leverage, and is therefore at a tremendous disadvantage.

This is not simply value-add data, either. Having information about your users allows you to advertise to them as well as actually make their experience better. Amazon is a great example of this, where it will advertise hundreds of targeted ads at you every page - "also bought X" "if you like Y, try Z" "EVERYONE IS BUYING A KINDLE OH GOD WHY WON'T YOU BUY A KINDLE YOU BASTARD LOOK HERE IS EVERY KINDLE EVER JUST PICK ONE BUY IT PLEEEEEEEEAAAASE!"

And so on.

This is a business advantage. Better advertising of related services and products is an advantage. It can generate revenue via ad fees or via higher conversion rates on direct sales.

Summary: think of data as location. The more data someone has, the closer they are to your house. That means that you're more likely to shop there, and if they open another store in their mall, you'll be more likely to shop at that new store. Data are a direct business advantage, and you're not likely to drive an hour to go to some interesting new place no matter how fancy-pants it is in comparison.

Point three: data can be combined, and it is easier to do so if you have more data. Some of you may have noticed I've been sticking to "data as plural", normally a pedantic and irritating choice. This is because I think one detail most people miss about data is that it is plural. Data is not like a bouncy ball. Data are like water in a cup.

Companies can combine data. This is one reason why companies frequently sell data to each other. To date, 99% of the data on the web about you has been more or less the same: your name and email address are the details that can get sold, your home address and credit card information are the details that cannot be sold.

But these days, there are a lot more details out there than you might think. What movies you like to rent, where you shop, who you phoned, what topics of conversation are common in your emails and social network chats, what kind of porn you surf for, which aliases are yours and not somebody else's, what your political preferences are, what kind of stupid shit you said ten years ago.

Most of this data is pretty useless to most people. Most corporations don't even care about it. But it is out there. It is really easy to trawl your Facebook or Twitter or Google+ account to collect a list of everyone you talk to and who talks to you.

Right now, your Amazon.com account is tied to an email address. Your Google+ address, probably. So Amazon can automatically, with no humans involved, look to see who is in your circles and visa-versa. And, next time you go to Amazon.com, it'll say "Hey, Greg bought this book, you should buy it too!" Of course, Greg bought "Animal Sex and You: A Practical Guide", so it might be a popup you wish you'd never seen...

Think it's outlandish? Here's a fun experiment! Go to Amazon.com, and search for wishlist. Just randomly punch in people's gmail accounts. I got a hit rate of around 20%.

Point four: no, really, data can be combined. I'm not joking. I don't really think I stressed this enough. Data can be automatically aggregated and combined. Even if you're not involved.

For example, if one of my friends posts a picture of me and then labels my face, I am now in the datastream. Especially if he labels me by email address or other unique identifier. Moreover, with fun facial recognition software, it's possible to then go and find out other pictures that are likely to be about me. Even if I'm not in the system as a user, the system has information about me.

Ever link to someone? "Oh, my friend Jerry posted this on Facebook: kalinkylinky". Congrats, your friend Jerry has now been linked to you, even if he's not even on the same service as you. Even if he has you blocked because you're a creepy stalker.

Think making your privacy settings strict will save you? Nope, it's pretty easy for me to reconstruct your social network using your friends who do NOT have strict privacy settings.

Let me make that clear: even if you set your account to strict privacy or don't participate at all, if you have any connections to other people who aren't quite so strict, your data can be easily reconstructed.

This is the same concept as "deanonymization". It's easy to take data that is supposed to be private or anonymous and link it up to a particular person using data from another source (or from the same source but another vector - IE, your friends' accounts).

Point five: data may be used to discriminate against you. Assuming that you don't care about all that above, let me remind you that data can be used against you, and already probably is.

Most employers will at least perform a simple search on your name when you interview them. Many employ a "drilling" service that will trawl for all accounts that can be linked to you and then trawl through their posts, looking for references to things like drugs.

These are services which already exist and are used reasonably frequently. It's only half a step to finding your circle of friends and, even if your account posts are private, finding what THEIR posts are about and what kind of comments you've left to their posts.

I use this as an introductory problem because it is one most people realize exists already. However, it is flat-out minor in comparison to other discriminatory activities. And here I don't mean racial discrimination, but the more general term meaning "judgment based on details or categories".

For example, we've already seen a few cases where people have gotten in trouble for Facebook pictures showing them carousing when they should be unable to work. Posts about your health are apparently court-worthy evidence when it comes to not paying out health insurance.

You have a side of you that employers, parents, government officials, and Amazon.com shouldn't know about? Well, your aliases are a fragile anonymity. Once broken, pseudonyms disintegrate. If you posted pornographic Twilight fanfic as a high school student under a pseudonym, the minute that pseudonym is linked to your adult identity, you are forever labeled as someone who has really, really shitty taste forever.

To those of us in a pretty swank position of privilege, this seems like a rather minor inconvenience. A) It's not, it's a major lifestyle change to give up all privacy to everyone who might want to track you for any reason. B) People who aren't rich white guys are much more subject to problems due to this kind of privacy invasion, so their issues will be 10x worse.

Anyway, the only solution I can see is to make it illegal for companies to deanonymize data or leverage publicly available data on their consumers. Otherwise, this stuff will happen if even a few of your friends are lax in their privacy settings.

Of course, that illegality doesn't spread to things like foreign companies and governments. But it should slow the spread if most of the major players that normally accrue data (such as Google, Amazon, etc) aren't allowed to accrue it in ways that endanger you. Having to compile all that data themselves is a significant stumbling block to any oppressive foreign power seeking to crack down on, say, demonstrations against them.

Most of us live in a very cushy world where we can't imagine data being used against us in any significant way. "It's just ads!" That's nearsighted and egotistical.

No comments: