Fake Facebook News

There have been lots of articles over the last week or so talking about “fake news” on Facebook, many revolving around the US election.

The ‘poster child’ of Facebook Fake News is this post: “FBI Agent Suspected in Hillary Email Leaks Found Dead…“. It appeared a few days before the US presidential election, and was shared a phenomenal number of times (567,752 according to Facebook’s API). It turned out the “Denver Guardian” does not actually exist – the site is just a shell set up to spread fake news, registered under an anonymous domain owner.

Here’s a quote from an article debunking it:

Interesting, eh? So the fake Denver Guardian article was “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”. And here’s a graph from that article, backing that up:

Quite a compelling chart. From that graph it looks like that Denver Guardian article is way way way more popular than anything the Boston Globe, LA Times, Chicago Tribune, and others have ever posted. Here you can see that debunking article shared on Twitter – Benedict Evans of the famous VC firm Andreessen Horowitz is retweeing it here, on an original tweet from Jay Rosen, who’s a Professor of Journalism at NYU:

408 retweets – I bet quite a few people read that post . Except… if you read into the detail properly, and check the actual data… that graph is not representative either. Here is why:

  • The author of the article just picked a single post, listed as ‘top story’, from each of the publications listed above, on a single day. If he’d picked a day earlier at a different time, he’d have found much more popular articles; if he’d picked a day later he may have too.
  • That line about “this article from a fake local paper was shared one thousand times more than material from real local papers” – strictly speaking that’s true, because “material” could mean any article. But it provides a false impression.

I spent a few minutes looking for the actual most shared posts on each of the above listed websites to remake the graph taking the actual ‘most shared’ posts. I went back to the start of September 2016. Here’s how the amended graph looks:

The “Denver Guardian” post is still very high there, but it’s not “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”.

In other words: An article debunking fake news on Facebook actually gives a very false impression of reality itself. It was compelling enough that an NYU professor shared it, & several hundred people retweeted that. The article has itself been shared more than 1,500 times on Facebook.

The author was told that the article was wrong. He quietly updated some of it, and added an explicit update note to the end later on, but most of the elements in the post are left as-is. It still says the Denver Guardian’s article is “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”, and the graph remains in tact. The NYU professor was told too, but left the RT as-is. Both probably did all of this with good intent, but the result is some who read it may take it at face value, and believe the problem to be “several orders of magnitude” greater than it likely is.

Summary:

  • Yes, there is fake information on Facebook. Some of it is deliberate; some of it is due to simple incompetence.
  • If you pick the most shared ‘fake news’ article of all time on Facebook, and compare it against some moderately shared posts from reputable news outlets, the outcome is that the problem looks much greater than it may be.
  • Sometimes very reputable people accidentally share false information; sometimes they leave it there even after it’s noted as being not quite right.
  • Fake news is still a problem. If you wanted, you could probably cheat the stock market, or nudge one or two votes in an election, by timing & pushing a piece of fake news at the right time. And, realistically, there are plenty of avenues Facebook could explore to limit the effectiveness of ‘fake news’.

Take what you read with a pinch of salt and, where you have a few moments spare, do a little of your own research to double check its validity. If it does not “pass the smell test”, maybe wait before hitting RT. But don’t overreact to the problem… it’s extremely unlikely that this fake news is “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”.

The Real Original Source of the Phrase “Big Data”

Big Data

In early 2013, Steve Lohr of the New York Times published an article where he tracked down the origin of the phrase “Big Data”. He found several different sources, and declared that it originated in the mid-1990s. But… he specifically opted to conclude that the very earliest source he could find – from 1989 – was not the originator. His reasoning was based on 2 factors:

  1. He wanted to credit someone who used the phrase in a technical way: “The credit, it seemed to me, should go to someone who was aware of the computing context.”
  2. He did not feel that the original usage of the phrase fitted the same idea of ‘Big Data’ as his. He therefore concluded the first usage was: “not, I don’t think, a use of the term that suggests an inkling of the technology we call Big Data today.”

I read Steve’s article at the time, where he declared that the first ever use of “Big Data” was not the originator, and thought “that’s a little unfair”. I keep going back to it, because the first source he found, and apparently the original usage of the phrase “Big Data” was very insightful, and covers perhaps the two biggest issues in relation to data today: its massive worth from a corporate point of view, and its massive privacy implications from a consumer point of view.

The original article was published on July 26th, 1989, under the headline “How Did They Get Your Name? Direct-mail Firms Have Vast Intelligence Network Tracking Consumers”. It was written by Erik Larson (now a best-selling author). The article talks about organisations gathering, joining, and mining data on millions of people, to use for marketing purposes. Here are a couple of example paragraphs:

“We’ve been scavenged by data pickers who sifted through our driving record and auto registrations, our deed and our mortgage, in search of what direct mailers see as the keys to our identities; our sexes, ages, the ages of our cars, the equity we hold in our home.

The scavengers record this data in central computers, which, in turn, merge it with other streams of revelatory data collected from other sources – the types of magazines we subscribe to, the organizations we support, how much credit we’ve got left – and then spit it all out (for a price) to virtually anyone who wants it.”

It goes on to talk about future implications of all of this:

It is an interesting exercise to imagine the big marketing databases put to use in other times, other places, by less trustworthy souls. What, for instance, might health insurers do with the subscription lists of gay publications?

Despite the dated & simplistic example, this is of course what many people today worry about: what governments try to regulate, where companies spend millions setting up & utilising systems, what we use in real time to deliver relevant ads to people as they browse websites, and – with a little stretching – what much of the NSA/Edward Snowden stuff was about. It is an article from 1989 talking about one of the biggest issues in technology today. And there, in the middle, is the first ever usage of the phrase “Big Data”:

bigdataquote

There’s a copy of the original article over on the Orlando Sentinel website, ironically now full of real-time targeted ads. Erik Larson later released a book expanding on the topic “The Naked Consumer: How Our Private Lives Become Public Commodities”. Despite being 25 years old, both the article and the book essentially talk about one of the versions of the phrase “Big Data” we use today: a cornerstone of modern marketing from a corporate point of view, and a privacy worry from a consumer point of view for many.

BuzzFeed is Watching You

When you visit BuzzFeed, they record lots of information about you.

Most websites record some information. BuzzFeed record a whole ton. I’ll start with the fairly mundane stuff, and then move on to one example of some slightly more scary stuff.

First: The Mundane Bits

Here’s a snapshot of what BuzzFeed records when you land on a page. They actually record much more than this, but this is just the info they pass to Google (stored within Google Analytics):

Here’s a description of what’s going on there:

The first line there is how many times in total I’ve visited the site (above this, which I’ve skipped for brevity, it also records the time I first visited, and a timestamp of my current visit).

Below that, the ‘Custom Var’ block is made up of elements BuzzFeed have actively decided “we need to record this in addition to what Google Analytics gives us out of the box”. Against these, you can see ‘scope’. A scope of ‘1’ means it’s something recorded about the user, ‘2’ means it’s recorded about the current visit, ‘page’ means it’s just a piece of information about the page itself.

There you can see other info they’re tracking, including:

  • Have you connected Facebook with BuzzFeed?
  • Do you have email updates enabled?
  • Do they know your gender & age?
  • How many times have you shared their content directly to Facebook & Twitter & via Email?
  • Are you logged in?
  • Which country are you in?
  • Are you a buzzfeed editor?
  • …and about 25 other pieces of information.

Within this you can also see it records ‘username’. I think that’s recording my user status, and an encoded version of my username. If I log in using 2 different browsers right now, it assigns me that same username string, but I’m going to caveat that I’m not 100% sure they’re recording that it is ‘me’ browsing the site (ie. that they’re able to link the data they’re recording in Google Analytics about my activity on the site back to my email address and other personally identifiable information). Either way, everything we’ve covered so far is quite mundane.

The Scary Bit

The scary bit occurs when you think about certain types of BuzzFeed content; most specifically: quizzes. Most quizzes are extremely benign – the stereotypical “Which [currently popular fictional TV show] Character Are You?” for example. But some of their quizzes are very specific, and very personal.

Here, for example, is a set of questions from a “How Privileged are You?” quiz, which has had 2,057,419 views at the time I write this. I’ve picked some of the questions that may cause you to think “actually, I wouldn’t necessarily want anyone recording my answers here”.

When you click any of those quiz answers, BuzzFeed record all of the mundane information we looked at earlier, plus they also records this:

Here’s what’s they’re recording there:

  • ‘event’ simply means something happened that BuzzFeed chose to record in Google Analytics.
  • ‘Buzz:content’ is how they’ve categorised the type of event.
  • ‘clickab:quiz-answer’ means that the event was a quiz answer.
  • ‘ad_unit_design3:desktopcontrol’ seems to be their definition of the design of the quiz answer that was clicked.
  • ‘ol:1218987’ is the quiz ID. In other words, if they wish, they could say “show me all the data for quiz 1218987” knowing that’s the ‘Check Your Privelege’ quiz.
  • ‘1219024’ is the actual answer I checked. Each quiz answer on BuzzFeed has a unique ID like this. Ie. if you click “I have never had an eating disorder” they record that click.

In other words, if I had access to the BuzzFeed Google Analytics data, I could query data for people who got to the end of the quiz & indicated – by not checking that particular answer – that they have had an eating disorder. Or that they have tried to change their gender. Or I could run a query along the following lines if I wished:

  • Show me all the data for anyone who answered the “Check Your Privelege” quiz but did not check “I have never taken medication for my mental health”.

In BuzzFeed’s defense, I’m sure when they set up the tracking in the first place they didn’t foresee that they’d be recording data from quizzes of this personal depth. This is just a single example, but I suspect this particular quiz would have had less than 2 million views if everyone completing it realised every click was being recorded & could potentially be reported on later – whether that data is fully identifiable back to individual users, or pseudonymous, or even totally anonymous.

What do you think?

The Mirror’s Crying Child Photo – Not All That it Seems

Here’s the front cover of the Daily Mirror. A haunting image of a starving British child, crying their eyes out.

Only… the child is from the Bay Area, and the photo was purchased from Flickr via Getty Images…

Embedded image permalink

Here’s the source of the original image: https://www.flickr.com/photos/laurenrosenbaum/4084544644/ (Here’s a happier one taken the following day: https://www.flickr.com/photos/laurenrosenbaum/4086511962/. Apparently she was crying over an earthworm.)

An excellent photo, taken by the excellent Lauren Rosenbaum in November 2009, shared on a US website (Flickr), sold by an American photo agency (Getty Images), used to illustrate poverty in Britain.

  • Does it matter that the photo is not really a starving child?
  • Does it matter that the photo wasn’t even taken in the UK?
  • Is there an ethical issue in buying a stock photo of a child – not in poverty – and using it to illustrate poverty?
  • Does it matter that the headline begins “Britain, 2014”, but the photo is actually “USA, 2009”?

I’m not sure on the answers to any of the above, but interesting to think about.

What do you think?

 


Twitter Is Telling Google Not to Follow Your Links

Over the last couple of years, Twitter silently changed the way they treat any links you include in tweets. In doing so, they have given themselves a very nice competitive advantage in lots areas, but they’ve also silently taken away the ability for search engines to follow the links you post to Twitter.

Here’s what Twitter changed:

  • In the past, clicking a link within Twitter took you directly to the destination.
  • Today, any link you click within Twitter first takes you invisibly to Twitter’s ‘t.co’ URL redirect. Once there, Twitter record various information about the click, before taking you on to your destination. All of this takes a tiny fraction of a second.

For example, clicking this link: http://t.co/1nKSjDDRhd will take you first to ‘t.co’, where Twitter will record the fact that you clicked it, and then you’ll be moved on to the destination URL (in that case, a previous blog post I wrote).

This is a very clever, simple way of allowing Twitter to gather piles of data on which links are most popular, who shares them, who clicks them, etc. As an illustration of how big this is, as a result of this Alexa treats ‘t.co’ as the 66th most popular website in the world.

The Oddity

The oddity here is this – the robots.txt file Twitter have created to tell all search engines what they can/cannot do with t.co links (http://t.co/robots.txt):

tcodisallow

Roughly translated into English, the first 2 lines there say:

  • “TwitterBot, there is nothing you are disallowed from crawling.” (ie. Twitterbot is allowed to crawl everything)

The second block of 2 lines says:

  • “All other bots: You are disallowed from crawling anything.” (ie. Unless you’re “Twitterbot”, you are not allowed to crawl anything at all on t.co)

Twitter could make this information available in other ways – for example via their API – but they famously cut off Google from full access to this.

So What?

This is sensible from Twitter’s point of view, as it means they don’t have Google and other search engines crawling every URL posted to Twitter, eating their bandwidth.

But from a website owner’s point of view, and a user point of view, it means that Twitter have blocked Google (and any other search engine) from following the links you post to Twitter.

The Hypocrisy of Big News Sites on State Surveillance in Seven Images

Every large news site is preaching about the NSA PRISM programme, and Obama’s apparent hypocrisy in monitoring his citizens.

What none of them mention explicitly is that they themselves use hundreds of technologies to track their readers both on their own sites, and as their readers move around the web.

Here are 6 images showing some of the tracking tecknologies on big news sites, plus 1 comparison chart of 68 technologies used across 10 large news sites. Note the ironic headlines on a few of these articles.

The Wall Street Journal

The WSJ says ‘US Collects Vast Data Trove’. Take a look at the 44 tracking technologies used on that page alone:

wsj

 Washington Post

The Washington Post talks about ‘sweeping surveillance’ on a page with 19 tracking technologies.

washpost

Cnet

Admittedly this is an old Cnet article, but take a look at their 20+ tracking technologies:

cnet

The Atlantic

The Atlantic often publish articles on privacy. Virtually their entire front page is devoted to the NSA PRISM programme at present. They themselves use a whole host of tracking tools, both directly & via their many social plugins.

theatlantic

GigaOm

No hypocrisy between the headline & the tracking technologies used by Om Malik, but interesting nonetheless.

gigaom

The New York Times

And double-irony from the NYT here. Take a look at the ad that’s automatically displayed. ‘2 friends are spying on you’, while the page itself has 17 tracking tools recording data about you.

nyt

Comparison of 68 Technologies Used by UK News Sites:

Finally, here’s a comparison I put together for an Econsultancy article (who use 13 technologies themselves) covering this:

News Sites Combined

The tools used for most of this were the excellent Ghostery, and Google Chrome’s Developer Tools.

Do share this with others if you have the chance. Outside of tech circles, I’m not sure many people realise quite how much of this is going on.

Goldman Sachs, Bloomberg, and Data Literacy

The biggest finance/data story of the month is that “Bloomberg snooped on Goldman Sachs”. Here is one of the dozens (thousands) of articles covering it: http://theweek.com/article/index/244050/is-bloomberg-news-spying-on-goldman-sachs

What’s the fuss about?

This is the summary of the story:

  1. Most banks & financial institutions use Bloomberg systems to gather information about financial markets.
  2. Bloomberg record data on who accesses those systems, when they do it, and what they do.
  3. Bloomberg’s journalists were using that information, and analysis of how their terminals were being used, as the basis of news articles.
  4. Goldman figured this out, and confronted Bloomberg accusing them of snooping.

Gawker (very foolishly in my opinion) say this about it:

“The whole thing sounds like the News of the World scandal, except if the targets were paying Rupert Murdoch $20,000 for the privilege.”

Here’s the irony:

What is Goldman Sachs’ advice on how companies should use data?

In October of last year, Goldman Sachs themselves were crowing that ‘data’ was the biggest opportunity for companies.

Their co-head of Internet Investment Banking at the time put out a series of videos covering this. Here was his (paraphrased by venturebeat) advice on what companies needed in order to harness this opportunity:

  1. Access to proprietary data,
  2. Wherewithal/knowledge of what to do with it/how to process it, and
  3. The right relationship with the consumer in order to apply the data.

Think through the 3 of those, and compare that to what Bloomberg did.

Of course, there are enormous marketing & trust implications with using & exposing customer data in the way Bloomberg did, but it’s madness (verging on ‘data illiterate’) that Goldman Sachs would simply assume that zero analysis was taking place on how their staff were using Bloomberg terminals, especially so as both Goldman & Bloomberg are in the business of data and analysis. And even more so again because Bloomberg’s contractual terms allowed them to capture and analyse the data.