How to Repeat what Cambridge Analytica Did

Lots of the newspaper articles about Cambridge Analytica speak as if what they (apparently) did was extremely advanced. Many also speak about it as if it’s unrepeatable.

It was not extremely advanced. And it is repeatable.

  • Advanced? Most of the things they apparently did were standard marketing tactics stretched in an unethical way.
  • Repeatable? In order to repeat what they did: All you would need is money, knowledge, & some moderate data skill (moderate may be read as ‘talented year 1 university computer science’ level; more is better, but with the right knowledge & setup, what they did doesn’t seem to be a task of extreme skill).

Below is an example detailing at a high level how someone could repeat what they did today.

Important caveats:

  • This doesn’t take into account legality, though all of the tools mentioned are widely available (+data laws change frequently).
  • This doesn’t take into account ethics. Some of those who’ve come forward from Cambridge Analytica have essentially said “I was just doing my job” – I think if you are working to alter people’s behaviour, you should be ethically aligned with what you’re doing (or at least not fully opposed).
  • It’s important also to note that Cambridge Analytica say they did nothing wrong, and that the data they used was gathered legitimately. That may well be the case: It would not be hard to do what they appear to have done by gathering the data legitimately, just more costly.

I’ve split this into 3 parts: ‘The Data’, ‘Taking Action’, ‘Summary’.

Section A: The Data

Data Step 1. Gather Personality Type Data

One of Cambridge Analytica’s big ‘Unique Selling Point’ claims was that they used ‘Psychographic Targeting’ to influence potential voters. This has a fancy name, but is a fairly simple principle, and is used within some mainstream marketing activity (mainly mass advertising, or data work designed to understand user behaviour).

Psychographic Targeting basically means categorising users into different personality types, and presenting them with different information or different ads designed to appeal to the flaws/nuances of their personality in order to push them toward carrying out particular behaviours. In the case of Cambridge Analytica, they say they did this either to urge users to vote, to move users toward ‘advocacy’ (pushing others to vote), or to suppress the likelihood of some voting for an opponent.

We know Cambridge Analytica claimed Psychographic Targeting as a big part of what they did, as the CEO (Alexander Nix) liked to do presentations telling people they did:


In part, this may be sales pitch: Anyone who has worked with high-spend media agencies knows they occasionally embelish the truth a little saying they’re using advanced techniques. Here though it looks like they did carry out some of this activity:

Below a slide where you can see him showing data including what’s referred to as ‘Big Five Inventory’. That means targets categorised by the ‘Big 5’ personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism (‘OCEAN’).

(snipped from:

Whether or not you believe this stuff works to the extent they say it did aside: This is probably the hardest element in replicating what Cambridge Analytica did, as it requires a big base of personality data. That’s essentially what they (Cambridge Analytica) & Facebook are in trouble for: The claim is that they used dodgy methods to acquire that personality data, and did not delete that data when Facebook asked them to.

Gathering a Pool of Personality Data

There are many ways to gather personality type data. If you wanted to replicate what Cambridge Analytica did, and you had the money, you could simply buy up a legitimate source of personality data.

There are many companies that collect this kind of thing. Some are very small operations sitting on lots of data, and therefore likely to be ‘cheap’ to acquire. Here’s one example, who use ads on The Guardian and many other publications to bring new customers to their IQ and general Personality Tests:

(Ironically the above is from a Guardian article on Cambridge Analytica)

To get an idea of the amount of data those people hold, below are some rough monthly estimates of the traffic for the site listed in the ad there (the stats below are gathered from another tool that monitors’ users behaviour – SimilarWeb – who buy up data on millions of web users’ habits):

From either creating a set of personality data yourself, or acquiring it via a third party source (either buying from a company that has it or buying a company outright that has that type of data) you could be then fairly simply categorise users within that data, allowing you to A) test the effectiveness of ads by personality type, B) use that data & that feedback loop to tailor messages for other channels.

Personality Data Summary: It should be quite straightforward to buy a big source of personality data. If you need to grow one from scratch, you can still do that legitimately using ads to send people to personality tests (ironically, The Guardian host lots of those ads at present).

Data Step 2. Add to your Personality pool with a broader data set.

Your pool of personality type data could be used to segment some users. An extra claim re Cambridge Analytica is that they also linked this to a wider set of data (In their case: all the Facebook contacts of users who took personality tests). They appear to have had various information on those users – age, gender, location, relationship to other users within their pool, and more.

2.1: Buying a set of data.

You could replicate that, but Facebook have tightened up their controls since. The data could, however, be gathered in different ways: There are many vendors of Email Data. Sometimes this data comes with the permission to email users, sometimes it does not. Sometimes it comes with many of the above attributes (age, gender, location, etc).

Summary: Many companies sell data on US / UK / citizens of other countries.

2.2: Adding to that set of data.

There are also many, many services who allow you to ‘augment’ email data. Some of these simply provide age, gender, whether the email address is active or not. Some provide much more in depth categorisation. Here are 3 examples who provide much more:

  • FullContact lets you supply a list of email addresses & returns with details from their accessible social media profiles (including profiling of bands, books, topics users have mentioned, etc).
  • Clearbit does something very similar, collecting ‘data from over 150 sources in real time’.
  • Pipl is another tool, with data on over 3,163,959,452 people, allowing you to gather details of their social media profiles, job information, and more.

Here’s a snapshot of some of the data available from FullContact, provided matched against a list of email addresses you provide to their system:

Using some of the above sources, it may also be possible to carry out the ‘social graph’ analysis that Cambridge Analytica’s set of data appears to have allowed them (ie, which users are connected to which others; which are of particular influence, etc).

2.3: Further data sources.

Each of the above are quite legitimate data sources (albeit having ‘permission’ to contact them relies on that having already been granted & documented somewhere). Alongside each of the above, there are many less legitimate sources of data available. Two examples of this type of data :

  • Data gathered by browser plugins. Many users don’t realise, but the plugins you add to web browsers can often record which sites you visit, when you visit them, other activity in the browser (text you put into some forms, etc).
  • Data gathered by mobile apps. For example: Those ‘Free Flashlight’ apps on your phone? How do you think they make money? Some mobile apps record what you do on your phone, which other apps you use, where you go, which websites you visit, your contacts, and more.

Each of the above could be used 

There are many other sources of data like this, from big credit reference agencies, to small operations who buy up email data.

Data Step 3. Link any data you’ve sourced with data gathered directly yourself

Every political campaign gathers their own data, and uses this to understand which voters to target as ‘voters’, which as ‘advocates’ (ie. those who will push others to vote – sometimes creating lists of those people individually so that campaigners can ring round on the day to ensure everyone’s voted), some may also identify opposition voters to attempt to ‘suppress’.

The official UK Brexit campaign built a system like this called ‘VICS’ the ‘Voter Intention Collection System’. Many use off-the-shelf tools.

One such off-the-shelf tool is ‘Nation Builder’ – which allows anyone with money to build up big databases of their own supporters. Nation Builder has been used by almost every political organisation you can think of (plus charities and others). It was used by the UK ‘Remain’ Brexit campaign, as well as by UKIP (the UK Independence Party). It was used by Donald Trump and it was used by Bernie Sanders. It was used by Ted Cruz, even though it was created by a former John Kerry aid. It’s used by the Conservative Party in the UK, and also used by the Labour Party and their loosely affiliated  ‘Momentum’

You will recognise the pattern: A web page asks you to support a candidate. It usually gathers one or two small pieces of information (your postal code, whether you’re registered to vote, etc). From there, you receive regular emails asking for donations, you’re organised into a community of others where you may be asked to organise grassroots events, you may be encouraged to share particular messages on social media, you may be given little surveys that basically look like quizzes, but are intended to figure out which issues can be used to increase the likelihood of you voting for a candidate, or promoting a candidate, or – on the reverse – which issues could be used to prevent you from voting for the opposition.

Here’s how they explain what they do:

And here’s how they build profiles of supporters, also offering some matching up with social profiles:

So there we have 3 pools of data, which we’ve joined together:

  1. A base of personality type data, which we can use to target particular types of ads/stories.
  2. A whole heap of behavioural data, showing which websites people visit, what they talk about on social media, etc.
  3. A lot of data we’ve gathered ourselves, all organised neatly in a way that allows us to identify supporters / advocates, gather donations, and push users to action.

Section B: Taking Action

We’ve discussed the data side of things; the other area Cambridge Analytica carried out services is in taking action on the basis of that data.

The actual actions Cambridge Analytica say they took, according to leaked presentations, is fairly basic. The data, and a process around adding to the data & understanding the current status of users, is the key part.

Here are 3 slides released by The Guardian, explaining what they did.

Action Tactic 1: Segmented / 1:1 targeting via email, social media.

The above is essentially action based around the elements we’ve spoken about so far:

  1. Grow a large pool of data.
  2. Create ads (& other ‘content’) to target to particular groups or individuals within that pool. Show them those ads/pieces of content via Facebook, Youtube, Twitter, Snapchat, email, and also use for TV ads & messaging.
  3. Run polling to understand the current status of each user, or each segment (intent to vote, direction of vote, whether they’ve encouraged others/would encourage others, etc).
  4. Monitor the data.
  5. Understand whether you’re on target/off target with particular groups or particular regions, allowing you to prioritise.
  6. Alter ads/content on the basis of points 3, 4, 5 above.

That’s most of the 1:1 work, covered by social channels & email.

Action Tactic 2: Bigger Media Buys for Direct Action

Here Cambridge Analytica say they placed ‘broader’ media buys. This is the type of activity that any very large consumer-facing campaign would carry out – buying mass ads to display to hundreds or millions of users. In their case, they say they targeted these in 3 ways:

  1. They based the messaging on elements they’d already measured as being successful within their more targeted advertising.
  2. They’d segmented these by location.
  3. They placed these on key outlets, where they knew they had a chance to affect behaviour, and placed them at crucial moments.

All three of the above are extremely straightforward if you’ve monitored your data closely through a campaign.

In terms of ‘crucial moments’: You can see the date on the above examples was election day itself: Ie, after all of the testing of ads prior, they ran a big push on election day itself. Everyone working in digital marketing will be familiar with the fact that the bulk of results for ‘direct response’ techniques (email, direct message campaigns, direct response ads) are immediately at the point that the money is spent, or the emails are sent. The ‘brand’ work to create users with the potential to act happens before that, the direct response activity is designed to have an impact at the point it goes live.

The above examples are also similar to the technique chosen by the official UK ‘Vote Leave’ Brexit campaign, who chose to spend large amounts of their media budget right at the end. 



Action Tactic 3: Search Advertising

This is such a minor, simple detail compared to the other elements above, but Cambridge Analytica thought it worth setting out separately and showing off in presentations. That may be because it’s such a broad, simple tactic that it would appeal to any potential client without much awareness of digital advertising, or it may be that they actually felt it to be advanced.

Alongside all of the other tactics, they also explain they did something rather basic: Placed ads within Google search results, against particular terms users may search for. Here’s an example from their presentation:

There are 3 ads there, that cover different elements of potential ‘voter journeys’:

  • Ad 1: An ad that’s targeted at users searching for a particular factual topic. This is interesting in it positions him as opposing the Iraq war: Presumably that came back as a positive among the polling they did.
  • Ad 2: A particularly negative ad re Hillary, related to a particular factual search. (user intent: which candidate am I aligned with re trade?)
  • Ad 3: A positive ad re Donald Trump.

Each of the ads pointed to pages that set forward information, and also attempted to gather donations / email opt ins. This is vaguely interesting, but is particularly basic marketing that anyone working on a campaign with a decent amount of money would likely cover.

Overall Summary

What Cambridge Analytica appear to have done was not particularly advanced. It simply required money, people to carry out, and (it appears) the willingness of some to ignore their ‘ethical misalignments’ with the campaign.

All of the actual data/marketing techniques carried out were relatively straightforward, known tactics.

It’s no longer so easy to gather the data they did from the source they took it from, but there are plenty of other possibilities for gathering similar data sets – whether gathered directly in its entirety, partly gathered directly with additional data added to it from other sources, or bought in from either a fully legitimate source or a less legitimate source.

Fake Facebook News

There have been lots of articles over the last week or so talking about “fake news” on Facebook, many revolving around the US election.

The ‘poster child’ of Facebook Fake News is this post: “FBI Agent Suspected in Hillary Email Leaks Found Dead…“. It appeared a few days before the US presidential election, and was shared a phenomenal number of times (567,752 according to Facebook’s API). It turned out the “Denver Guardian” does not actually exist – the site is just a shell set up to spread fake news, registered under an anonymous domain owner.

Here’s a quote from an article debunking it:

Interesting, eh? So the fake Denver Guardian article was “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”. And here’s a graph from that article, backing that up:

Quite a compelling chart. From that graph it looks like that Denver Guardian article is way way way more popular than anything the Boston Globe, LA Times, Chicago Tribune, and others have ever posted. Here you can see that debunking article shared on Twitter – Benedict Evans of the famous VC firm Andreessen Horowitz is retweeing it here, on an original tweet from Jay Rosen, who’s a Professor of Journalism at NYU:

408 retweets – I bet quite a few people read that post . Except… if you read into the detail properly, and check the actual data… that graph is not representative either. Here is why:

  • The author of the article just picked a single post, listed as ‘top story’, from each of the publications listed above, on a single day. If he’d picked a day earlier at a different time, he’d have found much more popular articles; if he’d picked a day later he may have too.
  • That line about “this article from a fake local paper was shared one thousand times more than material from real local papers” – strictly speaking that’s true, because “material” could mean any article. But it provides a false impression.

I spent a few minutes looking for the actual most shared posts on each of the above listed websites to remake the graph taking the actual ‘most shared’ posts. I went back to the start of September 2016. Here’s how the amended graph looks:

The “Denver Guardian” post is still very high there, but it’s not “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”.

In other words: An article debunking fake news on Facebook actually gives a very false impression of reality itself. It was compelling enough that an NYU professor shared it, & several hundred people retweeted that. The article has itself been shared more than 1,500 times on Facebook.

The author was told that the article was wrong. He quietly updated some of it, and added an explicit update note to the end later on, but most of the elements in the post are left as-is. It still says the Denver Guardian’s article is “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”, and the graph remains in tact. The NYU professor was told too, but left the RT as-is. Both probably did all of this with good intent, but the result is some who read it may take it at face value, and believe the problem to be “several orders of magnitude” greater than it likely is.


  • Yes, there is fake information on Facebook. Some of it is deliberate; some of it is due to simple incompetence.
  • If you pick the most shared ‘fake news’ article of all time on Facebook, and compare it against some moderately shared posts from reputable news outlets, the outcome is that the problem looks much greater than it may be.
  • Sometimes very reputable people accidentally share false information; sometimes they leave it there even after it’s noted as being not quite right.
  • Fake news is still a problem. If you wanted, you could probably cheat the stock market, or nudge one or two votes in an election, by timing & pushing a piece of fake news at the right time. And, realistically, there are plenty of avenues Facebook could explore to limit the effectiveness of ‘fake news’.

Take what you read with a pinch of salt and, where you have a few moments spare, do a little of your own research to double check its validity. If it does not “pass the smell test”, maybe wait before hitting RT. But don’t overreact to the problem… it’s extremely unlikely that this fake news is “several orders of magnitude more popular a story than anything any major city paper publishes on a daily basis”.

The Real Original Source of the Phrase “Big Data”

Big Data

In early 2013, Steve Lohr of the New York Times published an article where he tracked down the origin of the phrase “Big Data”. He found several different sources, and declared that it originated in the mid-1990s. But… he specifically opted to conclude that the very earliest source he could find – from 1989 – was not the originator. His reasoning was based on 2 factors:

  1. He wanted to credit someone who used the phrase in a technical way: “The credit, it seemed to me, should go to someone who was aware of the computing context.”
  2. He did not feel that the original usage of the phrase fitted the same idea of ‘Big Data’ as his. He therefore concluded the first usage was: “not, I don’t think, a use of the term that suggests an inkling of the technology we call Big Data today.”

I read Steve’s article at the time, where he declared that the first ever use of “Big Data” was not the originator, and thought “that’s a little unfair”. I keep going back to it, because the first source he found, and apparently the original usage of the phrase “Big Data” was very insightful, and covers perhaps the two biggest issues in relation to data today: its massive worth from a corporate point of view, and its massive privacy implications from a consumer point of view.

The original article was published on July 26th, 1989, under the headline “How Did They Get Your Name? Direct-mail Firms Have Vast Intelligence Network Tracking Consumers”. It was written by Erik Larson (now a best-selling author). The article talks about organisations gathering, joining, and mining data on millions of people, to use for marketing purposes. Here are a couple of example paragraphs:

“We’ve been scavenged by data pickers who sifted through our driving record and auto registrations, our deed and our mortgage, in search of what direct mailers see as the keys to our identities; our sexes, ages, the ages of our cars, the equity we hold in our home.

The scavengers record this data in central computers, which, in turn, merge it with other streams of revelatory data collected from other sources – the types of magazines we subscribe to, the organizations we support, how much credit we’ve got left – and then spit it all out (for a price) to virtually anyone who wants it.”

It goes on to talk about future implications of all of this:

It is an interesting exercise to imagine the big marketing databases put to use in other times, other places, by less trustworthy souls. What, for instance, might health insurers do with the subscription lists of gay publications?

Despite the dated & simplistic example, this is of course what many people today worry about: what governments try to regulate, where companies spend millions setting up & utilising systems, what we use in real time to deliver relevant ads to people as they browse websites, and – with a little stretching – what much of the NSA/Edward Snowden stuff was about. It is an article from 1989 talking about one of the biggest issues in technology today. And there, in the middle, is the first ever usage of the phrase “Big Data”:


There’s a copy of the original article over on the Orlando Sentinel website, ironically now full of real-time targeted ads. Erik Larson later released a book expanding on the topic “The Naked Consumer: How Our Private Lives Become Public Commodities”. Despite being 25 years old, both the article and the book essentially talk about one of the versions of the phrase “Big Data” we use today: a cornerstone of modern marketing from a corporate point of view, and a privacy worry from a consumer point of view for many.

BuzzFeed is Watching You

When you visit BuzzFeed, they record lots of information about you.

Most websites record some information. BuzzFeed record a whole ton. I’ll start with the fairly mundane stuff, and then move on to one example of some slightly more scary stuff.

First: The Mundane Bits

Here’s a snapshot of what BuzzFeed records when you land on a page. They actually record much more than this, but this is just the info they pass to Google (stored within Google Analytics):

Here’s a description of what’s going on there:

The first line there is how many times in total I’ve visited the site (above this, which I’ve skipped for brevity, it also records the time I first visited, and a timestamp of my current visit).

Below that, the ‘Custom Var’ block is made up of elements BuzzFeed have actively decided “we need to record this in addition to what Google Analytics gives us out of the box”. Against these, you can see ‘scope’. A scope of ‘1’ means it’s something recorded about the user, ‘2’ means it’s recorded about the current visit, ‘page’ means it’s just a piece of information about the page itself.

There you can see other info they’re tracking, including:

  • Have you connected Facebook with BuzzFeed?
  • Do you have email updates enabled?
  • Do they know your gender & age?
  • How many times have you shared their content directly to Facebook & Twitter & via Email?
  • Are you logged in?
  • Which country are you in?
  • Are you a buzzfeed editor?
  • …and about 25 other pieces of information.

Within this you can also see it records ‘username’. I think that’s recording my user status, and an encoded version of my username. If I log in using 2 different browsers right now, it assigns me that same username string, but I’m going to caveat that I’m not 100% sure they’re recording that it is ‘me’ browsing the site (ie. that they’re able to link the data they’re recording in Google Analytics about my activity on the site back to my email address and other personally identifiable information). Either way, everything we’ve covered so far is quite mundane.

The Scary Bit

The scary bit occurs when you think about certain types of BuzzFeed content; most specifically: quizzes. Most quizzes are extremely benign – the stereotypical “Which [currently popular fictional TV show] Character Are You?” for example. But some of their quizzes are very specific, and very personal.

Here, for example, is a set of questions from a “How Privileged are You?” quiz, which has had 2,057,419 views at the time I write this. I’ve picked some of the questions that may cause you to think “actually, I wouldn’t necessarily want anyone recording my answers here”.

When you click any of those quiz answers, BuzzFeed record all of the mundane information we looked at earlier, plus they also records this:

Here’s what’s they’re recording there:

  • ‘event’ simply means something happened that BuzzFeed chose to record in Google Analytics.
  • ‘Buzz:content’ is how they’ve categorised the type of event.
  • ‘clickab:quiz-answer’ means that the event was a quiz answer.
  • ‘ad_unit_design3:desktopcontrol’ seems to be their definition of the design of the quiz answer that was clicked.
  • ‘ol:1218987’ is the quiz ID. In other words, if they wish, they could say “show me all the data for quiz 1218987” knowing that’s the ‘Check Your Privelege’ quiz.
  • ‘1219024’ is the actual answer I checked. Each quiz answer on BuzzFeed has a unique ID like this. Ie. if you click “I have never had an eating disorder” they record that click.

In other words, if I had access to the BuzzFeed Google Analytics data, I could query data for people who got to the end of the quiz & indicated – by not checking that particular answer – that they have had an eating disorder. Or that they have tried to change their gender. Or I could run a query along the following lines if I wished:

  • Show me all the data for anyone who answered the “Check Your Privelege” quiz but did not check “I have never taken medication for my mental health”.

In BuzzFeed’s defense, I’m sure when they set up the tracking in the first place they didn’t foresee that they’d be recording data from quizzes of this personal depth. This is just a single example, but I suspect this particular quiz would have had less than 2 million views if everyone completing it realised every click was being recorded & could potentially be reported on later – whether that data is fully identifiable back to individual users, or pseudonymous, or even totally anonymous.

What do you think?

The Mirror’s Crying Child Photo – Not All That it Seems

Here’s the front cover of the Daily Mirror. A haunting image of a starving British child, crying their eyes out.

Only… the child is from the Bay Area, and the photo was purchased from Flickr via Getty Images…

Embedded image permalink

Here’s the source of the original image: (Here’s a happier one taken the following day: Apparently she was crying over an earthworm.)

An excellent photo, taken by the excellent Lauren Rosenbaum in November 2009, shared on a US website (Flickr), sold by an American photo agency (Getty Images), used to illustrate poverty in Britain.

  • Does it matter that the photo is not really a starving child?
  • Does it matter that the photo wasn’t even taken in the UK?
  • Is there an ethical issue in buying a stock photo of a child – not in poverty – and using it to illustrate poverty?
  • Does it matter that the headline begins “Britain, 2014”, but the photo is actually “USA, 2009”?

I’m not sure on the answers to any of the above, but interesting to think about.

What do you think?


Twitter Is Telling Google Not to Follow Your Links

Over the last couple of years, Twitter silently changed the way they treat any links you include in tweets. In doing so, they have given themselves a very nice competitive advantage in lots areas, but they’ve also silently taken away the ability for search engines to follow the links you post to Twitter.

Here’s what Twitter changed:

  • In the past, clicking a link within Twitter took you directly to the destination.
  • Today, any link you click within Twitter first takes you invisibly to Twitter’s ‘’ URL redirect. Once there, Twitter record various information about the click, before taking you on to your destination. All of this takes a tiny fraction of a second.

For example, clicking this link: will take you first to ‘’, where Twitter will record the fact that you clicked it, and then you’ll be moved on to the destination URL (in that case, a previous blog post I wrote).

This is a very clever, simple way of allowing Twitter to gather piles of data on which links are most popular, who shares them, who clicks them, etc. As an illustration of how big this is, as a result of this Alexa treats ‘’ as the 66th most popular website in the world.

The Oddity

The oddity here is this – the robots.txt file Twitter have created to tell all search engines what they can/cannot do with links (


Roughly translated into English, the first 2 lines there say:

  • “TwitterBot, there is nothing you are disallowed from crawling.” (ie. Twitterbot is allowed to crawl everything)

The second block of 2 lines says:

  • “All other bots: You are disallowed from crawling anything.” (ie. Unless you’re “Twitterbot”, you are not allowed to crawl anything at all on

Twitter could make this information available in other ways – for example via their API – but they famously cut off Google from full access to this.

So What?

This is sensible from Twitter’s point of view, as it means they don’t have Google and other search engines crawling every URL posted to Twitter, eating their bandwidth.

But from a website owner’s point of view, and a user point of view, it means that Twitter have blocked Google (and any other search engine) from following the links you post to Twitter.

The Hypocrisy of Big News Sites on State Surveillance in Seven Images

Every large news site is preaching about the NSA PRISM programme, and Obama’s apparent hypocrisy in monitoring his citizens.

What none of them mention explicitly is that they themselves use hundreds of technologies to track their readers both on their own sites, and as their readers move around the web.

Here are 6 images showing some of the tracking tecknologies on big news sites, plus 1 comparison chart of 68 technologies used across 10 large news sites. Note the ironic headlines on a few of these articles.

The Wall Street Journal

The WSJ says ‘US Collects Vast Data Trove’. Take a look at the 44 tracking technologies used on that page alone:


 Washington Post

The Washington Post talks about ‘sweeping surveillance’ on a page with 19 tracking technologies.



Admittedly this is an old Cnet article, but take a look at their 20+ tracking technologies:


The Atlantic

The Atlantic often publish articles on privacy. Virtually their entire front page is devoted to the NSA PRISM programme at present. They themselves use a whole host of tracking tools, both directly & via their many social plugins.



No hypocrisy between the headline & the tracking technologies used by Om Malik, but interesting nonetheless.


The New York Times

And double-irony from the NYT here. Take a look at the ad that’s automatically displayed. ‘2 friends are spying on you’, while the page itself has 17 tracking tools recording data about you.


Comparison of 68 Technologies Used by UK News Sites:

Finally, here’s a comparison I put together for an Econsultancy article (who use 13 technologies themselves) covering this:

News Sites Combined

The tools used for most of this were the excellent Ghostery, and Google Chrome’s Developer Tools.

Do share this with others if you have the chance. Outside of tech circles, I’m not sure many people realise quite how much of this is going on.

Goldman Sachs, Bloomberg, and Data Literacy

The biggest finance/data story of the month is that “Bloomberg snooped on Goldman Sachs”. Here is one of the dozens (thousands) of articles covering it:

What’s the fuss about?

This is the summary of the story:

  1. Most banks & financial institutions use Bloomberg systems to gather information about financial markets.
  2. Bloomberg record data on who accesses those systems, when they do it, and what they do.
  3. Bloomberg’s journalists were using that information, and analysis of how their terminals were being used, as the basis of news articles.
  4. Goldman figured this out, and confronted Bloomberg accusing them of snooping.

Gawker (very foolishly in my opinion) say this about it:

“The whole thing sounds like the News of the World scandal, except if the targets were paying Rupert Murdoch $20,000 for the privilege.”

Here’s the irony:

What is Goldman Sachs’ advice on how companies should use data?

In October of last year, Goldman Sachs themselves were crowing that ‘data’ was the biggest opportunity for companies.

Their co-head of Internet Investment Banking at the time put out a series of videos covering this. Here was his (paraphrased by venturebeat) advice on what companies needed in order to harness this opportunity:

  1. Access to proprietary data,
  2. Wherewithal/knowledge of what to do with it/how to process it, and
  3. The right relationship with the consumer in order to apply the data.

Think through the 3 of those, and compare that to what Bloomberg did.

Of course, there are enormous marketing & trust implications with using & exposing customer data in the way Bloomberg did, but it’s madness (verging on ‘data illiterate’) that Goldman Sachs would simply assume that zero analysis was taking place on how their staff were using Bloomberg terminals, especially so as both Goldman & Bloomberg are in the business of data and analysis. And even more so again because Bloomberg’s contractual terms allowed them to capture and analyse the data.