Google Analytics: Creating Automatic “Bot Alert” Emails

‘Bots’ often cause problems for website owners. Among many negative effects, they often damage the accuracy & precision of web analytics data, and cause website owners to make faulty decisions.

This comes up at least once a month with clients, and I’d spotted the great @peter_oneill, @matt_4ps, & @danieljtruman talking about it on twitter so thought I’d share this.

Here’s a quick Google Analytics ‘Custom Alert’ to help you spot some bots before they’ve caused lots of damage. It doesn’t solve the problem, but it helps flag when it may be happening, allowing you to delve in & investigate further, and to then filter out the traffic if it is indeed a problem.

The Problem

Here’s an example of what bot traffic looks like, when isolated from all of the rest of the traffic on a site:

bottraffic

Just there, it’s a robot sending around 375 visits per day. That’s not huge, but it’s roughly 11,000 visits over a period of a month, none of which converts, all of which has a 100% bounce rate. That causes the following issues:

  • It totally skews our bounce stats
  • It skews our conversion stats too.
  • All of the surrounding metrics like ‘per visit value’, ‘% new visits’ become misleading.
  • Stats by region/browser/etc are often messed up, as bots tend to favour one particular region/browser.
  • It does all of the above in an unpredictable manner, and is time consuming to look for.

 The Solution: How to alert yourself when this is happening?

One solution to alert yourself when this may be happening is as follows:

  1. Set up an ‘advanced segment’ to spot ‘new, direct, bouncing’ traffic.
  2. Set up an alert so that if that ‘new, direct, bouncing’ traffic ever increases massively, Google Analytics sends you an email.

Here are those 2 steps:

Step 1: Create An Advanced Segment

Usually, but not always, bots follow this pattern:

  • They visit the site ‘direct’.
  • They don’t store cookies, so are identified as ‘New’ visitors.
  • They record as 100% bounce rate.

Not all of that traffic will be bots, but we know if it jumps considerably there is a much higher likelihood that it’s a bot than if any other type of traffic jumps.

In order to isolate those people, we’ll set up an advanced segment:

segment

(if you’re lazy, you can simply click this link to create the above: http://bit.ly/maybebots)

Step 2: Set up the Alert

Following that, set up an alert to fire off an email when traffic leaps from your new ‘may be a bot’ segment:

To set that up, go to the following in Google Analytics: Intelligence Events > Overview > Custom Alerts > Manage Custom Alerts > Create new alert.

Once there, copy the following:

botalert

 

That essentially says “Please pay attention only to Direct Traffic, where that traffic is New to the site, and where it Only Views One Page. While you’re paying attention to that – if it’s more than double what it was last week on any given day – send me an email.”

Depending on your site’s traffic pattern, you may want to increase/decrease that ‘100%’ value.

Not perfect, and it won’t catch everything, but better than not spotting anything at all.

Summary

That’s it. Set that live & – when that ‘new, direct, bouncing’ traffic that is often caused by bots doubles – you’ll get an email telling you. From there you can investigate further & filter it from your google analytics data if it is indeed a bot.

Do post any thoughts you have on this, or any other solutions.

10 Replies to “Google Analytics: Creating Automatic “Bot Alert” Emails”

  1. Nice post Dan! Now that I’ve taken a look at our ‘Maybebots’ it’s looking like we’ve got a problem too! There’s a GA exclude filter that was shared from one of the measurecamp sessions to filter them out –
    Visitor-ISP-organisation:- ^(google inc.|yahoo! inc.|iac search and media europe ltd|iac search media inc|inktomi corporation|site confidence test agent servers|site ?confidence|global crossing|apache ltd.|nielsen netratings|meebo inc.|stumbleupon inc.|taptu limited)$

    That works in a custom report using ISP as the filter – courtesy of Ravi Sodha @ravisodha

    1. thanks, Chris! Phil has shared a good set of lists below too.

      Lots of the ‘maybebots’ traffic will be fine – real humans. If you look at (for example) the ‘cities’ report though, often you can see pockets of very regular traffic from a particular city there where there shouldn’t be.

      The purpose of ‘maybebots’ is really to watch out for when it spikes so that you can dig into it a bit.

      See you soon!!

      dan

  2. Most bots will not execute javascript – which means that they won’t even be tracked by Google Analytics. Any bots that do execute javascript will be tracked within GA, but these are likely to be few and far between.

    Tracking bots is notoriously difficult (I’ve looked at many many methods and solutions in the past) and ultimately, there is no easy way to detect bots over humans.

    It is often more effort than it is worth tracking these, unless they are causing serious issues with server loads etc.

    Building bots/scrapers is extremely simple and allows you to get all kinds of things quickly and easily.

    1. hi, Mick, in this case the purpose is to flag up some of the bots that *do* appear in Google Analytics (& therefore cause data/analysis issues).

      You say ‘these are likely to be few and far between’. Sadly they’re fairly common. The example you see in the post is a real example – one of about half a dozen I’ve come across this year so far. They’re often site speed monitors; affiliate scrapers; competitor pricing monitoring tools, etc. It’s useful to be able to spot these without having to manually dig through the data every few days.

      Thanks for the comment!

      dan

  3. Hi Dan,

    Here is list to check or exclude for robot visits in GA…

    1a. Robots – list1
    ^(inktomi corporation|iac search.*|yahoo! inc.|facebook inc.|stumbleupon inc.|dub6 ec2|site confidence.*|apache ltd.|nielsen netratings|affinity internet inc|dub6 ec2|Amazon (A9|Web|Data|Tech).*|microsoft corp)$

    1b. Robots – list2
    ^(compuware corporation|global crossing|psinet uk dedicated hosting|cable & wireless telecommunication services gmbh|cable & wireless uk p.u.c.|ftip003(110010|235904) crosspoint colocation ltd)$

    2. Browser Whitelist method:
    ^(Internet Explorer|Firefox|Chrome|Safari|Safari (in-app)|Opera|Opera Mini|Android Browser|Apple Browser)$

    You can also try using a robots.txt asell…
    #####################################
    # http://www.yourdomain.com/robots.txt whitelist method
    #####################################

    # CPM: Adsense banner and contexual bots
    User-agent: Mediapartners-Google* # support.google.com/adsense/bin/answer.py?hl=en&answer=99376
    User-agent: Mediapartners-Google # Adsense contextual targeting bot
    User-agent: msnbot-media/1.0 # MSN contextual targeting bot
    Disallow:
    Allow: /

    # PPC: Adwords & AdCenter landing page bots
    User-agent: Adsbot-Google # AdwordsPPC tinyurl.com/list-of-Google-Crawlers
    User-agent: AdsBot-Google-Mobile # AdwordsPPC for mobile campaigns
    User-agent: Adidxbot # BingPPC (aka MSNPTC) http://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0
    Disallow:
    Allow: /

    # SEO: Organic Crawlers
    User-agent: Googlebot # support.google.com/webmasters/bin/answer.py?hl=en&answer=1061943
    User-agent: Googlebot-Mobile
    User-agent: Googlebot-News
    User-agent: Googlebot-Video
    User-agent: Googlebot-Image
    User-agent: gsa-crawler # Google-Appliance tinyurl.com/google-sitesearch-appliance
    User-agent: bingbot # aka MSNBOT http://www.bing.com/bingbot.htm
    User-agent: MSNBot # old version of bingbot still used for multimedia and feeds crawls
    User-agent: BingPreview # bing.com/community/site_blogs/b/webmaster/archive/2012/10/26/page-snapshots-in-bing-windows-8-app-to-bring-new-crawl-traffic-to-sites.aspx
    User-agent: Slurp # aka Yahoo! Slurp – now part of Bingbot
    User-agent: Teoma # Ask.com
    User-agent: Baiduspider # Chinese search engine
    User-agent: Yandex # Russian search engine
    User-agent: naverbot # South Korean search engine
    User-agent: seznambot # Czech search engine
    User-agent: rogerbot # http://www.seomoz.org/dp/rogerbot
    User-agent: MJ12bot # Majestic SEO link index
    User-agent: AhrefsBot # Ahrefs SEO link index
    User-agent: Blekkobot #Blekko
    User-agent: xenu’s # Link Sleuth
    User-agent: Xenu’s Link Sleuth 1.1c
    User-agent: ia_archiver # archive.org
    User-agent: ScoutJet # help.blekko.com/index.php/can-i-submit-my-site-to-be-crawled/
    User-agent: Feedfetcher-Google # Feedburner http://www.google.com/feedfetcher.html
    User-agent: facebookexternalhit # http://www.facebook.com/externalhit_uatext.php
    User-agent: Twitterbot
    User-agent: LinkedInBot
    User-agent: bitlybot
    User-agent: Pinterest
    Crawl-delay: 20 # Reduce Server-Load using 20sec crawl delay
    Disallow:
    Allow: /

    # Server-Monitoring: BLACKLIST SiteConfidence & Gomez just incase they ignore user-agent: *
    User-agent: SiteCon # siteconfidence.com/services/load-testing/website.aspx
    User-agent: GomezAgent # gomeznetworks.com/help/gpn/MySettings/Last_Mile_PP_Test_Settings.htm
    User-agent: GomezAgent 1.0
    User-agent: GomezAgent 2.0
    User-agent: GomezAgent 3.0
    User-agent: YottaaMonitor
    User-agent: FunWebProducts
    Disallow: / #BLOCK

    # All undefined bots – Disallow website Crawl.
    User-agent: *
    Disallow: / #BLOCK

    Sitemap: http://www.yourdomain.com/sitemap.xml #ENTER your-domain here
    ################

    Thanks

    Phil.

  4. Hi Dan,

    This is one to bookmark and share. Thanks for writing. I have one thought to add. You said

    “They don’t store cookies, so are identified as ‘New’ visitors.”

    This could be key. I say could because I don’t know enough by bots to understand how they handle cookies. Do you mean they are not set at all (but then the GA code relies on setting a first-party cookie) or is it dropped at every subsequent request?)

    One thing I’m doing on one site is to extract the random visitor id in the _utma cookie and set it as a visitor-level custom variable. Now this happens as soon as the GA code and cookie is set so every unique visitor should have this set.

    However, I’ve noticed a small number of visitors do not have this custom variable set and they all seem to be from microsoft corp with (nearly) 100% bounce rate. If this traffic is indeed bot in origin, maybe setting visitor-level custom vars and segmenting out visitors without holds the key. Would be good to get someone who knows how bots work to say whether this would work.

    I’m a bit wary about excluding based on browser for example as you may accidentally exclude valid traffic as well (I use Dolphin on my mobile devices, for example) but great points from Phil on investigating whether it is a problem!

  5. I have been looking at adding a custom alert based on the advanced segment you provided, however I can’t seem to find it in the alert conditions drop-down menu. We have had some serious trouble with bots this month and I wanted to make sure we were alerted if something similar happened again.

    I have actually started using device detection and some PHP to understand if a visit is from a crawler/bot/scraper and then stop it from executing the GA JavaScript. This should solve our bot problem (along with some honeypot IP bans) in the future, but I need to make sure with some custom alerts.

    Is there a new method for adding your “Direct New Bouncers” advanced segment to the new custom alerts?

    Thanks in advance,

    Joe!

  6. Hi Dan,
    late comment on this subject but I’m adding now because I see a similar problem with one of my clients and Site Confidence.

    I have noticed that beyond bounce rate (100%), visitor type (New) and traffic source (Direct) Site Confidence also seems to emanate from Linux servers running a mozilla (FF) user agent. So in this case I’d be inclined to add these two filters to the alert / profile of view filter to weed out Site Con bots.

    What’s interesting in my case is that SiteCon is identified as ‘SiteCon Browser’ in the browser report but exhibits the same user behaviour by some traffic from visits from the Linux / FF combo so it seems that SiteCon is coming in under different guises meaning that filtering or setting an alert just be looking for SiteCon / Site Confidence wouldn’t be 100% effective.

    Not sure if the Linux / Mozilla combo also applies for other bots but it seems possible that Linux would be a common server OS.

    Regards, Hugh

Leave a Reply

Your email address will not be published. Required fields are marked *