With the explosion of ChatGPT in 2023, it has been well documented that major players in journalism were blocking OpenAI’s crawlers from indexing their web content.
With Google very much transitioning to an artificial intelligence (AI) first company, Google’s Search Generative Experience (SGE) experiment poses one key question: How much traffic will Google generate to websites once its AI embed goes live? If Google’s search generative results reduce traffic to news websites, they could be forced to block Googlebot from crawling their articles.
With a view to understanding what fallout the go-live in the news space of Google could be, we tracked the top 38 Google News results hourly between the 15th January 2024 and 17th January 2024, using the search phrase ‘uk news’.
Findings
- In January 2024, 75% of Google News results in top 3 positions blocked GPTBot.
- In August 2024, 89% of Google News results in those top 3 positions blocked GPTBot.
- Al Jazeera and The New Scientist now both block GPTBot.
- Notable websites that ranked highly in the Google News results that block GPTBot were bbc.co.uk, bbc.com, sky.com, theguardian.com, bloomberg.com and telegraph.co.uk.
- Out of 1366 Google News results, 945 were articles that featured on domains that blocked GPTBot (Open AI’s web crawler).
- Out of the sample of Google News results analysed, only 12% blocked Google Bard (via the ‘Google-Extend’ control).
What percentage of unique domains in Google News block GPTBot from crawling their content?
Whilst there are some very high ranking, household UK News brands that disallow GPTBot, 64% of Google News unique domains don’t block GPTBot.
Which websites that block GPTBot rank highly in Google?
BBC, Sky, The Guardian, Bloomberg & the Telegraph all Block GPTBot. Out Of those journalistic brands the BBC, Sky and The Guardian had the highest number of #1 positions.
Domain | Number of position 1 phrases | Block GPTBot? |
bbc.co.uk | 43 | Yes |
sky.com | 38 | Yes |
theguardian.com | 19 | Yes |
bbc.com | 7 | Yes |
bloomberg.com | 2 | Yes |
telegraph.co.uk | 2 | Yes |
aljazeera.com | 18 | YES (changed since Jan 2024) |
independent.co.uk | 10 | No |
www.gov.uk | 6 | No |
newscientist.com | 1 | YES (changed since Jan 2024) |
What website categories rank highly in Google News?
Predictably the vast majority (79%) of results from the Google News research are articles from the ‘UK News’ category. 15% of the domains from the Google News analysis didn’t have a robots.txt file to allow or disallow specific crawlers; the ‘Sports’ category for instance contained 2 domains that both didn’t have a live robots.txt file.
Why did we run this analysis?
Google’s Search Generative Results signals a paradigm shift in how content is surfaced and all eyes in the journalism space will be on how Google cites their content. If there is any significant shift where content creation is served in an AI embed but there is significant traffic loss from the new UI – there could be a backlash in the industry.
What are the potential downsides of blocking ChatGPT?
Blocking ChatGPT and similar AI tools might limit traffic potential that such tools could unlock. In digital marketing, people have made the assumption that ChatGPT ‘shouldn’t be used as a search engine’. In recent years, marketplaces like Amazon and social media platforms like TikTok have disrupted the search engine space for informational / transactional content discovery. In a recent GWI survey, 20% of internet users use AI tools to find information online.
An obvious downside from Open AI’s perspective is that ChatGPT cannot be trained on text from an influential, authoritative news website like the BBC. Despite Bing.com having the AI ‘Bing Chat’ due to Microsoft’s close relationship with Open AI, this has not moved the dial on Bing’s market share in the search engine market. According to Statcounter, Bing’s search engine market share languishes at only 3.37%; up from 3.03% in December 2022.
If users continue to search for web content via ChatGPT, then there could be an opportunity for Open AI to funnel more downstream traffic to the web – giving Microsoft a means of corroding at Google’s massive search engine market share. ‘ChatGPT’ has over 4 times the brand search volume that ‘Bing’ does – it’s not inconceivable that in future years it could drive more value or even volume in terms of locating the best web content for its users. For instance, consider that ‘Browse with Bing’ has only been live since September 2023 – providing ChatGPT users with a means of getting around the issue of GPTBot being blocked via the robots.txt file.
Can websites completely abstain from appearing in Google’s Search Generative Experience (SGE)?
In September 2023, Google Extended was launched to allow webmasters to opt in / opt out from inclusion for Bard & Vertex AI. However, SGE is classed as part of the Google search experience and so exclude your website from SGE (albeit at ‘experiment’ stage) currently you need to de-index your web content from Google entirely by blocking Googlebot via the robots.txt file.
Conclusion
How people search for information is becoming more layered and fragmented than ever. Whilst the early stages of LLM adoption have raised some early concerns for journalistic websites, there are early signs that users will look to ChatGPT and other AI tools when conducting classic informational search as opposed to structured, detailed AI prompting. Innovation is on its way with partnerships like the OpenAI x Axel Springer deal that was announced in December 2023, so it will be interesting to see how bridges will be built between the companies producing high quality content and the AI organisations in coming years.