Detecting AI bots with PHP

Sharing knowledge is good. But some companies do not give credit, where credit is due. So people are creating ways to protect content from leechers.

For this website I started looking how to add something. I don’t want to block all bots. For example, I like that my web site is indexed by the Internet Archive. Send them money if you can.

To tackle the AI problem, one of my local PHP usergroup organisers created a solution called VolkswAIgen.

It is all standards compliant and modern. Usually I don't write PHP code that way, because I mostly work with legacy stuff. The kind without documentation, no tests, and every developer added their own coding style. Obviously with no budget and the deadline was yesterday.

Also, I mostly do frontend stuff these days.

So here is my quick and dirty solution. First time I used a cache pool in PHP 🙌 Change the ingredients to your own liking 😉

For WordPress you could try the DefAI plugin. It uses the same library.

Install dependencies

composer require league/flysystem-local matthiasmullie/scrapbook volkswaigen/volkswaigen

Add PHP code

// Add the Composer autoloader (if not already added)
require __DIR__ . '/../vendor/autoload.php';

// Wrap in an anonymous function to not pollute the global namespace.
(function () {
  // Initialise variables first.
  $adapter = new \League\Flysystem\Local\LocalFilesystemAdapter(__DIR__ . '/../cache', null, LOCK_EX);
  $filesystem = new \League\Flysystem\Filesystem($adapter);
  $cache = new \MatthiasMullie\Scrapbook\Adapters\Flysystem($filesystem);
  $cachePool = new \MatthiasMullie\Scrapbook\Psr6\Pool($cache);
  $volkswaigen = new \VolkswAIgen\VolkswAIgen\Main(
    new \VolkswAIgen\VolkswAIgen\ListFetcher($cachePool)
  );

  $userAgent = '';
  if (isset($_SERVER['HTTP_USER_AGENT'])) {
    $userAgent = $_SERVER['HTTP_USER_AGENT'];
  }
  $ipAddress = '';
  if (isset($_SERVER['REMOTE_ADDR'])) {
    $ipAddress = $_SERVER['REMOTE_ADDR'];
  }
  if (!empty($_SERVER['HTTP_X_FORWARDED_FOR'])) {
    $ipAddress = $_SERVER['HTTP_X_FORWARDED_FOR'];
  }

  // Is it a bot?
  if ($volkswaigen->isAiBot($userAgent, $ipAddress)) {
    // It's AI! Feed them the good stuff.
    \http_response_code(418);
    echo '<h1>I’m a teapot</h1>';
    exit;
  }
})();

[Update] Using HTTP 418 response code instead of HTTP 404.