Getting rid of old spam is hard

phpBB is now pretty good at keeping spam users from registering and posting. In the phpBB 3.0 and 3.1 days, its defenses turned out to be pretty weak. The GD Image spambot countermeasure (still the default) was easily hacked. phpBB has at least added settings to let you tune it better, making it harder to hack. It also started supporting Google’s reCAPTCHA, but the version in phpBB 3.1 was quickly hacked and phpBB was not agile enough to quickly integrate its versions 2 and 3 reCAPTCHAs.

This led to the an inundation of spam on certain forums, mostly bogus spam registrations but also lots of spam posts in some forums. Some administrators countered by requiring all new users to be approved by an administrator. But when inundated with hundreds of these in a short period of time, it’s a hassle to delete them all, or discern the real new users from the spam ones. For a few years, I made quite a bit of money removing spam for clients.

With phpBB 3.2 things slowly got better, at least if administrators used best practices. Best practices were to use reCAPTCHA version 2 “I am not a robot”, or the Question & Answer, providing the questions were sufficiently difficult. A malicious human could still take the time to solve the questions, but these were unusual. There were also a few extensions that could help. The Sortables CAPTCHA was one of the more useful ones.

My go to for years has been the Cleantalk extension, which requires subscribing to their service. But now there is also an Akismet spam extension, which also requires a subscription, which can be free for personal sites.

All this is good at preventing spam, but how do you get rid of months or years of spam posts? That was my dilemma this week working with a client.

The latest version of the Cleantalk extension has a feature that removes spam users and their posts. But I discovered it has a few serious limitations:

  • It bases its judgment based on the IP of the poster. The user’s last IP is stored automatically. It doesn’t examine the post text. Over time, IPs that used to be marked as spam get cleaned up, and when this happens these IPs are no longer flagged, so spam registrations aren’t caught.
  • Its interface for finding these users is slow and can easily time out, which means sometimes it can’t succeed. It also lacks pagination.

Why this particular client ignored this problem for a few years, I don’t know. But cleaning up the database was a big challenge. The only real way to do it is to manually look at every post and flag those that were spam, then use moderator tools to get rid of them.

This was time prohibitive, but if these could be removed presumably people would start posting again and Google would rank the site as legitimate again, bringing in new people.

The next best solution I found was to try to identify when the spam started. This took quite a bit of analysis, but looking in the most posted forum on the board it looked like it started on Feburary 10, 2018. So I used phpBB’s Prune User feature to remove users and their posts that registered after the spam started.

This seems to have gotten rid of the spam. But it also removed accounts of some legitimate users, and their posts as well. Those who had accounts before then were unaffected and their posts remained.

phpBB needs a real solution which so far doesn’t exist.

But I think I have found a solution … if I write the extension. If you are an extension developer, please go ahead and develop it, just tell me so I don’t waste my time.

It turns out that the Akismet, the biggest solution out there and used widely in WordPress to moderate comments, has a Submit Spam API. So in theory, if you pass the needed information to it including the poster’s IP and the post text, it can render a judgment on whether it is spam or not. If these posts can be flagged, they can then be removed.

One possible issue is that the service requires sending it a User Agent string. phpBB does not store this. Perhaps a fake user agent string could be supplied, but would this render a correct judgment? If no, this solution wouldn’t work. Also, it requires an Akismet key to use, which might require some boards to purchase the key. This may be a limiting factor for some.

As I have time I hope to see if this is a viable approach finding and removing spam posts in phpBB.