Performance – How to stop the bad robots from my website without interfering with real users?

I want to keep some bad scrapers (that is, bad robots that ignore robots.txt through definition) to steal content and consume the bandwidth of my website. At the same time, I don’t want to interfere with legitimate users Experience, and don’t want to prevent well-performing bots (such as Googlebot) from indexing websites.

The standard method for dealing with this problem has been described here: Tactics for dealing with misbehaving robots. However, in this topic The solution proposed and proposed in is not what I want.

Some bad bots are connected via tor or botnet, which means their IP address is short-lived and may belong to the person using the infected computer .

Therefore, I have been thinking about how to improve the industry standard method to allow the “false positives” (i.e. humans) in the intellectual property blacklist to visit my website again. One idea is to completely stop blocking these IPs, Instead, they are required to pass the CAPTCHA before being allowed access. Although I think CAPTCHA is a PITA for legitimate users, it seems better to use CAPTCHA to review suspicious bad robots than to completely block access to these IPs. By tracking user sessions that complete the CAPTCHA, I should be able to determine if they are human (and their IP should be removed from the blacklist), or a bot smart enough to solve the CAPTCHA and put them in a darker list.

However, Before I start implementing this idea, I want to ask the good people here if they foresee any problems or weaknesses (I already know that some CAPTCHAs have been broken-but I think I will be able to handle it).

I believe the problem is whether there is a foreseeable problem with the verification code. Before I delve into it, I would like to talk about how you Plan to grab bots to challenge them with verification codes. TOR and proxy nodes change regularly, so the IP list needs to be updated constantly. You can use Maxmind as a benchmark proxy address list. You can also find services that update all TOR node addresses. But not all Bad bots come from these two carriers, so you need to find other ways to catch bots. If you add rate limiting and spam lists, then you should get more than 50% of bad bots. Other strategies must actually revolve around your website Customization.

Now let’s talk about the problem of Captchas. First, there are services like http://deathbycaptcha.com/. I don’t know if I need to elaborate on that, but it will let Your method becomes useless. Many other ways people bypass Captcha are using OCR software. The better Captcha beats OCR, the harder it is for users. In addition, many Captcha systems use client-side cookies, someone can solve Once, then upload to all their robots.
I think the most famous is Karl Groves’ 28 ways to beat Captcha. http://www.karlgroves.com/2013/02/09/list-of- resources-breaking-captcha/

For full disclosure, I am the co-founder of Distil Networks, a SaaS solution to stop bots. I often promote our software as a more complex system , Rather than just use the verification code and build it yourself, so my opinion on the effectiveness of your solution is biased.

I want to keep some Good scrapers (that is, bad robots that ignore robots.txt through definition) steal content and consume the bandwidth of my website. At the same time, I don’t want to interfere with the user experience of legitimate users, and I don’t want to prevent well-performing bots (such as Googlebot) from indexing websites .

The standard method for dealing with this problem has been described here: Tactics for dealing with misbehaving robots. However, the solutions proposed and proposed in this topic are not what I want.

p>

Some bad bots are connected through tor or botnet, which means their IP address is short-lived and may belong to the person using the infected computer.

Therefore, I have been thinking about how to improve The industry standard method to allow the “false positives” (i.e. humans) in the intellectual property blacklist to access my website again. One idea is to stop blocking these IPs completely, but to require them to pass the CAPTCHA before being allowed access. Although I think CAPTCHA It is a PITA for legitimate users, but using CAPTCHA to review suspected bad bots seems to be better than completely blocking access to these IPs. By tracking user sessions that completed CAPTCHA, I should be able to determine if they are human (and should be removed from the blacklist) Their IP), or a robot smart enough to solve the CAPTCHA and put them in a darker list.

However, before I start implementing this idea, I want to ask the good people here, If they foresee any problems or weaknesses (I already know that some CAPTCHAs have been broken-but I think I will be able to handle them).

The problem I believe is Is there a foreseeable problem with the verification code? In my deep Before entering the research, I would also like to talk about how you plan to catch bots and challenge them with verification codes. TOR and proxy nodes change regularly, so the IP list needs to be updated constantly. You can use Maxmind as a benchmark proxy address list. You can also find Update the service of all TOR node addresses. But not all bad bots come from these two carriers, so you need to find other ways to catch bots. If you add rate limiting and spam lists, then you should get more than 50% bad bots . Other strategies must actually be customized around your website.

Now let’s talk about the problem of Captchas. First, there are services like http://deathbycaptcha.com/. I don’t know Do I need to elaborate on that, but it will make your method useless. Many other ways people bypass Captcha are using OCR software. The better Captcha beats OCR, the harder it is for users. In addition, Many Captcha systems use client-side cookies, someone can solve it once and upload to all their bots.
I think the most famous is Karl Groves’ 28 ways to beat Captcha. http://www.karlgroves.com /2013/02/09/list-of-resources-breaking-captcha/

For complete disclosure, I am the co-founder of Distil Networks, a SaaS solution to stop bots. I often Promote our software as a more complex system, not just use verification codes and build it yourself, so my view on the effectiveness of your solution is biased.

< p>

Leave a Comment

Your email address will not be published.