Anti-(Anti-)Crawler Mechanism

When it comes to anti-crawlers, we have to talk about crawling. In fact, this is a concept. Crawlers simply hand over the manual tasks to the code for automated implementation. The anti-crawler is a means to detect whether the user is a real user or a code. The anti-crawler is a means against the anti-crawler mechanism.

They all say “double negation means affirmation”, then the crawler and the anti-crawler should be the same. In fact, it is not. On the surface, the behavior is the same, but in fact, the anti-crawler has done more processing than a simple small crawler.

Generally speaking, anti-crawlers will start from the following levels:
-header browser request header
-User-Agent user agent, a way to indicate the identity of the source of access< br>-The link from which the referer visits the target link is jumped from (for anti-leech, you can start with it)
-Host is the same-origin address judgment, it will be useful to use it.
-The same IP If the IP is accessed multiple times in a short period of time, it is very likely to be a crawler, and the anti-crawler will deal with this.
-Access frequency: Multiple high-concurrency accesses in a short period of time are basically problematic accesses.
The above several are common anti-reptile measures, of course, there are more advanced mechanisms, such as the most disgusting verification code (using tesseract can handle relatively simple verification code identification), user behavior analysis, etc. .

Now that we understand the common anti-crawler mechanism, it is not so clueless to implement the corresponding “policy-countermeasure” to achieve anti-crawler. Yes, we will have some countermeasures against the above restrictions.

  • For User-Agent, you can sort out some common browser proxy headers and use one of them randomly for each visit.
  • For IP, can I use a proxy IP
  • For frequency restrictions, it’s good to do random sleep during access intervals.
  • ……

actual combat

Before I have been blogging on CSDN, it’s anti-crawler To be honest, the mechanism is relatively shallow. On the one hand, it is not very necessary. On the other hand, it is not very cost-effective to be an anti-reptile broker. It is estimated that they are not willing to waste it.

So, it’s still very casual to brush pageviews on CSDN, let’s talk about my ideas.
-Proxy IP crawl, verify and clean data, and update regularly.
-Browser User-Agent organizes and adds random access.
-Immediately sleep strategy, log processing, error recording, timing retry, etc.

Proxy IP processing

# coding: utf8# @Author: 郭勃# @File: # @ Time: 2017/10/5 # @Contact: [email protected]# @blog: http: // @Description: Grab the proxy IP and save it to the redis-related keyimport requestsfrom bs4 import BeautifulSoupfrom redishelper import RedisHelperclass < span class="hljs-title">ProxyIP(object): """ catch Get the proxy IP, clean and verify. """ def __init__(self): self.rh = RedisHelper() def crawl(self)< /span>: """ Regardless of whether it is http or https, store it all in. """ # First handle the proxy ip of http mode httpurl = "http://www.xicidaili. com/nn/" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1 ; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'} html = requests.get(url=httpurl, headers=headers).text soup = BeautifulSoup(html, "html.parser") ips = soup.find_all("tr") for index in range(1, len(ips)): tds = ips[index].find_all('td') ip = tds[1].text port = tds[2].text ipinfo = "{}:{}".fo rmat(ip, port) if self._check(ip): self.rh.sAddAvalibeIp(ipinfo) # print(ipinfo ) def _check(self, ip): """ Check the validity of the proxy IP""" checkurl = "" localip = self._getLocalIp() # print("Local: { }, proxy: {}".format(localip, ip)) return False < span class="hljs-keyword">if localip==ip else True def _getLocalIp(self ): """ Get the IP address of this machine, the interface method is not reliable, temporarily use manual method at ip manually copy and paste """ return "" def clean(self ): ips = self.rh.sGetAllAvalibleIps() for ipinfo in ips: ip, port = ipinfo.split(":") if self._check(ip): self.rh.sAddAvalibeIp(ipinfo) else: self.rh.sRemoveAvalibleIp(ipinfo) def update(self): passif __name__ == '__main__': pip = ProxyIP()  # result = pip._check("", 53281) # print(result) pip.crawl() # pip.clean()


# coding: utf8# @Author: 郭勃# @File: # @Time: 2017/10/5 # @Contact: [email protected]# @blog: @Description: Some operating tool methods involving redisimport redisclass RedisHelper(object): """ is used to save the crawled blog content links. Save proxy IP """ def __init__ (self): self.articlepool = "redis:set:article:pool" self .avalibleips = "redis:set:avalible:ips" self.unavalibleips = "redis:set:unavalibe:ips"< /span> pool = redis.ConnectionPool(host="localhost", port=6379) self.redispool = redis.Redis(connection_pool=pool) def sAddArticleId< span class="hljs-params">(self, articleid): """ Add the blog id to be crawled. :param articleid: :return: """ self.redispool.sadd(self.articlepool, articleid) def sRemoveArticleId(self, articleid): self.redispool.srem(self.articlepool, articleid) def popupArticleId(self): return int(self.redispool.srandmember(self.articlepool)) def sAddAvalibeIp(self, ip): self.redispool.sadd(self.avalibleips, ip) def sRemoveAvalibeIp(self, ip): self.redispool.srem(se lf.avalibleips, ip) def sGetAllAvalibleIps(self): return [ip.decode('utf8 ') for ip in self.redispool.smembers(self.avalibleips)] def popupAvalibeIp(self ): return self.redispool.srandmember(self.avalibleips) def sAddUnavalibeIp(self, ip): self .redispool.sadd(self.unavalibleips, ip) def sRemoveUnavaibleIp(self, ip): self.redispool.srem(self.unavalibleips, ip)

csdn blog tool class

# coding: utf8# @Author : Guo Pu# @File: # @Time: 2017/10/5 # @Contact: [email protected]# @blog:< /span># @Description: Crawl all blog link tool classes of a blogger and other designed operations. import reimport requestsfrom bs4 import BeautifulSoupclass BlogScanner(object): """ Crawl under the blogger id All article link id. """ def __init__ span>(self, bloger="marksinoberg"): self.bloger = bloger < span class="hljs-comment"># self.blogpagelink = "{}/article/list/{}".format(self.bloger, 1) def _getTotalPages(self ): blogpagelink = "htt p://{}/article/list/{}?viewmode=contents".format(self.bloger, 1 ) html = requests.get(url=blogpagelink).text soup = BeautifulSoup(html, "html.parser") # Compared to hack operations, the actual development should not be so casual. temptext = soup.find('div', {"class": "pagelist"}).find("span") .get_text() restr = re.findall(re.compile("(\d+).*?(\d+)"), temptext) # print(restr) pages = restr[0][-1] return pages def _parsePage(self, pagenumber): blogpagelink = "{}/article/list/{}?viewmode=contents".format(self.bloger, int(pagenumber)) html = requests.get(url=blogpagelink).text soup = BeautifulSoup(html, "html.parser") links = soup.find(< span class="hljs-string">"div", {"id": "article_list"< /span>}).find_all("span", {"class": "link_title"}) articleids = [] for link in links : temp = link.find("a").attrs['href'] articleids.append( temp.split("/")[-1]) # print(len(articleids)) # print(articleids) return articleids def get_all_articleids(self): pages = int (self._getTotalPages()) articleids = [] for index in range(pages): tempids = self ._parsePage(int(index+1)) articleids.extend(tempids) return articleidsif __name__ == '__main__': bs = BlogScanner(bloger="marksinoberg" ) # print(bs._getTotalPages()) # bs._parsePage(1) articleids = bs.get_all_articleids() print(len(articleids)) print(a rticleids)


# coding: utf8< span class="hljs-comment"># @Author: 郭勃# @File:  # @Time: 2017/10/5 # @Contact: [email protected]# @blog: @Description: open brushimport requestsimport randomimport timefrom redishelper < span class="hljs-keyword">import RedisHelperclass  FakeUserAgent(object): """ Some User-Agents collected each time Popup produces different uas to reduce the impact of anti-reptile mechanism. More content: """ def __init__(self): self.uas = [" Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/ 533.1", "JUC (Linux; U; 2.3.7; zh-cn; MB200; 320*480) UCWEB7.9.3.103/139/999" span>, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110623 Firefox/7.0a1 Fennec/7.0a1", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Versi on/11.10", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version /4.0 Safari/534.13", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/1A542a Safari/419.3", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit /532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7", "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10", "Mozilla /5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/ Mobile Safari/534.1+", "Mozilla /5.0 (hp-tablet; Lin ux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla /5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", < span class="hljs-string">"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Macintosh ; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36", "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv: Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit /534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125. 122 UBrowser/4.0.3214.0 Safari/537.36", "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/ 533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML , like Gecko) Version/ Mobile Safari/534.1+", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile /9.0; HTC; Titan)", "Mozilla/4.0 (compatible; MSIE 6.0;) Opera/UCWEB7.0.2.37/28/999" , "Openwave/ UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/ 999", "UCWEB7.0.2.37/28/999", "Mozilla/5.0 (hp- tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/5 34.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build /HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build -1107180945; U; en-GB) Presto/2.8.149 Version/11.10", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",] def _generateIndexes(self): numbers = random .randint(0, len(self.uas)) indexes = [] while len(indexes) < numbers: temp = random.randrange(0, len(self.uas)) if temp not in indexes: indexes.append(temp) return indexes  def popupUAs(self): uas = [] indexes = self. _generateIndexes() for index in indexes: uas.append(self.uas[index]) return uasclass Brush< /span>(object): """ Turn on pageviews """ < span class="hljs-function">def __init__( self, bloge r="marksinoberg"): self.bloger = "http://blog.{}".format(bloger) self.headers = {'Host': '', 'Upgrade-Insecure-Requests': '1' , 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',} self.rh = RedisHelper() def getRandProxyIp(self): ip = self.rh.popupAvalibeIp() proxyip = {} ipinfo = "http://{}".format(str(ip.decode('utf8' ))) proxyip['http'] = ipinfo # print(proxyip) return proxyip def brushLink(self, articleid, randuas=[]): # /details/78058279 bloglink = "{}/article/details/{}".format(self.bloger, articleid) for ua in randuas: self.headers['User-Agent' ] = ua timeseed = random.randint(1, 3) print("Temporary sleep: {}seconds".format(timeseed)) time.sleep(timeseed) for index in range(timeseed): # requests.get(url=bloglink, headers=self.headers, proxies= self.getRandProxyIp()) requests.get(url=bloglink, headers=self.headers)if __name__ == '__main__': # fua = FakeUserAgent() # indexes = [0, 2,5, 7] # indexes = generate_random_numbers(0, 18, 7) # randuas = fua.popupUAs(indexes)  # randuas = fua.popupUAs() # print(len(randuas)) # print(randuas) # print(fua._generateIndexes()) brush = Brush("marksinoberg") # brush.brushLink(78058279, ra nduas) print(brush.getRandProxyIp())


# coding: utf8# @Author: 郭勃# @File: # @Time: 2017/10/5 # @Contact: [email protected]# @blog: @Description: entrancefrom csdn import *from redishelper  import RedisHelperfrom brushhelper import *import< /span> threadingdef main(): rh = RedisHelper() bs = Blo gScanner(bloger="marksinoberg") fua = FakeUserAgent() brush = Brush(bloger="marksinoberg" ) counter = 0 while counter <12 : # Open brush print("{}th time! ".format(counter)) try: uas = fua.popupUAs() articleid = rh.popupArticleId() brush.brushLink(articleid, uas) < span class="hljs-keyword">except Exception as e: print(e) # Log to be added Handler counter+=1if __name__ == '__main__': for i in range(280): temp = threading.Thread(target=main) temp.start()

Run result

I took it before I did a test for an article I wrote.
Blog link:

301 views before opening After a simple review, the number of visits is as follows:

After simply checking the page views


It’s roughly like this, although it’s a prototype at most, because the code is about 45% complete. If you are interested, you can add me QQ1064319632, or leave your suggestions in the comments, let us communicate and learn together.

