Writing a blog is partly to allow me to quickly review the knowledge I have learned before and to organize my ideas; the other is to help other children’s shoes who have encountered similar problems. But blogging is hard to stick to. There are various reasons. But after all, there is no “resonance.”
High mountains and flowing water, it is difficult to find a friend.
Actually, establishing the habit of blogging is just the little things: watching the number of visits to the blog every day, the number of likes increases; seeing your own articles being commented on by others, and so on.
Okay, not much nonsense. Today, let’s talk about the issue of brushing page views. Although this is far from the original intention of blogging, it is good to understand this kind of problem, after all, “technology is not illegal!”.
Anti-(Anti-)Crawler Mechanism
When it comes to anti-crawlers, we have to talk about crawling. In fact, this is a concept. Crawlers simply hand over the manual tasks to the code for automated implementation. The anti-crawler is a means to detect whether the user is a real user or a code. The anti-crawler is a means against the anti-crawler mechanism.
They all say “double negation means affirmation”, then the crawler and the anti-crawler should be the same. In fact, it is not. On the surface, the behavior is the same, but in fact, the anti-crawler has done more processing than a simple small crawler.
Generally speaking, anti-crawlers will start from the following levels:
-header browser request header
-User-Agent user agent, a way to indicate the identity of the source of access< br>-The link from which the referer visits the target link is jumped from (for anti-leech, you can start with it)
-Host is the same-origin address judgment, it will be useful to use it.
-The same IP If the IP is accessed multiple times in a short period of time, it is very likely to be a crawler, and the anti-crawler will deal with this.
-Access frequency: Multiple high-concurrency accesses in a short period of time are basically problematic accesses.
The above several are common anti-reptile measures, of course, there are more advanced mechanisms, such as the most disgusting verification code (using tesseract can handle relatively simple verification code identification), user behavior analysis, etc. .
Now that we understand the common anti-crawler mechanism, it is not so clueless to implement the corresponding “policy-countermeasure” to achieve anti-crawler. Yes, we will have some countermeasures against the above restrictions.
- For User-Agent, you can sort out some common browser proxy headers and use one of them randomly for each visit.
- For IP, can I use a proxy IP
- For frequency restrictions, it’s good to do random sleep during access intervals.
- ……
actual combat
Before I have been blogging on CSDN, it’s anti-crawler To be honest, the mechanism is relatively shallow. On the one hand, it is not very necessary. On the other hand, it is not very cost-effective to be an anti-reptile broker. It is estimated that they are not willing to waste it.
So, it’s still very casual to brush pageviews on CSDN, let’s talk about my ideas.
-Proxy IP crawl, verify and clean data, and update regularly.
-Browser User-Agent organizes and adds random access.
-Immediately sleep strategy, log processing, error recording, timing retry, etc.
Proxy IP processing
# coding: utf8# @Author: 郭勃# @File: proxyip.py # @ Time: 2017/10/5 # @Contact: [email protected]# @blog: http: //blog.csdn.net/marksinoberg# @Description: Grab the proxy IP and save it to the redis-related keyimport requestsfrom bs4 import BeautifulSoupfrom redishelper import RedisHelperclass < span class="hljs-title">ProxyIP(object): """ catch Get the proxy IP, clean and verify. """ def __init__(self): self.rh = RedisHelper() def crawl(self)< /span>: """ Regardless of whether it is http or https, store it all in. """ # First handle the proxy ip of http mode httpurl = "http://www.xicidaili. com/nn/" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1 ; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'} html = requests.get(url=httpurl, headers=headers).text soup = BeautifulSoup(html, "html.parser") ips = soup.find_all("tr") for index in range(1, len(ips)): tds = ips[index].find_all('td') ip = tds[1].text port = tds[2].text ipinfo = "{}:{}".fo rmat(ip, port) if self._check(ip): self.rh.sAddAvalibeIp(ipinfo) # print(ipinfo ) def _check(self, ip): """ Check the validity of the proxy IP""" checkurl = "http://47.94.19.186/common/checkip.php" localip = self._getLocalIp() # print("Local: { }, proxy: {}".format(localip, ip)) return False < span class="hljs-keyword">if localip==ip else True def _getLocalIp(self ): """ Get the IP address of this machine, the interface method is not reliable, temporarily use manual method at https://www.baidu.com/s?ie=UTF-8&wd= ip manually copy and paste """ return "223.91.239.159" def clean(self ): ips = self.rh.sGetAllAvalibleIps() for ipinfo in ips: ip, port = ipinfo.split(":") if self._check(ip): self.rh.sAddAvalibeIp(ipinfo) else: self.rh.sRemoveAvalibleIp(ipinfo) def update(self): passif __name__ == '__main__': pip = ProxyIP() # result = pip._check("223.91.239.159", 53281) # print(result) pip.crawl() # pip.clean()
Redistools
# coding: utf8# @Author: 郭勃# @File: redishelper.py # @Time: 2017/10/5 # @Contact: [email protected]# @blog: http://blog.csdn.net/marksinoberg# @Description: Some operating tool methods involving redisimport redisclass RedisHelper(object): """ is used to save the crawled blog content links. Save proxy IP """ def __init__ (self): self.articlepool = "redis:set:article:pool" self .avalibleips = "redis:set:avalible:ips" self.unavalibleips = "redis:set:unavalibe:ips"< /span> pool = redis.ConnectionPool(host="localhost", port=6379) self.redispool = redis.Redis(connection_pool=pool) def sAddArticleId< span class="hljs-params">(self, articleid): """ Add the blog id to be crawled. :param articleid: :return: """ self.redispool.sadd(self.articlepool, articleid) def sRemoveArticleId(self, articleid): self.redispool.srem(self.articlepool, articleid) def popupArticleId(self): return int(self.redispool.srandmember(self.articlepool)) def sAddAvalibeIp(self, ip): self.redispool.sadd(self.avalibleips, ip) def sRemoveAvalibeIp(self, ip): self.redispool.srem(se lf.avalibleips, ip) def sGetAllAvalibleIps(self): return [ip.decode('utf8 ') for ip in self.redispool.smembers(self.avalibleips)] def popupAvalibeIp(self ): return self.redispool.srandmember(self.avalibleips) def sAddUnavalibeIp(self, ip): self .redispool.sadd(self.unavalibleips, ip) def sRemoveUnavaibleIp(self, ip): self.redispool.srem(self.unavalibleips, ip)
csdn blog tool class
# coding: utf8# @Author : Guo Pu# @File: csdn.py # @Time: 2017/10/5 # @Contact: [email protected]# @blog: http://blog.csdn.net/marksinoberg< /span># @Description: Crawl all blog link tool classes of a blogger and other designed operations. import reimport requestsfrom bs4 import BeautifulSoupclass BlogScanner(object): """ Crawl under the blogger id All article link id. """ def __init__ span>(self, bloger="marksinoberg"): self.bloger = bloger < span class="hljs-comment"># self.blogpagelink = "http://blog.csdn.net/{}/article/list/{}".format(self.bloger, 1) def _getTotalPages(self ): blogpagelink = "htt p://blog.csdn.net/{}/article/list/{}?viewmode=contents".format(self.bloger, 1 ) html = requests.get(url=blogpagelink).text soup = BeautifulSoup(html, "html.parser") # Compared to hack operations, the actual development should not be so casual. temptext = soup.find('div', {"class": "pagelist"}).find("span") .get_text() restr = re.findall(re.compile("(\d+).*?(\d+)"), temptext) # print(restr) pages = restr[0][-1] return pages def _parsePage(self, pagenumber): span> blogpagelink = "http://blog.csdn.net/{}/article/list/{}?viewmode=contents".format(self.bloger, int(pagenumber)) html = requests.get(url=blogpagelink).text soup = BeautifulSoup(html, "html.parser") links = soup.find(< span class="hljs-string">"div", {"id": "article_list"< /span>}).find_all("span", {"class": "link_title"}) articleids = [] for link in links : temp = link.find("a").attrs['href'] articleids.append( temp.split("/")[-1]) # print(len(articleids)) # print(articleids) return articleids def get_all_articleids(self): pages = int (self._getTotalPages()) articleids = [] for index in range(pages): tempids = self ._parsePage(int(index+1)) articleids.extend(tempids) return articleidsif __name__ == '__main__': bs = BlogScanner(bloger="marksinoberg" ) # print(bs._getTotalPages()) # bs._parsePage(1) articleids = bs.get_all_articleids() print(len(articleids)) print(a rticleids)
Brushtools
# coding: utf8< span class="hljs-comment"># @Author: 郭勃# @File: brushhelper.py # @Time: 2017/10/5 # @Contact: [email protected]# @blog: http://blog.csdn.net/marksinoberg# @Description: open brushimport requestsimport randomimport timefrom redishelper < span class="hljs-keyword">import RedisHelperclass FakeUserAgent(object): """ Some User-Agents collected each time Popup produces different uas to reduce the impact of anti-reptile mechanism. More content: http://www.73207.com/useragent """ def __init__(self): self.uas = [" Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/ 533.1", "JUC (Linux; U; 2.3.7; zh-cn; MB200; 320*480) UCWEB7.9.3.103/139/999" span>, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110623 Firefox/7.0a1 Fennec/7.0a1", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Versi on/11.10", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version /4.0 Safari/534.13", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/1A542a Safari/419.3", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit /532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7", "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10", "Mozilla /5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", "Mozilla /5.0 (hp-tablet; Lin ux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla /5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", < span class="hljs-string">"Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Macintosh ; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36", "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit /534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125. 122 UBrowser/4.0.3214.0 Safari/537.36", "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/ 533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML , like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile /9.0; HTC; Titan)", "Mozilla/4.0 (compatible; MSIE 6.0;) Opera/UCWEB7.0.2.37/28/999" , "Openwave/ UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/ 999", "UCWEB7.0.2.37/28/999", "Mozilla/5.0 (hp- tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/5 34.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build /HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build -1107180945; U; en-GB) Presto/2.8.149 Version/11.10", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",] def _generateIndexes(self): numbers = random .randint(0, len(self.uas)) indexes = [] while len(indexes) < numbers: temp = random.randrange(0, len(self.uas)) if temp not in indexes: indexes.append(temp) return indexes def popupUAs(self): uas = [] indexes = self. _generateIndexes() for index in indexes: uas.append(self.uas[index]) return uasclass Brush< /span>(object): """ Turn on pageviews """ < span class="hljs-function">def __init__( self, bloge r="marksinoberg"): self.bloger = "http://blog. csdn.net/{}".format(bloger) self.headers = {'Host': ' blog.csdn.net', 'Upgrade-Insecure-Requests': '1' , 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',} self.rh = RedisHelper() def getRandProxyIp(self): ip = self.rh.popupAvalibeIp() proxyip = {} ipinfo = "http://{}".format(str(ip.decode('utf8' ))) proxyip['http'] = ipinfo # print(proxyip) return proxyip def brushLink(self, articleid, randuas=[]): # http://blog.csdn.net/marksinoberg/article /details/78058279 bloglink = "{}/article/details/{}".format(self.bloger, articleid) for ua in randuas: self.headers['User-Agent' ] = ua timeseed = random.randint(1, 3) print("Temporary sleep: {}seconds".format(timeseed)) time.sleep(timeseed) for index in range(timeseed): # requests.get(url=bloglink, headers=self.headers, proxies= self.getRandProxyIp()) requests.get(url=bloglink, headers=self.headers)if __name__ == '__main__': # fua = FakeUserAgent() # indexes = [0, 2,5, 7] # indexes = generate_random_numbers(0, 18, 7) # randuas = fua.popupUAs(indexes) # randuas = fua.popupUAs() # print(len(randuas)) # print(randuas) # print(fua._generateIndexes()) brush = Brush("marksinoberg") # brush.brushLink(78058279, ra nduas) print(brush.getRandProxyIp())
Entrance
# coding: utf8# @Author: 郭勃# @File: Main.py # @Time: 2017/10/5 # @Contact: [email protected]# @blog: http://blog.csdn.net/marksinoberg# @Description: entrancefrom csdn import *from redishelper import RedisHelperfrom brushhelper import *import< /span> threadingdef main(): rh = RedisHelper() bs = Blo gScanner(bloger="marksinoberg") fua = FakeUserAgent() brush = Brush(bloger="marksinoberg" ) counter = 0 while counter <12 : # Open brush print("{}th time! ".format(counter)) try: uas = fua.popupUAs() articleid = rh.popupArticleId() brush.brushLink(articleid, uas) < span class="hljs-keyword">except Exception as e: print(e) # Log to be added Handler counter+=1if __name__ == '__main__': for i in range(280): temp = threading.Thread(target=main) temp.start()
Run result
I took it before I did a test for an article I wrote.
Blog link: http://www.voidcn.com/article/p-dyiqhtnx-bny.html
301 views before opening After a simple review, the number of visits is as follows:
Summary
It’s roughly like this, although it’s a prototype at most, because the code is about 45% complete. If you are interested, you can add me QQ1064319632, or leave your suggestions in the comments, let us communicate and learn together.