This blog will continue to talk about common anti-crawler measures and our solutions. Similarly, if it helps you, please click a recommendation.
One, anti-leech
The anti-leech I encountered this time , In addition to the Referer anti-theft chain mentioned above, there are also Cookie anti-theft chain and timestamp anti-theft chain. Cookie anti-leeching is common in forums and communities. When a visitor requests a resource, he will check the visitor’s Cookie. If it is not his own user’s Cookie, the visitor will not be given the correct resource, thus achieving the purpose of anti-theft. The timestamp anti-theft link refers to adding a timestamp parameter after its url, so if you directly request the website url, you can’t get the real page, only with a timestamp.
This example is the image sub-community of Tianya Community:
p>
Here we first open the developer tools, and then choose a picture arbitrarily, get the link of this picture, and then use requests to download this picture, pay attention Bring the Referer field to see how it turns out:
import requests
url = "http://img3.laibafile.cn/p/l/305989961.jpg"
headers = {
"Referer": "http://pp.tianya.cn/",
< span style="font-size: 16px;"> "UserAgent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36"< br>}
res = requests.get(url)
with open('test.jpg','wb') as f:
f.write(res.content)
Our The crawler is running normally, and I saw that a test.jpg file was generated. Don’t be eager to be happy. Open the picture and take a look:
A mouthful of old blood vomited out, there is such a routine! How to do it? Look down!
Solution:
Since he said that it is only shared by Tianya community users, then we also become his Isn’t it enough for users? Without further ado, I registered an account, logged in, and got the cookies after logging in:
__auc= 90d515c116922f9f856bd84dd81; Hm_lvt_80579b57bf1b16bdf88364b13221a8bd = 1551070001,1551157745; user = w = EW2QER & id = 138991748 & f = 1; right = web4 = n & portal = n; td_cookie = 1580546065; __cid = CN; Hm_lvt_bc5755e0609123f78d0e816bf7dee255 = 1551070006,1551157767,1551162198,1551322367; time = ct = 1551322445.235; __asc = 9f30fb65169320604c71e2febf6; Hm_lpvt_bc5755e0609123f78d0e816bf7dee255 = 1551322450; __u_a = v2.2.4; sso = r = 349690738 & sid = & wsid = 71E671BF1DF0B635E4F3E3E41B56BE69; temp = k = 674669694 & s = & t = 1551323217 & b = b1eaa77438e37f7f08cbeffc109df957 & ct = 1551323217 & et = 1553915217; temp4 = rm = ef4c48449946624e9d7d473bc99fc5af; u_tip = 138991748 = 0
Note: Cookies are time-sensitive. I don’t know how long they will expire. test. Then add the Cookie to the code, and then run it, you can see that the picture is downloaded successfully:
< span style="font-size: 14pt; color: #ff0000;">
It took so long to create a picture, we How could it be so satisfying? Analysis of the page shows that there are fifteen pictures on a page, and when you pull down, you will see “Loading, please wait”:
We immediately realized that this was loaded through AJAX, so we opened the developer tool to view, and we can find the following content:
You can see each link The parts before “?” are basically the same, the number after “list_” indicates the number of pages, and what is the series of numbers after “_=”? Experienced people will soon realize that this is a timestamp, so let’s test it:
import time
import requests
t = time. time()*1000
url = "http://pp.tianya.cn/qt/list_4.shtml?_={}". format(t)
res = requests.get(url)
print(res.text)
After running, we get the result we want. Now that we can use code to construct links, how do we know the maximum number of pages? Let’s continue to drag the scroll wheel to drop down the page, and find that there is no page after the 5th page appears:
What should I do? Don’t worry, we can already construct links by ourselves. We can get more pages by changing the number after “list”. However, the result of my own test is that there are only 15 pages at most, and it is useless to increase the number afterwards. It should be that the server has restricted the data to only 15 pages at most. The following figure is the result returned after I changed the number to 16:
Finally write the program and run it, you can download the picture:
The complete code has been uploaded to GitHub!
2. Randomize the webpage source code
Use display:none to randomize To change the source code of the webpage, some websites will also randomize the name of the class and id, and then add some random tr and td, which will increase the difficulty of our analysis. For example, the proxy IP of the entire network:
Solution:
You can see each IP is contained in a td whose class is “ip”, so we can locate this td first, and then proceed to the next step of analysis. Although this td contains a lot of span tags and p tags, and there is no regularity in the position of each tag, there is still a way to parse it. The method is to extract all the text in this td, then remove the repeated parts, and finally splice them together. The code is as follows:
2 for n in range(1, 21):
3 lst = et.xpath(‘//table/tbody/tr[{}]/td[1]//text()‘ span>.format(n))
4 proxy = “”
5 for i in range(len(lst)-1):
6 if lst[i] != lst[ i + 1]:
7 proxy += lst[i]
8 proxy += lst[-1]
9 print(proxy) < /div>
Finally, we can get the data we want. However, the port data we get is not the same as the data displayed on the web page. This is because the port data is obfuscated by JS. As for how to crack it, we will share it next time.
1 et = etree .HTML(html) # html: Web page source code
2 for n in range(1, 21):
3 lst = et.xpath('//table/tbody/tr[{}]/td[1]//text()' span>.format(n))
4 proxy = ""
5 for i in range(len(lst)-1):
6 if lst[i] != lst[ i + 1]:
7 proxy += lst[i]
8 proxy += lst[-1]
9 print(proxy)
< /p>
WordPress database error: [Table 'yf99682.wp_s6mz6tyggq_comments' doesn't exist]SELECT SQL_CALC_FOUND_ROWS wp_s6mz6tyggq_comments.comment_ID FROM wp_s6mz6tyggq_comments WHERE ( comment_approved = '1' ) AND comment_post_ID = 2007 ORDER BY wp_s6mz6tyggq_comments.comment_date_gmt ASC, wp_s6mz6tyggq_comments.comment_ID ASC