Graduate reptile (1)

The objects processed by the crawler are links, titles, paragraphs, and pictures.
baidu

xxxx

xxxx


There are two types of links that must be excluded:
1, internal Jump link
xxxx
2, the link processed by the script
xxxx


For example:

import requests
url="https://www.baidu.com/"
r=requests. get(url)
print(r.text)
print(r.content)
print(r.status_code)

Because the crawler needs to save a large number of web pages, save When you need to ensure that the name is not the same, the common methods of saving the name are
1, domain+filename (may have the same name)
2, md5 (I failed to install hashlib, it can’t be used)
3, timestamp (My choice, the accuracy of the timestamp is determined according to the crawling speed, such as accurate to seconds or microseconds)


File save path

# Baidu
url="http://www.baidu.com/?tn=sitehao123_15"
#Extract domain
start_pos=url.find("//")#Retrieve from front to back
end_pos=url.rfind('/')#Retrieve from back to front
domain=url[start_pos+ 2:end_pos]
print(domain)

Leave a Comment

Your email address will not be published.