Graduate reptile (1)

The objects processed by the crawler are links, titles, paragraphs, and pictures.
baidu

xxxx

xxxx


There are two types of links that must be excluded:
1, internal Jump link
xxxx
2, the link processed by the script
xxxx


For example:

import requests
url="https://www.baidu.com/"
r=requests. get(url)
print(r.text)
print(r.content)
print(r.status_code)

Because the crawler needs to save a large number of web pages, save When you need to ensure that the name is not the same, the common methods of saving the name are
1, domain+filename (may have the same name)
2, md5 (I failed to install hashlib, it can’t be used)
3, timestamp (My choice, the accuracy of the timestamp is determined according to the crawling speed, such as accurate to seconds or microseconds)


File save path

# Baidu
url="http://www.baidu.com/?tn=sitehao123_15"
#Extract domain
start_pos=url.find("//")#Retrieve from front to back
end_pos=url.rfind('/')#Retrieve from back to front
domain=url[start_pos+ 2:end_pos]
print(domain)

WordPress database error: [Table 'yf99682.wp_s6mz6tyggq_comments' doesn't exist]
SELECT SQL_CALC_FOUND_ROWS wp_s6mz6tyggq_comments.comment_ID FROM wp_s6mz6tyggq_comments WHERE ( comment_approved = '1' ) AND comment_post_ID = 2013 ORDER BY wp_s6mz6tyggq_comments.comment_date_gmt ASC, wp_s6mz6tyggq_comments.comment_ID ASC

Leave a Comment

Your email address will not be published.