Table of Contents
- One. Focus on the requests module
- (1) What is the robots protocol?
- (2) List the network data packets used by the python web crawler you have used
- (3) What is the difference between http and https protocols?
- (4) What anti-crawler measures have you encountered when writing crawlers, how did you solve them?
- (5) The difference between POST and GET
- (6) Why is a proxy used?
- 2. Key points of data analysis
- (1) When using regular to parse page source data, should you use re.S or Re.M?
- (2) What is the role of xpath plug-in?
- (3) Review of xpath expressions
- (4) Review of common methods of parsing bs4
- 3. Use of coding platform Process
- 4. Mobile data crawling
- 1.fiddler settings
- 2.fiddler certificate download
- 3. LAN settings< /li>
- 5. Scrapy focus review
- 1. Persistent storage operation
- (1) Persistent storage based on terminal instructions
- (2) Persistent storage based on pipeline
- 2. Request based on scrapy
- (1) Manual get request sending
- (2)Post request sending
- (3) When will request pass parameters be used?
- (4)UA pool and proxy pool
- (5) Application of selenium in scrapy
- (6) Implementation steps of distributed crawler:
- 1. Persistent storage operation
- 6. Frequently Asked Questions
- 1. How to significantly improve the efficiency of crawlers
- 2. Multi-threaded crawling based on the requests module
- 3. How to improve the crawling efficiency of scrapy< ul>
- (1) Increase concurrency
- (2) Reduce log level
- (3) Forbid cookies
- (4) Forbid retry
- (5) Reduce download timeout
li>
- 4. Commonly used proxy IP websites, and their proxy IP prices
li>
- 1. Questions about the types of crawled data
- Common types of crawled data
- 2. The problem of crawling data level
- 3. Relevant crawler project in the resume (Demo)
- 4. Project name: xxx news recommendation system data collection
1.requests module focusing on combing
(1) What is the robots protocol?
The Robots protocol (also called crawler protocol, crawler rules, robot protocol, etc.) is robots.txt. The website tells search engines which pages can be crawled and which pages cannot be crawled through the robots protocol.
Robots agreement is the ethical norms prevailing in the Internet community of websites, and its purpose is to protect website data and sensitive information, and to ensure that users’ personal information and privacy are not infringed. Because it is not a command, it needs to be consciously followed by search engines.
(2)List the network data packets used by the python web crawler you have used
< code>requests, urllib
(3) What is the difference between http and https protocol?
The http protocol is the hypertext transfer protocol, which is used to transfer information between a web browser and a website server. The work of http protocol is to send content in plain text, without providing any form of data encryption, and this is also a place that is easy to be used by hackers. If hackers intercept the transmission information between the web browser and the website server, they can directly read it. Therefore, the http protocol is not suitable for transmitting some important and sensitive information, such as credit card passwords and payment verification codes.
The secure socket layer https protocol was born to solve this security flaw of the http protocol. For the security of data transmission, https was added on the basis of http ssl protocol, ssl relies on the certificate to verify the identity of the server and encrypts the communication between the browser and the server. In this way, even if a hacker intercepts the information in the sending process, it cannot be cracked and understood. Our website and users The information received the greatest security guarantee.
(4) What did you encounter when writing crawlers Anti-reptile measures, how did you solve it?
-
Anti-crawler through headers;
-
Send crawlers based on user behavior: the frequency of visits from the same IP in a short period of time;
-
p>
-
Dynamic webpage anti-crawler (request data through ajax, or generate through JavaScript);
-
Verification code
li>
-
Data encryption
Solution:
-
For basic Web page crawling can customize headers and send the header along with the request (usually User-Agent, Cookie)
-
Use IP proxy pool to crawl or reduce the frequency of crawling< /p>
-
Use selenium + phantomjs to grab dynamic data, or find a json page loaded with dynamic data
-
Use coding Platform identification verification code
-
To encrypt part of the data, you can use selenium to take screenshots and use python’s own pytesseract library for identification, but the slowest and most direct method is to find The encryption method performs reverse reasoning.
(5)The difference between POST and GET
-
Security of GET data transmission Low, POST transmission data security is high, because the parameters will not be saved in the browser history or web server log;
-
When doing data query, it is recommended to use GET; When adding, modifying or deleting data, it is recommended to use the POST method;
-
GET transmits data in the url, and the data information is placed in the request header; while the POST request information is placed The data is transferred in the request body;
-
The amount of data transmitted by GET is small, and data can only be sent in the request header, while the data transmitted by POST is relatively large. Restricted;
-
In terms of execution efficiency, GET is better than POST
(6) Why is an agent used?
The website’s anti-crawler policy will detect that the same IP is accessed too frequently, and thus prohibit the access of that IP. Therefore, an IP proxy is required to avoid this problem during the crawling process.
2. Data parsing key combing
(1) When parsing the page source data using regular rules, should you use re.S or Re.M?
re.S is a single-line parsing, and there can be line breaks in the parsing source data. re.S will act on the entire page source data that is treated as a large string. re.M is a multi-line parsing, which will apply regularization to each line of the source code. In actual combat, re.S is used.
(2) The role of the xpath plug-in?
You can directly verify the xpath expression in the browser, and then apply the xpath to the program after the verification is successful.
(3)xpath expression review
//div[@class='xxx'] # attribute positioning // div[@id='xxx']/a[2]/span # Hierarchical index positioning //a[@href=''and @class='xxx'] # logical operation //div[contains(@class,' xx')] # Fuzzy matching //div[starts-with(@class,'xx')] # Fuzzy matching /div/text() or /div//text() # Get text/@Attribute# Get attributes< /pre>(4)Review of common methods of bs4 parsing
find() # Find the first tag that meets the requirements find_all() # Find all the tags that meet the requirements select() # Find the tags that meet the requirements according to the selector div a ul li # Multi-level selector div>a>ul>li # Single-level selectorThree. The use process of the coding platform
Grab the page data carrying the verification code
Analyze the verification code on the page and download the verification code image to the local
Submit the verification code image to the coding platform for identification, and return the identification result
Use of cloud coding platform:
Registration of ordinary users and developer users on the official website
< /li>
Login as a developer user:
a) Download sample code: Development document --> Call sample and latest DLL --> Download PythonHTTP sample
b) Create a software: My software --> Add new software (the software's secret key and id will be used later)
Use the sample code pair in the sample code Save the local verification code for identification
4. Mobile data crawling
Blog address: https://www.cnblogs.com/bobo-zhang/p/10068994.html
# For detailed steps, please see the video: Link: https://pan.baidu.com/s/1BiNd3IPA44xGszN9n93_hQ Extraction code: 6slk1.fiddler Settings
-
The port number and the settings that allow the capture of data packets on other machines
-
The settings of https packet capture< /p>
2.fiddler certificate download
-
Ensure that the phone and fiddler are on the same network Below paragraph
-
Download the security certificate and install it on the phone. Enter the ip of the fiddler machine and the port number of the fiddler in the mobile browser to download the certificate.
-
After the download is successful, install the certificate
-
Please refer to the blog for detailed steps
-
Set the phone's network and port number to the ip of the fiddler computer and the port number of the fiddler
- Test:
- Turn on fiddler and grab the network package requested by the phone
ul>
3. LAN settings
Five. Scrapy focus review
1. Persistent storage operations
(1) Persistent storage based on terminal commands
scrapy crawl crawler name-o xxx.jsonscrapy crawl crawler name-o xxx.xmlscrapy crawl crawler name-o xxx.csv
(2) pipe-based persistent storage
1. Encapsulate the crawled data in the item object 2. Use yield to Submit the item to the pipeline 3. In the process_item method in the pipeline file, perform the persistent storage operation on the data in the item
Interview question: If you need to crawl to One copy of the data value is stored in the disk file, and the other is stored in the database. How to operate scrapy?
- The code in the pipeline file is:
#This class is a pipeline class, and the process_item method in this class It is used to implement persistent storage operations. class DoublekillPipeline(object): def process_item(self, item, spider): #persistent operation code (method 1: write to disk file) return item#If you want to implement another form of persistent operation, you can customize another one Pipeline class: class DoublekillPipeline_db(object): def process_item(self, item, spider): #persistent operation code (method 1: write to the database) return item """ It should be noted that in the pipelines.py file, if To store the data in different databases, the pipeline needs to create multiple classes in the pipelines.py file. The item will be passed in the process_item method of each class (via the return item in the process_item method), so except for the last The process_item method of a class does not need to return item, and the rest of the classes must return item """
- Open the pipeline operation code in settings.py:
< pre class="python">#The following structure is a dictionary. The keys in the dictionary indicate the pipeline file to be executed and the priority of its execution. ITEM_PIPELINES = {'doublekill.pipelines.DoublekillPipeline': 300,'doublekill.pipelines.DoublekillPipeline_db': 200,}""" In the above code, the two sets of keys in the dictionary indicate that the corresponding two pipelines in the pipeline file will be executed The process_item method in the class implements two different forms of persistence operations."""
2.Scrapy-based request
(1) manual get request sending
yield scrapy.Request(url,callback)# Parameter description: url: requested url callback: specified method to the requested page Perform parsing operations
(2)post request sending
Rewrite the start_request method of the parent class. This method will be called by the scheduler by default. In this method You can send post request through yield scrapy.FormRequest(url,callback,formdata)
.
- The parameter formdata is a parameter carried by the post request, and the type is a dictionary.
(3) When will you need to use the request to pass parameters?
-
When the data values that need to be parsed in the requirements are not in the same page , It must be processed by request passing parameters.
-
Implementation:
yield scrapy.Request(url,callback,meta)
, using meta parameters can be The item object is passed to the callback method specified by callback. meta is a dictionary type.
(4)UA pool and agent pool
h3>
Blog address: https://www.cnblogs.com/bobo-zhang/p/10013011.html
- The purpose of using UA pool and proxy pool:
- < li>Anti-crawler strategy that aims to prevent crawling websites
- (1) Intercept requests in download middleware
- (2) Tampering and disguising the UA in the request header information of the intercepted request
- (3) Open the middleware in the configuration file
li>
li>
- (1) Intercept the request in the download middleware
- (2) Modify the intercepted request IP to a certain proxy IP
- (3) Open the middleware in the configuration file
li>
(5) selenium application in scrapy Application
Blog address: https://www.cnblogs.com/bobo-zhang/p/10013045.html
principle analysis
< p>When the engine submits the request corresponding to the domestic sector url to the downloader, the downloader downloads the web page data, and then encapsulates the downloaded page data in the response, submits it to the engine, and the engine sends the response e is then forwarded to Spiders. There is no dynamically loaded news data in the page data stored in the response object received by Spiders. If you want to obtain dynamically loaded news data, you need to submit the downloader to the engine in the download middleware. The response object is intercepted, and its internally stored page data is tampered with, modified to carry the dynamically loaded news data, and then the tampered response object is finally handed over to Spiders for parsing operation.
The use process of selenium in scrapy
-
Rewrite the construction method of the crawler file, and use selenium to instantiate a browser in this method Object (because the browser object only needs to be instantiated once, so I chose to write it in the constructor
__init__
) -
Rewrite the crawler file The
closed(self, spider)
method closes the browser object inside it. This method is called when the spider ends. -
Override Download the
process_response
method of the middleware, let the method intercept the response object, and tamper with the page data stored in the response. -
In the configuration file Open the download middleware.
(6) Distributed crawler implementation steps:
- Guide package
from scrapy_redis.spiders import RedisCrawlSpider
-
Modify the parent class of the crawler class to
RedisCrawlSpider
-
Comment the starting url list and add an attribute of
redis_key
(the name of the scheduler queue) # bind 127.0.0.1
protected- mode no
-
Configure the settings in the project
-
a) Configure the ip and port number of the redis server:
- < li>
REDIS_HOST ='redis service ip address' REDIS_PORT = 6379# REDIS_PARAMS = {'password':'123456'}
-
-
b) Configure and use the scheduler in the scrapy-redis component:
-
# Use the deduplication queue of the scrapy-redis component DUPEFILTER_CLASS = " scrapy_redis.dupefilter.RFPDupeFilter"# Use scrapy-redis component's own scheduler SCHEDULER = "scrapy_redis.scheduler.Scheduler"# Whether to allow pause SCHEDULER_PERSIST = True
-
-
c) Configure to use the pipeline in the scrapy-redis component:
-
ITEM_PIPELINES = {#'wangyiPro.pipelines.WangyiproPipeline': 300,'scrapy_redis .pipelines.RedisPipeline': 400,}
-
-
d) Open the redis database service: redis-server configuration file
li>
-
e) Execute the crawler file:
scrapy runspider wangyi.py
-
f) Throw one into the scheduler’s queue Starting url:
- i. Open the redis client
- ii. lpush crawler file name starting url
< li>Configure the redis database configuration file redisxxx.conf
:
li>
6 .FAQ
1. How to significantly improve the efficiency of the crawler
1. Use a better-performing machine 2. Use optical fiber network 3. Multi-process 4. Multi-thread 5. Distributed
2. Multi-thread crawling based on requests module
Multi-threaded crawling based on the requests module recommends that you use the `multiprocessing.dummy.pool
thread pool, the crawling efficiency will be significantly improved.
Code display:
< pre class="python">import requestsfrom bs4 import Beautiful#Import thread pool from multiprocessing.dummy import Poolpool = Pool()# Instantiate thread pool object#Initiate home page request page_text = requests.get(url=’xxx’)#Use bs4 parses all a tags on the home page soup = Beautiful(page_text) a_list = soup.select(‘a’)# Join the href attribute value of the a tag with the domain name to form a complete urlurl_list = [‘www.xxx.com/’ +url[‘href’] for url in a_list]#Encapsulate the request function, this function can obtain the page data corresponding to the request request_page_text = lambda link:requests.get(link).text#Use the map method of the thread pool to send requests asynchronously , Cut to get the page data returned by the response page_text_list = pool.map(request_page_text, url_list)#Use thread pool to perform parsing operations asynchronously get_data = lambda data:parse(data)pool.map(get_data, page_text_list)#Data parsing method def parse( data):????pass
3. How to improve the crawling efficiency of scrapy
(1 )Increase concurrency
By default, the number of concurrent threads opened by scrapy is 32, which can be increased appropriately. In the settings configuration file Modify the `CONCURRENT_REQUESTS = 100
value to 100, and set the concurrency to 100.
(2)Reduce log level
When running scrapy, there will be a lot of log information output, in order to reduce the CPU usage. You can set the log output information to INFO or ERROR. Write in the configuration file: LOG_LEVEL ='INFO'
< /p>
(3)Forbid cookies
If cookies are not really needed, then cookies can be used when scrapy crawls data to reduce CPU usage and improve Crawling efficiency. Write in the configuration file: COOKIES_ENABLED = False
.
(4)Retry prohibited
For failure Re-request (retry) of HTTP will slow down the crawling speed, so retry can be prohibited. Write in the configuration file: RETRY_ENABLED = False
(5)Reduce download timeout
If crawling a very slow link, reducing the download timeout can make the stuck link quickly abandoned, thereby improving efficiency. Write in the configuration file : DOWNLOAD_TIMEOUT = 10
The timeout period is 10s.
4. Commonly used proxy ip sites, and their proxy IP price
i. National proxy IP network:
- Dynamic proxy IP
- Long-term proxy IP
ii. Xici proxy network:
- Package price:
iii. Fast proxy network:
- Package Price:
Seven. Writing crawler items in your resume
1. Questions about crawling data types
Common types of crawling data
- E-commerce product information
- a) Data source website: Jingdong, Tmall
- Information news information
- a) Netease news, Tencent news, Toutiao, etc.
- Music lyrics, song titles, author information
- a) Netease music, qq music, etc.
- Parameter information of medical devices
- a) 3618 Medical Device Network: http: //www.3618med.com/product
- b) China Medical Information Network: http://cmdi.gov.cn/publish/default/
- a) China’s air quality online detection and analysis platform: https://www.aqistudy.cn/
< li>Meteorological weather information
< h2 id="The problem of crawling data level">2. The problem of crawling data level
- For millions of data:
- a) Based on The requests module is crawled, and the duration is about 2 hours.
- b) Based on a distributed cluster (3 units), the duration is about 0.3 hours.
- c) In the production process of a general company, data If the amount is more than one million, it will generally use distributed data crawling
3. Relevant crawlers in the resume Project (Demo)
(1) Project name: Fashion clothing matching data collection
(2) Project description:
This project uses scrapy framework for love Crawl the relevant sub-links in all categories and sub-categories under the navigation pages of websites such as Collocation.com and Dressing.com, as well as the related content of the linked pages, write the data into the database, and provide the company as reference data.
(3) Responsibility description:
a. Responsible for the crawling of information data
b. Responsible for analyzing the process of data crawling
c. Responsible for analyzing the website’s anti-climbing technology and providing anti-anti-climbing strategies
d. Using thread pool for data crawling, collecting 13w pieces of data
e. Responsible for collecting Data collected for data analysis
4. Project name: xxx news recommendation system data collection
(1) Project description :
This project uses a distributed crawler based on RedisCrawlSpider
to crawl the news and information data under the major sections of Netease News, Headlines, Sina News and other websites, and will crawl The received data calls Baidu AI interface for keyword extraction and article classification retrieval. Design library tables for data storage.
(2) Responsibility description:
a. Build 5 distributed systems The cluster of news data crawling
b. Analyze the data crawling process and design an anti-anti-crawl strategy
c. Perform data cleaning and abnormalities on the nearly 30w pieces of data crawled by the distribution Value filtering
d. Call Baidu AI to extract keywords and classify news data
Contents
- One. The focus of the requests module is sorted out
- (1) What is the robots protocol?
- (2) List the network data packets used by the python web crawler you have used
- (3) What is the difference between http and https protocols?
- (4) What anti-crawler measures have you encountered when writing crawlers, how did you solve them?
- (5) The difference between POST and GET
- (6) Why is a proxy used?
- 2. Key points of data analysis
- (1) When using regular to parse page source data, should you use re.S or Re.M?
- (2) What is the role of xpath plug-in?
- (3) Review of xpath expressions
- (4) Review of common methods of parsing bs4
- 3. Use of coding platform Process
- 4. Mobile data crawling
- 1.fiddler settings
- 2.fiddler certificate download
- 3. LAN settings< /li>
- 5. Scrapy focus review
- 1. Persistent storage operation
- (1) Persistent storage based on terminal instructions
- (2) Persistent storage based on pipeline
- 2. Request based on scrapy
- (1) Manual get request sending
- (2)Post request sending
- (3) When will request pass parameters be used?
- (4)UA pool and proxy pool
- (5) Application of selenium in scrapy
- (6) Implementation steps of distributed crawler:
- 1. Persistent storage operation
- 6. Frequently Asked Questions
- 1. How to significantly improve the efficiency of crawlers
- 2. Multi-threaded crawling based on the requests module
- 3. How to improve the crawling efficiency of scrapy< ul>
- (1) Increase concurrency
- (2) Reduce log level
- (3) Forbid cookies
- (4) Forbid retry
- (5) Reduce download timeout
li>
- 4. Commonly used proxy IP websites, and their proxy IP prices
li>
- 1. Questions about the types of crawled data
- Common types of crawled data
- 2. The problem of crawling data level
- 3. Relevant crawler project in the resume (Demo)
- 4. Project name: xxx news recommendation system data collection
- I. Focus on the requests module
- (1) What is the robots protocol?
- (2) List the network data packets used by the python web crawler you have used
- (3) What is the difference between http and https protocols?
- (4) What anti-crawler measures have you encountered when writing crawlers, how did you solve them?
- (5) The difference between POST and GET
- (6) Why is a proxy used?
- 2. Key points of data analysis
- (1) When using regular to parse page source data, should you use re.S or Re.M?
- (2) What is the role of xpath plug-in?
- (3) Review of xpath expressions
- (4) Review of common methods of parsing bs4
- 3. Use of coding platform Process
- 4. Mobile data crawling
- 1.fiddler settings
- 2.fiddler certificate download
- 3. LAN settings< /li>
- 5. Scrapy focus review
- 1. Persistent storage operation
- (1) Persistent storage based on terminal instructions
- (2) Persistent storage based on pipeline
- 2. Request based on scrapy
- (1) Manual get request sending
- (2)Post request sending
- (3) When will request pass parameters be used?
- (4)UA pool and proxy pool
- (5) Application of selenium in scrapy
- (6) Implementation steps of distributed crawler:
- 1. Persistent storage operation
- 6. Frequently Asked Questions
- 1. How to significantly improve the efficiency of crawlers
- 2. Multi-threaded crawling based on the requests module
- 3. How to improve the crawling efficiency of scrapy< ul>
- (1) Increase concurrency
- (2) Reduce log level
- (3) Forbid cookies
- (4) Forbid retry
- (5) Reduce download timeout
li>
- 4. Commonly used proxy IP websites, and their proxy IP prices
li>
- 1. Questions about the types of crawled data
- Common types of crawled data
- 2. The problem of crawling data levels
- 3. Relevant crawler projects in the resume (Demo)
- 4. Project name: xxx news recommendation system data collection