Contents
- 1. Introduction to crawlers
- < li>1. What is a crawler
- 2.http protocol format
- 3. Common libraries
- 4. Common parsing syntax
- 5 .Common anti-crawling methods
One. Introduction to crawlers
< h2 id="What is a crawler">1. What is a crawler
An application that imitates the behavior of a browser to send a request to the server and obtain the response data. Process: initiate a request === >Get data===>Analyze data===>Store data Common request library: requests Common parsing library: Common repository: mongodb Common packet capture tool: web page network, Fiddler, mitmproxy console console: document.charset( View the decoding method of js)
2.http protocol format
'''Request: Request URL: REquest Method: "GET ":? Splicing parameters "POST": request body: 1.formdata 2.json 3.files request header: cookies: (save information, mainly used to record user login status) user-agent: user identity referer: inform the server, request Where does it come from? (Anti-theft chain) server-specific field'''''' response: status code: status code response header: location: redirected url set-cookie: set cookie server specific field response body: 1. html 2. Binary: image, video, audio 3. json 4. jsonp: cross-domain allowed
3. common library
request Library: requests Installation: pip install requests#get request requests.get() Parameters:''' url: headers={ "user-agent":"xxx", ...} params={}: url parameter cookies={ }(Preferential use of cookies parameters in headers) proxies = {"http":"http://..."}: proxy ip timeout = 0.5 timeout allow_redirects = True to allow redirection''' #post request requests.post () Parameter:''' The first 6 are the same as the get request data = {} json = {}/'123' Note: json and data can only exist in one kind of files = {"files":open("file address", "rt",encoding=utf-8)}''' Response: response = requests.get() response.url: request address response.text: response text information response.encoding='gbk': text encoding method response.content : Binary response.json: Convert to json format, equivalent to json.loads(response.text) response.status_code: status code response.headers: response header response.history :[Response object 1,...], redirection will automatically save the cookie request: session = requests.session() r = session.get(...) or r = session.post(.. ....) Supplement: (Save the cookie locally) import http.cookiejar as cookielib session.cookie = cookielib.LWPCookieJar() session.cookie.save(filename='1.txt') Save the cookie to the local file session.cookies .load(filename='1.txt') Extract cookies from local files
4. common parsing syntax
CSS selector: Class selector (.Class name) id selector (#id value) Tag selector (label) Descendant selector (ancestors): Indicates that all descendants below the ancestor will be selected. Child selector (father>child): Indicates the parent All children tags under the label will be selected by the brother selector (brother ~ brother): the brother label next to the brother label will be selected, and must be the same level attribute selector ([attribute name], [attribute name=attribute Value]: Precise selection, no spaces will work, [attribute name^=attribute value], [attribute name$=attribute value], [attribute name*=attribute value]): regular matching group selector (selector 1, Selector 2): on behalf of or multi-conditional selector (selector 1 and selector 2): on behalf of and pseudo-type selector (selector: active/hover/first-child/visited): only when the specified attribute is included will it be selected XPATH Selector: Slight
5. Common anti-climbing means
detect browser headersip block image verification code sliding module js Track front-end anti-debugging
Contents
- 1. Introduction to crawlers
- 1. What is a crawler
- 2.http protocol format
- 3. Common libraries
- 4. Common parsing syntax
- 5. Common anti-climbing methods
< /ul>
- 1. Introduction to crawlers
- 1. What is crawlers
- 2. http protocol format
- 3. Common libraries
- 4. Common parsing syntax
- 5. Common anti-climbing methods