Climbing basic knowledge

一What is a crawler

   crawler is the process of writing a program to simulate a browser surfing the Internet, and then let it go to the Internet to grab data.

Second classification of crawlers

1. General crawlers: Simply speaking, it is as much as possible; put all the crawlers on the Internet The webpages are downloaded and placed on the local server to form backup points. After relevant processing (extracting keywords, removing advertisements) on these webpages, a user search interface is provided at the end.

2. Focused crawler: Focused crawler is to crawl the specified data on the network according to the specified requirements. For example: get the name and review of the movie on Douban, instead of getting all the data values ​​in the entire page.

Three anti-reptiles

– Portal websites use corresponding strategies and technical means to prevent crawlers from crawling website data.

Four Anti-Crawlers

– The crawler program uses corresponding strategies and technical means to crack the portal’s anti-crawler methods, thereby crawling to the corresponding data.

Five HTTP protocols

1 Concept

HTTP protocol is the abbreviation of Hyper Text Transfer Protocol (Hyper Text Transfer Protocol), which is A transfer protocol used to transfer hypertext from a World Wide Web (WWW: World Wide Web) server to a local browser.

2 Features

  • HTTP is based on TCP/IP protocol: http protocol is an application layer protocol based on TCP/IP protocol.
  • HTTP is connectionless: the meaning of connectionless is to limit each connection to only process one request. After the server has processed the client’s request and received the client’s response, it will disconnect. In this way, transmission time can be saved.
  • HTTP is based on a request-response model: HTTP protocol stipulates that a request is sent from the client, and finally the server responds to the request and returns.
  • HTTP is stateless storage: HTTP protocol is a stateless protocol. Statelessness means that the protocol has no memory capacity for transaction processing. The lack of status means that if the previous information is needed for subsequent processing, it must be retransmitted, which may result in an increase in the amount of data transmitted per connection. On the other hand, when the server does not need previous information, its response is faster.

3 HTTP URL

HTTP use Uniform Resource Identifiers (URI) are used to transmit data and establish connections. URL is a special type of URI that contains enough information to find a certain resource.

Example analysis: http://www.aspxfans.com:8080/news/index.asp ?boardID=5&ID=24618&page=1#name

  • Part of agreement: The protocol part of the URL is “http:”;
  • Domain name part: The domain name part of the URL is “www.aspxfans.com”;
  • Port part: Following the domain name is the port, and use “:” as a separator between the domain name and the port;
  • Virtual Directory Part: From the first “/” after the domain name to the last “/”, it is the virtual directory part. The virtual directory is also not a required part of a URL. The virtual directory in this example is “/news/”
  • File name part: from the last “/” after the domain name to “?”, it is the file name part , If there is no “?”, it starts from the last “/” after the domain name to “#”, which is the file part, if there is no “?” and “#”, then starts from the last “/” after the domain name To the end, it is the file name part. The file name in this example is “index.asp”. The file name part is not a required part of a URL. If this part is omitted, the default file name will be used.
  • Anchor part: From “#” to the end, it is the anchor part. The anchor part in this example is “name”. The anchor part is also not a required part of a URL.
  • Parameter part: The part from “?” to “#” is the parameter part, also known as the search part and the query part. The parameter part in this example is “boardID=5&ID=24618&page=1”. The parameter can allow multiple parameters, and the “&” is used as the separator between the parameter and the parameter.

4.HTTP Request

share picture

Common request headers (message body):

  • accept: The browser tells the server through this header, the data type it supports;
  • Accept-Charset: The browser tells the server through this, the character set it supports;
  • Accept-Encoding: The browser tells the server through this that it supports the compression format;
  • Accept-Language: the browser tells the server through this, his prediction environment;
  • Host: the browser tells the server through this Server, like visiting the host;
  • If-Modified-Since: The browser tells the server through this header, the time to cache data;
  • Referer: The browser tells the server through this header , Which page the client came from (anti-theft link);
  • Connection: The browser tells the server through this header whether to disconnect or maintain the connection after the request;
  • X -Requested-With: XMLHttpResquest represents access via ajax;
  • User-Agent: the identity of the request carrier.

5 HTTP Response

Share pictures

Common response header information:

  • Location: The server uses this header to tell the browser Where the browser jumps to;
  • Server: The server uses this header to tell the browser server model;
  • Content-Encoding: The server uses this header to tell the browser the data compression Format;
  • Content-Length: The server tells the browser the length of the data through this header;
  • Content-Language: The server tells the browser language environment through this header;< /li>
  • Content-Type: The server uses this header to tell the browser the type of data returned;
  • Refresh: The server uses this header to tell the browser to refresh regularly;
  • Content-Disposition: The server uses this header to tell the browser to open the data in download mode;
  • Transfer-Encoding: The server uses this header to tell the browser that the data is sent back in blocks;
  • < li>Expires:-1 Control the browser not to cache;

  • Cache-Control:no-cache
  • Pragma: no-cache

< /p>

Response status code

Share picture

Six HTTPS protocols

HTTPS (Secure Hypertext Transfer Protocol) secure hypertext transfer protocol, HTTPS builds an SSL encryption layer on HTTP and encrypts transmission data , Is the secure version of the HTTP protocol.

share picture

< h2>The basic process of seven crawlers

Share pictures

Good article recommendation

The first Python web crawler “Basic concepts related to Python web crawlers”

Python network The second crawler “http and https protocol”

Django series of web applications and http protocol

Leave a Comment

Your email address will not be published.