The concept and role of reptile

Contents

01 The concept of crawlers

02 Crawler process

03 HTTP protocol

04 WEBSOCKET

  • The concept of crawlers< /span>

reptile Concept

The more official name of crawlers is data collection, which is generally called spider in English, which is fully automated through programming Collect data from the Internet. For example, a search engine is a kind of crawler. What a crawler needs to do is to simulate a normal network request. For example, if you click on a URL on a website, it is a network request.

The role of crawlers

Now that the era of big data has arrived, web crawling technology has become an indispensable part of this era. Companies need data to analyze user behavior, analyze the deficiencies of their products, analyze information about competitors, etc. , But the first condition for these is data collection. Among them, companies such as Toutiao are more famous for using crawlers.

  • The crawling process

    span>

The essence of crawlers

The essence of a crawler is to automatically simulate a network request initiated by a normal human, and then obtain the data returned by the network request. There is no essential difference between manually clicking a link and accessing a web page to obtain data with us.

The difficulties of crawlers

The difficulties of crawlers are mainly in two directions: Data acquisition Generally speaking, the websites we want to crawl do not want us to crawl his data. Then these websites will take some anti-crawler measures to prevent us from crawling data on his website. So we also need to take corresponding measures to bypass these anti-reptile measures. The speed of crawling data The data volume of the target we crawl is sometimes very large, even tens of millions of billions of data, and some even require real-time updates, so the speed of crawling is also very important. We generally use concurrency and distribution to solve speed problems.

  • Network request

A network request is actually a data transfer on the Internet. In order for the data to be correctly and quickly transmitted to the target host in the complex network. We have defined many network protocols, that is, the rules for network transmission of data, to realize network connections. Among these protocols, the one we use the most is basically the HTTP/HTTPS protocol.

HTTP protocol header

The main point is : User-Agent, Referer

HTTP request method

Basically, GET and POST are used

Status code

200 The request has been successful, and the response header or data body desired by the request will be returned with this response.
302 The requested resource is now Temporarily respond to requests from different URIs. Since such redirection is temporary, the client should continue to send future requests to the original address. This response is cacheable only if it is specified in Cache-Control or Expires.   The new temporary URI should be returned in the Location field of the response. Unless this is a HEAD request, the entity of the response should include a hyperlink to the new URI and a short description.   If this is not a GET or HEAD request, then the browser prohibits automatic redirection unless confirmed by the user, because the request conditions may change accordingly. Note: Although the RFC 1945 and RFC 2068 specifications do not allow the client to change the request method during redirection, many existing browsers treat the 302 response as a 303 response.
404 The request failed. The requested resource was not found on the server. There is no information to tell the user whether this situation is temporary or permanent. If the server knows the situation, it should use the 410 status code to inform that the old resource is permanently unavailable due to some internal configuration mechanism problems, and there is no address to jump to. The 404 status code is widely used when the server does not want to reveal why the request was rejected or there is no other suitable response available.

Cookie and Session

Cookie data storage On the client’s browser, the session data is placed on the server

HTTPS and WebSocket protocols

Disadvantages of HTTP protocol

< p>The biggest disadvantage of the HTTP protocol is plaintext transmission, and a large part of the data transmission path on the network is exposed in the public environment, so the data is very easy to leak.

HTTPS protocol

Because of this shortcoming of HTTP, there is HTTPS protocol based on SSL or TSL.

Before the HTTPS request is initiated, the client will first initiate a request to the server to obtain a certificate. The data transmitted by HTTPS is the data encrypted by the certificate, which avoids the risk of data leakage.

HTTPS guarantees security, but also reduces the transmission speed. The speed of the protocol is about 2-100 times slower than HTTP.

WebSocket protocol

WebSocket It is a protocol that HTML5 began to provide for full-duplex communication on a single TCP connection.

Because the HTTP protocol is a stateless, connectionless, one-way application layer protocol. It uses a request/response model. The communication request can only be initiated by the client, and the server responds to the request.

This The communication model has a drawback: the HTTP protocol cannot enable the server to initiate a message to the client.

This one-way request feature is destined to be very troublesome for the client to know if the server has continuous state changes. Most web applications will implement long polling with frequent requests. Polling is inefficient and wastes resources (because the connection must be kept on, or the HTTP connection is always open).

In WebSocket In the API, the browser and the server only need to do a handshake action, and then a fast channel is formed between the browser and the server. Data can be transferred directly between the two.

The browser sends a request to establish a WebSocket connection to the server through JavaScript. After the connection is established, the client and server can use TCP The connection directly exchanges data.

Due to websocket The appearance time is relatively short, the application scope is relatively small, so it is used less.

Leave a Comment

Your email address will not be published.