Crawl Archives - Simon Technology Blog

The concept and role of reptile

Contents

01 The concept of crawlers

02 Crawler process

03 HTTP protocol

04 WEBSOCKET

reptile Concept

The more official name of crawlers is data collection, which

September 29, 2021By Simo Web Crawler Concept, Crawl, roleLeave a Comment

The bug of the crawler handles the website — less than the number unconverted into entity

1. BUG found 　　 crawling chinadrugtrials When the test information was published on the details page, it was found that the program was broken in some places, as follows:

After investigation,

September 29, 2021By Simo Web Crawler BUG, Crawl, entity, less than, no, Number, transformation, treatment, WebsiteLeave a Comment

Reptile Frame SCRAPY (2)

One: Introduction to the core components of scrapy 1: Engine (scrapy): responsible for data processing of the entire system process, triggering things (core)

2: Scheduling Scheduler: Put the

September 29, 2021By Simo Web Crawler Crawl, frame, SCRAPY, twoLeave a Comment

Climber – General Code Framework

1. Baidu search keyword submission

The format of Baidu search path is: http://www.baidu.com/s?wd=keyword

import requests
keyword = “Python”
try:
kv = {‘wd’: keyword}
url = “http:/

September 29, 2021By Simo Web Crawler code, Crawl, frame, GeneralLeave a Comment

Link split and combination in reptile

from urllib.parse import urlencode, quotefrom oauthlib.common import urldecodedef decodeUrl(url): “”” :param url: Pass in a link to be decoded: return: output tuple url, a dictionary containing par

September 29, 2021By Simo Web Crawler Crawl, link, Merge, middle, SplitLeave a Comment

Reptile automatically generates request head tutorial

Previous situation summary:

< span style="font-size: 18pt; color: #ff0000;">　　The request header is a way of disguising the operator. Because the request header contains a lot of content;

September 28, 2021By Simo Web Crawler automatic, Crawl, generated, Head, Request, tutorialLeave a Comment

Multi-site RSS news text, import the discuz forum, automatic posting (1)

The company’s R&D department cannot access the Internet, but the company hopes that its R&D colleagues can Follow the news, understand the hotspots of science and technology, and keep up with the t

September 27, 2021By Simo Rss Auto, Crawl, Discuz, Forum, import, Multi, News, one, POST, realization, RSS, site, TextLeave a Comment