Introduction to a crawler

Overview

In recent years, with the gradual expansion and deepening of network applications, how to efficiently obtain online data has become countless companies and individuals In the era of big data, whoever has more data can obtain higher benefits, and web crawlers are the most commonly used means of crawling data from the Internet.
Web crawler, namely Web Spider, is a very vivid name. If the Internet is likened to a spider web, then Spider is a spider that crawls around the web. Web spiders search for web pages through their link addresses. Start from a certain page of the website (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next webpage through these link addresses, so that the loop continues until all the webpages of this website are Until the crawl is finished.
Share pictures

The value of crawlers Value

The most valuable thing on the Internet is data, such as the product information of Tmall Mall, the rental information of Lianjia.com, the securities investment information of Xueqiu.com, etc. These data all represent the information of various industries Real money, it can be said that whoever has the first-hand data in the industry will become the master of the entire industry. If the entire Internet data is likened to a treasure, then our crawler course is to teach everyone how to be efficient Digging into these treasures, mastering crawler skills, you become the boss behind all Internet information companies, in other words, they are all providing you with valuable data for free.

robots.txt protocol

If the data in the specified page of your portal website does not want to be crawled by the crawler, then you can Constrain the data crawling of the crawler program by writing a robots.txt protocol file. The format of the robots protocol can be observed in Taobao’s robots (just visit www.taobao.com/robots.txt). However, it should be noted that this agreement is only equivalent to a verbal agreement and does not use related technologies for compulsory control, so this agreement is to guard against gentlemen but not villains. But the crawler program we write in the learning crawler stage can ignore the robots protocol first.

Basic process of crawler

Share pictures

preliminary knowledge

http protocol

two requests module< /h1>

Requests is written in python language based on urllib. It uses the HTTP library of the Apache2 Licensed open source protocol. Requests is more convenient than urllib and can save us a lot of work. In a word, requests is the simplest and easy-to-use HTTP library implemented in python. It is recommended that crawlers use the requests library. After python is installed by default, the requests module is not installed. It needs to be installed separately through pip

2.1 basic grammar

Requests supported by the requests module

import requests

requests.get("http://httpbin.org/get")

requests.post("http://httpbin.org/post ")

requests.put("http://httpbin.org/put ")

requests.delete("http://httpbin.org/delete ")

requests.head("http://httpbin.org/get ")

requests.options("http://httpbin.org/get ")

get request

1 basic request

import requests

response=requests.get('https://www.jd.com/',)



with open("jd.html","wb< span style="color: #800000;">") as f:

 f.write(response.content)

2 Request with parameters

import requests

response=requests.get('https://s.taobao.com/search?q=Mobile')

response=requests.get('https://s.taobao.com/search',params={"q":"beauties"})

3 Request with header

import requests

response=requests.get('https://dig.chouti.com/',

 headers={

 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

}

 )

4 Requests with cookies

import uuid

import requests



url = 'http://httpbin.org/cookies span>'

cookies = dict(sbid=str(uuid.uuid4()))



res = requests.get(url, cookies=cookies)

print(res.text)

post request

1 data parameter
The usage of requests.post() is exactly the same as requests.get(). The special feature is that requests.post() has an additional data parameter to store the request body data

 response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})

2 Send json data

 import requests

res1=requests.post(url='http://httpbin .org/post', data={'name':'yuan'}) #No request header is specified, #Default request header: application/x- www-form-urlencoed

print(res1.json())



res2=requests.post(url='http://httpbin .org/post',json={'age':"22",}) #Default request header: application/json

print(res2.json())

response object

( 1) Common attributes

import requests

respone=requests.get('https://sh.lianjia.com/ershoufang/')

# respone attribute

print(respone.text)

print(respone.content)

print(respone.status_code)

print(respone.headers)

print(respone.cookies)

print(respone.cookies.get_dict())

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

(2) Encoding issues

import requests

response=requests.get('http://www.autohome.com/news')

#response.encoding='gbk' #The content of the page returned by the car home website is gb2312 encoded, and the default encoding of requests is ISO-8859< /span>-1, if not set to gbk, Chinese characters will be garbled

with open("res.html","w< span style="color: #800000;">") as f:

 f.write(response.text)

(3) Download binary files (pictures, videos, audios)

import requests

response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg'< span style="color: #000000;">)

with open("res.png","wb< span style="color: #800000;">") as f:

 # f.write(response.content) # For example, when downloading a video, if the video is 100G, it is unreasonable to use response.content and then write it to the file at once

 for line in response.iter_content():

 f.write(line)

(4) Parse json data

import requests

import json



response=requests.get('http://httpbin.org/get')

res1=json.loads(response.text) #too troublesome

res2=response.json() #Get json data directly

print(res1==res2)

(5) Redirection and History

By default, Except for HEAD, Requests will automatically handle all redirects. You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.

>>> r = requests.get('http://github.com')

>>> r.url

'https://github.com/< span style="color: #800000;">'

>>> r.status_code

200

>>> r.history

[301]>]

In addition, you can also disable redirect processing through the allow_redirects parameter:



>>> r = requests.get(' http://github.com', allow_redirects=False)

>>> r.status_code

301

>>> r.history

[]

2.2 requests advanced usage

Proxy

Some websites will have corresponding anti-crawling measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the visit frequency is too fast to look like a normal visitor, It may prohibit access to this IP. So we need to set up some proxy servers, and change a proxy every once in a while, even if the IP is banned, we can still change the IP to continue crawling.

res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json() print(res)

Free agent

2.3 Crawler case

douban.com movie top250

github home page

import requests

import re

#The first step: request to obtain token, in order to request verification through post

session=requests.session()

res=session.get("https://github.com/login")





authenticity_token=re.findall('name="authenticity_token" value= "(.*?)"',res.text)[0]

print(authenticity_token)





# The second step is to construct post request data



data={

 "login": "[emailprotected]",

 "password":"yuanchenqi0316",

 "commit": "Sign in",

 "utf8": "?",

 "authenticity_token": authenticity_token

}



res=session.post("https://github.com /session",data=data,headers=headers,cookies=cookies)



with open("github.html","wb< span style="color: #800000;">") as f:

 f.write(res.content)

One Crawler Introduction

Overview

In recent years With the gradual expansion and deepening of network applications, how to efficiently obtain online data has become the pursuit of countless companies and individuals. In the era of big data, whoever has more data can obtain higher benefits. Crawlers are one of the most commonly used methods to crawl data from the Internet.
Web crawler, namely Web Spider, is a very vivid name. If the Internet is likened to a spider web, then Spider is a spider that crawls around the web. Web spiders find web pages through the link addresses of the web pages. Start from a certain page of the website (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next webpage through these link addresses, so that the loop continues until all the webpages of this website are Until the crawl is finished.
Share pictures

The value of crawlers Value

robots.txt protocol

Basic process of crawler

Share pictures

preliminary knowledge

http protocol

two requests module< /h1>

Requests is written in python language based on urllib. It uses the HTTP library of the Apache2 Licensed open source protocol. Requests is more convenient than urllib and can save us a lot of work. In a word, requests is the simplest and easy-to-use HTTP library implemented in python. It is recommended that crawlers use the requests library. After python is installed by default, the requests module is not installed. It needs to be installed separately through pip

2.1 basic grammar

Requests supported by the requests module

import requests

requests.get("http://httpbin.org/get")

requests.post("http://httpbin.org/post ")

requests.put("http://httpbin.org/put ")

requests.delete("http://httpbin.org/delete ")

requests.head("http://httpbin.org/get ")

requests.options("http://httpbin.org/get ")

get request

1 basic request

import requests

response=requests.get('https://www.jd.com/',)



with open("jd.html","wb< span style="color: #800000;">") as f:

 f.write(response.content)

2 Request with parameters

import requests

response=requests.get('https://s.taobao.com/search?q=Mobile')

response=requests.get('https://s.taobao.com/search',params={"q":"beauties"})

3 Request with header

import requests

response=requests.get('https://dig.chouti.com/',

 headers={

 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

}

 )

4 Requests with cookies

import uuid

import requests



url = 'http://httpbin.org/cookies span>'

cookies = dict(sbid=str(uuid.uuid4()))



res = requests.get(url, cookies=cookies)

print(res.text)

post request

1 data parameter
The usage of requests.post() is exactly the same as requests.get(). The special feature is that requests.post() has an additional data parameter to store the request body data

 response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})

2 Send json data

 import requests

res1=requests.post(url='http://httpbin .org/post', data={'name':'yuan'}) #No request header is specified, #Default request header: application/x- www-form-urlencoed

print(res1.json())



res2=requests.post(url='http://httpbin .org/post',json={'age':"22",}) #Default request header: application/json

print(res2.json())

response object

( 1) Common attributes

import requests

respone=requests.get('https://sh.lianjia.com/ershoufang/')

# respone attribute

print(respone.text)

print(respone.content)

print(respone.status_code)

print(respone.headers)

print(respone.cookies)

print(respone.cookies.get_dict())

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

(2) Encoding issues

import requests

response=requests.get('http://www.autohome.com/news')

#response.encoding='gbk' #The content of the page returned by the car home website is gb2312 encoded, and the default encoding of requests is ISO-8859< /span>-1, if not set to gbk, Chinese characters will be garbled

with open("res.html","w< span style="color: #800000;">") as f:

 f.write(response.text)

(3) Download binary files (pictures, videos, audios)

import requests

response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg'< span style="color: #000000;">)

with open("res.png","wb< span style="color: #800000;">") as f:

 # f.write(response.content) # For example, when downloading a video, if the video is 100G, it is unreasonable to use response.content and then write it to the file at once

 for line in response.iter_content():

 f.write(line)

(4) Parse json data

import requests

import json



response=requests.get('http://httpbin.org/get')

res1=json.loads(response.text) #too troublesome

res2=response.json() #Get json data directly

print(res1==res2)

(5) Redirection and History

>>> r = requests.get('http://github.com')

>>> r.url

'https://github.com/< span style="color: #800000;">'

>>> r.status_code

200

>>> r.history

[301]>]

In addition, you can also disable redirect processing through the allow_redirects parameter:



>>> r = requests.get(' http://github.com', allow_redirects=False)

>>> r.status_code

301

>>> r.history

[]

2.2 requests advanced usage

Proxy

res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json() print(res)

Free agent

2.3 Crawler case

douban.com movie top250

github home page

import requests

import re

#The first step: request to obtain token, in order to request verification through post

session=requests.session()

res=session.get("https://github.com/login")





authenticity_token=re.findall('name="authenticity_token" value= "(.*?)"',res.text)[0]

print(authenticity_token)





# 第二步 构建post请求数据



data={

    "login": "[email protected]",

    "password":"yuanchenqi0316",

 "commit": "Sign in",

    "utf8": "?",

    "authenticity_token": authenticity_token

}



res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies)



with open("github.html","wb") as f:

    f.write(res.content)

一爬虫简介

概述

近年来，随着网络应用的逐渐扩展和深入，如何高效的获取网上数据成为了无数公司和个人的追求，在大数据时代，谁掌握了更多的数据，谁就可以获得更高的利益，而网络爬虫是其中最为常用的一种从网上爬取数据的手段。
网络爬虫，即Web Spider，是一个很形象的名字。如果把互联网比喻成一个蜘蛛网，那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。从网站某一个页面（通常是首页）开始，读取网页的内容，找到在网页中的其它链接地址，然后通过这些链接地址寻找下一个网页，这样一直循环下去，直到把这个网站所有的网页都抓取完为止。
分享图片

爬虫的价值

互联网中最有价值的便是数据，比如天猫商城的商品信息，链家网的租房信息，雪球网的证券投资信息等等，这些数据都代表了各个行业的真金白银，可以说，谁掌握了行业内的第一手数据，谁就成了整个行业的主宰，如果把整个互联网的数据比喻为一座宝藏，那我们的爬虫课程就是来教大家如何来高效地挖掘这些宝藏，掌握了爬虫技能，你就成了所有互联网信息公司幕后的老板，换言之，它们都在免费为你提供有价值的数据。

robots.txt协议

如果自己的门户网站中的指定页面中的数据不想让爬虫程序爬取到的话，那么则可以通过编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。robots协议的编写格式可以观察淘宝网的robots（访问www.taobao.com/robots.txt即可）。但是需要注意的是，该协议只是相当于口头的协议，并没有使用相关技术进行强制管制，所以该协议是防君子不防小人。但是我们在学习爬虫阶段编写的爬虫程序可以先忽略robots协议。

爬虫的基本流程

分享图片

预备知识

http协议

二 requests模块

Requests是用python语言基于urllib编写的，采用的是Apache2 Licensed开源协议的HTTP库，Requests它会比urllib更加方便，可以节约我们大量的工作。一句话，requests是python实现的最简单易用的HTTP库，建议爬虫使用requests库。默认安装好python之后，是没有安装requests模块的，需要单独通过pip安装

2.1 基本语法

requests模块支持的请求

import requests

requests.get("http://httpbin.org/get")

requests.post("http://httpbin.org/post ")

requests.put("http://httpbin.org/put ")

requests.delete("http://httpbin.org/delete ")

requests.head("http://httpbin.org/get ")

requests.options("http://httpbin.org/get")

get请求

1 基本请求

import requests

response=requests.get(‘https://www.jd.com/‘,)



with open("jd.html","wb") as f:

    f.write(response.content)

2 含参数请求

import requests

response=requests.get(‘https://s.taobao.com/search?q=手机‘)

response=requests.get(‘https://s.taobao.com/search‘,params={"q":"美女"})

3 含请求头请求

import requests

response=requests.get(‘https://dig.chouti.com/‘,

             headers={

                   ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36‘,

}

                      )

4 含cookies请求

import uuid

import requests



url = 'http://httpbin.org/cookies span>'

cookies = dict(sbid=str(uuid.uuid4()))



res = requests.get(url, cookies=cookies)

print(res.text)

post请求

1 data参数
requests.post()用法与requests.get()完全一致，特殊的是requests.post()多了一个data参数，用来存放请求体数据

response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})

2 发送json数据

import requests

res1=requests.post(url=‘http://httpbin.org/post‘, data={‘name‘:‘yuan‘}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

print(res1.json())



res2=requests.post(url=‘http://httpbin.org/post‘,json={‘age‘:"22",}) #默认的请求头:application/json

print(res2.json())

response对象

(1) 常见属性

import requests

respone=requests.get(‘https://sh.lianjia.com/ershoufang/‘)

# respone属性

print(respone.text)

print(respone.content)

print(respone.status_code)

print(respone.headers)

print(respone.cookies)

print(respone.cookies.get_dict())

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

(2) 编码问题

import requests

response=requests.get(‘http://www.autohome.com/news‘)

#response.encoding=‘gbk‘ #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码

with open("res.html","w") as f:

    f.write(response.text)

(3) 下载二进制文件（图片，视频，音频）

import requests

response=requests.get(‘http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg‘)

with open("res.png","wb") as f:

    # f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的

    for line in response.iter_content():

        f.write(line)

(4) 解析json数据　　

import requests

import json



response=requests.get(‘http://httpbin.org/get‘)

res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据

print(res1==res2)

(5) Redirection and History

默认情况下，除了 HEAD, Requests 会自动处理所有重定向。 You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.

>>> r = requests.get(‘http://github.com‘)

>>> r.url

'https://github.com/< span style="color: #800000;">'

>>> r.status_code

200

>>> r.history

[301]>]

另外，还可以通过 allow_redirects 参数禁用重定向处理：



>>> r = requests.get(‘http://github.com‘, allow_redirects=False)

>>> r.status_code

301

>>> r.history

[]

2.2 requests进阶用法

代理

一些网站会有相应的反爬虫措施，例如很多网站会检测某一段时间某个IP的访问次数，如果访问频率太快以至于看起来不像正常访客，它可能就会会禁止这个IP的访问。所以我们需要设置一些代理服务器，每隔一段时间换一个代理，就算IP被禁止，依然可以换个IP继续爬取。

res=requests.get(‘http://httpbin.org/ip‘, proxies={‘http‘:‘110.83.40.27:9999‘}).json() print(res)

免费代理

2.3 爬虫案例

豆瓣网电影top250

github的home页

import requests

import re

#第一步： 请求获取token，以便通过post请求校验

session=requests.session()

res=session.get("https://github.com/login")





authenticity_token=re.findall(‘name="authenticity_token" value="(.*?)"‘,res.text)[0]

print(authenticity_token)





# 第二步 构建post请求数据



data={

    "login": "[email protected]",

    "password":"yuanchenqi0316",

 "commit": "Sign in",

    "utf8": "?",

    "authenticity_token": authenticity_token

}



res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies)



with open("github.html","wb") as f:

    f.write(res.content)

import requests

requests.get("http://httpbin.org/get")

requests.post("http://httpbin.org/post ")

requests.put("http://httpbin.org/put ")

requests.delete("http://httpbin.org/delete ")

requests.head("http://httpbin.org/get ")

requests.options("http://httpbin.org/get")

import requests

response=requests.get(‘https://www.jd.com/‘,)



with open("jd.html","wb") as f:

    f.write(response.content)

import requests

response=requests.get(‘https://s.taobao.com/search?q=手机‘)

response=requests.get(‘https://s.taobao.com/search‘,params={"q":"美女"})

import requests

response=requests.get(‘https://dig.chouti.com/‘,

             headers={

                   ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36‘,

}

                      )

import uuid

import requests



url = 'http://httpbin.org/cookies span>'

cookies = dict(sbid=str(uuid.uuid4()))



res = requests.get(url, cookies=cookies)

print(res.text)

import requests

res1=requests.post(url=‘http://httpbin.org/post‘, data={‘name‘:‘yuan‘}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

print(res1.json())



res2=requests.post(url=‘http://httpbin.org/post‘,json={‘age‘:"22",}) #默认的请求头:application/json

print(res2.json())

import requests

respone=requests.get(‘https://sh.lianjia.com/ershoufang/‘)

# respone属性

print(respone.text)

print(respone.content)

print(respone.status_code)

print(respone.headers)

print(respone.cookies)

print(respone.cookies.get_dict())

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

import requests

response=requests.get(‘http://www.autohome.com/news‘)

#response.encoding=‘gbk‘ #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码

with open("res.html","w") as f:

    f.write(response.text)

import requests

response=requests.get(‘http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg‘)

with open("res.png","wb") as f:

    # f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的

    for line in response.iter_content():

        f.write(line)

import requests

import json



response=requests.get(‘http://httpbin.org/get‘)

res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据

print(res1==res2)

>>> r = requests.get(‘http://github.com‘)

>>> r.url

'https://github.com/< span style="color: #800000;">'

>>> r.status_code

200

>>> r.history

[301]>]

另外，还可以通过 allow_redirects 参数禁用重定向处理：



>>> r = requests.get(‘http://github.com‘, allow_redirects=False)

>>> r.status_code

301

>>> r.history

[]

import requests

import re

#第一步： 请求获取token，以便通过post请求校验

session=requests.session()

res=session.get("https://github.com/login")





authenticity_token=re.findall(‘name="authenticity_token" value="(.*?)"‘,res.text)[0]

print(authenticity_token)





# 第二步 构建post请求数据



data={

    "login": "[email protected]",

    "password":"yuanchenqi0316",

 "commit": "Sign in",

    "utf8": "?",

    "authenticity_token": authenticity_token

}



res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies)



with open("github.html","wb") as f:

    f.write(res.content)

Introduction to a crawler

Overview

The value of crawlers Value

robots.txt protocol

Basic process of crawler

preliminary knowledge

2.1 basic grammar

Requests supported by the requests module

get request

post request

response object

2.2 requests advanced usage

Proxy

2.3 Crawler case

douban.com movie top250

github home page

One ​​Crawler Introduction

Overview

The value of crawlers Value

robots.txt protocol

Basic process of crawler

preliminary knowledge

2.1 basic grammar

Requests supported by the requests module

get request

post request

response object

2.2 requests advanced usage

Proxy

2.3 Crawler case

douban.com movie top250

github home page

一 爬虫简介

概述

爬虫的价值

robots.txt协议

爬虫的基本流程

预备知识

二 requests模块

2.1 基本语法

requests模块支持的请求

get请求

post请求

response对象

2.2 requests进阶用法

代理

2.3 爬虫案例

豆瓣网电影top250

github的home页

Leave a Comment Cancel reply

One Crawler Introduction

一爬虫简介