Introduction to a crawler
Overview
In recent years, with the gradual expansion and deepening of network applications, how to efficiently obtain online data has become countless companies and individuals In the era of big data, whoever has more data can obtain higher benefits, and web crawlers are the most commonly used means of crawling data from the Internet.
Web crawler, namely Web Spider, is a very vivid name. If the Internet is likened to a spider web, then Spider is a spider that crawls around the web. Web spiders search for web pages through their link addresses. Start from a certain page of the website (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next webpage through these link addresses, so that the loop continues until all the webpages of this website are Until the crawl is finished.
The value of crawlers Value
The most valuable thing on the Internet is data, such as the product information of Tmall Mall, the rental information of Lianjia.com, the securities investment information of Xueqiu.com, etc. These data all represent the information of various industries Real money, it can be said that whoever has the first-hand data in the industry will become the master of the entire industry. If the entire Internet data is likened to a treasure, then our crawler course is to teach everyone how to be efficient Digging into these treasures, mastering crawler skills, you become the boss behind all Internet information companies, in other words, they are all providing you with valuable data for free.
robots.txt protocol
If the data in the specified page of your portal website does not want to be crawled by the crawler, then you can Constrain the data crawling of the crawler program by writing a robots.txt protocol file. The format of the robots protocol can be observed in Taobao’s robots (just visit www.taobao.com/robots.txt). However, it should be noted that this agreement is only equivalent to a verbal agreement and does not use related technologies for compulsory control, so this agreement is to guard against gentlemen but not villains. But the crawler program we write in the learning crawler stage can ignore the robots protocol first.
Basic process of crawler
preliminary knowledge
http protocol
two requests module< /h1>
Requests is written in python language based on urllib. It uses the HTTP library of the Apache2 Licensed open source protocol. Requests is more convenient than urllib and can save us a lot of work. In a word, requests is the simplest and easy-to-use HTTP library implemented in python. It is recommended that crawlers use the requests library. After python is installed by default, the requests module is not installed. It needs to be installed separately through pip
2.1 basic grammar
Requests supported by the requests module
import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get ")
get request
1 basic request
import requests
response=requests.get('https://www.jd.com/',)
with open("jd.html","wb< span style="color: #800000;">") as f:
f.write(response.content)
2 Request with parameters
import requests
response=requests.get('https://s.taobao.com/search?q=Mobile')
response=requests.get('https://s.taobao.com/search',params={"q":"beauties"})
3 Request with header
import requests
response=requests.get('https://dig.chouti.com/',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}
)
4 Requests with cookies
import uuid
import requests
url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.text)
post request
1 data parameter
The usage of requests.post() is exactly the same as requests.get(). The special feature is that requests.post() has an additional data parameter to store the request body data
response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})
2 Send json data
import requests
res1=requests.post(url='http://httpbin .org/post', data={'name':'yuan'}) #No request header is specified, #Default request header: application/x- www-form-urlencoed
print(res1.json())
res2=requests.post(url='http://httpbin .org/post',json={'age':"22",}) #Default request header: application/json
print(res2.json())
response object
( 1) Common attributes
import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')
# respone attribute
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)
(2) Encoding issues
import requests
response=requests.get('http://www.autohome.com/news')
#response.encoding='gbk' #The content of the page returned by the car home website is gb2312 encoded, and the default encoding of requests is ISO-8859< /span>-1, if not set to gbk, Chinese characters will be garbled
with open("res.html","w< span style="color: #800000;">") as f:
f.write(response.text)
(3) Download binary files (pictures, videos, audios)
import requests
response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg'< span style="color: #000000;">)
with open("res.png","wb< span style="color: #800000;">") as f:
# f.write(response.content) # For example, when downloading a video, if the video is 100G, it is unreasonable to use response.content and then write it to the file at once
for line in response.iter_content():
f.write(line)
(4) Parse json data
import requests
import json
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text) #too troublesome
res2=response.json() #Get json data directly
print(res1==res2)
(5) Redirection and History
By default, Except for HEAD, Requests will automatically handle all redirects. You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/< span style="color: #800000;">'
>>> r.status_code
200
>>> r.history
[301]>]
In addition, you can also disable redirect processing through the allow_redirects parameter:
>>> r = requests.get(' http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
2.2 requests advanced usage
Proxy
Some websites will have corresponding anti-crawling measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the visit frequency is too fast to look like a normal visitor, It may prohibit access to this IP. So we need to set up some proxy servers, and change a proxy every once in a while, even if the IP is banned, we can still change the IP to continue crawling.
res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json() print(res)
Free agent
2.3 Crawler case
douban.com movie top250
github home page
import requests
import re
#The first step: request to obtain token, in order to request verification through post
session=requests.session()
res=session.get("https://github.com/login")
authenticity_token=re.findall('name="authenticity_token" value= "(.*?)"',res.text)[0]
print(authenticity_token)
# The second step is to construct post request data
data={
"login": "[emailprotected]",
"password":"yuanchenqi0316",
"commit": "Sign in",
"utf8": "?",
"authenticity_token": authenticity_token
}
res=session.post("https://github.com /session",data=data,headers=headers,cookies=cookies)
with open("github.html","wb< span style="color: #800000;">") as f:
f.write(res.content)
One Crawler Introduction
Overview
In recent years With the gradual expansion and deepening of network applications, how to efficiently obtain online data has become the pursuit of countless companies and individuals. In the era of big data, whoever has more data can obtain higher benefits. Crawlers are one of the most commonly used methods to crawl data from the Internet.
Web crawler, namely Web Spider, is a very vivid name. If the Internet is likened to a spider web, then Spider is a spider that crawls around the web. Web spiders find web pages through the link addresses of the web pages. Start from a certain page of the website (usually the homepage), read the content of the webpage, find other link addresses in the webpage, and then find the next webpage through these link addresses, so that the loop continues until all the webpages of this website are Until the crawl is finished.
The value of crawlers Value
The most valuable thing on the Internet is data, such as the product information of Tmall Mall, the rental information of Lianjia.com, the securities investment information of Xueqiu.com, etc. These data all represent the information of various industries Real money, it can be said that whoever has the first-hand data in the industry will become the master of the entire industry. If the entire Internet data is likened to a treasure, then our crawler course is to teach everyone how to be efficient Digging into these treasures, mastering crawler skills, you become the boss behind all Internet information companies, in other words, they are all providing you with valuable data for free.
robots.txt protocol
If the data in the specified page of your portal website does not want to be crawled by the crawler, then you can Constrain the data crawling of the crawler program by writing a robots.txt protocol file. The format of the robots protocol can be observed in Taobao’s robots (just visit www.taobao.com/robots.txt). However, it should be noted that this agreement is only equivalent to a verbal agreement and does not use related technologies for compulsory control, so this agreement is to guard against gentlemen but not villains. But the crawler program we write in the learning crawler stage can ignore the robots protocol first.
Basic process of crawler
preliminary knowledge
http protocol
two requests module< /h1>
Requests is written in python language based on urllib. It uses the HTTP library of the Apache2 Licensed open source protocol. Requests is more convenient than urllib and can save us a lot of work. In a word, requests is the simplest and easy-to-use HTTP library implemented in python. It is recommended that crawlers use the requests library. After python is installed by default, the requests module is not installed. It needs to be installed separately through pip
2.1 basic grammar
Requests supported by the requests module
import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get ")
get request
1 basic request
import requests
response=requests.get('https://www.jd.com/',)
with open("jd.html","wb< span style="color: #800000;">") as f:
f.write(response.content)
2 Request with parameters
import requests
response=requests.get('https://s.taobao.com/search?q=Mobile')
response=requests.get('https://s.taobao.com/search',params={"q":"beauties"})
3 Request with header
import requests
response=requests.get('https://dig.chouti.com/',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64 ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}
)
4 Requests with cookies
import uuid
import requests
url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.text)
post request
1 data parameter
The usage of requests.post() is exactly the same as requests.get(). The special feature is that requests.post() has an additional data parameter to store the request body data
response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})
2 Send json data
import requests
res1=requests.post(url='http://httpbin .org/post', data={'name':'yuan'}) #No request header is specified, #Default request header: application/x- www-form-urlencoed
print(res1.json())
res2=requests.post(url='http://httpbin .org/post',json={'age':"22",}) #Default request header: application/json
print(res2.json())
response object
( 1) Common attributes
import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')
# respone attribute
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)
(2) Encoding issues
import requests
response=requests.get('http://www.autohome.com/news')
#response.encoding='gbk' #The content of the page returned by the car home website is gb2312 encoded, and the default encoding of requests is ISO-8859< /span>-1, if not set to gbk, Chinese characters will be garbled
with open("res.html","w< span style="color: #800000;">") as f:
f.write(response.text)
(3) Download binary files (pictures, videos, audios)
import requests
response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg'< span style="color: #000000;">)
with open("res.png","wb< span style="color: #800000;">") as f:
# f.write(response.content) # For example, when downloading a video, if the video is 100G, it is unreasonable to use response.content and then write it to the file at once
for line in response.iter_content():
f.write(line)
(4) Parse json data
import requests
import json
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text) #too troublesome
res2=response.json() #Get json data directly
print(res1==res2)
(5) Redirection and History
By default, Except for HEAD, Requests will automatically handle all redirects. You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/< span style="color: #800000;">'
>>> r.status_code
200
>>> r.history
[301]>]
In addition, you can also disable redirect processing through the allow_redirects parameter:
>>> r = requests.get(' http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
2.2 requests advanced usage
Proxy
Some websites will have corresponding anti-crawling measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the visit frequency is too fast to look like a normal visitor, It may prohibit access to this IP. So we need to set up some proxy servers, and change a proxy every once in a while, even if the IP is banned, we can still change the IP to continue crawling.
res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json() print(res)
Free agent
2.3 Crawler case
douban.com movie top250
github home page
import requests
import re
#The first step: request to obtain token, in order to request verification through post
session=requests.session()
res=session.get("https://github.com/login")
authenticity_token=re.findall('name="authenticity_token" value= "(.*?)"',res.text)[0]
print(authenticity_token)
# 第二步 构建post请求数据
data={
"login": "[email protected]",
"password":"yuanchenqi0316",
"commit": "Sign in",
"utf8": "?",
"authenticity_token": authenticity_token
}
res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies)
with open("github.html","wb") as f:
f.write(res.content)
一 爬虫简介
概述
近年来,随着网络应用的逐渐扩展和深入,如何高效的获取网上数据成为了无数公司和个人的追求,在大数据时代,谁掌握了更多的数据,谁就可以获得更高的利益,而网络爬虫是其中最为常用的一种从网上爬取数据的手段。
网络爬虫,即Web Spider,是一个很形象的名字。如果把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。
爬虫的价值
互联网中最有价值的便是数据,比如天猫商城的商品信息,链家网的租房信息,雪球网的证券投资信息等等,这些数据都代表了各个行业的真金白银,可以说,谁掌握了行业内的第一手数据,谁就成了整个行业的主宰,如果把整个互联网的数据比喻为一座宝藏,那我们的爬虫课程就是来教大家如何来高效地挖掘这些宝藏,掌握了爬虫技能, 你就成了所有互联网信息公司幕后的老板,换言之,它们都在免费为你提供有价值的数据。
robots.txt协议
如果自己的门户网站中的指定页面中的数据不想让爬虫程序爬取到的话,那么则可以通过编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。robots协议的编写格式可以观察淘宝网的robots(访问www.taobao.com/robots.txt即可)。但是需要注意的是,该协议只是相当于口头的协议,并没有使用相关技术进行强制管制,所以该协议是防君子不防小人。但是我们在学习爬虫阶段编写的爬虫程序可以先忽略robots协议。
爬虫的基本流程
预备知识
http协议
二 requests模块
Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库,Requests它会比urllib更加方便,可以节约我们大量的工作。一句话,requests是python实现的最简单易用的HTTP库,建议爬虫使用requests库。默认安装好python之后,是没有安装requests模块的,需要单独通过pip安装
2.1 基本语法
requests模块支持的请求
import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get")
get请求
1 基本请求
import requests
response=requests.get(‘https://www.jd.com/‘,)
with open("jd.html","wb") as f:
f.write(response.content)
2 含参数请求
import requests
response=requests.get(‘https://s.taobao.com/search?q=手机‘)
response=requests.get(‘https://s.taobao.com/search‘,params={"q":"美女"})
3 含请求头请求
import requests
response=requests.get(‘https://dig.chouti.com/‘,
headers={
‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36‘,
}
)
4 含cookies请求
import uuid
import requests
url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.text)
post请求
1 data参数
requests.post()用法与requests.get()完全一致,特殊的是requests.post()多了一个data参数,用来存放请求体数据
response=requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"yuan"})
2 发送json数据
import requests
res1=requests.post(url=‘http://httpbin.org/post‘, data={‘name‘:‘yuan‘}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed
print(res1.json())
res2=requests.post(url=‘http://httpbin.org/post‘,json={‘age‘:"22",}) #默认的请求头:application/json
print(res2.json())
response对象
(1) 常见属性
import requests
respone=requests.get(‘https://sh.lianjia.com/ershoufang/‘)
# respone属性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)
(2) 编码问题
import requests
response=requests.get(‘http://www.autohome.com/news‘)
#response.encoding=‘gbk‘ #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
with open("res.html","w") as f:
f.write(response.text)
(3) 下载二进制文件(图片,视频,音频)
import requests
response=requests.get(‘http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg‘)
with open("res.png","wb") as f:
# f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的
for line in response.iter_content():
f.write(line)
(4) 解析json数据
import requests
import json
response=requests.get(‘http://httpbin.org/get‘)
res1=json.loads(response.text) #太麻烦
res2=response.json() #直接获取json数据
print(res1==res2)
(5) Redirection and History
默认情况下,除了 HEAD, Requests 会自动处理所有重定向。 You can use the history method of the response object to track redirects. Response.history is a list of Response objects that were created to complete the request. This list of objects is sorted from oldest to most recent request.
>>> r = requests.get(‘http://github.com‘)
>>> r.url
'https://github.com/< span style="color: #800000;">'
>>> r.status_code
200
>>> r.history
[301]>]
另外,还可以通过 allow_redirects 参数禁用重定向处理:
>>> r = requests.get(‘http://github.com‘, allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
2.2 requests进阶用法
代理
一些网站会有相应的反爬虫措施,例如很多网站会检测某一段时间某个IP的访问次数,如果访问频率太快以至于看起来不像正常访客,它可能就会会禁止这个IP的访问。所以我们需要设置一些代理服务器,每隔一段时间换一个代理,就算IP被禁止,依然可以换个IP继续爬取。
res=requests.get(‘http://httpbin.org/ip‘, proxies={‘http‘:‘110.83.40.27:9999‘}).json() print(res)
免费代理
2.3 爬虫案例
豆瓣网电影top250
github的home页
import requests
import re
#第一步: 请求获取token,以便通过post请求校验
session=requests.session()
res=session.get("https://github.com/login")
authenticity_token=re.findall(‘name="authenticity_token" value="(.*?)"‘,res.text)[0]
print(authenticity_token)
# 第二步 构建post请求数据
data={
"login": "[email protected]",
"password":"yuanchenqi0316",
"commit": "Sign in",
"utf8": "?",
"authenticity_token": authenticity_token
}
res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies)
with open("github.html","wb") as f:
f.write(res.content)
import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post ")
requests.put("http://httpbin.org/put ")
requests.delete("http://httpbin.org/delete ")
requests.head("http://httpbin.org/get ")
requests.options("http://httpbin.org/get")
import requests
response=requests.get(‘https://www.jd.com/‘,)
with open("jd.html","wb") as f:
f.write(response.content)
import requests
response=requests.get(‘https://s.taobao.com/search?q=手机‘)
response=requests.get(‘https://s.taobao.com/search‘,params={"q":"美女"})
import requests
response=requests.get(‘https://dig.chouti.com/‘,
headers={
‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36‘,
}
)
import uuid
import requests
url = 'http://httpbin.org/cookies span>'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url, cookies=cookies)
print(res.text)
import requests
res1=requests.post(url=‘http://httpbin.org/post‘, data={‘name‘:‘yuan‘}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed
print(res1.json())
res2=requests.post(url=‘http://httpbin.org/post‘,json={‘age‘:"22",}) #默认的请求头:application/json
print(res2.json())
import requests
respone=requests.get(‘https://sh.lianjia.com/ershoufang/‘)
# respone属性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)
import requests
response=requests.get(‘http://www.autohome.com/news‘)
#response.encoding=‘gbk‘ #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
with open("res.html","w") as f:
f.write(response.text)
import requests
response=requests.get(‘http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg‘)
with open("res.png","wb") as f:
# f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的
for line in response.iter_content():
f.write(line)
import requests
import json
response=requests.get(‘http://httpbin.org/get‘)
res1=json.loads(response.text) #太麻烦
res2=response.json() #直接获取json数据
print(res1==res2)
>>> r = requests.get(‘http://github.com‘)
>>> r.url
'https://github.com/< span style="color: #800000;">'
>>> r.status_code
200
>>> r.history
[301]>]
另外,还可以通过 allow_redirects 参数禁用重定向处理:
>>> r = requests.get(‘http://github.com‘, allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
import requests
import re
#第一步: 请求获取token,以便通过post请求校验
session=requests.session()
res=session.get("https://github.com/login")
authenticity_token=re.findall(‘name="authenticity_token" value="(.*?)"‘,res.text)[0]
print(authenticity_token)
# 第二步 构建post请求数据
data={
"login": "[email protected]",
"password":"yuanchenqi0316",
"commit": "Sign in",
"utf8": "?",
"authenticity_token": authenticity_token
}
res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies)
with open("github.html","wb") as f:
f.write(res.content)