—Recover content begins—
One. Briefly explain the principle of crawling
If we compare the Internet to a big spider web, the data is Stored in the various nodes of the spider web, and the crawler is a small spider,
Crawling its own prey (data) along the network. The crawler refers to: initiating a request to the website, analyzing and extracting resources after obtaining Useful data program;
From a technical perspective, it is to simulate the behavior of the browser requesting the site through the program, and crawl the HTML code/JSON data/binary data (pictures, videos) returned by the site to the local area, and then Extract the data you need and store it for use;
Two. Understand the crawler development process
1). Briefly explain the working principle of the browser;
The essence of the working principle of the browser is to realize the communication of the http protocol. The specific process is as follows: (the process of HTTP communication is roughly divided into three stages)
connection, the server monitors a port through a ServerSocket class object , After listening, connect and open a socket virtual file.
Request, after creating a stream object related to the supervision socket connection, the browser obtains the request, which is a get request, and then obtains the accessed html file name from the request information, and sends the request to the server.
Response, after receiving the request, the server searches for the relevant directory file, and returns an error message if it does not exist. If it exists, it reads the html file, adds http headers and other processing responses to the browser, and the browser parses the html file. If it also contains resources such as pictures and videos, the browser visits the web server again to obtain pictures and videos, etc. It is assembled and displayed to the user.
2). Use the requests library to fetch website data;
requests.get(url) Get the html code of the campus news homepage
nice=‘http://www.cnblogs.com/< /span>‘
luck= requests.get(nice)
luck.encoding=‘utf-8‘
print(luck.text)
Run screenshot
3). Understand the web page
Write a simple html file, contains multiple tags, classes, id
"en">
"UTF-8">
Title
"-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd ">
"UTF-8">
File box outside the form:"none span>" />
—Recovery content ends—
一. Briefly explain the principle of crawling
If we compare the Internet to a large spider web, the data is stored in each node of the spider web, and the crawler is a small spider,
Crawling its own prey (data) along the web refers to a program that initiates a request to a website, obtains resources, analyzes and extracts useful data;
From a technical perspective, it is to simulate a browser through a program Request the behavior of the site, crawl the HTML code/JSON data/binary data (pictures, videos) returned by the site to the local, and then extract the data you need and store it for use;
2. Understand the crawler development process
1). Briefly explain the working principle of the browser;
The essence of the working principle of the browser is to realize the communication of the http protocol. The specific process is as follows: (HTTP The communication process is roughly divided into three stages)
connection, the server monitors a port through a ServerSocket class object, and then connects after listening, and opens a socket virtual file.
Request, after creating a stream object related to the supervision socket connection, the browser obtains the request, which is a get request, and then obtains the accessed html file name from the request information, and sends the request to the server.
Response, after receiving the request, the server searches for the relevant directory file, and returns an error message if it does not exist. If it exists, it reads the html file, adds http headers and other processing responses to the browser, and the browser parses the html file. If it also contains resources such as pictures and videos, the browser visits the web server again to obtain pictures and videos, etc. It is assembled and displayed to the user.
2). Use the requests library to fetch website data;
requests.get(url) Get the html code of the campus news homepage
nice=‘http://www.cnblogs.com/< /span>‘
luck= requests.get(nice)
luck.encoding=‘utf-8‘
print(luck.text)
Run screenshot
3). Understand the web page
Write a simple html file, contains multiple tags, classes, id
Run screenshot:
4). Use Beautiful Soup to parse web pages;
Analyze the above html file into a DOM Tree through BeautifulSoup(html_sample,’html.parser’)
select (selector) positioning data
find out html elements with specific tags< /p>
Find the html element with a specific class name
Find the html element with a specific id name
from bs4 import BeautifulSoup html = 'china
one tow ' soups = BeautifulSoup(html,‘html.parser’) a1 =soups.a a = soups.select(‘a’) print(a) nice = soups.select(‘h1‘) print(nice) luck = soups.select(‘#title‘) print(luck)
Run screenshot
3. Extract the title, release time, publishing unit, author, number of clicks, Content and other information
Such as url =’http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html’
Requires the release time to be datetime type, the number of clicks It is numeric type, and the others are string type.
import requests import bs4 from bs4 import BeautifulSoup from datetime import datetime import re get=requests.get(‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11104.html’) get.encoding=‘utf-8’ soup=BeautifulSoup(get.text,‘html.parser’) title=soup.select(‘.show-title’)[0].text; head=soup.select(‘.show-info’)[0].text.split() datetime=datetime.strptime(head[0][5:]+" "+head[1],‘%Y-%m-%d %H:%M:%S’) time = re.findall("\d+",requests.get('http://oa.gzcc.cn/api.php?op=count&id=11086&modelid=80').text.split(';')[3 ])[0] content=soup.select(‘.show-content’)[0].text print(‘title:’+title) print(‘Release time:’+str(datetime)) print(head[4]) print(head[2]) print(‘clicks:’+time) print(content)
Run results:
import requests
nice='http://www.cnblogs.com/< /span>'
luck= requests.get(nice)
luck.encoding='utf-8'
print(luck.text)
"en">
"UTF-8">
Title
"-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd ">
"UTF-8">
File box outside the form:"none span>" />
import requests
nice='http://www.cnblogs.com/< /span>'
luck= requests.get(nice)
luck.encoding='utf-8'
print(luck.text)
\
\
china
\
one\
< a href="# link2" class="link" id="link2"> tow\
\
'
from bs4 import BeautifulSoup html = 'china
one tow ' soups = BeautifulSoup(html,‘html.parser’) a1 =soups.a a = soups.select(‘a’) print(a) nice = soups.select(‘h1‘) print(nice) luck = soups.select(‘#title‘) print(luck)
3. Extract the title, release time, publishing unit, author, number of clicks, content and other information of a campus news
such as url =’http ://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html’
The release time is required to be of datetime type, the number of clicks is of numeric type, and the others are of string type.
import requests import bs4 from bs4 import BeautifulSoup from datetime import datetime import re get=requests.get(‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11104.html’) get.encoding=‘utf-8’ soup=BeautifulSoup(get.text,‘html.parser’) title=soup.select(‘.show-title’)[0].text; head=soup.select(‘.show-info’)[0].text.split() datetime=datetime.strptime(head[0][5:]+" "+head[1],‘%Y-%m-%d %H:%M:%S’) time = re.findall("\d+",requests.get('http://oa.gzcc.cn/api.php?op=count&id=11086&modelid=80').text.split(';')[3 ])[0] content=soup.select(‘.show-content’)[0].text print(‘title:’+title) print(‘Release time:’+str(datetime)) print(head[4]) print(head[2]) print(‘clicks:’+time) print(content)
Run results:
import requests import bs4 from bs4 import BeautifulSoup from datetime import datetime import re get=requests.get(‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11104.html’) get.encoding=‘utf-8’ soup=BeautifulSoup(get.text,‘html.parser’) title=soup.select(‘.show-title’)[0].text; head=soup.select(‘.show-info’)[0].text.split() datetime=datetime.strptime(head[0][5:]+" "+head[1],‘%Y-%m-%d %H:%M:%S’) time = re.findall("\d+",requests.get('http://oa.gzcc.cn/api.php?op=count&id=11086&modelid=80').text.split(';')[3 ])[0] content=soup.select(‘.show-content’)[0].text print(‘title:’+title) print(‘Release time:’+str(datetime)) print(head[4]) print(head[2]) print(‘clicks:’+time) print(content)