Understand the principle of reptiles

—Recover content begins—

One. Briefly explain the principle of crawling

If we compare the Internet to a big spider web, the data is Stored in the various nodes of the spider web, and the crawler is a small spider,

Crawling its own prey (data) along the network. The crawler refers to: initiating a request to the website, analyzing and extracting resources after obtaining Useful data program;

From a technical perspective, it is to simulate the behavior of the browser requesting the site through the program, and crawl the HTML code/JSON data/binary data (pictures, videos) returned by the site to the local area, and then Extract the data you need and store it for use;

Two. Understand the crawler development process

1). Briefly explain the working principle of the browser;

The essence of the working principle of the browser is to realize the communication of the http protocol. The specific process is as follows: (the process of HTTP communication is roughly divided into three stages)
connection, the server monitors a port through a ServerSocket class object , After listening, connect and open a socket virtual file.
Request, after creating a stream object related to the supervision socket connection, the browser obtains the request, which is a get request, and then obtains the accessed html file name from the request information, and sends the request to the server.
Response, after receiving the request, the server searches for the relevant directory file, and returns an error message if it does not exist. If it exists, it reads the html file, adds http headers and other processing responses to the browser, and the browser parses the html file. If it also contains resources such as pictures and videos, the browser visits the web server again to obtain pictures and videos, etc. It is assembled and displayed to the user.

2). Use the requests library to fetch website data;

requests.get(url) Get the html code of the campus news homepage

< pre>import requests

nice
=http://www.cnblogs.com/< /span>

luck
= requests.get(nice)

luck.encoding
=utf-8


print(luck.text)

Run screenshot

Share pictures

3). Understand the web page

Write a simple html file, contains multiple tags, classes, id



"en">

"UTF-8">
Title




"-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd ">


"UTF-8">






Username:
"txtUserName< span style="color: #800000;">" type="text span>" value="" />

Password:
"txtUserPass" type="password" />


File box outside the form:
"none span>" />

—Recovery content ends—

一. Briefly explain the principle of crawling

If we compare the Internet to a large spider web, the data is stored in each node of the spider web, and the crawler is a small spider,

Crawling its own prey (data) along the web refers to a program that initiates a request to a website, obtains resources, analyzes and extracts useful data;

From a technical perspective, it is to simulate a browser through a program Request the behavior of the site, crawl the HTML code/JSON data/binary data (pictures, videos) returned by the site to the local, and then extract the data you need and store it for use;

2. Understand the crawler development process

1). Briefly explain the working principle of the browser;

The essence of the working principle of the browser is to realize the communication of the http protocol. The specific process is as follows: (HTTP The communication process is roughly divided into three stages)
connection, the server monitors a port through a ServerSocket class object, and then connects after listening, and opens a socket virtual file.
Request, after creating a stream object related to the supervision socket connection, the browser obtains the request, which is a get request, and then obtains the accessed html file name from the request information, and sends the request to the server.
Response, after receiving the request, the server searches for the relevant directory file, and returns an error message if it does not exist. If it exists, it reads the html file, adds http headers and other processing responses to the browser, and the browser parses the html file. If it also contains resources such as pictures and videos, the browser visits the web server again to obtain pictures and videos, etc. It is assembled and displayed to the user.

2). Use the requests library to fetch website data;

requests.get(url) Get the html code of the campus news homepage

< pre>import requests

nice
=http://www.cnblogs.com/< /span>

luck
= requests.get(nice)

luck.encoding
=utf-8


print(luck.text)

Run screenshot

Share pictures

3). Understand the web page

Write a simple html file, contains multiple tags, classes, id

 \
\

china< /h1> \
one\
tow\
\
'

Run screenshot:

Share pictures

4). Use Beautiful Soup to parse web pages;

Analyze the above html file into a DOM Tree through BeautifulSoup(html_sample,’html.parser’)

select (selector) positioning data

find out html elements with specific tags< /p>

Find the html element with a specific class name

Find the html element with a specific id name

from bs4 import BeautifulSoup
html = '  

china

one tow ' soups = BeautifulSoup(html,‘html.parser’) a1 =soups.a a = soups.select(‘a’) print(a) nice = soups.select(‘h1‘) print(nice) luck = soups.select(‘#title‘) print(luck)

  Run screenshot

Share a picture

3. Extract the title, release time, publishing unit, author, number of clicks, Content and other information

Such as url =’http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html’

Requires the release time to be datetime type, the number of clicks It is numeric type, and the others are string type.

import requests
import bs4
from bs4 import BeautifulSoup
from datetime import datetime
import re
get=requests.get(‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11104.html’)
get.encoding=‘utf-8’
soup=BeautifulSoup(get.text,‘html.parser’)
title=soup.select(‘.show-title’)[0].text;
head=soup.select(‘.show-info’)[0].text.split()
datetime=datetime.strptime(head[0][5:]+" "+head[1],‘%Y-%m-%d %H:%M:%S’)
time = re.findall("\d+",requests.get('http://oa.gzcc.cn/api.php?op=count&id=11086&modelid=80').text.split(';')[3 ])[0]
content=soup.select(‘.show-content’)[0].text
print(‘title:’+title)
print(‘Release time:’+str(datetime))
print(head[4])
print(head[2])
print(‘clicks:’+time)
print(content)

  Run results:

Share pictures

import requests

nice
='http://www.cnblogs.com/< /span>'
luck
= requests.get(nice)
luck.encoding
='utf-8'

print(luck.text)



"en">

"UTF-8">
Title




"-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd ">


"UTF-8">






Username:
"txtUserName< span style="color: #800000;">" type="text span>" value="" />

Password:
"txtUserPass" type="password" />


File box outside the form:
"none span>" />

import requests

nice
='http://www.cnblogs.com/< /span>'
luck
= requests.get(nice)
luck.encoding
='utf-8'

print(luck.text)

 \
\

china

\
one\
< a href="# link2" class="link" id="link2"> tow\
\
'

from bs4 import BeautifulSoup
html = '  

china

one tow ' soups = BeautifulSoup(html,‘html.parser’) a1 =soups.a a = soups.select(‘a’) print(a) nice = soups.select(‘h1‘) print(nice) luck = soups.select(‘#title‘) print(luck)

3. Extract the title, release time, publishing unit, author, number of clicks, content and other information of a campus news

such as url =’http ://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html’

The release time is required to be of datetime type, the number of clicks is of numeric type, and the others are of string type.

import requests
import bs4
from bs4 import BeautifulSoup
from datetime import datetime
import re
get=requests.get(‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11104.html’)
get.encoding=‘utf-8’
soup=BeautifulSoup(get.text,‘html.parser’)
title=soup.select(‘.show-title’)[0].text;
head=soup.select(‘.show-info’)[0].text.split()
datetime=datetime.strptime(head[0][5:]+" "+head[1],‘%Y-%m-%d %H:%M:%S’)
time = re.findall("\d+",requests.get('http://oa.gzcc.cn/api.php?op=count&id=11086&modelid=80').text.split(';')[3 ])[0]
content=soup.select(‘.show-content’)[0].text
print(‘title:’+title)
print(‘Release time:’+str(datetime))
print(head[4])
print(head[2])
print(‘clicks:’+time)
print(content)

  Run results:

Share pictures

import requests
import bs4
from bs4 import BeautifulSoup
from datetime import datetime
import re
get=requests.get(‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0329/11104.html’)
get.encoding=‘utf-8’
soup=BeautifulSoup(get.text,‘html.parser’)
title=soup.select(‘.show-title’)[0].text;
head=soup.select(‘.show-info’)[0].text.split()
datetime=datetime.strptime(head[0][5:]+" "+head[1],‘%Y-%m-%d %H:%M:%S’)
time = re.findall("\d+",requests.get('http://oa.gzcc.cn/api.php?op=count&id=11086&modelid=80').text.split(';')[3 ])[0]
content=soup.select(‘.show-content’)[0].text
print(‘title:’+title)
print(‘Release time:’+str(datetime))
print(head[4])
print(head[2])
print(‘clicks:’+time)
print(content)

Leave a Comment

Your email address will not be published.