Reptile tips

Crawler Tips

First of all, what python crawler modules have you used? I believe most people will reply to requests or scrapy, well I mean most people. But for simple crawlers, let’s habitually use requests, or the upgraded version of requests_html. At this time, using scrapy will mean using a sledgehammer to kill chickens.

Now we have a simple request, to get a form data of this webpage http://www.air-level.com/air/beijing/, and then save it.
I believe that many people should pick up the requests and knock them up at this time. Since the code is relatively simple, let’s talk about the idea here.
First, we need to successfully access the webpage, then parse the content in the webpage form, and then store the data. Here we simply save the csv. After the idea is complete, we can write our own code. If you are not familiar with xpath parsing data, it should be a little time-consuming and life is short. How can we waste too much time on such a simple task?

After investigation, I found a better way to deal with this static single page. . .

pandas module

Introduction

Mentioned that pandas is more associated with its data analysis function, But when I checked its api, I found this method
read_html:
Here are the function and its parameters

pandas.read_html(io, ​​match= '.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, tupleize_cols=None, thousands=',', encoding=None, decimal='.', converters =None, na_values=None, keep_default_na=True, displayed_only=True)

https://pandas.pydata.org/**

installation

pip3 install pandas

crawler code

import pandas as pd
df = pd.read_html("http://www. air-level.com/air/beijing/", encoding='utf-8',header=0)[0]
results = df.T.to_dict().values()

print(results)

Then we see the output of a mapping mapping type data

dict_values([{'monitoring station':'Beijing Tiantan','AQI': 177 ,'Air quality level':'moderate pollution','PM2.5': '134 μg/m3','PM10': '176 μg/m3','primary pollutant':'PM2.5'}, {'Monitoring Station':'Beijing Shunyi New Town','AQI': 167,'Air Quality Level':'Moderate Pollution','PM2.5': '127 μg/m3','PM10': '163 μg /m3','primary pollutant':'PM2.5'}, { 'Monitoring Station':'Beijing Agricultural Exhibition Hall','AQI': 155,'Air Quality Level':'Moderate Pollution','PM2.5': '118 μg/m3','PM10': '170 μg /m3','primary pollutant':'PM2.5'}, {'monitoring station':'Beijing Olympic Sports Center','AQI': 152,'air quality level':'moderate pollution','PM2 .5': '116 μg/m3','PM10': '132 μg/m3','primary pollutant':'PM2.5'}, {'monitoring station':'Beijing Dongsi','AQI' : 150,'air quality level':'light pollution','PM2.5': '115 μg/m3','PM10': '145 μg/m3','primary pollutant':'PM2.5' }, {'Monitoring Station':'Wanliu, Haidian District, Beijing','AQI': 142,'Air Quality Level':'Light Pollution','PM2.5': '109 μg/m3','PM10' : '143 μg/m3','primary pollutant':'PM2.5'}, {'monitoring station':'Beijing Wanshou West Palace','AQI': 142,'air quality level':'light pollution' ,'PM2.5': '109 μg/m3','PM10': '143 μg/m3','primary pollutant':'PM2.5'}, {'monitoring station':'Beijing ancient city', ' AQI': 137,'Air Quality Level':'Mild Pollution','PM2.5': '105 μg/m3','PM10': '120 μg/m3','Primary Pollutant':'PM2. 5'}, {'Monitoring Station':'Beijing Guanyuan','AQI': 137,'Air Quality Level':'Light Pollution','PM2.5': '105 μg/m3','PM10' : '144 μg/m3','primary pollutant':'PM2.5'}, {'monitoring station':'Beijing Huairou Town','AQI': 121,'air quality level':'light pollution' ,'PM2.5': '92 μg/m3','PM10': '143 μg/m3','primary pollutant':'PM2.5'}, {'monitoring station':'Beijing Dingling', 'AQI': 114,'Air Quality Level':'Light Pollution','PM2.5': '86 μg/m3','PM10': '92 μg/m3','Primary Pollutant':'PM2 .5'}, {'Monitoring Station':'Beijing Changping Town','AQI': 104,'Air quality level':'light pollution','PM2.5': '78 μg/m3','PM10': '109 μg/m3','primary pollutant':'PM2.5'} ])

The code is very simple but the implementation is not simple. The first line imports the pandas package.
The second line of read_html core function implementation is to call requests and then parse each td in the table tag. Data
Finally, a list object is generated, which is a dataframe object. So get its first dataframe data through the small label 0. Since it is a dateframe, we can use the dataframe method.
The third line first performs a rank conversion operation, and then converts it to a mapping type and prints it out. In order to demonstrate the effect of the above code, let’s do a storage operation on the result.

Save in csv

df = pd.read_html("http:/ /www.air-level.com/air/beijing/", encoding='utf-8',header=0)[0]
df.to_csv("tq.csv",index=False)

After executing the code, tq.csv is generated. When you open it, it is exactly the data we want.
Let us compare the data of this webpage:
Webpage content
This is the csv data we saved
csv data
You can find that we successfully acquired The data of the web form.

It should be noted that read_html can only parse static pages.

How about it, it's not easy, just give it a try.

Leave a Comment

Your email address will not be published.