Recently, the company has a new requirement, that is, it needs to crawl the air ticket data of a certain day. Let me first crawl the data of Ctrip. Qunar. For Ctrip, it is still relatively simple, but Qunar. I encountered a problem. At the beginning, I used the requests module to crawl Qunar. In the header information of the request, there are some random values. Each request is different. Even if you use the random value of the last request, it will be Fake data is returned to you. I generated some random strings for it myself to request, but still returned fake data. In the end, there is no way, I can only use selenium Dafa.
In the process of using selenium, Qunar.com still did an anti-selenum crawler:
1. I recognized selenium with js
< p> When I selected the location and time, I clicked to load, and I entered each flight information page. The other parts were displayed. The core data was loading all the time. My computer crashed several times.
This kind of anti-climbing encountered for the first time, only Baidu, and finally found a solution in this great god:
https:// www.cnblogs.com/xieqiankun/p/hide-webdriver.html
Solution:
Add parameters in option, stop “Stop developer mode running”; I Chrome browser used, so:
option.add_experimental_option('excludeSwitches', ['enable-automation'])
Then the data is successful Show it,
2. On the air ticket price, use css to disrupt the price so that you can’t climb the real Price
When I thought I was about to succeed, I found a new problem. Using the data matched by xpath, there are a lot of disrupted data, and there are some interfering data in it:< /p>
<em class="rel"> span>
<b style="width:48px;left:-48px" >
<i style=width: 16px;"">6i>
<i style=width: 16px;"">5i>
<i style=width: 16px;"">6i>
b>
<b style="width: 16px;left:-48px">7 b> span>
em>
Real price: 756
Rules:
It mainly depends on the width of css
Above:
Total width 48px
Displayed value The text value of the i tag in the b tag, except for one b tag for display, the value of the other b tag replaces the i tag The middle text value is the real value.
The left value in the style of each b tag indicates the position, such as left: -48px, replace the first i, if left: -16px, replace the third i.
is the way of thinking, I put the code on github, link: https://github.com/bobos008/Airticket
https: //www.cnblogs.com/xieqiankun/p/hide-webdriver.html
option.add_experimental_option('
span>excludeSwitches‘, [ ‘enable-automation‘])
<em class="rel">
<b style="width:48px;left:-48px" >
<i style=width: 16px;"">6i>
<i style=width: 16px;"">5i>
<i style=width: 16px;"">6i>
b>
<b style="width: 16px;left:-48px">7 b> span>
em>