Where to go online ticket reptile

   Recently, the company has a new requirement, that is, it needs to crawl the air ticket data of a certain day. Let me first crawl the data of Ctrip. Qunar. For Ctrip, it is still relatively simple, but Qunar. I encountered a problem. At the beginning, I used the requests module to crawl Qunar. In the header information of the request, there are some random values. Each request is different. Even if you use the random value of the last request, it will be Fake data is returned to you. I generated some random strings for it myself to request, but still returned fake data. In the end, there is no way, I can only use selenium Dafa.

In the process of using selenium, Qunar.com still did an anti-selenum crawler:

  1. I recognized selenium with js

< p>   When I selected the location and time, I clicked to load, and I entered each flight information page. The other parts were displayed. The core data was loading all the time. My computer crashed several times.

   This kind of anti-climbing encountered for the first time, only Baidu, and finally found a solution in this great god:

https:// www.cnblogs.com/xieqiankun/p/hide-webdriver.html

  Solution:

  Add parameters in option, stop “Stop developer mode running”; I Chrome browser used, so:

option.add_experimental_option('excludeSwitches', ['enable-automation'])

   Then the data is successful Show it,

    

  2. On the air ticket price, use css to disrupt the price so that you can’t climb the real Price

When I thought I was about to succeed, I found a new problem. Using the data matched by xpath, there are a lot of disrupted data, and there are some interfering data in it:< /p>

<em class="rel">

<b style="width:48px;left:-48px" >
<i style=width: 16px;"">6i>
<i style=width: 16px;"">5i>
<i style=width: 16px;"">6i>
b>
<b style="width: 16px;left:-48px">7 b> span>
em>

Real price: 756

Rules:

It mainly depends on the width of css

Above:

  Total width 48px

   Displayed value The text value of the i tag in the b tag, except for one b tag for display, the value of the other b tag replaces the i tag The middle text value is the real value.

   The left value in the style of each b tag indicates the position, such as left: -48px, replace the first i, if left: -16px, replace the third i.

   is the way of thinking, I put the code on github, link: https://github.com/bobos008/Airticket

https: //www.cnblogs.com/xieqiankun/p/hide-webdriver.html

option.add_experimental_option('

span>excludeSwitches, [enable-automation])

<em class="rel">

<b style="width:48px;left:-48px" >
<i style=width: 16px;"">6i>
<i style=width: 16px;"">5i>
<i style=width: 16px;"">6i>
b>
<b style="width: 16px;left:-48px">7 b> span>
em>

Leave a Comment

Your email address will not be published.