The bug of the crawler handles the website — less than the number unconverted into entity

1. BUG found

   crawling chinadrugtrials When the test information was published on the details page, it was found that the program was broken in some places, as follows:

  Share pictures

After investigation, it turns out that this is a bug on the webpage —– very few details page The less than sign in some texts is not converted into a physical character, and is connected to the following ul or a (capitalization), and becomes something similar to a tag, p>

causes bs4 to treat it as the starting angle bracket of the tag when parsing, and it is automatically completed.

BUG is simplified as follows:

< span style="color: #ff0000;">share picture

Output:

span>

share picture

2. Solutions:

  Because the bs4 structured parsing page is needed, the less than sign cannot be replaced by regular rules (the normal tags will also change), and the situation is relatively rare, so Before parsing with Beautifulsoup, the unconverted less than sign can be converted into an entity character <

The code changes are as follows:

share picture< /span>

3.Useful character entities in HTML

Share a picture

Leave a Comment

Your email address will not be published.