1. BUG found
crawling chinadrugtrials When the test information was published on the details page, it was found that the program was broken in some places, as follows:
After investigation, it turns out that this is a bug on the webpage —– very few details page The less than sign in some texts is not converted into a physical character, and is connected to the following ul or a (capitalization), and becomes something similar to a tag, p>
causes bs4 to treat it as the starting angle bracket of the tag when parsing, and it is automatically completed.
BUG is simplified as follows:
< span style="color: #ff0000;">
Output:
span>
2. Solutions:
Because the bs4 structured parsing page is needed, the less than sign cannot be replaced by regular rules (the normal tags will also change), and the situation is relatively rare, so Before parsing with Beautifulsoup, the unconverted less than sign can be converted into an entity character <
The code changes are as follows:
3.Useful character entities in HTML