xpath basic use
1. Install lxml package
pip install lxml
2. Use
1. Use:
from < span class="cm-variable">lxml import etree # 导包
import requests
?
response = resquests.get('www.baidu.com')
< span class="cm-comment"># Generate an html object
# html = etree.parse(html document) # Parameter is html document
html = etree.HTML(response.text) # The parameter is a string text
div = html.xpath('xpath expression') # Return a list of text span> span>
1. Get the outermost label, traverse all the sub-labels inside, and get the label text
content_list =div.xpath(‘ .//div[@class=”d_post_content j_d_post_content “]/text()’).extract()
2. Remove regular All tags<.*?> re.compile.sub()
content_list=div.xpath(‘.//div[@ class=”d_post_content j_d_post_content “]’)
pattern=re.compile(r(‘<.*?>‘),re.S)
content=pattern.sub(”,content_list)
3./ text() Get the text of the tag//text() Get the text of the tag and subtags
content_list = div.xpath(‘. //div[@class=”d_post_content j_d_post_content “]//text()’).extract()
4 Use xpath( ‘string(.)’) Get all text in this way and concatenate
content_list=div.xpath(‘.//div[ @class=”d_post_content j_d_post_content “]’).xpath(‘string(.)’).extract()[0]+’\n’
After the text content is obtained, print(content_list) to view the content, if you need to process the format, it is as follows:
remove = re .compile(‘\s’) content =” for string in content_list: string = remove.sub(”,string) content += string
string method: content = div.xpath(‘string(.//div[@class=”content”])’).strip() # Get all the text under the div to form a string< /span>