XPath Basic use

xpath basic use

1. Install lxml package

pip install lxml

2. Use

1. Use:

from < span class="cm-variable">lxml import etree # 导包
import requests
?
response = resquests.get('www.baidu.com')
< span class="cm-comment"># Generate an html object
# html = etree.parse(html document) # Parameter is html document
html = etree.HTML(response.text) # The parameter is a string text
div = html.xpath('xpath expression') # Return a list of text
span>
span>

1. Get the outermost label, traverse all the sub-labels inside, and get the label text

content_list =div.xpath(‘ .//div[@class=”d_post_content j_d_post_content “]/text()’).extract()

2. Remove regular All tags<.*?> re.compile.sub()

content_list=div.xpath(‘.//div[@ class=”d_post_content j_d_post_content “]’)

pattern=re.compile(r(‘<.*?>‘),re.S)

content=pattern.sub(”,content_list)

3./ text() Get the text of the tag//text() Get the text of the tag and subtags

content_list = div.xpath(‘. //div[@class=”d_post_content j_d_post_content “]//text()’).extract()

4 Use xpath( ‘string(.)’) Get all text in this way and concatenate

content_list=div.xpath(‘.//div[ @class=”d_post_content j_d_post_content “]’).xpath(‘string(.)’).extract()[0]+’\n’

After the text content is obtained, print(content_list) to view the content, if you need to process the format, it is as follows:

remove = re .compile(‘\s’) content =” for string in content_list: string = remove.sub(”,string) content += string

string method: content = div.xpath(‘string(.//div[@class=”content”])’).strip() # Get all the text under the div to form a string< /span>

Leave a Comment

Your email address will not be published.