快速上手 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml' ) print(soup.prettify()) print(soup.title.string)
标签选择器
属性
功能
使用eg
title
选择元素
soup.title
head
选择元素
soup.head
p
选择元素
soup.p
name
获取名称
soup.title.name
attrs
获取属性
soup.p.attrs[‘name’]或soup.p[‘name’]
string
获取内容
soup.p.string或者soup.p.text
contents
获取子节点,返回list
soup.p.contents
children
获取子节点,返回迭代器
soup.p.children
descendants
获取子孙节点,返回迭代器
soup.p.descendants
parent
获取父节点,返回列表
soup.a.parent
parents
获取祖先节点,返回列表
soup.a.parents
next_siblings
获取后面的兄弟节点,返回列表
soup.a.next_siblings
previous_siblings
获取前面的兄弟节点,返回列表
soup.a.previous_siblings
标准选择器
find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml' ) print(soup.find_all(attrs={'id' : 'list-1' })) print(soup.find_all(attrs={'name' : 'elements' })) print(soup.find_all(id='list-1' )) print(soup.find_all(class_='element' )) print(soup.find_all(text='Foo' )
find( name , attrs , recursive , text , **kwargs )
find返回单个元素,find_all返回所有元素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 html=""" <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """ from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml' ) print(soup.find('ul' )) print(type(soup.find('ul' ))) print(soup.find('page' )) >>>>> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <class 'bs4 .element .Tag '> None
方法
描述
prettify
格式化html代码
find_parents()
返回所有祖先节点
find_parent()
返回直接父节点
find_next_siblings()
返回后面所有兄弟节点
find_next_sibling()
返回后面第一个兄弟节点
find_previous_siblings()
返回前面所有兄弟节点
find_previous_sibling()
返回前面第一个兄弟节点
find_all_next()
返回节点
后所有符合条件的节点
find_next()
返回第一个符合条件的节点
find_all_previous()
返回节点
后所有符合条件的节点
find_previous()
返回第一个符合条件的节点
css选择器 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml' ) print(soup.select('.panel .panel-heading' )) print(soup.select('ul li' )) print(soup.select('#list-2 .element' )) print(type(soup.select('ul' )[0 ])) for li in soup.select('li' ): print(li.get_text())
总结
推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法
参考文档