crawler-06 Beautiful Soup

crawler-06 Beautiful Soup

一、概念引入

  Beautiful Soup,也称美丽汤,可以对html、xml格式进行解析并且提取其中相关信息;Beautiful Soup可以对你所要求的各种格式进行爬取,并且进行树形解析。

  原理:将目标文档当做一锅汤,然后煲制这锅汤。

二、实际使用

1、安装

  如果有pacharm的话,直接在settings里面的project interpreter点点点安装bs4即可;也可以通过cmd来安装,命令如下:

1
pip install beautifulsoup4

2、测试

1)打开下面的链接,右键查看源代码

1
https://python123.io/ws/demo.html

2)可以看到源代码的格式

UTOOLS1576466743044.png

3)用requests爬取这个页面源代码

  可以看到打印出来的内容,连个换行都没有,所有内容挤在一起;

1
2
3
4
5
6
>>> import requests
>>> url="https://python123.io/ws/demo.html"
>>> r=requests.get(url)
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>>

4)用bs4库处理文本

  • 尽管安装的是beautifulsoup4,但使用的时候用的是简写:bs4;
  • 注意python对大小写很敏感,BeautifulSoup中B和S要大写;
  • html.parser是一种对html格式的解释器;
  • soup=BeautifulSoup(demo,”html.parser”)就熬制了一锅汤,定义了一个soup;
1
2
3
4
5
6
7
8
9
10
11
12
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify)
<bound method Tag.prettify of <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>>
>>>

  可以看到打印出来的文本变得有序很多,不再是一整段;

5)BeautifulSoup的使用

1
2
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>','html.parser')

三、BeautifulSoup库的基本元素

  Beautiful Soup库是解析、遍历、维护“标签树”的功能库;如下图,各种标签互有上下游的关系,从而组长了标签树;只要你提供的文件是标签树类型的,那么就可以用BeautifulSoup库完成解析。

1
2
3
4
5
<html> 
<body>
<p class=“title”> … </p>
</body>
</html

1)Beautiful Soup库标签的理解如下图:

UTOOLS1576466847344.png

2)BeautifulSoup类的基本元素:

UTOOLS1576466806700.png

1、引用

  Beautiful Soup库,也叫beautifulsoup4 或bs4 约定引用方式如下,即主要是用BeautifulSoup:

1
2
from bs4 import BeautifulSoup
import bs4

2、理解

  html文档、标签树、BeautifulSoup类三者等价;BeautifulSoup对应一个HTML/XML文档的全部内容;如下例子中,对象soup是熬制过的html文档。

1
2
soup = BeautifulSoup('<p>data</p>','html.parser')
soup2 = BeautifulSoup(open("D://demo.html"),'html.parser')

3、Beautiful Soup库的解析器

解析器 使用方法 条件
bs4的HTML解析器 BeautifulSoup(mk,’html.parser’) 安装bs4库
lxml的HTML解析器 BeautifulSoup(mk,’lxml’) pip install lxml
lxml的XML解析器 BeautifulSoup(mk,’xml’) pip install lxml
html5lib的解析器 BeautifulSoup(mk,’html5lib’) pip install html5lib

4、打印BeautifulSoup类的基本元素

1)name和attributes元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
>>> from bs4 import BeautifulSoup
>>> import requests
>>> url="https://python123.io/ws/demo.html"
>>> r=requests.get(url)
>>> demo=r.text
>>> soup=BeautifulSoup(demo,"html.parser")

#直接打印title
>>> soup.title
<title>This is a python demo page</title>

#将a标签定义为一个tag,然后打印这个tag
#文本中其实有两个a标签,但是这里只能获取了第一个a标签的内容
>>> tag=soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>>

#打印标签的名字,打印该标签父亲的名字
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>>

#查看标签的属性
>>> tag=soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>>

#提取属性字典某一个键值
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'

#查看标签属性的类型,如下,该标签属性是字典类型
>>> type(tag.attrs)
<class 'dict'>
>>>

#直接查看该标签的类型
>>> type(tag)
<class 'bs4.element.Tag'>
>>>

2)NavigableString元素(继续使用上面的IDLE,省得重新熬汤)

  就是打印标签内的字符串,如下:

1
2
3
4
5
6
7
8
9
10
11
>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'

#下例中,<b>也是p标签内的,但是没有打印,因为NavigableString元素跨越多个标签层次,只打印内容,不打印标签;
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>>

3)Comment元素

  打印注释内容,不会把<!–这种尖括号的印出来;

1
2
3
4
5
6
7
8
9
>>> newsoup=BeautifulSoup("<b><!--This is a comment--></b>")
>>> newsoup.b.string
'This is a comment'

#可以看到其类型,叫做bs4.element.Comment,是一种特殊类型
#分析文档时,有时候会对注释进行判断,依据就是这个类型;这种情况不常用,了解即可;
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>>

四、基于bs4库的HTML内容遍历方法

1、html基本格式

UTOOLS1576466880682.png

2、遍历方式

UTOOLS1576466971049.png

3、标签树的下行遍历

属性 说明
.contents 子节点的列表,将 tag 所有儿子节点存入列表
.children 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#准备工作
>>> import requests
>>> url="http://python123.io/ws/demo.html"
>>> r=requests.get(url)
>>> demo=r.text

#验证下行遍历
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>

#查看head的儿子节点,.contents返回的是列表
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>>

#查看body有几个儿子
>>> len(soup.body.contents)
5
>>>

#查看body的第一个儿子
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>>

使用for循环进行遍历

1
2
3
4
5
6
7
#遍历儿子节点
for child in soup.body.children:
print(child)

#遍历子孙节点
forchild insoup.body.descendants:
print(child)

4、标签树的上行遍历

属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#title标签的父亲是head标签
>>> soup.title.parent
<head><title>This is a python demo page</title></head>

#html标签的父亲是他自己
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

#soup本身是特殊标签,其父亲为空
>>> soup.parent
>>>

#遍历所有先辈节点,包括soup本身,所以要区别判断
>>> for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)


p
body
html
[document]
>>> soup.name
'[document]'
>>>

5、标签树的平行遍历

属性 说明
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签

UTOOLS1576467014760.png

1
2
3
4
5
6
7
8
9
#平行遍历获得的下一个节点不一定是标签类型
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling
>>>

利用for循环完成平行遍历

1
2
3
4
5
6
7
#遍历后续节点
for sibling in soup.a.next_sibling:
print(sibling)

#遍历前续节点
for sibling in soup.a.previous_sibling:
print(sibling)

五、基于bs4库的HTML格式输出

  如何让html页面更加“友好”地显示?“友好”既针对人,也针对程序。

1、bs4库的prettify()方法

  跟右键查看网页源代码一样的格式,打印html代码;

1
2
3
4
5
6
7
8
9
>>> soup.prettify
<bound method Tag.prettify of <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>>
>>>

2、bs4库的编码和python3的默认编码一样,都是‘utf-8’,因此可以很方便地显示中文;如果用的是python2,那么就很鸡肋,需要无止尽地做编码转换,以下是python3:

1
2
3
4
5
6
>>> soup1=BeautifulSoup("<p>中文</p>","html.parser")
>>> soup1.p.string
'中文'
>>> soup1.prettify
<bound method Tag.prettify of <p>中文</p>>
>>>
欢迎打赏,谢谢
------ 本文结束------
0%