Python爬蟲庫-BeautifulSoup的使用-知識星球

來源：IT派

ID：it_pai

Beautiful Soup是一個可以從HTML或XML檔案中提取資料的Python庫，簡單來說，它能將HTML的標簽檔案解析成樹形結構，然後方便地獲取到指定標簽的對應屬性。

透過Beautiful Soup庫，我們可以將指定的class或id值作為引數，來直接獲取到對應標簽的相關資料，這樣的處理方式簡潔明瞭。

當前最新的 Beautiful Soup 版本為4.4.0，Beautiful Soup 3 當前已停止維護。

Beautiful Soup 4 可用於 Python2.7 和 Python3.0，本文示例使用的Python版本為2.7。

博主使用的是Mac系統，直接透過命令安裝庫：

sudo easy_install beautifulsoup4

安裝完成後，嘗試包含庫執行：

from bs4 import BeautifulSoup

若沒有報錯，則說明庫已正常安裝完成。

開始

本文會透過這個網頁http://reeoo.com來進行示例講解，如下圖所示

BeautifulSoup 物件初始化

將一段檔案傳入 BeautifulSoup 的構造方法，就能得到一個檔案物件。如下程式碼所示，檔案透過請求url獲取：

#coding:utf-8
from bs4 import BeautifulSoup
import urllib2
url = 'http://reeoo.com'
request = urllib2.Request(url)
response = urllib2.urlopen(request, timeout=20)
content = response.read()
soup = BeautifulSoup(content, 'html.parser')

request 請求沒有做異常處理，這裡暫時先忽略。BeautifulSoup 構造方法的第二個引數為檔案解析器，若不傳入該引數，BeautifulSoup會自行選擇最合適的解析器來解析檔案，不過會有警告提示。

也可以透過檔案控制代碼來初始化，可先將HTML的原始碼儲存到本地同級目錄 reo.html，然後將檔案名作為引數：

soup = BeautifulSoup(open('reo.html'))

可以列印 soup，輸出內容和HTML文字無二致，此時它為一個複雜的樹形結構，每個節點都是Python物件。

Ps. 接下來示例程式碼中所用到的 soup 都為該soup。

Tag

Tag物件與HTML原生檔案中的標簽相同，可以直接透過對應名字獲取

tag = soup.title
print tag

列印結果：

<title>Reeoo - web design inspiration and website gallerytitle>

Name

透過Tag物件的name屬性，可以獲取到標簽的名稱

print tag.name
# title

Attributes

一個tag可能包含很多屬性，如id、class等，操作tag屬性的方式與字典相同。

例如網頁中包含縮圖區域的標簽 article

...
<article class="box">
    <div id="main">
    <ul id="list">
        <li id="sponsor"><div class="sponsor_tips">div>
            <script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve;=CVYD42T&placement;=reeoocom" id="_carbonads_js">script>
        li>
...

獲取它 class 屬性的值

tag = soup.article
c = tag['class']
print c     
# [u'box']

也可以直接透過 .attrs 獲取所有的屬性

tag = soup.article
attrs = tag.attrs
print attrs
# {u'class': [u'box']}

ps. 因為class屬於多值屬性，所以它的值為陣列。

tag中的字串

透過 string 方法獲取標簽中包含的字串

tag = soup.title
s = tag.string
print s
# Reeoo - web design inspiration and website gallery

檔案樹的遍歷

一個Tag可能包含多個字串或其它的Tag，這些都是這個Tag的子節點。Beautiful Soup提供了許多操作和遍歷子節點的屬性。

子節點

透過Tag的 name 可以獲取到對應標簽，多次呼叫這個方法，可以獲取到子節點中對應的標簽。

如下圖：

我們希望獲取到 article 標簽中的 li

tag = soup.article.div.ul.li
print tag

列印結果：

<li id="sponsor"><div class="sponsor_tips">div>
<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve;=CVYD42T&placement;=reeoocom" type="text/javascript">script>
li>

也可以把中間的一些節點省略，結果也一致

tag = soup.article.li

透過 . 屬性只能獲取到第一個tag，若想獲取到所有的 li 標簽，可以透過 find_all() 方法

ls = soup.article.div.ul.find_all('li')

獲取到的是包含所有li標簽的串列。

tag的 .contents 屬性可以將tag的子節點以串列的方式輸出:

tag = soup.article.div.ul
contents = tag.contents

列印 contents 可以看到串列中不僅包含了 li 標簽內容，還包括了換行符 ‘\n’
過tag的 .children 生成器,可以對tag的子節點進行迴圈

tag = soup.article.div.ul
children = tag.children
print children
for child in children:
    print child

可以看到 children 的型別為
.contents 和 .children 屬性僅包含tag的直接子節點，若要遍歷子節點的子節點，可以透過 .descendants 屬性，方法與前兩者類似，這裡不列出來了。

父節點

透過 .parent 屬性來獲取某個元素的父節點，article 的父節點為 body。

tag = soup.article
print tag.parent.name
# body

或者透過 .parents 屬性遍歷所有的父輩節點。

tag = soup.article
for p in tag.parents:
    print p.name

兄弟節點

.next_sibling 和 .previous_sibling 屬性用來插敘兄弟節點，使用方式與其他的節點類似。

檔案樹的搜尋

對樹形結構的檔案進行特定的搜尋是爬蟲抓取過程中最常用的操作。

find_all()

find_all(name , attrs , recursive , string , ** kwargs)

name 引數

查詢所有名字為 name 的tag

soup.find_all('title')
# [<title>Reeoo - web design inspiration and website gallerytitle>]
soup.find_all('footer')
# [<footer id="footer">\n<div class="box">\n<p> ... div>\nfooter>]

keyword 引數

如果指定引數的名字不是內建的引數名（name , attrs , recursive , string），則將該引數當成tag的屬性進行搜尋，不指定tag的話則預設為對所有tag進行搜尋。

如，搜尋所有 id 值為 footer 的標簽

soup.find_all(id='footer')
# [<footer id="footer">\n<div class="box">\n<p> ... div>\nfooter>]

加上標簽的引數

soup.find_all('footer', id='footer')
# [

"footer">\n

class="box">\n<p> ... </div>\n</footer>] # 沒有id值為'footer'的div標簽，所以結果傳回為空 soup.find_all('div', id='footer') # []

獲取所有縮圖的 div 標簽，縮圖用 class 為 thumb 標記

soup.find_all('div', class_='thumb')

這裡需要註意一點，因為 class 為Python的保留關鍵字，所以作為引數時加上了下劃線，為“class_”。

指定名字的屬性引數值可以包括：字串、正則運算式、串列、True/False。

True/False

是否存在指定的屬性。

搜尋所有帶有 target 屬性的標簽

soup.find_all(target=True)

搜尋所有不帶 target 屬性的標簽（仔細觀察會發現，搜尋結果還是會有帶 target 的標簽，那是不帶 target 標簽的子標簽，這裡需要註意一下。）

soup.find_all(target=False)

可以指定多個引數作為過濾條件，例如頁面縮圖部分的標簽如下所示：

...
<li>
    <div class="thumb">
        <a href="http://reeoo.com/aim-creative-studios">![AIM Creative Studios](http://upload-images.jianshu.io/upload_images/1346917-f6281ffe1a8f0b18.gif?imageMogr2/auto-orient/strip)a>
    div>
    <div class="title">
        <a href="http://reeoo.com/aim-creative-studios">AIM Creative Studiosa>
    div>
li>
...

搜尋 src 屬性中包含 reeoo 字串，並且 class 為 lazy 的標簽：

soup.find_all(src=re.compile("reeoo.com"), class_='lazy')

搜尋結果即為所有的縮圖 img 標簽。

有些屬性不能作為引數使用，如 data-**** 屬性。在上面的例子中，data-original 不能作為引數使用，執行起來會報錯，SyntaxError: keyword can’t be an expression*。

attrs 引數

定義一個字典引數來搜尋對應屬性的tag，一定程度上能解決上面提到的不能將某些屬性作為引數的問題。

例如，搜尋包含 data-original 屬性的標簽

print soup.find_all(attrs={'data-original': True})

搜尋 data-original 屬性中包含 reeoo.com 字串的標簽

soup.find_all(attrs={'data-original': re.compile("reeoo.com")})

搜尋 data-original 屬性為指定值的標簽

soup.find_all(attrs={'data-original': 'http://media.reeoo.com/Bersi Serlini Franciacorta.png!page'})

string 引數

和 name 引數類似，針對檔案中的字串內容。

搜尋包含 Reeoo 字串的標簽：

soup.find_all(string=re.compile("Reeoo"))

列印搜尋結果可看到包含3個元素，分別是對應標簽裡的內容，具體見下圖所示

limit 引數

find_all() 傳回的是整個檔案的搜尋結果，如果檔案內容較多則搜尋過程耗時過長，加上 limit 限制，當結果到達 limit 值時停止搜尋並傳回結果。

搜尋 class 為 thumb 的 div 標簽，只搜尋3個

soup.find_all('div', class_='thumb', limit=3)

列印結果為一個包含3個元素的串列，實際滿足結果的標簽在檔案裡不止3個。

recursive 引數

find_all() 會檢索當前tag的所有子孫節點,如果只想搜尋tag的直接子節點,可以使用引數 recursive=False。

find()

find(name , attrs , recursive , string , ** kwargs)

find() 方法和 find_all() 方法的引數使用基本一致，只是 find() 的搜尋方法只會傳回第一個滿足要求的結果，等價於 find_all() 方法並將limit設定為1。

soup.find_all('div', class_='thumb', limit=1)
soup.find('div', class_='thumb')

搜尋結果一致，唯一的區別是 find_all() 傳回的是一個陣列，find() 傳回的是一個元素。

當沒有搜尋到滿足條件的標簽時，find() 傳回 None，而 find_all() 傳回一個空的串列。

CSS選擇器

Tag 或 BeautifulSoup 物件透過 select() 方法中傳入字串引數, 即可使用CSS選擇器的語法找到tag。

語意和CSS一致，搜尋 article 標簽下的 ul 標簽中的 li 標簽

print soup.select('article ul li')

透過類名查詢，兩行程式碼的結果一致，搜尋 class 為 thumb 的標簽

soup.select('.thumb')
soup.select('[class~=thumb]')

透過id查詢，搜尋 id 為 sponsor 的標簽

soup.select('#sponsor')

透過是否存在某個屬性來查詢，搜尋具有 id 屬性的 li 標簽

soup.select('li[id]')

透過屬性的值來查詢查詢，搜尋 id 為 sponsor 的 li 標簽

soup.select('li[id="sponsor"]')

其他

其他的搜尋方法還有：

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_previous_siblings() 和 find_previous_sibling()

…

引數的作用和 find_all()、find() 差別不大，這裡就不再列舉使用方式了。這兩個方法基本已經能滿足絕大部分的查詢需求。

還有一些方法涉及檔案樹的修改。對於爬蟲來說大部分工作只是檢索頁面的資訊，很少需要對頁面原始碼做改動，所以這部分的內容也不再列舉。

具體詳細資訊可直接參考Beautiful Soup庫的官方說明檔案。

《Linux雲端計算及運維架構師高薪實戰班》2018年08月27日即將開課中，120天衝擊Linux運維年薪30萬，改變速約~~~~

*宣告：推送內容及圖片來源於網路，部分內容會有所改動，版權歸原作者所有，如來源資訊有誤或侵犯權益，請聯絡我們刪除或授權事宜。

– END –

更多Linux好文請點選【閱讀原文】哦

↓↓↓

Python爬蟲庫-BeautifulSoup的使用

開始