實戰|基於TensorFlow+Python的文字分類全程詳解-知識星球

本教程將會建立一個神經網路模型，透過分析影評文字將影評分為正面或負面。這是一個典型的二分類問題，是一種重要且廣泛適用的機器學習問題。

我們將使用包含50,000條電影評論文字的IMDB（網際網路電影資料庫）資料集，並將其分為訓練集（含25,000條影評）和測試集（含25,000條影評）。訓練集和測試集是平衡的，也即兩者的正面評論和負面評論的總數量相同。

本教程將會使用tf.keras（一個高階API），用於在TensorFlow中構建和訓練模型。如果你想瞭解利用tf.keras進行更高階的文字分類的教程，請參閱MLCC文字分類指南。你可以使用以下python程式碼匯入Keras：

import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)

輸出：

1.11.0

下載IMDB資料集

IMDB資料集已經整合於TensorFlow中。它已經被預處理，評論（單詞序列）已經被轉換為整數序列，整數序列中每個整數表示字典中的特定單詞。

您可以使用以下程式碼下載IMDB資料集（如果您已經下載了，使用下麵程式碼會直接讀取該資料集）：

imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

輸出：

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 0s 0us/step

引數num_words=10000表示資料集保留了最常出現的10,000個單詞。為了保持資料大小的可處理性，罕見的單詞會被丟棄。

探索資料

讓我們花一點時間來瞭解資料的格式。資料集經過預處理後，每個影評都是由整數陣列構成，代替影評中原有的單詞。每個影評都有一個標簽，標簽是0或1的整數值，其中0表示負面評論，1表示正面評論。

print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

輸出：

Training entries: 25000, labels: 25000

評論文字已轉換為整數陣列，每個整數表示字典中的特定單詞。以下是第一篇評論文字轉換後的形式：

print(train_data[0])

輸出：

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

電影評論的長度可能不同，但是神經網路的輸入必須是相同長度，因此我們需要稍後解決此問題。以下程式碼顯示了第一篇評論和第二篇評論分別包含的單詞數量：

len(train_data[0]), len(train_data[1])

輸出：

(218, 189)

將整數轉換回單詞：

瞭解如何將整數轉換迴文字也許是有用的。在下麵程式碼中，我們將建立一個輔助函式來查詢包含有整數到字串對映的字典物件：

# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index[""] = 0
word_index[""] = 1
word_index[""] = 2  # unknown
word_index[""] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

輸出：

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step

現在我們可以使用decode_review函式來檢視解碼後的第一篇影評文字：

decode_review(train_data[0])

輸出：

"this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robertis an amazing actor and now the same being directorfather came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released forand would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was alsoto the two little boy's that played theof norman and paul they were just brilliant children are often left out of thelist i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

準備資料

在輸入到神經網路之前，整數陣列形式的評論必須轉換為張量。這種轉換可以透過以下兩種方式完成：

方法一：對陣列進行獨熱編碼（One-hot-encode），將其轉換為0和1的向量。例如序列[3,5]將成為一個10,000維的向量，除索引3和5為1外，其餘全部為零。然後，將其作為我們網路中的第一層——全連線層（稠密層，Dense layer）——以處理浮點向量資料。然而，這種方法會佔用大量記憶體，需要一個num_words * num_reviews大小的矩陣。
方法二：填充陣列，使它們都具有相同的長度，然後建立一個形狀為max_length * num_reviews的整數張量。我們可以使用能夠處理這種形狀的嵌入層（embedding layer）作為我們神經網路中的第一層。

在本教程中，我們使用第二種方法。

由於電影評論的長度必須相同，我們使用pad_sequences函式對長度進行標準化：

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index[""],
                                                        padding='post',
                                                        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index[""],
                                                       padding='post',
                                                       maxlen=256)

我們來看現在影評的長度：

len(train_data[0]), len(train_data[1])

輸出：

(256, 256)

檢視填充後的第一篇影評：

print(train_data[0])

輸出：

[   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]

構建模型

神經網路是由層的疊加來實現的，因此我們需要做兩個架構性決策：

模型中要使用多少層？
每層要使用多少隱藏單元？

在本例中，輸入資料由單詞索引陣列組成，要預測的標簽不是0就是1。我們可以建立這樣一個模型來解決這個問題：

# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

輸出：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________

在該模型中，以下4層按順序堆疊以構建分類器：

第一層是嵌入層（Embedding layer）。該層採用整數編碼的詞彙表，並查詢每個詞索引的嵌入向量。這些向量是作為模型訓練學習的。向量為輸出陣列新增維度，生成的維度為：(batch, sequence, embedding)。
接下來，全域性平均池化層（GlobalAveragePooling1D layer）透過對序列維度求平均，為每個評論傳回固定長度的輸出向量。這允許模型以最簡單的方式處理可變長度的輸入。
這個固定長度的輸出向量透過一個帶有16個隱藏單元的全連線層（稠密層，Dense layer）進行傳輸。
最後一層與單個輸出節點緊密連線。使用sigmoid啟用函式，輸出值是介於0和1之間的浮點數，表示機率或置信水平。

隱藏單元：

上述模型在輸入和輸出之間有兩個中間或“隱藏”層。輸出（單元、節點或神經元）的數量是層的表示空間的維度。換句話說，網路在學習內部表示時允許的自由度。

如果模型具有更多隱藏單元（更高維度的表示空間）和/或更多層，那麼網路可以學習更複雜的表示。但是，它使網路的計算成本更高，並且可能導致學習不需要的樣式——這些樣式可以提高在訓練資料上的表現，而不會提高在測試資料上的表現。這就是所謂的過度擬合，稍後我們將對此進行探討。

損失函式和最佳化器：

模型需要一個損失函式和一個用於訓練的最佳化器。由於這是二分類問題和機率輸出模型（一個帶有sigmoid 啟用的單個單元層），我們將使用binary_crossentropy損失函式。

這不是損失函式的唯一選擇，例如您也可以選擇mean_squared_error函式。但是通常binary_crossentropy在處理機率上表現更好——它測量機率分佈之間的“距離”，或者測量真實分佈和預測之間的“距離”（我們的例子中）。

日後，當我們探索回歸問題（比如預測房價）時，我們將看到如何使用另一種稱為均方誤差（Mean Squared Error）的損失函式。

現在，使用最佳化器和損失函式來配置模型：

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

創造驗證集

在訓練時，我們想要檢查模型在以前沒有見過的資料上的準確性。因而我們透過從原始訓練資料中分離10,000個影評來建立驗證集。（為什麼現在不使用測試集呢？我們的標的是隻使用訓練資料開發和調整我們的模型，然後僅使用一次測試資料來評估我們模型的準確性）。

x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

訓練模型

本教程採用小批次梯度下降法訓練模型，每個mini—batches含有512個樣本（影評），模型共訓練了40個epoch。這就意味著在x_train和y_train張量上對所有樣本進行了40次迭代。在訓練期間，模型在驗證集（含10,000個樣本）上的損失值和準確率同樣會被記錄。

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

輸出：

Train on 15000 samples, validate on 10000 samples
Epoch 1/40
15000/15000 [==============================] - 1s 57us/step - loss: 0.6914 - acc: 0.5662 - val_loss: 0.6886 - val_acc: 0.6416
Epoch 2/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.6841 - acc: 0.7016 - val_loss: 0.6792 - val_acc: 0.6751
Epoch 3/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.6706 - acc: 0.7347 - val_loss: 0.6627 - val_acc: 0.7228
Epoch 4/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.6481 - acc: 0.7403 - val_loss: 0.6376 - val_acc: 0.7774
Epoch 5/40
15000/15000 [==============================] - 1s 40us/step - loss: 0.6150 - acc: 0.7941 - val_loss: 0.6017 - val_acc: 0.7862
Epoch 6/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.5719 - acc: 0.8171 - val_loss: 0.5596 - val_acc: 0.7996
Epoch 7/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.5230 - acc: 0.8400 - val_loss: 0.5145 - val_acc: 0.8266
Epoch 8/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.4738 - acc: 0.8559 - val_loss: 0.4717 - val_acc: 0.8407
Epoch 9/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.4288 - acc: 0.8671 - val_loss: 0.4343 - val_acc: 0.8500
Epoch 10/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.3889 - acc: 0.8794 - val_loss: 0.4034 - val_acc: 0.8558
Epoch 11/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.3558 - acc: 0.8875 - val_loss: 0.3805 - val_acc: 0.8607
Epoch 12/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.3285 - acc: 0.8942 - val_loss: 0.3585 - val_acc: 0.8675
Epoch 13/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.3039 - acc: 0.9001 - val_loss: 0.3432 - val_acc: 0.8707
Epoch 14/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.2836 - acc: 0.9056 - val_loss: 0.3299 - val_acc: 0.8739
Epoch 15/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.2661 - acc: 0.9102 - val_loss: 0.3197 - val_acc: 0.8766
Epoch 16/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.2512 - acc: 0.9145 - val_loss: 0.3114 - val_acc: 0.8780
Epoch 17/40
15000/15000 [==============================] - 1s 39us/step - loss: 0.2368 - acc: 0.9196 - val_loss: 0.3046 - val_acc: 0.8800
Epoch 18/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.2244 - acc: 0.9235 - val_loss: 0.2991 - val_acc: 0.8820
Epoch 19/40
15000/15000 [==============================] - 1s 44us/step - loss: 0.2129 - acc: 0.9279 - val_loss: 0.2950 - val_acc: 0.8825
Epoch 20/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.2027 - acc: 0.9313 - val_loss: 0.2912 - val_acc: 0.8826
Epoch 21/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1929 - acc: 0.9357 - val_loss: 0.2884 - val_acc: 0.8836
Epoch 22/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1840 - acc: 0.9394 - val_loss: 0.2868 - val_acc: 0.8843
Epoch 23/40
15000/15000 [==============================] - 1s 40us/step - loss: 0.1758 - acc: 0.9429 - val_loss: 0.2856 - val_acc: 0.8840
Epoch 24/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1677 - acc: 0.9475 - val_loss: 0.2842 - val_acc: 0.8850
Epoch 25/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1606 - acc: 0.9503 - val_loss: 0.2838 - val_acc: 0.8847
Epoch 26/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.1535 - acc: 0.9526 - val_loss: 0.2839 - val_acc: 0.8853
Epoch 27/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.1475 - acc: 0.9547 - val_loss: 0.2851 - val_acc: 0.8841
Epoch 28/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.1414 - acc: 0.9571 - val_loss: 0.2848 - val_acc: 0.8862
Epoch 29/40
15000/15000 [==============================] - 1s 39us/step - loss: 0.1356 - acc: 0.9585 - val_loss: 0.2859 - val_acc: 0.8860
Epoch 30/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1307 - acc: 0.9617 - val_loss: 0.2877 - val_acc: 0.8864
Epoch 31/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1248 - acc: 0.9645 - val_loss: 0.2893 - val_acc: 0.8856
Epoch 32/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1202 - acc: 0.9660 - val_loss: 0.2916 - val_acc: 0.8844
Epoch 33/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1149 - acc: 0.9685 - val_loss: 0.2936 - val_acc: 0.8853
Epoch 34/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1107 - acc: 0.9695 - val_loss: 0.2971 - val_acc: 0.8845
Epoch 35/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.1069 - acc: 0.9707 - val_loss: 0.2987 - val_acc: 0.8854
Epoch 36/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.1021 - acc: 0.9731 - val_loss: 0.3019 - val_acc: 0.8842
Epoch 37/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.0984 - acc: 0.9747 - val_loss: 0.3050 - val_acc: 0.8833
Epoch 38/40
15000/15000 [==============================] - 1s 42us/step - loss: 0.0951 - acc: 0.9753 - val_loss: 0.3089 - val_acc: 0.8826
Epoch 39/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.0911 - acc: 0.9773 - val_loss: 0.3111 - val_acc: 0.8829
Epoch 40/40
15000/15000 [==============================] - 1s 41us/step - loss: 0.0876 - acc: 0.9795 - val_loss: 0.3149 - val_acc: 0.8829

評估模型

透過測試集來檢驗模型的表現。檢驗結果將傳回兩個值：損失值（表示我們的誤差，值越低越好）和準確率。

results = model.evaluate(test_data, test_labels)

print(results)

輸出：

25000/25000 [==============================] - 1s 36us/step
[0.33615295355796815, 0.87196]

本文中使用了相當簡單的方法便可達到約87％的準確率。若採用更先進的方法，模型準確率應該接近95％。

繪圖檢視精確率和損失值隨時間變化情況

model.fit()函式會傳回一個History物件，該物件包含一個字典，記錄了訓練期間發生的所有事情。

history_dict = history.history
history_dict.keys()

輸出：

dict_keys(['acc', 'val_loss', 'loss', 'val_acc'])

字典中共有四個條目，每個條目對應訓練或驗證期間一個受監控的指標。我們可以使用這些條目來繪製訓練和驗證期間的損失值、訓練和驗證期間的準確率，以進行對比。

import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

輸出：

plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

輸出：

在上面2張圖中，點表示訓練集的損失值和準確度，實線表示驗證集的損失值和準確度。

圖中，訓練集的損失值隨著epoch增大而減少，訓練集的準確度隨著epoch增大而增大。這在使用梯度下降最佳化時是符合預期的——在每次迭代時最小化期望數量。

但圖中驗證集的損失值和準確率似乎在大約二十個epoch後便已達到峰值，這是不應該出現的情況。這是過度擬合的一個例子：模型在訓練資料上的表現比它在以前從未見過的資料上的表現要好。在此之後，模型由於在訓練集上過度最佳化，將不適合應用於測試集。

對於這種特殊情況，我們可以透過在二十個左右的epoch後停止訓練來防止過度擬合。在以後的教程中，您會看到如何使用回呼自動執行此操作。

#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.