Python資料分析、挖掘常用工具-知識星球

作者：深度沉迷學習 Python愛好者社群專欄作者

簡書地址：https://www.jianshu.com/u/d76c6535dbc5

Python語言：

簡要概括一下Python語言在資料分析、挖掘場景中常用特性：

串列（可以被修改），元組（不可以被修改）
字典（結構）
集合（同數學概念上的集合）
函式式程式設計（主要由lambda()、map()、reduce()、filter()構成）

Python資料分析常用庫：

Python資料挖掘相關擴充套件庫

NumPy

提供真正的陣列，相比Python內建串列來說速度更快，NumPy也是Scipy、Matplotlib、Pandas等庫的依賴庫，內建函式處理資料速度是C語言級別的，因此使用中應儘量使用內建函式。

示例：NumPy基本操作

import numpy as np  # 一般以np為別名

a = np.array([2, 0, 1, 5])
print(a)
print(a[:3])
print(a.min())
a.sort()  # a被改寫
print(a)
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b*b)

輸出：

[2 0 1 5]
[2 0 1]
0
[0 1 2 5]
[[ 1  4  9]
 [16 25 36]]

Scipy

NumPy和Scipy讓Python有了MATLAB味道。Scipy依賴於NumPy，NumPy提供了多維陣列功能，但只是一般的陣列並不是矩陣。比如兩個陣列相乘時，只是對應元素相乘。Scipy提供了真正的矩陣，以及大量基於矩陣運算的物件與函式。

Scipy包含功能有最最佳化、線性代數、積分、插值、擬合、特殊函式、快速傅裡葉變換、訊號處理、影象處理、常微分方程求解等常用計算。

示例：Scipy求解非線性方程組和數值積分

# 求解方程組
from scipy.optimize import fsolve

def f(x):
    x1 = x[0]
    x2 = x[1]
    return [2 * x1 - x2 ** 2 - 1, x1 ** 2 - x2 - 2]


result = fsolve(f, [1, 1])
print(result)

# 積分
from scipy import integrate

def g(x):  # 定義被積函式
    return (1 - x ** 2) ** 0.5

pi_2, err = integrate.quad(g, -1, 1)  # 輸出積分結果和誤差
print(pi_2 * 2, err)

輸出：

[ 1.91963957  1.68501606]
3.141592653589797 1.0002356720661965e-09

Matplotlib

Python中著名的繪相簿，主要用於二維繪圖，也可以進行簡單的三維繪圖。

示例：Matplotlib繪圖基本操作

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 10000)  # 自變數x，10000為點的個數
y = np.sin(x) + 1  # 因變數y
z = np.cos(x ** 2) + 1  # 因變數z

plt.figure(figsize=(8, 4))  # 設定影象大小
# plt.rcParams['font.sans-serif'] = 'SimHei'  # 標簽若有中文，則需設定字型
# plt.rcParams['axes.unicode_minus'] = False  # 儲存影象時若負號顯示不正常，則新增該句

# 兩條曲線
plt.plot(x, y, label='$\sin (x+1)$', color='red', linewidth=2)  # 設定標簽，線條顏色，線條大小
plt.plot(x, z, 'b--', label='$\cos x^2+1$')

plt.xlim(0, 10)  # x坐標範圍
plt.ylim(0, 2.5)  # y坐標範圍

plt.xlabel("Time(s)")  # x軸名稱
plt.ylabel("Volt")  # y軸名稱
plt.title("Matplotlib Sample")  # 圖的標題

plt.legend()  # 顯示圖例
plt.show()  # 顯示作圖結果

輸出：

Pandas

Pandas是Python下非常強大的資料分析工具。它建立在NumPy之上，功能很強大，支援類似SQL的增刪改查，並具有豐富的資料處理函式，支援時間序列分析功能，支援靈活處理缺失資料等。

Pandas基本資料結構是Series和DataFrame。Series就是序列，類似一維陣列，DataFrame則相當於一張二維表格，類似二維陣列，它每一列都是一個Series。為定位Series中的元素，Pandas提供了Index物件，類似主鍵。

DataFrame本質上是Series的容器。

示例：Pandas簡單操作

import pandas as pd

s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
d = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]], columns=['a', 'b', 'c'])
d2 = pd.DataFrame(s)

print(s)
print(d.head())  # 預覽前5行
print(d.describe())

# 讀取檔案(路徑最好別帶中文)
df=pd.read_csv("G:\\data.csv", encoding="utf-8")
print(df)

輸出：

a    1
b    2
c    3
dtype: int64
    a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12
4  13  14  15
               a          b          c
count   6.000000   6.000000   6.000000
mean    8.500000   9.500000  10.500000
std     5.612486   5.612486   5.612486
min     1.000000   2.000000   3.000000
25%     4.750000   5.750000   6.750000
50%     8.500000   9.500000  10.500000
75%    12.250000  13.250000  14.250000
max    16.000000  17.000000  18.000000
Empty DataFrame
Columns: [1068, 12, 蔬果, 1201, 蔬菜, 120104, 花果, 20150430, 201504, DW-1201040010, 散稱, 生鮮, 千克, 0.973, 5.43, 2.58, 否]
Index: []

Scikit-Learn

Scikit-Learn依賴NumPy、Scipy和Matplotlib，是Python中強大的機器學習庫，提供了諸如資料預處理、分類、回歸、聚類、預測和模型分析等功能。

示例：建立線性回歸模型

from sklearn.linear_model import LinearRegression
model= LinearRegression()
print(model)

所有模型都提供的介面：

model.fit()：訓練模型，監督模型是fit(X,y)，無監督模型是fit(X)

監督模型提供的介面：

model.predict(X_new)：預測新樣本
model.predict_proba(X_new)：預測機率，僅對某些模型有用（LR）

無監督模型提供的介面：

model.ransform()：從資料中學到新的“基空間”
model.fit_transform()：從資料中學到的新的基，並將這個資料按照這組“基”進行轉換

Scikit-Learn本身自帶了一些資料集，如花卉和手寫影象資料集等，下麵以花卉資料集舉個慄子，訓練集包含4個維度——萼片長度、寬度，花瓣長度和寬度，以及四個亞屬分類結果。

示例：

from sklearn import datasets  # 匯入資料集
from sklearn import svm 

iris = datasets.load_iris()  # 載入資料集
clf = svm.LinearSVC()  # 建立線性SVM分類器
clf.fit(iris.data, iris.target)  # 用資料訓練模型
print(clf.predict([[5, 3, 1, 0.2], [5.0, 3.6, 1.3, 0.25]]))

輸出：

[0 0]

Keras

Keras是基於Theano的深度學習庫，它不僅可以搭建普通神經網路，還可以搭建各種深度學習模型，如自編碼器、迴圈神經網路、遞迴神經網路、摺積神經網路等，執行速度也很快，簡化了搭建各種神經網路模型的步驟，允許普通使用者輕鬆搭建幾百個輸入節點的深層神經網路，定製度也很高。

示例：簡單的MLP（多層感知器）

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD

model = Sequential()  # 模型初始化
model.add(Dense(20, 64))  # 新增輸入層（20節點）、第一隱藏層（64節點）的連線
model.add(Activation('tanh'))  # 第一隱藏層用tanh作為啟用函式
model.add(Dropout(0.5))  # 使用Dropout防止過擬合
model.add(Dense(64, 64))  # 新增第一隱藏層（64節點）、第二隱藏層（64節點）的連線
model.add(Activation('tanh'))  # 第二隱藏層用tanh作為啟用函式
model.add(Dense(64, 1))  # 新增第二隱藏層（64節點）、輸出層（1節點）的連線
model.add(Activation('sigmod'))  # 第二隱藏層用sigmod作為啟用函式

sgd=SGD(lr=0.1,decay=1e-6,momentum=0.9,nesterov=True)  # 定義求解演演算法
model.compile(loss='mean_squared_error',optimizer=sgd)  # 編譯生成模型，損失函式為平均誤差平方和
model.fit(x_train,y_train,nb_epoch=20,batch_size=16)  # 訓練模型
score = model.evaluate(X_test,y_test,batch_size=16)  # 測試模型

參考：

Keras中文檔案
如何計算兩個檔案的相似度（二）

Genism

Genism主要用來處理語言方面的任務，如文字相似度計算、LDA、Word2Vec等。

示例：

import logging
from gensim import models

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.INFO)

sentences = [['first', 'sentence'], ['second', 'sentence']]  # 將分好詞的句子按串列形式輸入
model = models.Word2Vec(sentences, min_count=1)  # 用以上句子訓練詞向量模型
print(model['sentence'])  # 輸出單詞sentence的詞向量

輸出：

2017-10-24 19:02:40,785 : INFO : collecting all words and their counts 2017-10-24 19:02:40,785 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2017-10-24 19:02:40,785 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences 2017-10-24 19:02:40,785 : INFO : Loading a fresh vocabulary 2017-10-24 19:02:40,785 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0) 2017-10-24 19:02:40,785 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0) 2017-10-24 19:02:40,786 : INFO : deleting the raw counts dictionary of 3 items 2017-10-24 19:02:40,786 : INFO : sample=0.001 downsamples 3 most-common words 2017-10-24 19:02:40,786 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4) 2017-10-24 19:02:40,786 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes 2017-10-24 19:02:40,786 : INFO : resetting layer weights 2017-10-24 19:02:40,786 : INFO : training model with 3 workers on 3 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 2017-10-24 19:02:40,788 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-10-24 19:02:40,788 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-10-24 19:02:40,788 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-10-24 19:02:40,789 : INFO : training on 20 raw words (0 effective words) took 0.0s, 0 effective words/s 2017-10-24 19:02:40,789 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay [ -1.54225400e-03 -2.45212857e-03 -2.20486755e-03 -3.64410551e-03 -2.28137174e-03 -1.70348200e-03 -1.05830852e-03 -4.37875278e-03 -4.97106137e-03 3.93485563e-04 -1.97932171e-03 -3.40653211e-03 1.54990738e-03 8.97102174e-04 2.94041773e-03 3.45200230e-03 -4.60584508e-03 3.81468004e-03 3.07120802e-03 2.85422982e-04 7.01598416e-04 2.69670971e-03 4.17246483e-03 -6.48593705e-04 1.11404411e-03 4.02203249e-03 -2.34672683e-03 2.35153269e-03 2.32632101e-05 3.76200466e-03 -3.95653257e-03 3.77303245e-03 8.48884694e-04 1.61545759e-03 2.53374409e-03 -4.25464474e-03 -2.06338940e-03 -6.84972096e-04 -6.92955102e-04 -2.27969326e-03 -2.13766913e-03 3.95324081e-03 3.52649018e-03 1.29243149e-03 4.29229392e-03 -4.34781052e-03 2.42843386e-03 3.12117115e-03 -2.99768522e-03 -1.17538485e-03 6.67148328e-04 -6.86432002e-04 -3.58940102e-03 2.40547652e-03 -4.18888079e-03 -3.12567432e-03 -2.51603196e-03 2.53451476e-03 3.65199335e-03 3.35336081e-03 -2.50071986e-04 4.15537134e-03 -3.89242987e-03 4.88173496e-03 -3.34603712e-03 3.18462006e-03 1.57053335e-04 3.51517834e-03 -1.20337342e-03 -1.81524854e-04 3.57784083e-05 -2.36600707e-03 -3.77405947e-03 -1.70441647e-03 -4.51521482e-03 -9.47134569e-04 4.53894213e-03 1.55767589e-03 8.57840874e-04 -1.12304837e-03 -3.95945460e-03 5.37869288e-04 -2.04461766e-03 5.24829782e-04 3.76719423e-03 -4.38512256e-03 4.81262803e-03 -4.20147832e-03 -3.87057988e-03 1.67581497e-03 1.51928759e-03 -1.31744961e-03 3.28474329e-03 -3.28777428e-03 -9.67226923e-04 4.62622894e-03 1.34165725e-03 3.60148447e-03 4.80416557e-03 -1.98963983e-03]

參考：

如何計算兩個檔案的相似度（二）

本次筆記是對資料分析和挖掘中常用工具的簡要介紹，詳細使用會在以後筆記中進行介紹。

贊賞作者

轉載宣告：本文轉載自「Python愛好者社群」，搜尋「python_shequ」即可關註。

《Python人工智慧和全棧開發》2018年07月23日即將在北京開課，120天衝擊Python年薪30萬，改變速約~~~~

*宣告：推送內容及圖片來源於網路，部分內容會有所改動，版權歸原作者所有，如來源資訊有誤或侵犯權益，請聯絡我們刪除或授權事宜。

– END –

更多Python好文請點選【閱讀原文】哦

↓↓↓

Python資料分析、挖掘常用工具

Python資料分析常用庫：

NumPy

Scipy

Matplotlib

Pandas

Scikit-Learn

Keras

Genism

相關推薦

熱門標籤

熱門文章

分享創造快樂