Nginx 用得好，這個知識點最重要！-知識星球

來源：運維部落

ID:linux178

很多人都知道nginx可以做反向代理和負載均衡，但是關於nginx的健康檢查(health_check）機制瞭解的不多。

其實社群版nginx提供的health_check機制其實很薄弱，主要是透過在upstream中配置max_fails和fail_timeout來實現，這邊文章主要是深入分析社群版的health_check機制。

當然還有更好的一些建議，比如商業版的nginx plus或者阿裡的tengine,他們包含的健康檢查機制更加完善和高效，如果你堅持使用nginx社群版，當然還可以自己寫或者找第三方模組來編譯了。

首先說下我的測試環境，CentOS release 6.4 (Final) + nginx_1.6.0 + 2臺tomcat8.0.15作為後端伺服器。（宣告:以下所有配置僅僅為測試所用，不代表線上環境真實所用，真正的線上環境需要更多配置和最佳化。）

nginx配置如下:

#user nobody;
worker_processes 1;
#pid logs/nginx.pid;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"'
access_log logs/access.log main;
sendfile on;
keepalive_timeout 65;
upstream backend {
server localhost:9090 max_fails=1 fail_timeout=40s;
server localhost:9191 max_fails=1 fail_timeout=40s;
}
server {
listen 80;
server_name localhost;
location / {
proxy_pass http://backend;
proxy_connect_timeout 1;
proxy_read_timeout 1;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}
}
}

關於nginx和tomcat的配置的基本配置不再說明，大家可以去看官方檔案。

我們可以看到我在upstream 指令中配置了兩臺server,每臺server都設定了max_fails和fail_timeout值。

現在開始啟動nginx，然後啟動後臺的2臺server, 故意把在Tomcat Listener中Sleep 10分鐘，也就是tomcat啟動要花費10分鐘左右，埠已開，但是沒有接收請求,然後我們訪問http://localhost/response/ (response這個介面是我在tomcat中寫的一個servlet介面，功能很簡單，如果是9090的server接收請求則傳回9090，如果是9191埠的server則傳回9191.),現在觀察nginx的表現。

我們檢視nginx中

access.log

192.168.42.254 - - [29/Dec/2014:11:24:23 +0800] "GET /response/ HTTP/1.1" 504 537 720 380 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 2.004 host:health.iflytek.com

192.168.42.254 - - [29/Dec/2014:11:24:24 +0800] "GET /favicon.ico HTTP/1.1" 502 537 715 311 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 0.000 host:health.iflytek.com

error.log

2014/12/29 11:24:22 [error] 6318#0: *4785892017 upstream timed out (110: Connection timed out) while reading response essay-header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9090/response/", host: "health.iflytek.com"

2014/12/29 11:24:23 [error] 6318#0: *4785892017 upstream timed out (110: Connection timed out) while reading response essay-header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9191/response/", host: "health.iflytek.com"

2014/12/29 11:24:24 [error] 6318#0: *4785892017 no live upstreams while connecting to upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /favicon.ico HTTP/1.1", upstream: "http://health/favicon.ico", host: "health.iflytek.com"

（為什麼要在listener中設定睡眠10分鐘，這是因為我們的業務中需要做快取預熱，所以這10分鐘就是模擬伺服器啟動過程中有10分鐘的不可用。）

觀察日誌發現在兩臺tomcat啟動過程中，傳送一次請求，nginx會自動幫我們進行重試所有的後端伺服器，最後會報 no live upstreams while connecting to upstream錯誤。這也算是nginx做health_check的一種方式。這裡需要特別強調一點，我們設定了proxy_read_timeout 為 1秒。後面再重點講解這個引數，很重要。

等待40s,現在把9090這臺伺服器啟動完成，但是9191這臺伺服器仍然是正在啟動，觀察nginx日誌表現。

access.log

192.168.42.254 - - [29/Dec/2014:11:54:18 +0800] "GET /response/ HTTP/1.1" 200 19 194 423 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 0.210 host:health.iflytek.com

192.168.42.254 - - [29/Dec/2014:11:54:18 +0800] "GET /favicon.ico HTTP/1.1" 404 453 674 311 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 0.212 host:health.iflytek.com

error.log

沒有列印任何錯誤

瀏覽器傳回9090,說明nginx正常接收請求。

我們再次請求一次。

access.log

192.168.42.254 - - [29/Dec/2014:13:43:13 +0800] "GET /response/ HTTP/1.1" 200 19 194 423 "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" 1.005 host:health.iflytek.com

說明正常傳回，同時傳回9090

error.log

2014/12/29 13:43:13 [error] 6323#0: *4801368618 upstream timed out (110: Connection timed out) while reading response essay-header from upstream, client: 192.168.42.254, server: health.iflytek.com, request: "GET /response/ HTTP/1.1", upstream: "http://192.168.42.249:9191/response/", host: "health.iflytek.com"

發現nginx error.log 增加了一行upstream time out的錯誤。但是客戶端仍然正常傳回，upstream預設是輪訓的負載，所以這個請求預設會轉發到9191這臺機器，但是因為9191正在啟動，所以這次請求失敗，然後有nginx重試轉發到9090機器上面。

OK，但是fail_timeout=40s是什麼意思呢？我們要不要重現一下這個引數的重要性？

Let’s go !

現在你只需要靜靜的做個美男子，等待9191機器啟動完畢！多傳送幾次請求！然後咦,你發現9191機器傳回9191

響應了噢！fail_timeout=40s其實就是如果上次請求發現9191無法正常傳回，那麼有40s的時間該server會不可用，但是一旦超過40s請求也會再次轉發到該server上的，不管該server到底有沒有真正的恢復。

所以可見nginx社群版的health_check機制有多麼的薄弱啊，也就是一個延時遮蔽而已，如此周而複始！

如果你用過nginx plus其實你會發現nginx plus 提供的health_check機制更加強大，說幾個關鍵詞，你們自己去查! zone slow_start health_check match ! 這個slow_start其實就很好的解決了快取預熱的問題，比如nginx發現一臺機器重啟了，那麼會等待slow_starts設定的時間才會再次傳送請求到該伺服器上，這就給快取預熱提供了時間。

《Linux雲端計算及運維架構師高薪實戰班》2018年07月16日即將開課中，120天衝擊Linux運維年薪30萬，改變速約~~~~

*宣告：推送內容及圖片來源於網路，部分內容會有所改動，版權歸原作者所有，如來源資訊有誤或侵犯權益，請聯絡我們刪除或授權事宜。

– END –

更多Linux好文請點選【閱讀原文】哦

↓↓↓

Nginx 用得好，這個知識點最重要！

access.log

error.log

error.log

access.log

相關推薦

熱門標籤

熱門文章

分享創造快樂