Prometheus中動態發現Target和Relabel的應用-知識星球

本文以Consul為例介紹了Prometheus的服務發現能力，適用於在雲平臺/容器平臺的監控場景動態發現Target。同時透過Prometheus的relabel實現多資料中心的監控資料聚合，以及選擇和過濾監控Target。

Prometheus中的Job和Instance

Prometheus主要由一下幾個部分組成：

Prometheus Server：負責採集監控資料，並且對外提供PromQL實現監控資料的查詢以及聚合分析；
Exporters：用於向Prometheus Server暴露資料採集的endpoint，Prometheus輪訓這些Exporter採集並且儲存資料；
AlertManager以及其它元件（……和本文無關就不說這些）

在Prometheus Server的配置檔案中我們使用scrape_configs來定義：

scrape_configs:
- job_name: prometheus
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - localhost:9090

其中每一個scrape_config物件對應一個資料採集的Job，每一個Job可以對應多個Instance，即配置檔案中的targets。透過Prometheus UI可以更直觀的看到其中的關係。

Pull vs Push

對於Zabbix以及Nagios這類Push系統而言，通常由採集的Agent來決定和哪一個監控服務進行通訊。而對於Prometheus這類基於Pull的監控平臺而言，則由server側決定採集的標的有哪些。

相比於Push System而言，Pull System：

只要Exporter在執行，你可以在任何地方（比如在本地），搭建你的監控系統
你可以更容器的去定位Instance實體的健康狀態以及故障定位

當然對於我個人的角度來看，Pull System更利於DevOps的實施。每一個團隊可以搭建自己的監控系統，並關註自己關心的監控指標，並構建自己的DevOps Dashboard。

在小規模監控或者本地測試中_static_configs_是我們最常用的用於配置監控標的服務，但是在IaaS平臺（如Openstack）或者CaaS平臺（如Kubernetes）中：基礎設施、容器、應用程式的建立和銷毀會更加頻繁。

那對於Prometheus這樣的Pull System而言，如何動態的發現這些監控標的？

服務發現 Service Discovery

Prometheus支援多種服務發現機制：檔案、DNS、Consul、Kubernetes、OpenStack、EC2等等。基於服務發現的過程並不複雜，透過第三方提供的介面，Prometheus查詢到需要監控的Target串列，然後輪訓這些Target獲取監控資料。

這裡為了驗證Prometheus的服務發現能力，我們使用Docker Compose在本地搭建我們的測試環境。我們使用gliderlabs/registrator監聽Docker行程，對於暴露了埠的容器，registrator會自動將該容器暴露的服務地址註冊到Consul中。

這裡使用Node Exporter採集當前主機資料，使用cAdvisor採集容器相關資料。

完整的Docker Compose檔案如下：

version: '2'
services:
  consul:
    image: consul
    ports:
      - 8400:8400
      - 8500:8500
      - 8600:53/udp
    command: agent -server -client=0.0.0.0 -dev -node=node0 -bootstrap-expect=1 -data-dir=/tmp/consul
    labels:
      SERVICE_IGNORE: 'true'
  registrator:
    image: gliderlabs/registrator
    depends_on:
      - consul
    volumes:
      - /var/run:/tmp:rw
    command: consul://consul:8500
  prometheus:
    image: quay.io/prometheus/prometheus
    ports:
      - 9090:9090
  node_exporter:
    image: quay.io/prometheus/node-exporter
    pid: "host"
    ports:
      - 9100:9100
  cadvisor:
    image: google/cadvisor:latest
    ports:
      - 8080:8080
    volumes:
      - /:/rootfs:ro 
      - /var/run:/var/run:rw
      - /var/lib/docker/:/var/lib/docker:ro

使用Docker Compose啟動該應用堆疊，在Consul UI中，我們可以看到如下結果：

建立Prometheus配置檔案：

global:
  scrape_interval: 5s
  scrape_timeout: 5s
  evaluation_interval: 15s
scrape_configs:
  - job_name: consul_sd
    metrics_path: /metrics
    scheme: http
    consul_sd_configs:
      - server: consul:8500
        scheme: http
        services:
          - node_exporter
          - cadvisor

其中我們建立了一個Job名為consulsd，並透過consulsd_configs定義我們需要從Consul獲取的服務實體，其中：

server：指定了Consul的訪問地址
services：為註冊到Consul中的實體資訊

掛載配置檔案到Prometheus Server，並且重新啟動Docker Compose。

services:
  prometheus:
    volumes:
      - ./prometheus/prometheus:/etc/prometheus/prometheus.yml

檢視Prometheus UI的Target頁面，我們可以看到，如下結果：

我們透過將Exporter註冊到Consul，並且配置Prometheus基於Consul動態發現需要採集的標的實體。

如何過濾選擇Target實體？relabel

目前為止，只要是註冊到Consul上的Node Exporter或者cAdvisor實體是可以自動新增到Prometheus的Target當中。現在請考慮下麵的場景：

對於線上環境我們可能會劃分為：dev、stage、prod不同的叢集。每一個叢集執行多個主機節點，每個伺服器節點上執行一個Node Exporter實體。Node Exporter實體會自動測試到服務註冊中心Consul服務當中，Prometheus會根據Consul傳回的Node Exporter實體資訊產生Target串列，並且向這些Target輪訓資料。

so far so good.

然而，如果我們可能還需要：

需要按照不同的環境dev、stage、prod聚合監控資料？
對於研發團隊而言，我可能只關心dev環境的監控資料？
為每一個團隊單獨搭建一個Prometheus Server？如何讓不同團隊的Prometheus Server採集不同的環境監控資料？

第一個問題：如何根據環境聚合監控資料？replace

在預設情況下，我們從所有環境的Node Exporter中採集到的主機指標如下：

node_cpu{cpu="cpu0",instance="172.21.0.3:9100",job="consul_sd",mode="guest"}

其中Instance為Target的地址，透過Instance我們可以區分主機，但是無法區分環境。

我們希望採集的指標應該是如下形式：

node_cpu{cpu="cpu0",instance="172.21.0.3:9100",dc="dc1",job="consul_sd",mode="guest"}

透過metrics中的label dc（資料中心）來在監控資料中新增不同的環境指標。這樣我們可以透過dc來聚合資料 sum(node_cpu{dc=”dc1″})。

為了達到這個目的我們需要使用Relabel的replace能力。

官方檔案中是這樣解釋Relabel能力的：

Relabeling is a powerful tool to dynamically rewrite the label set of a target before it gets scraped. Multiple relabeling steps can be configured per scrape configuration. They are applied to the label set of each target in order of their appearance in the configuration file.

簡單理解的話，就是Relabel可以在Prometheus採集資料之前，透過Target實體的Metadata資訊，動態重新寫入Label的值。除此之外，我們還能根據Target實體的Metadata資訊選擇是否採集或者忽略該Target實體。

基於Consul動態發現的Target實體，具有以下Metadata資訊：

_meta_consul_address：Consul地址
_meta_consul_dc：Consul中服務所在的資料中心
_meta_consul_ metadata_：服務的metadata
_meta_consul_node：服務所在Consul節點的資訊
_meta_consul_ service_address：服務訪問地址
_meta_consul_ service_id：服務ID
_meta_consul_ service_port：服務埠
_meta_consul_service：服務名稱
_meta_consul_tags：服務包含的標簽資訊

在Prometheus UI中，也可以直接檢視Target的Metadata資訊：

這裡我們使用_metaconsuldc資訊來標記當前Target所在的data center。並且透過regex來匹配sourcelabel的值，使用replacement來選擇regex運算式匹配到的mach group。透過action來告訴Prometheus在採集資料之前，需要將replacement的內容寫入到target_label dc當中：

...
scrape_configs:
  - job_name: consul_sd
    relabel_configs:
    - source_labels:  ["__meta_consul_dc"]
      regex: "(.*)"
      replacement: $1
      action: replace
      target_label: "dc"
...

對於直接保留標簽的值時，也可以簡化為：

      target_label: "dc"

重啟Prometheus，檢視透過UI檢視Target串列：

在Target的labels列我們可以看到當前Instance的label標簽。

查詢Prometheus查詢監控資料，所有metrics都被寫入了所在的資料中心標簽dc：

node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="guest"}    0
node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="guest_nice"}    0
node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="idle"}    91933.77
node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="iowait"}    56.8
node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="irq"}    0
node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="nice"}    0
node_cpu{cpu="cpu0",dc="dc1",instance="172.21.0.6:9100",job="consul_sd",mode="softirq"}    19.02

第二個問題：如何選擇採集標的？keep/drop

在第一個問題中，我們透過定義relabel_configs的action為replace，告訴Prometheus，需要為當前實體採集的所有metrics寫入新的label。當需要過濾Target標的時，我們則將action定義為keep或者drop。

在Job的配置當中使用一下配置，當匹配到Target的元資料標簽_metaconsul_tags中匹配到“.，development,.”，則keep當前實體：

    relabel_configs:
    - source_labels: ["__meta_consul_tags"]
      regex: ".*,development,.*"
      action: keep

為了在本地模擬，我們可以使用registor自動註冊service tag的能力。修改Docker Compose如下：

version: '2'
services:
  consul:
    image: consul
    ports:
      - 8400:8400
      - 8500:8500
      - 8600:53/udp
    command: agent -server -client=0.0.0.0 -dev -node=node0 -bootstrap-expect=1 -data-dir=/tmp/consul
    labels:
      SERVICE_IGNORE: 'true'
  registrator:
    image: gliderlabs/registrator
    depends_on:
      - consul
    volumes:
      - /var/run:/tmp:rw
    command: consul://consul:8500
  prometheus:
    image: quay.io/prometheus/prometheus
    ports:
      - 9090:9090
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
  node_exporter:
    image: quay.io/prometheus/node-exporter
    pid: "host"
    ports:
      - 9100:9100
    labels:
      SERVICE_TAGS: "development" # 設定該服務向consul註冊的TAGS為development
  cadvisor:
    image: google/cadvisor:latest
    ports:
      - 8080:8080
    volumes:
      - /:/rootfs:ro 
      - /var/run:/var/run:rw
      - /var/lib/docker/:/var/lib/docker:ro
    labels:
      SERVICE_TAGS: "production,scraped" # 設定該服務向consul註冊的TAGS為development,production

重啟docker-compose如下所示，我們可以在Consul中檢視服務的TAGS。

檢視Prometheus UI Target頁面，可以發現，當前Target實體當中只存在__meta _consul_tags中包含development的實體，從而過濾了其它註冊到Consul中的實體。

小結

綜上：

在雲平臺/容器平臺中我們可以透過Prometheus的SD能力動態發現監控的標的實體
透過relabeling可以在寫入metrics資料之前，動態修改metrics的label
透過relabeling可以對Target實體進行過濾和選擇

原文地址： http://yunlzheng.github.io/2018/01/17/prometheus-sd-and-relabel/

基於Kubernetes的DevOps實踐培訓

本次培訓包含：Kubernetes核心概念；Kubernetes叢集的安裝配置、運維管理、架構規劃；Kubernetes元件、監控、網路；針對於Kubernetes API介面的二次開發；DevOps基本理念；微服務架構；微服務的容器化等，點選識別下方二維碼加微信好友瞭解具體培訓內容。

點選閱讀原文連結即可報名。

Prometheus中動態發現Target和Relabel的應用

相關推薦

熱門標籤

熱門文章

分享創造快樂