Prometheus 儲存系統解析

Posted on 2018-06-27

DevOps

Prometheus 系統包含了 Local on-disk storage 與 Remote storage，本節將說明兩者差異，以了解作為實現資料儲存可靠性的基礎。

Time Series Database

時間序列資料庫(Time Series Database, TSDB) 是經過優化後，專門用來儲存與管理時間序列資料(Time Series Data)的資料庫系統，目標是提供一套高效能讀寫的資料庫系統。時間序列資料庫一般在資料中以時間作為索引(index)。

TSDB 通常會具備以下特點：

系統主要以寫入為主。
寫入資料時是依序新增，並且大多數時間做資料排序。
不會頻繁更新資料，且寫入時間通常非常短，約幾秒內完成。
資料以區塊為單位進行刪除，很少單除某時間的資料。
讀取一般以升冪或降冪做循序讀取。
支援高並行與叢集式。

常見的 TSDB 可以查看 TSDB Projects。

Local Storage

On-disk Layout

Prometheus 從 2.0 版本開始引入自定義儲存格式，來將拉取的時間序列資料保存到 Local on-disk。其中存在 Local 的資料樣本會以兩小時(預設)為一個區塊被儲存(表示一個 Sliding Window)，而每個區塊會包含該 Sliding Window 的所有樣本資料(Chunks)、詮釋資料檔案(meta.json)與索引檔案(index)。

而 Prometheus 透過 write-ahead-log(WAL) 機制來防止當前區塊在收集樣本資料(剛抓取的資料會保存在記憶體中)時，發生伺服器錯誤或重啟等問題。一旦 Promethesu 重新啟動時，將依據 WAL 進行恢復資料，並以此重播(replayed)。然而當在這期間透過 API 刪除時間序列資料時，刪除紀錄將被儲存在單獨的邏輯檔案(tombstone)中(不是立即從區塊檔案中刪除資料)。

Prometheus 的資料目錄結構如下所示：

# 透過以下指令觀察
$ ./prometheus --storage.tsdb.min-block-duration=5m

# 當經過 5m 後，用 tree 查看 data 結構
$ tree data
data
├── 01CGDH53DJXHE1M1971QM0NHMD
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── lock
└── wal
    ├── 000001
    └── 000003

TSDB 的儲存格式可以參考 TSDB format。

可以看到每五分鐘會產生一個區塊的 Sliding Window，透過這種形式來保存所有樣本資料，可以有效提升 Prometheus 的查詢效率，比如說要查詢某個時間範圍內的資料，只需要查詢該時間範圍的區塊資料即可。

另外這種儲存方式也能夠簡化歷史資料的刪除，只要某一個區塊的時間範圍落在設置的保留範圍外，就會將該區塊丟棄。

而 Prometheus 的每個樣本大約為 1 - 2 位元組(byte)，因此規劃 Prometheus 伺服器容量可以使用以下公式來做簡單計算：

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

因此在規劃時若想降低儲存容量需求的話，要依據上述規則來計算估計，如 retention_time_seconds 與 bytes_per_sample 不變下，就只能對 ingested_samples_per_second 做處理。

注意! Local on-disk 方式雖然能夠提供高效能的查詢與檢索，但也存在著侷限性問題，因為儲存資料並不會被叢集同步或進行複製等功能，這表示當磁碟出錯或節點故障時，將有可能遺失儲存的時間序列資料。然而若本身對於這問題不嚴謹的話，依然能夠使用 Local on-disk 方式來儲存樣本資料。

Memory Usage

Prometheus 2.0 版本以前會在記憶體中儲存當前使用的所有 Chunks，並且會盡可能使用最近使用過的 Chunks。由於可能會使用到很多記憶體來快取資料，因此 Prometheus 可以透過設定 Heap memory size(bytes) 來確保不會發生 OOM(Out of memory)。而設定 Heap size 時要注意 Prometheus 使用的實體記憶體是 Go runtime 與作業系統的複雜溝通結果，因此很難精準預定大小，因此建議設定實體記憶體的 2/3大小(預設為 2G)。

假設設定 Heap size 為 2G，這表示 Prometheus 真正使用的實體記憶體約為 3G 左右。

Remote Storage

Prometheus 的 Local storage 可擴展性與耐用性會受到單節點限制，然而 Prometheus 並沒有實作自身的叢集式儲存，而是以一套 gRPC 介面來實作遠程儲存系統(Remote Storage)的整合。

Prometheus 透過兩種方式來整合遠端儲存系統，分別為：

Remote write: Prometheus 能夠以標準的格式寫入樣本到遠程的 URL。使用者能夠在組態檔案設定指定 Remote write 的 URL，一但設定後 Prometheus 將樣本資料透過 HTTP 形式發送給 Adaptor。而使用者則可以在 Adaptor 中串接外部任意服務，其中外部服務可以是分散式儲存系統、公有雲儲存服務，或是是 Message Queue 等等。
Remote read: Prometheus 能夠以標準的格式從遠端 URL 讀取(返回)樣本資料。如 Remote write 類似，Remote read 也能夠透過 Adaptor 來實作與儲存服務的整合，在 Remote read 中，當使用者發送查詢請求後，Prometheus 將向 remote_read 中設定的 URL 發起查詢請求(mathchers, ranges)，而 Adaptor 將根據請求條件從第三方儲存服務中取得響應的資料，同時將資料轉換成 Prometheus 的原始樣本資料(Raw samples)傳給 Prometheus Server。

從 Prometheus 原始碼中，可以找到 Remote storage gRPC 的 ProtoBuff 的定義資訊，開發者能夠透過實作這些介面來串接儲存服務：

syntax = "proto3";
package prometheus;

option go_package = "prompb";

import "types.proto";

message WriteRequest {
  repeated prometheus.TimeSeries timeseries = 1;
}

message ReadRequest {
  repeated Query queries = 1;
}

message ReadResponse {
  // In same order as the request's queries.
  repeated QueryResult results = 1;
}

message Query {
  int64 start_timestamp_ms = 1;
  int64 end_timestamp_ms = 2;
  repeated prometheus.LabelMatcher matchers = 3;
  prometheus.ReadHints hints = 4;
}

message QueryResult {
  // Samples within a time series must be ordered by time.
  repeated prometheus.TimeSeries timeseries = 1;
}

由於 Prometheus 的 Remote storage 能夠自行實作資料處理邏輯，因此當接收到 remote_write 的 HTTP 服務時，能夠將內容轉換成 WriteRequests 再由開發者自行處理。舉例下列範例：

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"

	"github.com/gogo/protobuf/proto"
	"github.com/golang/snappy"
	"github.com/prometheus/common/model"

	"github.com/prometheus/prometheus/prompb"
)

func main() {
	http.HandleFunc("/receive", func(w http.ResponseWriter, r *http.Request) {
		compressed, err := ioutil.ReadAll(r.Body)
		if err != nil {
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}

		reqBuf, err := snappy.Decode(nil, compressed)
		if err != nil {
			http.Error(w, err.Error(), http.StatusBadRequest)
			return
		}

		var req prompb.WriteRequest
		if err := proto.Unmarshal(reqBuf, &req); err != nil {
			http.Error(w, err.Error(), http.StatusBadRequest)
			return
		}

		for _, ts := range req.Timeseries {
			m := make(model.Metric, len(ts.Labels))
			for _, l := range ts.Labels {
				m[model.LabelName(l.Name)] = model.LabelValue(l.Value)
			}
			fmt.Println(m)

			for _, s := range ts.Samples {
				fmt.Printf("  %f %d\n", s.Value, s.Timestamp)
			}
		}
	})

	http.ListenAndServe(":1234", nil)
}

下面是一個 Remote storage 的設定檔案，可以在prometheus.yml 設定檔加入：

remote_write:
    url: <string>
    [ remote_timeout: <duration> | default = 30s ]
    write_relabel_configs:
    [ - <relabel_config> ... ]
    basic_auth:
    [ username: <string> ]
    [ password: <string> ]
    [ bearer_token: <string> ]
    [ bearer_token_file: /path/to/bearer/token/file ]
    tls_config:
    [ <tls_config> ]
    [ proxy_url: <string> ]

remote_read:
    url: <string>
    required_matchers:
    [ <labelname>: <labelvalue> ... ]
    [ remote_timeout: <duration> | default = 30s ]
    [ read_recent: <boolean> | default = false ]
    basic_auth:
    [ username: <string> ]
    [ password: <string> ]
    [ bearer_token: <string> ]
    [ bearer_token_file: /path/to/bearer/token/file ]
    [ <tls_config> ]
    [ proxy_url: <string> ]

而 Prometheus 官方也列出了目前已整合的第三方儲存服務，可以到 Remote Endpoints and Storage 中查看。

需注意不同的儲存服務能提供的 Write 與 Read 實作都不同，有些可能只能進行寫入操作，而有些只能做讀取操作，因此要注意選擇時是否滿足需求。

Prometheus Storage Flags

Prometheus 支援針對儲存的 Flags：

Flags	預設值	描述
–storage.tsdb.path	data/	Metrics 儲存路徑
–storage.tsdb.retention	15d	儲存的資料樣本會保留多長的時間
–storage.tsdb.min-block-duration	2h	一個資料區塊的最小持續時間
–storage.tsdb.max-block-duration	36h	壓縮區塊的最大持續時間(預設為 retention period 的 10% 時間)
–storage.tsdb.no-lockfile	false	設定是否建立 lockfile 在資料目錄下
–storage.remote.flush-deadline	1m	在關機或者組態重新讀取時，清除樣本的等待時間

References

本文作者：KaiRen Bai
本文連結： Prometheus 儲存系統解析
發佈時間： 2018-6-27 12:06
版權聲明： All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.