使用Influxdb-relay实现Influxdb高可用

目前平台有些遗留应用还是使用的influxdb保存监控数据, influxdb为单实例, 随时可能出现单机故障, 考虑到influxdb还将运行很长一段时间, 因此需要扩展成HA机制, 这里选择influxdb-relay方案

这里需要说明的是, relay不会同步两个influxdb实例之间的数据,它只提供双写的能力,即在某个实例出现问题后还能将数据正常写入另一个正常的实例, 保证数据不丢失.如果想在问题实例上同步这部分数据话需要人工介入.

架构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
				┌─────────────────┐                 
│writes & queries │
└─────────────────┘


┌───────────────┐
│ │
┌────────│ Load Balancer │─────────┐
│ │ │ │
│ └──────┬─┬──────┘ │
│ │ │ │
│ │ │ │
│ ┌──────┘ └────────┐ │
│ │ ┌─────────────┐ │ │┌──────┐
│ │ │/write or UDP│ │ ││/query│
│ ▼ └─────────────┘ ▼ │└──────┘
│ ┌──────────┐ ┌──────────┐ │
│ │ InfluxDB │ │ InfluxDB │ │
│ │ Relay │ │ Relay │ │
│ └──┬────┬──┘ └────┬──┬──┘ │
│ │ | | │ │
│ | ┌─┼──────────────┘ | │
│ │ │ └──────────────┐ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ │ │ │ │
└─▶│ InfluxDB │ │ InfluxDB │◀─┘
│ │ │ │
└──────────┘ └──────────┘

共涉及到3个应用

  1. Influxdb: influxdb实例
  2. Influxdb-relay: 代理influxdb的写流量, 通过双写机制保证数据写入到2个influxdb数据库中,读流量还是从lb直接转到influxdb实例,不会经过relay
  3. loadBalancer: 对influxdb的读写流量都通过该服务进行代理到influxdb relay, 选择nginx即可,其它需要访问influxdb服务的配置参数都需要指定该应用的地址

新增实例

线上influxdb的版本为:

InfluxDB v1.7.4 (git: 1.7 ef77e72f435b71b1ad6da7d6a6a4c4a262439379)

部署机器: 192.168.1.5

需要在一台机器上部署一个influxdb新实例: 192.168.1.6

1
2
wget https://dl.influxdata.com/influxdb/releases/influxdb_1.7.4_amd64.deb
dpkg -i influxdb_1.7.4_amd64.deb

两个实例使用的配置文件如下, 具体的参数配置可以根据情况定:

cat /etc/influxdb/influxdb.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
reporting-disabled = false
bind-address = ":8088"
[meta]
dir = "/data/influxdb/meta"
retention-autocreate = true
[data]
dir = "/data/influxdb/data"
wal-dir = "/data/influxdb/wal"
wal-fsync-delay = "50ms"
index-version = "inmem"
trace-logging-enabled = false
query-log-enabled = false
validate-keys = true
cache-max-memory-size = "4g"
compact-full-write-cold-duration = "6h"
max-concurrent-compactions = 0
compact-throughput = "64m"
compact-throughput-burst = "128m"
tsm-use-madv-willneed = false
max-series-per-database = 0
max-values-per-tag = 0
[coordinator]
write-timeout = "600s"
max-concurrent-queries = 0
query-timeout = "0s"
log-queries-after = "60s"
max-select-point = 0
max-select-series = 0
max-select-buckets = 0
[retention]
enabled = true
check-interval = "1h"
[shard-precreation]
enabled = true
check-interval = "30m"
advance-period = "30m"
[monitor]
store-enabled = true
store-database = "_internal"
store-interval = "60s"
[http]
enabled = true
flux-enabled = true
bind-address = ":8086"
log-enabled = false
write-tracing = false
pprof-enabled = false
debug-pprof-enabled = false
https-enabled = false
max-row-limit = 0
max-connection-limit = 0
unix-socket-enabled = false
max-body-size = 0
max-concurrent-write-limit = 0
max-enqueued-write-limit = 0
enqueued-write-timeout = 0
[logging]
format = "json"
level = "info"
suppress-logo = true
[subscriber]
enabled = false
[[graphite]]
enabled = true
database = "graphite"
retention-policy = "day_hour"
bind-address = ":2003"
protocol = "tcp"
consistency-level = "one"
batch-size = 1000
batch-pending = 50
batch-timeout = "1s"
[[udp]]
enabled = false
[continuous_queries]
enabled = true
log-enabled = true
query-stats-enabled = false
run-interval = "10s"
[tls]

实例启停方式:

1
2
systemctl start influxd.service
systemctl stop influxd.service

启动新的influxdb实例后,需要将线上数据导入到该实例.

relay

relay服务会对统一接入的流程进行转发, 可直接docker部署,目前只有一个实例, 可扩容成2个,配置相同.

配置文件做为configmap的形式挂载到容器中, 内容如下:

1
2
3
4
5
6
7
[[http]]
name = "relay-http"
bind-addr = ":9096"
output = [
{ name="db1", location = "http://192.168.1.5:8086/write" },
{ name="db2", location = "http://192.168.1.6:8086/write" },
]

同时,生成一个relay的svc,名为influxdb-relay-headless.sensego,端口号9096

nginx

从架构图中可以看出, 需要部署一个proxy层来代理influxdb的读写流量, 这里选择nginx

nginx做为容器部署, 一个实例即可

配置文件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
worker_processes 8;

events {
worker_connections 10240;
}

http {
client_max_body_size 0;

upstream relay {
# relay实例svc
server influxdb-relay-headless.sensego:9096;
}

upstream db {
# 后端influx实例地址,这里最好使用健康检查,当influxdb有一个节点宕机时,会被nginx踢除
ip_hash;
server 192.168.1.5:8086 max_fails=1 fail_timeout=10s;
server 192.168.1.6:8086 max_fails=1 fail_timeout=10s;
}

server {
listen 9096;
location /ping {
proxy_pass http://db;
}
location /write {
limit_except POST {
deny all;
}
proxy_pass http://relay;
}

location /query {
proxy_pass http://db;
}
}
}

数据迁移

由于influxdb只是保存metrics数据, 数据量大概在300G左右

不是特别敏感, 因此可以不停机进行备份,

这里采用的是在某个时间点进行全量备份,之后再通过增量备份来导入在操作期间写入的数据

虽然influxdb支持远程备份,建议在192.168.1.5本地进行备份,然后复制到新节点上

备份

1
2
3
4
5
6
7
# 先全量备份
influxd backup -host 192.168.1.5:8088 -portable /tmp/backup-all/influxdb
# 全库备份必须为一个全新的influxdb实例
# 全库备份包含retention policy.

# 增量备份
influxd backup -portable -database mytsd -start 2017-04-28T06:49:00Z -end 2017-04-28T06:50:00Z /tmp/backup-ins/influxdb

恢复

1
2
3
4
# 在172.16.104.203上做本地恢复
influxd restore -portable /tmp/backup-all/influxdb
# 再导入增量备份数据
influxd restore -portable /tmp/backup-ins/influxdb

到此,influxdb由单机点扩容到双节点, 避免了单机故障

参考文章: