S3互換など – OSAKANA TAROのメモ帳

2026年1月21日2026年2月20日

2サーバでcephを組んで見た(未解決)

Proxmox VEの2サーバ+ corosync qdeviceサーバの3サーバ構成でProxmox VEクラスタを作った際に、cephを組めるのかな？と実験してみた

Proxmox VEサーバは CPU6コア、メモリ16GB、システムディスク120GBで作成し、ceph用ストレージとして16GBディスクを3個で稼働させた。

とりあえず動いてる

モニタとマネージャは各サーバに1個ずつ設定した。

ただ、１台止めた場合に、cephは使えなくなる状態である。

後述の２ノードに均等に同じデータを持たす設定としても、ceph環境での多数決で過半数を問えるようにするには、仮想でもいいのでもう１ノード立てないと実現できないので、どうしようかなぁ・・・という状態となっている。

とりあえずcephストレージに仮想マシンを配置した場合の動作確認には使えるので、とりあえずはこれでいいか、としているが、

実は２ノードだとデータがミラー構成になるので、確保したディスク容量の1/2以下しか使えないのに対して、３ノードであれば、1/2～2/3の間程度が使える計算となるのでそっちの構成の方が良かったかなぁ・・と思わなくもない

詳細確認

まず「ceph health」と「ceph health detail」を実行して確認

root@proxmoxa:~# ceph health
HEALTH_WARN clock skew detected on mon.proxmoxb; Degraded data redundancy: 28/90 objects degraded (31.111%), 25 pgs degraded, 128 pgs undersized
root@proxmoxa:~# ceph health detail
HEALTH_WARN clock skew detected on mon.proxmoxb; Degraded data redundancy: 39/123 objects degraded (31.707%), 35 pgs degraded, 128 pgs undersized
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.proxmoxb
    mon.proxmoxb clock skew 0.305298s > max 0.05s (latency 0.00675958s)
[WRN] PG_DEGRADED: Degraded data redundancy: 39/123 objects degraded (31.707%), 35 pgs degraded, 128 pgs undersized
    pg 2.0 is stuck undersized for 26m, current state active+undersized, last acting [3,1]
    pg 2.1 is stuck undersized for 26m, current state active+undersized, last acting [2,5]
    pg 2.2 is stuck undersized for 26m, current state active+undersized, last acting [5,1]
    pg 2.3 is stuck undersized for 26m, current state active+undersized, last acting [5,2]
    pg 2.4 is stuck undersized for 26m, current state active+undersized+degraded, last acting [1,4]
    pg 2.5 is stuck undersized for 26m, current state active+undersized, last acting [3,0]
    pg 2.6 is stuck undersized for 26m, current state active+undersized, last acting [1,3]
    pg 2.7 is stuck undersized for 26m, current state active+undersized+degraded, last acting [3,2]
    pg 2.8 is stuck undersized for 26m, current state active+undersized, last acting [3,0]
    pg 2.9 is stuck undersized for 26m, current state active+undersized, last acting [1,4]
    pg 2.a is stuck undersized for 26m, current state active+undersized+degraded, last acting [1,4]
    pg 2.b is stuck undersized for 26m, current state active+undersized, last acting [3,0]
    pg 2.c is stuck undersized for 26m, current state active+undersized, last acting [2,3]
    pg 2.d is stuck undersized for 26m, current state active+undersized, last acting [1,3]
    pg 2.e is stuck undersized for 26m, current state active+undersized+degraded, last acting [2,3]
    pg 2.f is stuck undersized for 26m, current state active+undersized, last acting [4,0]
    pg 2.10 is stuck undersized for 26m, current state active+undersized, last acting [2,4]
    pg 2.11 is stuck undersized for 26m, current state active+undersized, last acting [4,1]
    pg 2.1c is stuck undersized for 26m, current state active+undersized+degraded, last acting [4,2]
    pg 2.1d is stuck undersized for 26m, current state active+undersized, last acting [3,0]
    pg 2.1e is stuck undersized for 26m, current state active+undersized+degraded, last acting [2,5]
    pg 2.1f is stuck undersized for 26m, current state active+undersized+degraded, last acting [0,3]
    pg 2.20 is stuck undersized for 26m, current state active+undersized+degraded, last acting [5,1]
    pg 2.21 is stuck undersized for 26m, current state active+undersized, last acting [2,4]
    pg 2.22 is stuck undersized for 26m, current state active+undersized, last acting [3,2]
    pg 2.23 is stuck undersized for 26m, current state active+undersized, last acting [0,3]
    pg 2.24 is stuck undersized for 26m, current state active+undersized, last acting [5,1]
    pg 2.25 is stuck undersized for 26m, current state active+undersized, last acting [4,1]
    pg 2.26 is stuck undersized for 26m, current state active+undersized, last acting [5,2]
    pg 2.27 is stuck undersized for 26m, current state active+undersized, last acting [3,0]
    pg 2.28 is stuck undersized for 26m, current state active+undersized, last acting [2,3]
    pg 2.29 is stuck undersized for 26m, current state active+undersized+degraded, last acting [3,1]
    pg 2.2a is stuck undersized for 26m, current state active+undersized, last acting [5,0]
    pg 2.2b is stuck undersized for 26m, current state active+undersized, last acting [2,4]
    pg 2.2c is stuck undersized for 26m, current state active+undersized, last acting [2,5]
    pg 2.2d is stuck undersized for 26m, current state active+undersized, last acting [5,2]
    pg 2.2e is stuck undersized for 26m, current state active+undersized+degraded, last acting [5,0]
    pg 2.2f is stuck undersized for 26m, current state active+undersized+degraded, last acting [5,0]
    pg 2.30 is stuck undersized for 26m, current state active+undersized+degraded, last acting [4,0]
    pg 2.31 is stuck undersized for 26m, current state active+undersized, last acting [0,5]
    pg 2.32 is stuck undersized for 26m, current state active+undersized, last acting [5,1]
    pg 2.33 is stuck undersized for 26m, current state active+undersized, last acting [3,1]
    pg 2.34 is stuck undersized for 26m, current state active+undersized+degraded, last acting [5,0]
    pg 2.35 is stuck undersized for 26m, current state active+undersized, last acting [1,3]
    pg 2.36 is stuck undersized for 26m, current state active+undersized, last acting [1,4]
    pg 2.37 is stuck undersized for 26m, current state active+undersized, last acting [3,1]
    pg 2.38 is stuck undersized for 26m, current state active+undersized+degraded, last acting [0,5]
    pg 2.39 is stuck undersized for 26m, current state active+undersized, last acting [1,5]
    pg 2.7d is stuck undersized for 26m, current state active+undersized, last acting [0,4]
    pg 2.7e is stuck undersized for 26m, current state active+undersized+degraded, last acting [0,4]
    pg 2.7f is stuck undersized for 26m, current state active+undersized+degraded, last acting [4,1]
root@proxmoxa:~#

続いて「ceph -s」

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            clock skew detected on mon.proxmoxb
            Degraded data redundancy: 120/366 objects degraded (32.787%), 79 pgs degraded, 128 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 27m)
    mgr: proxmoxa(active, since 34m), standbys: proxmoxb
    osd: 6 osds: 6 up (since 28m), 6 in (since 29m); 1 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 122 objects, 436 MiB
    usage:   1.0 GiB used, 95 GiB / 96 GiB avail
    pgs:     120/366 objects degraded (32.787%)
             2/366 objects misplaced (0.546%)
             79 active+undersized+degraded
             49 active+undersized
             1  active+clean+remapped

  io:
    client:   15 KiB/s rd, 8.4 MiB/s wr, 17 op/s rd, 13 op/s wr

root@proxmoxa:~#

clock skew detected

まずは「clock skew detected」について確認

[WRN] MON_CLOCK_SKEW: clock skew detected on mon.proxmoxb
    mon.proxmoxb clock skew 0.305298s > max 0.05s (latency 0.00675958s)

「MON_CLOCK_SKEW」にある通りサーバ間の時刻に差がある、というもの

mon_clock_drift_allowed が標準では 0.05秒で設定されているものに対して「mon.proxmoxb clock skew 0.305298s」となっているため警告となっている。

今回の検証環境はESXi 8.0 Free版の上に立てているので、全体的な処理パワーが足りずに遅延になっているのではないかと思われるため無視する

設定として無視する場合はproxmox wikiの Ceph Configuration にあるように「ceph config コマンド」で行う

現在の値を確認

root@proxmoxa:~# ceph config get mon mon_clock_drift_allowed
0.050000
root@proxmoxa:~#

設定を変更、今回は0.5ぐらいにしておくか

root@proxmoxa:~# ceph config set mon mon_clock_drift_allowed 0.5
root@proxmoxa:~# ceph config get mon mon_clock_drift_allowed
0.500000
root@proxmoxa:~#

メッセージが消えたことを確認

コマンドを実行しても消えていることを確認

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 427/1287 objects degraded (33.178%), 124 pgs degraded, 128 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 40m)
    mgr: proxmoxa(active, since 47m), standbys: proxmoxb
    osd: 6 osds: 6 up (since 41m), 6 in (since 41m); 1 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 429 objects, 1.6 GiB
    usage:   3.4 GiB used, 93 GiB / 96 GiB avail
    pgs:     427/1287 objects degraded (33.178%)
             2/1287 objects misplaced (0.155%)
             124 active+undersized+degraded
             4   active+undersized
             1   active+clean+remapped

  io:
    client:   685 KiB/s rd, 39 KiB/s wr, 7 op/s rd, 2 op/s wr

root@proxmoxa:~#

ceph helth detailからも消えたことを確認

root@proxmoxa:~# ceph health
HEALTH_WARN Degraded data redundancy: 426/1284 objects degraded (33.178%), 124 pgs degraded, 128 pgs undersized
root@proxmoxa:~# ceph health detail
HEALTH_WARN Degraded data redundancy: 422/1272 objects degraded (33.176%), 124 pgs degraded, 128 pgs undersized
[WRN] PG_DEGRADED: Degraded data redundancy: 422/1272 objects degraded (33.176%), 124 pgs degraded, 128 pgs undersized
    pg 2.0 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,1]
    pg 2.1 is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,5]
    pg 2.2 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,1]
    pg 2.3 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,2]
    pg 2.4 is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,4]
    pg 2.5 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,0]
    pg 2.6 is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,3]
    pg 2.7 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,2]
    pg 2.8 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,0]
    pg 2.9 is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,4]
    pg 2.a is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,4]
    pg 2.b is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,0]
    pg 2.c is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,3]
    pg 2.d is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,3]
    pg 2.e is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,3]
    pg 2.f is stuck undersized for 39m, current state active+undersized+degraded, last acting [4,0]
    pg 2.10 is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,4]
    pg 2.11 is active+undersized+degraded, acting [4,1]
    pg 2.1c is stuck undersized for 39m, current state active+undersized+degraded, last acting [4,2]
    pg 2.1d is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,0]
    pg 2.1e is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,5]
    pg 2.1f is stuck undersized for 39m, current state active+undersized+degraded, last acting [0,3]
    pg 2.20 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,1]
    pg 2.21 is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,4]
    pg 2.22 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,2]
    pg 2.23 is stuck undersized for 39m, current state active+undersized+degraded, last acting [0,3]
    pg 2.24 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,1]
    pg 2.25 is stuck undersized for 39m, current state active+undersized+degraded, last acting [4,1]
    pg 2.26 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,2]
    pg 2.27 is stuck undersized for 39m, current state active+undersized, last acting [3,0]
    pg 2.28 is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,3]
    pg 2.29 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,1]
    pg 2.2a is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,0]
    pg 2.2b is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,4]
    pg 2.2c is stuck undersized for 39m, current state active+undersized+degraded, last acting [2,5]
    pg 2.2d is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,2]
    pg 2.2e is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,0]
    pg 2.2f is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,0]
    pg 2.30 is stuck undersized for 39m, current state active+undersized+degraded, last acting [4,0]
    pg 2.31 is stuck undersized for 39m, current state active+undersized+degraded, last acting [0,5]
    pg 2.32 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,1]
    pg 2.33 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,1]
    pg 2.34 is stuck undersized for 39m, current state active+undersized+degraded, last acting [5,0]
    pg 2.35 is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,3]
    pg 2.36 is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,4]
    pg 2.37 is stuck undersized for 39m, current state active+undersized+degraded, last acting [3,1]
    pg 2.38 is stuck undersized for 39m, current state active+undersized+degraded, last acting [0,5]
    pg 2.39 is stuck undersized for 39m, current state active+undersized+degraded, last acting [1,5]
    pg 2.7d is stuck undersized for 39m, current state active+undersized+degraded, last acting [0,4]
    pg 2.7e is stuck undersized for 39m, current state active+undersized+degraded, last acting [0,4]
    pg 2.7f is stuck undersized for 39m, current state active+undersized+degraded, last acting [4,1]
root@proxmoxa:~#

PG_DEGRADED: Degraded data redundancy

たくさん出ているやつについて調査

まずは PG_DEGRADED を確認・・・

osdがdownしているわけではないので、参考にならなそう

とりあえず関連しそうな状態を確認

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 803/2415 objects degraded (33.251%), 128 pgs degraded, 128 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 89m)
    mgr: proxmoxa(active, since 96m), standbys: proxmoxb
    osd: 6 osds: 6 up (since 90m), 6 in (since 91m); 1 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 805 objects, 3.1 GiB
    usage:   6.4 GiB used, 90 GiB / 96 GiB avail
    pgs:     803/2415 objects degraded (33.251%)
             2/2415 objects misplaced (0.083%)
             128 active+undersized+degraded
             1   active+clean+remapped

  io:
    client:   20 KiB/s rd, 13 MiB/s wr, 15 op/s rd, 29 op/s wr

root@proxmoxa:~#

root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  2/6 objects misplaced (33.333%)

pool cephpool id 2
  826/2478 objects degraded (33.333%)
  client io 14 KiB/s rd, 8.4 MiB/s wr, 14 op/s rd, 15 op/s wr

root@proxmoxa:~#

root@proxmoxa:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.06
pool 2 'cephpool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 39 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 21474836480 application rbd read_balance_score 1.41
        removed_snaps_queue [2~1]

root@proxmoxa:~#

現状のcephpoolは pg_num=128, pgp_num=128 で作成されている

autoscaleの設定を見てみる

root@proxmoxa:~# ceph osd pool autoscale-status
POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
.mgr      452.0k                3.0        98280M  0.0000                                  1.0       1              on         False
cephpool   2234M       20480M   3.0        98280M  0.6252                                  1.0     128              on         False
root@proxmoxa:~#

「How I Built a 2-Node HA Proxmox Cluster with Ceph, Podman, and a Raspberry Pi (Yes, It Works)」にやりたいことがあるっぽい

このページでは「ceph config set osd osd_default_size 2」と「ceph config set osd osd_default_min_size 1」を実行しているが、ceph config getで確認してみると、値はない模様

root@proxmoxa:~# ceph config get osd osd_default_size
Error ENOENT: unrecognized key 'osd_default_size'
root@proxmoxa:~# ceph config get osd osd_default_min_size
Error ENOENT: unrecognized key 'osd_default_min_size'
root@proxmoxa:~#

設定出来たりしないかを念のため確認してみたが、エラーとなった

root@proxmoxa:~# ceph config set osd osd_default_size 2
Error EINVAL: unrecognized config option 'osd_default_size'
root@proxmoxa:~# ceph config set osd osd_default_min_size 1
Error EINVAL: unrecognized config option 'osd_default_min_size'
root@proxmoxa:~#

osd_pool_default_sizeとosd_pool_default_min_sizeならばあるので、そちらを設定してみることにした

root@proxmoxa:~# ceph config get osd osd_pool_default_size
3
root@proxmoxa:~# ceph config get osd osd_pool_default_min_size
0
root@proxmoxa:~#

root@proxmoxa:~# ceph config set osd osd_pool_default_size 2
root@proxmoxa:~# ceph config set osd osd_pool_default_min_size 1
root@proxmoxa:~# ceph config get osd osd_pool_default_size
2
root@proxmoxa:~# ceph config get osd osd_pool_default_min_size
1
root@proxmoxa:~#

状態に変化はなさそう

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 885/2661 objects degraded (33.258%), 128 pgs degraded, 128 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 2h)
    mgr: proxmoxa(active, since 2h), standbys: proxmoxb
    osd: 6 osds: 6 up (since 2h), 6 in (since 2h); 1 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   7.1 GiB used, 89 GiB / 96 GiB avail
    pgs:     885/2661 objects degraded (33.258%)
             2/2661 objects misplaced (0.075%)
             128 active+undersized+degraded
             1   active+clean+remapped

root@proxmoxa:~# ceph osd pool autoscale-status
POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
.mgr      452.0k                3.0        98280M  0.0000                                  1.0       1              on         False
cephpool   2319M       20480M   3.0        98280M  0.6252                                  1.0     128              on         False
root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  2/6 objects misplaced (33.333%)

pool cephpool id 2
  885/2655 objects degraded (33.333%)
  client io 170 B/s wr, 0 op/s rd, 0 op/s wr

root@proxmoxa:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.06
pool 2 'cephpool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 39 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 21474836480 application rbd read_balance_score 1.41
        removed_snaps_queue [2~1]

root@proxmoxa:~#

ceph osd pool get コマンドで、各プールのsizeとmin_sizeを確認

root@proxmoxa:~# ceph osd pool get cephpool size
size: 3
root@proxmoxa:~# ceph osd pool get cephpool min_size
min_size: 2
root@proxmoxa:~#

設定を変更

root@proxmoxa:~# ceph osd pool set cephpool size 2
set pool 2 size to 2
root@proxmoxa:~# ceph osd pool set cephpool min_size 1
set pool 2 min_size to 1
root@proxmoxa:~# ceph osd pool get cephpool size
size: 2
root@proxmoxa:~# ceph osd pool get cephpool min_size
min_size: 1
root@proxmoxa:~#

状態確認すると、ceph health がHEALTH_OKになっている

root@proxmoxa:~# ceph health
HEALTH_OK
root@proxmoxa:~# ceph health detail
HEALTH_OK
root@proxmoxa:~#

他のステータスは？と確認してみると、問題無く見える

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_OK

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 2h)
    mgr: proxmoxa(active, since 2h), standbys: proxmoxb
    osd: 6 osds: 6 up (since 2h), 6 in (since 2h); 1 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   7.1 GiB used, 89 GiB / 96 GiB avail
    pgs:     2/1776 objects misplaced (0.113%)
             128 active+clean
             1   active+clean+remapped

  io:
    recovery: 1.3 MiB/s, 0 objects/s

root@proxmoxa:~# ceph osd pool autoscale-status
POOL        SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
.mgr      452.0k                3.0        98280M  0.0000                                  1.0       1              on         False
cephpool   3479M       20480M   2.0        98280M  0.4168                                  1.0     128              on         False
root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  2/6 objects misplaced (33.333%)

pool cephpool id 2
  nothing is going on

root@proxmoxa:~# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.06
pool 2 'cephpool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 42 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 21474836480 application rbd read_balance_score 1.17

root@proxmoxa:~#

GUIもHEALTH_OK

障害テスト

片側を停止してどうなるか？

PVEのクラスタ側は生きている

root@proxmoxa:~# ha-manager status
quorum OK
master proxmoxa (active, Wed Jan 21 17:43:39 2026)
lrm proxmoxa (active, Wed Jan 21 17:43:40 2026)
lrm proxmoxb (old timestamp - dead?, Wed Jan 21 17:43:08 2026)
service vm:100 (proxmoxa, started)
root@proxmoxa:~# pvecm status
Cluster information
-------------------
Name:             cephcluster
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Wed Jan 21 17:44:11 2026
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.3b
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 192.168.2.64 (local)
0x00000000          1            Qdevice
root@proxmoxa:~#

しかし、cephのステータスは死んでいる

「ceph helth」コマンドを実行してみると返事が返ってこない

root@proxmoxa:~# ceph health

ダメそうなので、停止したノードを復帰

ceph osd poolの.mgrについてもsizeとmin_sizeを変更

root@proxmoxa:~# ceph osd pool ls
.mgr
cephpool
root@proxmoxa:~# ceph osd pool get .mgr size
size: 3
root@proxmoxa:~# ceph osd pool get .mgr min_size
min_size: 2
root@proxmoxa:~# ceph osd pool set .mgr size 2
set pool 1 size to 2
root@proxmoxa:~# ceph osd pool set .mgr min_size 1
set pool 1 min_size to 1
root@proxmoxa:~# ceph osd pool get .mgr size
size: 2
root@proxmoxa:~# ceph osd pool get .mgr min_size
min_size: 1
root@proxmoxa:~#

で、先のブログにあるようにmonパラメータも変更するため、現在値を確認

root@proxmoxa:~# ceph config get mon mon_osd_min_down_reporters
2
root@proxmoxa:~# ceph config get mon mon_osd_down_out_interval
600
root@proxmoxa:~# ceph config get mon mon_osd_report_timeout
900
root@proxmoxa:~#

これをそれぞれ変更

root@proxmoxa:~# ceph config set mon mon_osd_min_down_reporters 1
root@proxmoxa:~# ceph config set mon mon_osd_down_out_interval 120
root@proxmoxa:~# ceph config set mon mon_osd_report_timeout 90
root@proxmoxa:~# ceph config get mon mon_osd_min_down_reporters
1
root@proxmoxa:~# ceph config get mon mon_osd_down_out_interval
120
root@proxmoxa:~# ceph config get mon mon_osd_report_timeout
90
root@proxmoxa:~#

・・・相変わらずceph -sで応答がなくなる

cephを維持するための3番目のノードをどう作成する？

先ほどの記事の「Faking a Third Node with a Containerized MON」にコンテナとして3つめのceph monを起動させる話が書いてあった

Proxmox VEフォーラムの「3rd Ceph MON on external QDevice (Podman) – 4-node / 2-site cluster」からProxmox VE wikiの「Stretch Cluster」ではceph monではなく「tie-breaker node」を立てるとある

またStretch Clusterでは、先ほど変更したOSDのsize=4, min_size=2 として、2つのノードに2個のレプリカを保証する設定としていた。

とりあえず、OSDのsize/min_sizeを変更する

root@proxmoxa:~# ceph osd pool ls
.mgr
cephpool
root@proxmoxa:~# ceph osd pool get .mgr size
size: 2
root@proxmoxa:~# ceph osd pool get .mgr min_size
min_size: 1
root@proxmoxa:~# ceph osd pool set .mgr size 4
set pool 1 size to 4
root@proxmoxa:~# ceph osd pool set .mgr min_size 2
set pool 1 min_size to 2
root@proxmoxa:~# ceph osd pool get .mgr size
size: 4
root@proxmoxa:~# ceph osd pool get .mgr min_size
min_size: 2
root@proxmoxa:~# ceph osd pool get cephpool size
size: 2
root@proxmoxa:~# ceph osd pool get cephpool min_size
min_size: 1
root@proxmoxa:~# ceph osd pool set cephpool size 4
set pool 2 size to 4
root@proxmoxa:~# ceph osd pool set cephpool min_size 2
set pool 2 min_size to 2
root@proxmoxa:~# ceph osd pool get cephpool size
size: 4
root@proxmoxa:~# ceph osd pool get cephpool min_size
min_size: 2
root@proxmoxa:~#

この状態でceph osd pool statsを取ると先ほどまで33.333%だったものが50.0% なった

root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  4/8 objects degraded (50.000%)

pool cephpool id 2
  1770/3540 objects degraded (50.000%)

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 1774/3548 objects degraded (50.000%), 129 pgs degraded, 129 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 6h)
    mgr: proxmoxb(active, since 6h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 6h), 6 in (since 24h)

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   7.0 GiB used, 89 GiB / 96 GiB avail
    pgs:     1774/3548 objects degraded (50.000%)
             129 active+undersized+degraded

root@proxmoxa:~#

root@proxmoxa:~# ceph health
HEALTH_WARN Degraded data redundancy: 1774/3548 objects degraded (50.000%), 129 pgs degraded, 129 pgs undersized
root@proxmoxa:~# ceph health detail
HEALTH_WARN Degraded data redundancy: 1774/3548 objects degraded (50.000%), 129 pgs degraded, 129 pgs undersized
[WRN] PG_DEGRADED: Degraded data redundancy: 1774/3548 objects degraded (50.000%), 129 pgs degraded, 129 pgs undersized
    pg 1.0 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,0]
    pg 2.0 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,1]
    pg 2.1 is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,5]
    pg 2.2 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,1]
    pg 2.3 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,2]
    pg 2.4 is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,4]
    pg 2.5 is stuck undersized for 5m, current state active+undersized+degraded, last acting [4,0]
    pg 2.6 is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,3]
    pg 2.7 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,2]
    pg 2.8 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,0]
    pg 2.9 is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,4]
    pg 2.a is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,4]
    pg 2.b is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,0]
    pg 2.c is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,3]
    pg 2.d is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,3]
    pg 2.e is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,3]
    pg 2.f is stuck undersized for 5m, current state active+undersized+degraded, last acting [4,0]
    pg 2.10 is active+undersized+degraded, acting [2,4]
    pg 2.1c is stuck undersized for 5m, current state active+undersized+degraded, last acting [4,2]
    pg 2.1d is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,0]
    pg 2.1e is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,5]
    pg 2.1f is stuck undersized for 5m, current state active+undersized+degraded, last acting [0,3]
    pg 2.20 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,1]
    pg 2.21 is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,4]
    pg 2.22 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,2]
    pg 2.23 is stuck undersized for 5m, current state active+undersized+degraded, last acting [0,3]
    pg 2.24 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,1]
    pg 2.25 is stuck undersized for 5m, current state active+undersized+degraded, last acting [4,2]
    pg 2.26 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,2]
    pg 2.27 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,0]
    pg 2.28 is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,3]
    pg 2.29 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,1]
    pg 2.2a is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,0]
    pg 2.2b is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,4]
    pg 2.2c is stuck undersized for 5m, current state active+undersized+degraded, last acting [2,5]
    pg 2.2d is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,2]
    pg 2.2e is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,0]
    pg 2.2f is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,0]
    pg 2.30 is stuck undersized for 5m, current state active+undersized+degraded, last acting [4,0]
    pg 2.31 is stuck undersized for 5m, current state active+undersized+degraded, last acting [0,5]
    pg 2.32 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,1]
    pg 2.33 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,1]
    pg 2.34 is stuck undersized for 5m, current state active+undersized+degraded, last acting [5,0]
    pg 2.35 is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,3]
    pg 2.36 is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,4]
    pg 2.37 is stuck undersized for 5m, current state active+undersized+degraded, last acting [3,1]
    pg 2.38 is stuck undersized for 5m, current state active+undersized+degraded, last acting [0,5]
    pg 2.39 is stuck undersized for 5m, current state active+undersized+degraded, last acting [1,5]
    pg 2.7d is stuck undersized for 5m, current state active+undersized+degraded, last acting [0,4]
    pg 2.7e is stuck undersized for 5m, current state active+undersized+degraded, last acting [0,4]
    pg 2.7f is stuck undersized for 5m, current state active+undersized+degraded, last acting [4,1]
root@proxmoxa:~#

このときのceph osd treeは下記の状態

root@proxmoxa:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         0.09357  root default
-5         0.04678      host proxmoxa
 3    ssd  0.01559          osd.3          up   1.00000  1.00000
 4    ssd  0.01559          osd.4          up   1.00000  1.00000
 5    ssd  0.01559          osd.5          up   1.00000  1.00000
-3         0.04678      host proxmoxb
 0    ssd  0.01559          osd.0          up   1.00000  1.00000
 1    ssd  0.01559          osd.1          up   1.00000  1.00000
 2    ssd  0.01559          osd.2          up   1.00000  1.00000
root@proxmoxa:~#

次にCRUSH Structureを2個作る

root@proxmoxa:~# ceph osd crush add-bucket room1 room
added bucket room1 type room to crush map
root@proxmoxa:~# ceph osd crush add-bucket room2 room
added bucket room2 type room to crush map
root@proxmoxa:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-8               0  room room2
-7               0  room room1
-1         0.09357  root default
-5         0.04678      host proxmoxa
 3    ssd  0.01559          osd.3          up   1.00000  1.00000
 4    ssd  0.01559          osd.4          up   1.00000  1.00000
 5    ssd  0.01559          osd.5          up   1.00000  1.00000
-3         0.04678      host proxmoxb
 0    ssd  0.01559          osd.0          up   1.00000  1.00000
 1    ssd  0.01559          osd.1          up   1.00000  1.00000
 2    ssd  0.01559          osd.2          up   1.00000  1.00000
root@proxmoxa:~#

で、移動？

root@proxmoxa:~# ceph osd crush move room1 root=default
moved item id -7 name 'room1' to location {root=default} in crush map
root@proxmoxa:~# ceph osd crush move room2 root=default
moved item id -8 name 'room2' to location {root=default} in crush map
root@proxmoxa:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         0.09357  root default
-5         0.04678      host proxmoxa
 3    ssd  0.01559          osd.3          up   1.00000  1.00000
 4    ssd  0.01559          osd.4          up   1.00000  1.00000
 5    ssd  0.01559          osd.5          up   1.00000  1.00000
-3         0.04678      host proxmoxb
 0    ssd  0.01559          osd.0          up   1.00000  1.00000
 1    ssd  0.01559          osd.1          up   1.00000  1.00000
 2    ssd  0.01559          osd.2          up   1.00000  1.00000
-7               0      room room1
-8               0      room room2
root@proxmoxa:~#

次にノードをそれぞれ別のroomに移動

root@proxmoxa:~# ceph osd crush move proxmoxa room=room1
moved item id -5 name 'proxmoxa' to location {room=room1} in crush map
root@proxmoxa:~# ceph osd crush move proxmoxb room=room2
moved item id -3 name 'proxmoxb' to location {room=room2} in crush map
root@proxmoxa:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME              STATUS  REWEIGHT  PRI-AFF
-1         0.09357  root default
-7         0.04678      room room1
-5         0.04678          host proxmoxa
 3    ssd  0.01559              osd.3          up   1.00000  1.00000
 4    ssd  0.01559              osd.4          up   1.00000  1.00000
 5    ssd  0.01559              osd.5          up   1.00000  1.00000
-8         0.04678      room room2
-3         0.04678          host proxmoxb
 0    ssd  0.01559              osd.0          up   1.00000  1.00000
 1    ssd  0.01559              osd.1          up   1.00000  1.00000
 2    ssd  0.01559              osd.2          up   1.00000  1.00000
root@proxmoxa:~#

CRUSH ruleを作成

root@proxmoxa:~# ceph osd getcrushmap > crush.map.bin
25
root@proxmoxa:~# ls -l crush.map.bin
-rw-r--r-- 1 root root 1104 Jan 22 16:31 crush.map.bin
root@proxmoxa:~# crushtool -d crush.map.bin -o crush.map.txt
root@proxmoxa:~# ls -l crush.map*
-rw-r--r-- 1 root root 1104 Jan 22 16:31 crush.map.bin
-rw-r--r-- 1 root root 1779 Jan 22 16:31 crush.map.txt
root@proxmoxa:~#

crush.map.bin はバイナリファイルなので、crushtoolでテキストにしたものを作成

現状の内容は下記だった

root@proxmoxa:~# cat crush.map.txt
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host proxmoxa {
        id -5           # do not change unnecessarily
        id -6 class ssd         # do not change unnecessarily
        # weight 0.04678
        alg straw2
        hash 0  # rjenkins1
        item osd.3 weight 0.01559
        item osd.4 weight 0.01559
        item osd.5 weight 0.01559
}
room room1 {
        id -7           # do not change unnecessarily
        id -10 class ssd                # do not change unnecessarily
        # weight 0.04678
        alg straw2
        hash 0  # rjenkins1
        item proxmoxa weight 0.04678
}
host proxmoxb {
        id -3           # do not change unnecessarily
        id -4 class ssd         # do not change unnecessarily
        # weight 0.04678
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.01559
        item osd.1 weight 0.01559
        item osd.2 weight 0.01559
}
room room2 {
        id -8           # do not change unnecessarily
        id -9 class ssd         # do not change unnecessarily
        # weight 0.04678
        alg straw2
        hash 0  # rjenkins1
        item proxmoxb weight 0.04678
}
root default {
        id -1           # do not change unnecessarily
        id -2 class ssd         # do not change unnecessarily
        # weight 0.09357
        alg straw2
        hash 0  # rjenkins1
        item room1 weight 0.04678
        item room2 weight 0.04678
}

# rules
rule replicated_rule {
        id 0
        type replicated
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map
root@proxmoxa:~#

テキストファイルの最後に replicated_stretch_rule を追加。idは、テキストを見て他にあるruleのidの次の番号を設定する

root@proxmoxa:~# cp crush.map.txt crush-new.map.txt
root@proxmoxa:~# vi crush-new.map.txt
root@proxmoxa:~# diff -u crush.map.txt crush-new.map.txt
--- crush.map.txt       2026-01-22 16:31:44.837979276 +0900
+++ crush-new.map.txt   2026-01-22 16:35:12.146622553 +0900
@@ -87,3 +87,13 @@
 }

 # end crush map
+
+rule replicated_stretch_rule {
+        id 1
+        type replicated
+        step take default
+        step choose firstn 0 type room
+        step chooseleaf firstn 2 type host
+        step emit
+}
+
root@proxmoxa:~#

作成したファイルをcephに読み込ませる

root@proxmoxa:~# ls -l
total 12
-rw-r--r-- 1 root root 1104 Jan 22 16:31 crush.map.bin
-rw-r--r-- 1 root root 1779 Jan 22 16:31 crush.map.txt
-rw-r--r-- 1 root root 1977 Jan 22 16:35 crush-new.map.txt
root@proxmoxa:~# crushtool -c crush-new.map.txt -o crush-new.map.bin
root@proxmoxa:~# ls -l
total 16
-rw-r--r-- 1 root root 1104 Jan 22 16:31 crush.map.bin
-rw-r--r-- 1 root root 1779 Jan 22 16:31 crush.map.txt
-rw-r--r-- 1 root root 1195 Jan 22 16:36 crush-new.map.bin
-rw-r--r-- 1 root root 1977 Jan 22 16:35 crush-new.map.txt
root@proxmoxa:~# ceph osd setcrushmap -i crush-new.map.bin
26
root@proxmoxa:~#

そうするとcrush ruleが追加される

root@proxmoxa:~# ceph osd crush rule ls
replicated_rule
replicated_stretch_rule
root@proxmoxa:~#

別にosd treeは変わってない模様

root@proxmoxa:~# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME              STATUS  REWEIGHT  PRI-AFF
-1         0.09354  root default
-7         0.04677      room room1
-5         0.04677          host proxmoxa
 3    ssd  0.01558              osd.3          up   1.00000  1.00000
 4    ssd  0.01558              osd.4          up   1.00000  1.00000
 5    ssd  0.01558              osd.5          up   1.00000  1.00000
-8         0.04677      room room2
-3         0.04677          host proxmoxb
 0    ssd  0.01558              osd.0          up   1.00000  1.00000
 1    ssd  0.01558              osd.1          up   1.00000  1.00000
 2    ssd  0.01558              osd.2          up   1.00000  1.00000
root@proxmoxa:~#

root@proxmoxa:~# ceph health
HEALTH_WARN Degraded data redundancy: 448/3548 objects degraded (12.627%), 32 pgs degraded, 91 pgs undersized
root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 448/3548 objects degraded (12.627%), 32 pgs degraded, 91 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 7h)
    mgr: proxmoxb(active, since 7h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 7h), 6 in (since 25h); 97 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   10 GiB used, 86 GiB / 96 GiB avail
    pgs:     448/3548 objects degraded (12.627%)
             1276/3548 objects misplaced (35.964%)
             59 active+undersized+remapped
             34 active+clean+remapped
             32 active+undersized+degraded
             4  active+clean

root@proxmoxa:~#

移動していかない？

以前と同じようにpgp_numを128から32に変えてみる

root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  4/8 objects misplaced (50.000%)

pool cephpool id 2
  448/3540 objects degraded (12.655%)
  1272/3540 objects misplaced (35.932%)

root@proxmoxa:~# ceph osd pool get cephpool pgp_num
pgp_num: 128
root@proxmoxa:~# ceph osd pool set cephpool pgp_num 32
set pool 2 pgp_num to 32
root@proxmoxa:~# ceph osd pool get cephpool pgp_num
pgp_num: 128
root@proxmoxa:~#

かわっていかない・・・

root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  4/8 objects misplaced (50.000%)

pool cephpool id 2
  448/3540 objects degraded (12.655%)
  1272/3540 objects misplaced (35.932%)
  client io 170 B/s wr, 0 op/s rd, 0 op/s wr

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 448/3548 objects degraded (12.627%), 32 pgs degraded, 91 pgs undersized
            1 pools have pg_num > pgp_num

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 7h)
    mgr: proxmoxb(active, since 7h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 7h), 6 in (since 25h); 97 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   10 GiB used, 86 GiB / 96 GiB avail
    pgs:     448/3548 objects degraded (12.627%)
             1276/3548 objects misplaced (35.964%)
             59 active+undersized+remapped
             34 active+clean+remapped
             32 active+undersized+degraded
             4  active+clean

  io:
    client:   170 B/s wr, 0 op/s rd, 0 op/s wr

root@proxmoxa:~#

「1 pools have pg_num > pgp_num」とでているなら、pg_numもかえてみるか？

root@proxmoxa:~# ceph osd pool set cephpool pg_num 32
set pool 2 pg_num to 32
root@proxmoxa:~#

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 448/3548 objects degraded (12.627%), 32 pgs degraded, 91 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 7h)
    mgr: proxmoxb(active, since 7h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 7h), 6 in (since 25h); 97 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   10 GiB used, 86 GiB / 96 GiB avail
    pgs:     448/3548 objects degraded (12.627%)
             1276/3548 objects misplaced (35.964%)
             59 active+undersized+remapped
             34 active+clean+remapped
             32 active+undersized+degraded
             4  active+clean

root@proxmoxa:~#

しばらく待ったものの変化はない

crush ruleが適用されているのか？

「6.3. CRUSH ルールが作成され、プールが正しい CRUSH ルールに設定されていることの確認」

現状のルールのrule idを確認

root@proxmoxa:~# ceph osd crush rule dump | grep -E "rule_(id|name)"
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "rule_id": 1,
        "rule_name": "replicated_stretch_rule",
root@proxmoxa:~#

実際のpoolに設定されているルールのIDを確認

root@proxmoxa:~# ceph osd dump|grep cephpool
pool 2 'cephpool' replicated size 4 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 133 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 21474836480 application rbd read_balance_score 1.22
root@proxmoxa:~#

「crush_rule 0」とあるので、変更されてないっぽい

既存poolにcrush ruleを適用する方法をRedHatドキュメントから

root@proxmoxa:~# ceph osd pool get cephpool crush_rule
crush_rule: replicated_rule
root@proxmoxa:~# ceph osd pool set cephpool crush_rule replicated_stretch_rule
set pool 2 crush_rule to replicated_stretch_rule
root@proxmoxa:~# ceph osd pool get cephpool crush_rule
crush_rule: replicated_stretch_rule
root@proxmoxa:~# ceph osd dump|grep cephpool
pool 2 'cephpool' replicated size 4 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 134 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_bytes 21474836480 application rbd read_balance_score 1.22
root@proxmoxa:~#

変更できた

うーん・・・・？

root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  4/8 objects misplaced (50.000%)

pool cephpool id 2
  194/3540 objects degraded (5.480%)
  1562/3540 objects misplaced (44.124%)

root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 194/3548 objects degraded (5.468%), 14 pgs degraded, 83 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 8h)
    mgr: proxmoxb(active, since 8h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 8h), 6 in (since 26h); 115 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   11 GiB used, 85 GiB / 96 GiB avail
    pgs:     194/3548 objects degraded (5.468%)
             1566/3548 objects misplaced (44.138%)
             69 active+undersized+remapped
             45 active+clean+remapped
             14 active+undersized+degraded
             1  active+clean

root@proxmoxa:~# ceph osd pool stats
pool .mgr id 1
  4/8 objects misplaced (50.000%)

pool cephpool id 2
  194/3540 objects degraded (5.480%)
  1562/3540 objects misplaced (44.124%)
  client io 1.4 KiB/s wr, 0 op/s rd, 0 op/s wr

root@proxmoxa:~#

「手順: PG カウントの増加」にpg_numとpgp_numをかえる、という話があって、pgp_numを4にしてたので実行してみた

root@proxmoxa:~# ceph osd pool get cephpool pg_num
pg_num: 128
root@proxmoxa:~# ceph osd pool get cephpool pgp_num
pgp_num: 128
root@proxmoxa:~# ceph osd pool set cephpool pgp_num 4
set pool 2 pgp_num to 4
root@proxmoxa:~# ceph osd pool get cephpool pgp_num
pgp_num: 128
root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 194/3548 objects degraded (5.468%), 14 pgs degraded, 83 pgs undersized
            1 pools have pg_num > pgp_num

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 9h)
    mgr: proxmoxb(active, since 9h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 9h), 6 in (since 27h); 115 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   11 GiB used, 85 GiB / 96 GiB avail
    pgs:     194/3548 objects degraded (5.468%)
             1566/3548 objects misplaced (44.138%)
             69 active+undersized+remapped
             45 active+clean+remapped
             14 active+undersized+degraded
             1  active+clean

root@proxmoxa:~#

「1 pools have pg_num > pgp_num」という出力がでるようになってしまった

じゃあ、pg_num も4にしてみる

root@proxmoxa:~# ceph osd pool set cephpool pg_num 4
set pool 2 pg_num to 4
root@proxmoxa:~# ceph osd pool get cephpool pg_num
pg_num: 128
root@proxmoxa:~# ceph -s
  cluster:
    id:     26b59237-5bed-45fe-906e-aa3b13033b86
    health: HEALTH_WARN
            Degraded data redundancy: 194/3548 objects degraded (5.468%), 14 pgs degraded, 83 pgs undersized

  services:
    mon: 2 daemons, quorum proxmoxa,proxmoxb (age 9h)
    mgr: proxmoxb(active, since 9h), standbys: proxmoxa
    osd: 6 osds: 6 up (since 9h), 6 in (since 27h); 115 remapped pgs

  data:
    pools:   2 pools, 129 pgs
    objects: 887 objects, 3.4 GiB
    usage:   11 GiB used, 85 GiB / 96 GiB avail
    pgs:     194/3548 objects degraded (5.468%)
             1566/3548 objects misplaced (44.138%)
             69 active+undersized+remapped
             45 active+clean+remapped
             14 active+undersized+degraded
             1  active+clean

root@proxmoxa:~#

関係なさそう

2025年11月6日

HPE Morpheus VM EssentialsのHCI構成を組んでみた

vSphere代替とも言われるHPE Morpheus VM Essentials (HPE VME, HVM, HPE VM Essentails) で、クラスタをセットアップする際に下記の選択肢がある。

「HVM 1.2 HCI Ceph Cluster on HVM/Ubuntu 24.04」ということで、共有ストレージとしてCephを使用するHCI構成があるらしい。

どういう構成を組めばいいのかわからなかったのですが、ドキュメントを探すと [Infrastructure]-[Clusters]-[HVM Clusters]-[Base Cluster Details]という非常にわかりづらいところに、HCI構成の場合に求める仕様が書いてあった。

・物理サーバが最低3台
・CPUコア数1以上
・メモリ4GB以上。Cephを使う場合ディスク1個ごとに4GB追加
・OSディスク 20GB以上、Ceph用データディスク500GB以上

「Example Cluster Deployment」にサンプル構成がある

まずは、最低限のスペックでUbuntu 24.04+HVMをインストールして、Manager仮想マシンをセットアップした。

クラスタを作るところまでの手順は省略

で・・・私がはまった点の1つとして、レイアウト選択のバグ動作、というのがあります。

ver 8.0.10で実施したところ、「HVM 1.2 HCI Ceph Cluster on HVM/Ubuntu 24.04」を選択したところ下記の様にSSHホストが1つしか選択できない状態でした。

そういうものかと思って手順を進めて作成を開始すると指定してないサーバが2つ登場してプロセスが開始されるという状況に・・・

何度かクラスタレイアウトの選択をやり直すと、下記の様に3サーバ分の空欄が表示されるときがありました。この状態であればHCIクラスタの作成に成功しました。

下記の様に入力し作成を開始

作成完了

まあ、わかってしまえば構成自体は簡単だったのですが・・・

Cephのステータスを確認する画面が無いのはどうかと思うんですよ・・・

たとえば[インフラストラクチャ]-[クラスター]で作成したHCIクラスタを表示すると下記の様に「CEPH」って表示があります。

問題は「ステータス:WARN」って表示以上のことを調べるインタフェースが無い、ということ

[ストレージ]で確認できるものは下記の情報だけで、Cephの状態について確認できない

2025年2月27日2025年4月2日

Oracle Cloudのオブジェクトストレージにcyberduckで接続する

Oracle Cloud(OCI)のオブジェクトストレージに cyberduck を使って接続する際、なんかうまくいかなかったのでメモ書き

us-phoenix-1にあるので、OCI Object Storage (us-phoenix-1).cyberduckprofile を選択したのだが選択しても表示されるサーバ名は「s3.amazonaws.com」となっている。

cyberduckprofileの内容を確認すると下記なので、本来サーバに表示されるべきなのは「<namespace>.compat.objectstorage.us-phoenix-1.oraclecloud.com」なんじゃないのか？という疑問点が・・・

	<key>Protocol</key>
	<string>s3</string>
	<key>Vendor</key>
	<string>oracle-us-phoenix-1</string>
	<key>Description</key>
	<string>OCI Object Storage (us-phoenix-1)</string>
	<key>Hostname Placeholder</key>
	<string>&lt;namespace&gt;.compat.objectstorage.us-phoenix-1.oraclecloud.com</string>
	<key>Hostname Configurable</key>
	<true/>

なお、cyberduckのバージョンは 9.1.2 (42722) でした。

じゃあ、OCIにつなぐときはどう接続すればいいのか？

とりあえず、プロファイルはOCI Object Storageを選択する

サーバ名は「<namespace>.compat.objectstorage.us-phoenix-1.oraclecloud.com」

ここでいう<namespace>は、Oracle Cloudコンソールで[ストレージ]-[オブジェクトストレージ]-[バケット(bucket)]で表示される「ネームスペース」のこととなる。

cyberduckに入力すると以下のような感じになる。

続いてAccess KeyとSecret KeyをOracle Cloudコンソールで作成する。

[アイデンティティとセキュリティ]-[ドメイン]でドメイン「Default」を選択

[アイデンティティ・ドメイン]-[ユーザー]にて、API操作用のユーザを作成

作成後、該当ユーザを選択し、[リソース]-[顧客秘密キー]を開く

「秘密キーの生成」をクリック

ここで表示される「生成されたキー」が Secret Keyとなる。

書いてある通りに再表示できないので、間違えずに保存すること。

画面を閉じて一覧に戻ると、Access Keyが確認できる。

これで、Access KeyとSecret Keyが入手できたので、Cyberduck設定画面に入力する

これで接続をすると以下のようにOracle Cloud上のオブジェクトストレージのbucket内容が表示されるのを確認できた。

2024年11月14日

テスト用に作ったceph環境でOSDが落ちまくるので osd heartbeat grace を変更してみた(様子見中

Proxmox VEのcephストレージ環境の動作を確認するためESXi上に RAM 16GBの仮想マシンを4台作ってテスト中(+1台 corosync qnetdサーバがいてProxmox VEクラスタの維持に使用)

で、あるタイミングから、各ノード上のosdのdownが多発するようになった

root@pve37:~# ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         0.62549         -  480 GiB  210 GiB  204 GiB  178 KiB  5.4 GiB  270 GiB  43.65  1.00    -          root default
-3         0.15637         -  160 GiB   59 GiB   58 GiB   47 KiB  1.4 GiB  101 GiB  36.92  0.85    -              host pve36
 0    hdd  0.03909   1.00000   40 GiB   15 GiB   14 GiB    7 KiB  403 MiB   25 GiB  36.27  0.83   36      up          osd.0
 1    hdd  0.03909   1.00000   40 GiB   18 GiB   17 GiB   13 KiB  332 MiB   22 GiB  44.03  1.01   52      up          osd.1
 2    hdd  0.03909   1.00000   40 GiB   11 GiB   10 GiB   18 KiB  337 MiB   29 GiB  26.46  0.61   27      up          osd.2
 3    hdd  0.03909   1.00000   40 GiB   16 GiB   16 GiB    9 KiB  393 MiB   24 GiB  40.91  0.94   48      up          osd.3
-5         0.15637         -  160 GiB   67 GiB   66 GiB   75 KiB  1.6 GiB   93 GiB  41.95  0.96    -              host pve37
 4    hdd  0.03909   1.00000   40 GiB   19 GiB   18 GiB   24 KiB  443 MiB   21 GiB  46.87  1.07   51      up          osd.4
 5    hdd  0.03909   1.00000   40 GiB   11 GiB   11 GiB   21 KiB  201 MiB   29 GiB  28.58  0.65   30      up          osd.5
 6    hdd  0.03909   1.00000   40 GiB   16 GiB   16 GiB   12 KiB  294 MiB   24 GiB  39.51  0.91   40      up          osd.6
 7    hdd  0.03909   1.00000   40 GiB   21 GiB   20 GiB   18 KiB  693 MiB   19 GiB  52.84  1.21   61      up          osd.7
-7         0.15637         -   80 GiB   49 GiB   47 GiB   36 KiB  1.3 GiB   31 GiB  60.91  1.40    -              host pve38
 8    hdd  0.03909         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.8
 9    hdd  0.03909         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.9
10    hdd  0.03909   1.00000   40 GiB   20 GiB   20 GiB   17 KiB  415 MiB   20 GiB  51.02  1.17   53      up          osd.10
11    hdd  0.03909   1.00000   40 GiB   28 GiB   27 GiB   19 KiB  922 MiB   12 GiB  70.80  1.62   73      up          osd.11
-9         0.15637         -   80 GiB   35 GiB   34 GiB   20 KiB  1.1 GiB   45 GiB  43.27  0.99    -              host pve39
12    hdd  0.03909   1.00000   40 GiB   20 GiB   20 GiB    7 KiB  824 MiB   20 GiB  50.81  1.16   63      up          osd.12
13    hdd  0.03909         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.13
14    hdd  0.03909   1.00000   40 GiB   14 GiB   14 GiB   13 KiB  303 MiB   26 GiB  35.72  0.82    0    down          osd.14
15    hdd  0.03909         0      0 B      0 B      0 B      0 B      0 B      0 B      0     0    0    down          osd.15
                       TOTAL  480 GiB  210 GiB  204 GiB  183 KiB  5.4 GiB  270 GiB  43.65
MIN/MAX VAR: 0.61/1.62  STDDEV: 11.55
root@pve37:~#

pve38のosd.8とosd.9がdownになっているので、pve38にログインしてプロセスを確認すると、–id 8 と –id 9のceph-osdサービスが起動していないので、これらを再起動する

root@pve38:~# ps -ef|grep osd
ceph        1676       1  1 12:14 ?        00:02:01 /usr/bin/ceph-osd -f --cluster ceph --id 10 --setuser ceph --setgroup ceph
ceph        1681       1  2 12:14 ?        00:02:45 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph
root       30916   30893  0 14:10 pts/0    00:00:00 grep osd
root@pve38:~# systemctl restart ceph-osd@8
root@pve38:~# systemctl restart ceph-osd@9
root@pve38:~#

しばらく待つとupになる

root@pve38:~# ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         0.62549         -  600 GiB  227 GiB  221 GiB  229 KiB  5.6 GiB  373 GiB  37.84  1.00    -          root default
-3         0.15637         -  160 GiB   53 GiB   52 GiB   47 KiB  1.4 GiB  107 GiB  33.11  0.88    -              host pve36
 0    hdd  0.03909   1.00000   40 GiB   13 GiB   12 GiB    7 KiB  403 MiB   27 GiB  31.50  0.83   28      up          osd.0
 1    hdd  0.03909   1.00000   40 GiB   16 GiB   16 GiB   13 KiB  332 MiB   24 GiB  40.30  1.07   47      up          osd.1
 2    hdd  0.03909   1.00000   40 GiB  9.7 GiB  9.4 GiB   18 KiB  337 MiB   30 GiB  24.21  0.64   22      up          osd.2
 3    hdd  0.03909   1.00000   40 GiB   15 GiB   14 GiB    9 KiB  393 MiB   25 GiB  36.41  0.96   41      up          osd.3
-5         0.15637         -  160 GiB   61 GiB   59 GiB   75 KiB  1.6 GiB   99 GiB  37.89  1.00    -              host pve37
 4    hdd  0.03909   1.00000   40 GiB   16 GiB   15 GiB   24 KiB  443 MiB   24 GiB  39.75  1.05   41      up          osd.4
 5    hdd  0.03909   1.00000   40 GiB   10 GiB   10 GiB   21 KiB  201 MiB   30 GiB  25.52  0.67   26      up          osd.5
 6    hdd  0.03909   1.00000   40 GiB   14 GiB   13 GiB   12 KiB  278 MiB   26 GiB  34.26  0.91   32      up          osd.6
 7    hdd  0.03909   1.00000   40 GiB   21 GiB   20 GiB   18 KiB  693 MiB   19 GiB  52.02  1.37   52      up          osd.7
-7         0.15637         -  120 GiB   57 GiB   55 GiB   54 KiB  1.5 GiB   63 GiB  47.17  1.25    -              host pve38
 8    hdd  0.03909   1.00000   40 GiB   14 GiB   14 GiB   18 KiB  132 MiB   26 GiB  35.75  0.94   30      up          osd.8
 9    hdd  0.03909   1.00000      0 B      0 B      0 B      0 B      0 B      0 B      0     0   22      up          osd.9
10    hdd  0.03909   1.00000   40 GiB   18 GiB   18 GiB   17 KiB  419 MiB   22 GiB  45.92  1.21   42      up          osd.10
11    hdd  0.03909   1.00000   40 GiB   24 GiB   23 GiB   19 KiB  939 MiB   16 GiB  59.84  1.58   42      up          osd.11
-9         0.15637         -  160 GiB   57 GiB   56 GiB   53 KiB  1.2 GiB  103 GiB  35.51  0.94    -              host pve39
12    hdd  0.03909   1.00000   40 GiB   15 GiB   14 GiB    7 KiB  841 MiB   25 GiB  37.05  0.98   37      up          osd.12
13    hdd  0.03909   1.00000   40 GiB   16 GiB   16 GiB   16 KiB  144 MiB   24 GiB  39.70  1.05   42      up          osd.13
14    hdd  0.03909   1.00000   40 GiB   14 GiB   14 GiB   16 KiB   84 MiB   26 GiB  35.82  0.95   39      up          osd.14
15    hdd  0.03909   1.00000   40 GiB   12 GiB   12 GiB   14 KiB  127 MiB   28 GiB  29.48  0.78   30      up          osd.15
                       TOTAL  600 GiB  227 GiB  221 GiB  236 KiB  5.6 GiB  373 GiB  37.84
MIN/MAX VAR: 0/1.58  STDDEV: 12.91
root@pve38:~#

が・・・またしばらくすると、他のosdが落ちる、などしていた

RedHat Ceph Storage 7 トラブルシューティングガイドの「第5章 Ceph OSD のトラブルシューティング」5.1.7. OSDS のフラップを確認すると、osdに指定されているディスクが遅いから、ということになるようだ。

osd_heartbeat_grace_time というパラメータをデフォルトの20秒から変更すると、タイムアウトまでの値を緩和できるのかな、と思ったのだが、どうやって設定するのかが不明…

ceph.orgのOSD Setting を見ると /etc/ceph/ceph.conf (PVEの場合、 /etc/pve/ceph.conf )に追加すればいいのかな？というところなんだけど、OSD Config Reference , Configuring Monitor/OSD Interaction を見ても osd_heartbeat_grace_time というパラメータが無い…(osd_heartbeat_grace ならあった)

RedHatドキュメントの続きに書いてある「この問題を解決するには、以下を行います。」のところを見ると、「ceph osd set noup」「ceph osd set nodown」を設定して、OSDをdownおよびupとしてマークするのを停止する、とある。

試しにnoup,nodownの療法を設定してみたところ、OSDサービスを起動してもceph osd df treeで確認するとdownのままとなっていた。

まあ、upになったとしてもupのマークを付けないのが「noup」だから当然ですね・・・

そんなわけで、「ceph osd unset noup」「ceph osd set nodown」でdownにしない、という設定を入れてみた

設定を入れると「ceph osd stat」での状態確認で「flags nodown」と表示されるようになる。

root@pve38:~# ceph osd stat
16 osds: 16 up (since 62m), 16 in (since 62m); epoch: e4996
flags nodown
root@pve38:~#

とりあえず、これで一時的なごまかしはできた。

ただ、これは、OSDで使用しているディスクが壊れたとしても downにならない、ということでもある。

なので、「nodown」フラグを設定しっぱなしで使う、というのはとても不適切となる。

ちゃんとした対処を行うためには、具体的に何が問題になっているのかを「ceph health detail」を実行して、具体的にSlow OSD heartbeats がどれくらい遅いのかを確認する

root@pve38:~# ceph health detail
HEALTH_WARN nodown flag(s) set; Slow OSD heartbeats on back (longest 5166.450ms); Slow OSD heartbeats on front (longest 5467.151ms)
[WRN] OSDMAP_FLAGS: nodown flag(s) set
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 5166.450ms)
    Slow OSD heartbeats on back from osd.13 [] to osd.8 [] 5166.450 msec
    Slow OSD heartbeats on back from osd.13 [] to osd.0 [] 3898.044 msec
    Slow OSD heartbeats on back from osd.12 [] to osd.9 [] 3268.881 msec
    Slow OSD heartbeats on back from osd.10 [] to osd.3 [] 2610.064 msec possibly improving
    Slow OSD heartbeats on back from osd.12 [] to osd.8 [] 2588.321 msec
    Slow OSD heartbeats on back from osd.6 [] to osd.14 [] 2565.141 msec
    Slow OSD heartbeats on back from osd.8 [] to osd.7 [] 2385.851 msec possibly improving
    Slow OSD heartbeats on back from osd.13 [] to osd.11 [] 2324.505 msec
    Slow OSD heartbeats on back from osd.8 [] to osd.12 [] 2305.474 msec possibly improving
    Slow OSD heartbeats on back from osd.14 [] to osd.11 [] 2275.033 msec
    Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 5467.151ms)
    Slow OSD heartbeats on front from osd.13 [] to osd.8 [] 5467.151 msec
    Slow OSD heartbeats on front from osd.13 [] to osd.0 [] 3956.364 msec
    Slow OSD heartbeats on front from osd.12 [] to osd.9 [] 3513.493 msec
    Slow OSD heartbeats on front from osd.12 [] to osd.8 [] 2657.999 msec
    Slow OSD heartbeats on front from osd.6 [] to osd.14 [] 2657.486 msec
    Slow OSD heartbeats on front from osd.10 [] to osd.3 [] 2610.558 msec possibly improving
    Slow OSD heartbeats on front from osd.8 [] to osd.7 [] 2436.661 msec possibly improving
    Slow OSD heartbeats on front from osd.14 [] to osd.11 [] 2351.914 msec
    Slow OSD heartbeats on front from osd.14 [] to osd.10 [] 2351.812 msec
    Slow OSD heartbeats on front from osd.13 [] to osd.11 [] 2335.698 msec
    Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information
root@pve38:~#

osd.7のログが出てるpve37にログインして /var/log/ceph/ceph-osd.7.log から「no replay from」と「osd.8」でgrep をかけてログを確認

おそらく「Slow OSD heartbeats on front from osd.8 [] to osd.7 [] 2436.661 msec」に相当するあたりがコレなのかな？というところ


2024-11-14T14:46:05.457+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:02.037605+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)
2024-11-14T14:46:06.454+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:02.037605+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)
2024-11-14T14:46:07.467+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:07.338127+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)
2024-11-14T14:46:08.418+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:07.338127+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)
2024-11-14T14:46:09.371+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:09.038264+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)
2024-11-14T14:46:10.416+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:09.038264+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)
2024-11-14T14:46:11.408+0900 7e72364006c0 -1 osd.7 4996 heartbeat_check: no reply from 172.17.44.38:6802 osd.8 since back 2024-11-14T14:46:11.338592+0900 front 2024-11-14T14:45:41.850539+0900 (oldest deadline 2024-11-14T14:46:05.334473+0900)

oldset deadlineにある時刻と、その前にある時刻の差は20秒なので、 osd_heartbeat_grace もしくは osd_heartbeat_grace_time のデフォルト値 20 が効いてるんだろうなぁ、と推定できる

設定手法について記載を探してみたのだがなかなかない

Ceph Block Device 3rd Party Integration » Ceph iSCSI Gateway » iSCSI Gateway Requirements に下記のような設定例がある

[osd]
osd heartbeat grace = 20
osd heartbeat interval = 5

また、下記のように個別OSDに対して値を設定することも可能であるようだ

ceph tell osd.* config set osd_heartbeat_grace 20
ceph tell osd.* config set osd_heartbeat_interval 5

ceph daemon osd.0 config set osd_heartbeat_grace 20
ceph daemon osd.0 config set osd_heartbeat_interval 5

ceph tellの書式を確認すると「ceph tell osd.* config get osd_heartbeat_grace」で値がとれる模様

root@pve37:~# ceph tell osd.* config get osd_heartbeat_grace
osd.0: {
    "osd_heartbeat_grace": "20"
}
osd.1: {
    "osd_heartbeat_grace": "20"
}
osd.2: {
    "osd_heartbeat_grace": "20"
}
osd.3: {
    "osd_heartbeat_grace": "20"
}
osd.4: {
    "osd_heartbeat_grace": "20"
}
osd.5: {
    "osd_heartbeat_grace": "20"
}
osd.6: {
    "osd_heartbeat_grace": "20"
}
osd.7: {
    "osd_heartbeat_grace": "20"
}
osd.8: {
    "osd_heartbeat_grace": "20"
}
osd.9: {
    "osd_heartbeat_grace": "20"
}
osd.10: {
    "osd_heartbeat_grace": "20"
}
osd.11: {
    "osd_heartbeat_grace": "20"
}
osd.12: {
    "osd_heartbeat_grace": "20"
}
osd.13: {
    "osd_heartbeat_grace": "20"
}
osd.14: {
    "osd_heartbeat_grace": "20"
}
osd.15: {
    "osd_heartbeat_grace": "20"
}
root@pve37:~#

とりあえず「ceph tell osd.* config set osd_heartbeat_grace 30」と実行し、30に設定してみる

root@pve37:~# ceph tell osd.* config set osd_heartbeat_grace 30
osd.0: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.1: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.2: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.3: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.4: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.5: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.6: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.7: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.8: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.9: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.10: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.11: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.12: {
    "success": "osd_heartbeat_grace = '' (not observed, change may require restart) "
}
osd.13: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.14: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
osd.15: {
    "success": "osd_delete_sleep = '' osd_delete_sleep_hdd = '' osd_delete_sleep_hybrid = '' osd_delete_sleep_ssd = '' osd_heartbeat_grace = '' (not observed, change may require restart) osd_max_backfills = '' osd_pg_delete_cost = '' (not observed, change may require restart) osd_recovery_max_active = '' osd_recovery_max_active_hdd = '' osd_recovery_max_active_ssd = '' osd_recovery_sleep = '' osd_recovery_sleep_hdd = '' osd_recovery_sleep_hybrid = '' osd_recovery_sleep_ssd = '' osd_scrub_sleep = '' osd_snap_trim_sleep = '' osd_snap_trim_sleep_hdd = '' osd_snap_trim_sleep_hybrid = '' osd_snap_trim_sleep_ssd = '' "
}
root@pve37:~#

すべて「”success”」ではあるので設定変更は完了しているのだと思うが、応答が2種類あるのはなんなのだろうか？

設定が変更されたかどうかを確認

root@pve37:~# ceph tell osd.* config get osd_heartbeat_grace
osd.0: {
    "osd_heartbeat_grace": "30"
}
osd.1: {
    "osd_heartbeat_grace": "30"
}
osd.2: {
    "osd_heartbeat_grace": "30"
}
osd.3: {
    "osd_heartbeat_grace": "30"
}
osd.4: {
    "osd_heartbeat_grace": "30"
}
osd.5: {
    "osd_heartbeat_grace": "30"
}
osd.6: {
    "osd_heartbeat_grace": "30"
}
osd.7: {
    "osd_heartbeat_grace": "30"
}
osd.8: {
    "osd_heartbeat_grace": "30"
}
osd.9: {
    "osd_heartbeat_grace": "30"
}
osd.10: {
    "osd_heartbeat_grace": "30"
}
osd.11: {
    "osd_heartbeat_grace": "30"
}
osd.12: {
    "osd_heartbeat_grace": "30"
}
osd.13: {
    "osd_heartbeat_grace": "30"
}
osd.14: {
    "osd_heartbeat_grace": "30"
}
osd.15: {
    "osd_heartbeat_grace": "30"
}
root@pve37:~#

とはいえ、set時の出力に「(not observed, change may require restart)」とあるとおり、ceph-osdの再起動が必須であるようだ

/etc/pve/ceph.conf に変更したパラメータは反映されてない模様なので、 osd.4～osd.7があるサーバを再起動してからもう一度値を確認してみたら、20に戻っていた。

root@pve38:~# ceph tell osd.* config get osd_heartbeat_grace
osd.0: {
    "osd_heartbeat_grace": "30"
}
osd.1: {
    "osd_heartbeat_grace": "30"
}
osd.2: {
    "osd_heartbeat_grace": "30"
}
osd.3: {
    "osd_heartbeat_grace": "30"
}
osd.4: {
    "osd_heartbeat_grace": "20"
}
osd.5: {
    "osd_heartbeat_grace": "20"
}
osd.6: {
    "osd_heartbeat_grace": "20"
}
osd.7: {
    "osd_heartbeat_grace": "20"
}
osd.8: {
    "osd_heartbeat_grace": "30"
}
osd.9: {
    "osd_heartbeat_grace": "30"
}
osd.10: {
    "osd_heartbeat_grace": "30"
}
osd.11: {
    "osd_heartbeat_grace": "30"
}
osd.12: {
    "osd_heartbeat_grace": "30"
}
osd.13: {
    "osd_heartbeat_grace": "30"
}
osd.14: {
    "osd_heartbeat_grace": "30"
}
osd.15: {
    "osd_heartbeat_grace": "30"
}
root@pve38:~#

/etc/pve/ceph.conf の最後に下記を追加

[osd]
        osd heartbeat grace = 30

設定後、再起動してから確認すると、想定通り30になっているのを確認。そもそも、osd_heartbeat_grace についてはceph tellコマンドでの設定変更後、再起動しないでも大丈夫、というやつなんでは？

root@pve38:~# ceph tell osd.* config get osd_heartbeat_grace
osd.0: {
    "osd_heartbeat_grace": "30"
}
osd.1: {
    "osd_heartbeat_grace": "30"
}
osd.2: {
    "osd_heartbeat_grace": "30"
}
osd.3: {
    "osd_heartbeat_grace": "30"
}
osd.4: {
    "osd_heartbeat_grace": "30"
}
osd.5: {
    "osd_heartbeat_grace": "30"
}
osd.6: {
    "osd_heartbeat_grace": "30"
}
osd.7: {
    "osd_heartbeat_grace": "30"
}
osd.8: {
    "osd_heartbeat_grace": "30"
}
osd.9: {
    "osd_heartbeat_grace": "30"
}
osd.10: {
    "osd_heartbeat_grace": "30"
}
osd.11: {
    "osd_heartbeat_grace": "30"
}
osd.12: {
    "osd_heartbeat_grace": "30"
}
osd.13: {
    "osd_heartbeat_grace": "30"
}
osd.14: {
    "osd_heartbeat_grace": "30"
}
osd.15: {
    "osd_heartbeat_grace": "30"
}
root@pve38:~#

2024年11月11日2024年11月12日

Proxmox VEクラスタをUPSで停止する手法のメモ(調査段階

Proxmox VE環境でcephによるストレージ領域を作成して、物理的にディスクを共有していない状態で複数サーバ間を1ファイルシステムで運用できる環境についての試験中。

Proxmox VEクラスタをUPS連動で停止する場合の処理について確認しているのだが、公式ドキュメントにそのまま使えるようなものがないので、情報を集めているところ・・・

Proxmox VE公式ドキュメント:Shutdown Proxmox VE + Ceph HCI cluster
これはceph側の処理だけかかれていて、ceph側を止める前には仮想マシンを停止しなければならないのに、そこについて触れていない。

Proxmox VEフォーラムを探す
Clean shutdown of whole cluster (2023／01／16)
Shutdown of the Hyper-Converged Cluster (CEPH) (2020/04/05)

ここらのスクリプトが使えそうだが、仮想マシン/コンテナを停止するのは「各ノードで pvenode stopallを実行」ではなく、「APIを使ってすべてを停止する」が推奨される模様。

ここから試験

「pvesh get /nodes」でノードリストを作って、「ssh ホスト名 pvenode stopall」で仮想マシンを停止できるか試してみたが、管理Web上は「VMとコンテナの一括シャットダウン」と出力されるのだが、仮想マシンの停止が実行されなかった。

3.11.3. Bulk Guest Power Management を見ると止まっても良さそうなんだけど・・・

pvesh コマンドを調べるとこちらでも停止させることができる模様なので下記で実施

for hostname in `pvesh ls /nodes/|awk '{ print $2 }'`; do for vmid in `pvesh ls /nodes/$hostname/qemu/|awk '{ print $2 }'`; do pvesh create /nodes/$hostname/qemu/$vmid/status/shutdown; done; done

実行例

root@pve36:~# for hostname in `pvesh ls /nodes/|awk '{ print $2 }'`
> do for vmid in `pvesh ls /nodes/$hostname/qemu/|awk '{ print $2 }'`
>   do
>     pvesh create /nodes/$hostname/qemu/$vmid/status/shutdown
>   done
> done
VM 102 not running

Requesting HA stop for VM 102
UPID:pve36:00014A47:0018FC0A:6731A591:hastop:102:root@pam:
VM 100 not running
Requesting HA stop for VM 100"UPID:pve37:00013A71:0018FDDC:6731A597:hastop:100:root@pam:"
VM 101 not running
Requesting HA stop for VM 101"UPID:pve38:000144BF:00192F03:6731A59D:hastop:101:root@pam:"
Requesting HA stop for VM 103"UPID:pve38:000144DC:0019305D:6731A5A0:hastop:103:root@pam:"
root@pve36:~#

これで、停止することを確認

次に仮想マシンの停止確認

root@pve36:~# pvesh get /nodes/pve36/qemu
lqqqqqqqqqwqqqqqqwqqqqqqwqqqqqqwqqqqqqqqqqqwqqqqqqqqqqwqqqqqqqqqwqqqqqwqqqqqqqqqqqwqqqqqqqqqqqqqqqqqwqqqqqqqqqqqqqqwqqqqqqwqqqqqqqqk
x status  x vmid x cpus x lock x   maxdisk x   maxmem x name    x pid x qmpstatus x running-machine x running-qemu x tags x uptime x
tqqqqqqqqqnqqqqqqnqqqqqqnqqqqqqnqqqqqqqqqqqnqqqqqqqqqqnqqqqqqqqqnqqqqqnqqqqqqqqqqqnqqqqqqqqqqqqqqqqqnqqqqqqqqqqqqqqnqqqqqqnqqqqqqqqu
x stopped x  102 x    2 x      x 32.00 GiB x 2.00 GiB x testvm2 x     x           x                 x              x      x     0s x
mqqqqqqqqqvqqqqqqvqqqqqqvqqqqqqvqqqqqqqqqqqvqqqqqqqqqqvqqqqqqqqqvqqqqqvqqqqqqqqqqqvqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqvqqqqqqvqqqqqqqqj
root@pve36:~#

ヘッダーを付けない形式で出力

root@pve36:~# pvesh get /nodes/pve36/qemu --noborder --noheader
stopped  102    2      32.00 GiB 2.00 GiB testvm2                                                 0s
root@pve36:~#

“stopped” となってない行があればまだ停止していない、ということになるので「pvesh get /nodes/pve36/qemu –noborder –noheader|grep -v “stopped”」の出力結果があるかどうかで判断できそう

また、これらはqemu仮想マシンについてのみなので、lxcコンテナについては含まれないので、そちらについても対応する

停止

for hostname in `pvesh ls /nodes/|awk '{ print $2 }'`; do for vmid in `pvesh ls /nodes/$hostname/lxc/|awk '{ print $2 }'`; do pvesh create /nodes/$hostname/lxc/$vmid/status/shutdown;done;done

仮想マシンが止まったかを判断するには、すべての仮想マシンの状態が”stopped”になっているか、で判定するなら下記

for hostname in `pvesh ls /nodes/|awk '{ print $2 }'`; do  echo "=== $hostname ===";    flag=0;  while [ $flag -eq 0 ];  do    pvesh get /nodes/$hostname/qemu --noborder --noheader|grep -v "stopped" > /dev/null;    flag=$?;    echo $flag;  done; done

すべての仮想マシンの状態で”running”がないことなら

for hostname in pvesh ls /nodes/|awk '{ print $2 }'; do echo “=== $hostname ===”; flag=0; while [ $flag -eq 0 ]; do pvesh get /nodes/$hostname/qemu –noborder –noheader|grep “running” > /dev/null ; flag=$?; echo $flag; done; done

どっちがいいかは悩むところ

Cephの停止についてはRedHatの「2.10. Red Hat Ceph Storage クラスターの電源をオフにして再起動」とProxmoxの「Shutdown Proxmox VE + Ceph HCI cluster 」を確認

Proxmox VE側だと下記だけ

ceph osd set noout
ceph osd set norecover
ceph osd set norebalance
ceph osd set nobackfill
ceph osd set nodown
ceph osd set pause

RedHat側にはこれらを実行する前にcephfsを停止するための手順が追加されている。

ceph fs set FS_NAME max_mds 1
ceph mds deactivate FS_NAME:1 # rank 2 of 2
ceph status # wait for rank 1 to finish stopping
ceph fs set FS_NAME cluster_down true
ceph mds fail FS_NAME:0

ceph fs setで設定しているmax_mdsとcluster_downの値はどうなっているのかを確認

root@pve36:~# ceph fs get cephfs
Filesystem 'cephfs' (1)
fs_name cephfs
epoch   65
flags   12 joinable allow_snaps allow_multimds_snaps
created 2024-11-05T14:29:45.941671+0900
modified        2024-11-11T11:04:06.223151+0900
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
max_xattr_size  65536
required_client_features        {}
last_failure    0
last_failure_osd_epoch  3508
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=784113}
failed
damaged
stopped
data_pools      [3]
metadata_pool   4
inline_data     disabled
balancer
bal_rank_mask   -1
standby_count_wanted    1
[mds.pve36{0:784113} state up:active seq 29 addr [v2:172.17.44.36:6800/1472122357,v1:172.17.44.36:6801/1472122357] compat {c=[1],r=[1],i=[7ff]}]
root@pve36:~#

cluster_downはない？

root@pve36:~# ceph mds stat
cephfs:1 {0=pve36=up:active} 1 up:standby
root@pve36:~# ceph mds compat show
compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
root@pve36:~#

んー？　ceph fsのcluster_down を実行してみる

root@pve36:~# ceph fs set cephfs cluster_down true
cephfs marked not joinable; MDS cannot join as newly active. WARNING: cluster_down flag is deprecated and will be removed in a future version. Please use "joinable".
root@pve36:~#

joinableを使えとあるので、この記述は古いらしい

おや？と思って再度探したところRedHatのドキュメントが古かった。RedHat「2.5. Red Hat Ceph Storage クラスターの電源をオフにして再起動」、もしくはIBMの「Ceph File System ・クラスターの停止」

ceph fs set FS_NAME max_mds 1
ceph fs fail FS_NAME
ceph status
ceph fs set FS_NAME joinable false

IBM手順の方だとmax_mdsの操作は行わずに実施している

root@pve36:~# ceph fs status
cephfs - 4 clients
======
RANK  STATE    MDS      ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  pve36  Reqs:    0 /s    21     20     16     23
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   122G
  cephfs_data      data    31.8G   122G
STANDBY MDS
   pve37
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~# ceph fs fail cephfs
cephfs marked not joinable; MDS cannot join the cluster. All MDS ranks marked failed.
root@pve36:~# ceph fs status
cephfs - 0 clients
======
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   122G
  cephfs_data      data    31.8G   122G
STANDBY MDS
   pve37
   pve36
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~# ceph fs fail cephfs
cephfs marked not joinable; MDS cannot join the cluster. All MDS ranks marked failed.
root@pve36:~# echo $?
0
root@pve36:~#  ceph fs set cephfs joinable false
cephfs marked not joinable; MDS cannot join as newly active.
root@pve36:~# echo $?
0
root@pve36:~# ceph fs status
cephfs - 0 clients
======
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   122G
  cephfs_data      data    31.8G   122G
STANDBY MDS
   pve37
   pve36
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~#

…止めてしまったら、dfコマンドが実行できなくなるので注意

root@pve36:~# ceph osd stat
16 osds: 16 up (since 6h), 16 in (since 3d); epoch: e3599
root@pve36:~# ceph osd set noout
noout is set
root@pve36:~# ceph osd set norecover
norecover is set
root@pve36:~# ceph osd set norebalance
norebalance is set
root@pve36:~# ceph osd set nobackfill
nobackfill is set
root@pve36:~# ceph osd set nodown
nodown is set
root@pve36:~# ceph osd set pause
pauserd,pausewr is set
root@pve36:~# ceph osd stat
16 osds: 16 up (since 6h), 16 in (since 3d); epoch: e3605
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
root@pve36:~#

ドキュメントに「重要上記の例は、OSD ノード内のサービスと各 OSD を停止する場合のみであり、各 OSD ノードで繰り返す必要があります。」とあるので、各サーバで確認してみたが、別に各サーバで実行する必要はなさそうである。

root@pve36:~# ceph osd stat
16 osds: 16 up (since 6h), 16 in (since 3d); epoch: e3605
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
root@pve36:~# ssh pve37 ceph osd stat
16 osds: 16 up (since 6h), 16 in (since 3d); epoch: e3605
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
root@pve36:~# ssh pve38 ceph osd stat
16 osds: 16 up (since 6h), 16 in (since 3d); epoch: e3605
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
root@pve36:~# ssh pve39 ceph osd stat
16 osds: 16 up (since 6h), 16 in (since 3d); epoch: e3605
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
root@pve36:~#

このあと、各サーバに対してshutdown -h nowを実行して止めた

起動後

root@pve36:~# ceph status
  cluster:
    id:     4647497d-17da-46f4-8e7b-231365d96e42
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set

  services:
    mon: 3 daemons, quorum pve36,pve37,pve38 (age 41s)
    mgr: pve38(active, since 30s), standbys: pve37, pve36
    mds: 0/1 daemons up (1 failed), 2 standby
    osd: 16 osds: 16 up (since 48s), 16 in (since 3d)
         flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover

  data:
    volumes: 0/1 healthy, 1 failed
    pools:   4 pools, 193 pgs
    objects: 17.68k objects, 69 GiB
    usage:   206 GiB used, 434 GiB / 640 GiB avail
    pgs:     193 active+clean

root@pve36:~# ceph osd stat
16 osds: 16 up (since 69s), 16 in (since 3d); epoch: e3621
flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
root@pve36:~#

root@pve36:~# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
root@pve36:~# ceph fs status
cephfs - 0 clients
======
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   125G
  cephfs_data      data    31.8G   125G
STANDBY MDS
   pve37
   pve36
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~#

復帰のためのコマンド群1

root@pve36:~# ceph osd unset noout
noout is unset
root@pve36:~# ceph osd unset norecover
norecover is unset
root@pve36:~# ceph osd unset norebalance
norebalance is unset
root@pve36:~#  ceph osd unset nobackfill
nobackfill is unset
root@pve36:~# ceph osd unset nodown
nodown is unset
root@pve36:~# ceph osd unset pause
pauserd,pausewr is unset
root@pve36:~# ceph osd stat
16 osds: 16 up (since 100s), 16 in (since 3d); epoch: e3627
root@pve36:~# ceph status
  cluster:
    id:     4647497d-17da-46f4-8e7b-231365d96e42
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline

  services:
    mon: 3 daemons, quorum pve36,pve37,pve38 (age 102s)
    mgr: pve38(active, since 90s), standbys: pve37, pve36
    mds: 0/1 daemons up (1 failed), 2 standby
    osd: 16 osds: 16 up (since 108s), 16 in (since 3d)

  data:
    volumes: 0/1 healthy, 1 failed
    pools:   4 pools, 193 pgs
    objects: 17.68k objects, 69 GiB
    usage:   206 GiB used, 434 GiB / 640 GiB avail
    pgs:     193 active+clean

  io:
    client:   21 KiB/s rd, 0 B/s wr, 9 op/s rd, 1 op/s wr

root@pve36:~#

ファイルシステム再開

root@pve36:~# ceph fs set cephfs joinable true
cephfs marked joinable; MDS may join as newly active.
root@pve36:~# ceph fs status
cephfs - 4 clients
======
RANK    STATE     MDS   ACTIVITY   DNS    INOS   DIRS   CAPS
 0    reconnect  pve36              10     10      6      0
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   125G
  cephfs_data      data    31.8G   125G
STANDBY MDS
   pve37
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~# ceph osd stat
16 osds: 16 up (since 2m), 16 in (since 3d); epoch: e3627
root@pve36:~# cpeh status
-bash: cpeh: command not found
root@pve36:~# ceph status
  cluster:
    id:     4647497d-17da-46f4-8e7b-231365d96e42
    health: HEALTH_WARN
            1 filesystem is degraded

  services:
    mon: 3 daemons, quorum pve36,pve37,pve38 (age 2m)
    mgr: pve38(active, since 2m), standbys: pve37, pve36
    mds: 1/1 daemons up, 1 standby
    osd: 16 osds: 16 up (since 2m), 16 in (since 3d)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   4 pools, 193 pgs
    objects: 17.68k objects, 69 GiB
    usage:   206 GiB used, 434 GiB / 640 GiB avail
    pgs:     193 active+clean

root@pve36:~#
root@pve36:~# ceph fs status
cephfs - 4 clients
======
RANK  STATE    MDS   ACTIVITY   DNS    INOS   DIRS   CAPS
 0    rejoin  pve36              10     10      6      0
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   125G
  cephfs_data      data    31.8G   125G
STANDBY MDS
   pve37
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~# df
Filesystem           1K-blocks     Used Available Use% Mounted on
udev                   8156156        0   8156156   0% /dev
tmpfs                  1638000     1124   1636876   1% /run
/dev/mapper/pve-root  28074060 14841988  11780656  56% /
tmpfs                  8189984    73728   8116256   1% /dev/shm
tmpfs                     5120        0      5120   0% /run/lock
/dev/fuse               131072       36    131036   1% /etc/pve
tmpfs                  8189984       28   8189956   1% /var/lib/ceph/osd/ceph-2
tmpfs                  8189984       28   8189956   1% /var/lib/ceph/osd/ceph-0
tmpfs                  8189984       28   8189956   1% /var/lib/ceph/osd/ceph-1
tmpfs                  8189984       28   8189956   1% /var/lib/ceph/osd/ceph-3
tmpfs                  1637996        0   1637996   0% /run/user/0
root@pve36:~# ceph fs status
cephfs - 0 clients
======
RANK  STATE    MDS      ACTIVITY     DNS    INOS   DIRS   CAPS
 0    active  pve36  Reqs:    0 /s    10     10      6      0
      POOL         TYPE     USED  AVAIL
cephfs_metadata  metadata   244M   125G
  cephfs_data      data    31.8G   125G
STANDBY MDS
   pve37
MDS version: ceph version 18.2.4 (2064df84afc61c7e63928121bfdd74c59453c893) reef (stable)
root@pve36:~# df
Filesystem                               1K-blocks     Used Available Use% Mounted on
udev                                       8156156        0   8156156   0% /dev
tmpfs                                      1638000     1128   1636872   1% /run
/dev/mapper/pve-root                      28074060 14841992  11780652  56% /
tmpfs                                      8189984    73728   8116256   1% /dev/shm
tmpfs                                         5120        0      5120   0% /run/lock
/dev/fuse                                   131072       36    131036   1% /etc/pve
tmpfs                                      8189984       28   8189956   1% /var/lib/ceph/osd/ceph-2
tmpfs                                      8189984       28   8189956   1% /var/lib/ceph/osd/ceph-0
tmpfs                                      8189984       28   8189956   1% /var/lib/ceph/osd/ceph-1
tmpfs                                      8189984       28   8189956   1% /var/lib/ceph/osd/ceph-3
tmpfs                                      1637996        0   1637996   0% /run/user/0
172.17.44.36,172.17.44.37,172.17.44.38:/ 142516224 11116544 131399680   8% /mnt/pve/cephfs
root@pve36:~#

ただ、これだと通常のPVE起動プロセスで実行される「VMとコンテナの一括起動」で仮想マシンが実行されなかった。おや？と思ったら、設定が変わってた

root@pve36:~# ha-manager status
quorum OK
master pve39 (active, Mon Nov 11 18:24:13 2024)
lrm pve36 (idle, Mon Nov 11 18:24:15 2024)
lrm pve37 (idle, Mon Nov 11 18:24:18 2024)
lrm pve38 (idle, Mon Nov 11 18:24:18 2024)
lrm pve39 (idle, Mon Nov 11 18:24:15 2024)
service vm:100 (pve37, stopped)
service vm:101 (pve38, stopped)
service vm:102 (pve36, stopped)
service vm:103 (pve38, stopped)
root@pve36:~# ha-manager config
vm:100
        state stopped

vm:101
        state stopped

vm:102
        state stopped

vm:103
        state stopped

root@pve36:~#