Configuring KVM cluster with CEPH filesystem: Difference between revisions
No edit summary |
|||
| (5 intermediate revisions by the same user not shown) | |||
| Line 22: | Line 22: | ||
10.0.0.50 localceph localceph.cifarelli.loc | 10.0.0.50 localceph localceph.cifarelli.loc | ||
The last line differs for each host and is used by KVM to always initialize the POOL on the local address (connecting to the monitor that must run on ALL nodes), even when VMs are migrated. | The last line differs for each host and is used by KVM to always initialize the POOL on the local address (connecting to the monitor that must run on ALL nodes), even when VMs are migrated.<br> | ||
this is a workaround since, by default, monitor doesn't listen on localhost and you need to configure the VM in a way it doesn't depend on other hosts. So the VM will go on running even in case of another node failure. | |||
<br> | <br> | ||
| Line 120: | Line 121: | ||
ceph orch device ls --wide --refresh | ceph orch device ls --wide --refresh | ||
(`--wide --refresh` is useful to rescan devices, especially if disks were just added). Example response: | |||
HOST PATH TYPE TRANSPORT RPM DEVICE ID SIZE HEALTH IDENT FAULT AVAILABLE REFRESHED REJECT REASONS | |||
testkvm1.cifarelli.loc /dev/sr0 hdd QEMU_DVD-ROM_QM00001 2048 N/A N/A No 80s ago Insufficient space (<5GB) | |||
testkvm1.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A No 80s ago Has BlueStore device label, Has a FileSystem | |||
testkvm2.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A No 75s ago Has BlueStore device label, Has a FileSystem | |||
testkvm3.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A No 76s ago Has BlueStore device label, Has a FileSystem | |||
In the above case, the storage was not available. Once released: | |||
HOST PATH TYPE TRANSPORT RPM DEVICE ID SIZE HEALTH IDENT FAULT AVAILABLE REFRESHED REJECT REASONS | |||
testkvm1.cifarelli.loc /dev/sr0 hdd QEMU_DVD-ROM_QM00001 2048 N/A N/A No 11s ago Insufficient space (<5GB) | |||
testkvm1.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A Yes 11s ago | |||
testkvm2.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A Yes 7s ago | |||
testkvm3.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A Yes 7s ago | |||
We can see that there are 3 volumes of 50GB attached to each node and available. Add them all automatically: | |||
ceph orch apply osd --all-available-devices --method raw | |||
(wide refresh is used to rescan devices, especially if disks were just added). | (wide refresh is used to rescan devices, especially if disks were just added). | ||
| Line 136: | Line 157: | ||
ceph -s | ceph -s | ||
== | cluster: | ||
id: 71446092-6e7f-11ef-b946-525400747546 | |||
health: HEALTH_OK | |||
services: | |||
mon: 3 daemons, quorum testkvm1,testkvm2,testkvm3 (age 20m) | |||
mgr: testkvm2.ayridz(active, since 19m), standbys: testkvm1.wgrpdw | |||
osd: 4 osds: 3 up (since 3m), 4 in (since 68s) | |||
data: | |||
pools: 1 pools, 1 pgs | |||
objects: 2 objects, 449 KiB | |||
usage: 81 MiB used, 150 GiB / 150 GiB avail | |||
pgs: 1 active+clean | |||
The cluster is fine: it has 3 monitors and 4 OSDs | |||
Copy the configs to the other nodes: | |||
scp /etc/ceph/ceph* testkvm2.cifarelli.loc:/etc/ceph | |||
scp /etc/ceph/ceph* testkvm3.cifarelli.loc:/etc/ceph | |||
Install the shell on the other nodes: | |||
CEPH_RELEASE=18.2.4 | |||
curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm | |||
chmod +x cephadm | |||
./cephadm add-repo --release reef | |||
./cephadm install | |||
cephadm install ceph-common | |||
Now the shell and administration commands are available on every node</br> | |||
Deploy the manager daemon on all nodes: | |||
ceph orch daemon add mgr testkvm2.cifarelli.loc | |||
ceph orch daemon add mgr testkvm3.cifarelli.loc | |||
Now all nodes are managers | |||
==Pools== | |||
To be able to delete pools: | To be able to delete pools: | ||
ceph config set mon mon_allow_pool_delete true | ceph config set mon mon_allow_pool_delete true | ||
| Line 223: | Line 278: | ||
ceph orch daemon rm osd.2 --force | ceph orch daemon rm osd.2 --force | ||
==OSD | ==OSD Removal (Soft Procedure)== | ||
ceph osd out {osd-num} | ceph osd out {osd-num} | ||
The corresponding disk begins to be drained and the pages are relocated. Monitor until it no longer contains any pages using | |||
ceph osd df | ceph osd df | ||
Example output: | |||
ceph osd safe-to-destroy osd.0 | [root@KVM1 ~]# ceph osd df | ||
ceph orch daemon rm osd.0 --force | ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS | ||
ceph osd crush remove osd.0 | 0 ssd 0.70000 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up | ||
4 ssd 3.49309 1.00000 3.5 TiB 1.8 TiB 1.7 TiB 23 MiB 4.1 GiB 1.7 TiB 50.11 1.18 82 up | |||
1 ssd 0.70000 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 2 up | |||
5 ssd 3.49309 1.00000 3.5 TiB 1.5 TiB 1.5 TiB 17 MiB 3.2 GiB 2.0 TiB 42.75 1.01 86 up | |||
3 ssd 3.49309 1.00000 3.5 TiB 1.4 TiB 1.4 TiB 20 MiB 4.9 GiB 2.1 TiB 39.17 0.92 80 up | |||
2 ssd 3.49309 1.00000 3.5 TiB 1.3 TiB 1.3 TiB 19 MiB 4.8 GiB 2.2 TiB 37.44 0.88 89 up | |||
TOTAL 14 TiB 5.9 TiB 5.9 TiB 78 MiB 17 GiB 8.1 TiB 42.37 | |||
osd.0 is now empty. Verify with | |||
ceph osd safe-to-destroy osd.0 | |||
which returns | |||
OSD(s) 0 are safe to destroy without reducing data durability. | |||
You can now remove the service with | |||
ceph orch daemon rm osd.0 --force | |||
Remove the crush map entry with | |||
ceph osd crush remove osd.0 | |||
Then it is recommended to go to the GUI in the OSD section and perform delete/purge | |||
==recovery== | ==recovery== | ||
| Line 253: | Line 322: | ||
==wipe disk to reuse as OSD== | ==wipe disk to reuse as OSD== | ||
vgdisplay | vgdisplay | ||
Take note of VG ID and run `vgremove` on it | |||
Then: | Then: | ||
wipefs -a /dev/vdd | wipefs -a /dev/vdd | ||
== image management == | == image management == | ||
| Line 264: | Line 333: | ||
If image is in qcow2: | If image is in qcow2: | ||
qemu-img convert -f qcow2 -O raw /var/lib/libvirt/images/rocky9.qcow2 rbd:rbdpool01/ROCKY9 | qemu-img convert -f qcow2 -O raw /var/lib/libvirt/images/rocky9.qcow2 rbd:rbdpool01/ROCKY9 | ||
[[Category:CEPH]][[Category:KVM]] | |||
Latest revision as of 12:00, 22 July 2025
General overview
This is how we have built a fully hyperconverged cluster with open source technologies.
We deployed on 3 hardware nodes (used): HP gen10 servers with 10GBe interfaces, 2Gold CPU each (24 cores), 256GB ram, 480GB SSD disk for system, multiple SSD 4TB to share as OSD foe CEPH filesystem.
OS: Rocky linux 9 with KVM hypervisor and tools preinstalled.
References
Installation:
https://kifarunix.com/how-to-deploy-ceph-storage-cluster-on-rocky-linux/#create-ceph-os-ds-from-attached-storage-devices
Block devices:
https://kifarunix.com/configure-and-use-ceph-block-device-on-linux-clients/
KVM pools:
https://blog.modest-destiny.com/posts/kvm-libvirt-add-ceph-rbd-pool/
Installation
Start from at least 3 nodes with SSH access keys
Similar to the cluster, set up reciprocal DNS entries in `/etc/hosts`
10.0.0.50 testkvm1.cifarelli.loc testkvm1 10.0.0.51 testkvm2.cifarelli.loc testkvm2 10.0.0.52 testkvm3.cifarelli.loc testkvm3 10.0.0.50 localceph localceph.cifarelli.loc
The last line differs for each host and is used by KVM to always initialize the POOL on the local address (connecting to the monitor that must run on ALL nodes), even when VMs are migrated.
this is a workaround since, by default, monitor doesn't listen on localhost and you need to configure the VM in a way it doesn't depend on other hosts. So the VM will go on running even in case of another node failure.
Install docker:
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo dnf install docker-ce systemctl enable --now docker containerd docker --version
Install CEPH repo and ceph-admin:
CEPH_RELEASE=19.2.0 curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm chmod +x cephadm ./cephadm add-repo --release reef ./cephadm install which cephadm
Initialize ceph-admin node:
cephadm bootstrap --mon-ip 10.0.0.50 --allow-fqdn-hostname
- Warning!** Among the messages, you should see the WEB interface credentials (note them down):
Ceph Dashboard is now available at:
URL: https://testkvm1.cifarelli.loc:8443/
User: admin
Password: tdocjdyd3d
This should:
- Create a monitor and manager daemon for the new cluster on the localhost.
- Generate a new SSH key for the Ceph cluster and add it to the root user’s /root/.ssh/authorized_keys file.
- Write a copy of the public key to /etc/ceph/ceph.pub.
- Write a minimal configuration file to /etc/ceph/ceph.conf.
- Write a copy of the client.admin administrative key to /etc/ceph/ceph.client.admin.keyring.
- Add the _admin label to the bootstrap host.
To see the containers created:
podman ps
To see the systemd units created and started:
systemctl list-units 'ceph*'
To install the CLI:
cephadm install ceph-common
Now the CLI is available. For example:
ceph -s
Copy keys to other hosts:
ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm2 ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm3
Deploy the monitor daemon to other hosts:
ceph orch host add testkvm2.cifarelli.loc ceph orch host add testkvm3.cifarelli.loc
To ensure ALL nodes have an active monitor (required by KVM):
ceph orch apply mon --placement="testkvm1 testkvm2 testkvm3"
(verify this).
Or:
ceph orch apply mon 3
Which means always keep 3 active services, i.e., one per node
Similarly:
ceph orch apply mgr 3
Ensures there are 3 manager nodes (1 active + 2 standby)
Add labels corresponding to the nodes:
ceph orch host label add testkvm1.cifarelli.loc mon-mgr-01 ceph orch host label add testkvm2.cifarelli.loc mon-mgr-02 ceph orch host label add testkvm3.cifarelli.loc mon-mgr-03 ceph orch host ls
On other nodes, it's better to install at least the CLI:
CEPH_RELEASE=19.2.0 curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm chmod +x cephadm ./cephadm add-repo --release reef ./cephadm install ceph-common
Copy keyring and config from testkvm1:
scp /etc/ceph/* testkvm2:/etc/ceph
Now from testkvmd try:
ceph -s
Add OSD: check what is available:
ceph orch device ls --wide --refresh
(`--wide --refresh` is useful to rescan devices, especially if disks were just added). Example response:
HOST PATH TYPE TRANSPORT RPM DEVICE ID SIZE HEALTH IDENT FAULT AVAILABLE REFRESHED REJECT REASONS testkvm1.cifarelli.loc /dev/sr0 hdd QEMU_DVD-ROM_QM00001 2048 N/A N/A No 80s ago Insufficient space (<5GB) testkvm1.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A No 80s ago Has BlueStore device label, Has a FileSystem testkvm2.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A No 75s ago Has BlueStore device label, Has a FileSystem testkvm3.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A No 76s ago Has BlueStore device label, Has a FileSystem
In the above case, the storage was not available. Once released:
HOST PATH TYPE TRANSPORT RPM DEVICE ID SIZE HEALTH IDENT FAULT AVAILABLE REFRESHED REJECT REASONS testkvm1.cifarelli.loc /dev/sr0 hdd QEMU_DVD-ROM_QM00001 2048 N/A N/A No 11s ago Insufficient space (<5GB) testkvm1.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A Yes 11s ago testkvm2.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A Yes 7s ago testkvm3.cifarelli.loc /dev/vdb hdd 50.0G N/A N/A Yes 7s ago
We can see that there are 3 volumes of 50GB attached to each node and available. Add them all automatically:
ceph orch apply osd --all-available-devices --method raw
(wide refresh is used to rescan devices, especially if disks were just added).
If disks are not available, once freed they appear as available. Then:
ceph orch apply osd --all-available-devices --method raw
To avoid automatic disk addition:
ceph orch apply osd --all-available-devices --unmanaged=false
Manual addition:
ceph orch daemon add osd <host>:<device-path>
Now check status:
ceph -s
cluster: id: 71446092-6e7f-11ef-b946-525400747546 health: HEALTH_OK services: mon: 3 daemons, quorum testkvm1,testkvm2,testkvm3 (age 20m) mgr: testkvm2.ayridz(active, since 19m), standbys: testkvm1.wgrpdw osd: 4 osds: 3 up (since 3m), 4 in (since 68s) data: pools: 1 pools, 1 pgs objects: 2 objects, 449 KiB usage: 81 MiB used, 150 GiB / 150 GiB avail pgs: 1 active+clean
The cluster is fine: it has 3 monitors and 4 OSDs
Copy the configs to the other nodes:
scp /etc/ceph/ceph* testkvm2.cifarelli.loc:/etc/ceph scp /etc/ceph/ceph* testkvm3.cifarelli.loc:/etc/ceph
Install the shell on the other nodes:
CEPH_RELEASE=18.2.4 curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm chmod +x cephadm ./cephadm add-repo --release reef ./cephadm install cephadm install ceph-common
Now the shell and administration commands are available on every node
Deploy the manager daemon on all nodes:
ceph orch daemon add mgr testkvm2.cifarelli.loc ceph orch daemon add mgr testkvm3.cifarelli.loc
Now all nodes are managers
Pools
To be able to delete pools:
ceph config set mon mon_allow_pool_delete true
To delete a pool:
ceph osd pool delete .mgr .mgr --yes-i-really-really-mean-it
To create a pool (with 128 placement groups):
ceph osd pool create rbdpool 128 128 ceph osd pool application enable rbdpool rbd
The same can be done via the web interface
KVM pool
Each host should also be a monitor.
By default, the number of monitors is lower. To ensure every host is a monitor:
ceph orch apply mon --placement="KVM1 KVM2 KVM3 KVM4"
After the RBD pool is created, it can be added to KVM nodes.
First, create an auth key (required):
ceph auth get-or-create "client.libvirt" mon "profile rbd" osd "profile rbd pool=rbdpool"
Then create auth on KVM. Edit `auth.xml` as:
<secret ephemeral='no' private='no'>
<uuid>47f5d4b2-75da-4ecd-be13-a8cfac626b95</uuid>
<usage type='ceph'>
<name>client.libvirt secret</name>
</usage>
</secret>
Then ON EACH HOST:
virsh secret-define --file auth.xml virsh secret-set-value --secret "47f5d4b2-75da-4ecd-be13-a8cfac626b95" --base64 "$(ceph auth get-key client.libvirt)"
Then create the pool from this `pool.xml` template:
<pool type="rbd">
<name>rpool01</name>
<source>
<name>rbdpool</name>
<host name='localceph'/>
<auth username='libvirt' type='ceph'>
<secret uuid='47f5d4b2-75da-4ecd-be13-a8cfac626b95'/>
</auth>
</source>
</pool>
virsh pool-define pool.xml virsh pool-autostart rpool01 virsh pool-start rpool01
troubleshooting
monitor
If mon service won’t start, list daemons:
ceph orch ps --daemon_type=mon
Remove:
ceph orch daemon rm mon.testkvm2 --force
Redeploy:
ceph orch apply mon testkvm2.cifarelli.loc
stray daemons
If nothing works:
ceph mgr fail
osd pg
Page states per disk:
ceph pg dump
If stuck, e.g.:
pg 1.0 is stuck stale...
and
ceph pg 1.0 query
returns
Error ENOENT
Try:
ceph osd force-create-pg 2.19
osd balance
Use:
ceph osd df
To rebalance, e.g.
ceph osd crush reweight osd.0 1
osd
If OSD is failed or disk is dead:
ceph orch daemon rm osd.2 --force
OSD Removal (Soft Procedure)
ceph osd out {osd-num}
The corresponding disk begins to be drained and the pages are relocated. Monitor until it no longer contains any pages using
ceph osd df
Example output:
[root@KVM1 ~]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 ssd 0.70000 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
4 ssd 3.49309 1.00000 3.5 TiB 1.8 TiB 1.7 TiB 23 MiB 4.1 GiB 1.7 TiB 50.11 1.18 82 up
1 ssd 0.70000 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 2 up
5 ssd 3.49309 1.00000 3.5 TiB 1.5 TiB 1.5 TiB 17 MiB 3.2 GiB 2.0 TiB 42.75 1.01 86 up
3 ssd 3.49309 1.00000 3.5 TiB 1.4 TiB 1.4 TiB 20 MiB 4.9 GiB 2.1 TiB 39.17 0.92 80 up
2 ssd 3.49309 1.00000 3.5 TiB 1.3 TiB 1.3 TiB 19 MiB 4.8 GiB 2.2 TiB 37.44 0.88 89 up
TOTAL 14 TiB 5.9 TiB 5.9 TiB 78 MiB 17 GiB 8.1 TiB 42.37
osd.0 is now empty. Verify with
ceph osd safe-to-destroy osd.0
which returns
OSD(s) 0 are safe to destroy without reducing data durability.
You can now remove the service with
ceph orch daemon rm osd.0 --force
Remove the crush map entry with
ceph osd crush remove osd.0
Then it is recommended to go to the GUI in the OSD section and perform delete/purge
recovery
To speed up recovery:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
Check:
ceph config show osd.0 | grep max
Reset:
ceph tell 'osd.*' injectargs '--osd-max-backfills 1' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 0'
Alternative:
ceph config set osd osd_mclock_override_recovery_settings true ceph config set osd osd_max_backfills 5
Then:
ceph config set osd osd_mclock_override_recovery_settings false
wipe disk to reuse as OSD
vgdisplay
Take note of VG ID and run `vgremove` on it Then:
wipefs -a /dev/vdd
image management
Ceph QEMU integration: https://docs.ceph.com/en/reef/rbd/qemu-rbd/
You can manage images to and from the pool. Some operations are available via web UI (copy, delete).
Example to copy image to pool:
rbd import --dest-pool rbdpool Rocky-9-GenericCloud-Base.latest.x86_64.raw rocky01.raw
If image is in qcow2:
qemu-img convert -f qcow2 -O raw /var/lib/libvirt/images/rocky9.qcow2 rbd:rbdpool01/ROCKY9
