Configuring KVM cluster with CEPH filesystem
Created to test a full hyperconverged cluster with open source technologies.
We deployed on 3 hardware nodes: HP gen10 servers with 10GBe interfaces, 256GB ram, 480GB SSD disk for system, multiple SSD 4TB to share as OSD foe CEPH filesystem.
References
Installation:
https://kifarunix.com/how-to-deploy-ceph-storage-cluster-on-rocky-linux/#create-ceph-os-ds-from-attached-storage-devices
Block devices:
https://kifarunix.com/configure-and-use-ceph-block-device-on-linux-clients/
KVM pools:
https://blog.modest-destiny.com/posts/kvm-libvirt-add-ceph-rbd-pool/
Installation
Start from at least 3 nodes with SSH access keys
Similar to the cluster, set up reciprocal DNS entries in `/etc/hosts`
10.0.0.50 testkvm1.cifarelli.loc testkvm1 10.0.0.51 testkvm2.cifarelli.loc testkvm2 10.0.0.52 testkvm3.cifarelli.loc testkvm3 10.0.0.50 localceph localceph.cifarelli.loc
The last line differs for each host and is used by KVM to always initialize the POOL on the local address (connecting to the monitor that must run on ALL nodes), even when VMs are migrated.
Install docker:
dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo dnf install docker-ce systemctl enable --now docker containerd docker --version
Install CEPH repo and ceph-admin:
CEPH_RELEASE=19.2.0 curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm chmod +x cephadm ./cephadm add-repo --release reef ./cephadm install which cephadm
Initialize ceph-admin node:
cephadm bootstrap --mon-ip 10.0.0.50 --allow-fqdn-hostname
- Warning!** Among the messages, you should see the WEB interface credentials (note them down):
Ceph Dashboard is now available at:
URL: https://testkvm1.cifarelli.loc:8443/
User: admin
Password: tdocjdyd3d
This should:
- Create a monitor and manager daemon for the new cluster on the localhost.
- Generate a new SSH key for the Ceph cluster and add it to the root user’s /root/.ssh/authorized_keys file.
- Write a copy of the public key to /etc/ceph/ceph.pub.
- Write a minimal configuration file to /etc/ceph/ceph.conf.
- Write a copy of the client.admin administrative key to /etc/ceph/ceph.client.admin.keyring.
- Add the _admin label to the bootstrap host.
To see the containers created:
podman ps
To see the systemd units created and started:
systemctl list-units 'ceph*'
To install the CLI:
cephadm install ceph-common
Now the CLI is available. For example:
ceph -s
Copy keys to other hosts:
ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm2 ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm3
Deploy the monitor daemon to other hosts:
ceph orch host add testkvm2.cifarelli.loc ceph orch host add testkvm3.cifarelli.loc
To ensure ALL nodes have an active monitor (required by KVM):
ceph orch apply mon --placement="testkvm1 testkvm2 testkvm3"
(verify this).
Or:
ceph orch apply mon 3
Which means always keep 3 active services, i.e., one per node
Similarly:
ceph orch apply mgr 3
Ensures there are 3 manager nodes (1 active + 2 standby)
Add labels corresponding to the nodes:
ceph orch host label add testkvm1.cifarelli.loc mon-mgr-01 ceph orch host label add testkvm2.cifarelli.loc mon-mgr-02 ceph orch host label add testkvm3.cifarelli.loc mon-mgr-03 ceph orch host ls
On other nodes, it's better to install at least the CLI:
CEPH_RELEASE=19.2.0 curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm chmod +x cephadm ./cephadm add-repo --release reef ./cephadm install ceph-common
Copy keyring and config from testkvm1:
scp /etc/ceph/* testkvm2:/etc/ceph
Now from testkvmd try:
ceph -s
Add OSD: check what is available:
ceph orch device ls --wide --refresh
(wide refresh is used to rescan devices, especially if disks were just added).
If disks are not available, once freed they appear as available. Then:
ceph orch apply osd --all-available-devices --method raw
To avoid automatic disk addition:
ceph orch apply osd --all-available-devices --unmanaged=false
Manual addition:
ceph orch daemon add osd <host>:<device-path>
Now check status:
ceph -s
...
pools
To be able to delete pools:
ceph config set mon mon_allow_pool_delete true
To delete a pool:
ceph osd pool delete .mgr .mgr --yes-i-really-really-mean-it
To create a pool (with 128 placement groups):
ceph osd pool create rbdpool 128 128 ceph osd pool application enable rbdpool rbd
The same can be done via the web interface
KVM pool
Each host should also be a monitor.
By default, the number of monitors is lower. To ensure every host is a monitor:
ceph orch apply mon --placement="KVM1 KVM2 KVM3 KVM4"
After the RBD pool is created, it can be added to KVM nodes.
First, create an auth key (required):
ceph auth get-or-create "client.libvirt" mon "profile rbd" osd "profile rbd pool=rbdpool"
Then create auth on KVM. Edit `auth.xml` as:
<secret ephemeral='no' private='no'>
<uuid>47f5d4b2-75da-4ecd-be13-a8cfac626b95</uuid>
<usage type='ceph'>
<name>client.libvirt secret</name>
</usage>
</secret>
Then ON EACH HOST:
virsh secret-define --file auth.xml virsh secret-set-value --secret "47f5d4b2-75da-4ecd-be13-a8cfac626b95" --base64 "$(ceph auth get-key client.libvirt)"
Then create the pool from this `pool.xml` template:
<pool type="rbd">
<name>rpool01</name>
<source>
<name>rbdpool</name>
<host name='localceph'/>
<auth username='libvirt' type='ceph'>
<secret uuid='47f5d4b2-75da-4ecd-be13-a8cfac626b95'/>
</auth>
</source>
</pool>
virsh pool-define pool.xml virsh pool-autostart rpool01 virsh pool-start rpool01
troubleshooting
monitor
If mon service won’t start, list daemons:
ceph orch ps --daemon_type=mon
Remove:
ceph orch daemon rm mon.testkvm2 --force
Redeploy:
ceph orch apply mon testkvm2.cifarelli.loc
stray daemons
If nothing works:
ceph mgr fail
osd pg
Page states per disk:
ceph pg dump
If stuck, e.g.:
pg 1.0 is stuck stale...
and
ceph pg 1.0 query
returns
Error ENOENT
Try:
ceph osd force-create-pg 2.19
osd balance
Use:
ceph osd df
To rebalance, e.g.
ceph osd crush reweight osd.0 1
osd
If OSD is failed or disk is dead:
ceph orch daemon rm osd.2 --force
OSD removal (soft)
ceph osd out {osd-num}
Monitor with:
ceph osd df
Then:
ceph osd safe-to-destroy osd.0 ceph orch daemon rm osd.0 --force ceph osd crush remove osd.0
Use GUI for delete/purge
recovery
To speed up recovery:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
Check:
ceph config show osd.0 | grep max
Reset:
ceph tell 'osd.*' injectargs '--osd-max-backfills 1' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 0'
Alternative:
ceph config set osd osd_mclock_override_recovery_settings true ceph config set osd osd_max_backfills 5
Then:
ceph config set osd osd_mclock_override_recovery_settings false
wipe disk to reuse as OSD
vgdisplay
Note VG ID and run `vgremove` on it Then:
wipefs -a /dev/vdd
image management
Ceph QEMU integration: https://docs.ceph.com/en/reef/rbd/qemu-rbd/
You can manage images to and from the pool. Some operations are available via web UI (copy, delete).
Example to copy image to pool:
rbd import --dest-pool rbdpool Rocky-9-GenericCloud-Base.latest.x86_64.raw rocky01.raw
If image is in qcow2:
qemu-img convert -f qcow2 -O raw /var/lib/libvirt/images/rocky9.qcow2 rbd:rbdpool01/ROCKY9
```
Let me know if you want this exported to a `.mediawiki` or `.txt` file!
