Configuring KVM cluster with CEPH filesystem

From Fvettore-WIKI

General overview

This is how we have built a fully hyperconverged cluster with open source technologies.
We deployed on 3 hardware nodes (used): HP gen10 servers with 10GBe interfaces, 2Gold CPU each (24 cores), 256GB ram, 480GB SSD disk for system, multiple SSD 4TB to share as OSD foe CEPH filesystem.
OS: Rocky linux 9 with KVM hypervisor and tools preinstalled.

References

Installation: https://kifarunix.com/how-to-deploy-ceph-storage-cluster-on-rocky-linux/#create-ceph-os-ds-from-attached-storage-devices
Block devices: https://kifarunix.com/configure-and-use-ceph-block-device-on-linux-clients/
KVM pools: https://blog.modest-destiny.com/posts/kvm-libvirt-add-ceph-rbd-pool/

Installation

Start from at least 3 nodes with SSH access keys
Similar to the cluster, set up reciprocal DNS entries in `/etc/hosts`

10.0.0.50   testkvm1.cifarelli.loc testkvm1  
10.0.0.51   testkvm2.cifarelli.loc testkvm2  
10.0.0.52   testkvm3.cifarelli.loc testkvm3  
10.0.0.50   localceph localceph.cifarelli.loc  

The last line differs for each host and is used by KVM to always initialize the POOL on the local address (connecting to the monitor that must run on ALL nodes), even when VMs are migrated.
this is a workaround since, by default, monitor doesn't listen on localhost and you need to configure the VM in a way it doesn't depend on other hosts. So the VM will go on running even in case of another node failure.


Install docker:

dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo  
dnf install docker-ce  
systemctl enable --now docker containerd  
docker --version

Install CEPH repo and ceph-admin:

CEPH_RELEASE=19.2.0  
curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm  
chmod +x cephadm  
./cephadm add-repo --release reef  
./cephadm install  
which cephadm

Initialize ceph-admin node:

cephadm bootstrap --mon-ip 10.0.0.50  --allow-fqdn-hostname
    • Warning!** Among the messages, you should see the WEB interface credentials (note them down):
Ceph Dashboard is now available at:  

            URL: https://testkvm1.cifarelli.loc:8443/  
           User: admin  
       Password: tdocjdyd3d  

This should:

  • Create a monitor and manager daemon for the new cluster on the localhost.
  • Generate a new SSH key for the Ceph cluster and add it to the root user’s /root/.ssh/authorized_keys file.
  • Write a copy of the public key to /etc/ceph/ceph.pub.
  • Write a minimal configuration file to /etc/ceph/ceph.conf.
  • Write a copy of the client.admin administrative key to /etc/ceph/ceph.client.admin.keyring.
  • Add the _admin label to the bootstrap host.

To see the containers created:

podman ps  

To see the systemd units created and started:

systemctl list-units 'ceph*'  

To install the CLI:

cephadm install ceph-common  

Now the CLI is available. For example:

ceph -s  

Copy keys to other hosts:

ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm2  
ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm3  

Deploy the monitor daemon to other hosts:

ceph orch host add testkvm2.cifarelli.loc  
ceph orch host add testkvm3.cifarelli.loc  

To ensure ALL nodes have an active monitor (required by KVM):

ceph orch apply mon --placement="testkvm1 testkvm2 testkvm3"  

(verify this).
Or:

ceph orch apply mon 3  

Which means always keep 3 active services, i.e., one per node

Similarly:

ceph orch apply mgr 3  

Ensures there are 3 manager nodes (1 active + 2 standby)

Add labels corresponding to the nodes:

ceph orch host label add testkvm1.cifarelli.loc mon-mgr-01  
ceph orch host label add testkvm2.cifarelli.loc mon-mgr-02  
ceph orch host label add testkvm3.cifarelli.loc mon-mgr-03  
ceph orch host ls

On other nodes, it's better to install at least the CLI:

CEPH_RELEASE=19.2.0  
curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm  
chmod +x cephadm  
./cephadm add-repo --release reef  
./cephadm install ceph-common

Copy keyring and config from testkvm1:

scp /etc/ceph/* testkvm2:/etc/ceph

Now from testkvmd try:

ceph -s

Add OSD: check what is available:

ceph orch device ls --wide --refresh

(wide refresh is used to rescan devices, especially if disks were just added).

If disks are not available, once freed they appear as available. Then:

ceph orch apply osd --all-available-devices --method raw

To avoid automatic disk addition:

ceph orch apply osd --all-available-devices --unmanaged=false  

Manual addition:

ceph orch daemon add osd <host>:<device-path>

Now check status:

ceph -s

...

pools

To be able to delete pools:

ceph config set mon mon_allow_pool_delete true

To delete a pool:

ceph osd pool delete .mgr .mgr --yes-i-really-really-mean-it

To create a pool (with 128 placement groups):

ceph osd pool create rbdpool 128 128  
ceph osd pool application enable rbdpool rbd

The same can be done via the web interface

KVM pool

Each host should also be a monitor.
By default, the number of monitors is lower. To ensure every host is a monitor:

ceph orch apply mon --placement="KVM1 KVM2 KVM3 KVM4"

After the RBD pool is created, it can be added to KVM nodes.
First, create an auth key (required):

ceph auth get-or-create "client.libvirt" mon "profile rbd" osd "profile rbd pool=rbdpool"

Then create auth on KVM. Edit `auth.xml` as:

<secret ephemeral='no' private='no'>  
  <uuid>47f5d4b2-75da-4ecd-be13-a8cfac626b95</uuid>  
  <usage type='ceph'>  
    <name>client.libvirt secret</name>  
  </usage>  
</secret>

Then ON EACH HOST:

virsh secret-define --file auth.xml  
virsh secret-set-value --secret "47f5d4b2-75da-4ecd-be13-a8cfac626b95" --base64 "$(ceph auth get-key client.libvirt)"

Then create the pool from this `pool.xml` template:

<pool type="rbd">  
  <name>rpool01</name>  
  <source>  
    <name>rbdpool</name>  
    <host name='localceph'/>  
    <auth username='libvirt' type='ceph'>  
      <secret uuid='47f5d4b2-75da-4ecd-be13-a8cfac626b95'/>  
    </auth>  
  </source>  
</pool>
virsh pool-define pool.xml  
virsh pool-autostart rpool01  
virsh pool-start rpool01

troubleshooting

monitor

If mon service won’t start, list daemons:

ceph orch ps --daemon_type=mon  

Remove:

ceph orch daemon rm mon.testkvm2 --force  

Redeploy:

ceph orch apply mon testkvm2.cifarelli.loc  

stray daemons

If nothing works:

ceph mgr fail

osd pg

Page states per disk:

ceph pg dump  

If stuck, e.g.:

pg 1.0 is stuck stale...  

and

ceph pg 1.0 query  

returns

Error ENOENT  

Try:

ceph osd force-create-pg 2.19

osd balance

Use:

ceph osd df  

To rebalance, e.g.

ceph osd crush reweight osd.0 1

osd

If OSD is failed or disk is dead:

ceph orch daemon rm osd.2 --force  

OSD removal (soft)

ceph osd out {osd-num}  

Monitor with:

ceph osd df  

Then:

ceph osd safe-to-destroy osd.0  
ceph orch daemon rm osd.0 --force  
ceph osd crush remove osd.0  

Use GUI for delete/purge

recovery

To speed up recovery:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'  
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'  

Check:

ceph config show osd.0 | grep max  

Reset:

ceph tell 'osd.*' injectargs '--osd-max-backfills 1'  
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 0'

Alternative:

ceph config set osd osd_mclock_override_recovery_settings true  
ceph config set osd osd_max_backfills 5  

Then:

ceph config set osd osd_mclock_override_recovery_settings false

wipe disk to reuse as OSD

vgdisplay  

Note VG ID and run `vgremove` on it Then:

wipefs -a /dev/vdd  

image management

Ceph QEMU integration: https://docs.ceph.com/en/reef/rbd/qemu-rbd/
You can manage images to and from the pool. Some operations are available via web UI (copy, delete).
Example to copy image to pool:

rbd import --dest-pool rbdpool Rocky-9-GenericCloud-Base.latest.x86_64.raw rocky01.raw  

If image is in qcow2:

qemu-img convert -f qcow2 -O raw /var/lib/libvirt/images/rocky9.qcow2 rbd:rbdpool01/ROCKY9