Configuring KVM cluster with CEPH filesystem: Difference between revisions

From Fvettore-WIKI
mNo edit summary
Line 322: Line 322:
==wipe disk to reuse as OSD==   
==wipe disk to reuse as OSD==   
  vgdisplay   
  vgdisplay   
Note VG ID and run `vgremove` on it   
Take note of VG ID and run `vgremove` on it   
Then:   
Then:   
  wipefs -a /dev/vdd
  wipefs -a /dev/vdd


== image management ==   
== image management ==   

Revision as of 13:32, 21 July 2025

General overview

This is how we have built a fully hyperconverged cluster with open source technologies.
We deployed on 3 hardware nodes (used): HP gen10 servers with 10GBe interfaces, 2Gold CPU each (24 cores), 256GB ram, 480GB SSD disk for system, multiple SSD 4TB to share as OSD foe CEPH filesystem.
OS: Rocky linux 9 with KVM hypervisor and tools preinstalled.

References

Installation: https://kifarunix.com/how-to-deploy-ceph-storage-cluster-on-rocky-linux/#create-ceph-os-ds-from-attached-storage-devices
Block devices: https://kifarunix.com/configure-and-use-ceph-block-device-on-linux-clients/
KVM pools: https://blog.modest-destiny.com/posts/kvm-libvirt-add-ceph-rbd-pool/

Installation

Start from at least 3 nodes with SSH access keys
Similar to the cluster, set up reciprocal DNS entries in `/etc/hosts`

10.0.0.50   testkvm1.cifarelli.loc testkvm1  
10.0.0.51   testkvm2.cifarelli.loc testkvm2  
10.0.0.52   testkvm3.cifarelli.loc testkvm3  
10.0.0.50   localceph localceph.cifarelli.loc  

The last line differs for each host and is used by KVM to always initialize the POOL on the local address (connecting to the monitor that must run on ALL nodes), even when VMs are migrated.
this is a workaround since, by default, monitor doesn't listen on localhost and you need to configure the VM in a way it doesn't depend on other hosts. So the VM will go on running even in case of another node failure.


Install docker:

dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo  
dnf install docker-ce  
systemctl enable --now docker containerd  
docker --version

Install CEPH repo and ceph-admin:

CEPH_RELEASE=19.2.0  
curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm  
chmod +x cephadm  
./cephadm add-repo --release reef  
./cephadm install  
which cephadm

Initialize ceph-admin node:

cephadm bootstrap --mon-ip 10.0.0.50  --allow-fqdn-hostname
    • Warning!** Among the messages, you should see the WEB interface credentials (note them down):
Ceph Dashboard is now available at:  

            URL: https://testkvm1.cifarelli.loc:8443/  
           User: admin  
       Password: tdocjdyd3d  

This should:

  • Create a monitor and manager daemon for the new cluster on the localhost.
  • Generate a new SSH key for the Ceph cluster and add it to the root user’s /root/.ssh/authorized_keys file.
  • Write a copy of the public key to /etc/ceph/ceph.pub.
  • Write a minimal configuration file to /etc/ceph/ceph.conf.
  • Write a copy of the client.admin administrative key to /etc/ceph/ceph.client.admin.keyring.
  • Add the _admin label to the bootstrap host.

To see the containers created:

podman ps  

To see the systemd units created and started:

systemctl list-units 'ceph*'  

To install the CLI:

cephadm install ceph-common  

Now the CLI is available. For example:

ceph -s  

Copy keys to other hosts:

ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm2  
ssh-copy-id -f -i /etc/ceph/ceph.pub root@testkvm3  

Deploy the monitor daemon to other hosts:

ceph orch host add testkvm2.cifarelli.loc  
ceph orch host add testkvm3.cifarelli.loc  

To ensure ALL nodes have an active monitor (required by KVM):

ceph orch apply mon --placement="testkvm1 testkvm2 testkvm3"  

(verify this).
Or:

ceph orch apply mon 3  

Which means always keep 3 active services, i.e., one per node

Similarly:

ceph orch apply mgr 3  

Ensures there are 3 manager nodes (1 active + 2 standby)

Add labels corresponding to the nodes:

ceph orch host label add testkvm1.cifarelli.loc mon-mgr-01  
ceph orch host label add testkvm2.cifarelli.loc mon-mgr-02  
ceph orch host label add testkvm3.cifarelli.loc mon-mgr-03  
ceph orch host ls

On other nodes, it's better to install at least the CLI:

CEPH_RELEASE=19.2.0  
curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm  
chmod +x cephadm  
./cephadm add-repo --release reef  
./cephadm install ceph-common

Copy keyring and config from testkvm1:

scp /etc/ceph/* testkvm2:/etc/ceph

Now from testkvmd try:

ceph -s

Add OSD: check what is available:

ceph orch device ls --wide --refresh

(`--wide --refresh` is useful to rescan devices, especially if disks were just added). Example response:

HOST                    PATH      TYPE  TRANSPORT  RPM  DEVICE ID              SIZE  HEALTH  IDENT  FAULT  AVAILABLE  REFRESHED  REJECT REASONS  
testkvm1.cifarelli.loc  /dev/sr0  hdd                   QEMU_DVD-ROM_QM00001  2048           N/A    N/A    No         80s ago    Insufficient space (<5GB)  
testkvm1.cifarelli.loc  /dev/vdb  hdd                                         50.0G          N/A    N/A    No         80s ago    Has BlueStore device label, Has a FileSystem  
testkvm2.cifarelli.loc  /dev/vdb  hdd                                         50.0G          N/A    N/A    No         75s ago    Has BlueStore device label, Has a FileSystem  
testkvm3.cifarelli.loc  /dev/vdb  hdd                                         50.0G          N/A    N/A    No         76s ago    Has BlueStore device label, Has a FileSystem  

In the above case, the storage was not available. Once released:

HOST                    PATH      TYPE  TRANSPORT  RPM  DEVICE ID              SIZE  HEALTH  IDENT  FAULT  AVAILABLE  REFRESHED  REJECT REASONS  
testkvm1.cifarelli.loc  /dev/sr0  hdd                   QEMU_DVD-ROM_QM00001  2048           N/A    N/A    No         11s ago    Insufficient space (<5GB)  
testkvm1.cifarelli.loc  /dev/vdb  hdd                                         50.0G          N/A    N/A    Yes        11s ago  
testkvm2.cifarelli.loc  /dev/vdb  hdd                                         50.0G          N/A    N/A    Yes        7s ago  
testkvm3.cifarelli.loc  /dev/vdb  hdd                                         50.0G          N/A    N/A    Yes        7s ago  

We can see that there are 3 volumes of 50GB attached to each node and available. Add them all automatically:

ceph orch apply osd --all-available-devices --method raw  

(wide refresh is used to rescan devices, especially if disks were just added).

If disks are not available, once freed they appear as available. Then:

ceph orch apply osd --all-available-devices --method raw

To avoid automatic disk addition:

ceph orch apply osd --all-available-devices --unmanaged=false  

Manual addition:

ceph orch daemon add osd <host>:<device-path>

Now check status:

ceph -s


 cluster:  
   id:     71446092-6e7f-11ef-b946-525400747546  
   health: HEALTH_OK  
 services:  
   mon: 3 daemons, quorum testkvm1,testkvm2,testkvm3 (age 20m)  
   mgr: testkvm2.ayridz(active, since 19m), standbys: testkvm1.wgrpdw  
   osd: 4 osds: 3 up (since 3m), 4 in (since 68s)  
 data:  
   pools:   1 pools, 1 pgs  
   objects: 2 objects, 449 KiB  
   usage:   81 MiB used, 150 GiB / 150 GiB avail  
   pgs:     1 active+clean  

The cluster is fine: it has 3 monitors and 4 OSDs

Copy the configs to the other nodes:

scp /etc/ceph/ceph* testkvm2.cifarelli.loc:/etc/ceph  
scp /etc/ceph/ceph* testkvm3.cifarelli.loc:/etc/ceph  

Install the shell on the other nodes:

CEPH_RELEASE=18.2.4  
curl -sLO https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm  
chmod +x cephadm  
./cephadm add-repo --release reef  
./cephadm install  
cephadm install ceph-common  

Now the shell and administration commands are available on every node
Deploy the manager daemon on all nodes:

ceph orch daemon add mgr testkvm2.cifarelli.loc  
ceph orch daemon add mgr testkvm3.cifarelli.loc  

Now all nodes are managers

Pools

To be able to delete pools:

ceph config set mon mon_allow_pool_delete true

To delete a pool:

ceph osd pool delete .mgr .mgr --yes-i-really-really-mean-it

To create a pool (with 128 placement groups):

ceph osd pool create rbdpool 128 128  
ceph osd pool application enable rbdpool rbd

The same can be done via the web interface

KVM pool

Each host should also be a monitor.
By default, the number of monitors is lower. To ensure every host is a monitor:

ceph orch apply mon --placement="KVM1 KVM2 KVM3 KVM4"

After the RBD pool is created, it can be added to KVM nodes.
First, create an auth key (required):

ceph auth get-or-create "client.libvirt" mon "profile rbd" osd "profile rbd pool=rbdpool"

Then create auth on KVM. Edit `auth.xml` as:

<secret ephemeral='no' private='no'>  
  <uuid>47f5d4b2-75da-4ecd-be13-a8cfac626b95</uuid>  
  <usage type='ceph'>  
    <name>client.libvirt secret</name>  
  </usage>  
</secret>

Then ON EACH HOST:

virsh secret-define --file auth.xml  
virsh secret-set-value --secret "47f5d4b2-75da-4ecd-be13-a8cfac626b95" --base64 "$(ceph auth get-key client.libvirt)"

Then create the pool from this `pool.xml` template:

<pool type="rbd">  
  <name>rpool01</name>  
  <source>  
    <name>rbdpool</name>  
    <host name='localceph'/>  
    <auth username='libvirt' type='ceph'>  
      <secret uuid='47f5d4b2-75da-4ecd-be13-a8cfac626b95'/>  
    </auth>  
  </source>  
</pool>
virsh pool-define pool.xml  
virsh pool-autostart rpool01  
virsh pool-start rpool01

troubleshooting

monitor

If mon service won’t start, list daemons:

ceph orch ps --daemon_type=mon  

Remove:

ceph orch daemon rm mon.testkvm2 --force  

Redeploy:

ceph orch apply mon testkvm2.cifarelli.loc  

stray daemons

If nothing works:

ceph mgr fail

osd pg

Page states per disk:

ceph pg dump  

If stuck, e.g.:

pg 1.0 is stuck stale...  

and

ceph pg 1.0 query  

returns

Error ENOENT  

Try:

ceph osd force-create-pg 2.19

osd balance

Use:

ceph osd df  

To rebalance, e.g.

ceph osd crush reweight osd.0 1

osd

If OSD is failed or disk is dead:

ceph orch daemon rm osd.2 --force  

OSD Removal (Soft Procedure)

ceph osd out {osd-num}

The corresponding disk begins to be drained and the pages are relocated. Monitor until it no longer contains any pages using

ceph osd df

Example output:

[root@KVM1 ~]# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP    META     AVAIL    %USE   VAR   PGS  STATUS
 0    ssd  0.70000         0      0 B      0 B      0 B     0 B      0 B      0 B      0     0    0      up
 4    ssd  3.49309   1.00000  3.5 TiB  1.8 TiB  1.7 TiB  23 MiB  4.1 GiB  1.7 TiB  50.11  1.18   82      up
 1    ssd  0.70000         0      0 B      0 B      0 B     0 B      0 B      0 B      0     0    2      up
 5    ssd  3.49309   1.00000  3.5 TiB  1.5 TiB  1.5 TiB  17 MiB  3.2 GiB  2.0 TiB  42.75  1.01   86      up
 3    ssd  3.49309   1.00000  3.5 TiB  1.4 TiB  1.4 TiB  20 MiB  4.9 GiB  2.1 TiB  39.17  0.92   80      up
 2    ssd  3.49309   1.00000  3.5 TiB  1.3 TiB  1.3 TiB  19 MiB  4.8 GiB  2.2 TiB  37.44  0.88   89      up
                       TOTAL   14 TiB  5.9 TiB  5.9 TiB  78 MiB   17 GiB  8.1 TiB  42.37

osd.0 is now empty. Verify with

ceph osd safe-to-destroy osd.0

which returns

OSD(s) 0 are safe to destroy without reducing data durability.

You can now remove the service with

ceph orch daemon rm osd.0 --force

Remove the crush map entry with

ceph osd crush remove osd.0

Then it is recommended to go to the GUI in the OSD section and perform delete/purge

recovery

To speed up recovery:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'  
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'  

Check:

ceph config show osd.0 | grep max  

Reset:

ceph tell 'osd.*' injectargs '--osd-max-backfills 1'  
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 0'

Alternative:

ceph config set osd osd_mclock_override_recovery_settings true  
ceph config set osd osd_max_backfills 5  

Then:

ceph config set osd osd_mclock_override_recovery_settings false

wipe disk to reuse as OSD

vgdisplay  

Take note of VG ID and run `vgremove` on it Then:

wipefs -a /dev/vdd

image management

Ceph QEMU integration: https://docs.ceph.com/en/reef/rbd/qemu-rbd/
You can manage images to and from the pool. Some operations are available via web UI (copy, delete).
Example to copy image to pool:

rbd import --dest-pool rbdpool Rocky-9-GenericCloud-Base.latest.x86_64.raw rocky01.raw  

If image is in qcow2:

qemu-img convert -f qcow2 -O raw /var/lib/libvirt/images/rocky9.qcow2 rbd:rbdpool01/ROCKY9