Scope/Description
This article details the process of troubleshooting a monitor service experiencing slow-block ops. If your Ceph cluster encounters a slow/blocked operation it will log it and set the cluster health into Warning Mode.
Generally speaking, an OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.
The main causes of OSDs having slow requests are:
-
Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
-
Problems with the network are usually connected with flapping OSDs. See Section 5.1.4, “Flapping OSDs” for details.
-
System load
Prerequisites
Steps
- Start to troubleshoot in this order:
- Look in the monitor logs (systemctl status ceph-mon@mon.id)
- Look in the OSD logs (systemctl status ceph-osd@osd.id)
- Check Disk Health (SMART)
- Check Network Health (Network diagnostic tools)
Example
- Cluster shows health warning:
[root@dennis ~]# ceph -s cluster: id: 3a49db45-43a5-4c82-9327-4c568dd8fc92 health: HEALTH_WARN 30 slow ops, oldest one blocked for 104335 sec, mon.mac has slow ops ...
This will give us a clue where to look first. Monitor service running on the host named “mac”
- Check the status of the monitor service on host mac
[root@mac ~]# systemctl status ceph-mon@mac ceph-mon@mac.service - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-07-23 09:36:34 ADT; 2 months 23 days ago Main PID: 2249 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service └─2249 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph Oct 15 10:59:46 mac: 2018-10-15 10:59:46.371 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 10:59:51 mac: 2018-10-15 10:59:51.372 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 10:59:56 mac: 2018-10-15 10:59:56.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:01 mac: 2018-10-15 11:00:01.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:06 mac: 2018-10-15 11:00:06.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:11 mac: 2018-10-15 11:00:11.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:16 mac: 2018-10-15 11:00:16.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:21 mac: 2018-10-15 11:00:21.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:26 mac: 2018-10-15 11:00:26.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:31 mac: 2018-10-15 11:00:31.375 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
In this case, it looks like mon.mac got hung up removing a snapshot. Verify all the snapshots are correct. Reboot mon.mac service to clear the old warning.
systemctl restart ceph-mon@mac
- Verify the service restarted correctly
[root@mac ~]# systemctl status ceph-mon@mac -l ceph-mon@mac.service - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-10-15 11:00:55 ADT; 9s ago Main PID: 542434 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service └─542434 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph Oct 15 11:00:55 mac.45lab.com systemd[1]: Started Ceph cluster monitor daemon. Oct 15 11:00:55 mac.45lab.com systemd[1]: Starting Ceph cluster monitor daemon...
- And verify that your cluster is back to “Healthy Status”
[root@mac ~]# ceph -s cluster: id: 3a49db45-43a5-4c82-9327-4c568dd8fc92 health: HEALTH_OK services: mon: 3 daemons, quorum charlie,dennis,mac mgr: mac(active), standbys: charlie, dennis mds: cephfs-1/1/1 up {0=mac=up:active}, 2 up:standby osd: 17 osds: 17 up, 17 in data: pools: 3 pools, 656 pgs objects: 12.27 M objects, 8.9 TiB usage: 28 TiB used, 34 TiB / 62 TiB avail pgs: 656 active+clean io: client: 122 KiB/s rd, 2.0 MiB/s wr, 2 op/s rd, 121 op/s wr