-
If your Ceph cluster encounters a slow/blocked operation it will log it and set the cluster health into Warning Mode.
Slow/blocked ops are synonyms as far as Ceph is concerned – both mean the same thing.
Generally speaking, an OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.
The main causes of OSDs having slow requests are:
-
Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
-
Problems with the network are usually connected with flapping OSDs. See Section 5.1.4, “Flapping OSDs” for details.
-
System load
Start to troubleshoot in this order:
-
Look in the monitor logs (systemctl status ceph-mon@mon.id)
-
Look in the OSD logs (systemctl status ceph-osd@osd.id)
-
Check Disk Health (SMART)
-
Check Network Health (Network diagnostic tools)
-
Example
Cluster shows health warning:
[root@dennis ~]# ceph -s cluster: id: 3a49db45-43a5-4c82-9327-4c568dd8fc92 health: HEALTH_WARN 30 slow ops, oldest one blocked for 104335 sec, mon.mac has slow ops ...
This will give us a clue where to look first. Monitor service running on the host named “mac”
Check the status of the monitor service on host mac
[root@mac ~]# systemctl status ceph-mon@mac ● ceph-mon@mac.service - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-07-23 09:36:34 ADT; 2 months 23 days ago Main PID: 2249 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service └─2249 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph Oct 15 10:59:46 mac: 2018-10-15 10:59:46.371 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 10:59:51 mac: 2018-10-15 10:59:51.372 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 10:59:56 mac: 2018-10-15 10:59:56.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:01 mac: 2018-10-15 11:00:01.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:06 mac: 2018-10-15 11:00:06.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:11 mac: 2018-10-15 11:00:11.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:16 mac: 2018-10-15 11:00:16.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:21 mac: 2018-10-15 11:00:21.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:26 mac: 2018-10-15 11:00:26.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0) Oct 15 11:00:31 mac: 2018-10-15 11:00:31.375 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
It looks like mon.mac got hung up removing a snapshot. Verify all the snapshots are correct. Reboot mon.mac service to clear the old warning.
systemctl restart ceph-mon@mac
Verify the service restarted correctly
[root@mac ~]# systemctl status ceph-mon@mac -l ● ceph-mon@mac.service - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-10-15 11:00:55 ADT; 9s ago Main PID: 542434 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service └─542434 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph Oct 15 11:00:55 mac.45lab.com systemd[1]: Started Ceph cluster monitor daemon. Oct 15 11:00:55 mac.45lab.com systemd[1]: Starting Ceph cluster monitor daemon...
And verify that your cluster is back to “Healthy Status”
[root@mac ~]# ceph -s cluster: id: 3a49db45-43a5-4c82-9327-4c568dd8fc92 health: HEALTH_OK services: mon: 3 daemons, quorum charlie,dennis,mac mgr: mac(active), standbys: charlie, dennis mds: cephfs-1/1/1 up {0=mac=up:active}, 2 up:standby osd: 17 osds: 17 up, 17 in data: pools: 3 pools, 656 pgs objects: 12.27 M objects, 8.9 TiB usage: 28 TiB used, 34 TiB / 62 TiB avail pgs: 656 active+clean io: client: 122 KiB/s rd, 2.0 MiB/s wr, 2 op/s rd, 121 op/s wr
Estimated reading time: 2 min
In this article
Views: 7269