Scope/Description

This article details the process of troubleshooting a monitor service experiencing slow-block ops. If your Ceph cluster encounters a slow/blocked operation it will log it and set the cluster health into Warning Mode.

Generally speaking, an OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

The main causes of OSDs having slow requests are:

Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
Problems with the network are usually connected with flapping OSDs. See Section 5.1.4, “Flapping OSDs” for details.
System load

Prerequisites

Steps

Start to troubleshoot in this order:
- Look in the monitor logs (systemctl status ceph-mon@mon.id)
- Look in the OSD logs (systemctl status ceph-osd@osd.id)
- Check Disk Health (SMART)
- Check Network Health (Network diagnostic tools)

Example

Cluster shows health warning:

[root@dennis ~]# ceph -s
  cluster:
    id:     3a49db45-43a5-4c82-9327-4c568dd8fc92
    health: HEALTH_WARN
            30 slow ops, oldest one blocked for 104335 sec, mon.mac has slow ops
...

This will give us a clue where to look first. Monitor service running on the host named “mac”

Check the status of the monitor service on host mac

[root@mac ~]# systemctl status ceph-mon@mac
   ceph-mon@mac.service - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-07-23 09:36:34 ADT; 2 months 23 days ago
   Main PID: 2249 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service
           └─2249 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph

Oct 15 10:59:46 mac: 2018-10-15 10:59:46.371 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 10:59:51 mac: 2018-10-15 10:59:51.372 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 10:59:56 mac: 2018-10-15 10:59:56.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:01 mac: 2018-10-15 11:00:01.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:06 mac: 2018-10-15 11:00:06.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:11 mac: 2018-10-15 11:00:11.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:16 mac: 2018-10-15 11:00:16.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:21 mac: 2018-10-15 11:00:21.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:26 mac: 2018-10-15 11:00:26.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
Oct 15 11:00:31 mac: 2018-10-15 11:00:31.375 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)

In this case, it looks like mon.mac got hung up removing a snapshot. Verify all the snapshots are correct. Reboot mon.mac service to clear the old warning.

In order to restart a mon server you will need to run the command on that specific host

systemctl restart ceph-mon@mac

Verify the service restarted correctly

[root@mac ~]# systemctl status ceph-mon@mac -l
   ceph-mon@mac.service - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-10-15 11:00:55 ADT; 9s ago
   Main PID: 542434 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service
           └─542434 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph

Oct 15 11:00:55 mac.45lab.com systemd[1]: Started Ceph cluster monitor daemon.
Oct 15 11:00:55 mac.45lab.com systemd[1]: Starting Ceph cluster monitor daemon...

And verify that your cluster is back to “Healthy Status”

[root@mac ~]# ceph -s
  cluster:
    id:     3a49db45-43a5-4c82-9327-4c568dd8fc92
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum charlie,dennis,mac
    mgr: mac(active), standbys: charlie, dennis
    mds: cephfs-1/1/1 up  {0=mac=up:active}, 2 up:standby
    osd: 17 osds: 17 up, 17 in
 
  data:
    pools:   3 pools, 656 pgs
    objects: 12.27 M objects, 8.9 TiB
    usage:   28 TiB used, 34 TiB / 62 TiB avail
    pgs:     656 active+clean
 
  io:
    client:   122 KiB/s rd, 2.0 MiB/s wr, 2 op/s rd, 121 op/s wr

Was this article helpful?

Like 34 Dislike 49