KB450101 – Slow-Blocked Ops Troubleshooting

Last modified: October 23, 2019
You are here:
Estimated reading time: 2 min
In this article
  • If your Ceph cluster encounters a slow/blocked operation it will log it and set the cluster health into Warning Mode.

    Slow/blocked ops are synonyms as far as Ceph is concerned – both mean the same thing.

    Generally speaking, an OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

    The main causes of OSDs having slow requests are:

    • Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
    • Problems with the network are usually connected with flapping OSDs. See Section 5.1.4, “Flapping OSDs” for details.
    • System load

    Start to troubleshoot in this order:

    1. Look in the monitor logs (systemctl status ceph-mon@mon.id)
    2. Look in the OSD logs (systemctl status ceph-osd@osd.id)
      1. Check Disk Health (SMART)
      2. Check Network Health (Network diagnostic tools)

    Example

    Cluster shows health warning:

    [root@dennis ~]# ceph -s
      cluster:
        id:     3a49db45-43a5-4c82-9327-4c568dd8fc92
        health: HEALTH_WARN
                30 slow ops, oldest one blocked for 104335 sec, mon.mac has slow ops
    ...

    This will give us a clue where to look first. Monitor service running on the host named “mac”

    Check the status of the monitor service on host mac

    [root@mac ~]# systemctl status ceph-mon@mac
    ● ceph-mon@mac.service - Ceph cluster monitor daemon
       Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
       Active: active (running) since Mon 2018-07-23 09:36:34 ADT; 2 months 23 days ago
     Main PID: 2249 (ceph-mon)
       CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service
               └─2249 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph
    
    Oct 15 10:59:46 mac: 2018-10-15 10:59:46.371 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 10:59:51 mac: 2018-10-15 10:59:51.372 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 10:59:56 mac: 2018-10-15 10:59:56.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:01 mac: 2018-10-15 11:00:01.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:06 mac: 2018-10-15 11:00:06.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:11 mac: 2018-10-15 11:00:11.373 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:16 mac: 2018-10-15 11:00:16.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:21 mac: 2018-10-15 11:00:21.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:26 mac: 2018-10-15 11:00:26.374 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)
    Oct 15 11:00:31 mac: 2018-10-15 11:00:31.375 7f11b56b9700 -1 mon.mac@2(peon) e3 get_health_metrics reporting 30 slow ops, oldest is remove_snaps({5=[270b,273e]} v0)

    It looks like mon.mac got hung up removing a snapshot. Verify all the snapshots are correct. Reboot mon.mac service to clear the old warning.

    systemctl restart ceph-mon@mac

    Verify the service restarted correctly

    [root@mac ~]# systemctl status ceph-mon@mac -l
    ● ceph-mon@mac.service - Ceph cluster monitor daemon
       Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
       Active: active (running) since Mon 2018-10-15 11:00:55 ADT; 9s ago
     Main PID: 542434 (ceph-mon)
       CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@mac.service
               └─542434 /usr/bin/ceph-mon -f --cluster ceph --id mac --setuser ceph --setgroup ceph
    
    Oct 15 11:00:55 mac.45lab.com systemd[1]: Started Ceph cluster monitor daemon.
    Oct 15 11:00:55 mac.45lab.com systemd[1]: Starting Ceph cluster monitor daemon...

    And verify that your cluster is back to “Healthy Status”

    [root@mac ~]# ceph -s
      cluster:
        id:     3a49db45-43a5-4c82-9327-4c568dd8fc92
        health: HEALTH_OK
     
      services:
        mon: 3 daemons, quorum charlie,dennis,mac
        mgr: mac(active), standbys: charlie, dennis
        mds: cephfs-1/1/1 up  {0=mac=up:active}, 2 up:standby
        osd: 17 osds: 17 up, 17 in
     
      data:
        pools:   3 pools, 656 pgs
        objects: 12.27 M objects, 8.9 TiB
        usage:   28 TiB used, 34 TiB / 62 TiB avail
        pgs:     656 active+clean
     
      io:
        client:   122 KiB/s rd, 2.0 MiB/s wr, 2 op/s rd, 121 op/s wr
Was this article helpful?
Dislike 12
Views: 5834
Unboxing Racking Storage Drives Cable Setup Power UPS Sizing Remote Access