45 Drives Knowledge Base
KB450156 - Replacing Failed OSDs
https://knowledgebase.45drives.com/kb/kb450156-replacing-failed-osds/

KB450156 - Replacing Failed OSDs

Posted on July 2, 2019 by Rob MacQueen


Scope/Description:

This details the process of locating and replacing a downed/out OSD in the cluster.

Prerequisites:

A Octopus/Nautilus Ceph Cluster

Have a replacement drive of same capacity/speed on hand (If replacing)

Steps:

OSD location

  • In the UI: Cluster→OSDs. Sort By Status.

  • Run this command in the terminal and look for OSD's that are down or out:
ceph osd tree
  • osd.23 and osd.17 have failed on host vosd01

Resolve Linux Name of OSD devices

  • In the dashboard, select failed OSD → Metadata. Look for the devices field.


  • In the terminal:
ceph-volume lvm list
  • Find the failed OSD. Read the device name.


  • osd.23=/dev/sdi and osd.17=/dev/sdg

Find Physical Location of Failed Devices

  • Remote into the host(s) with failed OSDs.
ssh root@vosd01
  • Map linux device name to physical alias name.
ls -al /dev/ | grep sdg
lrwxrwxrwx.  1 root       root              3 Jan  9 23:58 1-7 -> sdg
brw-rw----.  1 root       disk         8,  96 Jan  9 23:58 sdg
ls -al  /dev/ | grep sdi
lrwxrwxrwx.  1 root       root              3 Jan  9 23:58 1-6 -> sdi
brw-rw----.  1 root       disk         8, 128 Jan  9 23:58 sdi
  • OSD 17 is in slot 1-7 on host vosd01
  • OSD 23 is in slot 1-6 on host vosd01
  • DO NOT PHYSICALLY REMOVE THESE DRIVES YET, SAFELY REMOVE FROM CLUSTER FIRST
  • DO NOT REMOVE FAILED OSDs FROM CLUSTER UNTIL YOU HAVE PHYSICALLY LOCATED THEM FIRST
  • Remote into host(s) with failed OSDs
ssh vosd01
  • Destroy the OSDs
ceph osd destroy 17 --yes-i-really-mean-it
ceph osd destroy 23 --yes-i-really-mean-it
  • Remove the failed disks physically from the system
  • Insert new drive into same slots. IF you use new slots take note of the name, use this new slot name below
  • Wipe the new disk
ceph-volume lvm zap /dev/1-7
ceph-volume lvm zap /dev/1-6
  • Recreate the old OSD with the create command using the old OSD.id with the new disk present.
    If this command fails and reads "The osd ID # is already in use or does not exist," simply remove "--osd-id #" from the command.
ceph-volume lvm create --osd-id 17 --data /dev/1-7

ceph-volume lvm create --osd-id 23 --data /dev/1-6
  • Observe data Migration
watch -n1 ceph -s

Verification:

When the data migration is done make sure the cluster is in healthy state, and the amount of OSD's in the cluster match the amount previously in the cluster.

ceph -s

Troubleshooting:

KB450156 – Replacing Failed OSDs – 45 Drives Knowledge Base

KB450156 – Replacing Failed OSDs

Last modified: June 14, 2021
You are here:
Estimated reading time: 1 min

Scope/Description:

This details the process of locating and replacing a downed/out OSD in the cluster.

Prerequisites:

A Octopus/Nautilus Ceph Cluster

Have a replacement drive of same capacity/speed on hand (If replacing)

Steps:

OSD location

  • In the UI: Cluster→OSDs. Sort By Status.

  • Run this command in the terminal and look for OSD’s that are down or out:
ceph osd tree
  • osd.23 and osd.17 have failed on host vosd01

Resolve Linux Name of OSD devices

  • In the dashboard, select failed OSD → Metadata. Look for the devices field.


  • In the terminal:
ceph-volume lvm list
  • Find the failed OSD. Read the device name.


  • osd.23=/dev/sdi and osd.17=/dev/sdg

Find Physical Location of Failed Devices

  • Remote into the host(s) with failed OSDs.
ssh root@vosd01
  • Map linux device name to physical alias name.
ls -al /dev/ | grep sdg
lrwxrwxrwx.  1 root       root              3 Jan  9 23:58 1-7 -> sdg
brw-rw----.  1 root       disk         8,  96 Jan  9 23:58 sdg
ls -al  /dev/ | grep sdi
lrwxrwxrwx.  1 root       root              3 Jan  9 23:58 1-6 -> sdi
brw-rw----.  1 root       disk         8, 128 Jan  9 23:58 sdi
  • OSD 17 is in slot 1-7 on host vosd01
  • OSD 23 is in slot 1-6 on host vosd01
  • DO NOT PHYSICALLY REMOVE THESE DRIVES YET, SAFELY REMOVE FROM CLUSTER FIRST
  • DO NOT REMOVE FAILED OSDs FROM CLUSTER UNTIL YOU HAVE PHYSICALLY LOCATED THEM FIRST
  • Remote into host(s) with failed OSDs
ssh vosd01
  • Destroy the OSDs
ceph osd destroy 17 --yes-i-really-mean-it
ceph osd destroy 23 --yes-i-really-mean-it
  • Remove the failed disks physically from the system
  • Insert new drive into same slots. IF you use new slots take note of the name, use this new slot name below
  • Wipe the new disk
ceph-volume lvm zap /dev/1-7
ceph-volume lvm zap /dev/1-6
  • Recreate the old OSD with the create command using the old OSD.id with the new disk present.
    If this command fails and reads “The osd ID # is already in use or does not exist,” simply remove “–osd-id #” from the command.
ceph-volume lvm create --osd-id 17 --data /dev/1-7

ceph-volume lvm create --osd-id 23 --data /dev/1-6
  • Observe data Migration
watch -n1 ceph -s

Verification:

When the data migration is done make sure the cluster is in healthy state, and the amount of OSD’s in the cluster match the amount previously in the cluster.

ceph -s

Troubleshooting:

Was this article helpful?
Dislike 0
Views: 580
Unboxing Racking Storage Drives Cable Setup Power UPS Sizing Remote Access