Find Which OSDs Have Failed
OSD location
-
In the UI:
-
Cluster→OSDs. Sort By Status.
-
-
In the terminal:
-
ceph osd tree
-
Look for OSDs marked either, Down,Out or both.
-
-
osd.23 and osd.17 have failed on host vosd01
Resolve Linux Name of OSD devices
-
In the UI:
-
Select failed OSD → Metadata. Look for the devices feild.
-
-
In the terminal:
-
ceph-volume lvm list
-
Find the failed OSD. Read the device name.
-
-
-
osd.23=/dev/sdi and osd.17=/dev/sdg
Find Physical Location of Failed Devices
-
Remote into the host(s) with failed OSDs.
-
ssh root@vosd01
-
-
Map linux device name to physical alias name.
-
ls -al /dev/ | grep sdg
-
lrwxrwxrwx. 1 root root 3 Jan 9 23:58 1-7 -> sdg brw-rw----. 1 root disk 8, 96 Jan 9 23:58 sdg
-
-
ls -al /dev/ | grep sdi
-
lrwxrwxrwx. 1 root root 3 Jan 9 23:58 1-6 -> sdi brw-rw----. 1 root disk 8, 128 Jan 9 23:58 sdi
-
-
-
OSD 17 is in slot 1-7 on host vosd01
-
OSD 23 is in slot 1-6 on host vosd01
-
DO NOT PHYSICALLY REMOVE THESE DRIVES YET, SAFELY REMOVE FROM CLUSTER FIRST
Shrink Cluster By Removing Failed OSDs
-
DO NOT REMOVE FAILED OSDs FROM CLUSTER UNTIL YOU HAVE PHYSICALLY LOCATED THEM FIRST
-
Remote into host(s) with failed OSDs
-
ssh vosd01
-
-
Destroy the OSDs
-
ceph osd destroy 17 --yes-i-really-mean-it
-
ceph osd destroy 23 --yes-i-really-mean-it
-
-
Remove the failed disks physically from the system
-
Insert new drive into same slots. IF you use new slots take note of the name, use this new slot name below
-
Wipe the new disk
-
ceph-volume lvm zap /dev/1-7
-
ceph-volume lvm zap /dev/1-6
-
-
Recreate the old OSD with the create command using the old OSD.id with the new disk present.
If this command fails and reads “The osd ID # is already in use or does not exist,” simply remove “–osd-id #” from the command.-
ceph-volume lvm create --osd-id 17 --data /dev/1-7
-
ceph-volume lvm create --osd-id 23 --data /dev/1-6
-
-
Observe data Migration
-
watch -n1 ceph -s
-