- This article details the process of locating and replacing a failed storage drive/OSD in a Ceph Cluster.
- SSH access to your Ceph Cluster
- Replacement Drive(s)
- First we’ll have to figure out which drive has failed. We can do this through either the Ceph Dashboard, or via the command line.
- In the Dashboard under the tab “Cluster > OSDs“, you can see which drives have failed. You will need the OSD number to physically located the drive in the server.
In the command line, run the ceph osd tree command in the terminal and look for OSD’s that are down or out.
ceph osd tree
osd.2 is downed in the OSD tree above. It is located on host node OSD1.
- You will need to run the following command to physically find the downed OSD in the server. If you are running an older ceph cluster, see troubleshooting section to locate the disk physically.
- Next. we’ll shrink the cluster by removing the failed OSD(s)
- SSH into host(s) with failed OSD(s)
Destroy the OSD(s)
ceph osd destroy 2 --yes-i-really-mean-it
Do not remove failed OSD(s) from the Ceph Cluster until you have identified which drive it has been created on
Remove the failed disks physically from the system
Insert new drive into same slots. IF you use new slots take note of the name, use this new slot name below
Wipe the new disk if it has any data on it
ceph-volume lvm zap /dev/1-1
Recreate the old OSD with the create command using the old OSD.id with the new disk present.
If this command fails and reads “The osd ID # is already in use or does not exist,” simply remove “–osd-id #” from the command.
ceph-volume lvm create --osd-id 2 --data /dev/1-1
Observe data Migration
watch ceph -s
- When the data migration is done make sure the cluster is in healthy state, and the amount of OSD’s in the cluster match the amount previously in the cluster.
- For older clusters you may not have the findosd tool. This is a way to find the physical location of a down osd.
ceph-volume lvm list
- Find the failed OSD. Read the device name.
osd.23=/dev/sdi and osd.17=/dev/sdg
Find Physical Location of Failed Devices
Remote into the host(s) with failed OSDs.
Map linux device name to physical alias name.
ls -al /dev/ | grep sdg
lrwxrwxrwx. 1 root root 3 Jan 9 23:58 1-7 -> sdg brw-rw----. 1 root disk 8, 96 Jan 9 23:58 sdg
ls -al /dev/ | grep sdi
lrwxrwxrwx. 1 root root 3 Jan 9 23:58 1-6 -> sdi brw-rw----. 1 root disk 8, 128 Jan 9 23:58 sdi
OSD 17 is in slot 1-7 on host vosd01
OSD 23 is in slot 1-6 on host vosd01