I need help with RAID error
Hello, I've never had to deals with issues with RAID (new to dedicated servers since we switched from VPS's)
This is the error that I've got to email:
A Fail event had been detected on md device /dev/md/2. It could be related to component device /dev/nvme0n1p3. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md2 : active raid1 nvme1n1p3[1] nvme0n1p3[0](F) 3745885504 blocks super 1.2 [2/1] [_U] bitmap: 3/28 pages [12KB], 65536KB chunk md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0] 4189184 blocks super 1.2 [2/2] [UU] md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0] 523264 blocks super 1.2 [2/2] [UU] unused devices:
A Fail event had been detected on md device /dev/md/1. It could be related to component device /dev/nvme0n1p2. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md2 : active raid1 nvme1n1p3[1] 3745885504 blocks super 1.2 [2/1] [_U] bitmap: 4/28 pages [16KB], 65536KB chunk md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0](F) 4189184 blocks super 1.2 [2/1] [_U] md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0](F) 523264 blocks super 1.2 [2/1] [_U] unused devices:
and there's second mail with:
A Fail event had been detected on md device /dev/md/0. It could be related to component device /dev/nvme0n1p1. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md2 : active raid1 nvme1n1p3[1] 3745885504 blocks super 1.2 [2/1] [_U] bitmap: 4/28 pages [16KB], 65536KB chunk md0 : active raid1 nvme1n1p1[1] 4189184 blocks super 1.2 [2/1] [_U] md1 : active raid1 nvme1n1p2[1] 523264 blocks super 1.2 [2/1] [_U] unused devices:
What should I do? Thanks.
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
Comments
Here's more info that I've Googled
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
And the partitions:
The server is Hetzner AX 101
and I've did the partition thing with their tutorial for disk larger than 2TB with imageinstall (rescue while OS install)
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
Helped a bit via discord, .... mdam should be back in sync for now.
★ MyRoot.PW ★ Dedicated Servers ★ LIR-Services ★ Web-Hosting ★
★ Locations: Austria + Netherlands + USA ★ [email protected] ★
@SGraf Thanks for helping! Thanks also for posting about the resolution so the thread wouldn't be left just dangling.
@Amadex @SGraf Could you guys please post a brief note about
what the problem was,
what caused the problem, and
how you fixed it?
Best wishes and kindest regards from a clueless™ guy in the desert! 🏜️
Tom. 穆坦然. Not Oles. Happy New York City guy visiting Mexico! How is your 文言文?
The MetalVPS.com website runs very speedily on MicroLXC.net! Thanks to @Neoon!
You should get the NVMe that was removed from the array replaced ASAP. I had that happen too and put it back in the RAID array because a badblocks and SMART test came out clean. A few hours later, the node started behaving extremely weirdly (high iowait) and eventually crashed.
I don't understand why in 2021 people still do mdadm arrays... Logical Volume Groups, btrfs, or ZFS are the way to go. There's too many issues with write holes and desyncing on mdadm that require manual intervention for my tastes.
Cheap dedis are my drug, and I'm too far gone to turn back.
Sorry, how do we know that an NVMe was removed from the array?
Tom. 穆坦然. Not Oles. Happy New York City guy visiting Mexico! How is your 文言文?
The MetalVPS.com website runs very speedily on MicroLXC.net! Thanks to @Neoon!
Got an alert from our monitoring software, and an email too. cat /proc/mdstat will also show your array as degraded.
LVM RAID uses the same underlying driver iirc, btrfs is still not stable and ZFS has a performance overhead/you need to tune it properly. mdadm still works just fine ootb and is a tested solution.
@SGraf helped me a lot. Thanks again! 🙌
@Not_Oles
Problem: I've got email that RAID has failed
What caused the problem: dunno
Fixed: @SGraf was troubleshooting
Idk if I should replace the disk or keep it for now.
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
As i said in the chat, get that disk replaced. Just because we put the mdam raid back together for now, doesnt mean it will be stable in the future.
We saw i/o write erors on the ssd before the system dropped the ssd completely. the disk came back after a reboot and we re-added+synced the ssd.
★ MyRoot.PW ★ Dedicated Servers ★ LIR-Services ★ Web-Hosting ★
★ Locations: Austria + Netherlands + USA ★ [email protected] ★
run
smartctl -a
against your nvmes to get an idea how worn out they are...@Falzo
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
these NVMe are brand new. there is nothing to argue over for having them replaced.
did you power off the server via panel at some point after the first installation? maybe a power loss lead to the broken initial raid sync in the first place...
I've never did that. After Plesk installation + Centos 8 > CloudLinux 8 conversion I did a normal reboot via ssh. Server was bought on 27.09.2021 and everything was installed on that day + rebooted. Since then I've touched nothing.
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
weird... I however doubt that there is anything wrong with one of the NVMes at all, whatever hickup that has been then.
I will wait and see If it happens again, thanks everyone for replies
Amadex.com Domainer + IT Supporter | Brbljaona Balkan Chat Website | ICT Jobs Croatia
LVM at least has the benefit of self-healing in RAID1 scenarios despite calling mdadm for the underlying RAID functionality, and offers greater flexibility over mdadm.
btrfs is considered stable for RAID1: https://btrfs.wiki.kernel.org/index.php/Status - Performance tuning just needs to happen to the code and then it'll be a lot more viable, but that being said it is quite performant as it is now.
ZFS performance overhead is way overblown; the "1TB of storage needs 1GB of RAM" is only for enterprise level applications with many clients simultaneously reading and writing to the array, and the recommended tuning settings are well documented and understood.
ZFS performance tuning official recommendations: https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html
ZFS developer on the "1GB for 1TB" rule:
https://www.reddit.com/r/DataHoarder/comments/5u3385/linus_tech_tips_unboxes_1_pb_of_seagate/ddrh5iv/
https://www.reddit.com/r/DataHoarder/comments/5u3385/linus_tech_tips_unboxes_1_pb_of_seagate/ddrngar/
Cheap dedis are my drug, and I'm too far gone to turn back.
I'm not sure what you're referring to by LVM's self healing .
I'm referring to BTRFS's stability, A few weeks back I had a friend lose data due to a power loss (OpenSUSE + Btrfs). This rarely happens with EXT4.
And yes, not sure where that RAM requirement came from. I run a few large (100+TB) Proxmox servers that run ZFS (that's the only logical choice, xfs has been having problems with files disappearing apparently, ext4 is limited) and they run on very little RAM. I find it really nice that the writehole problems are patched there, but there are a few quirks. IOWait was much higher when the VM was in a zvol than when it was in a file in the same zpool that the zvol was in. Another problem with ZFS is the lack of mainstream linux support atm. It should improve in future though hopefully.
I have high hopes on a project called bcachefs, it seems really cool