Linus Torvalds + HN + ECC RAM + rasdaemon
One of the reasons why I am renting a Hetzner AX-51 server is because the AX51 is the least expensive server in the AX line that comes configured by default with Error-correcting Code Memory (ECC RAM memory).
This morning rajesh-s posted on Hacker News (HN) about Linus Torvalds opinion that "ECC absolutely matters."
The HN discussion was fascinating! All kinds of anecdotal and tech information about ECC and bit flips, their importance or lack of importance, relative costs, effects on hardware and software, influence from Intel and AMD marketing strategies, and use of ECC by various huge companies and in various kinds of equipment.
One HN comment by fortran77 stated that the rate of bit flips is about 1 per gigabyte per month.
gsvelto, a Mozilla engineer, posted in the HN discussion a link to his excellent rasdaemon tutorial.
Whoa! @Not_Oles, the "clueless administrator," had never heard of rasdaemon before! What's rasdaemon? As gsvelto said, "rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory error." The the initial "ras" in name "rasdaemon" stands for Reliablity, Availability and Serviceability (RAS).
I googled around a bit and checked Github for any rasdaemon problems with Proxmox. The weather seemed maybe pretty good. So I went ahead and installed rasdaemon on my server. Now maybe I can monitor ECC memory errors and also get those errors logged.
Here below is what the install looked like in case anybody is interested.
I had a fun day! I hope you did too! Greetings from Mexico! ??????️
root@hels ~ # apt-get update Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB] Hit:2 http://deb.debian.org/debian buster InRelease Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB] Hit:4 http://download.proxmox.com/debian/ceph-nautilus buster InRelease Hit:5 http://download.proxmox.com/debian/pve buster InRelease Hit:6 http://mirror.hetzner.de/debian/packages buster InRelease Get:7 http://mirror.hetzner.de/debian/security buster/updates InRelease [65.4 kB] Get:8 http://mirror.hetzner.de/debian/packages buster-updates InRelease [51.9 kB] Fetched 235 kB in 1s (321 kB/s) Reading package lists... Done root@hels ~ # apt-get dist-upgrade Reading package lists... Done Building dependency tree Reading state information... Done Calculating upgrade... Done 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. root@hels ~ # apt-get install rasdaemon Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: libdbd-sqlite3-perl libdbi-perl Suggested packages: libmldbm-perl libnet-daemon-perl libsql-statement-perl The following NEW packages will be installed: libdbd-sqlite3-perl libdbi-perl rasdaemon 0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded. Need to get 1,030 kB of archives. After this operation, 2,914 kB of additional disk space will be used. Do you want to continue? [Y/n] Get:1 http://mirror.hetzner.de/debian/packages buster/main amd64 libdbi-perl amd64 1.642-1+deb10u1 [775 kB] Get:2 http://mirror.hetzner.de/debian/packages buster/main amd64 libdbd-sqlite3-perl amd64 1.62-3 [177 kB] Get:3 http://mirror.hetzner.de/debian/packages buster/main amd64 rasdaemon amd64 0.6.0-1.2 [78.2 kB] Fetched 1,030 kB in 1s (2,023 kB/s) Selecting previously unselected package libdbi-perl:amd64. (Reading database ... 71868 files and directories currently installed.) Preparing to unpack .../libdbi-perl_1.642-1+deb10u1_amd64.deb ... Unpacking libdbi-perl:amd64 (1.642-1+deb10u1) ... Selecting previously unselected package libdbd-sqlite3-perl:amd64. Preparing to unpack .../libdbd-sqlite3-perl_1.62-3_amd64.deb ... Unpacking libdbd-sqlite3-perl:amd64 (1.62-3) ... Selecting previously unselected package rasdaemon. Preparing to unpack .../rasdaemon_0.6.0-1.2_amd64.deb ... Unpacking rasdaemon (0.6.0-1.2) ... Setting up libdbi-perl:amd64 (1.642-1+deb10u1) ... Setting up libdbd-sqlite3-perl:amd64 (1.62-3) ... Setting up rasdaemon (0.6.0-1.2) ... Created symlink /etc/systemd/system/multi-user.target.wants/ras-mc-ctl.service → /lib/systemd/system/ras-mc-ctl.service. Created symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /lib/systemd/system/rasdaemon.service. Processing triggers for man-db (2.8.5-2) ... root@hels ~ # systemctl enable rasdaemon root@hels ~ # systemctl status rasdaemon ● rasdaemon.service - RAS daemon to log the RAS events Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2021-01-04 01:44:06 UTC; 2min 42s ago Main PID: 15397 (rasdaemon) Tasks: 1 (limit: 4915) Memory: 10.9M CGroup: /system.slice/rasdaemon.service └─15397 /usr/sbin/rasdaemon -f -r Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: ras:extlog_mem_event event enabled Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Enabled event ras:extlog_mem_event Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: ras:extlog_mem_event event enabled Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Listening to events for cpus 0 to 15 Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: Enabled event ras:extlog_mem_event Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording mc_event events Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording aer_event events Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording extlog_event events Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording mce_record events Jan 04 01:44:06 hels.xxxxxxxxx.xxx rasdaemon[15397]: rasdaemon: Recording arm_event events root@hels ~ # man rasdaemon root@hels ~ # man ras-mc-ctl root@hels ~ # ras-mc-ctl --mainboard ras-mc-ctl: mainboard: ASRockRack model B450D4U-V1L root@hels ~ # ras-mc-ctl --summary No Memory errors. No PCIe AER errors. No Extlog errors. No MCE errors. root@hels ~ #
Tom. 穆坦然. Not Oles. Happy New York City guy visiting Mexico! How is your 文言文?
The MetalVPS.com website runs very speedily on MicroLXC.net! Thanks to @Neoon!
Comments
Might not be a big deal with docker containers or anything that could restart on failure.
But would be a disaster if the same happens on a DB server.
When DB server running on Docker? I think, the isolation of volumes and immutable images push db servers to containers.
It might still be an issue as Rowhammer attacks can be used to escalate privileges and thus pretty much render any sandboxing useless. Not an easy attack and likely to get spotted if you carefully monitor your machines, but one should be aware that is possible when enough energy is spent on it. Heck, there even was a proof of concept published in 2015 for privilege escalation in web browsers using JavaScript.
Alwyzon - Virtual Servers in Austria starting at 3,99 €/month (excl. VAT)
You have a point! I did not look at it from an attackers point of view, once inside they have all access to the internal VPN and config files to extract and connect to
Yeah saw that - was an interesting read. Didn't know one could monitor it.
I think the relative importance of risks is also worth keeping in mind though. I'm not in charge of a datacenter so my data losses are more likely to be from a bad config / backup /vulnerability not on point etc than rowhammer or a cosmic bit flip.
Or sometimes just stupidity...deleted a ssh key the other day. Whoops.
When @david and I were setting up my old, now gone, OVH servers for our original giveaway program, we had a mysterious crash. We spent several days trying to figure out what might have happened. I got back into the server and retrieved the logs, but no joy. We then spent several weeks wondering about reliability and trying extended testing.
There were no more crashes at OVH and there have not been any crashes on the new Hetzner box. Nevertheless, when evaluating relative importance of ECC monitoring, maybe incorporating the time and trouble spent debugging into the comparison metric might be good.
Tom. 穆坦然. Not Oles. Happy New York City guy visiting Mexico! How is your 文言文?
The MetalVPS.com website runs very speedily on MicroLXC.net! Thanks to @Neoon!
Yeah when offering a service to other for payment then I'd def expect ECC to be considered.
Sorry above comment wasn't really meant to be dismissive of your usage case. I just think Torvalds point is a little too broad & overstated in that a big chunk of computers are used to FB and browse cat pictures...not exactly something you need to guard against cosmic rays (though if you can for no additional cost sure why not)