A Cautionary Tale About Fanless Systems

I’ve had fanless systems for a few years, and I recently decided it was time for an upgrade. I started with a Protectli server a few years ago, and some time later I got a Yanling machine. This time, I wanted to upgrade once more because I want to start using Kubernetes in my home infrastructure. I know how annoying this can be, but I want to do this as a learning experience.

Both Yanling and Protectli machines were amazing, but it’s important to note that hardware-wise (memory and storage) they were very similar to each other.

The new one I bought has a 13th Gen i7 processor, support for up to 64 GB of DDR5 RAM, two NVMe slots for M.2 2280 NVMe storage, and even an additional SATA port for yet another SSD.

I got two NVMe 4TB SSDs (WD Black) and 64 GB of memory for it. I installed Proxmox on it and migrated the VMs I was already running to it. The process was a breeze. It’s important to note that since I aimed for this to be an improvement, instead of ext4, for this system I defined the SSDs to be a ZFS cluster (RAID 5).

After I got my VMs up and running, I replaced the old server with this one. Up to this point, everything was amazing. I had achieved everything with only 30 minutes of downtime. The new server started operating probably at 01:00 AM.

The problems started the next morning. At some point, the server crashed (the server itself, not a VM). I analyzed the Proxmox logs and, eventually, I noticed that one of the drives had reached 95°C (even the server’s chassis was burning up), this caused it to be disconnected from the array and it was marked in a DEGRADED state. I just restored my original server and started to investigate.

At this point, I did some research, I suspected that maybe ZFS was being too hard on the SSDs and maybe that was the cause of the problem. However, most information and opinions from people said that the usage impact should be negligible and should not affect the temperature significantly.

While debugging all this, I noted that for some reason, the date and time of Proxmox would not persist, I didn’t give this too much thought, I disregarded this as I thought that it was because Proxmox could not reach the NTP server. This will be important later.

After checking my options, I decided to add heatsinks to both NVMe SSDs. After I did that, I left the server running (in a controlled environment). I wrote a quick bash script to monitor the temperature of both drives every 5 minutes.

Although, this time neither of them reached 95°C, they got dangerously close. Particularly when the nightly backups cron job ran. Even after the job finished, the cooldown process was not fast.

Timestamp Temperature 1 (°C) Temperature 2 (°C)
2024-08-24 23:05:01 76 79
2024-08-24 23:10:01 76 78
2024-08-24 23:15:01 75 78
2024-08-25 08:35:01 75 77
2024-08-25 08:40:01 75 77
2024-08-25 08:45:01 75 77
2024-08-25 08:50:01 75 77
2024-08-25 08:55:01 75 77
2024-08-25 08:58:57 81 84
2024-08-25 09:00:01 84 86
2024-08-25 09:02:11 84 85
2024-08-25 09:02:49 84 86
2024-08-25 09:05:01 84 91
2024-08-25 09:05:23 84 90
2024-08-25 09:06:06 84 89
2024-08-25 09:08:35 83 87

It seemed heatsinks would not be enough for this to be within safe limits. Ideally, I was aiming for ~60°C without too much load and 70°C under stress.

To achieve this, it seemed a fan was in order. I asked the vendor and was told that the system supports 8010 fans. I ordered one, so to prepare before it got here, I wanted to have the new machine ready.

It was time for me to check what the problem was with the date. Eventually, I noticed that the time was not only wrong on the OS, but also on the BIOS. So I tested a few times setting the right date and time on the BIOS but it would not persist after unplugging and plugging the server.

This usually means that the CMOS battery is failing, so I bought a new one and replaced it. Even after replacing it, the battery was drained within a couple of hours (this was tested with a multimeter). The new battery went from 3V to 1V in a few hours. I tried with yet another battery, but it happened again.

I can’t be sure what happened, but I suspect that when one SSD heated up significantly, it also affected the mainboard, causing the CMOS batteries to drain quickly.

I wanted to share this so it doesn’t happen to someone else. If you get a fanless system, make sure it can properly deal with the amount of heat your components are producing.

Buy Me a Coffee at ko-fi.com

Related Posts

Adding Code Highlighting in Blackboard's Question Import Format

Adding Code Highlighting in Blackboard's Question Import Format

Blackboard is a popular learning platform that many universities use to manage content and exams.

Read More
Forensics Beginner Challenges Part 3 of 3

Forensics Beginner Challenges Part 3 of 3

Let’s start the third and final installment of this series about solving forensics beginner challenges.

Read More
Recalibrating an APC UPS after replacing the battery

Recalibrating an APC UPS after replacing the battery

For a while, I’ve used a BR1100M2-LM UPS to support my rack.

Read More