Hi all,
I am testing a ZFS mirror pool on the raspberry pi 5 (16 GBytes of RAM) with a Pironman 5 MAX case.
I have been doing a lot of testing and I found a repeatable, consistent problem that completely hangs the raspberry OS (Debian 13 trixie, all terminal through the lite version).
With the two drives in the ZFS pool, all online, if I do a hammer test on the pool with fio as:
fio --name=zfs-stability --directory=/mnt --rw=randrw --rwmixread=70 --bs=64k --ioengine=libaio --iodepth=32 --size=2G --numjobs=4 --runtime=3600s --time_based --group_reporting
Some 2-3 minutes in, The pi completely freezes and I see in dmesg the following:
[ 431.584404] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 431.584414] nvme nvme0: Does your device have a faulty power saving mode enabled?
[ 431.584417] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[ 431.925272] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
[ 431.925275] nvme nvme1: Does your device have a faulty power saving mode enabled?
[ 431.925277] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[ 432.072403] nvme 0001:03:00.0: enabling device (0000 -> 0002)
[ 432.072420] nvme nvme0: Disabling device after reset failure: -19
Note that I am using a 5.1Volt, 5Amp power supply and
dmesg | grep -i "voltage"
gives nothing and
vcgencmd get_throttled
returns 0 hex.
After the Pi is hanged, I have to physically remove the usb-c power… WAIT A FEW minutes and power back on… If I just reset the power, I can see the two orange drive activity leds in the HAT blink very faintly (not bright), and the disks are not recognized… I have to unplug again.
I have been doing repeated tests for days with all kinds of commands (including adding nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off to /boot/firmware/cmdline.txt that the dmesg message hints at.
Nothing works.
Now… if I PHYSICALLY remove one of the disks in the drive, the ZFS pool works (mirror) but in degraded mode… and I can chuck along no problem in the fio test. It has been running now for almost an hour. Very healthy 400MiB/sec reads and 180MiB/sec writes.
Both disks are Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] (rev 01) (prog-if 02 [NVM Express] 1-TB drives which are are specfically listed in Compatible NVMe SSDs — SunFounder Pironman 5 documentation as being compatible. I am monitoring during the test the two drives with a watch using smartctl -a /dev/nvme0n1 | grep -i ‘Temperature’; smartctl -a /dev/nvme1n1 | grep -i ‘Temperature’and drives stay cool at less than 50ºC all the time (I am using heatsinks on both of them and the pironman 5 MAX dual small fans in the case are on.
This is very frustrating.
It seems that when using the two disks, the ASMedia switch just resets after 1-2 minutes and drops completely. The pi hangs. This makes my usecase not viable of having a ZFS mirror for redundancy. Given that with just one disk, fio and the ZFS pool chug along, I do not think this is a ZFS issue.
Is this a limitation of the ASM1182e 2-Port PCIe x1 Gen2 Packet Switch that the HAT has?
I have done LOTS of things:
- Tried with the other flat pci cable that comes in the pironman 5 MAX case. Same results.
- I even purchased and tried ANOTHER 2xNVME HAT from Freenove (it uses the same solution of getting 5Volts from the raspberry and the same ASMedia chip. It also comes with other flat cable, which I used. Same results.
I understand this is a very hard one to crack.
Is there something that I can do?
If this is a limitation of hat… where is this stated? The use case of mirroring drives, with ZFS or any other software raid should be pretty common, right?