It's surprising that rebooting the Pi 4 fixed the Pi 5 server and you didn't have to reboot the Pi 5. Even so, I think it's more likely the Pi 5 was at fault.Hi,
I've fallen into NFS hell again. I recently added a pi5 with 4xSATA hat to my homelab and moved the 2 SSDs I had previously connected to other linux servers via USB enclosures to this new one. It worked nicely for 1-2 weeks. Tonight, the weirdest of issues happened. This was ultimately resolved by a reboot, but I'd like to understand.
Setup:
- 1 pi 4 running many things including Kodi, with a projector connected via HDMI, and jellyfin
- 1 pi 5 with the SATA hat and the drives, running nfs-kernel-server and samba (the pi4 is connected to it via nfs)
- other linux servers (x86)
- all the above connected via a tp-link switch, itself connected to my ISP's modem/router/wifi point
- a macbook connected to the router over wifi
The issue as I went through it:
1. watching a movie over jellyfin from the macbook in the late afternoon worked, but was very slow to start, when normally everything is super snappy; didn't think too much of it at first
2. later, trying to watch another one from the pi4 with kodi, it would load extremely slowly, stutter, buffer... it was unusable, so I started wondering if something was wrong with the file server
3. connected to the pi5, I found thousands of dmesg log lines (typically ~50 within the same second) likeThe exact same line, with those exact numbers, every time.Code:
Oct 17 20:49:23 pi5 kernel: rpc-srv/tcp: nfsd: sent 1045898 when sending 1045896 bytes - shutting down socket
I tried to see what could be wrong on the server side. I restarted nfs, rebooted several times, tried to tune some parameters a bit randomly based on what I could see at https://serverfault.com/questions/88049 ... ng-1048708, apt upgrade'd (triggering a kernel upgrade) and rebooted once more, rebooted my ISP's router just in case... nothing solved it other than shutting down nfs-kernel-server, which wasn't very useful as all my linux servers depend on this.
4. it occurred to me to ssh into the pi4 as well, to see if it had interesting nfs logs on its side. It didn't. I had a number of HDMI CEC timeout errors, which is probably a separate issue that I need to solve. But nothing related to nfs. However, I tried ls-ing the nfs mount and it was very slow to respond (though it did), so even such a basic operation was being affected.
5. I thought of connecting to the other linux servers I have and doing the same ls command (on the same mount that they all share). It was super snappy, no issue on those!
6. That's when I thought: well, maybe I should reboot the pi4. I did. Problem solved, I watched the whole movie, no single nfsd error line on the pi5.
So... rebooting the pi4, ONE of the several nfs clients connected to the pi5's nfs server, solved the issue that was not being logged anywhere on the pi4 itself but was being logged in the pi5's logs. The issue that apparently had also affected my macbook earlier on - but not the other linux servers. My brain explodes.
While googling the rpc-srv/tcp error, I found only fairly old posts (2016-2018). I found no single one where, like in my case, the complaint was that the bytes send were MORE than the bytes "sending" -- all the posts I found were about "sent ONLY x bytes when sending y bytes" where y > x. And nothing at all in that vein from the last few years.
=> does anyone have any idea of what happened here?
=> can I do anything to avoid this happening again in the future?
Feel free to ask for any additional details such as configurations etc. I can't think of anything relevant to include, to be honest.
Thanks!
P.
If NFS slows down again, I'd suggest rebooting the Pi 5 to see if that also fixes the errors.
Statistics: Posted by ejolson — Fri Oct 18, 2024 5:49 am