NFS issues since upgrading to 13-RELEASE

Discussion:

NFS issues since upgrading to 13-RELEASE

(too old to reply)

Chris Roose

2021-04-15 13:22:45 UTC

I posted this in -questions and someone suggested I post here as well.

I'm having NFS availability issues between my Proxmox client and FreeBSD server (10G link) since upgrading to 13-RELEASE. And unfortunately I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of stuck.

Periodically, the NFS server (I've tried both v3 and v4.2 clients) will go unresponsive for several minutes. I never had this problem on 12.2, and as far as I can tell it's not a disk or network I/O issue. I'll get several "nfs: server not responding, still trying" messages on the client and a few minutes later it usually recovers. It's not clear to me yet what's causing the block. Restarting nfsd on the server will resolve the issue if it doesn't clear itself.

Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.
--
Chris
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Allan Jude

2021-04-15 18:35:22 UTC

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and FreeBSD server (10G link) since upgrading to 13-RELEASE. And unfortunately I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of stuck.
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will go unresponsive for several minutes. I never had this problem on 12.2, and as far as I can tell it's not a disk or network I/O issue. I'll get several "nfs: server not responding, still trying" messages on the client and a few minutes later it usually recovers. It's not clear to me yet what's causing the block. Restarting nfsd on the server will resolve the issue if it doesn't clear itself.
Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.

run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.

Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2021-04-15 20:47:23 UTC

Post by Allan Jude

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and FreeBSD server (10G link) since upgrading to 13->RELEASE. And unfortunately I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of stuck.
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will go unresponsive for several minutes. I never had >this problem on 12.2, and as far as I can tell it's not a disk or network I/O issue. I'll get several "nfs: server not >responding, still trying" messages on the client and a few minutes later it usually recovers. It's not clear to me yet >what's causing the block. Restarting nfsd on the server will resolve the issue if it doesn't clear itself.

otis@ has run into a problem that sounds similar.
He sees a growing Recv-Q size on the server for the TCP connection from the client
when "netstat -a" is done on the server when the "hang" occurs.
In his case, he is using a Linux client and it does not recover, however other client
mounts continue to function.
I suspect the recovery after a few minutes is the client establishing a new TCP
connection.

He has been running for almost a week with r367492 reverted and has not reported
seeing the problem again (he had reported that it has taken up to a week to recur, so
reverting r367492 *might* have fixed the problem and I'd guess we'll know in another
week?).

- If using svn to revert the patch is inconvenient, I've attached a patch that can be applied
to revert it.
- Alternately you can try rscheff@'s alternate proposed patch that is at
https://reviews.freebsd.og/D29690.
I have not yet had time to test this one, but since I cannot reproduce the hang, I can
only do testing of it to see that it is "no worse" than reverting r367492 for my
setup.

Please let us know which you choose and whether or not it fixes your problem.

Post by Allan Jude

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.

run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.

If the revert of r367492 does not fix the problem, monitor the TCP connection(s)
via "netstat -a" and, if possible, capture packets via
tcpdump -s 0 -w hang.pcap host <nfs-client>
or similar, run on the server.

Ideally the tcpdump would be started before the "hang" occurs, but running
one while the hang is occurring (until after it recovers) could also be useful.

Thanks for reporting this, rick

--
Allan Jude

Rick Macklem

2021-04-15 21:05:24 UTC

I wrote:
[stuff snipped]

Post by Rick Macklem
https://reviews.freebsd.og/D29690.

Oops, that's
https:/reviews.freebsd.org/D29690

rick

I have not yet had time to test this one, but since I cannot reproduce the hang, I can
only do testing of it to see that it is "no worse" than reverting r367492 for my
setup.

Please let us know which you choose and whether or not it fixes your problem.

Post by Rick Macklem

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.

run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.

If the revert of r367492 does not fix the problem, monitor the TCP connection(s)
via "netstat -a" and, if possible, capture packets via
tcpdump -s 0 -w hang.pcap host <nfs-client>
or similar, run on the server.

Ideally the tcpdump would be started before the "hang" occurs, but running
one while the hang is occurring (until after it recovers) could also be useful.

Thanks for reporting this, rick

--
Allan Jude
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Scheffenegger, Richard

2021-04-15 21:14:32 UTC

FWIW:

r367492 fixes an issue around "premature" transmission of an ACK due to the incoming segment only been partially processed at the time - related to in-kernel TCP consumers which use socket upcalls.

Rick mentioned, that the NFS server (one in-kernel TCP user) has stringent requirements on the state of the socket during the upcall, thus D29690 is retaining the lock on the socket buffer until TCP processing is finalized and the upcall can be done without running any risk for transmitting outdated information back to the other end.

However, I have no proper way to verify/validate this interaction.

My ask would be to test the behavior with D29690 first - but if similar hangs keep reoccurring, then revert r367492 (which will also mean more severe surgery on the TCP processing flow).

Thanks.

Richard Scheffenegger

-----Ursprüngliche Nachricht-----
Von: Rick Macklem <***@uoguelph.ca>
Gesendet: Donnerstag, 15. April 2021 23:05
An: Allan Jude <***@freebsd.org>; freebsd-***@freebsd.org
Cc: Richard Scheffenegger <***@FreeBSD.org>; Juraj Lutter <***@FreeBSD.org>
Betreff: Re: NFS issues since upgrading to 13-RELEASE

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

I wrote:
[stuff snipped]

Post by Rick Macklem
https://reviews.freebsd.og/D29690.

Oops, that's
https:/reviews.freebsd.org/D29690

rick

I have not yet had time to test this one, but since I cannot reproduce the hang, I can
only do testing of it to see that it is "no worse" than reverting r367492 for my
setup.

Please let us know which you choose and whether or not it fixes your problem.

Post by Rick Macklem

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.

run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.

If the revert of r367492 does not fix the problem, monitor the TCP connection(s) via "netstat -a" and, if possible, capture packets via tcpdump -s 0 -w hang.pcap host <nfs-client> or similar, run on the server.

Ideally the tcpdump would be started before the "hang" occurs, but running one while the hang is occurring (until after it recovers) could also be useful.

Thanks for reporting this, rick

--
Allan Jude
_______________________________________________
freebsd-***@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2021-04-15 21:07:21 UTC

Stupid Outlook...
I wrote:
[stuff snipped]

Post by Rick Macklem
https://reviews.freebsd.og/D29690.

Oops, that's
https:/reviews.freebsd.org/D29690

But you can figure out the link;-), rick

rick

I have not yet had time to test this one, but since I cannot reproduce the hang, I can
only do testing of it to see that it is "no worse" than reverting r367492 for my
setup.

Please let us know which you choose and whether or not it fixes your problem.

Post by Rick Macklem

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through vmstat, gstat, top, etc. when the problem occurs, but I haven't been able to pinpoint the issue. I can get pcap, but it would be from the hosts, because I don't have a 10G tap or managed switch.

run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.

If the revert of r367492 does not fix the problem, monitor the TCP connection(s)
via "netstat -a" and, if possible, capture packets via
tcpdump -s 0 -w hang.pcap host <nfs-client>
or similar, run on the server.

Ideally the tcpdump would be started before the "hang" occurs, but running
one while the hang is occurring (until after it recovers) could also be useful.

Thanks for reporting this, rick

--
Allan Jude
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Juraj Lutter

2021-04-15 21:10:47 UTC

Post by Juraj Lutter
The machine it’s running on is definitely a slow or weak one (it’s dell r740xd with 2x CPU, 256GB RAM, 22xNVMe data zpool).

Is definitely *NOT* a slow or weak one :-)

otis

Olav Gjerde

2021-04-15 19:07:18 UTC

I have the same issue, using Ubuntu 20.10 with Linux 5.8 kernel. The Linux
NFS client will get unresponsive and it does not recover in my case, even
if I restart NFS on FreeBSD. I upgraded from FreeBSD 12.1-RELEASE though.

Post by Chris Roose

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and FreeBSD

server (10G link) since upgrading to 13-RELEASE. And unfortunately I
upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of
stuck.

Post by Chris Roose
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will

go unresponsive for several minutes. I never had this problem on 12.2, and
as far as I can tell it's not a disk or network I/O issue. I'll get several
"nfs: server not responding, still trying" messages on the client and a few
minutes later it usually recovers. It's not clear to me yet what's causing
the block. Restarting nfsd on the server will resolve the issue if it
doesn't clear itself.

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through vmstat,

gstat, top, etc. when the problem occurs, but I haven't been able to
pinpoint the issue. I can get pcap, but it would be from the hosts, because
I don't have a 10G tap or managed switch.
run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

--
Kind Regards / Med Vennlig Hilsen

Olav Grønås Gjerde

BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

Olav Gjerde

2021-04-15 19:21:07 UTC

Well something do happen if I restart NFS Service on FreeBSD , it works for
like 10 seconds then it gets unresponsive again.

This is my output from `nfsstat -d 1`

0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
8.00 1025 8.00 8.02 17170 134.54 2.01 72716 142.54 0.07 51 34
8.00 2273 17.76 7.99 31273 244.07 2.01 133267 261.83 0.14 20 82
8.03 4889 38.33 7.99 25885 202.07 2.06 119340 240.40 0.13 21 81
[===== Read =====] [===== Write ====] [=========== Total ============]
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s ms ql %b
7.98 8811 68.64 8.00 12997 101.54 2.22 78396 170.18 0.15 1 80
7.99 922 7.20 8.00 3798 29.68 2.10 17965 36.87 0.09 0 11
8.07 2959 23.31 0.00 0 0.00 2.67 8938 23.31 0.86 32 72
7.97 7088 55.18 0.00 0 0.00 2.66 21233 55.18 1.05 16 98
7.98 4666 36.38 0.00 0 0.00 2.66 13986 36.38 0.36 9 29
8.00 4513 35.24 8.00 7662 59.86 2.20 44188 95.10 0.27 10 49
7.98 4799 37.40 8.00 11422 89.23 2.16 60076 126.63 0.19 0 51
8.00 4322 33.76 0.00 0 0.00 2.67 12967 33.76 0.89 0 42
8.02 4839 37.91 0.00 0 0.00 2.67 14550 37.91 0.54 17 41
8.01 4516 35.32 0.00 0 0.00 2.67 13569 35.32 0.57 27 38
7.95 4459 34.62 8.00 1195 9.34 2.49 18109 43.96 0.55 0 45
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0

Post by Olav Gjerde
I have the same issue, using Ubuntu 20.10 with Linux 5.8 kernel. The Linux
NFS client will get unresponsive and it does not recover in my case, even
if I restart NFS on FreeBSD. I upgraded from FreeBSD 12.1-RELEASE though.

Post by Chris Roose

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and

FreeBSD server (10G link) since upgrading to 13-RELEASE. And unfortunately
I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of
stuck.

Post by Chris Roose
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will

go unresponsive for several minutes. I never had this problem on 12.2, and
as far as I can tell it's not a disk or network I/O issue. I'll get several
"nfs: server not responding, still trying" messages on the client and a few
minutes later it usually recovers. It's not clear to me yet what's causing
the block. Restarting nfsd on the server will resolve the issue if it
doesn't clear itself.

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through

vmstat, gstat, top, etc. when the problem occurs, but I haven't been able
to pinpoint the issue. I can get pcap, but it would be from the hosts,
because I don't have a 10G tap or managed switch.
run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current
"

--
Kind Regards / Med Vennlig Hilsen
Olav Grønås Gjerde
BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

--
Kind Regards / Med Vennlig Hilsen

Olav Grønås Gjerde

BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

Olav Gjerde

2021-04-19 08:30:05 UTC

I have tried D29690 patch and reverting back to r367492 this weekend.
Neither made any difference for my system.
There is also a reddit thread about this
https://www.reddit.com/r/freebsd/comments/mqol4o/nfs_issues_since_upgrading_to_13release/

Post by Rick Macklem
Just fyi, I just got a "recursed on non-recursed mutex" panic in
socantrcvmore() with the D29690 patch, so you might not
want to test with that one yet.
rick
________________________________________
Sent: Thursday, April 15, 2021 3:21 PM
To: Allan Jude
Subject: Re: NFS issues since upgrading to 13-RELEASE
CAUTION: This email originated from outside of the University of Guelph.
Do not click links or open attachments unless you recognize the sender and
know the content is safe. If in doubt, forward suspicious emails to
Well something do happen if I restart NFS Service on FreeBSD , it works for
like 10 seconds then it gets unresponsive again.
This is my output from `nfsstat -d 1`
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
8.00 1025 8.00 8.02 17170 134.54 2.01 72716 142.54 0.07 51 34
8.00 2273 17.76 7.99 31273 244.07 2.01 133267 261.83 0.14 20 82
8.03 4889 38.33 7.99 25885 202.07 2.06 119340 240.40 0.13 21 81
[===== Read =====] [===== Write ====] [=========== Total ============]
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s ms ql %b
7.98 8811 68.64 8.00 12997 101.54 2.22 78396 170.18 0.15 1 80
7.99 922 7.20 8.00 3798 29.68 2.10 17965 36.87 0.09 0 11
8.07 2959 23.31 0.00 0 0.00 2.67 8938 23.31 0.86 32 72
7.97 7088 55.18 0.00 0 0.00 2.66 21233 55.18 1.05 16 98
7.98 4666 36.38 0.00 0 0.00 2.66 13986 36.38 0.36 9 29
8.00 4513 35.24 8.00 7662 59.86 2.20 44188 95.10 0.27 10 49
7.98 4799 37.40 8.00 11422 89.23 2.16 60076 126.63 0.19 0 51
8.00 4322 33.76 0.00 0 0.00 2.67 12967 33.76 0.89 0 42
8.02 4839 37.91 0.00 0 0.00 2.67 14550 37.91 0.54 17 41
8.01 4516 35.32 0.00 0 0.00 2.67 13569 35.32 0.57 27 38
7.95 4459 34.62 8.00 1195 9.34 2.49 18109 43.96 0.55 0 45
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0

Post by Olav Gjerde
I have the same issue, using Ubuntu 20.10 with Linux 5.8 kernel. The

Linux

Post by Olav Gjerde
NFS client will get unresponsive and it does not recover in my case, even
if I restart NFS on FreeBSD. I upgraded from FreeBSD 12.1-RELEASE though.

Post by Chris Roose

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and

FreeBSD server (10G link) since upgrading to 13-RELEASE. And

unfortunately

Post by Olav Gjerde

Post by Chris Roose
I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm

kind of

Post by Olav Gjerde

Post by Chris Roose
stuck.

Post by Chris Roose
Periodically, the NFS server (I've tried both v3 and v4.2 clients)

will

Post by Olav Gjerde

Post by Chris Roose
go unresponsive for several minutes. I never had this problem on 12.2,

and

Post by Olav Gjerde

Post by Chris Roose
as far as I can tell it's not a disk or network I/O issue. I'll get

several

Post by Olav Gjerde

Post by Chris Roose
"nfs: server not responding, still trying" messages on the client and a

few

Post by Olav Gjerde

Post by Chris Roose
minutes later it usually recovers. It's not clear to me yet what's

causing

Post by Olav Gjerde

Post by Chris Roose
the block. Restarting nfsd on the server will resolve the issue if it
doesn't clear itself.

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through

vmstat, gstat, top, etc. when the problem occurs, but I haven't been

able

Post by Olav Gjerde

Post by Chris Roose
to pinpoint the issue. I can get pcap, but it would be from the hosts,
because I don't have a 10G tap or managed switch.
run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "
"

--
Kind Regards / Med Vennlig Hilsen
Olav Grønås Gjerde
BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

--
Kind Regards / Med Vennlig Hilsen
Olav Grønås Gjerde
BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

--
Kind Regards / Med Vennlig Hilsen

Olav Grønås Gjerde

BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

Rick Macklem

2021-04-19 15:03:35 UTC

I have tried D29690 patch and reverting back to r367492 this weekend. Neither made any difference for my system.

Just to clarify it, I meant "revert the patch in r367492" and not
"revert to revision r367492". I've attached the patch that
backs out the changes made by the patch in r367492, which
should apply to a fairly recent main/13 kernel.

This should be done instead of applying D29690, not combined with it.
My testing of D29690 has suggested it is not yet mature, so
I would not recommend choosing that alternative yet.

If you have tried a kernel with the attached patch applied to it, but not
D29690 applied to it, then please:
Let us know if you still have Linux clients "hanging" with this kernel.
If still "hanging", try the following to see if they help:
- Use the "minorversion=1" mount option on the Linux clients,
to ensure that they are not using NFSv4.2, to see if it is a
NFSv4.2 specific issue.
- Try disabling tso and lro and avoid jumbo frames for drivers
that use jumbo mbufs when handling jumbo frames.
Collect the following info when it happens:
- "netstat -a", to see what the TCP connection is up to.
- "tcpdump -s 0 -w hang.pcap host <nfs-client>"
run for several minutes on the server, to see what is going on the
wire. I use wireshark to look at hang.pcap, since it
knows NFS as well as TCP.
You can also do the above with "host <nfs-server>" instead
of "host <nfs-client>" run on the client.
- "ps axHl" on the server, to see what the nfsd threads
are up to.
If none of the above contains confidential info, please
send it to me, if not the list.

Good luck with it, rick
ps: Yea, I started this post and then realized I had hit
reply instead of reply all.

There is also a reddit thread about this https://www.reddit.com/r/freebsd/comments/mqol4o/nfs_issues_since_upgrading_to_13release/

On Sat, Apr 17, 2021 at 1:10 AM Rick Macklem <***@uoguelph.ca<mailto:***@uoguelph.ca>> wrote:
Just fyi, I just got a "recursed on non-recursed mutex" panic in
socantrcvmore() with the D29690 patch, so you might not
want to test with that one yet.

rick

________________________________________
From: owner-freebsd-***@freebsd.org<mailto:owner-freebsd-***@freebsd.org> <owner-freebsd-***@freebsd.org<mailto:owner-freebsd-***@freebsd.org>> on behalf of Olav Gjerde <***@backupbay.com<mailto:***@backupbay.com>>
Sent: Thursday, April 15, 2021 3:21 PM
To: Allan Jude
Cc: freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org>
Subject: Re: NFS issues since upgrading to 13-RELEASE

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca<mailto:***@uoguelph.ca>

Well something do happen if I restart NFS Service on FreeBSD , it works for
like 10 seconds then it gets unresponsive again.

This is my output from `nfsstat -d 1`

0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
8.00 1025 8.00 8.02 17170 134.54 2.01 72716 142.54 0.07 51 34
8.00 2273 17.76 7.99 31273 244.07 2.01 133267 261.83 0.14 20 82
8.03 4889 38.33 7.99 25885 202.07 2.06 119340 240.40 0.13 21 81
[===== Read =====] [===== Write ====] [=========== Total ============]
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s ms ql %b
7.98 8811 68.64 8.00 12997 101.54 2.22 78396 170.18 0.15 1 80
7.99 922 7.20 8.00 3798 29.68 2.10 17965 36.87 0.09 0 11
8.07 2959 23.31 0.00 0 0.00 2.67 8938 23.31 0.86 32 72
7.97 7088 55.18 0.00 0 0.00 2.66 21233 55.18 1.05 16 98
7.98 4666 36.38 0.00 0 0.00 2.66 13986 36.38 0.36 9 29
8.00 4513 35.24 8.00 7662 59.86 2.20 44188 95.10 0.27 10 49
7.98 4799 37.40 8.00 11422 89.23 2.16 60076 126.63 0.19 0 51
8.00 4322 33.76 0.00 0 0.00 2.67 12967 33.76 0.89 0 42
8.02 4839 37.91 0.00 0 0.00 2.67 14550 37.91 0.54 17 41
8.01 4516 35.32 0.00 0 0.00 2.67 13569 35.32 0.57 27 38
7.95 4459 34.62 8.00 1195 9.34 2.49 18109 43.96 0.55 0 45
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0

I have the same issue, using Ubuntu 20.10 with Linux 5.8 kernel. The Linux
NFS client will get unresponsive and it does not recover in my case, even
if I restart NFS on FreeBSD. I upgraded from FreeBSD 12.1-RELEASE though.

Post by Chris Roose

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and

FreeBSD server (10G link) since upgrading to 13-RELEASE. And unfortunately
I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of
stuck.

Post by Chris Roose
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will

go unresponsive for several minutes. I never had this problem on 12.2, and
as far as I can tell it's not a disk or network I/O issue. I'll get several
"nfs: server not responding, still trying" messages on the client and a few
minutes later it usually recovers. It's not clear to me yet what's causing
the block. Restarting nfsd on the server will resolve the issue if it
doesn't clear itself.

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through

vmstat, gstat, top, etc. when the problem occurs, but I haven't been able
to pinpoint the issue. I can get pcap, but it would be from the hosts,
because I don't have a 10G tap or managed switch.
run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current
"

--
Kind Regards / Med Vennlig Hilsen
Olav Grønås Gjerde
BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

--
Kind Regards / Med Vennlig Hilsen

Olav Grønås Gjerde

BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59
_______________________________________________
freebsd-***@freebsd.org<mailto:freebsd-***@freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org<mailto:freebsd-current-***@freebsd.org>"

--
Kind Regards / Med Vennlig Hilsen

Olav Grønås Gjerde

BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

Rick Macklem

2021-04-16 23:10:13 UTC

Just fyi, I just got a "recursed on non-recursed mutex" panic in
socantrcvmore() with the D29690 patch, so you might not
want to test with that one yet.

rick

________________________________________
From: owner-freebsd-***@freebsd.org <owner-freebsd-***@freebsd.org> on behalf of Olav Gjerde <***@backupbay.com>
Sent: Thursday, April 15, 2021 3:21 PM
To: Allan Jude
Cc: freebsd-***@freebsd.org
Subject: Re: NFS issues since upgrading to 13-RELEASE

CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to ***@uoguelph.ca

Well something do happen if I restart NFS Service on FreeBSD , it works for
like 10 seconds then it gets unresponsive again.

This is my output from `nfsstat -d 1`

0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
8.00 1025 8.00 8.02 17170 134.54 2.01 72716 142.54 0.07 51 34
8.00 2273 17.76 7.99 31273 244.07 2.01 133267 261.83 0.14 20 82
8.03 4889 38.33 7.99 25885 202.07 2.06 119340 240.40 0.13 21 81
[===== Read =====] [===== Write ====] [=========== Total ============]
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s ms ql %b
7.98 8811 68.64 8.00 12997 101.54 2.22 78396 170.18 0.15 1 80
7.99 922 7.20 8.00 3798 29.68 2.10 17965 36.87 0.09 0 11
8.07 2959 23.31 0.00 0 0.00 2.67 8938 23.31 0.86 32 72
7.97 7088 55.18 0.00 0 0.00 2.66 21233 55.18 1.05 16 98
7.98 4666 36.38 0.00 0 0.00 2.66 13986 36.38 0.36 9 29
8.00 4513 35.24 8.00 7662 59.86 2.20 44188 95.10 0.27 10 49
7.98 4799 37.40 8.00 11422 89.23 2.16 60076 126.63 0.19 0 51
8.00 4322 33.76 0.00 0 0.00 2.67 12967 33.76 0.89 0 42
8.02 4839 37.91 0.00 0 0.00 2.67 14550 37.91 0.54 17 41
8.01 4516 35.32 0.00 0 0.00 2.67 13569 35.32 0.57 27 38
7.95 4459 34.62 8.00 1195 9.34 2.49 18109 43.96 0.55 0 45
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0
0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0

Post by Olav Gjerde
I have the same issue, using Ubuntu 20.10 with Linux 5.8 kernel. The Linux
NFS client will get unresponsive and it does not recover in my case, even
if I restart NFS on FreeBSD. I upgraded from FreeBSD 12.1-RELEASE though.

Post by Chris Roose

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and

FreeBSD server (10G link) since upgrading to 13-RELEASE. And unfortunately
I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of
stuck.

Post by Chris Roose
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will

go unresponsive for several minutes. I never had this problem on 12.2, and
as far as I can tell it's not a disk or network I/O issue. I'll get several
"nfs: server not responding, still trying" messages on the client and a few
minutes later it usually recovers. It's not clear to me yet what's causing
the block. Restarting nfsd on the server will resolve the issue if it
doesn't clear itself.

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through

vmstat, gstat, top, etc. when the problem occurs, but I haven't been able
to pinpoint the issue. I can get pcap, but it would be from the hosts,
because I don't have a 10G tap or managed switch.
run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current
"

--
Kind Regards / Med Vennlig Hilsen
Olav Grønås Gjerde
BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

--
Kind Regards / Med Vennlig Hilsen

Olav Grønås Gjerde

BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Juraj Lutter

2021-04-19 15:13:10 UTC

I have tried D29690 patch and reverting back to r367492 this weekend. Neither made any difference for my system.

For me, reverting the patch in r367492, solved all the problems.
In addition, I also turned off LRO and TSO on NICs comprising the lagg interface over which NFS service is provided.

otis

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Juraj Lutter

2021-04-15 21:09:26 UTC

Post by Rick Macklem

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and FreeBSD server (10G link) since upgrading to 13->RELEASE. And unfortunately I upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of stuck.
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will go unresponsive for several minutes. I never had >this problem on 12.2, and as far as I can tell it's not a disk or network I/O issue. I'll get several "nfs: server not >responding, still trying" messages on the client and a few minutes later it usually recovers. It's not clear to me yet >what's causing the block. Restarting nfsd on the server will resolve the issue if it doesn't clear itself.

He sees a growing Recv-Q size on the server for the TCP connection from the client
when "netstat -a" is done on the server when the "hang" occurs.
In his case, he is using a Linux client and it does not recover, however other client
mounts continue to function.

Correct.

Post by Rick Macklem
I suspect the recovery after a few minutes is the client establishing a new TCP
connection.
He has been running for almost a week with r367492 reverted and has not reported
seeing the problem again (he had reported that it has taken up to a week to recur, so
reverting r367492 *might* have fixed the problem and I'd guess we'll know in another
week?).

We are now running 4 days without interruption. Before r367492 was reverted, it was
unpredictable when it will lock up. The best result we achieved was 7 days.

The machine it’s running on is definitely a slow or weak one (it’s dell r740xd with 2x CPU, 256GB RAM, 22xNVMe data zpool).

otis

Jason Unovitch

2021-04-17 13:23:35 UTC

Olav,

Does anything change if you set -tso -lro on the serving NIC on your
FreeBSD server side? Do the Linux clients remain responsive then?

I had seen something similar with a Ubuntu 20.04 client going to a
13.0-CURRENT/STABLE server but chalked it up to it starting around the time
of the last Chelsio firmware update. I haven't had the time to dig into it
and the temporary triage of setting '-tso -lro' at boot hasn't been as
temporary as I hoped. That did stable it up on my side. Curious what you
see and if that helps add another data point to understanding the issue.

Jason

Date: Thu, 15 Apr 2021 21:07:18 +0200
Subject: Re: NFS issues since upgrading to 13-RELEASE
<CAJ7kQyGQrxe7wJs+MezErdNUoLE1HEvD6ixiu2W5=
Content-Type: text/plain; charset="UTF-8"
I have the same issue, using Ubuntu 20.10 with Linux 5.8 kernel. The Linux
NFS client will get unresponsive and it does not recover in my case, even
if I restart NFS on FreeBSD. I upgraded from FreeBSD 12.1-RELEASE though.

Post by Chris Roose

Post by Chris Roose
I posted this in -questions and someone suggested I post here as well.
I'm having NFS availability issues between my Proxmox client and

FreeBSD

Post by Chris Roose
server (10G link) since upgrading to 13-RELEASE. And unfortunately I
upgraded my ZFS pool to v2.0.0 before I noticed the issue, so I'm kind of
stuck.

Post by Chris Roose
Periodically, the NFS server (I've tried both v3 and v4.2 clients) will

go unresponsive for several minutes. I never had this problem on 12.2,

and

Post by Chris Roose
as far as I can tell it's not a disk or network I/O issue. I'll get

several

Post by Chris Roose
"nfs: server not responding, still trying" messages on the client and a

few

Post by Chris Roose
minutes later it usually recovers. It's not clear to me yet what's

causing

Post by Chris Roose
the block. Restarting nfsd on the server will resolve the issue if it
doesn't clear itself.

Post by Chris Roose
Any pointers for troubleshooting this? I've been looking through

vmstat,

Post by Chris Roose
gstat, top, etc. when the problem occurs, but I haven't been able to
pinpoint the issue. I can get pcap, but it would be from the hosts,

because

Post by Chris Roose
I don't have a 10G tap or managed switch.
run `nfsstat -d 1` and try to capture a few lines from before, during,
and after the stall, and that may provide some insight.
Specifically, does the queue length grow, suggesting it is waiting on
the I/O subsystem, or does it just stop getting traffic all together.
--
Allan Jude
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "

--
Kind Regards / Med Vennlig Hilsen
Olav Gr?n?s Gjerde
BackupBay Gjerde
Madlaforen 35
4042 HAFRSFJORD
Norway
Phone: +47 918 000 59

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Chris Roose

2021-04-20 22:35:17 UTC

Post by Jason Unovitch
Does anything change if you set -tso -lro on the serving NIC on your
FreeBSD server side? Do the Linux clients remain responsive then?

Thank you, Jason. This seems to have cleared the problem up for me.
Since disabling TSO and LRO on the server NIC last night, I haven't seen
any timeouts.
--
Chris
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Rick Macklem

2021-04-25 13:06:11 UTC

Post by Chris Roose

Post by Jason Unovitch
Does anything change if you set -tso -lro on the serving NIC on your
FreeBSD server side? Do the Linux clients remain responsive then?

Thank you, Jason. This seems to have cleared the problem up for me.
Since disabling TSO and LRO on the server NIC last night, I haven't seen
any timeouts.

I think there might be a couple of reasons that disabling TSO resolves this:
1 - The obvious one is that the net chip/driver is broken for certain TSO
segments. Often the culprit is a NFS read reply of just less than 64K,
that is made up of a chain of 33mbufs with a total length just under
64K. Then the driver adds a MAC layer header that bumps the size up
to greater than 64K.
--> This can happen if the driver does not set the TSO sizing parameters
quite correctly, among other things.

2 - TSO does work correctly, but results in different timing of the TCP
segments transmitted for the segment compared with non-TSO.

I believe that, for otis@, disabling TSO reduced the frequency of Linux
client hangs, but did not stop them.
--> reverting the patch in r367492 (this patch is not in FreeBSD12) has
fixed the problem for him.

rick

--
Chris
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Juraj Lutter

2021-04-25 13:18:05 UTC

Post by Rick Macklem
2 - TSO does work correctly, but results in different timing of the TCP
segments transmitted for the segment compared with non-TSO.
client hangs, but did not stop them.
--> reverting the patch in r367492 (this patch is not in FreeBSD12) has
fixed the problem for him.

Correct. Reverting the patch in r367492 has made the system stable and usable (thanks, Rick!).
We also have disabled LRO and TSO on the interfaces serving the NFS traffic, it also might have added up to the stability.

There is some more work going on in Phabricator (D29690) that we also want to test.

otis

—
Juraj Lutter
***@FreeBSD.org

17 Replies
82 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Chris Roose 2021-04-15 13:22:45 UTC

Allan Jude 2021-04-15 18:35:22 UTC

Rick Macklem 2021-04-15 20:47:23 UTC

Rick Macklem 2021-04-15 21:05:24 UTC

Scheffenegger, Richard 2021-04-15 21:14:32 UTC

Rick Macklem 2021-04-15 21:07:21 UTC

Juraj Lutter 2021-04-15 21:10:47 UTC

Olav Gjerde 2021-04-15 19:07:18 UTC

Olav Gjerde 2021-04-15 19:21:07 UTC

Olav Gjerde 2021-04-19 08:30:05 UTC

Rick Macklem 2021-04-19 15:03:35 UTC

Rick Macklem 2021-04-16 23:10:13 UTC

Juraj Lutter 2021-04-19 15:13:10 UTC

Juraj Lutter 2021-04-15 21:09:26 UTC

Jason Unovitch 2021-04-17 13:23:35 UTC

Chris Roose 2021-04-20 22:35:17 UTC

Rick Macklem 2021-04-25 13:06:11 UTC

Juraj Lutter 2021-04-25 13:18:05 UTC

about - legalese

Loading...