Strange network/socket anomalies since about a month

Discussion:

(too old to reply)

Alexander Leidinger

2024-04-22 07:26:59 UTC

Hi,

I see a higher failure rate of socket/network related stuff since a
while. Those failures are transient. Directly executing the same thing
again may or may not result in success/failure. I'm not able to
reproduce this at will. Sometimes they show up.

Examples:
- poudriere runs with the sccache overlay (like ccache but also works
for rust) sometimes fail to create the communication socket and as such
the build fails. I have 3 different poudriere bulk runs after each other
in my build script, and when the first one fails, the second and third
still run. If the first fails due to the sccache issue, the second and
3rd may or may not fail. Sometimes the first fails and the rest is ok.
Sometimes all fail, and if I then run one by hand it works (the script
does the same as the manual run, the script is simply a "for type in A B
C; do; poudriere bulk -O sccache -j $type -f ${type}.pkglist; done"
which I execute from the same shell, and the script doesn't do
env-sanityzing).
- A webmail interface (inet / local net -> nginx (rev-proxy) -> nginx
(webmail service) -> php -> imap) sees intermittent issues sometimes.
Opening the same email directly again afterwards normally works. I've
also seen transient issues with pgp signing (webmail interface -> gnupg
/ gpg-agent on the server), simply hitting send again after a failure
works fine.

Gleb, could this be related to the socket stuff you did 2 weeks ago? My
world is from 2024-04-17-112537. I do notice this since at least then,
but I'm not sure if they where there before that and I simply didn't
notice them. They are surely "new recently", that amount of issues I
haven's seen in January. The last two updates of current I did before
the last one where on 2024-03-31-120210 and 2024-04-08-112551.

I could also imagine that some memory related transient failure could
cause this, but with >3 GB free I do not expect this. Important here may
be that I have https://reviews.freebsd.org/D40575 in my tree, which is
memory related, but it's only a metric to quantify memory fragmentation.

Any ideas how to track this down more easily than running the entire
poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?

Bye,
Alexander.

--
http://www.Leidinger.net ***@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org ***@FreeBSD.org : PGP 0x8F31830F9F2772BF

Paul Mather

2024-04-22 15:35:16 UTC

Permalink

Hi,
I see a higher failure rate of socket/network related stuff since a while. Those failures are transient. Directly executing the same thing again may or may not result in success/failure. I'm not able to reproduce this at will. Sometimes they show up.
- poudriere runs with the sccache overlay (like ccache but also works for rust) sometimes fail to create the communication socket and as such the build fails. I have 3 different poudriere bulk runs after each other in my build script, and when the first one fails, the second and third still run. If the first fails due to the sccache issue, the second and 3rd may or may not fail. Sometimes the first fails and the rest is ok. Sometimes all fail, and if I then run one by hand it works (the script does the same as the manual run, the script is simply a "for type in A B C; do; poudriere bulk -O sccache -j $type -f ${type}.pkglist; done" which I execute from the same shell, and the script doesn't do env-sanityzing).
- A webmail interface (inet / local net -> nginx (rev-proxy) -> nginx (webmail service) -> php -> imap) sees intermittent issues sometimes. Opening the same email directly again afterwards normally works. I've also seen transient issues with pgp signing (webmail interface -> gnupg / gpg-agent on the server), simply hitting send again after a failure works fine.
Gleb, could this be related to the socket stuff you did 2 weeks ago? My world is from 2024-04-17-112537. I do notice this since at least then, but I'm not sure if they where there before that and I simply didn't notice them. They are surely "new recently", that amount of issues I haven's seen in January. The last two updates of current I did before the last one where on 2024-03-31-120210 and 2024-04-08-112551.
I could also imagine that some memory related transient failure could cause this, but with >3 GB free I do not expect this. Important here may be that I have https://reviews.freebsd.org/D40575 in my tree, which is memory related, but it's only a metric to quantify memory fragmentation.
Any ideas how to track this down more easily than running the entire poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?

No answers, I'm afraid, just a "me too."

I have the same problem as you describe when using ports-mgmt/sccache-overlay when building packages with Poudriere. In my case, I'm using FreeBSD 14-STABLE (stable/14-13952fbca).

I actually stopped using ports-mgmt/sccache-overlay because it got to the point where it didn't work more often than it did. Then, a few months ago, I decided to start using it again on a whim and it worked reliably for me. Then, starting a few weeks ago, it has reverted to the behaviour you describe above. It is not as bad right now as it got when I quit using it. Now, sometimes it will fail, but it will succeed when re-running a "poudriere bulk" run.

I'd love it to go back to when it was working 100% of the time.

Cheers,

Paul.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Gleb Smirnoff

2024-04-22 16:12:49 UTC

Permalink

Alexander,

On Mon, Apr 22, 2024 at 09:26:59AM +0200, Alexander Leidinger wrote:
A> I see a higher failure rate of socket/network related stuff since a while.
A> Those failures are transient. Directly executing the same thing again may
A> or may not result in success/failure. I'm not able to reproduce this at
A> will. Sometimes they show up.
A>
A> Examples:
A> - poudriere runs with the sccache overlay (like ccache but also works for
A> rust) sometimes fail to create the communication socket and as such the
A> build fails. I have 3 different poudriere bulk runs after each other in my
A> build script, and when the first one fails, the second and third still run.
A> If the first fails due to the sccache issue, the second and 3rd may or may
A> not fail. Sometimes the first fails and the rest is ok. Sometimes all fail,
A> and if I then run one by hand it works (the script does the same as the
A> manual run, the script is simply a "for type in A B C; do; poudriere bulk
A> -O sccache -j $type -f ${type}.pkglist; done" which I execute from the
A> same shell, and the script doesn't do env-sanityzing).
A> - A webmail interface (inet / local net -> nginx (rev-proxy) -> nginx
A> (webmail service) -> php -> imap) sees intermittent issues sometimes.
A> Opening the same email directly again afterwards normally works. I've also
A> seen transient issues with pgp signing (webmail interface -> gnupg /
A> gpg-agent on the server), simply hitting send again after a failure works
A> fine.
A>
A> Gleb, could this be related to the socket stuff you did 2 weeks ago? My
A> world is from 2024-04-17-112537. I do notice this since at least then, but
A> I'm not sure if they where there before that and I simply didn't notice
A> them. They are surely "new recently", that amount of issues I haven's seen
A> in January. The last two updates of current I did before the last one where
A> on 2024-03-31-120210 and 2024-04-08-112551.

The stuff I pushed 2 weeks ago was a large rewrite of unix/stream, but that was
reverted as it appears needs more work wrt to aio(4), nfs/rpc and also appeared
that sendfile(2) over unix(4) has some non-zero use.

There were several preparatory commits that were not reverted and one of them
had a bug. The bug manifested itself as failure to send(2) zero bytes over
unix/stream. It was fixed with e6a4b57239dafc6c944473326891d46d966c0264. Can
you please check you have this revision? Other than that there are no known
bugs left.

A> I could also imagine that some memory related transient failure could cause
A> this, but with >3 GB free I do not expect this. Important here may be that
A> I have https://reviews.freebsd.org/D40575 in my tree, which is memory
A> related, but it's only a metric to quantify memory fragmentation.
A>
A> Any ideas how to track this down more easily than running the entire
A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?

I don't have any better idea than ktrace over failing application. Yep, I
understand that poudriere will produce a lot. But first we need to determine
what syscall fails and on what type of socket. After that we can scope down to
using dtrace on very particular functions.

--
Gleb Smirnoff

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Alexander Leidinger

2024-04-24 10:01:26 UTC

Permalink

Post by Gleb Smirnoff
There were several preparatory commits that were not reverted and one of them
had a bug. The bug manifested itself as failure to send(2) zero bytes over
unix/stream. It was fixed with
e6a4b57239dafc6c944473326891d46d966c0264. Can
you please check you have this revision? Other than that there are no known
bugs left.

Yes, I have this fix in my running kernel.

Post by Gleb Smirnoff
A> Any ideas how to track this down more easily than running the entire
A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?
I don't have any better idea than ktrace over failing application.
Yep, I
understand that poudriere will produce a lot. But first we need to determine

Yes, it does. 4.4 GB just for the start of poudriere until the first
package build fails due to a failed sccache start (luckily in the first
builder, but I had at least 2 builders automatically spin up by
poudriere at the time when I validated the failure in the logs and
disabled the tracing).

Post by Gleb Smirnoff
what syscall fails and on what type of socket. After that we can scope down to
using dtrace on very particular functions.

I'm not sure I manage to find the cause of the failure... the only thing
which remotely looks like an issue is "Resource temporarily
unavailable", but this is from the process which waits for the server to
have started:
---snip---
58406 sccache 1713947887.504834367 RET __sysctl 0
58406 sccache 1713947887.505521884 CALL
rfork(0x80000000<><invalid>2147483648)
58406 sccache 1713947887.505757777 CAP system call not allowed:
rfork
58406 sccache 1713947887.505774176 RET rfork 58426/0xe43a
58406 sccache 1713947887.507304865 CALL
compat11.kevent(0x3,0x371d360f89e8,0x2,0x371d360f89e8,0x2,0)
58406 sccache 1713947887.507657906 STRU struct freebsd11_kevent[] = {
{ ident=11, filter=EVFILT_READ, flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>,
fflags=0, data=0, udata=0x0 }
{ ident=11, filter=EVFILT_WRITE,
flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>, fflags=0, data=0, udata=0x0 } }
58406 sccache 1713947887.507689980 STRU struct freebsd11_kevent[] = {
{ ident=11, filter=EVFILT_READ, flags=0x4000<EV_ERROR>, fflags=0,
data=0, udata=0x0 }
{ ident=11, filter=EVFILT_WRITE, flags=0x4000<EV_ERROR>,
fflags=0, data=0, udata=0x0 } }
58406 sccache 1713947887.507977155 RET compat11.kevent 2
58406 sccache 1713947887.508015751 CALL write(0x5,0x371515685bcc,0x1)
58406 sccache 1713947887.508086434 GIO fd 5 wrote 1 byte
0x0000 01 |.|

58406 sccache 1713947887.508145930 RET write 1
58406 sccache 1713947887.508183140 CALL
compat11.kevent(0x7,0,0,0x5a5689ab0c40,0x400,0)
58406 sccache 1713947887.508396614 STRU struct freebsd11_kevent[] = {
}
58406 sccache 1713947887.508156537 STRU struct freebsd11_kevent[] = {
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>,
fflags=0, data=0x1, udata=0xffffffffffffffff } }
58406 sccache 1713947887.508530888 RET compat11.kevent 1
58406 sccache 1713947887.508563736 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947887.508729102 GIO fd 4 read 1 byte
0x0000 01 |.|

58406 sccache 1713947887.508967661 RET read 1
58406 sccache 1713947887.508996753 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947887.509028311 RET read -1 errno 35 Resource
temporarily unavailable
58406 sccache 1713947887.509068838 CALL
compat11.kevent(0x3,0,0,0x5a5689a97540,0x400,0x371d3a2887c8)
..
58406 sccache 1713947897.514352552 CALL
_umtx_op(0x5a5689a3d290,0x10<UMTX_OP_WAKE_PRIVATE>,0x7fffffff,0,0)
58406 sccache 1713947897.514383653 RET _umtx_op 0
58406 sccache 1713947897.514421273 CALL write(0x5,0x371515685bcc,0x1)
58406 sccache 1713947897.515050967 STRU struct freebsd11_kevent[] = {
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>,
fflags=0, data=0x1, udata=0xffffffffffffffff } }
58406 sccache 1713947897.515146151 RET compat11.kevent 1
58406 sccache 1713947897.515178978 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947897.515368070 GIO fd 4 read 1 byte
0x0000 01 |.|

58406 sccache 1713947897.515396600 RET read 1
58406 sccache 1713947897.515426523 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947897.515457073 RET read -1 errno 35 Resource
temporarily unavailable
58406 sccache 1713947897.515004494 GIO fd 5 wrote 1 byte
0x0000 01 |.|
---snip---

https://www.leidinger.net/test/sccache.tar.bz2 contains the parts of the
trace of the sccache processes (in case someone wants to have a look).

The poudriere run had several builders in parallel, at least 2 were
running at that point in time. What the overlay does is to startup
(sccache --start-server) the sccache server process (forks to return
back on the command line) which creates a file system socket, and then
it queries the stats (sccache --show-stats). So some of the traces in
the tarball are the server start (those with "Timed out waiting for
server startup" are maybe the processes which fork to start the server
and wait for it to be started), some are the stat-query, and some seem
to be a successful start in another poudriere-builder (those with a
successful /root/.ccache/sccache/5/4/<hashvalue> access look like from a
successful start in another jail). Maybe there is also a --stop-server
from poudriere somewhere.

What I noticed (except that printing the new CAP stuff for non-CAP
enabled processes by default is disturbing) is, that compat11 stuff is
called (seems the rust ecosystem is not keeping up with our speed of
development...). Not sure if it matters here that some compat stuff is
called.

Bye,
Alexander.

--
http://www.Leidinger.net ***@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org ***@FreeBSD.org : PGP 0x8F31830F9F2772BF

Dag-Erling Smørgrav

2024-04-24 17:27:52 UTC

Permalink

Post by Alexander Leidinger

Post by Gleb Smirnoff
I don't have any better idea than ktrace over failing application.
Yep, I understand that poudriere will produce a lot.

Yes, it does. 4.4 GB just for the start of poudriere until the first
package build fails due to a failed sccache start [...]

Using `ktrace -tcnpstuy` instead of just `ktrace` should greatly reduce
the size of the trace file.

(remind me to modify ktrace and kdump so this can be written as `-t-i`
or `-tI` instead...)

DES

--
Dag-Erling Smørgrav - ***@FreeBSD.org

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de