I'm seeing a problem where a client process does a TCP connect(2) at the same time that a SIGKILL is sent to the process doing listen(2). Occasionally, seemingly depending on some exact timing/race, the connect() will succeed, but the client is never notified that the server closed the connection (no FIN/RST packet), and a poll()/select() in the client waits indefinitely. The behaviour is as if the code that handles process kill first walks the list of existing connections (inclusive listen(2) backlog) and sends FIN to the client. Then it shuts down the listening socket, after which SYN will be replied with RST. But leaving a small time window in-between where new connections are still acknowledged with SYN/ACK, but no longer shut down by FIN nor RST. This is all on 127.0.0.1 loopback, so no network/packet loss issues involved. $ uname -a Linux urd 5.10.0-8-amd64 #1 SMP Debian 5.10.46-5 (2021-09-23) x86_64 GNU/Linux Also reproduced on several other kernel versions and on RiscV. Attached is a perl script that reproduces the problem, also available here: https://knielsen-hq.org/test_listen_backlog_on_server_kill.pl The script repeatedly forks a server process, establishes some connections, does kill -9 of the server, tries to re-connect at the same time, and tests whether the reconnect is handled correctly (either refused, or notified of close). For me, usually the problem occurs within a few 100 iterations. Here is an example output where it triggers the error and corresponding tcpdump output: ----------------------------------------------------------------------- AHA! select() on extra connection timed out on iteration 67! Extra connection fd=19 port=59404 14:49:55.066435 lo In IP localhost.59404 > localhost.2345: Flags [S], seq 4284834695, win 65495, options [mss 65495,sackOK,TS val 152268719 ecr 0,nop,wscale 7], length 0 14:49:55.066465 lo In IP localhost.2345 > localhost.59404: Flags [S.], seq 3024130858, ack 4284834696, win 65483, options [mss 65495,sackOK,TS val 152268719 ecr 152268719,nop,wscale 7], length 0 14:49:55.066491 lo In IP localhost.59404 > localhost.2345: Flags [.], ack 1, win 512, options [nop,nop,TS val 152268719 ecr 152268719], length 0 14:50:05.077150 lo In IP localhost.59404 > localhost.2345: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 152278730 ecr 152268719], length 0 14:50:05.077183 lo In IP localhost.2345 > localhost.59404: Flags [R], seq 3024130859, win 0, length 0 ----------------------------------------------------------------------- We see the connection being established with SYN/ACK, but no FIN is sent when the server process exits. And only 10 seconds later, when the script times out the poll()/select() does the client send FIN, which is replied with RST as there is no listening socket on port 2345. Occasionally another behaviour is seen, the client's initial SYN packet is not replied, causing client retransmission (which is then replied with RST): ----------------------------------------------------------------------- Oops, connect() took 1.008663 seconds! (connect=No) 14:57:19.389914 lo In IP localhost.43856 > localhost.2345: Flags [S], seq 2851822367, win 65495, options [mss 65495,sackOK,TS val 152713043 ecr 0,nop,wscale 7], length 0 14:57:20.398363 lo In IP localhost.43856 > localhost.2345: Flags [S], seq 2851822367, win 65495, options [mss 65495,sackOK,TS val 152714051 ecr 0,nop,wscale 7], length 0 14:57:20.398415 lo In IP localhost.2345 > localhost.43856: Flags [R.], seq 0, ack 2851822368, win 0, length 0 ----------------------------------------------------------------------- A practical consequence of this bug is that if a server dies, a client may seemingly re-establish its connection successfully and think that it is again connected to the (restarted) server and wait for data. But in reality the client's connection is dead, and the client can wait indefinitely for data on the socket or EOF/close notification. This problem originates from the testsuite / continuous integration of MariaDB, a relational database. The testsuite is testing the correctness of various scenarios of the server process crashing. These tests very occasionally fail due to a timeout on the re-established server connection, which is due to this bug. Original MariaDB bug for reference: https://jira.mariadb.org/browse/MDEV-30232 Any ideas? Is this a known issue? - Kristian.