All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
@ 2023-12-08 18:20 Salvatore Dipietro
  2023-12-09 11:03 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Salvatore Dipietro @ 2023-12-08 18:20 UTC (permalink / raw
  To: edumazet, davem, dsahern, kuba, pabeni
  Cc: netdev, blakgeof, alisaidi, benh, dipietro.salvatore,
	Salvatore Dipietro

Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
and packets are sent as soon as possible. However in the `tcp_push` function
where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
considered which can trigger unexpected corking of packets and induce delays.

For example, if two packets are generated as part of a server's reply, if the
first one is not transmitted on the wire quickly enough, the second packet can
trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
possible. It will either wait for additional packets to be coalesced or an ACK
from the client before transmitting the corked packet. This can interact badly
if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
completion times. It is not always possible to control who has delayed acks
set, but it is possible to adjust when and how autocorking is triggered.
Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.

Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
Apache Tomcat 9.0.83 running the basic servlet below:

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class HelloWorldServlet extends HttpServlet {
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
        response.setContentType("text/html;charset=utf-8");
        OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
        String s = "a".repeat(3096);
        osw.write(s,0,s.length());
        osw.flush();
    }
}

Load was applied using  wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance.  With the current auto-corking behavior and TCP_NODELAY
set an additional 40ms latency from P99.99+ values are observed.  With the
patch applied we see no occurrences of 40ms latencies. The patch has also been
tested with iperf and uperf benchmarks and no regression was observed.

# No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
  ...
 50.000%    0.91ms
 75.000%    1.12ms
 90.000%    1.46ms
 99.000%    1.73ms
 99.900%    1.96ms
 99.990%   43.62ms   <<< 40+ ms extra latency
 99.999%   48.32ms
100.000%   49.34ms

# With patch
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
  ...
 50.000%    0.89ms
 75.000%    1.13ms
 90.000%    1.44ms
 99.000%    1.67ms
 99.900%    1.78ms
 99.990%    2.27ms   <<< no 40+ ms extra latency
 99.999%    3.71ms
100.000%    4.57ms

Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d3456cf840de..87751a2a6fff 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -716,7 +716,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
 
 	tcp_mark_urg(tp, flags);
 
-	if (tcp_should_autocork(sk, skb, size_goal)) {
+	if (!nonagle && tcp_should_autocork(sk, skb, size_goal)) {
 
 		/* avoid atomic op if TSQ_THROTTLED bit is already set */
 		if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-08 18:20 [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set Salvatore Dipietro
@ 2023-12-09 11:03 ` Eric Dumazet
  2023-12-11 15:58   ` Hazem Mohamed Abuelfotoh
  2023-12-12 10:57 ` Paolo Abeni
  2023-12-13 10:40 ` [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set patchwork-bot+netdevbpf
  2 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2023-12-09 11:03 UTC (permalink / raw
  To: Salvatore Dipietro
  Cc: davem, dsahern, kuba, pabeni, netdev, blakgeof, alisaidi, benh,
	dipietro.salvatore

On Fri, Dec 8, 2023 at 7:23 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
>
> Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
> and packets are sent as soon as possible. However in the `tcp_push` function
> where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
> considered which can trigger unexpected corking of packets and induce delays.
>
> For example, if two packets are generated as part of a server's reply, if the
> first one is not transmitted on the wire quickly enough, the second packet can
> trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
> possible. It will either wait for additional packets to be coalesced or an ACK
> from the client before transmitting the corked packet. This can interact badly
> if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
> completion times. It is not always possible to control who has delayed acks
> set, but it is possible to adjust when and how autocorking is triggered.
> Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
>
> Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> Apache Tomcat 9.0.83 running the basic servlet below:
>
> import java.io.IOException;
> import java.io.OutputStreamWriter;
> import java.io.PrintWriter;
> import javax.servlet.ServletException;
> import javax.servlet.http.HttpServlet;
> import javax.servlet.http.HttpServletRequest;
> import javax.servlet.http.HttpServletResponse;
>
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> ---

Very nice catch.

I suggest this patch lands in net tree, what about we add a Fixes: tag ?

Fixes: f54b311142a9 ("tcp: auto corking")
Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-09 11:03 ` Eric Dumazet
@ 2023-12-11 15:58   ` Hazem Mohamed Abuelfotoh
  2023-12-12 10:41     ` Paolo Abeni
  0 siblings, 1 reply; 28+ messages in thread
From: Hazem Mohamed Abuelfotoh @ 2023-12-11 15:58 UTC (permalink / raw
  To: edumazet
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dipiets,
	dsahern, kuba, netdev, pabeni

It will be good to submit another version changing the documentation https://docs.kernel.org/networking/timestamping.html editing "It can prevent the situation by always flushing the TCP stack in between requests, for instance by enabling TCP_NODELAY and disabling TCP_CORK and autocork." to "It can prevent the situation by always flushing the TCP stack in between requests, for instance by enabling TCP_NODELAY and disabling TCP_CORK."


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-11 15:58   ` Hazem Mohamed Abuelfotoh
@ 2023-12-12 10:41     ` Paolo Abeni
  2023-12-12 11:29       ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Abeni @ 2023-12-12 10:41 UTC (permalink / raw
  To: Hazem Mohamed Abuelfotoh
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dipiets,
	dsahern, kuba, netdev, edumazet

On Mon, 2023-12-11 at 15:58 +0000, Hazem Mohamed Abuelfotoh wrote:
> It will be good to submit another version changing the documentation
> https://docs.kernel.org/networking/timestamping.html editing "It can
> prevent the situation by always flushing the TCP stack in between
> requests, for instance by enabling TCP_NODELAY and disabling TCP_CORK
> and autocork." to "It can prevent the situation by always flushing
> the TCP stack in between requests, for instance by enabling
> TCP_NODELAY and disabling TCP_CORK."

The documentation update could be a follow-up.

Please fix your email client, the message you sent was not properly
trimmed.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-08 18:20 [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set Salvatore Dipietro
  2023-12-09 11:03 ` Eric Dumazet
@ 2023-12-12 10:57 ` Paolo Abeni
  2023-12-13 14:10   ` Eric Dumazet
  2023-12-13 10:40 ` [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set patchwork-bot+netdevbpf
  2 siblings, 1 reply; 28+ messages in thread
From: Paolo Abeni @ 2023-12-12 10:57 UTC (permalink / raw
  To: Salvatore Dipietro, edumazet, davem, dsahern, kuba
  Cc: netdev, blakgeof, alisaidi, benh, dipietro.salvatore

On Fri, 2023-12-08 at 10:20 -0800, Salvatore Dipietro wrote:
> Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
> and packets are sent as soon as possible. However in the `tcp_push` function
> where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
> considered which can trigger unexpected corking of packets and induce delays.
> 
> For example, if two packets are generated as part of a server's reply, if the
> first one is not transmitted on the wire quickly enough, the second packet can
> trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
> possible. It will either wait for additional packets to be coalesced or an ACK
> from the client before transmitting the corked packet. This can interact badly
> if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
> completion times. It is not always possible to control who has delayed acks
> set, but it is possible to adjust when and how autocorking is triggered.
> Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
> 
> Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> Apache Tomcat 9.0.83 running the basic servlet below:
> 
> import java.io.IOException;
> import java.io.OutputStreamWriter;
> import java.io.PrintWriter;
> import javax.servlet.ServletException;
> import javax.servlet.http.HttpServlet;
> import javax.servlet.http.HttpServletRequest;
> import javax.servlet.http.HttpServletResponse;
> 
> public class HelloWorldServlet extends HttpServlet {
>     @Override
>     protected void doGet(HttpServletRequest request, HttpServletResponse response)
>       throws ServletException, IOException {
>         response.setContentType("text/html;charset=utf-8");
>         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
>         String s = "a".repeat(3096);
>         osw.write(s,0,s.length());
>         osw.flush();
>     }
> }
> 
> Load was applied using  wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> c6i.8xlarge instance.  With the current auto-corking behavior and TCP_NODELAY
> set an additional 40ms latency from P99.99+ values are observed.  With the
> patch applied we see no occurrences of 40ms latencies. The patch has also been
> tested with iperf and uperf benchmarks and no regression was observed.
> 
> # No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
>   ...
>  50.000%    0.91ms
>  75.000%    1.12ms
>  90.000%    1.46ms
>  99.000%    1.73ms
>  99.900%    1.96ms
>  99.990%   43.62ms   <<< 40+ ms extra latency
>  99.999%   48.32ms
> 100.000%   49.34ms
> 
> # With patch
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
>   ...
>  50.000%    0.89ms
>  75.000%    1.13ms
>  90.000%    1.44ms
>  99.000%    1.67ms
>  99.900%    1.78ms
>  99.990%    2.27ms   <<< no 40+ ms extra latency
>  99.999%    3.71ms
> 100.000%    4.57ms
> 
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> ---
>  net/ipv4/tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index d3456cf840de..87751a2a6fff 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -716,7 +716,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
>  
>  	tcp_mark_urg(tp, flags);
>  
> -	if (tcp_should_autocork(sk, skb, size_goal)) {
> +	if (!nonagle && tcp_should_autocork(sk, skb, size_goal)) {

It looks like the above disables autocorking even after the userspace
sets TCP_CORK. Am I reading it correctly? Is that expected?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-12 10:41     ` Paolo Abeni
@ 2023-12-12 11:29       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2023-12-12 11:29 UTC (permalink / raw
  To: Paolo Abeni
  Cc: Hazem Mohamed Abuelfotoh, alisaidi, benh, blakgeof, davem,
	dipietro.salvatore, dipiets, dsahern, kuba, netdev

On Tue, Dec 12, 2023 at 11:41 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2023-12-11 at 15:58 +0000, Hazem Mohamed Abuelfotoh wrote:
> > It will be good to submit another version changing the documentation
> > https://docs.kernel.org/networking/timestamping.html editing "It can
> > prevent the situation by always flushing the TCP stack in between
> > requests, for instance by enabling TCP_NODELAY and disabling TCP_CORK
> > and autocork." to "It can prevent the situation by always flushing
> > the TCP stack in between requests, for instance by enabling
> > TCP_NODELAY and disabling TCP_CORK."
>
> The documentation update could be a follow-up.

About timestamping documentation I sent something  cleaner I think

https://patchwork.kernel.org/project/netdevbpf/patch/20231212110608.3673677-1-edumazet@google.com/




>
> Please fix your email client, the message you sent was not properly
> trimmed.
>
> Cheers,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-08 18:20 [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set Salvatore Dipietro
  2023-12-09 11:03 ` Eric Dumazet
  2023-12-12 10:57 ` Paolo Abeni
@ 2023-12-13 10:40 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 28+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-12-13 10:40 UTC (permalink / raw
  To: Salvatore Dipietro
  Cc: edumazet, davem, dsahern, kuba, pabeni, netdev, blakgeof,
	alisaidi, benh, dipietro.salvatore

Hello:

This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Fri, 8 Dec 2023 10:20:49 -0800 you wrote:
> Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
> and packets are sent as soon as possible. However in the `tcp_push` function
> where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
> considered which can trigger unexpected corking of packets and induce delays.
> 
> For example, if two packets are generated as part of a server's reply, if the
> first one is not transmitted on the wire quickly enough, the second packet can
> trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
> possible. It will either wait for additional packets to be coalesced or an ACK
> from the client before transmitting the corked packet. This can interact badly
> if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
> completion times. It is not always possible to control who has delayed acks
> set, but it is possible to adjust when and how autocorking is triggered.
> Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
> 
> [...]

Here is the summary with links:
  - tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
    https://git.kernel.org/netdev/net/c/f3f32a356c0d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-12 10:57 ` Paolo Abeni
@ 2023-12-13 14:10   ` Eric Dumazet
  2023-12-13 19:00     ` Jakub Kicinski
  2023-12-13 21:30     ` Salvatore Dipietro
  0 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2023-12-13 14:10 UTC (permalink / raw
  To: Paolo Abeni
  Cc: Salvatore Dipietro, davem, dsahern, kuba, netdev, blakgeof,
	alisaidi, benh, dipietro.salvatore

On Tue, Dec 12, 2023 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2023-12-08 at 10:20 -0800, Salvatore Dipietro wrote:
> > Based on the tcp man page, if TCP_NODELAY is set, it disables Nagle's algorithm
> > and packets are sent as soon as possible. However in the `tcp_push` function
> > where autocorking is evaluated the `nonagle` value set by TCP_NODELAY is not
> > considered which can trigger unexpected corking of packets and induce delays.
> >
> > For example, if two packets are generated as part of a server's reply, if the
> > first one is not transmitted on the wire quickly enough, the second packet can
> > trigger the autocorking in `tcp_push` and be delayed instead of sent as soon as
> > possible. It will either wait for additional packets to be coalesced or an ACK
> > from the client before transmitting the corked packet. This can interact badly
> > if the receiver has tcp delayed acks enabled, introducing 40ms extra delay in
> > completion times. It is not always possible to control who has delayed acks
> > set, but it is possible to adjust when and how autocorking is triggered.
> > Patch prevents autocorking if the TCP_NODELAY flag is set on the socket.
> >
> > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > Apache Tomcat 9.0.83 running the basic servlet below:
> >
> > import java.io.IOException;
> > import java.io.OutputStreamWriter;
> > import java.io.PrintWriter;
> > import javax.servlet.ServletException;
> > import javax.servlet.http.HttpServlet;
> > import javax.servlet.http.HttpServletRequest;
> > import javax.servlet.http.HttpServletResponse;
> >
> > public class HelloWorldServlet extends HttpServlet {
> >     @Override
> >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> >       throws ServletException, IOException {
> >         response.setContentType("text/html;charset=utf-8");
> >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> >         String s = "a".repeat(3096);
> >         osw.write(s,0,s.length());
> >         osw.flush();
> >     }
> > }
> >
> > Load was applied using  wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > c6i.8xlarge instance.  With the current auto-corking behavior and TCP_NODELAY
> > set an additional 40ms latency from P99.99+ values are observed.  With the
> > patch applied we see no occurrences of 40ms latencies. The patch has also been
> > tested with iperf and uperf benchmarks and no regression was observed.
> >
> > # No patch with tcp_autocorking=1 and TCP_NODELAY set on all sockets
> > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
> >   ...
> >  50.000%    0.91ms
> >  75.000%    1.12ms
> >  90.000%    1.46ms
> >  99.000%    1.73ms
> >  99.900%    1.96ms
> >  99.990%   43.62ms   <<< 40+ ms extra latency
> >  99.999%   48.32ms
> > 100.000%   49.34ms
> >
> > # With patch
> > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.49.177:8080/hello/hello'
> >   ...
> >  50.000%    0.89ms
> >  75.000%    1.13ms
> >  90.000%    1.44ms
> >  99.000%    1.67ms
> >  99.900%    1.78ms
> >  99.990%    2.27ms   <<< no 40+ ms extra latency
> >  99.999%    3.71ms
> > 100.000%    4.57ms
> >
> > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > ---
> >  net/ipv4/tcp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index d3456cf840de..87751a2a6fff 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -716,7 +716,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> >
> >       tcp_mark_urg(tp, flags);
> >
> > -     if (tcp_should_autocork(sk, skb, size_goal)) {
> > +     if (!nonagle && tcp_should_autocork(sk, skb, size_goal)) {
>
> It looks like the above disables autocorking even after the userspace
> sets TCP_CORK. Am I reading it correctly?Sal Is that expected?
>

Yes, it seems the patch went too far.

Also I wonder about these 40ms delays, TCP small queue handler should
kick when the prior skb is TX completed.

It seems the issue is on the driver side ?

Salvatore, which driver are you using ?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-13 14:10   ` Eric Dumazet
@ 2023-12-13 19:00     ` Jakub Kicinski
  2023-12-13 19:29       ` Eric Dumazet
  2023-12-13 21:30     ` Salvatore Dipietro
  1 sibling, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2023-12-13 19:00 UTC (permalink / raw
  To: Eric Dumazet
  Cc: Paolo Abeni, Salvatore Dipietro, davem, dsahern, netdev, blakgeof,
	alisaidi, benh, dipietro.salvatore

On Wed, 13 Dec 2023 15:10:00 +0100 Eric Dumazet wrote:
> > It looks like the above disables autocorking even after the userspace
> > sets TCP_CORK. Am I reading it correctly?Sal Is that expected?
> 
> Yes, it seems the patch went too far.

Reverted to avoid it getting to Linus in the meantime, FWIW.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-13 19:00     ` Jakub Kicinski
@ 2023-12-13 19:29       ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2023-12-13 19:29 UTC (permalink / raw
  To: Jakub Kicinski
  Cc: Paolo Abeni, Salvatore Dipietro, davem, dsahern, netdev, blakgeof,
	alisaidi, benh, dipietro.salvatore

On Wed, Dec 13, 2023 at 8:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 13 Dec 2023 15:10:00 +0100 Eric Dumazet wrote:
> > > It looks like the above disables autocorking even after the userspace
> > > sets TCP_CORK. Am I reading it correctly?Sal Is that expected?
> >
> > Yes, it seems the patch went too far.
>
> Reverted to avoid it getting to Linus in the meantime, FWIW.

Thanks Jakub.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-13 14:10   ` Eric Dumazet
  2023-12-13 19:00     ` Jakub Kicinski
@ 2023-12-13 21:30     ` Salvatore Dipietro
  2023-12-14  8:40       ` Eric Dumazet
  2023-12-14 14:08       ` Eric Dumazet
  1 sibling, 2 replies; 28+ messages in thread
From: Salvatore Dipietro @ 2023-12-13 21:30 UTC (permalink / raw
  To: edumazet
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dipiets,
	dsahern, kuba, netdev, pabeni

> It looks like the above disables autocorking even after the userspace
> sets TCP_CORK. Am I reading it correctly? Is that expected?

I have tested a new version of the patch which can target only TCP_NODELAY.
Results using previous benchmark are identical. I will submit it in a new 
patch version.

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -716,7 +716,8 @@

        tcp_mark_urg(tp, flags);

-       if (tcp_should_autocork(sk, skb, size_goal)) {
+       if (!(nonagle & TCP_NAGLE_OFF) &&
+           tcp_should_autocork(sk, skb, size_goal)) {

                /* avoid atomic op if TSQ_THROTTLED bit is already set */
                if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {



> Also I wonder about these 40ms delays, TCP small queue handler should
> kick when the prior skb is TX completed.
>
> It seems the issue is on the driver side ?
>
> Salvatore, which driver are you using ?

I am using ENA driver.

Eric can you please clarify where do you think the problem is?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-13 21:30     ` Salvatore Dipietro
@ 2023-12-14  8:40       ` Eric Dumazet
  2023-12-14  9:05         ` Eric Dumazet
  2023-12-14 14:08       ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2023-12-14  8:40 UTC (permalink / raw
  To: Salvatore Dipietro
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dsahern,
	kuba, netdev, pabeni

On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
>
> > It looks like the above disables autocorking even after the userspace
> > sets TCP_CORK. Am I reading it correctly? Is that expected?
>
> I have tested a new version of the patch which can target only TCP_NODELAY.
> Results using previous benchmark are identical. I will submit it in a new
> patch version.

Well, I do not think we will accept a patch there, because you
basically are working around the root cause
for a certain variety of workloads.

Issue would still be there for applications not using TCP_NODELAY

>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -716,7 +716,8 @@
>
>         tcp_mark_urg(tp, flags);
>
> -       if (tcp_should_autocork(sk, skb, size_goal)) {
> +       if (!(nonagle & TCP_NAGLE_OFF) &&
> +           tcp_should_autocork(sk, skb, size_goal)) {
>
>                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
>                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
>
>
>
> > Also I wonder about these 40ms delays, TCP small queue handler should
> > kick when the prior skb is TX completed.
> >
> > It seems the issue is on the driver side ?
> >
> > Salvatore, which driver are you using ?
>
> I am using ENA driver.
>
> Eric can you please clarify where do you think the problem is?

The problem is that TSQ logic is not working properly, probably
because the driver
holds a packet that has been sent.

TX completion seems to be delayed until the next transmit happens on
the transmit queue.

I suspect some kind of missed interrupt or a race.

virtio_net is known to have a similar issue (not sure if this has been
fixed lately)

ena_io_poll() and ena_intr_msix_io() logic, playing with
ena_napi->interrupts_masked seem
convoluted/risky to me.

ena_start_xmit() also seems to have bugs vs xmit_more logic, but this
is orthogonal.

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..5282e718699ba9e64765bea2435e1c5a55aaa89b
100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -3235,6 +3235,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
*skb, struct net_device *dev)

 error_drop_packet:
        dev_kfree_skb(skb);
+       /* Make sure to ring the doorbell. */
+       ena_ring_tx_doorbell(tx_ring);
        return NETDEV_TX_OK;
 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-14  8:40       ` Eric Dumazet
@ 2023-12-14  9:05         ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2023-12-14  9:05 UTC (permalink / raw
  To: Salvatore Dipietro
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dsahern,
	kuba, netdev, pabeni

On Thu, Dec 14, 2023 at 9:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
> >
> > > It looks like the above disables autocorking even after the userspace
> > > sets TCP_CORK. Am I reading it correctly? Is that expected?
> >
> > I have tested a new version of the patch which can target only TCP_NODELAY.
> > Results using previous benchmark are identical. I will submit it in a new
> > patch version.
>
> Well, I do not think we will accept a patch there, because you
> basically are working around the root cause
> for a certain variety of workloads.
>
> Issue would still be there for applications not using TCP_NODELAY
>
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -716,7 +716,8 @@
> >
> >         tcp_mark_urg(tp, flags);
> >
> > -       if (tcp_should_autocork(sk, skb, size_goal)) {
> > +       if (!(nonagle & TCP_NAGLE_OFF) &&
> > +           tcp_should_autocork(sk, skb, size_goal)) {
> >
> >                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
> >                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
> >
> >
> >
> > > Also I wonder about these 40ms delays, TCP small queue handler should
> > > kick when the prior skb is TX completed.
> > >
> > > It seems the issue is on the driver side ?
> > >
> > > Salvatore, which driver are you using ?
> >
> > I am using ENA driver.
> >
> > Eric can you please clarify where do you think the problem is?
>
> The problem is that TSQ logic is not working properly, probably
> because the driver
> holds a packet that has been sent.
>
> TX completion seems to be delayed until the next transmit happens on
> the transmit queue.
>
> I suspect some kind of missed interrupt or a race.
>
> virtio_net is known to have a similar issue (not sure if this has been
> fixed lately)
>
> ena_io_poll() and ena_intr_msix_io() logic, playing with
> ena_napi->interrupts_masked seem
> convoluted/risky to me.
>
> ena_start_xmit() also seems to have bugs vs xmit_more logic, but this
> is orthogonal.
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..5282e718699ba9e64765bea2435e1c5a55aaa89b
> 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3235,6 +3235,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
> *skb, struct net_device *dev)
>
>  error_drop_packet:
>         dev_kfree_skb(skb);
> +       /* Make sure to ring the doorbell. */
> +       ena_ring_tx_doorbell(tx_ring);
>         return NETDEV_TX_OK;
>  }

ena_io_poll() has a race against
u64_stats_update_begin(&tx_ring->syncp);/u64_stats_update_end(&tx_ring->syncp);

This should be done by this thread, while it still owns the NAPI SCHED bit.

Doing anything that might be racy after napi_complete_done() is a bug.
In this case, this could brick foreverer ena_get_stats64().

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index c44c44e26ddfe74a93b7f1fb3c3ca90f978909e2..e3464adfd0b791af621c92a651125ced2ad2de8a
100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2017,7 +2017,6 @@ static int ena_io_poll(struct napi_struct *napi,
int budget)
        int tx_work_done;
        int rx_work_done = 0;
        int tx_budget;
-       int napi_comp_call = 0;
        int ret;

        tx_ring = ena_napi->tx_ring;
@@ -2038,6 +2037,11 @@ static int ena_io_poll(struct napi_struct
*napi, int budget)
        if (likely(budget))
                rx_work_done = ena_clean_rx_irq(rx_ring, napi, budget);

+       u64_stats_update_begin(&tx_ring->syncp);
+       tx_ring->tx_stats.tx_poll++;
+       u64_stats_update_end(&tx_ring->syncp);
+       WRITE_ONCE(tx_ring->tx_stats.last_napi_jiffies, jiffies);
+
        /* If the device is about to reset or down, avoid unmask
         * the interrupt and return 0 so NAPI won't reschedule
         */
@@ -2047,7 +2051,9 @@ static int ena_io_poll(struct napi_struct *napi,
int budget)
                ret = 0;

        } else if ((budget > rx_work_done) && (tx_budget > tx_work_done)) {
-               napi_comp_call = 1;
+               u64_stats_update_begin(&tx_ring->syncp);
+               tx_ring->tx_stats.napi_comp++;
+               u64_stats_update_end(&tx_ring->syncp);

                /* Update numa and unmask the interrupt only when schedule
                 * from the interrupt context (vs from sk_busy_loop)
@@ -2071,13 +2077,6 @@ static int ena_io_poll(struct napi_struct
*napi, int budget)
                ret = budget;
        }

-       u64_stats_update_begin(&tx_ring->syncp);
-       tx_ring->tx_stats.napi_comp += napi_comp_call;
-       tx_ring->tx_stats.tx_poll++;
-       u64_stats_update_end(&tx_ring->syncp);
-
-       tx_ring->tx_stats.last_napi_jiffies = jiffies;
-
        return ret;
 }

@@ -3235,6 +3234,8 @@ static netdev_tx_t ena_start_xmit(struct sk_buff
*skb, struct net_device *dev)

 error_drop_packet:
        dev_kfree_skb(skb);
+       /* Make sure to ring the doorbell. */
+       ena_ring_tx_doorbell(tx_ring);
        return NETDEV_TX_OK;
 }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-13 21:30     ` Salvatore Dipietro
  2023-12-14  8:40       ` Eric Dumazet
@ 2023-12-14 14:08       ` Eric Dumazet
  2023-12-14 15:52         ` Geoff Blake
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2023-12-14 14:08 UTC (permalink / raw
  To: Salvatore Dipietro
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dsahern,
	kuba, netdev, pabeni

On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
>
> > It looks like the above disables autocorking even after the userspace
> > sets TCP_CORK. Am I reading it correctly? Is that expected?
>
> I have tested a new version of the patch which can target only TCP_NODELAY.
> Results using previous benchmark are identical. I will submit it in a new
> patch version.
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -716,7 +716,8 @@
>
>         tcp_mark_urg(tp, flags);
>
> -       if (tcp_should_autocork(sk, skb, size_goal)) {
> +       if (!(nonagle & TCP_NAGLE_OFF) &&
> +           tcp_should_autocork(sk, skb, size_goal)) {
>
>                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
>                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
>
>
>
> > Also I wonder about these 40ms delays, TCP small queue handler should
> > kick when the prior skb is TX completed.
> >
> > It seems the issue is on the driver side ?
> >
> > Salvatore, which driver are you using ?
>
> I am using ENA driver.
>
> Eric can you please clarify where do you think the problem is?
>

Following bpftrace program could double check if ena driver is
possibly holding TCP skbs too long:

bpftrace -e 'k:dev_hard_start_xmit {
 $skb = (struct sk_buff *)arg0;
 if ($skb->fclone == 2) {
  @start[$skb] = nsecs;
 }
}
k:__kfree_skb {
 $skb = (struct sk_buff *)arg0;
 if ($skb->fclone == 2 && @start[$skb]) {
  @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
  delete(@start[$skb]);
}
} END { clear(@start); }'

iroa21:/home/edumazet# ./trace-tx-completion.sh
Attaching 3 probes...
^C


@tx_compl_usecs:
[2, 4)                13 |                                                    |
[4, 8)               182 |                                                    |
[8, 16)          2379007 |@@@@@@@@@@@@@@@                                     |
[16, 32)         7865369 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64)         6040939 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
[64, 128)         199255 |@                                                   |
[128, 256)          9235 |                                                    |
[256, 512)            89 |                                                    |
[512, 1K)             37 |                                                    |
[1K, 2K)              19 |                                                    |
[2K, 4K)              56 |                                                    |

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-14 14:08       ` Eric Dumazet
@ 2023-12-14 15:52         ` Geoff Blake
  2023-12-14 16:07           ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Geoff Blake @ 2023-12-14 15:52 UTC (permalink / raw
  To: Eric Dumazet
  Cc: Salvatore Dipietro, alisaidi, benh, davem, dipietro.salvatore,
	dsahern, kuba, netdev, pabeni

[-- Attachment #1: Type: text/plain, Size: 3780 bytes --]

Thanks for helping dig in here Eric, but what is supposed to happen on TX 
completion? We're unfamiliar with TCP small queues beside finding your old 
LKML listing that states a tasklet is supposed to run if there is pending 
data.  So need a bit more guidance if you could.

I think its supposed to call tcp_free() when the skb is destructed and 
that invokes the tasklet?  There is also sock_wfree(), it does not appear 
to have the linkage to the tasklet by design.

We did attach probes at one point to look at whether there was a chance an 
interrupt went missing (but don't have them on-hand anymore), but we 
always saw the TX completion happen. When the 40ms latency happened 
we'd see that the completion had happened just after the other packet decided to 
be corked.  But it certainly doesn't hurt to double check.  

- Geoff Blake

On Thu, 14 Dec 2023, Eric Dumazet wrote:

> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
> >
> > > It looks like the above disables autocorking even after the userspace
> > > sets TCP_CORK. Am I reading it correctly? Is that expected?
> >
> > I have tested a new version of the patch which can target only TCP_NODELAY.
> > Results using previous benchmark are identical. I will submit it in a new
> > patch version.
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -716,7 +716,8 @@
> >
> >         tcp_mark_urg(tp, flags);
> >
> > -       if (tcp_should_autocork(sk, skb, size_goal)) {
> > +       if (!(nonagle & TCP_NAGLE_OFF) &&
> > +           tcp_should_autocork(sk, skb, size_goal)) {
> >
> >                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
> >                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
> >
> >
> >
> > > Also I wonder about these 40ms delays, TCP small queue handler should
> > > kick when the prior skb is TX completed.
> > >
> > > It seems the issue is on the driver side ?
> > >
> > > Salvatore, which driver are you using ?
> >
> > I am using ENA driver.
> >
> > Eric can you please clarify where do you think the problem is?
> >
> 
> Following bpftrace program could double check if ena driver is
> possibly holding TCP skbs too long:
> 
> bpftrace -e 'k:dev_hard_start_xmit {
>  $skb = (struct sk_buff *)arg0;
>  if ($skb->fclone == 2) {
>   @start[$skb] = nsecs;
>  }
> }
> k:__kfree_skb {
>  $skb = (struct sk_buff *)arg0;
>  if ($skb->fclone == 2 && @start[$skb]) {
>   @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
>   delete(@start[$skb]);
> }
> } END { clear(@start); }'
> 
> iroa21:/home/edumazet# ./trace-tx-completion.sh
> Attaching 3 probes...
> ^C
> 
> 
> @tx_compl_usecs:
> [2, 4)                13 |                                                    |
> [4, 8)               182 |                                                    |
> [8, 16)          2379007 |@@@@@@@@@@@@@@@                                     |
> [16, 32)         7865369 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32, 64)         6040939 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
> [64, 128)         199255 |@                                                   |
> [128, 256)          9235 |                                                    |
> [256, 512)            89 |                                                    |
> [512, 1K)             37 |                                                    |
> [1K, 2K)              19 |                                                    |
> [2K, 4K)              56 |                                                    |
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  2023-12-14 15:52         ` Geoff Blake
@ 2023-12-14 16:07           ` Eric Dumazet
  2024-01-17 21:26             ` [PATCH v2] tcp: Add memory barrier to tcp_push() Salvatore Dipietro
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2023-12-14 16:07 UTC (permalink / raw
  To: Geoff Blake
  Cc: Salvatore Dipietro, alisaidi, benh, davem, dipietro.salvatore,
	dsahern, kuba, netdev, pabeni

On Thu, Dec 14, 2023 at 4:52 PM Geoff Blake <blakgeof@amazon.com> wrote:
>
> Thanks for helping dig in here Eric, but what is supposed to happen on TX
> completion? We're unfamiliar with TCP small queues beside finding your old
> LKML listing that states a tasklet is supposed to run if there is pending
> data.  So need a bit more guidance if you could.
>
> I think its supposed to call tcp_free() when the skb is destructed and
> that invokes the tasklet?  There is also sock_wfree(), it does not appear
> to have the linkage to the tasklet by design.
>
> We did attach probes at one point to look at whether there was a chance an
> interrupt went missing (but don't have them on-hand anymore), but we
> always saw the TX completion happen. When the 40ms latency happened
> we'd see that the completion had happened just after the other packet decided to
> be corked.  But it certainly doesn't hurt to double check.

When TX completion happens, while autocorking was setup, TSQ_THROTTLED
bit was set on sk->sk_tsq_flags, so TSQ logic should call
tcp_tsq_handler() -> tcp_tsq_write() -> tcp_write_xmit()

tcp_write_xmit() should send the pending packet (if CWND and other
constraints allows this)

autocorking is all about giving chance to the application to
add more bytes on the pending skb before TX completion happens
(typically in less  than 100 usec on an idle qdisc/nic)

If your life depends on not waiting for this delay, you have two options :

1) use MSG_EOR
2) disable autocorking (/proc/sys/net/ipv4/tcp_autocorking



>
> - Geoff Blake
>
> On Thu, 14 Dec 2023, Eric Dumazet wrote:
>
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On Wed, Dec 13, 2023 at 10:30 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
> > >
> > > > It looks like the above disables autocorking even after the userspace
> > > > sets TCP_CORK. Am I reading it correctly? Is that expected?
> > >
> > > I have tested a new version of the patch which can target only TCP_NODELAY.
> > > Results using previous benchmark are identical. I will submit it in a new
> > > patch version.
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -716,7 +716,8 @@
> > >
> > >         tcp_mark_urg(tp, flags);
> > >
> > > -       if (tcp_should_autocork(sk, skb, size_goal)) {
> > > +       if (!(nonagle & TCP_NAGLE_OFF) &&
> > > +           tcp_should_autocork(sk, skb, size_goal)) {
> > >
> > >                 /* avoid atomic op if TSQ_THROTTLED bit is already set */
> > >                 if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
> > >
> > >
> > >
> > > > Also I wonder about these 40ms delays, TCP small queue handler should
> > > > kick when the prior skb is TX completed.
> > > >
> > > > It seems the issue is on the driver side ?
> > > >
> > > > Salvatore, which driver are you using ?
> > >
> > > I am using ENA driver.
> > >
> > > Eric can you please clarify where do you think the problem is?
> > >
> >
> > Following bpftrace program could double check if ena driver is
> > possibly holding TCP skbs too long:
> >
> > bpftrace -e 'k:dev_hard_start_xmit {
> >  $skb = (struct sk_buff *)arg0;
> >  if ($skb->fclone == 2) {
> >   @start[$skb] = nsecs;
> >  }
> > }
> > k:__kfree_skb {
> >  $skb = (struct sk_buff *)arg0;
> >  if ($skb->fclone == 2 && @start[$skb]) {
> >   @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
> >   delete(@start[$skb]);
> > }
> > } END { clear(@start); }'
> >
> > iroa21:/home/edumazet# ./trace-tx-completion.sh
> > Attaching 3 probes...
> > ^C
> >
> >
> > @tx_compl_usecs:
> > [2, 4)                13 |                                                    |
> > [4, 8)               182 |                                                    |
> > [8, 16)          2379007 |@@@@@@@@@@@@@@@                                     |
> > [16, 32)         7865369 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32, 64)         6040939 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
> > [64, 128)         199255 |@                                                   |
> > [128, 256)          9235 |                                                    |
> > [256, 512)            89 |                                                    |
> > [512, 1K)             37 |                                                    |
> > [1K, 2K)              19 |                                                    |
> > [2K, 4K)              56 |                                                    |
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] tcp: Add memory barrier to tcp_push()
  2023-12-14 16:07           ` Eric Dumazet
@ 2024-01-17 21:26             ` Salvatore Dipietro
  2024-01-17 21:58               ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Salvatore Dipietro @ 2024-01-17 21:26 UTC (permalink / raw
  To: edumazet
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dipiets,
	dsahern, kuba, netdev, pabeni

On CPUs with weak memory models, reads and updates performed by tcp_push to the
sk variables can get reordered leaving the socket throttled when it should not.
The tasklet running tcp_wfree() may also not observe the memory updates in time
and will skip flushing any packets throttled by tcp_push(), delaying the sending.
This can pathologically cause 40ms extra latency due to bad interactions with
delayed acks.

Modeling the memory access behavior of tcp_push() (P0) and tcp_wfree() (P1)
using the herd7 simulator, proves this behavior can occur. Below is the litmus
model which describes the functions:
```
C MP+tcp
{
  [flag] = 0;
  [sk] = 5;
  [corked] = 0;
}

P0(int *flag, int *sk, int *corked){
    int r0;
    int r1;
    int r2;

    r1 = READ_ONCE(*sk);
    if (r1 == 5) {

        r0 = READ_ONCE(*flag);
        if (r0 == 0) {
            WRITE_ONCE(*flag, 1);
        }

        // memory barrier added in this patch,
        // original code does not order the reads/writes
        smp_mb();

        r2 = READ_ONCE(*sk);
        if (r2 == 5 ) {
            WRITE_ONCE(*corked,1);
        }
    }
}

P1(int *flag, int *sk, int *corked){
    int r0;
    int r1;

    r1 = READ_ONCE(*sk);
    smp_store_release(sk, 0);

    r0 = smp_load_acquire(flag);
    if (r0 == 1) {
        smp_store_release(flag, 0);
    }
}
locations [0:r0; 0:r1; 0:r2; 1:r0; 1:r1; flag; sk; corked; ]
exists ( flag=1 /\ corked=1 )
```

Adding the memory barrier removes the positive witness from the memory model.
smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86
since not affected.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
Apache Tomcat 9.0.83 running the basic servlet below:
```
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class HelloWorldServlet extends HttpServlet {
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
        response.setContentType("text/html;charset=utf-8");
        OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
        String s = "a".repeat(3096);
        osw.write(s,0,s.length());
        osw.flush();
    }
}
```
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
values is observed while, with the patch, the extra latency disappears.

# No patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
 50.000%    0.91ms
 75.000%    1.13ms
 90.000%    1.46ms
 99.000%    1.74ms
 99.900%    1.89ms
 99.990%   41.95ms  <<< 40+ ms extra latency
 99.999%   48.32ms
100.000%   48.96ms

# With patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
 50.000%    0.90ms
 75.000%    1.13ms
 90.000%    1.45ms
 99.000%    1.72ms
 99.900%    1.83ms
 99.990%    2.11ms  <<< no 40+ ms extra latency
 99.999%    2.53ms
100.000%    2.62ms

Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
affected by this issue and the patch doesn't introduce any additional
delay.

Fixes: f54b311142a9 ("tcp: auto corking")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
---
 net/ipv4/tcp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ff6838ca2e58..ab9e3922393c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
 		/* It is possible TX completion already happened
 		 * before we set TSQ_THROTTLED.
 		 */
+		smp_mb__after_atomic();
 		if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize)
 			return;
 	}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] tcp: Add memory barrier to tcp_push()
  2024-01-17 21:26             ` [PATCH v2] tcp: Add memory barrier to tcp_push() Salvatore Dipietro
@ 2024-01-17 21:58               ` Eric Dumazet
  2024-01-17 23:16                 ` [PATCH v3] " Salvatore Dipietro
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2024-01-17 21:58 UTC (permalink / raw
  To: Salvatore Dipietro
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dsahern,
	kuba, netdev, pabeni

On Wed, Jan 17, 2024 at 10:29 PM Salvatore Dipietro <dipiets@amazon.com> wrote:
>
> On CPUs with weak memory models, reads and updates performed by tcp_push to the
> sk variables can get reordered leaving the socket throttled when it should not.
> The tasklet running tcp_wfree() may also not observe the memory updates in time
> and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> This can pathologically cause 40ms extra latency due to bad interactions with
> delayed acks.
>
> Modeling the memory access behavior of tcp_push() (P0) and tcp_wfree() (P1)
> using the herd7 simulator, proves this behavior can occur. Below is the litmus
> model which describes the functions:
> ```
> C MP+tcp
> {
>   [flag] = 0;
>   [sk] = 5;
>   [corked] = 0;
> }
>
> P0(int *flag, int *sk, int *corked){
>     int r0;
>     int r1;
>     int r2;
>
>     r1 = READ_ONCE(*sk);
>     if (r1 == 5) {
>
>         r0 = READ_ONCE(*flag);
>         if (r0 == 0) {
>             WRITE_ONCE(*flag, 1);
>         }
>
>         // memory barrier added in this patch,
>         // original code does not order the reads/writes
>         smp_mb();
>
>         r2 = READ_ONCE(*sk);
>         if (r2 == 5 ) {
>             WRITE_ONCE(*corked,1);
>         }
>     }
> }
>

Interesting.

Quite frankly I doubt this herd7 stuff needs to be in a changelog,
this is too verbose/pedantic.

What about instead referring to similar commit bf06200e732d
("tcp: tsq: fix nonagle handling")

This was exactly the same issue, but at that time tcp_push() was not
re-reading sk->sk_wmem_alloc




> P1(int *flag, int *sk, int *corked){
>     int r0;
>     int r1;
>
>     r1 = READ_ONCE(*sk);
>     smp_store_release(sk, 0);
>
>     r0 = smp_load_acquire(flag);
>     if (r0 == 1) {
>         smp_store_release(flag, 0);
>     }
> }
> locations [0:r0; 0:r1; 0:r2; 1:r0; 1:r1; flag; sk; corked; ]
> exists ( flag=1 /\ corked=1 )
> ```
>
> Adding the memory barrier removes the positive witness from the memory model.
> smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86
> since not affected.
> Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> Apache Tomcat 9.0.83 running the basic servlet below:
> ```
> import java.io.IOException;
> import java.io.OutputStreamWriter;
> import java.io.PrintWriter;
> import javax.servlet.ServletException;
> import javax.servlet.http.HttpServlet;
> import javax.servlet.http.HttpServletRequest;
> import javax.servlet.http.HttpServletResponse;
>
> public class HelloWorldServlet extends HttpServlet {
>     @Override
>     protected void doGet(HttpServletRequest request, HttpServletResponse response)
>       throws ServletException, IOException {
>         response.setContentType("text/html;charset=utf-8");
>         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
>         String s = "a".repeat(3096);
>         osw.write(s,0,s.length());
>         osw.flush();
>     }
> }
> ```
> Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> values is observed while, with the patch, the extra latency disappears.
>
> # No patch and tcp_autocorking=1
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
>   ...
>  50.000%    0.91ms
>  75.000%    1.13ms
>  90.000%    1.46ms
>  99.000%    1.74ms
>  99.900%    1.89ms
>  99.990%   41.95ms  <<< 40+ ms extra latency
>  99.999%   48.32ms
> 100.000%   48.96ms
>
> # With patch and tcp_autocorking=1
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
>   ...
>  50.000%    0.90ms
>  75.000%    1.13ms
>  90.000%    1.45ms
>  99.000%    1.72ms
>  99.900%    1.83ms
>  99.990%    2.11ms  <<< no 40+ ms extra latency
>  99.999%    2.53ms
> 100.000%    2.62ms
>
> Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> affected by this issue and the patch doesn't introduce any additional
> delay.
>
> Fixes: f54b311142a9 ("tcp: auto corking")

Bug came with a181ceb501b3 ("tcp: autocork should not hold first
packet in write queue")



> Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> ---
>  net/ipv4/tcp.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ff6838ca2e58..ab9e3922393c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
>                 /* It is possible TX completion already happened
>                  * before we set TSQ_THROTTLED.
>                  */
> +               smp_mb__after_atomic();
>                 if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize)
>                         return;
>         }
> --
> 2.42.0
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-17 21:58               ` Eric Dumazet
@ 2024-01-17 23:16                 ` Salvatore Dipietro
  2024-01-18 10:38                   ` Paolo Abeni
  0 siblings, 1 reply; 28+ messages in thread
From: Salvatore Dipietro @ 2024-01-17 23:16 UTC (permalink / raw
  To: edumazet
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dipiets,
	dsahern, kuba, netdev, pabeni

On CPUs with weak memory models, reads and updates performed by tcp_push to the
sk variables can get reordered leaving the socket throttled when it should not.
The tasklet running tcp_wfree() may also not observe the memory updates in time
and will skip flushing any packets throttled by tcp_push(), delaying the sending.
This can pathologically cause 40ms extra latency due to bad interactions with
delayed acks.

Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
on x86 since not affected.

Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
Apache Tomcat 9.0.83 running the basic servlet below:

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class HelloWorldServlet extends HttpServlet {
    @Override
    protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
        response.setContentType("text/html;charset=utf-8");
        OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
        String s = "a".repeat(3096);
        osw.write(s,0,s.length());
        osw.flush();
    }
}

Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
values is observed while, with the patch, the extra latency disappears.

# No patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
 50.000%    0.91ms
 75.000%    1.13ms
 90.000%    1.46ms
 99.000%    1.74ms
 99.900%    1.89ms
 99.990%   41.95ms  <<< 40+ ms extra latency
 99.999%   48.32ms
100.000%   48.96ms

# With patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
  ...
 50.000%    0.90ms
 75.000%    1.13ms
 90.000%    1.45ms
 99.000%    1.72ms
 99.900%    1.83ms
 99.990%    2.11ms  <<< no 40+ ms extra latency
 99.999%    2.53ms
100.000%    2.62ms

Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
affected by this issue and the patch doesn't introduce any additional
delay.

Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
queue")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
---
 net/ipv4/tcp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ff6838ca2e58..ab9e3922393c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
 		/* It is possible TX completion already happened
 		 * before we set TSQ_THROTTLED.
 		 */
+		smp_mb__after_atomic();
 		if (refcount_read(&sk->sk_wmem_alloc) > skb->truesize)
 			return;
 	}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-17 23:16                 ` [PATCH v3] " Salvatore Dipietro
@ 2024-01-18 10:38                   ` Paolo Abeni
  2024-01-18 10:42                     ` Eric Dumazet
  2024-01-18 17:46                     ` Geoff Blake
  0 siblings, 2 replies; 28+ messages in thread
From: Paolo Abeni @ 2024-01-18 10:38 UTC (permalink / raw
  To: Salvatore Dipietro, edumazet
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dsahern,
	kuba, netdev

On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> On CPUs with weak memory models, reads and updates performed by tcp_push to the
> sk variables can get reordered leaving the socket throttled when it should not.
> The tasklet running tcp_wfree() may also not observe the memory updates in time
> and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> This can pathologically cause 40ms extra latency due to bad interactions with
> delayed acks.
> 
> Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> on x86 since not affected.
> 
> Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> Apache Tomcat 9.0.83 running the basic servlet below:
> 
> import java.io.IOException;
> import java.io.OutputStreamWriter;
> import java.io.PrintWriter;
> import javax.servlet.ServletException;
> import javax.servlet.http.HttpServlet;
> import javax.servlet.http.HttpServletRequest;
> import javax.servlet.http.HttpServletResponse;
> 
> public class HelloWorldServlet extends HttpServlet {
>     @Override
>     protected void doGet(HttpServletRequest request, HttpServletResponse response)
>       throws ServletException, IOException {
>         response.setContentType("text/html;charset=utf-8");
>         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
>         String s = "a".repeat(3096);
>         osw.write(s,0,s.length());
>         osw.flush();
>     }
> }
> 
> Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> values is observed while, with the patch, the extra latency disappears.
> 
> # No patch and tcp_autocorking=1
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
>   ...
>  50.000%    0.91ms
>  75.000%    1.13ms
>  90.000%    1.46ms
>  99.000%    1.74ms
>  99.900%    1.89ms
>  99.990%   41.95ms  <<< 40+ ms extra latency
>  99.999%   48.32ms
> 100.000%   48.96ms
> 
> # With patch and tcp_autocorking=1
> ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
>   ...
>  50.000%    0.90ms
>  75.000%    1.13ms
>  90.000%    1.45ms
>  99.000%    1.72ms
>  99.900%    1.83ms
>  99.990%    2.11ms  <<< no 40+ ms extra latency
>  99.999%    2.53ms
> 100.000%    2.62ms
> 
> Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> affected by this issue and the patch doesn't introduce any additional
> delay.
> 
> Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> queue")

Please read carefully the process documentation under
Documentation/process/ and specifically the netdev specific bits:

no resubmissions within the 24h grace period.

Please double-check your patch with checkpatch for formal errors: the
fixes tag must not be split across multiple lines.

And please do not sent new version in reply to a previous one: it will
confuse the bot.

> Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> ---
>  net/ipv4/tcp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ff6838ca2e58..ab9e3922393c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
>  		/* It is possible TX completion already happened
>  		 * before we set TSQ_THROTTLED.
>  		 */
> +		smp_mb__after_atomic();

Out of sheer ignorance I'm wondering if moving such barrier inside the
above 'if' just after 'set_bit' would suffice?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 10:38                   ` Paolo Abeni
@ 2024-01-18 10:42                     ` Eric Dumazet
  2024-01-18 16:40                       ` Jakub Kicinski
  2024-01-18 17:46                     ` Geoff Blake
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2024-01-18 10:42 UTC (permalink / raw
  To: Paolo Abeni
  Cc: Salvatore Dipietro, alisaidi, benh, blakgeof, davem,
	dipietro.salvatore, dsahern, kuba, netdev

On Thu, Jan 18, 2024 at 11:38 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > sk variables can get reordered leaving the socket throttled when it should not.
> > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > This can pathologically cause 40ms extra latency due to bad interactions with
> > delayed acks.
> >
> > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > on x86 since not affected.
> >
> > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > Apache Tomcat 9.0.83 running the basic servlet below:
> >
> > import java.io.IOException;
> > import java.io.OutputStreamWriter;
> > import java.io.PrintWriter;
> > import javax.servlet.ServletException;
> > import javax.servlet.http.HttpServlet;
> > import javax.servlet.http.HttpServletRequest;
> > import javax.servlet.http.HttpServletResponse;
> >
> > public class HelloWorldServlet extends HttpServlet {
> >     @Override
> >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> >       throws ServletException, IOException {
> >         response.setContentType("text/html;charset=utf-8");
> >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> >         String s = "a".repeat(3096);
> >         osw.write(s,0,s.length());
> >         osw.flush();
> >     }
> > }
> >
> > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > values is observed while, with the patch, the extra latency disappears.
> >
> > # No patch and tcp_autocorking=1
> > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> >   ...
> >  50.000%    0.91ms
> >  75.000%    1.13ms
> >  90.000%    1.46ms
> >  99.000%    1.74ms
> >  99.900%    1.89ms
> >  99.990%   41.95ms  <<< 40+ ms extra latency
> >  99.999%   48.32ms
> > 100.000%   48.96ms
> >
> > # With patch and tcp_autocorking=1
> > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> >   ...
> >  50.000%    0.90ms
> >  75.000%    1.13ms
> >  90.000%    1.45ms
> >  99.000%    1.72ms
> >  99.900%    1.83ms
> >  99.990%    2.11ms  <<< no 40+ ms extra latency
> >  99.999%    2.53ms
> > 100.000%    2.62ms
> >
> > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > affected by this issue and the patch doesn't introduce any additional
> > delay.
> >
> > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > queue")
>
> Please read carefully the process documentation under
> Documentation/process/ and specifically the netdev specific bits:
>
> no resubmissions within the 24h grace period.
>
> Please double-check your patch with checkpatch for formal errors: the
> fixes tag must not be split across multiple lines.
>
> And please do not sent new version in reply to a previous one: it will
> confuse the bot.
>
> > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > ---
> >  net/ipv4/tcp.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index ff6838ca2e58..ab9e3922393c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> >               /* It is possible TX completion already happened
> >                * before we set TSQ_THROTTLED.
> >                */
> > +             smp_mb__after_atomic();
>
> Out of sheer ignorance I'm wondering if moving such barrier inside the
> above 'if' just after 'set_bit' would suffice?

I think this would work just fine.

>
> Thanks!
>
> Paolo
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 10:42                     ` Eric Dumazet
@ 2024-01-18 16:40                       ` Jakub Kicinski
  0 siblings, 0 replies; 28+ messages in thread
From: Jakub Kicinski @ 2024-01-18 16:40 UTC (permalink / raw
  To: Eric Dumazet
  Cc: Paolo Abeni, Salvatore Dipietro, alisaidi, benh, blakgeof, davem,
	dipietro.salvatore, dsahern, netdev

On Thu, 18 Jan 2024 11:42:40 +0100 Eric Dumazet wrote:
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index ff6838ca2e58..ab9e3922393c 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > >               /* It is possible TX completion already happened
> > >                * before we set TSQ_THROTTLED.
> > >                */
> > > +             smp_mb__after_atomic();  
> >
> > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > above 'if' just after 'set_bit' would suffice?  
> 
> I think this would work just fine.

Sorry, "this" as in Paolo's suggestion or "this" as in the v3 patch 
as posted? :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 10:38                   ` Paolo Abeni
  2024-01-18 10:42                     ` Eric Dumazet
@ 2024-01-18 17:46                     ` Geoff Blake
  2024-01-18 18:30                       ` Paolo Abeni
  1 sibling, 1 reply; 28+ messages in thread
From: Geoff Blake @ 2024-01-18 17:46 UTC (permalink / raw
  To: Paolo Abeni
  Cc: Salvatore Dipietro, edumazet, alisaidi, benh, davem,
	dipietro.salvatore, dsahern, kuba, netdev



On Thu, 18 Jan 2024, Paolo Abeni wrote:

> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > sk variables can get reordered leaving the socket throttled when it should not.
> > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > This can pathologically cause 40ms extra latency due to bad interactions with
> > delayed acks.
> >
> > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > on x86 since not affected.
> >
> > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > Apache Tomcat 9.0.83 running the basic servlet below:
> >
> > import java.io.IOException;
> > import java.io.OutputStreamWriter;
> > import java.io.PrintWriter;
> > import javax.servlet.ServletException;
> > import javax.servlet.http.HttpServlet;
> > import javax.servlet.http.HttpServletRequest;
> > import javax.servlet.http.HttpServletResponse;
> >
> > public class HelloWorldServlet extends HttpServlet {
> >     @Override
> >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> >       throws ServletException, IOException {
> >         response.setContentType("text/html;charset=utf-8");
> >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> >         String s = "a".repeat(3096);
> >         osw.write(s,0,s.length());
> >         osw.flush();
> >     }
> > }
> >
> > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > values is observed while, with the patch, the extra latency disappears.
> >
> > # No patch and tcp_autocorking=1
> > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> >   ...
> >  50.000%    0.91ms
> >  75.000%    1.13ms
> >  90.000%    1.46ms
> >  99.000%    1.74ms
> >  99.900%    1.89ms
> >  99.990%   41.95ms  <<< 40+ ms extra latency
> >  99.999%   48.32ms
> > 100.000%   48.96ms
> >
> > # With patch and tcp_autocorking=1
> > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> >   ...
> >  50.000%    0.90ms
> >  75.000%    1.13ms
> >  90.000%    1.45ms
> >  99.000%    1.72ms
> >  99.900%    1.83ms
> >  99.990%    2.11ms  <<< no 40+ ms extra latency
> >  99.999%    2.53ms
> > 100.000%    2.62ms
> >
> > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > affected by this issue and the patch doesn't introduce any additional
> > delay.
> >
> > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > queue")
> 
> Please read carefully the process documentation under
> Documentation/process/ and specifically the netdev specific bits:
> 
> no resubmissions within the 24h grace period.
> 
> Please double-check your patch with checkpatch for formal errors: the
> fixes tag must not be split across multiple lines.
> 
> And please do not sent new version in reply to a previous one: it will
> confuse the bot.
> 
> > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > ---
> >  net/ipv4/tcp.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index ff6838ca2e58..ab9e3922393c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> >               /* It is possible TX completion already happened
> >                * before we set TSQ_THROTTLED.
> >                */
> > +             smp_mb__after_atomic();
> 
> Out of sheer ignorance I'm wondering if moving such barrier inside the
> above 'if' just after 'set_bit' would suffice?

According to the herd7 modeling tool, the answer is no for weak memory 
models.  If we put the smp_mb inside the if, it allows the machine to 
reorder the two reads to sk_wmem_alloc and we can get to the bad state 
this patch is fixing.  Placing it outside the if ensures 
the ordering between those two reads as well as ordering the write to the 
flags variable.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 17:46                     ` Geoff Blake
@ 2024-01-18 18:30                       ` Paolo Abeni
  2024-01-18 19:30                         ` Geoff Blake
  0 siblings, 1 reply; 28+ messages in thread
From: Paolo Abeni @ 2024-01-18 18:30 UTC (permalink / raw
  To: Geoff Blake
  Cc: Salvatore Dipietro, edumazet, alisaidi, benh, davem,
	dipietro.salvatore, dsahern, kuba, netdev

On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> 
> On Thu, 18 Jan 2024, Paolo Abeni wrote:
> 
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > sk variables can get reordered leaving the socket throttled when it should not.
> > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > delayed acks.
> > > 
> > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > on x86 since not affected.
> > > 
> > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > 
> > > import java.io.IOException;
> > > import java.io.OutputStreamWriter;
> > > import java.io.PrintWriter;
> > > import javax.servlet.ServletException;
> > > import javax.servlet.http.HttpServlet;
> > > import javax.servlet.http.HttpServletRequest;
> > > import javax.servlet.http.HttpServletResponse;
> > > 
> > > public class HelloWorldServlet extends HttpServlet {
> > >     @Override
> > >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > >       throws ServletException, IOException {
> > >         response.setContentType("text/html;charset=utf-8");
> > >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > >         String s = "a".repeat(3096);
> > >         osw.write(s,0,s.length());
> > >         osw.flush();
> > >     }
> > > }
> > > 
> > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > values is observed while, with the patch, the extra latency disappears.
> > > 
> > > # No patch and tcp_autocorking=1
> > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > >   ...
> > >  50.000%    0.91ms
> > >  75.000%    1.13ms
> > >  90.000%    1.46ms
> > >  99.000%    1.74ms
> > >  99.900%    1.89ms
> > >  99.990%   41.95ms  <<< 40+ ms extra latency
> > >  99.999%   48.32ms
> > > 100.000%   48.96ms
> > > 
> > > # With patch and tcp_autocorking=1
> > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > >   ...
> > >  50.000%    0.90ms
> > >  75.000%    1.13ms
> > >  90.000%    1.45ms
> > >  99.000%    1.72ms
> > >  99.900%    1.83ms
> > >  99.990%    2.11ms  <<< no 40+ ms extra latency
> > >  99.999%    2.53ms
> > > 100.000%    2.62ms
> > > 
> > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > affected by this issue and the patch doesn't introduce any additional
> > > delay.
> > > 
> > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > queue")
> > 
> > Please read carefully the process documentation under
> > Documentation/process/ and specifically the netdev specific bits:
> > 
> > no resubmissions within the 24h grace period.
> > 
> > Please double-check your patch with checkpatch for formal errors: the
> > fixes tag must not be split across multiple lines.
> > 
> > And please do not sent new version in reply to a previous one: it will
> > confuse the bot.
> > 
> > > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > > ---
> > >  net/ipv4/tcp.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index ff6838ca2e58..ab9e3922393c 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > >               /* It is possible TX completion already happened
> > >                * before we set TSQ_THROTTLED.
> > >                */
> > > +             smp_mb__after_atomic();
> > 
> > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > above 'if' just after 'set_bit' would suffice?
> 
> According to the herd7 modeling tool, the answer is no for weak memory 
> models.  If we put the smp_mb inside the if, it allows the machine to 
> reorder the two reads to sk_wmem_alloc and we can get to the bad state 
> this patch is fixing.  Placing it outside the if ensures 
> the ordering between those two reads as well as ordering the write to the 
> flags variable.

For the records, I asked because reading the documentation  I assumed
that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
operation (in this case 'set_bit'). Is this a wrong interpretation?

Now I wonder if the patch is enough. Specifically, would it work
correctly when the TSQ_THROTTLED bit is already set and there is no RMW
operation in between the two refcount_read()?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 18:30                       ` Paolo Abeni
@ 2024-01-18 19:30                         ` Geoff Blake
  2024-01-19 13:33                           ` Eric Dumazet
  2024-04-17  7:33                           ` Eric Dumazet
  0 siblings, 2 replies; 28+ messages in thread
From: Geoff Blake @ 2024-01-18 19:30 UTC (permalink / raw
  To: Paolo Abeni
  Cc: Salvatore Dipietro, edumazet, alisaidi, benh, davem,
	dipietro.salvatore, dsahern, kuba, netdev



On Thu, 18 Jan 2024, Paolo Abeni wrote:

> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> >
> > On Thu, 18 Jan 2024, Paolo Abeni wrote:
> >
> > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > >
> > >
> > >
> > > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > > sk variables can get reordered leaving the socket throttled when it should not.
> > > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > > delayed acks.
> > > >
> > > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > > on x86 since not affected.
> > > >
> > > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > >
> > > > import java.io.IOException;
> > > > import java.io.OutputStreamWriter;
> > > > import java.io.PrintWriter;
> > > > import javax.servlet.ServletException;
> > > > import javax.servlet.http.HttpServlet;
> > > > import javax.servlet.http.HttpServletRequest;
> > > > import javax.servlet.http.HttpServletResponse;
> > > >
> > > > public class HelloWorldServlet extends HttpServlet {
> > > >     @Override
> > > >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > > >       throws ServletException, IOException {
> > > >         response.setContentType("text/html;charset=utf-8");
> > > >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > > >         String s = "a".repeat(3096);
> > > >         osw.write(s,0,s.length());
> > > >         osw.flush();
> > > >     }
> > > > }
> > > >
> > > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > > values is observed while, with the patch, the extra latency disappears.
> > > >
> > > > # No patch and tcp_autocorking=1
> > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > >   ...
> > > >  50.000%    0.91ms
> > > >  75.000%    1.13ms
> > > >  90.000%    1.46ms
> > > >  99.000%    1.74ms
> > > >  99.900%    1.89ms
> > > >  99.990%   41.95ms  <<< 40+ ms extra latency
> > > >  99.999%   48.32ms
> > > > 100.000%   48.96ms
> > > >
> > > > # With patch and tcp_autocorking=1
> > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > >   ...
> > > >  50.000%    0.90ms
> > > >  75.000%    1.13ms
> > > >  90.000%    1.45ms
> > > >  99.000%    1.72ms
> > > >  99.900%    1.83ms
> > > >  99.990%    2.11ms  <<< no 40+ ms extra latency
> > > >  99.999%    2.53ms
> > > > 100.000%    2.62ms
> > > >
> > > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > > affected by this issue and the patch doesn't introduce any additional
> > > > delay.
> > > >
> > > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > > queue")
> > >
> > > Please read carefully the process documentation under
> > > Documentation/process/ and specifically the netdev specific bits:
> > >
> > > no resubmissions within the 24h grace period.
> > >
> > > Please double-check your patch with checkpatch for formal errors: the
> > > fixes tag must not be split across multiple lines.
> > >
> > > And please do not sent new version in reply to a previous one: it will
> > > confuse the bot.
> > >
> > > > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > > > ---
> > > >  net/ipv4/tcp.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index ff6838ca2e58..ab9e3922393c 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > > >               /* It is possible TX completion already happened
> > > >                * before we set TSQ_THROTTLED.
> > > >                */
> > > > +             smp_mb__after_atomic();
> > >
> > > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > > above 'if' just after 'set_bit' would suffice?
> >
> > According to the herd7 modeling tool, the answer is no for weak memory
> > models.  If we put the smp_mb inside the if, it allows the machine to
> > reorder the two reads to sk_wmem_alloc and we can get to the bad state
> > this patch is fixing.  Placing it outside the if ensures
> > the ordering between those two reads as well as ordering the write to the
> > flags variable.
> 
> For the records, I asked because reading the documentation  I assumed
> that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
> operation (in this case 'set_bit'). Is this a wrong interpretation?
> 
> Now I wonder if the patch is enough. Specifically, would it work
> correctly when the TSQ_THROTTLED bit is already set and there is no RMW
> operation in between the two refcount_read()?
> 

I fat fingered the model in the previous reply, sorry for the confusion.  
It appears we can put the barrier inside the if check and be fine.  For 
the x86 case reads won't pass other reads according to my understanding 
of their memory model which makes this work as is without the barrier to
begin with for when TSQ_THROTTLED is already 
set or not set.
  
For arm64, we need to generate a dmb.ish instruction after the store.  Can 
run a test to make sure the model is correct and can move the barrier 
to inside the if after the atomic bitset.

Thanks,
Geoff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 19:30                         ` Geoff Blake
@ 2024-01-19 13:33                           ` Eric Dumazet
  2024-04-17  7:33                           ` Eric Dumazet
  1 sibling, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2024-01-19 13:33 UTC (permalink / raw
  To: Geoff Blake
  Cc: Paolo Abeni, Salvatore Dipietro, alisaidi, benh, davem,
	dipietro.salvatore, dsahern, kuba, netdev

On Thu, Jan 18, 2024 at 8:30 PM Geoff Blake <blakgeof@amazon.com> wrote:
>
>
>
> On Thu, 18 Jan 2024, Paolo Abeni wrote:
>
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> > >
> > > On Thu, 18 Jan 2024, Paolo Abeni wrote:
> > >
> > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > >
> > > >
> > > >
> > > > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > > > sk variables can get reordered leaving the socket throttled when it should not.
> > > > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > > > delayed acks.
> > > > >
> > > > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > > > on x86 since not affected.
> > > > >
> > > > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > > >
> > > > > import java.io.IOException;
> > > > > import java.io.OutputStreamWriter;
> > > > > import java.io.PrintWriter;
> > > > > import javax.servlet.ServletException;
> > > > > import javax.servlet.http.HttpServlet;
> > > > > import javax.servlet.http.HttpServletRequest;
> > > > > import javax.servlet.http.HttpServletResponse;
> > > > >
> > > > > public class HelloWorldServlet extends HttpServlet {
> > > > >     @Override
> > > > >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > > > >       throws ServletException, IOException {
> > > > >         response.setContentType("text/html;charset=utf-8");
> > > > >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > > > >         String s = "a".repeat(3096);
> > > > >         osw.write(s,0,s.length());
> > > > >         osw.flush();
> > > > >     }
> > > > > }
> > > > >
> > > > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > > > values is observed while, with the patch, the extra latency disappears.
> > > > >
> > > > > # No patch and tcp_autocorking=1
> > > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > > >   ...
> > > > >  50.000%    0.91ms
> > > > >  75.000%    1.13ms
> > > > >  90.000%    1.46ms
> > > > >  99.000%    1.74ms
> > > > >  99.900%    1.89ms
> > > > >  99.990%   41.95ms  <<< 40+ ms extra latency
> > > > >  99.999%   48.32ms
> > > > > 100.000%   48.96ms
> > > > >
> > > > > # With patch and tcp_autocorking=1
> > > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > > >   ...
> > > > >  50.000%    0.90ms
> > > > >  75.000%    1.13ms
> > > > >  90.000%    1.45ms
> > > > >  99.000%    1.72ms
> > > > >  99.900%    1.83ms
> > > > >  99.990%    2.11ms  <<< no 40+ ms extra latency
> > > > >  99.999%    2.53ms
> > > > > 100.000%    2.62ms
> > > > >
> > > > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > > > affected by this issue and the patch doesn't introduce any additional
> > > > > delay.
> > > > >
> > > > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > > > queue")
> > > >
> > > > Please read carefully the process documentation under
> > > > Documentation/process/ and specifically the netdev specific bits:
> > > >
> > > > no resubmissions within the 24h grace period.
> > > >
> > > > Please double-check your patch with checkpatch for formal errors: the
> > > > fixes tag must not be split across multiple lines.
> > > >
> > > > And please do not sent new version in reply to a previous one: it will
> > > > confuse the bot.
> > > >
> > > > > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index ff6838ca2e58..ab9e3922393c 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > > > >               /* It is possible TX completion already happened
> > > > >                * before we set TSQ_THROTTLED.
> > > > >                */
> > > > > +             smp_mb__after_atomic();
> > > >
> > > > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > > > above 'if' just after 'set_bit' would suffice?
> > >
> > > According to the herd7 modeling tool, the answer is no for weak memory
> > > models.  If we put the smp_mb inside the if, it allows the machine to
> > > reorder the two reads to sk_wmem_alloc and we can get to the bad state
> > > this patch is fixing.  Placing it outside the if ensures
> > > the ordering between those two reads as well as ordering the write to the
> > > flags variable.
> >
> > For the records, I asked because reading the documentation  I assumed
> > that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
> > operation (in this case 'set_bit'). Is this a wrong interpretation?
> >
> > Now I wonder if the patch is enough. Specifically, would it work
> > correctly when the TSQ_THROTTLED bit is already set and there is no RMW
> > operation in between the two refcount_read()?
> >
>
> I fat fingered the model in the previous reply, sorry for the confusion.
> It appears we can put the barrier inside the if check and be fine.  For
> the x86 case reads won't pass other reads according to my understanding
> of their memory model which makes this work as is without the barrier to
> begin with for when TSQ_THROTTLED is already
> set or not set.
>
> For arm64, we need to generate a dmb.ish instruction after the store.  Can
> run a test to make sure the model is correct and can move the barrier
> to inside the if after the atomic bitset.

Great, Salvatore can you send a v4, with the Fixes: tag properly formatted
and including Paolo suggestion ?

Thanks a lot.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-01-18 19:30                         ` Geoff Blake
  2024-01-19 13:33                           ` Eric Dumazet
@ 2024-04-17  7:33                           ` Eric Dumazet
  2024-04-22 16:16                             ` Salvatore Dipietro
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2024-04-17  7:33 UTC (permalink / raw
  To: Geoff Blake
  Cc: Paolo Abeni, Salvatore Dipietro, alisaidi, benh, davem,
	dipietro.salvatore, dsahern, kuba, netdev

On Thu, Jan 18, 2024 at 8:30 PM Geoff Blake <blakgeof@amazon.com> wrote:
>
>
>
> On Thu, 18 Jan 2024, Paolo Abeni wrote:
>
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> >
> >
> >
> > On Thu, 2024-01-18 at 11:46 -0600, Geoff Blake wrote:
> > >
> > > On Thu, 18 Jan 2024, Paolo Abeni wrote:
> > >
> > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > >
> > > >
> > > >
> > > > On Wed, 2024-01-17 at 15:16 -0800, Salvatore Dipietro wrote:
> > > > > On CPUs with weak memory models, reads and updates performed by tcp_push to the
> > > > > sk variables can get reordered leaving the socket throttled when it should not.
> > > > > The tasklet running tcp_wfree() may also not observe the memory updates in time
> > > > > and will skip flushing any packets throttled by tcp_push(), delaying the sending.
> > > > > This can pathologically cause 40ms extra latency due to bad interactions with
> > > > > delayed acks.
> > > > >
> > > > > Adding a memory barrier in tcp_push before the sk_wmem_alloc read removes the
> > > > > bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle
> > > > > handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead
> > > > > on x86 since not affected.
> > > > >
> > > > > Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and
> > > > > Apache Tomcat 9.0.83 running the basic servlet below:
> > > > >
> > > > > import java.io.IOException;
> > > > > import java.io.OutputStreamWriter;
> > > > > import java.io.PrintWriter;
> > > > > import javax.servlet.ServletException;
> > > > > import javax.servlet.http.HttpServlet;
> > > > > import javax.servlet.http.HttpServletRequest;
> > > > > import javax.servlet.http.HttpServletResponse;
> > > > >
> > > > > public class HelloWorldServlet extends HttpServlet {
> > > > >     @Override
> > > > >     protected void doGet(HttpServletRequest request, HttpServletResponse response)
> > > > >       throws ServletException, IOException {
> > > > >         response.setContentType("text/html;charset=utf-8");
> > > > >         OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
> > > > >         String s = "a".repeat(3096);
> > > > >         osw.write(s,0,s.length());
> > > > >         osw.flush();
> > > > >     }
> > > > > }
> > > > >
> > > > > Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
> > > > > c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
> > > > > values is observed while, with the patch, the extra latency disappears.
> > > > >
> > > > > # No patch and tcp_autocorking=1
> > > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > > >   ...
> > > > >  50.000%    0.91ms
> > > > >  75.000%    1.13ms
> > > > >  90.000%    1.46ms
> > > > >  99.000%    1.74ms
> > > > >  99.900%    1.89ms
> > > > >  99.990%   41.95ms  <<< 40+ ms extra latency
> > > > >  99.999%   48.32ms
> > > > > 100.000%   48.96ms
> > > > >
> > > > > # With patch and tcp_autocorking=1
> > > > > ./wrk -t32 -c128 -d40s --latency -R10000  http://172.31.60.173:8080/hello/hello
> > > > >   ...
> > > > >  50.000%    0.90ms
> > > > >  75.000%    1.13ms
> > > > >  90.000%    1.45ms
> > > > >  99.000%    1.72ms
> > > > >  99.900%    1.83ms
> > > > >  99.990%    2.11ms  <<< no 40+ ms extra latency
> > > > >  99.999%    2.53ms
> > > > > 100.000%    2.62ms
> > > > >
> > > > > Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
> > > > > affected by this issue and the patch doesn't introduce any additional
> > > > > delay.
> > > > >
> > > > > Fixes: a181ceb501b3 ("tcp: autocork should not hold first packet in write
> > > > > queue")
> > > >
> > > > Please read carefully the process documentation under
> > > > Documentation/process/ and specifically the netdev specific bits:
> > > >
> > > > no resubmissions within the 24h grace period.
> > > >
> > > > Please double-check your patch with checkpatch for formal errors: the
> > > > fixes tag must not be split across multiple lines.
> > > >
> > > > And please do not sent new version in reply to a previous one: it will
> > > > confuse the bot.
> > > >
> > > > > Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index ff6838ca2e58..ab9e3922393c 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -726,6 +726,7 @@ void tcp_push(struct sock *sk, int flags, int mss_now,
> > > > >               /* It is possible TX completion already happened
> > > > >                * before we set TSQ_THROTTLED.
> > > > >                */
> > > > > +             smp_mb__after_atomic();
> > > >
> > > > Out of sheer ignorance I'm wondering if moving such barrier inside the
> > > > above 'if' just after 'set_bit' would suffice?
> > >
> > > According to the herd7 modeling tool, the answer is no for weak memory
> > > models.  If we put the smp_mb inside the if, it allows the machine to
> > > reorder the two reads to sk_wmem_alloc and we can get to the bad state
> > > this patch is fixing.  Placing it outside the if ensures
> > > the ordering between those two reads as well as ordering the write to the
> > > flags variable.
> >
> > For the records, I asked because reading the documentation  I assumed
> > that smp_mb__after_atomic() enforces the ordering WRT the previous RMW
> > operation (in this case 'set_bit'). Is this a wrong interpretation?
> >
> > Now I wonder if the patch is enough. Specifically, would it work
> > correctly when the TSQ_THROTTLED bit is already set and there is no RMW
> > operation in between the two refcount_read()?
> >
>
> I fat fingered the model in the previous reply, sorry for the confusion.
> It appears we can put the barrier inside the if check and be fine.  For
> the x86 case reads won't pass other reads according to my understanding
> of their memory model which makes this work as is without the barrier to
> begin with for when TSQ_THROTTLED is already
> set or not set.
>
> For arm64, we need to generate a dmb.ish instruction after the store.  Can
> run a test to make sure the model is correct and can move the barrier
> to inside the if after the atomic bitset.

I read again this code (in 'fixed' kernels) and could not convince
myself it is bug free/.

Without RMW with return value or barriers, this all boils to bare reads,
and all these reads could be ordered differently.

I am thinking of making the following change to be on the cautious side.

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f23b97777ea5e93fc860d3f225e464f9376cfe09..2e768aab39128cc8aebc2dbd65406f3bccf449f7
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -722,12 +722,9 @@ void tcp_push(struct sock *sk, int flags, int mss_now,

        if (tcp_should_autocork(sk, skb, size_goal)) {

-               /* avoid atomic op if TSQ_THROTTLED bit is already set */
-               if (!test_bit(TSQ_THROTTLED, &sk->sk_tsq_flags)) {
+               if (!test_and_set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags))
                        NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
-                       set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
-                       smp_mb__after_atomic();
-               }
+
                /* It is possible TX completion already happened
                 * before we set TSQ_THROTTLED.
                 */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] tcp: Add memory barrier to tcp_push()
  2024-04-17  7:33                           ` Eric Dumazet
@ 2024-04-22 16:16                             ` Salvatore Dipietro
  0 siblings, 0 replies; 28+ messages in thread
From: Salvatore Dipietro @ 2024-04-22 16:16 UTC (permalink / raw
  To: edumazet
  Cc: alisaidi, benh, blakgeof, davem, dipietro.salvatore, dipiets,
	dsahern, kuba, netdev, pabeni

We have tested the proposed solution and it doesn't introduce any regression in our benchmark. 
Moreover, looking to the assembly code, the `test_and_set_bit` generates an `ldsetal` and `dmb ish` instructions which are correct for the ARM architecture.
We do not see any concern with the proposed patch.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-04-22 16:17 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-08 18:20 [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set Salvatore Dipietro
2023-12-09 11:03 ` Eric Dumazet
2023-12-11 15:58   ` Hazem Mohamed Abuelfotoh
2023-12-12 10:41     ` Paolo Abeni
2023-12-12 11:29       ` Eric Dumazet
2023-12-12 10:57 ` Paolo Abeni
2023-12-13 14:10   ` Eric Dumazet
2023-12-13 19:00     ` Jakub Kicinski
2023-12-13 19:29       ` Eric Dumazet
2023-12-13 21:30     ` Salvatore Dipietro
2023-12-14  8:40       ` Eric Dumazet
2023-12-14  9:05         ` Eric Dumazet
2023-12-14 14:08       ` Eric Dumazet
2023-12-14 15:52         ` Geoff Blake
2023-12-14 16:07           ` Eric Dumazet
2024-01-17 21:26             ` [PATCH v2] tcp: Add memory barrier to tcp_push() Salvatore Dipietro
2024-01-17 21:58               ` Eric Dumazet
2024-01-17 23:16                 ` [PATCH v3] " Salvatore Dipietro
2024-01-18 10:38                   ` Paolo Abeni
2024-01-18 10:42                     ` Eric Dumazet
2024-01-18 16:40                       ` Jakub Kicinski
2024-01-18 17:46                     ` Geoff Blake
2024-01-18 18:30                       ` Paolo Abeni
2024-01-18 19:30                         ` Geoff Blake
2024-01-19 13:33                           ` Eric Dumazet
2024-04-17  7:33                           ` Eric Dumazet
2024-04-22 16:16                             ` Salvatore Dipietro
2023-12-13 10:40 ` [PATCH] tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.