[PATCH v2 0/3] keepalive bugfixes

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] keepalive bugfixes
@ 2023-04-24 23:22 Uday Shankar
  2023-04-24 23:22 ` [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Uday Shankar @ 2023-04-24 23:22 UTC (permalink / raw
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke, Sagi Grimberg,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

While reviewing the Linux KATO implementation in an attempt to better
understand the current NVMe Keep Alive specification, we found a
few issues in the host implementation that could contribute to spurious
Keep Alive timeouts being detected by controllers.

Changes since v1 (https://lore.kernel.org/linux-nvme/20230417225558.2890062-1-ushankar@purestorage.com/):
- Patch 2: fix indentation, set start_time even when stats disabled

Uday Shankar (3):
  nvme: double KA polling frequency to avoid KATO with TBKAS on
  nvme: check IO start time when deciding to defer KA
  nvme: improve handling of long keep alives

 drivers/nvme/host/core.c | 38 +++++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  4 ++--
 2 files changed, 37 insertions(+), 5 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on
  2023-04-24 23:22 [PATCH v2 0/3] keepalive bugfixes Uday Shankar
@ 2023-04-24 23:22 ` Uday Shankar
  2023-05-18  8:45   ` Sagi Grimberg
  2023-04-24 23:22 ` [PATCH v2 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Uday Shankar @ 2023-04-24 23:22 UTC (permalink / raw
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke, Sagi Grimberg,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

With TBKAS on, the completion of one command can defer sending a
keep alive for up to twice the delay between successive runs of
nvme_keep_alive_work. The current delay of KATO / 2 thus makes it
possible for one command to defer sending a keep alive for up to
KATO, which can result in the controller detecting a KATO. The following
trace demonstrates the issue, taking KATO = 8 for simplicity:

1. t = 0: run nvme_keep_alive_work, no keep-alive sent
2. t = ε: I/O completion seen, set comp_seen = true
3. t = 4: run nvme_keep_alive_work, see comp_seen == true,
          skip sending keep-alive, set comp_seen = false
4. t = 8: run nvme_keep_alive_work, see comp_seen == false,
          send a keep-alive command.

Here, there is a delay of 8 - ε between receiving a command completion
and sending the next command. With ε small, the controller is likely to
detect a keep alive timeout.

Fix this by running nvme_keep_alive_work with a delay of KATO / 4
whenever TBKAS is on. Going through the above trace now gives us a
worst-case delay of 4 - ε, which is in line with the recommendation of
sending a command every KATO / 2 in the NVMe specification.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6c1e7d6709e0..1298c7b9bffb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1150,10 +1150,16 @@ EXPORT_SYMBOL_NS_GPL(nvme_passthru_end, NVME_TARGET_PASSTHRU);
  * 
  *   The host should send Keep Alive commands at half of the Keep Alive Timeout
  *   accounting for transport roundtrip times [..].
+ * 
+ * When TBKAS is on, we need to run nvme_keep_alive_work at twice this
+ * frequency, as one command completion can postpone sending a keep alive
+ * command by up to twice the delay between runs.
  */
 static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 {
-	queue_delayed_work(nvme_wq, &ctrl->ka_work, ctrl->kato * HZ / 2);
+	unsigned long delay = (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) ?
+		ctrl->kato * HZ / 4 : ctrl->kato * HZ / 2;
+	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
 static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] nvme: check IO start time when deciding to defer KA
  2023-04-24 23:22 [PATCH v2 0/3] keepalive bugfixes Uday Shankar
  2023-04-24 23:22 ` [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
@ 2023-04-24 23:22 ` Uday Shankar
  2023-05-18  8:52   ` Sagi Grimberg
  2023-04-24 23:22 ` [PATCH v2 3/3] nvme: improve handling of long keep alives Uday Shankar
  2023-05-17 22:08 ` [PATCH v2 0/3] keepalive bugfixes Uday Shankar
  3 siblings, 1 reply; 9+ messages in thread
From: Uday Shankar @ 2023-04-24 23:22 UTC (permalink / raw
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke, Sagi Grimberg,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

When a command completes, we set a flag which will skip sending a
keep alive at the next run of nvme_keep_alive_work when TBKAS is on.
However, if the command was submitted long ago, it's possible that
the controller may have also restarted its keep alive timer (as a
result of receiving the command) long ago. The following trace
demonstrates the issue, assuming TBKAS is on and KATO = 8 for
simplicity:

1. t = 0: submit I/O commands A, B, C, D, E
2. t = 0.5: commands A, B, C, D, E reach controller, restart its keep
            alive timer
3. t = 1: A completes
4. t = 2: run nvme_keep_alive_work, see recent completion, do nothing
5. t = 3: B completes
6. t = 4: run nvme_keep_alive_work, see recent completion, do nothing
7. t = 5: C completes
8. t = 6: run nvme_keep_alive_work, see recent completion, do nothing
9. t = 7: D completes
10. t = 8: run nvme_keep_alive_work, see recent completion, do nothing
11. t = 9: E completes

At this point, 8.5 seconds have passed without restarting the
controller's keep alive timer, so the controller will detect a keep
alive timeout.

Fix this by checking the IO start time when deciding to defer sending a
keep alive command. Only set comp_seen if the command started after the
most recent run of nvme_keep_alive_work. With this change, the
completions of B, C, and D will not set comp_seen and the run of
nvme_keep_alive_work at t = 4 will send a keep alive.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c | 14 +++++++++++++-
 drivers/nvme/host/nvme.h |  4 ++--
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1298c7b9bffb..0215dcf643ef 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -397,7 +397,14 @@ void nvme_complete_rq(struct request *req)
 	trace_nvme_complete_rq(req);
 	nvme_cleanup_cmd(req);
 
-	if (ctrl->kas)
+	/*
+	 * Completions of long-running commands should not be able to
+	 * defer sending of periodic keep alives, since the controller
+	 * may have completed processing such commands a long time ago
+	 * (arbitrarily close to command submission time).
+	 */
+	if (ctrl->kas && !ctrl->comp_seen &&
+	    nvme_req(req)->start_time >= ctrl->ka_last_check_time)
 		ctrl->comp_seen = true;
 
 	switch (nvme_decide_disposition(req)) {
@@ -1178,6 +1185,8 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		return RQ_END_IO_NONE;
 	}
 
+	WRITE_ONCE(ctrl->ka_last_check_time, jiffies);
+	smp_wmb();
 	ctrl->comp_seen = false;
 	spin_lock_irqsave(&ctrl->lock, flags);
 	if (ctrl->state == NVME_CTRL_LIVE ||
@@ -1196,6 +1205,9 @@ static void nvme_keep_alive_work(struct work_struct *work)
 	bool comp_seen = ctrl->comp_seen;
 	struct request *rq;
 
+	WRITE_ONCE(ctrl->ka_last_check_time, jiffies);
+	smp_wmb();
+
 	if ((ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) && comp_seen) {
 		dev_dbg(ctrl->device,
 			"reschedule traffic based keep-alive timer\n");
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1..b7f9a93b2d33 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -162,9 +162,7 @@ struct nvme_request {
 	u8			retries;
 	u8			flags;
 	u16			status;
-#ifdef CONFIG_NVME_MULTIPATH
 	unsigned long		start_time;
-#endif
 	struct nvme_ctrl	*ctrl;
 };
 
@@ -323,6 +321,7 @@ struct nvme_ctrl {
 	struct delayed_work ka_work;
 	struct delayed_work failfast_work;
 	struct nvme_command ka_cmd;
+	unsigned long ka_last_check_time;
 	struct work_struct fw_act_work;
 	unsigned long events;
 
@@ -1028,6 +1027,7 @@ static inline void nvme_start_request(struct request *rq)
 {
 	if (rq->cmd_flags & REQ_NVME_MPATH)
 		nvme_mpath_start_request(rq);
+	nvme_req(rq)->start_time = jiffies;
 	blk_mq_start_request(rq);
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] nvme: improve handling of long keep alives
  2023-04-24 23:22 [PATCH v2 0/3] keepalive bugfixes Uday Shankar
  2023-04-24 23:22 ` [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
  2023-04-24 23:22 ` [PATCH v2 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
@ 2023-04-24 23:22 ` Uday Shankar
  2023-05-18  9:01   ` Sagi Grimberg
  2023-05-17 22:08 ` [PATCH v2 0/3] keepalive bugfixes Uday Shankar
  3 siblings, 1 reply; 9+ messages in thread
From: Uday Shankar @ 2023-04-24 23:22 UTC (permalink / raw
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke, Sagi Grimberg,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: Uday Shankar, linux-nvme

Upon keep alive completion, nvme_keep_alive_work is scheduled with the
same delay every time. If keep alive commands are completing slowly,
this may cause a keep alive timeout. The following trace illustrates the
issue, taking KATO = 8 and TBKAS off for simplicity:

1. t = 0: run nvme_keep_alive_work, send keep alive
2. t = ε: keep alive reaches controller, controller restarts its keep
          alive timer
3. t = 4: host receives keep alive completion, schedules
          nvme_keep_alive_work with delay 4
4. t = 8: run nvme_keep_alive_work, send keep alive

Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε
between the controller receiving successive keep alives. With ε small,
the controller is likely to detect a keep alive timeout.

Fix this by calculating the RTT of the keep alive command, and adjusting
the scheduling delay of the next keep alive work accordingly.

Reported-by: Costa Sapuntzakis <costa@purestorage.com>
Reported-by: Randy Jennings <randyj@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0215dcf643ef..2028fa050bcf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1162,10 +1162,15 @@ EXPORT_SYMBOL_NS_GPL(nvme_passthru_end, NVME_TARGET_PASSTHRU);
  * frequency, as one command completion can postpone sending a keep alive
  * command by up to twice the delay between runs.
  */
+static unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
+{
+	return (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) ?
+		(ctrl->kato * HZ / 4) : (ctrl->kato * HZ / 2);
+}
+
 static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 {
-	unsigned long delay = (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) ?
-		ctrl->kato * HZ / 4 : ctrl->kato * HZ / 2;
+	unsigned long delay = nvme_keep_alive_work_period(ctrl);
 	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 }
 
@@ -1175,6 +1180,15 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 	struct nvme_ctrl *ctrl = rq->end_io_data;
 	unsigned long flags;
 	bool startka = false;
+	unsigned long rtt = jiffies - nvme_req(rq)->start_time;
+	unsigned long delay = nvme_keep_alive_work_period(ctrl);
+
+	/* Subtract off the keepalive RTT so nvme_keep_alive_work runs
+	 * at the desired frequency. */
+	if (rtt <= delay)
+		delay -= rtt;
+	else
+		delay = 0;
 
 	blk_mq_free_request(rq);
 
@@ -1194,7 +1208,7 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		startka = true;
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	if (startka)
-		nvme_queue_keep_alive_work(ctrl);
+		queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
 	return RQ_END_IO_NONE;
 }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] keepalive bugfixes
  2023-04-24 23:22 [PATCH v2 0/3] keepalive bugfixes Uday Shankar
                   ` (2 preceding siblings ...)
  2023-04-24 23:22 ` [PATCH v2 3/3] nvme: improve handling of long keep alives Uday Shankar
@ 2023-05-17 22:08 ` Uday Shankar
  3 siblings, 0 replies; 9+ messages in thread
From: Uday Shankar @ 2023-05-17 22:08 UTC (permalink / raw
  To: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke, Sagi Grimberg,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: linux-nvme

Ping?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on
  2023-04-24 23:22 ` [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
@ 2023-05-18  8:45   ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2023-05-18  8:45 UTC (permalink / raw
  To: Uday Shankar, Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: linux-nvme

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] nvme: check IO start time when deciding to defer KA
  2023-04-24 23:22 ` [PATCH v2 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
@ 2023-05-18  8:52   ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2023-05-18  8:52 UTC (permalink / raw
  To: Uday Shankar, Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: linux-nvme

This makes sense to me.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] nvme: improve handling of long keep alives
  2023-04-24 23:22 ` [PATCH v2 3/3] nvme: improve handling of long keep alives Uday Shankar
@ 2023-05-18  9:01   ` Sagi Grimberg
  2023-05-19 17:42     ` Uday Shankar
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2023-05-18  9:01 UTC (permalink / raw
  To: Uday Shankar, Costa Sapuntzakis, Randy Jennings, Hannes Reinecke,
	Keith Busch, Christoph Hellwig, Jens Axboe
  Cc: linux-nvme


> Upon keep alive completion, nvme_keep_alive_work is scheduled with the
> same delay every time. If keep alive commands are completing slowly,
> this may cause a keep alive timeout. The following trace illustrates the
> issue, taking KATO = 8 and TBKAS off for simplicity:
> 
> 1. t = 0: run nvme_keep_alive_work, send keep alive
> 2. t = ε: keep alive reaches controller, controller restarts its keep
>            alive timer
> 3. t = 4: host receives keep alive completion, schedules
>            nvme_keep_alive_work with delay 4
> 4. t = 8: run nvme_keep_alive_work, send keep alive
> 
> Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε
> between the controller receiving successive keep alives. With ε small,
> the controller is likely to detect a keep alive timeout.
> 
> Fix this by calculating the RTT of the keep alive command, and adjusting
> the scheduling delay of the next keep alive work accordingly.
> 
> Reported-by: Costa Sapuntzakis <costa@purestorage.com>
> Reported-by: Randy Jennings <randyj@purestorage.com>
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/core.c | 20 +++++++++++++++++---
>   1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0215dcf643ef..2028fa050bcf 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1162,10 +1162,15 @@ EXPORT_SYMBOL_NS_GPL(nvme_passthru_end, NVME_TARGET_PASSTHRU);
>    * frequency, as one command completion can postpone sending a keep alive
>    * command by up to twice the delay between runs.
>    */
> +static unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
> +{
> +	return (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) ?
> +		(ctrl->kato * HZ / 4) : (ctrl->kato * HZ / 2);
> +}
> +
>   static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
>   {
> -	unsigned long delay = (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS) ?
> -		ctrl->kato * HZ / 4 : ctrl->kato * HZ / 2;
> +	unsigned long delay = nvme_keep_alive_work_period(ctrl);
>   	queue_delayed_work(nvme_wq, &ctrl->ka_work, delay);
>   }
>   
> @@ -1175,6 +1180,15 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>   	struct nvme_ctrl *ctrl = rq->end_io_data;
>   	unsigned long flags;
>   	bool startka = false;
> +	unsigned long rtt = jiffies - nvme_req(rq)->start_time;
> +	unsigned long delay = nvme_keep_alive_work_period(ctrl);
> +
> +	/* Subtract off the keepalive RTT so nvme_keep_alive_work runs
> +	 * at the desired frequency. */
> +	if (rtt <= delay)
> +		delay -= rtt;
> +	else
> +		delay = 0;
This should probably log a warning or something... The controller has
such a slow response time to keep-alive that it exceeds the delay...


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/3] nvme: improve handling of long keep alives
  2023-05-18  9:01   ` Sagi Grimberg
@ 2023-05-19 17:42     ` Uday Shankar
  0 siblings, 0 replies; 9+ messages in thread
From: Uday Shankar @ 2023-05-19 17:42 UTC (permalink / raw
  To: Sagi Grimberg
  Cc: Costa Sapuntzakis, Randy Jennings, Hannes Reinecke, Keith Busch,
	Christoph Hellwig, Jens Axboe, linux-nvme

On Thu, May 18, 2023 at 12:01:33PM +0300, Sagi Grimberg wrote:
> > +	/* Subtract off the keepalive RTT so nvme_keep_alive_work runs
> > +	 * at the desired frequency. */
> > +	if (rtt <= delay)
> > +		delay -= rtt;
> > +	else
> > +		delay = 0;
> This should probably log a warning or something... The controller has
> such a slow response time to keep-alive that it exceeds the delay...

Good idea. Added it in v3.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-05-19 17:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-24 23:22 [PATCH v2 0/3] keepalive bugfixes Uday Shankar
2023-04-24 23:22 ` [PATCH v2 1/3] nvme: double KA polling frequency to avoid KATO with TBKAS on Uday Shankar
2023-05-18  8:45   ` Sagi Grimberg
2023-04-24 23:22 ` [PATCH v2 2/3] nvme: check IO start time when deciding to defer KA Uday Shankar
2023-05-18  8:52   ` Sagi Grimberg
2023-04-24 23:22 ` [PATCH v2 3/3] nvme: improve handling of long keep alives Uday Shankar
2023-05-18  9:01   ` Sagi Grimberg
2023-05-19 17:42     ` Uday Shankar
2023-05-17 22:08 ` [PATCH v2 0/3] keepalive bugfixes Uday Shankar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.