Linux-Block Archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Fix failover to non integrity NVMe path
@ 2023-04-24 22:54 Max Gurtovoy
  2023-04-24 22:54 ` [PATCH v2 1/2] block: bio-integrity: export bio_integrity_free func Max Gurtovoy
  2023-04-24 22:54 ` [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns Max Gurtovoy
  0 siblings, 2 replies; 7+ messages in thread
From: Max Gurtovoy @ 2023-04-24 22:54 UTC (permalink / raw
  To: martin.petersen, hch, sagi, linux-nvme
  Cc: hare, kbusch, axboe, linux-block, oren, oevron, israelr,
	Max Gurtovoy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="yes", Size: 3505 bytes --]

Hi Christoph/Sagi/Martin,

We're encountered a crash while testing failover between NVMeF/RDMA
paths to a target that expose a namespace with metadata. The scenario is
the following:
Configure one initiator/host path on PI offload capable port (e.g ConnectX-5
device) and configure second initiator/host path on non PI offload capable
port (e.g ConnectX-3).

In case of failover, the original rq/bio is protected with integrity
context but the failover port is not integrity capable and it didn't allocate
the metadata resources for request. Thus we get NULL deref in case
blk_integrity_rq(rq) return true while req->metadata_sgl is NULL.

Bellow snip on the trace:

[Tue Feb 14 18:48:25 2023] mlx5_core 0000:02:00.0 ens2f0np0: Link down
[Tue Feb 14 18:48:32 2023] nvme nvme0: starting error recovery
[Tue Feb 14 18:48:32 2023] BUG: kernel NULL pointer dereference, address: 0000000000000010
[Tue Feb 14 18:48:32 2023] #PF: supervisor read access in kernel mode
[Tue Feb 14 18:48:32 2023] #PF: error_code(0x0000) - not-present page
[Tue Feb 14 18:48:32 2023] PGD 0 P4D 0 
[Tue Feb 14 18:48:32 2023] Oops: 0000 [#1] PREEMPT SMP PTI
[Tue Feb 14 18:48:32 2023] CPU: 17 PID: 518 Comm: kworker/17:1H Tainted: G S          E      6.2.0-rc4+ #224
[Tue Feb 14 18:48:32 2023] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 2.0 12/17/2015
[Tue Feb 14 18:48:32 2023] Workqueue: kblockd nvme_requeue_work [nvme_core]
...
...

[Tue Feb 14 18:48:32 2023] Call Trace:
[Tue Feb 14 18:48:32 2023]  <TASK>
[Tue Feb 14 18:48:32 2023]  nvme_rdma_queue_rq+0x194/0xa20 [nvme_rdma]
[Tue Feb 14 18:48:32 2023]  ? __blk_mq_try_issue_directly+0x13f/0x1a0
[Tue Feb 14 18:48:32 2023]  __blk_mq_try_issue_directly+0x13f/0x1a0
[Tue Feb 14 18:48:32 2023]  blk_mq_try_issue_directly+0x15/0x50
[Tue Feb 14 18:48:32 2023]  blk_mq_submit_bio+0x539/0x580
[Tue Feb 14 18:48:32 2023]  __submit_bio+0xfa/0x170
[Tue Feb 14 18:48:32 2023]  submit_bio_noacct_nocheck+0xe1/0x2a0
[Tue Feb 14 18:48:32 2023]  nvme_requeue_work+0x4e/0x60 [nvme_core]

To solve that we'll expose API to release the integrity context from a
bio and call it in case of failover for each bio.
Another way to solve this is to free the context during
bio_integrity_prep.

I choose the first option because I thought it is better to avoid this
check in the fast path but the price is exporting new API from the block
layer.

In V1 there were some doubts regarding the setup configuration but I
believe that we can and should support it.
There are no provisions in the specification that prohibit it. Combining
binding and coupling multipathing with integrity/metadata appears to be
a matter of implementation specifics rather than a requirement.

If the host path lacks the ability to add protection information, it is
acceptable to request that the controller take action by setting the
PRACT bit to 1 when the namespace is formatted with protection
information.

Changes:
V2:
 - update cover letter with more motivation
 - Fix build issue reported by kernel test robot
 - Didn't add Reviewed-by signature from Sagi for patch 2/2 since I
   think he is still not sure about the series.

Max Gurtovoy (2):
  block: bio-integrity: export bio_integrity_free func
  nvme-multipath: fix path failover for integrity ns

 block/bio-integrity.c         | 1 +
 block/blk.h                   | 4 ----
 drivers/nvme/host/multipath.c | 9 +++++++++
 include/linux/bio.h           | 6 ++++++
 4 files changed, 16 insertions(+), 4 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] block: bio-integrity: export bio_integrity_free func
  2023-04-24 22:54 [PATCH v2 0/2] Fix failover to non integrity NVMe path Max Gurtovoy
@ 2023-04-24 22:54 ` Max Gurtovoy
  2023-04-24 22:54 ` [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns Max Gurtovoy
  1 sibling, 0 replies; 7+ messages in thread
From: Max Gurtovoy @ 2023-04-24 22:54 UTC (permalink / raw
  To: martin.petersen, hch, sagi, linux-nvme
  Cc: hare, kbusch, axboe, linux-block, oren, oevron, israelr,
	Max Gurtovoy

This function is the complementary function to bio_integrity_alloc.
Export it for users that would like to free the integrity context
explicitly.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 block/bio-integrity.c | 1 +
 block/blk.h           | 4 ----
 include/linux/bio.h   | 6 ++++++
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 4533eb491661..276ca86fc1d3 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -110,6 +110,7 @@ void bio_integrity_free(struct bio *bio)
 	bio->bi_integrity = NULL;
 	bio->bi_opf &= ~REQ_INTEGRITY;
 }
+EXPORT_SYMBOL(bio_integrity_free);
 
 /**
  * bio_integrity_add_page - Attach integrity metadata
diff --git a/block/blk.h b/block/blk.h
index cc4e8873dfde..644e5f30a983 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -179,7 +179,6 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
-void bio_integrity_free(struct bio *bio);
 static inline bool bio_integrity_endio(struct bio *bio)
 {
 	if (bio_integrity(bio))
@@ -245,9 +244,6 @@ static inline bool bio_integrity_endio(struct bio *bio)
 {
 	return true;
 }
-static inline void bio_integrity_free(struct bio *bio)
-{
-}
 static inline int blk_integrity_add(struct gendisk *disk)
 {
 	return 0;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d766be7152e1..64aabdacac24 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -700,6 +700,7 @@ static inline bool bioset_initialized(struct bio_set *bs)
 		bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
 
 extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
+extern void bio_integrity_free(struct bio *);
 extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
 extern bool bio_integrity_prep(struct bio *);
 extern void bio_integrity_advance(struct bio *, unsigned int);
@@ -764,6 +765,11 @@ static inline void *bio_integrity_alloc(struct bio * bio, gfp_t gfp,
 	return ERR_PTR(-EINVAL);
 }
 
+static inline void bio_integrity_free(struct bio *bio)
+{
+	return;
+}
+
 static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 					unsigned int len, unsigned int offset)
 {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns
  2023-04-24 22:54 [PATCH v2 0/2] Fix failover to non integrity NVMe path Max Gurtovoy
  2023-04-24 22:54 ` [PATCH v2 1/2] block: bio-integrity: export bio_integrity_free func Max Gurtovoy
@ 2023-04-24 22:54 ` Max Gurtovoy
  2023-04-25  2:12   ` Martin K. Petersen
  1 sibling, 1 reply; 7+ messages in thread
From: Max Gurtovoy @ 2023-04-24 22:54 UTC (permalink / raw
  To: martin.petersen, hch, sagi, linux-nvme
  Cc: hare, kbusch, axboe, linux-block, oren, oevron, israelr,
	Max Gurtovoy

In case the integrity capabilities of the failed path and the failover
path don't match, we may run into NULL dereference. Free the integrity
context during the path failover and let the block layer prepare it
again if needed during bio_submit.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Tested-by: Ori Evron <oevron@nvidia.com>
Signed-off-by: Ori Evron <oevron@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/multipath.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 9171452e2f6d..f439916f4447 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -6,6 +6,7 @@
 #include <linux/backing-dev.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-integrity.h>
 #include <trace/events/block.h>
 #include "nvme.h"
 
@@ -106,6 +107,14 @@ void nvme_failover_req(struct request *req)
 			bio->bi_opf &= ~REQ_POLLED;
 			bio->bi_cookie = BLK_QC_T_NONE;
 		}
+		/*
+		 * If the failover path will not be integrity capable the bio
+		 * should not have integrity context.
+		 * If the failover path will be integrity capable the bio will
+		 * be prepared for integrity again.
+		 */
+		if (bio_integrity(bio))
+			bio_integrity_free(bio);
 	}
 	blk_steal_bios(&ns->head->requeue_list, req);
 	spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns
  2023-04-24 22:54 ` [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns Max Gurtovoy
@ 2023-04-25  2:12   ` Martin K. Petersen
  2023-04-25 10:24     ` Max Gurtovoy
  0 siblings, 1 reply; 7+ messages in thread
From: Martin K. Petersen @ 2023-04-25  2:12 UTC (permalink / raw
  To: Max Gurtovoy
  Cc: martin.petersen, hch, sagi, linux-nvme, hare, kbusch, axboe,
	linux-block, oren, oevron, israelr


Hi Max!

> In case the integrity capabilities of the failed path and the failover
> path don't match, we may run into NULL dereference. Free the integrity
> context during the path failover and let the block layer prepare it
> again if needed during bio_submit.

This assumes that the protection information is just an ephemeral
checksum. However, that is not always the case. The application may
store values in the application or storage tags which must be returned
on a subsequent read.

In addition, in some overseas markets (financial, government), PI is a
regulatory requirement. It would be really bad for us to expose a device
claiming PI support and then it turns out the protection isn't actually
always active.

DM multipathing doesn't allow mismatched integrity profiles. I don't
think NVMe should either.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns
  2023-04-25  2:12   ` Martin K. Petersen
@ 2023-04-25 10:24     ` Max Gurtovoy
  2023-04-25 22:39       ` Keith Busch
  2023-04-26  1:22       ` Martin K. Petersen
  0 siblings, 2 replies; 7+ messages in thread
From: Max Gurtovoy @ 2023-04-25 10:24 UTC (permalink / raw
  To: Martin K. Petersen
  Cc: hch, sagi, linux-nvme, hare, kbusch, axboe, linux-block, oren,
	oevron, israelr



On 25/04/2023 5:12, Martin K. Petersen wrote:
> 
> Hi Max!
> 
>> In case the integrity capabilities of the failed path and the failover
>> path don't match, we may run into NULL dereference. Free the integrity
>> context during the path failover and let the block layer prepare it
>> again if needed during bio_submit.
> 
> This assumes that the protection information is just an ephemeral
> checksum. However, that is not always the case. The application may
> store values in the application or storage tags which must be returned
> on a subsequent read.

Interesting.

Maybe you can point me to this API that allow applications to store 
application tag in PI field ?

I see that app_tag is 0 in Linux and we don't set the nvme_cmd->apptag 
to non zero value.
It's been a while since I was working on this so I might be wrong here :).

I've noticed that in t10_pi_generate and ext_pi_crc64_generate we set it 
to 0 as well.

The way I see it now, and I might be wrong, is that the Linux kernel is 
not supporting application to store apptag values unless it's using some 
passthrough command.

> 
> In addition, in some overseas markets (financial, government), PI is a
> regulatory requirement. It would be really bad for us to expose a device
> claiming PI support and then it turns out the protection isn't actually
> always active.
> 
> DM multipathing doesn't allow mismatched integrity profiles. I don't
> think NVMe should either.
> 

AFAIU, the DM multipath is not a specification but a Linux 
implementation. NVMe multipathing follows a standard.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns
  2023-04-25 10:24     ` Max Gurtovoy
@ 2023-04-25 22:39       ` Keith Busch
  2023-04-26  1:22       ` Martin K. Petersen
  1 sibling, 0 replies; 7+ messages in thread
From: Keith Busch @ 2023-04-25 22:39 UTC (permalink / raw
  To: Max Gurtovoy
  Cc: Martin K. Petersen, hch, sagi, linux-nvme, hare, axboe,
	linux-block, oren, oevron, israelr

On Tue, Apr 25, 2023 at 01:24:31PM +0300, Max Gurtovoy wrote:
> 
> 
> On 25/04/2023 5:12, Martin K. Petersen wrote:
> > 
> > Hi Max!
> > 
> > > In case the integrity capabilities of the failed path and the failover
> > > path don't match, we may run into NULL dereference. Free the integrity
> > > context during the path failover and let the block layer prepare it
> > > again if needed during bio_submit.
> > 
> > This assumes that the protection information is just an ephemeral
> > checksum. However, that is not always the case. The application may
> > store values in the application or storage tags which must be returned
> > on a subsequent read.
> 
> Interesting.
> 
> Maybe you can point me to this API that allow applications to store
> application tag in PI field ?

I think Martin might be considering kernel modules as apps since they can
access the integrity payload no problem. I know of at least one out-of-tree
module (ex: OpenCAS) that utilizes the apptag (not that upstream necessarily
should care about such modules..).

Outside that, passthrough ioctls and io_uring_cmd can access the fields. I
don't think those interfaces apply to what you're changing, though, since these
bypass the submit_bio() interface.
 
> I see that app_tag is 0 in Linux and we don't set the nvme_cmd->apptag to
> non zero value.
> It's been a while since I was working on this so I might be wrong here :).
> 
> I've noticed that in t10_pi_generate and ext_pi_crc64_generate we set it to
> 0 as well.
> 
> The way I see it now, and I might be wrong, is that the Linux kernel is not
> supporting application to store apptag values unless it's using some
> passthrough command.

If we're considering only in-tree usage, I think you're correct.

> > In addition, in some overseas markets (financial, government), PI is a
> > regulatory requirement. It would be really bad for us to expose a device
> > claiming PI support and then it turns out the protection isn't actually
> > always active.
> > 
> > DM multipathing doesn't allow mismatched integrity profiles. I don't
> > think NVMe should either.
> > 
> 
> AFAIU, the DM multipath is not a specification but a Linux implementation.
> NVMe multipathing follows a standard.

If we're talking about any of the nvme passthrough interfaces, which format
will the user space need to construct its command for? This seems confusing
since userspace doesn't have direct control over which path the command will be
dispatched on, so it won't know which format it needs to cater to, nor will it
have deterministic behavior.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns
  2023-04-25 10:24     ` Max Gurtovoy
  2023-04-25 22:39       ` Keith Busch
@ 2023-04-26  1:22       ` Martin K. Petersen
  1 sibling, 0 replies; 7+ messages in thread
From: Martin K. Petersen @ 2023-04-26  1:22 UTC (permalink / raw
  To: Max Gurtovoy
  Cc: Martin K. Petersen, hch, sagi, linux-nvme, hare, kbusch, axboe,
	linux-block, oren, oevron, israelr


Hi Max!

> The way I see it now, and I might be wrong, is that the Linux kernel
> is not supporting application to store apptag values unless it's using
> some passthrough command.

As Keith mentioned, there are various ways.

But let's assume you are a multipathed storage device that says "Yes, I
support encryption". And then one of the paths doesn't and just lets
things go across the wire in plain text.

I think most people would agree that would be a *bad* thing.

The expectation is that when a device reports that a capability is
enabled, that this capability is actually effective. Protection
information is no different. The PI is part of the data and it needs to
be validated by the recipient.

Silently throwing away the protection information defies the very
premise of why data integrity protection was defined in the first place.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-26  1:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-24 22:54 [PATCH v2 0/2] Fix failover to non integrity NVMe path Max Gurtovoy
2023-04-24 22:54 ` [PATCH v2 1/2] block: bio-integrity: export bio_integrity_free func Max Gurtovoy
2023-04-24 22:54 ` [PATCH v2 2/2] nvme-multipath: fix path failover for integrity ns Max Gurtovoy
2023-04-25  2:12   ` Martin K. Petersen
2023-04-25 10:24     ` Max Gurtovoy
2023-04-25 22:39       ` Keith Busch
2023-04-26  1:22       ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).