All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
@ 2024-03-15 23:10 Martin Wilck
  2024-03-19 13:04 ` Ming Lei
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Wilck @ 2024-03-15 23:10 UTC (permalink / raw
  To: Mike Snitzer, Mikulas Patocka, Alasdair G Kergon
  Cc: dm-devel, Martin Wilck, Ming Lei, Hannes Reinecke,
	Vasilis Liaskovitis

In a recent kernel dump analysis, we found that the kernel crashed because
dm_rq_target_io tio->ti was pointing to invalid memory in dm_end_request(),
in a situation where multipathd was doing map reloads because of a storage
failover. The map of the respective mapped_device had been replaced by a
different struct dm_table.

We obverved this with a 5.3.18 distro kernel, but the code in question
hasn't change much since then. Basically, we were only missing
b4459b11e840 ("dm rq: don't queue request to blk-mq during DM suspend"),
which doesn't guarantee that the race I'm thinking of (see below) can't
happen.

When a map is resumed after a table reload, the live table is swapped, and
the tio->ti member of any live request becomes stale.  __dm_resume() avoids
this by quiescing the queue and calling dm_wait_for_completion(), which
waits until blk_mq_queue_inflight() doesn't report any in-flight requests.

However, blk_mq_queue_inflight() counts only "started" requests. So, if a
request is dispatched before the queue was quiesced, but
dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this request
because of memory ordering effects, __dm_suspend() may finish successfully,
even though there is an active request, and the resume code path may
proceed through dm_swap_table() and dm_table_destroy(). The request that
sneaked through will then carry an invalid dm_target reference in
tio->ti. This is how I believe the above crash came to pass.

This patch tries to fix this by setting the "started" status of the request
inside a SRCU read side critical section. In __dm_suspend(),
synchronize_srcu(&md->io_barrier) is called after setting
DMF_BLOCK_IO_FOR_SUSPEND. The read side critical section is executed
completely, with all its memory effects, either before or after the
synchronize_srcu(). In the first case, dm_wait_for_completion() will
observe the in-flight status. Otherwise, dm_mq_queue_rq will see
DMF_BLOCK_IO_FOR_SUSPEND and give up. In both cases, there will be no stale
reference in tio->target when the request finishes.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Vasilis Liaskovitis <vliaskovitis@suse.com>
---
 drivers/md/dm-rq.c | 51 ++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f7e9a3632eb3..7f6c83eb2e5b 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -481,6 +481,10 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
 	struct dm_target *ti = md->immutable_target;
+	struct dm_table *map;
+	int srcu_idx;
+
+	map = dm_get_live_table(md, &srcu_idx);
 
 	/*
 	 * blk-mq's unquiesce may come from outside events, such as
@@ -488,25 +492,31 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	 * come during suspend, so simply ask for blk-mq to requeue it.
 	 */
 	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)))
-		return BLK_STS_RESOURCE;
+		goto out_put_table;
+
+	/*
+	 * If this request got dispatched before the queue was stopped in
+	 * __dm_suspend(), make sure dm_wait_for_completion() will
+	 * see this request as in flight. Otherwise this request may be actually
+	 * serviced while the table is swapped, and when it finishes, tio->ti
+	 * may reference a stale target.
+	 * As we're in a read-side critical section here, the synchronize_srcu()
+	 * in __dm_suspend() guarantees that if we haven't seen
+	 * DMF_BLOCK_IO_FOR_SUSPEND here, __dm_suspend() will observe the
+	 * in flight status.
+	 */
+	dm_start_request(md, rq);
 
 	if (unlikely(!ti)) {
-		int srcu_idx;
-		struct dm_table *map;
-
-		map = dm_get_live_table(md, &srcu_idx);
-		if (unlikely(!map)) {
-			dm_put_live_table(md, srcu_idx);
-			return BLK_STS_RESOURCE;
-		}
+		if (unlikely(!map))
+			goto out_put_table;
 		ti = dm_table_find_target(map, 0);
-		dm_put_live_table(md, srcu_idx);
 	}
+	dm_put_live_table(md, srcu_idx);
 
 	if (ti->type->busy && ti->type->busy(ti))
-		return BLK_STS_RESOURCE;
+		goto out_requeue;
 
-	dm_start_request(md, rq);
 
 	/* Init tio using md established in .init_request */
 	init_tio(tio, rq, md);
@@ -517,14 +527,19 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	tio->ti = ti;
 
 	/* Direct call is fine since .queue_rq allows allocations */
-	if (map_request(tio) == DM_MAPIO_REQUEUE) {
-		/* Undo dm_start_request() before requeuing */
-		rq_end_stats(md, rq);
-		rq_completed(md);
-		return BLK_STS_RESOURCE;
-	}
+	if (map_request(tio) == DM_MAPIO_REQUEUE)
+		goto out_requeue;
 
 	return BLK_STS_OK;
+
+out_put_table:
+	dm_put_live_table(md, srcu_idx);
+
+out_requeue:
+	/* Undo dm_start_request() before requeuing */
+	rq_end_stats(md, rq);
+	rq_completed(md);
+	return BLK_STS_RESOURCE;
 }
 
 static const struct blk_mq_ops dm_mq_ops = {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
  2024-03-15 23:10 [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend() Martin Wilck
@ 2024-03-19 13:04 ` Ming Lei
  2024-03-19 15:41   ` Martin Wilck
  0 siblings, 1 reply; 7+ messages in thread
From: Ming Lei @ 2024-03-19 13:04 UTC (permalink / raw
  To: Martin Wilck
  Cc: Mike Snitzer, Mikulas Patocka, Alasdair G Kergon, dm-devel,
	Martin Wilck, Hannes Reinecke, Vasilis Liaskovitis, ming.lei

Hello Martin,

On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> In a recent kernel dump analysis, we found that the kernel crashed because
> dm_rq_target_io tio->ti was pointing to invalid memory in dm_end_request(),
> in a situation where multipathd was doing map reloads because of a storage
> failover. The map of the respective mapped_device had been replaced by a
> different struct dm_table.
> 
> We obverved this with a 5.3.18 distro kernel, but the code in question
> hasn't change much since then. Basically, we were only missing
> b4459b11e840 ("dm rq: don't queue request to blk-mq during DM suspend"),
> which doesn't guarantee that the race I'm thinking of (see below) can't
> happen.
> 
> When a map is resumed after a table reload, the live table is swapped, and
> the tio->ti member of any live request becomes stale.  __dm_resume() avoids
> this by quiescing the queue and calling dm_wait_for_completion(), which
> waits until blk_mq_queue_inflight() doesn't report any in-flight requests.
> 
> However, blk_mq_queue_inflight() counts only "started" requests. So, if a
> request is dispatched before the queue was quiesced, but
> dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this request
> because of memory ordering effects, __dm_suspend() may finish successfully,

Can you explain a bit about the exact memory order which causes MQ_RQ_IN_FLIGHT
not observed?

blk-mq quiesce includes synchronize_rcu() which drains all in-flight
dispatch, so after blk_mq_quiesce_queue() returns,
if blk_mq_queue_inflight() returns 0, it does mean there isn't any
active inflight requests.

If there is bug in this pattern, I guess more drivers may have such
'risk'.

BTW, what is the underlying disks in your dm-mpath setting?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
  2024-03-19 13:04 ` Ming Lei
@ 2024-03-19 15:41   ` Martin Wilck
  2024-03-19 16:53     ` Mike Snitzer
  2024-03-20  3:03     ` Ming Lei
  0 siblings, 2 replies; 7+ messages in thread
From: Martin Wilck @ 2024-03-19 15:41 UTC (permalink / raw
  To: Ming Lei
  Cc: Mike Snitzer, Mikulas Patocka, Alasdair G Kergon, dm-devel,
	Hannes Reinecke, Vasilis Liaskovitis

Hello Ming,

On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> Hello Martin,
> 
> On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > In a recent kernel dump analysis, we found that the kernel crashed
> > because
> > dm_rq_target_io tio->ti was pointing to invalid memory in
> > dm_end_request(),
> > in a situation where multipathd was doing map reloads because of a
> > storage
> > failover. The map of the respective mapped_device had been replaced
> > by a
> > different struct dm_table.
> > 
> > We obverved this with a 5.3.18 distro kernel, but the code in
> > question
> > hasn't change much since then. Basically, we were only missing
> > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > suspend"),
> > which doesn't guarantee that the race I'm thinking of (see below)
> > can't
> > happen.
> > 
> > When a map is resumed after a table reload, the live table is
> > swapped, and
> > the tio->ti member of any live request becomes stale. 
> > __dm_resume() avoids
> > this by quiescing the queue and calling dm_wait_for_completion(),
> > which
> > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > requests.
> > 
> > However, blk_mq_queue_inflight() counts only "started" requests.
> > So, if a
> > request is dispatched before the queue was quiesced, but
> > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > request
> > because of memory ordering effects, __dm_suspend() may finish
> > successfully,
> 
> Can you explain a bit about the exact memory order which causes
> MQ_RQ_IN_FLIGHT
> not observed?
> 
> blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> dispatch, so after blk_mq_quiesce_queue() returns,
> if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> active inflight requests.
> 
> If there is bug in this pattern, I guess more drivers may have such
> 'risk'.

What we know for sure is that there was a bad dm_target reference in
(struct dm_rq_target_io *tio)->ti:

crash> struct -x dm_rq_target_io c00000245ca90128
struct dm_rq_target_io {
  md = 0xc0000031c66a4000,
  ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,

crash> struct -x dm_target  0xc0080000020d0080
struct dm_target struct: invalid kernel virtual address:
c0080000020d0080  type: "gdb_readmem_callback"

The question is how this could have come to pass. It can only happen
if tio->ti had been set before the map was reloaded. 
My theory is that the IO had been dispatched before the queue had been
quiesced, like this:

Task A                                 Task B
(dispatching IO)                       (executing a DM_SUSPEND ioctl to
                                       resume after DM_TABLE_LOAD)
                                       do_resume()
                                         dm_suspend()
                                           __dm_suspend()
dm_mq_queue_rq()                         
   struct dm_target *ti =                
       md->immutable_target;                  
                                              dm_stop_queue()
                                                 blk_mq_quiesce_queue()
       /* 
        * At this point, the queue is quiesced, but task A
        * has alreadyentered dm_mq_queue_rq()
        */
                                              dm_wait_for_completion()
                                                 blk_mq_queue_inflight()
       /* 
        * blk_mq_queue_inflight() doesn't see Task A's 
        * request because it isn't started yet
        */
                                              set_bit(dmf_suspended_flag)
   dm_start_request(md, rq);             dm_swap_table()
                                            __bind()
                                              md->immutable_target = ...
                                         dm_target_destroy()
        /* the previous md->immutable_target is freed */
   init_tio(tio, rq, md);
        /* the stale ti pointer is assigned to tio->ti */
   tio->ti = ti;

dm_mq_queue_rq() contains no synchronization code if 
md->immutable_target is set, so I think that this can happen, even
though it looks unlikely. With b4459b11e840 (which was not applied in
the customer kernel), there would be a
set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
but IMO that the above would still be possible.

If this can't happen, I have no more ideas how the observed situation
came to pass. The customer who sent us the core claims that
he has seen this multiple times already (but we have only this single
core dump).

> BTW, what is the underlying disks in your dm-mpath setting?

It was ibmvfc.

Thanks for looking into this,
Martin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
  2024-03-19 15:41   ` Martin Wilck
@ 2024-03-19 16:53     ` Mike Snitzer
  2024-03-19 21:01       ` Martin Wilck
  2024-03-20  3:03     ` Ming Lei
  1 sibling, 1 reply; 7+ messages in thread
From: Mike Snitzer @ 2024-03-19 16:53 UTC (permalink / raw
  To: Martin Wilck
  Cc: Ming Lei, Mikulas Patocka, Alasdair G Kergon, dm-devel,
	Hannes Reinecke, Vasilis Liaskovitis

On Tue, Mar 19 2024 at 11:41P -0400,
Martin Wilck <mwilck@suse.com> wrote:

> Hello Ming,
> 
> On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> > Hello Martin,
> > 
> > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > > In a recent kernel dump analysis, we found that the kernel crashed
> > > because
> > > dm_rq_target_io tio->ti was pointing to invalid memory in
> > > dm_end_request(),
> > > in a situation where multipathd was doing map reloads because of a
> > > storage
> > > failover. The map of the respective mapped_device had been replaced
> > > by a
> > > different struct dm_table.
> > > 
> > > We obverved this with a 5.3.18 distro kernel, but the code in
> > > question
> > > hasn't change much since then. Basically, we were only missing
> > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > > suspend"),
> > > which doesn't guarantee that the race I'm thinking of (see below)
> > > can't
> > > happen.
> > > 
> > > When a map is resumed after a table reload, the live table is
> > > swapped, and
> > > the tio->ti member of any live request becomes stale. 
> > > __dm_resume() avoids
> > > this by quiescing the queue and calling dm_wait_for_completion(),
> > > which
> > > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > > requests.
> > > 
> > > However, blk_mq_queue_inflight() counts only "started" requests.
> > > So, if a
> > > request is dispatched before the queue was quiesced, but
> > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > > request
> > > because of memory ordering effects, __dm_suspend() may finish
> > > successfully,
> > 
> > Can you explain a bit about the exact memory order which causes
> > MQ_RQ_IN_FLIGHT
> > not observed?
> > 
> > blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> > dispatch, so after blk_mq_quiesce_queue() returns,
> > if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> > active inflight requests.
> > 
> > If there is bug in this pattern, I guess more drivers may have such
> > 'risk'.
> 
> What we know for sure is that there was a bad dm_target reference in
> (struct dm_rq_target_io *tio)->ti:
> 
> crash> struct -x dm_rq_target_io c00000245ca90128
> struct dm_rq_target_io {
>   md = 0xc0000031c66a4000,
>   ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,
> 
> crash> struct -x dm_target  0xc0080000020d0080
> struct dm_target struct: invalid kernel virtual address:
> c0080000020d0080  type: "gdb_readmem_callback"
> 
> The question is how this could have come to pass. It can only happen
> if tio->ti had been set before the map was reloaded. 
> My theory is that the IO had been dispatched before the queue had been
> quiesced, like this:
> 
> Task A                                 Task B
> (dispatching IO)                       (executing a DM_SUSPEND ioctl to
>                                        resume after DM_TABLE_LOAD)
>                                        do_resume()
>                                          dm_suspend()
>                                            __dm_suspend()
> dm_mq_queue_rq()                         
>    struct dm_target *ti =                
>        md->immutable_target;                  
>                                               dm_stop_queue()
>                                                  blk_mq_quiesce_queue()
>        /* 
>         * At this point, the queue is quiesced, but task A
>         * has alreadyentered dm_mq_queue_rq()
>         */
>                                               dm_wait_for_completion()
>                                                  blk_mq_queue_inflight()
>        /* 
>         * blk_mq_queue_inflight() doesn't see Task A's 
>         * request because it isn't started yet
>         */
>                                               set_bit(dmf_suspended_flag)
>    dm_start_request(md, rq);             dm_swap_table()
>                                             __bind()
>                                               md->immutable_target = ...
>                                          dm_target_destroy()
>         /* the previous md->immutable_target is freed */
>    init_tio(tio, rq, md);
>         /* the stale ti pointer is assigned to tio->ti */
>    tio->ti = ti;
> 
> dm_mq_queue_rq() contains no synchronization code if 
> md->immutable_target is set, so I think that this can happen, even
> though it looks unlikely. With b4459b11e840 (which was not applied in
> the customer kernel), there would be a
> set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
> but IMO that the above would still be possible.
> 
> If this can't happen, I have no more ideas how the observed situation
> came to pass. The customer who sent us the core claims that
> he has seen this multiple times already (but we have only this single
> core dump).

While I appreciate you making us aware of this crash, I'm really not
interested in going down the rabbit hole of reasoning through fairly
complex concerns unless you have gotten your customer a kernel with
latest fixes (e.g. commit b4459b11e840) and they _still_ crash.

Has that happened?

NOTE: There is also 85067747cf98 ("dm: do not use waitqueue for
request-based DM") but you probably have it since you said other than
missing the changes from commit b4459b11e840 that your dm-rq.c was
same.

Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
  2024-03-19 16:53     ` Mike Snitzer
@ 2024-03-19 21:01       ` Martin Wilck
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Wilck @ 2024-03-19 21:01 UTC (permalink / raw
  To: Mike Snitzer
  Cc: Ming Lei, Mikulas Patocka, Alasdair G Kergon, dm-devel,
	Hannes Reinecke, Vasilis Liaskovitis

Mike,

On Tue, 2024-03-19 at 12:53 -0400, Mike Snitzer wrote:
> 
> While I appreciate you making us aware of this crash, I'm really not
> interested in going down the rabbit hole of reasoning through fairly
> complex concerns unless you have gotten your customer a kernel with
> latest fixes (e.g. commit b4459b11e840) and they _still_ crash.
> 
> Has that happened?

Not yet; I wanted to bring up the issue upstream first. But what you're
saying makes sense. We will ask the customer to test with a kernel that
has b4459b11e840 added.

> NOTE: There is also 85067747cf98 ("dm: do not use waitqueue for
> request-based DM") but you probably have it since you said other than
> missing the changes from commit b4459b11e840 that your dm-rq.c was
> same.

This commit was already contained in the kernel that the customer had.

Thanks,
Martin



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
  2024-03-19 15:41   ` Martin Wilck
  2024-03-19 16:53     ` Mike Snitzer
@ 2024-03-20  3:03     ` Ming Lei
  2024-03-20  9:51       ` Martin Wilck
  1 sibling, 1 reply; 7+ messages in thread
From: Ming Lei @ 2024-03-20  3:03 UTC (permalink / raw
  To: Martin Wilck
  Cc: Mike Snitzer, Mikulas Patocka, Alasdair G Kergon, dm-devel,
	Hannes Reinecke, Vasilis Liaskovitis

On Tue, Mar 19, 2024 at 04:41:26PM +0100, Martin Wilck wrote:
> Hello Ming,
> 
> On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> > Hello Martin,
> > 
> > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > > In a recent kernel dump analysis, we found that the kernel crashed
> > > because
> > > dm_rq_target_io tio->ti was pointing to invalid memory in
> > > dm_end_request(),
> > > in a situation where multipathd was doing map reloads because of a
> > > storage
> > > failover. The map of the respective mapped_device had been replaced
> > > by a
> > > different struct dm_table.
> > > 
> > > We obverved this with a 5.3.18 distro kernel, but the code in
> > > question
> > > hasn't change much since then. Basically, we were only missing
> > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > > suspend"),
> > > which doesn't guarantee that the race I'm thinking of (see below)
> > > can't
> > > happen.
> > > 
> > > When a map is resumed after a table reload, the live table is
> > > swapped, and
> > > the tio->ti member of any live request becomes stale. 
> > > __dm_resume() avoids
> > > this by quiescing the queue and calling dm_wait_for_completion(),
> > > which
> > > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > > requests.
> > > 
> > > However, blk_mq_queue_inflight() counts only "started" requests.
> > > So, if a
> > > request is dispatched before the queue was quiesced, but
> > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > > request
> > > because of memory ordering effects, __dm_suspend() may finish
> > > successfully,
> > 
> > Can you explain a bit about the exact memory order which causes
> > MQ_RQ_IN_FLIGHT
> > not observed?
> > 
> > blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> > dispatch, so after blk_mq_quiesce_queue() returns,
> > if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> > active inflight requests.
> > 
> > If there is bug in this pattern, I guess more drivers may have such
> > 'risk'.
> 
> What we know for sure is that there was a bad dm_target reference in
> (struct dm_rq_target_io *tio)->ti:
> 
> crash> struct -x dm_rq_target_io c00000245ca90128
> struct dm_rq_target_io {
>   md = 0xc0000031c66a4000,
>   ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,
> 
> crash> struct -x dm_target  0xc0080000020d0080
> struct dm_target struct: invalid kernel virtual address:
> c0080000020d0080  type: "gdb_readmem_callback"
> 
> The question is how this could have come to pass. It can only happen
> if tio->ti had been set before the map was reloaded. 
> My theory is that the IO had been dispatched before the queue had been
> quiesced, like this:
> 
> Task A                                 Task B
> (dispatching IO)                       (executing a DM_SUSPEND ioctl to
>                                        resume after DM_TABLE_LOAD)
>                                        do_resume()
>                                          dm_suspend()
>                                            __dm_suspend()
> dm_mq_queue_rq()                         
>    struct dm_target *ti =                
>        md->immutable_target;                  
>                                               dm_stop_queue()
>                                                  blk_mq_quiesce_queue()
>        /* 
>         * At this point, the queue is quiesced, but task A
>         * has alreadyentered dm_mq_queue_rq()
>         */

That shouldn't happen, blk_mq_quiesce_queue() drains all pending
dm_mq_queue_rq() and prevents new dm_mq_queue_rq() from being
called.

>                                               dm_wait_for_completion()
>                                                  blk_mq_queue_inflight()
>        /* 
>         * blk_mq_queue_inflight() doesn't see Task A's 
>         * request because it isn't started yet
>         */
>                                               set_bit(dmf_suspended_flag)
>    dm_start_request(md, rq);             dm_swap_table()
>                                             __bind()
>                                               md->immutable_target = ...
>                                          dm_target_destroy()
>         /* the previous md->immutable_target is freed */
>    init_tio(tio, rq, md);
>         /* the stale ti pointer is assigned to tio->ti */
>    tio->ti = ti;
> 
> dm_mq_queue_rq() contains no synchronization code if 
> md->immutable_target is set, so I think that this can happen, even
> though it looks unlikely. With b4459b11e840 (which was not applied in
> the customer kernel), there would be a
> set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
> but IMO that the above would still be possible.
> 
> If this can't happen, I have no more ideas how the observed situation
> came to pass. The customer who sent us the core claims that
> he has seen this multiple times already (but we have only this single
> core dump).

Your kernel is v5.3, which is too old, so as Mike suggested, please
try b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
suspend") given blk-mq has too many changes here.

IMO, this issue should happen in upstream kernel.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
  2024-03-20  3:03     ` Ming Lei
@ 2024-03-20  9:51       ` Martin Wilck
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Wilck @ 2024-03-20  9:51 UTC (permalink / raw
  To: Ming Lei
  Cc: Mike Snitzer, Mikulas Patocka, Alasdair G Kergon, dm-devel,
	Hannes Reinecke, Vasilis Liaskovitis

On Wed, 2024-03-20 at 11:03 +0800, Ming Lei wrote:
> On Tue, Mar 19, 2024 at 04:41:26PM +0100, Martin Wilck wrote:
> > 
> > What we know for sure is that there was a bad dm_target reference
> > in
> > (struct dm_rq_target_io *tio)->ti:
> > 
> > crash> struct -x dm_rq_target_io c00000245ca90128
> > struct dm_rq_target_io {
> >   md = 0xc0000031c66a4000,
> >   ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,
> > 
> > crash> struct -x dm_target  0xc0080000020d0080
> > struct dm_target struct: invalid kernel virtual address:
> > c0080000020d0080  type: "gdb_readmem_callback"
> > 
> > The question is how this could have come to pass. It can only
> > happen
> > if tio->ti had been set before the map was reloaded. 
> > My theory is that the IO had been dispatched before the queue had
> > been
> > quiesced, like this:
> > 
> > Task A                                 Task B
> > (dispatching IO)                       (executing a DM_SUSPEND
> > ioctl to
> >                                        resume after DM_TABLE_LOAD)
> >                                        do_resume()
> >                                          dm_suspend()
> >                                            __dm_suspend()
> > dm_mq_queue_rq()                         
> >    struct dm_target *ti =                
> >        md->immutable_target;                  
> >                                               dm_stop_queue()
> >                                                 
> > blk_mq_quiesce_queue()
> >        /* 
> >         * At this point, the queue is quiesced, but task A
> >         * has alreadyentered dm_mq_queue_rq()
> >         */
> 
> That shouldn't happen, blk_mq_quiesce_queue() drains all pending
> dm_mq_queue_rq() and prevents new dm_mq_queue_rq() from being
> called.

Thanks for pointing this out. I'd been missing the fact that the
synchronization is achieved by the rcu_read_lock() in
__blk_mq_run_dispatch_ops(), which guards invocations of the
request dispatching code against the synchronize_rcu() in
blk_mq_wait_quiesce_done(). In our old kernel it was still in
hctx_lock(), but with the same effect.

This means that don't see any more how our dm_target reference could 
have pointed to freed memory. For now, we'll follow Mike's advice.

Thanks a lot,
Martin




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-20  9:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-15 23:10 [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend() Martin Wilck
2024-03-19 13:04 ` Ming Lei
2024-03-19 15:41   ` Martin Wilck
2024-03-19 16:53     ` Mike Snitzer
2024-03-19 21:01       ` Martin Wilck
2024-03-20  3:03     ` Ming Lei
2024-03-20  9:51       ` Martin Wilck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.