Intermittent stalling of all MD IO, Debian buster (4.19.0-16)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Intermittent stalling of all MD IO, Debian buster (4.19.0-16)
@ 2021-06-12 12:41 Andy Smith
  2021-06-12 13:39 ` Andy Smith
  2021-06-16  3:57 ` Guoqing Jiang
  0 siblings, 2 replies; 5+ messages in thread
From: Andy Smith @ 2021-06-12 12:41 UTC (permalink / raw)
  To: linux-raid

Hi,

I've been experiencing this problem intermittently since December of
last year after upgrading some existing servers to Debian stable
(buster). I can't reproduce it at will and it can sometimes take
several months to happen again, although it has just happened twice
in 3 days on one host.

What happens is that all IO to particular MD devices seems to
freeze. At this point I generally have no option but to power cycle
the server as an orderly shutdown can't be completed.

These servers are Xen hypervisors, and very occasionally I or a
guest administrator has been able to shut down a guest and then
things seem to become unblocked. I am aware that this could mean it
could be a Xen issue and I'm pursuing that angle as well. The
version of the Xen hypervisor in use did change as well as the OS
upgrade.

In terms of logging, this is the sort of thing I get:

Jun 12 12:04:40 clockwork kernel: [216427.246183] INFO: task md5_raid1:205 blocked for more than 120 seconds.
Jun 12 12:04:40 clockwork kernel: [216427.246995]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
Jun 12 12:04:40 clockwork kernel: [216427.247852] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 12:04:40 clockwork kernel: [216427.248674] md5_raid1       D 0   205      2 0x80000000
Jun 12 12:04:40 clockwork kernel: [216427.249534] Call Trace:
Jun 12 12:04:40 clockwork kernel: [216427.250368] __schedule+0x29f/0x840
Jun 12 12:04:40 clockwork kernel: [216427.251788]  ? _raw_spin_unlock_irqrestore+0x14/0x20
Jun 12 12:04:40 clockwork kernel: [216427.253078] schedule+0x28/0x80
Jun 12 12:04:40 clockwork kernel: [216427.253945] md_super_wait+0x6e/0xa0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.254812]  ? finish_wait+0x80/0x80
Jun 12 12:04:40 clockwork kernel: [216427.256139] md_bitmap_wait_writes+0x93/0xa0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.256994]  ? md_bitmap_get_counter+0x42/0xd0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.257787] md_bitmap_daemon_work+0x1f7/0x370 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.258608]  ? md_rdev_init+0xb0/0xb0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.259553] md_check_recovery+0x41/0x530 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.260304]  raid1d+0x5c/0xf10 [raid1]
Jun 12 12:04:40 clockwork kernel: [216427.261096]  ? lock_timer_base+0x67/0x80
Jun 12 12:04:40 clockwork kernel: [216427.261863]  ? _raw_spin_unlock_irqrestore+0x14/0x20
Jun 12 12:04:40 clockwork kernel: [216427.262659]  ? try_to_del_timer_sync+0x4d/0x80
Jun 12 12:04:40 clockwork kernel: [216427.263436]  ? del_timer_sync+0x37/0x40
Jun 12 12:04:40 clockwork kernel: [216427.264189]  ? schedule_timeout+0x173/0x3b0
Jun 12 12:04:40 clockwork kernel: [216427.264911]  ? md_rdev_init+0xb0/0xb0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.265664]  ? md_thread+0x94/0x150 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.266412]  ? process_checks+0x4a0/0x4a0 [raid1]
Jun 12 12:04:40 clockwork kernel: [216427.267124] md_thread+0x94/0x150 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.267842]  ? finish_wait+0x80/0x80
Jun 12 12:04:40 clockwork kernel: [216427.268539] kthread+0x112/0x130
Jun 12 12:04:40 clockwork kernel: [216427.269231]  ? kthread_bind+0x30/0x30
Jun 12 12:04:40 clockwork kernel: [216427.269903] ret_from_fork+0x35/0x40
Jun 12 12:04:40 clockwork kernel: [216427.270590] INFO: task md2_raid1:207 blocked for more than 120 seconds.
Jun 12 12:04:40 clockwork kernel: [216427.271260]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
Jun 12 12:04:40 clockwork kernel: [216427.271942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 12:04:40 clockwork kernel: [216427.272721] md2_raid1       D 0   207      2 0x80000000
Jun 12 12:04:40 clockwork kernel: [216427.273432] Call Trace:
Jun 12 12:04:40 clockwork kernel: [216427.274172] __schedule+0x29f/0x840
Jun 12 12:04:40 clockwork kernel: [216427.274869] schedule+0x28/0x80
Jun 12 12:04:40 clockwork kernel: [216427.275543] io_schedule+0x12/0x40
Jun 12 12:04:40 clockwork kernel: [216427.276208] wbt_wait+0x205/0x300
Jun 12 12:04:40 clockwork kernel: [216427.276861]  ? wbt_wait+0x300/0x300
Jun 12 12:04:40 clockwork kernel: [216427.277503] rq_qos_throttle+0x31/0x40
Jun 12 12:04:40 clockwork kernel: [216427.278193] blk_mq_make_request+0x111/0x530
Jun 12 12:04:40 clockwork kernel: [216427.278876] generic_make_request+0x1a4/0x400
Jun 12 12:04:40 clockwork kernel: [216427.279657]  ? try_to_wake_up+0x54/0x470
Jun 12 12:04:40 clockwork kernel: [216427.280400] submit_bio+0x45/0x130
Jun 12 12:04:40 clockwork kernel: [216427.281136]  ? md_super_write.part.63+0x90/0x120 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.281788] md_update_sb.part.65+0x3a8/0x8e0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.282480]  ? md_rdev_init+0xb0/0xb0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.283106] md_check_recovery+0x272/0x530 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.283738]  raid1d+0x5c/0xf10 [raid1]
Jun 12 12:04:40 clockwork kernel: [216427.284345]  ? __schedule+0x2a7/0x840
Jun 12 12:04:40 clockwork kernel: [216427.284939]  ? md_rdev_init+0xb0/0xb0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.285522]  ? schedule+0x28/0x80
Jun 12 12:04:40 clockwork kernel: [216427.286121]  ? schedule_timeout+0x26d/0x3b0
Jun 12 12:04:40 clockwork kernel: [216427.286702]  ? __schedule+0x2a7/0x840
Jun 12 12:04:40 clockwork kernel: [216427.287279]  ? md_rdev_init+0xb0/0xb0 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.287871]  ? md_thread+0x94/0x150 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.288458]  ? process_checks+0x4a0/0x4a0 [raid1]
Jun 12 12:04:40 clockwork kernel: [216427.289062] md_thread+0x94/0x150 [md_mod]
Jun 12 12:04:40 clockwork kernel: [216427.289663]  ? finish_wait+0x80/0x80
Jun 12 12:04:40 clockwork kernel: [216427.290288] kthread+0x112/0x130
Jun 12 12:04:40 clockwork kernel: [216427.290858]  ? kthread_bind+0x30/0x30
Jun 12 12:04:40 clockwork kernel: [216427.291433] ret_from_fork+0x35/0x40

Anyone seen anything like this before or have any suggestions for
what to try next?

It's not really feasible for me to try to see if it happens without
running as a Xen dom0 because even if it doesn't happen for 2 months
I won't have confidence…

Thanks,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Intermittent stalling of all MD IO, Debian buster (4.19.0-16)
  2021-06-12 12:41 Intermittent stalling of all MD IO, Debian buster (4.19.0-16) Andy Smith
@ 2021-06-12 13:39 ` Andy Smith
  2021-06-16  3:57 ` Guoqing Jiang
  1 sibling, 0 replies; 5+ messages in thread
From: Andy Smith @ 2021-06-12 13:39 UTC (permalink / raw)
  To: linux-raid

On Sat, Jun 12, 2021 at 12:41:57PM +0000, Andy Smith wrote:
> Hi,
> 
> I've been experiencing this problem intermittently since December of
> last year after upgrading some existing servers to Debian stable
> (buster). I can't reproduce it at will and it can sometimes take
> several months to happen again, although it has just happened twice
> in 3 days on one host.

I was in a bit of a rush when I dashed that email off. Here's some
more information about the typical configuration of these servers.

$ uname -a
Linux clockwork 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
$ mdadm --version
mdadm - v4.1 - 2018-10-01

Most of these servers have spent about 5 years running on earlier
versions of Debian, notably the full Debian jessie release cycle,
without issue. I've only started having issues after upgrading to
Debian buster.

I will omit details of all member devices as I'm not getting issues
with IO errors, dropouts etc. Most of the servers just have two SATA
SSDs although I am also seeing this on more complex setups.

$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]

md5 : active raid1 sdb5[1] sda5[0]
      3742779392 blocks super 1.2 [2/2] [UU]
      bitmap: 14/28 pages [56KB], 65536KB chunk

md1 : active raid1 sdb1[1] sda1[0]
      975296 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda2[0] sdb2[1]
      4878336 blocks super 1.2 [2/2] [UU]

md3 : active (auto-read-only) raid1 sdb3[1] sda3[0]
      1951744 blocks super 1.2 [2/2] [UU]

unused devices: <none>

$ sudo smartctl -i /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZ7KH3T8HALS-00005
Serial Number:    S47RNA0MC01657
LU WWN Device Id: 5 002538 e09c88bb3
Firmware Version: HXM7404Q
User Capacity:    3,840,755,982,336 bytes [3.84 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 12 13:36:47 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

$ sudo smartctl -i /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZ7KH3T8HALS-00005
Serial Number:    S47RNA0MC01656
LU WWN Device Id: 5 002538 e09c88b8a
Firmware Version: HXM7404Q
User Capacity:    3,840,755,982,336 bytes [3.84 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jun 12 13:36:54 2021 UTC
Local Time is:    Sat Jun 12 13:36:54 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

At this point it would be really helpful fi Ic ould even narrow it
down to "Xen problem" or "dom0 kernel problem". :(

Cheers,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Intermittent stalling of all MD IO, Debian buster (4.19.0-16)
  2021-06-12 12:41 Intermittent stalling of all MD IO, Debian buster (4.19.0-16) Andy Smith
  2021-06-12 13:39 ` Andy Smith
@ 2021-06-16  3:57 ` Guoqing Jiang
  2021-06-16 15:05   ` Andy Smith
  1 sibling, 1 reply; 5+ messages in thread
From: Guoqing Jiang @ 2021-06-16  3:57 UTC (permalink / raw)
  To: linux-raid

Hi,

On 6/12/21 8:41 PM, Andy Smith wrote:
> Hi,
>
> I've been experiencing this problem intermittently since December of
> last year after upgrading some existing servers to Debian stable
> (buster). I can't reproduce it at will and it can sometimes take
> several months to happen again, although it has just happened twice
> in 3 days on one host.
>
> What happens is that all IO to particular MD devices seems to
> freeze. At this point I generally have no option but to power cycle
> the server as an orderly shutdown can't be completed.
>
> These servers are Xen hypervisors, and very occasionally I or a
> guest administrator has been able to shut down a guest and then
> things seem to become unblocked. I am aware that this could mean it
> could be a Xen issue and I'm pursuing that angle as well. The
> version of the Xen hypervisor in use did change as well as the OS
> upgrade.
>
> In terms of logging, this is the sort of thing I get:
>
> Jun 12 12:04:40 clockwork kernel: [216427.246183] INFO: task md5_raid1:205 blocked for more than 120 seconds.
> Jun 12 12:04:40 clockwork kernel: [216427.246995]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
> Jun 12 12:04:40 clockwork kernel: [216427.247852] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 12 12:04:40 clockwork kernel: [216427.248674] md5_raid1       D 0   205      2 0x80000000
> Jun 12 12:04:40 clockwork kernel: [216427.249534] Call Trace:
> Jun 12 12:04:40 clockwork kernel: [216427.250368] __schedule+0x29f/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.251788]  ? _raw_spin_unlock_irqrestore+0x14/0x20
> Jun 12 12:04:40 clockwork kernel: [216427.253078] schedule+0x28/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.253945] md_super_wait+0x6e/0xa0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.254812]  ? finish_wait+0x80/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.256139] md_bitmap_wait_writes+0x93/0xa0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.256994]  ? md_bitmap_get_counter+0x42/0xd0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.257787] md_bitmap_daemon_work+0x1f7/0x370 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.258608]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.259553] md_check_recovery+0x41/0x530 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.260304]  raid1d+0x5c/0xf10 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.261096]  ? lock_timer_base+0x67/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.261863]  ? _raw_spin_unlock_irqrestore+0x14/0x20
> Jun 12 12:04:40 clockwork kernel: [216427.262659]  ? try_to_del_timer_sync+0x4d/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.263436]  ? del_timer_sync+0x37/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.264189]  ? schedule_timeout+0x173/0x3b0
> Jun 12 12:04:40 clockwork kernel: [216427.264911]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.265664]  ? md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.266412]  ? process_checks+0x4a0/0x4a0 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.267124] md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.267842]  ? finish_wait+0x80/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.268539] kthread+0x112/0x130
> Jun 12 12:04:40 clockwork kernel: [216427.269231]  ? kthread_bind+0x30/0x30
> Jun 12 12:04:40 clockwork kernel: [216427.269903] ret_from_fork+0x35/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.270590] INFO: task md2_raid1:207 blocked for more than 120 seconds.
> Jun 12 12:04:40 clockwork kernel: [216427.271260]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
> Jun 12 12:04:40 clockwork kernel: [216427.271942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 12 12:04:40 clockwork kernel: [216427.272721] md2_raid1       D 0   207      2 0x80000000
> Jun 12 12:04:40 clockwork kernel: [216427.273432] Call Trace:
> Jun 12 12:04:40 clockwork kernel: [216427.274172] __schedule+0x29f/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.274869] schedule+0x28/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.275543] io_schedule+0x12/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.276208] wbt_wait+0x205/0x300
> Jun 12 12:04:40 clockwork kernel: [216427.276861]  ? wbt_wait+0x300/0x300
> Jun 12 12:04:40 clockwork kernel: [216427.277503] rq_qos_throttle+0x31/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.278193] blk_mq_make_request+0x111/0x530
> Jun 12 12:04:40 clockwork kernel: [216427.278876] generic_make_request+0x1a4/0x400
> Jun 12 12:04:40 clockwork kernel: [216427.279657]  ? try_to_wake_up+0x54/0x470
> Jun 12 12:04:40 clockwork kernel: [216427.280400] submit_bio+0x45/0x130
> Jun 12 12:04:40 clockwork kernel: [216427.281136]  ? md_super_write.part.63+0x90/0x120 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.281788] md_update_sb.part.65+0x3a8/0x8e0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.282480]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.283106] md_check_recovery+0x272/0x530 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.283738]  raid1d+0x5c/0xf10 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.284345]  ? __schedule+0x2a7/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.284939]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.285522]  ? schedule+0x28/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.286121]  ? schedule_timeout+0x26d/0x3b0
> Jun 12 12:04:40 clockwork kernel: [216427.286702]  ? __schedule+0x2a7/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.287279]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.287871]  ? md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.288458]  ? process_checks+0x4a0/0x4a0 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.289062] md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.289663]  ? finish_wait+0x80/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.290288] kthread+0x112/0x130
> Jun 12 12:04:40 clockwork kernel: [216427.290858]  ? kthread_bind+0x30/0x30
> Jun 12 12:04:40 clockwork kernel: [216427.291433] ret_from_fork+0x35/0x40

The above looks like the bio for sb write was throttled by wbt, which 
caused the first calltrace.
I am wondering if there  were intensive IOs happened to the underlying 
device of md5, which
triggered wbt to throttle sb write, or can you access the underlying 
device directly?

And there was a report [1] for raid5 which may related to wbt throttle 
as well, not sure if the
change [2] could help or not.

[1]. 
https://lore.kernel.org/linux-raid/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/
[2]. 
https://lore.kernel.org/linux-raid/cb0f312e-55dc-cdc4-5d2e-b9b415de617f@gmail.com/

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Intermittent stalling of all MD IO, Debian buster (4.19.0-16)
  2021-06-16  3:57 ` Guoqing Jiang
@ 2021-06-16 15:05   ` Andy Smith
  2021-06-18  5:35     ` Guoqing Jiang
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Smith @ 2021-06-16 15:05 UTC (permalink / raw)
  To: linux-raid

Hi Guoqing,

Thanks for looking at this.

On Wed, Jun 16, 2021 at 11:57:33AM +0800, Guoqing Jiang wrote:
> The above looks like the bio for sb write was throttled by wbt, which caused
> the first calltrace.
> I am wondering if there  were intensive IOs happened to the
> underlying device of md5, which triggered wbt to throttle sb
> write, or can you access the underlying device directly?

Next time it occurs I can check if I am able to read from the SSDs
that make up the MD device, if that information would be helpful.

I have never been able to replicate the problem in a test
environment so it is likely that it needs to be under heavy load for
it to happen.

> And there was a report [1] for raid5 which may related to wbt throttle as
> well, not sure if the
> change [2] could help or not.
> 
> [1]. https://lore.kernel.org/linux-raid/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/
> [2]. https://lore.kernel.org/linux-raid/cb0f312e-55dc-cdc4-5d2e-b9b415de617f@gmail.com/

All of my MD arrays tend to be RAID-1 or RAID-10, two devices, no
journal, internal bitmap. I see the reporter of this problem was
using RAID-6 with an external write journal. I can still build a
kernel with this patch and try it out, if you think it could possibly
help. The long time between incidents obviously makes things
extra challenging.

The next step I have taken is to put the buster-backports kernel
package (5.10.24-1~bpo10+1) on two test servers, and will also boot
the production hosts into this if they should experience the problem
again.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Intermittent stalling of all MD IO, Debian buster (4.19.0-16)
  2021-06-16 15:05   ` Andy Smith
@ 2021-06-18  5:35     ` Guoqing Jiang
  0 siblings, 0 replies; 5+ messages in thread
From: Guoqing Jiang @ 2021-06-18  5:35 UTC (permalink / raw)
  To: linux-raid

Hi Andy,

On 6/16/21 11:05 PM, Andy Smith wrote:
> Hi Guoqing,
>
> Thanks for looking at this.
>
> On Wed, Jun 16, 2021 at 11:57:33AM +0800, Guoqing Jiang wrote:
>> The above looks like the bio for sb write was throttled by wbt, which caused
>> the first calltrace.
>> I am wondering if there  were intensive IOs happened to the
>> underlying device of md5, which triggered wbt to throttle sb
>> write, or can you access the underlying device directly?
> Next time it occurs I can check if I am able to read from the SSDs
> that make up the MD device, if that information would be helpful.
>
> I have never been able to replicate the problem in a test
> environment so it is likely that it needs to be under heavy load for
> it to happen.

I guess so, and a reliable reproducer definitely  helps us to analysis 
the root cause.

>> And there was a report [1] for raid5 which may related to wbt throttle as
>> well, not sure if the
>> change [2] could help or not.
>>
>> [1]. https://lore.kernel.org/linux-raid/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/
>> [2]. https://lore.kernel.org/linux-raid/cb0f312e-55dc-cdc4-5d2e-b9b415de617f@gmail.com/
> All of my MD arrays tend to be RAID-1 or RAID-10, two devices, no
> journal, internal bitmap. I see the reporter of this problem was
> using RAID-6 with an external write journal. I can still build a
> kernel with this patch and try it out, if you think it could possibly
> help.

Yes, because both of the two issues have wbt related call traces though 
raid level is different.

> The long time between incidents obviously makes things
> extra challenging.
>
> The next step I have taken is to put the buster-backports kernel
> package (5.10.24-1~bpo10+1) on two test servers, and will also boot
> the production hosts into this if they should experience the problem
> again.

Good luck :).

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-18  5:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-12 12:41 Intermittent stalling of all MD IO, Debian buster (4.19.0-16) Andy Smith
2021-06-12 13:39 ` Andy Smith
2021-06-16  3:57 ` Guoqing Jiang
2021-06-16 15:05   ` Andy Smith
2021-06-18  5:35     ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.