Re: Intermittent stalling of all MD IO, Debian buster (4.19.0-16)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Guoqing Jiang <jgq516@gmail.com>
To: linux-raid@vger.kernel.org
Subject: Re: Intermittent stalling of all MD IO, Debian buster (4.19.0-16)
Date: Wed, 16 Jun 2021 11:57:33 +0800	[thread overview]
Message-ID: <336ddb45-990f-5d52-23b0-c097c124d022@gmail.com> (raw)
In-Reply-To: <20210612124157.hq6da5zouwd53ucy@bitfolk.com>

Hi,

On 6/12/21 8:41 PM, Andy Smith wrote:
> Hi,
>
> I've been experiencing this problem intermittently since December of
> last year after upgrading some existing servers to Debian stable
> (buster). I can't reproduce it at will and it can sometimes take
> several months to happen again, although it has just happened twice
> in 3 days on one host.
>
> What happens is that all IO to particular MD devices seems to
> freeze. At this point I generally have no option but to power cycle
> the server as an orderly shutdown can't be completed.
>
> These servers are Xen hypervisors, and very occasionally I or a
> guest administrator has been able to shut down a guest and then
> things seem to become unblocked. I am aware that this could mean it
> could be a Xen issue and I'm pursuing that angle as well. The
> version of the Xen hypervisor in use did change as well as the OS
> upgrade.
>
> In terms of logging, this is the sort of thing I get:
>
> Jun 12 12:04:40 clockwork kernel: [216427.246183] INFO: task md5_raid1:205 blocked for more than 120 seconds.
> Jun 12 12:04:40 clockwork kernel: [216427.246995]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
> Jun 12 12:04:40 clockwork kernel: [216427.247852] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 12 12:04:40 clockwork kernel: [216427.248674] md5_raid1       D 0   205      2 0x80000000
> Jun 12 12:04:40 clockwork kernel: [216427.249534] Call Trace:
> Jun 12 12:04:40 clockwork kernel: [216427.250368] __schedule+0x29f/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.251788]  ? _raw_spin_unlock_irqrestore+0x14/0x20
> Jun 12 12:04:40 clockwork kernel: [216427.253078] schedule+0x28/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.253945] md_super_wait+0x6e/0xa0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.254812]  ? finish_wait+0x80/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.256139] md_bitmap_wait_writes+0x93/0xa0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.256994]  ? md_bitmap_get_counter+0x42/0xd0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.257787] md_bitmap_daemon_work+0x1f7/0x370 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.258608]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.259553] md_check_recovery+0x41/0x530 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.260304]  raid1d+0x5c/0xf10 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.261096]  ? lock_timer_base+0x67/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.261863]  ? _raw_spin_unlock_irqrestore+0x14/0x20
> Jun 12 12:04:40 clockwork kernel: [216427.262659]  ? try_to_del_timer_sync+0x4d/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.263436]  ? del_timer_sync+0x37/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.264189]  ? schedule_timeout+0x173/0x3b0
> Jun 12 12:04:40 clockwork kernel: [216427.264911]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.265664]  ? md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.266412]  ? process_checks+0x4a0/0x4a0 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.267124] md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.267842]  ? finish_wait+0x80/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.268539] kthread+0x112/0x130
> Jun 12 12:04:40 clockwork kernel: [216427.269231]  ? kthread_bind+0x30/0x30
> Jun 12 12:04:40 clockwork kernel: [216427.269903] ret_from_fork+0x35/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.270590] INFO: task md2_raid1:207 blocked for more than 120 seconds.
> Jun 12 12:04:40 clockwork kernel: [216427.271260]       Not tainted 4.19.0-16-amd64 #1 Debian 4.19.181-1
> Jun 12 12:04:40 clockwork kernel: [216427.271942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 12 12:04:40 clockwork kernel: [216427.272721] md2_raid1       D 0   207      2 0x80000000
> Jun 12 12:04:40 clockwork kernel: [216427.273432] Call Trace:
> Jun 12 12:04:40 clockwork kernel: [216427.274172] __schedule+0x29f/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.274869] schedule+0x28/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.275543] io_schedule+0x12/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.276208] wbt_wait+0x205/0x300
> Jun 12 12:04:40 clockwork kernel: [216427.276861]  ? wbt_wait+0x300/0x300
> Jun 12 12:04:40 clockwork kernel: [216427.277503] rq_qos_throttle+0x31/0x40
> Jun 12 12:04:40 clockwork kernel: [216427.278193] blk_mq_make_request+0x111/0x530
> Jun 12 12:04:40 clockwork kernel: [216427.278876] generic_make_request+0x1a4/0x400
> Jun 12 12:04:40 clockwork kernel: [216427.279657]  ? try_to_wake_up+0x54/0x470
> Jun 12 12:04:40 clockwork kernel: [216427.280400] submit_bio+0x45/0x130
> Jun 12 12:04:40 clockwork kernel: [216427.281136]  ? md_super_write.part.63+0x90/0x120 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.281788] md_update_sb.part.65+0x3a8/0x8e0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.282480]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.283106] md_check_recovery+0x272/0x530 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.283738]  raid1d+0x5c/0xf10 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.284345]  ? __schedule+0x2a7/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.284939]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.285522]  ? schedule+0x28/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.286121]  ? schedule_timeout+0x26d/0x3b0
> Jun 12 12:04:40 clockwork kernel: [216427.286702]  ? __schedule+0x2a7/0x840
> Jun 12 12:04:40 clockwork kernel: [216427.287279]  ? md_rdev_init+0xb0/0xb0 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.287871]  ? md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.288458]  ? process_checks+0x4a0/0x4a0 [raid1]
> Jun 12 12:04:40 clockwork kernel: [216427.289062] md_thread+0x94/0x150 [md_mod]
> Jun 12 12:04:40 clockwork kernel: [216427.289663]  ? finish_wait+0x80/0x80
> Jun 12 12:04:40 clockwork kernel: [216427.290288] kthread+0x112/0x130
> Jun 12 12:04:40 clockwork kernel: [216427.290858]  ? kthread_bind+0x30/0x30
> Jun 12 12:04:40 clockwork kernel: [216427.291433] ret_from_fork+0x35/0x40

The above looks like the bio for sb write was throttled by wbt, which 
caused the first calltrace.
I am wondering if there  were intensive IOs happened to the underlying 
device of md5, which
triggered wbt to throttle sb write, or can you access the underlying 
device directly?

And there was a report [1] for raid5 which may related to wbt throttle 
as well, not sure if the
change [2] could help or not.

[1]. 
https://lore.kernel.org/linux-raid/d3fced3f-6c2b-5ffa-fd24-b24ec6e7d4be@xmyslivec.cz/
[2]. 
https://lore.kernel.org/linux-raid/cb0f312e-55dc-cdc4-5d2e-b9b415de617f@gmail.com/

Thanks,
Guoqing

next prev parent reply	other threads:[~2021-06-16  3:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-12 12:41 Intermittent stalling of all MD IO, Debian buster (4.19.0-16) Andy Smith
2021-06-12 13:39 ` Andy Smith
2021-06-16  3:57 ` Guoqing Jiang [this message]
2021-06-16 15:05   ` Andy Smith
2021-06-18  5:35     ` Guoqing Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=336ddb45-990f-5d52-23b0-c097c124d022@gmail.com \
    --to=jgq516@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.