From: "Luse, Paul E" <paul.e.luse@intel.com>
To: Xiao Ni <xni@redhat.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>,
Paul E Luse <paul.e.luse@linux.intel.com>,
"song@kernel.org" <song@kernel.org>,
"neilb@suse.com" <neilb@suse.com>, "shli@fb.com" <shli@fb.com>,
"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"yi.zhang@huawei.com" <yi.zhang@huawei.com>,
"yangerkun@huawei.com" <yangerkun@huawei.com>,
"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()
Date: Tue, 27 Feb 2024 14:26:39 +0000 [thread overview]
Message-ID: <813BAD45-4484-4B1E-BCD0-40C159DA62BA@intel.com> (raw)
In-Reply-To: <CALTww2_iPFJiX17ORbN2+ssdYWVk0=M4pCgJDoWh_-jJPn0bRA@mail.gmail.com>
> On Feb 26, 2024, at 9:49 PM, Xiao Ni <xni@redhat.com> wrote:
>
> On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2024/02/27 10:23, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>>
>>>> read_balance:
>>>> for_each_rdev
>>>> if(next_seq_sect == this_sector || disk == 0)
>>>> -> sequential reads
>>>> best_disk = disk;
>>>> if (...)
>>>> choose_next_idle = 1
>>>> continue;
>>>>
>>>> for_each_rdev
>>>> -> iterate next rdev
>>>> if (pending == 0)
>>>> best_disk = disk;
>>>> -> choose the next idle disk
>>>> break;
>>>>
>>>> if (choose_next_idle)
>>>> -> keep using this rdev if there are no other idle disk
>>>> contine
>>>>
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>>
>>>> - /* If device is idle, use it */
>>>> - if (pending == 0) {
>>>> - best_disk = disk;
>>>> - break;
>>>> - }
>>>>
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>>
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>> disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>> are rotational, this disk will still be chosen;
>>>>
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>> drivers/md/raid1.c | 21 ++++++++++++---------
>>>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> unsigned int min_pending;
>>>> struct md_rdev *rdev;
>>>> int choose_first;
>>>> - int choose_next_idle;
>>>>
>>>> /*
>>>> * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> best_pending_disk = -1;
>>>> min_pending = UINT_MAX;
>>>> best_good_sectors = 0;
>>>> - choose_next_idle = 0;
>>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>>
>>>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>> struct raid1_info *mirror = &conf->mirrors[disk];
>>>>
>>>> - best_disk = disk;
>>>> /*
>>>> * If buffered sequential IO size exceeds optimal
>>>> * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> mirror->next_seq_sect > opt_iosize &&
>>>> mirror->next_seq_sect - opt_iosize >=
>>>> mirror->seq_start) {
>>>> - choose_next_idle = 1;
>>>> - continue;
>>>> + /*
>>>> + * Add 'pending' to avoid choosing this disk if
>>>> + * there is other idle disk.
>>>> + * Set 'dist' to 0, so that if there is no other
>>>> + * idle disk and all disks are rotational, this
>>>> + * disk will still be chosen.
>>>> + */
>>>> + pending++;
>>>> + dist = 0;
>>>> + } else {
>>>> + best_disk = disk;
>>>> + break;
>>>> }
>>>> - break;
>>>> }
>>>
>>> Hi Kuai
>>>
>>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
>>> a sequential read. If there are no other idle disks, it will read from
>>> the sequential disk. With this patch, it reads from the
>>> best_pending_disk even min_pending is not 0. It looks like a wrong
>>> behaviour?
>>
>> Yes, nice catch, I didn't notice this yet... So there is a hidden
>> logical, sequential IO priority is higher than minimal 'pending'
>> selection, it's only less than 'choose_next_idle' where idle disk
>> exist.
>
> Yes.
>
>
>>
>> Looks like if we want to keep this behaviour, we can add a 'sequential
>> disk':
>>
>> if (is_sequential())
>> if (!should_choose_next())
>> return disk;
>> ctl.sequential_disk = disk;
>>
>> ...
>>
>> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>> return ctl.sequential_disk;
>
> Agree with this, thanks :)
>
> Best Regards
> Xiao
Yup, agree as well. This will help for sure with the followup to this series for seq read improvements :)
>>
>> Thanks,
>> Kuai
>>
>>>
>>> Best Regards
>>> Xiao
>>>>
>>>> - if (choose_next_idle)
>>>> - continue;
>>>> -
>>>> if (min_pending > pending) {
>>>> min_pending = pending;
>>>> best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>> .
>>>
>>
>
>
next prev parent reply other threads:[~2024-02-27 14:26 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-22 7:57 [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Yu Kuai
2024-02-22 7:57 ` [PATCH md-6.9 01/10] md: add a new helper rdev_has_badblock() Yu Kuai
2024-02-26 9:01 ` Xiao Ni
2024-02-22 7:57 ` [PATCH md-6.9 02/10] md: record nonrot rdevs while adding/removing rdevs to conf Yu Kuai
2024-02-26 13:12 ` Xiao Ni
2024-02-26 13:25 ` Yu Kuai
2024-02-26 13:27 ` Xiao Ni
2024-02-22 7:57 ` [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance() Yu Kuai
2024-02-26 8:55 ` Xiao Ni
2024-02-26 9:12 ` Yu Kuai
2024-02-26 9:24 ` Xiao Ni
2024-02-26 9:40 ` Yu Kuai
2024-02-26 13:20 ` Xiao Ni
2024-02-27 2:23 ` Xiao Ni
2024-02-27 2:38 ` Yu Kuai
2024-02-27 4:49 ` Xiao Ni
2024-02-27 14:26 ` Luse, Paul E [this message]
2024-02-22 7:58 ` [PATCH md-6.9 04/10] md/raid1-10: add a helper raid1_check_read_range() Yu Kuai
2024-02-26 9:15 ` Xiao Ni
2024-02-22 7:58 ` [PATCH md-6.9 05/10] md/raid1-10: factor out a new helper raid1_should_read_first() Yu Kuai
2024-02-26 13:46 ` Xiao Ni
2024-02-22 7:58 ` [PATCH md-6.9 06/10] md/raid1: factor out read_first_rdev() from read_balance() Yu Kuai
2024-02-26 14:16 ` Xiao Ni
2024-02-27 1:06 ` Yu Kuai
2024-02-27 1:23 ` Xiao Ni
2024-02-27 1:42 ` Xiao Ni
2024-02-27 1:43 ` Yu Kuai
2024-02-27 2:50 ` Xiao Ni
2024-02-27 2:50 ` Xiao Ni
2024-02-22 7:58 ` [PATCH md-6.9 07/10] md/raid1: factor out choose_slow_rdev() " Yu Kuai
2024-02-26 14:44 ` Xiao Ni
2024-02-22 7:58 ` [PATCH md-6.9 08/10] md/raid1: factor out choose_bb_rdev() " Yu Kuai
2024-02-27 1:50 ` Xiao Ni
2024-02-22 7:58 ` [PATCH md-6.9 09/10] md/raid1: factor out the code to manage sequential IO Yu Kuai
2024-02-27 2:04 ` Xiao Ni
2024-02-22 7:58 ` [PATCH md-6.9 10/10] md/raid1: factor out helpers to choose the best rdev from read_balance() Yu Kuai
2024-02-27 4:47 ` Xiao Ni
2024-02-22 8:40 ` [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Paul Menzel
2024-02-22 9:08 ` Yu Kuai
2024-02-22 13:04 ` Luse, Paul E
2024-02-22 15:30 ` Paul Menzel
2024-02-27 0:27 ` Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=813BAD45-4484-4B1E-BCD0-40C159DA62BA@intel.com \
--to=paul.e.luse@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.com \
--cc=paul.e.luse@linux.intel.com \
--cc=shli@fb.com \
--cc=song@kernel.org \
--cc=xni@redhat.com \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai1@huaweicloud.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).