Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

Linux-Raid Archives mirror
 help / color / mirror / Atom feed

From: "Luse, Paul E" <paul.e.luse@intel.com>
To: Xiao Ni <xni@redhat.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>,
	Paul E Luse <paul.e.luse@linux.intel.com>,
	"song@kernel.org" <song@kernel.org>,
	"neilb@suse.com" <neilb@suse.com>, "shli@fb.com" <shli@fb.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"yi.zhang@huawei.com" <yi.zhang@huawei.com>,
	"yangerkun@huawei.com" <yangerkun@huawei.com>,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()
Date: Tue, 27 Feb 2024 14:26:39 +0000	[thread overview]
Message-ID: <813BAD45-4484-4B1E-BCD0-40C159DA62BA@intel.com> (raw)
In-Reply-To: <CALTww2_iPFJiX17ORbN2+ssdYWVk0=M4pCgJDoWh_-jJPn0bRA@mail.gmail.com>



> On Feb 26, 2024, at 9:49 PM, Xiao Ni <xni@redhat.com> wrote:
> 
> On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>> 
>> Hi,
>> 
>> 在 2024/02/27 10:23, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>>> 
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>> 
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>> 
>>>> read_balance:
>>>>  for_each_rdev
>>>>   if(next_seq_sect == this_sector || disk == 0)
>>>>   -> sequential reads
>>>>    best_disk = disk;
>>>>    if (...)
>>>>     choose_next_idle = 1
>>>>     continue;
>>>> 
>>>>  for_each_rdev
>>>>  -> iterate next rdev
>>>>   if (pending == 0)
>>>>    best_disk = disk;
>>>>    -> choose the next idle disk
>>>>    break;
>>>> 
>>>>   if (choose_next_idle)
>>>>    -> keep using this rdev if there are no other idle disk
>>>>    contine
>>>> 
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>> 
>>>> -               /* If device is idle, use it */
>>>> -               if (pending == 0) {
>>>> -                       best_disk = disk;
>>>> -                       break;
>>>> -               }
>>>> 
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>> 
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>>    disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>>    are rotational, this disk will still be chosen;
>>>> 
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Paul Luse <paul.e.luse@linux.intel.com>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>  drivers/md/raid1.c | 21 ++++++++++++---------
>>>>  1 file changed, 12 insertions(+), 9 deletions(-)
>>>> 
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>         unsigned int min_pending;
>>>>         struct md_rdev *rdev;
>>>>         int choose_first;
>>>> -       int choose_next_idle;
>>>> 
>>>>         /*
>>>>          * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>         best_pending_disk = -1;
>>>>         min_pending = UINT_MAX;
>>>>         best_good_sectors = 0;
>>>> -       choose_next_idle = 0;
>>>>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>> 
>>>>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>>                         struct raid1_info *mirror = &conf->mirrors[disk];
>>>> 
>>>> -                       best_disk = disk;
>>>>                         /*
>>>>                          * If buffered sequential IO size exceeds optimal
>>>>                          * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                             mirror->next_seq_sect > opt_iosize &&
>>>>                             mirror->next_seq_sect - opt_iosize >=
>>>>                             mirror->seq_start) {
>>>> -                               choose_next_idle = 1;
>>>> -                               continue;
>>>> +                               /*
>>>> +                                * Add 'pending' to avoid choosing this disk if
>>>> +                                * there is other idle disk.
>>>> +                                * Set 'dist' to 0, so that if there is no other
>>>> +                                * idle disk and all disks are rotational, this
>>>> +                                * disk will still be chosen.
>>>> +                                */
>>>> +                               pending++;
>>>> +                               dist = 0;
>>>> +                       } else {
>>>> +                               best_disk = disk;
>>>> +                               break;
>>>>                         }
>>>> -                       break;
>>>>                 }
>>> 
>>> Hi Kuai
>>> 
>>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
>>> a sequential read. If there are no other idle disks, it will read from
>>> the sequential disk. With this patch, it reads from the
>>> best_pending_disk even min_pending is not 0. It looks like a wrong
>>> behaviour?
>> 
>> Yes, nice catch, I didn't notice this yet... So there is a hidden
>> logical, sequential IO priority is higher than minimal 'pending'
>> selection, it's only less than 'choose_next_idle' where idle disk
>> exist.
> 
> Yes.
> 
> 
>> 
>> Looks like if we want to keep this behaviour, we can add a 'sequential
>> disk':
>> 
>> if (is_sequential())
>>  if (!should_choose_next())
>>   return disk;
>>  ctl.sequential_disk = disk;
>> 
>> ...
>> 
>> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>>  return ctl.sequential_disk;
> 
> Agree with this, thanks :)
> 
> Best Regards
> Xiao

Yup, agree as well.  This will help for sure with the followup to this series for seq read improvements :) 

>> 
>> Thanks,
>> Kuai
>> 
>>> 
>>> Best Regards
>>> Xiao
>>>> 
>>>> -               if (choose_next_idle)
>>>> -                       continue;
>>>> -
>>>>                 if (min_pending > pending) {
>>>>                         min_pending = pending;
>>>>                         best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>> 
>>>> 
>>> 
>>> .
>>> 
>> 
> 
>

next prev parent reply	other threads:[~2024-02-27 14:26 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-22  7:57 [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Yu Kuai
2024-02-22  7:57 ` [PATCH md-6.9 01/10] md: add a new helper rdev_has_badblock() Yu Kuai
2024-02-26  9:01   ` Xiao Ni
2024-02-22  7:57 ` [PATCH md-6.9 02/10] md: record nonrot rdevs while adding/removing rdevs to conf Yu Kuai
2024-02-26 13:12   ` Xiao Ni
2024-02-26 13:25     ` Yu Kuai
2024-02-26 13:27       ` Xiao Ni
2024-02-22  7:57 ` [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance() Yu Kuai
2024-02-26  8:55   ` Xiao Ni
2024-02-26  9:12     ` Yu Kuai
2024-02-26  9:24       ` Xiao Ni
2024-02-26  9:40         ` Yu Kuai
2024-02-26 13:20           ` Xiao Ni
2024-02-27  2:23   ` Xiao Ni
2024-02-27  2:38     ` Yu Kuai
2024-02-27  4:49       ` Xiao Ni
2024-02-27 14:26         ` Luse, Paul E [this message]
2024-02-22  7:58 ` [PATCH md-6.9 04/10] md/raid1-10: add a helper raid1_check_read_range() Yu Kuai
2024-02-26  9:15   ` Xiao Ni
2024-02-22  7:58 ` [PATCH md-6.9 05/10] md/raid1-10: factor out a new helper raid1_should_read_first() Yu Kuai
2024-02-26 13:46   ` Xiao Ni
2024-02-22  7:58 ` [PATCH md-6.9 06/10] md/raid1: factor out read_first_rdev() from read_balance() Yu Kuai
2024-02-26 14:16   ` Xiao Ni
2024-02-27  1:06     ` Yu Kuai
2024-02-27  1:23       ` Xiao Ni
2024-02-27  1:42         ` Xiao Ni
2024-02-27  1:43         ` Yu Kuai
2024-02-27  2:50           ` Xiao Ni
2024-02-27  2:50   ` Xiao Ni
2024-02-22  7:58 ` [PATCH md-6.9 07/10] md/raid1: factor out choose_slow_rdev() " Yu Kuai
2024-02-26 14:44   ` Xiao Ni
2024-02-22  7:58 ` [PATCH md-6.9 08/10] md/raid1: factor out choose_bb_rdev() " Yu Kuai
2024-02-27  1:50   ` Xiao Ni
2024-02-22  7:58 ` [PATCH md-6.9 09/10] md/raid1: factor out the code to manage sequential IO Yu Kuai
2024-02-27  2:04   ` Xiao Ni
2024-02-22  7:58 ` [PATCH md-6.9 10/10] md/raid1: factor out helpers to choose the best rdev from read_balance() Yu Kuai
2024-02-27  4:47   ` Xiao Ni
2024-02-22  8:40 ` [PATCH md-6.9 00/10] md/raid1: refactor read_balance() and some minor fix Paul Menzel
2024-02-22  9:08   ` Yu Kuai
2024-02-22 13:04     ` Luse, Paul E
2024-02-22 15:30       ` Paul Menzel
2024-02-27  0:27 ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=813BAD45-4484-4B1E-BCD0-40C159DA62BA@intel.com \
    --to=paul.e.luse@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=paul.e.luse@linux.intel.com \
    --cc=shli@fb.com \
    --cc=song@kernel.org \
    --cc=xni@redhat.com \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).