Linux-BTRFS Archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Josef Bacik <josef@toxicpanda.com>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range()
Date: Sun, 12 May 2024 18:26:18 +0930	[thread overview]
Message-ID: <6882e4dc-bab7-4cde-8021-a571b8a2fb2b@gmx.com> (raw)
In-Reply-To: <20240510142525.GA90506@perftesting>



在 2024/5/10 23:55, Josef Bacik 写道:
> On Wed, Mar 06, 2024 at 09:05:33AM +1030, Qu Wenruo wrote:
>> [BUG]
>> For subpage + zoned case, btrfs can easily hang with the following
>> workload, even with previous subpage delalloc rework:
>>
>>   # mkfs.btrfs -s 4k -f $dev
>>   # mount $dev $mnt
>>   # xfs_io -f -c "pwrite 32k 128k" $mnt/foobar
>>   # umount $mnt
>>
>> The system would hang at unmount due to unfinished ordered extents.
>>
>> Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
>> append size of 64K, and the system has 64K page size.
>>
>
> This whole explanation confuses me, because in the normal case we clear the
> whole page dirty before we enter into __extent_writepage_io(), so what we're
> doing here is no different than the regular case, so I don't get how this is a
> problem?

Although I have sent out the updated (in fact just my github branch
version) version, I still like to explain it a little more, as there are
several more hidden (existing) problems here.

The example I would go with would be the same one page layout, with 64K
page size and 4K sector size:

     0      32K      64K     96K    128K    160K    192K
     I      |////////I//////////////I///////|       I

Where "I" is the page boundary.

Firstly, the biggest confusion comes from where we clear the PageDirty.
We do it in several locations:

- folio_clear_dirty_for_io() inside extent_write_cache_pages()
   I'd say, this is the worst location to call.
   It is not subpage aware, nor at the right timing.

   The only good news is, it's not causing real problems, thus I'll
   clean it up later.

- btrfs_folio_clear_dirty() inside __extent_writepage_io()
   This is subpage aware, and timing correct after we have set the
   (sub)page range writeback.

- clear_page_dirty_for_io() inside extent_write_locked_range()
   The offending part we're discussing here.


For the "normal" subpage case, we go the following sequence:

1. folio_clear_dirty_for_io() on page [0, 64K)
    As I said already, this is not subpage aware nor timing correctly.

2. writepage_delalloc() to allocate an OE for range [32K, 160K)

3. __extent_writepage_io() for page [0, 64K)
    Which would then search subpage bits (not yet cleared), and submit
    range [32K, 64K) correctly.
    And correctly clear the subpage dirty flags.

4. folio_clear_dirty_for_io() on page [64K, 128K)

5. __extent_writepage_io() for page [64K, 128K)
    Now this time it would submit [64K, 128K) for the range.
    And correctly clear the subpage dirty flags.

6. folio_clear_dirty_for_io() on page [128K, 160K)

7. __extent_writepage_io() for page [128K, 160K)
    Now this time it would submit [128K, 160K) for the range.
    And correctly clear the subpage dirty flags.

The point here is, __extent_writepage_io() is only called for the exact
page from extent_write_cache_pages(). It would never be called on other
pages.

But for the subpage + zoned case, things can go a little different:

1. folio_clear_dirty_for_io() on page [0, 64K)
    This is still the same.

2. writepage_delalloc() to allocate an OE for range [32K, 96K)
    As our zoned append size limit is 64K.

3. __extent_writepage_io() for page [0, 64K)
    So far still no difference, submitted range [32K, 64K)

4. clear_page_dirty_for_io() for page [64K, 128K) inside
    extent_write_locked_range().
    This is different now. We are handling page other than the locked
    page.
    This would only cause problem later.

5. __extent_writepage_io() for page [64K, 128K)
    Now we only submit the range [64K, 96K), as our range is limited by
    the extent_write_locked_range().

    At this page, page dirty is cleared, but subpage dirty is only
    cleared for [64K, 96K), the remaining half still has subpage dirty
    flags. AKA, a desync between page flags and subpage flags.

6. folio_test_dirty() for page [64K, 128K) from
    extent_write_cache_pages()
    Now since that page no longer has dirty flags (even it still has
    subpage flags), extent_write_cache_pages() would simply skip this
    page, causing the range [96K, 128K) not submitted.

That's why this patch is required to fix the problem.

I hope my newer patch would be enough to explain it already, as it
explicitly showed the step 6) using a lot of trace_printk().

And if you find the explanation here is better, I'm pretty happy to
merge the explanation into the v4 patchset.

And thanks to your questions, I'm more confident to do further cleanup
on the page dirty handling.

Thanks,
Qu

>
>> [CAUSE]
>> There is a bug involved in extent_write_locked_range() (well, I'm
>> already surprised by how many subpage incompatible code are inside that
>> function):
>>
>> - If @pages_dirty is true, we will clear the page dirty flag for the
>>    whole page
>>
>>    This means, for above case, since the max zone append size is 64K,
>>    we got an ordered extent sized 64K, resulting the following writeback
>>    range:
>>
>>    0               64K                 128K            192K
>>    |       |///////|///////////////////|/////////|
>>            32K               96K
>>             \       OE      /
>>
>>    |///| = subpage dirty range
>>
>>    And when we go into the __extent_writepage_io() call to submit [32K,
>>    64K), extent_write_locked_range() would find it's the locked page, and
>>    not clear its page dirty flag, so the submission go without any
>>    problem.
>>
>>    But when we go into the [64K, 96K) range for the second half of the
>>    ordered extent, extent_write_locked_range() would clear the page dirty
>>    flag for the whole page range [64K, 128K), resulting the following
>>    layout:
>>
>>    0               64K                 128K            192K
>>    |       |///////|         |         |/////////|
>>            32K               96K
>>             \       OE      /
>
> Huh?  We come into extent_write_locked_range(), the first page is the locked
> page so we don't clear dirty, because it's already been cleared dirty by the
> original start from extent_write_cache_pages() right?  So we would have
>
> Actual page
> [0, 64k) !PageDirty()
> [64k, 128k) PageDirty()
>
> Subpage
> [0, 64k) not dirty
> [64k, 160k) dirty
>
>>
>>    Then inside __extent_writepage_io(), since the page is no longer
>>    dirty, we skip the submission, causing only half of the ordered extent
>>    can be finished, thus hanging the whole system.
>
> Huh? We _always_ call __extent_writepage_io() with the actual page not dirty, so
> how are we skipping the submission?  We only clear the subpage range for the
> part we actually submitted.
>
> This needs a better explanation, because I'm super confused about how the page
> state matters here when the main path does exactly this.  Thanks,
>
> Josef
>

  reply	other threads:[~2024-05-12  8:56 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-05 22:35 [PATCH v2 0/2] btrfs: fix data corruption/hang/rsv leak in subpage zoned cases Qu Wenruo
2024-03-05 22:35 ` [PATCH v2 1/2] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
2024-05-10 14:25   ` Josef Bacik
2024-05-12  8:56     ` Qu Wenruo [this message]
2024-03-05 22:35 ` [PATCH v2 2/2] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-10 14:25   ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6882e4dc-bab7-4cde-8021-a571b8a2fb2b@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).