Linux-BTRFS Archive mirror
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 0/3] btrfs: more explaination on extent_map members
Date: Fri, 12 Apr 2024 14:16:07 +0100	[thread overview]
Message-ID: <CAL3q7H7tPq9ZYN6e9ow43-sOatHEAv2D3bj0Co+_at2LT2JJCA@mail.gmail.com> (raw)
In-Reply-To: <2a688a2e-a956-4bec-a231-372ec7134003@suse.com>

On Fri, Apr 12, 2024 at 11:45 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2024/4/12 20:01, Filipe Manana 写道:
> > On Thu, Apr 11, 2024 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [REPO]
> >> https://github.com/adam900710/linux.git/ em_cleanup
> >>
> >> The first 3 patches only. As later patches are the huge rename part.
> >>
> >> [CHANGELOG]
> >> v4:
> >> - Add extra comments for the extent map merging
> >>    Since extent maps can be merged, the members are only matching
> >>    on-disk file extent items before merging.
> >>    This also means users of extent_map should not rely on it for
> >>    a reliable file extent item member.
> >>    (That's also why we use ordered_extent not extent_maps, to update
> >>     file extent items)
> >
> > So this isn't true.
> > We use information from the ordered extent to create the file extent
> > item, when finishing the ordered extent, because the ordered extent is
> > immediately accessible.
> > We could also use the extent map, we would just have to do a tree search for it.
> > The extent map created when starting delalloc or at the beginning of a
> > DIO write is pinned and in the list of modified extents, so it can't
> > be merged and it represents exactly one file extent item - the one
> > going to be inserted the inode's subvolume tree when finishing the
> > ordered extent.
>
> If you're doing COW writes, extent map is fine, as it's always new
> extents and they are all pinned thus won't be merged.
>
> For pure NOCOW, it's also fine because there would be no extent map
> inserted, nor updating file extent items.
>
> But for PREALLOC writes (NOCOW into preallocated writes), it's not the
> case as the target extent maps can be merged already.

Yes, but prealloc is an exception, as we don't insert a new file
extent item, just update an existing one or split and change the type
from prealloc to regular, etc.
My comment was obviously for COW, which is the only case that needs to
insert a new file extent item.

>
> Thus that's why for NOCOW we go with a subvolume tree search to grab the
> correct original disk_bytenr/disk_num_bytes/offset to go.
>
> If you use extent map instead of search subvolume tree, the following
> layout can screw up everything:
>
> Filepos 0:
> disk_bytenr X, disk_num_bytes 64K
> num_bytes 64K, offset 0, ram_bytes 64K
>
> Filepos 64K
> disk_bytenr X+64K, disk_num_bytes 64K
> num_bytes 64K, offset 0, ram_bytes 64K.
>
> The extent map would be (after merge):
>
> filepos 0, len 128K block_start X, block len 128K, orig_start 0.
>
> Even with my new disk_bytenr it would be no difference, the disk_bytenr
> would still be X, and disk_num_bytes would be merged 128K.
>
> So the merging behavior is affecting PREALLOC writes, and we're already
> avoiding the only pitfall.
>
> Thanks,
> Qu
>
> >
> >>
> >> v3:
> >> - Rebased to latest for-next branch
> >> - Further comments polishment
> >> - Coding style update to follow the guideline
> >>
> >> v2:
> >> - Add Filipe's cleanup on mod_start/mod_len
> >>    These two members are no longer utilized, saving me quite some time on
> >>    digging into their usage.
> >>
> >> - Update the comments of the extent_map structure
> >>    To make them more readable and less confusing.
> >>
> >> - Further cleanup for inline extent_map reading
> >>
> >> - A new patch to do extra sanity checks for create_io_em()
> >>    Firstly pure NOCOW writes should not call create_io_em(), secondly
> >>    with the new knowledge of extent_map, it's easier to do extra sanity
> >>    checks for the already pretty long parameter list.
> >>
> >> Btrfs uses extent_map to represent a in-memory file extent.
> >>
> >> There are severam members that are 1:1 mappe in on-disk file extent
> >> items and extent maps:
> >>
> >> - extent_map::start     ==      key.offset
> >> - extent_map::len       ==      file_extent_num_bytes
> >> - extent_map::ram_bytes ==      file_extent_ram_bytes
> >>
> >> But that's all, the remaining are pretty different:
> >>
> >> - Use block_start to indicate holes/inline extents
> >>    Meanwhile btrfs on-disk file extent items go with a dedicated type for
> >>    inline extents, and disk_bytenr 0 for holes.
> >>
> >> - Weird block_start/orig_block_len/orig_start
> >>    In theory we can directly go with the same file_extent_disk_bytenr,
> >>    file_extent_disk_num_bytes and file_extent_offset to calculate the
> >>    remaining members (block_start/orig_start/orig_block_len/block_len).
> >>
> >>    But for whatever reason, we didn't go that path and have a hell of
> >>    weird and inconsistent calculation for them.
> >>
> >> Qu Wenruo (3):
> >>    btrfs: add extra comments on extent_map members
> >>    btrfs: simplify the inline extent map creation
> >>    btrfs: add extra sanity checks for create_io_em()
> >>
> >>   fs/btrfs/extent_map.h | 58 ++++++++++++++++++++++++++++++++++++++++++-
> >>   fs/btrfs/file-item.c  | 20 +++++++--------
> >>   fs/btrfs/inode.c      | 40 ++++++++++++++++++++++++++++-
> >>   3 files changed, 106 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 2.44.0
> >>
> >>

      reply	other threads:[~2024-04-12 13:16 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-11 22:47 [PATCH v4 0/3] btrfs: more explaination on extent_map members Qu Wenruo
2024-04-11 22:47 ` [PATCH v4 1/3] btrfs: add extra comments " Qu Wenruo
2024-04-12 10:37   ` Filipe Manana
2024-04-11 22:47 ` [PATCH v4 2/3] btrfs: simplify the inline extent map creation Qu Wenruo
2024-04-11 22:47 ` [PATCH v4 3/3] btrfs: add extra sanity checks for create_io_em() Qu Wenruo
2024-04-12 10:31 ` [PATCH v4 0/3] btrfs: more explaination on extent_map members Filipe Manana
2024-04-12 10:45   ` Qu Wenruo
2024-04-12 13:16     ` Filipe Manana [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H7tPq9ZYN6e9ow43-sOatHEAv2D3bj0Co+_at2LT2JJCA@mail.gmail.com \
    --to=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).