Re: [PATCH v2] ocfs2: improve write IO performance when fragmentation is high

OCFS2-Devel Archive mirror
 help / color / mirror / Atom feed

From: Heming Zhao <heming.zhao@suse.com>
To: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: ocfs2-devel@lists.linux.dev, ailiop@suse.com
Subject: Re: [PATCH v2] ocfs2: improve write IO performance when fragmentation is high
Date: Mon, 18 Mar 2024 15:47:25 +0800	[thread overview]
Message-ID: <0fc73316-6639-44aa-9cd2-b561a4caa93d@suse.com> (raw)
In-Reply-To: <75a0425a-1e31-42bf-9095-e4a3e00b0e01@linux.alibaba.com>

On 3/18/24 13:49, Joseph Qi wrote:
> 
> 
> On 3/18/24 1:29 PM, Heming Zhao wrote:
>> On 3/18/24 09:39, Joseph Qi wrote:
>>>
>>>
>>> On 3/16/24 11:44 AM, Heming Zhao wrote:
>>>> The group_search function ocfs2_cluster_group_search() should
>>>> bypass groups with insufficient space to avoid unnecessary
>>>> searches.
>>>>
>>>> This patch is particularly useful when ocfs2 is handling huge
>>>> number small files, and volume fragmentation is very high.
>>>> In this case, ocfs2 is busy with looking up available la window
>>>> from //global_bitmap.
>>>>
>>>> This patch introduces a new member in the Group Description (gd)
>>>> struct called 'bg_contig_free_bits', representing the max
>>>> contigous free bits in this gd. When ocfs2 allocates a new
>>>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>>>> expedite the search process.
>>>>
>>>> Let's image below path.
>>>>
>>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>>
>>>> 2. when user delete a large file and trigger
>>>>      ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>>>>      unconditionally.
>>>>
>>>> 3. a write IOs thread run and trigger the worst performance path
>>>>
>>>> ```
>>>> ocfs2_reserve_clusters_with_limit
>>>>    ocfs2_reserve_local_alloc_bits
>>>>     ocfs2_local_alloc_slide_window //[1]
>>>>      + ocfs2_local_alloc_reserve_for_window //[2]
>>>>      + ocfs2_local_alloc_new_window //[3]
>>>>         ocfs2_recalc_la_window
>>>> ```
>>>>
>>>> [1]:
>>>> will be called when la window bits used up.
>>>>
>>>> [2]:
>>>> under la state is ENABLED, and this func only check global_bitmap
>>>> free bits, it will succeed in general.
>>>>
>>>> [3]:
>>>> will use the default la window size to search clusters then fail.
>>>> ocfs2_recalc_la_window attempts other la window sizes.
>>>> the timing complexity is O(n^4), resulting in a significant time
>>>> cost for scanning global bitmap. This leads to a dramatic slowdown
>>>> in write I/Os (e.g., user space 'dd').
>>>>
>>>> i.e.
>>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>>> la window default size: 106MB.
>>>> The partition is fragmentation by creating & deleting huge mount of
>>>> small files.
>>>>
>>>> before this patch, the timing of [3] should be
>>>> (the number got from real world):
>>>> - la window size change order (size: MB):
>>>>     106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>>     only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>>     ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>>     runs in worst case.
>>>> - group chain number: 242
>>>>     ocfs2_claim_suballoc_bits calls for-loop 242 times
>>>> - each chain has 49 block group
>>>>     ocfs2_search_chain calls while-loop 49 times
>>>> - each bg has 32256 blocks
>>>>     ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>>     for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>>     (32256/64) (this is not worst value) for timing calucation.
>>>>
>>>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>>
>>>> In the worst case, user space writes 1MB data will trigger 42M scanning
>>>> times.
>>>>
>>>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>>>> orders of magnitude.
>>>>
>>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>>> ---
>>>> v2:
>>>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>>>>      for setting bg->bg_contig_free_bits.
>>>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>>>>      'unsigned int'.
>>>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>>>>      parameter 'failure_path' to 'fastpath'.
>>>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>>>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>>>>      'suballoc.h' to 'suballoc.c'.
>>>> 6. modify some code indent error.
>>>>
>>>> ---
>>>>    fs/ocfs2/move_extents.c |  2 +-
>>>>    fs/ocfs2/ocfs2_fs.h     |  4 +-
>>>>    fs/ocfs2/resize.c       |  7 +++
>>>>    fs/ocfs2/suballoc.c     | 99 +++++++++++++++++++++++++++++++++++++----
>>>>    fs/ocfs2/suballoc.h     |  6 ++-
>>>>    5 files changed, 106 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>>>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>>>> --- a/fs/ocfs2/move_extents.c
>>>> +++ b/fs/ocfs2/move_extents.c
>>>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>>>>        }
>>>>          ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>>>> -                     goal_bit, len);
>>>> +                     goal_bit, len, 0, 0);
>>>>        if (ret) {
>>>>            ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>>>>                               le16_to_cpu(gd->bg_chain));
>>>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>>>> index 7aebdbf5cc0a..eae18d772e93 100644
>>>> --- a/fs/ocfs2/ocfs2_fs.h
>>>> +++ b/fs/ocfs2/ocfs2_fs.h
>>>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>>>>        __le16    bg_free_bits_count;     /* Free bits count */
>>>>        __le16   bg_chain;               /* What chain I am in. */
>>>>    /*10*/    __le32   bg_generation;
>>>> -    __le32    bg_reserved1;
>>>> +    __le32   bg_contig_free_bits;  /* max contig free bits length */
>>>
>>> I've found some parts you are using le32, while others using le16.
>>> It seems that le16 is enough here. So why not:
>>> __le16 bg_contig_free_bits;
>>> __le16 bg_reserved1;
>>>
>>> Joseph
>>
>> IIUC, From the comment before ocfs2_la_default_mb(), the max cluster
>> size is 1MB. And from 'struct ocfs2_group_desc', the offset of
>> bg_bitmap is 0x40. Therefore, the max contiguous bit of gd:
>> (1024*1024 - 0x40)*8 = 8388096 (23-bit length, bigger than __le16)
>>
> bg_bits and bg_free_bits_count are already using __le16 now.

You are right.
I rechecked the code. The gd block size should be blocksize not
clustersize. and the value of blocksize should not be bigger than
PAGE_SIZE, so the max value of bg_free_bits_count is
(4096-0x40)*8=32256, which is smaller than __le16.

I will modify all the code related to __le32.

-Heming

     prev parent reply	other threads:[~2024-03-18  7:47 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-16  3:44 [PATCH v2] ocfs2: improve write IO performance when fragmentation is high Heming Zhao
2024-03-18  1:39 ` Joseph Qi
2024-03-18  5:29   ` Heming Zhao
2024-03-18  5:49     ` Joseph Qi
2024-03-18  7:47       ` Heming Zhao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0fc73316-6639-44aa-9cd2-b561a4caa93d@suse.com \
    --to=heming.zhao@suse.com \
    --cc=ailiop@suse.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=ocfs2-devel@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).