From: Heming Zhao <heming.zhao@suse.com>
To: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: ocfs2-devel@lists.linux.dev, ailiop@suse.com
Subject: Re: [PATCH v2] ocfs2: improve write IO performance when fragmentation is high
Date: Mon, 18 Mar 2024 15:47:25 +0800 [thread overview]
Message-ID: <0fc73316-6639-44aa-9cd2-b561a4caa93d@suse.com> (raw)
In-Reply-To: <75a0425a-1e31-42bf-9095-e4a3e00b0e01@linux.alibaba.com>
On 3/18/24 13:49, Joseph Qi wrote:
>
>
> On 3/18/24 1:29 PM, Heming Zhao wrote:
>> On 3/18/24 09:39, Joseph Qi wrote:
>>>
>>>
>>> On 3/16/24 11:44 AM, Heming Zhao wrote:
>>>> The group_search function ocfs2_cluster_group_search() should
>>>> bypass groups with insufficient space to avoid unnecessary
>>>> searches.
>>>>
>>>> This patch is particularly useful when ocfs2 is handling huge
>>>> number small files, and volume fragmentation is very high.
>>>> In this case, ocfs2 is busy with looking up available la window
>>>> from //global_bitmap.
>>>>
>>>> This patch introduces a new member in the Group Description (gd)
>>>> struct called 'bg_contig_free_bits', representing the max
>>>> contigous free bits in this gd. When ocfs2 allocates a new
>>>> la window from //global_bitmap, 'bg_contig_free_bits' helps
>>>> expedite the search process.
>>>>
>>>> Let's image below path.
>>>>
>>>> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
>>>>
>>>> 2. when user delete a large file and trigger
>>>> ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>>>> unconditionally.
>>>>
>>>> 3. a write IOs thread run and trigger the worst performance path
>>>>
>>>> ```
>>>> ocfs2_reserve_clusters_with_limit
>>>> ocfs2_reserve_local_alloc_bits
>>>> ocfs2_local_alloc_slide_window //[1]
>>>> + ocfs2_local_alloc_reserve_for_window //[2]
>>>> + ocfs2_local_alloc_new_window //[3]
>>>> ocfs2_recalc_la_window
>>>> ```
>>>>
>>>> [1]:
>>>> will be called when la window bits used up.
>>>>
>>>> [2]:
>>>> under la state is ENABLED, and this func only check global_bitmap
>>>> free bits, it will succeed in general.
>>>>
>>>> [3]:
>>>> will use the default la window size to search clusters then fail.
>>>> ocfs2_recalc_la_window attempts other la window sizes.
>>>> the timing complexity is O(n^4), resulting in a significant time
>>>> cost for scanning global bitmap. This leads to a dramatic slowdown
>>>> in write I/Os (e.g., user space 'dd').
>>>>
>>>> i.e.
>>>> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
>>>> la window default size: 106MB.
>>>> The partition is fragmentation by creating & deleting huge mount of
>>>> small files.
>>>>
>>>> before this patch, the timing of [3] should be
>>>> (the number got from real world):
>>>> - la window size change order (size: MB):
>>>> 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>>>> only 0.8MB succeed, 0.8MB also triggers la window to disable.
>>>> ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>>>> runs in worst case.
>>>> - group chain number: 242
>>>> ocfs2_claim_suballoc_bits calls for-loop 242 times
>>>> - each chain has 49 block group
>>>> ocfs2_search_chain calls while-loop 49 times
>>>> - each bg has 32256 blocks
>>>> ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>>>> for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>>>> (32256/64) (this is not worst value) for timing calucation.
>>>>
>>>> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
>>>>
>>>> In the worst case, user space writes 1MB data will trigger 42M scanning
>>>> times.
>>>>
>>>> under this patch, the timing is '7*242*49 = 83006', reduced by three
>>>> orders of magnitude.
>>>>
>>>> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
>>>> ---
>>>> v2:
>>>> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>>>> for setting bg->bg_contig_free_bits.
>>>> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>>>> 'unsigned int'.
>>>> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>>>> parameter 'failure_path' to 'fastpath'.
>>>> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
>>>> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>>>> 'suballoc.h' to 'suballoc.c'.
>>>> 6. modify some code indent error.
>>>>
>>>> ---
>>>> fs/ocfs2/move_extents.c | 2 +-
>>>> fs/ocfs2/ocfs2_fs.h | 4 +-
>>>> fs/ocfs2/resize.c | 7 +++
>>>> fs/ocfs2/suballoc.c | 99 +++++++++++++++++++++++++++++++++++++----
>>>> fs/ocfs2/suballoc.h | 6 ++-
>>>> 5 files changed, 106 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
>>>> index 1f9ed117e78b..f9d6a4f9ca92 100644
>>>> --- a/fs/ocfs2/move_extents.c
>>>> +++ b/fs/ocfs2/move_extents.c
>>>> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>>>> }
>>>> ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
>>>> - goal_bit, len);
>>>> + goal_bit, len, 0, 0);
>>>> if (ret) {
>>>> ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>>>> le16_to_cpu(gd->bg_chain));
>>>> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
>>>> index 7aebdbf5cc0a..eae18d772e93 100644
>>>> --- a/fs/ocfs2/ocfs2_fs.h
>>>> +++ b/fs/ocfs2/ocfs2_fs.h
>>>> @@ -883,14 +883,14 @@ struct ocfs2_group_desc
>>>> __le16 bg_free_bits_count; /* Free bits count */
>>>> __le16 bg_chain; /* What chain I am in. */
>>>> /*10*/ __le32 bg_generation;
>>>> - __le32 bg_reserved1;
>>>> + __le32 bg_contig_free_bits; /* max contig free bits length */
>>>
>>> I've found some parts you are using le32, while others using le16.
>>> It seems that le16 is enough here. So why not:
>>> __le16 bg_contig_free_bits;
>>> __le16 bg_reserved1;
>>>
>>> Joseph
>>
>> IIUC, From the comment before ocfs2_la_default_mb(), the max cluster
>> size is 1MB. And from 'struct ocfs2_group_desc', the offset of
>> bg_bitmap is 0x40. Therefore, the max contiguous bit of gd:
>> (1024*1024 - 0x40)*8 = 8388096 (23-bit length, bigger than __le16)
>>
> bg_bits and bg_free_bits_count are already using __le16 now.
You are right.
I rechecked the code. The gd block size should be blocksize not
clustersize. and the value of blocksize should not be bigger than
PAGE_SIZE, so the max value of bg_free_bits_count is
(4096-0x40)*8=32256, which is smaller than __le16.
I will modify all the code related to __le32.
-Heming
prev parent reply other threads:[~2024-03-18 7:47 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-16 3:44 [PATCH v2] ocfs2: improve write IO performance when fragmentation is high Heming Zhao
2024-03-18 1:39 ` Joseph Qi
2024-03-18 5:29 ` Heming Zhao
2024-03-18 5:49 ` Joseph Qi
2024-03-18 7:47 ` Heming Zhao [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0fc73316-6639-44aa-9cd2-b561a4caa93d@suse.com \
--to=heming.zhao@suse.com \
--cc=ailiop@suse.com \
--cc=joseph.qi@linux.alibaba.com \
--cc=ocfs2-devel@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).