OCFS2-Devel Archive mirror
 help / color / mirror / Atom feed
From: Heming Zhao <heming.zhao@suse.com>
To: joseph.qi@linux.alibaba.com
Cc: ocfs2-devel@lists.linux.dev, ailiop@suse.com
Subject: Re: [PATCH] ocfs2: improve write IO performance when fragmentation is high
Date: Tue, 19 Mar 2024 15:00:12 +0800	[thread overview]
Message-ID: <2dbad363-ec3c-491a-b53d-bf064e3aaf05@suse.com> (raw)
In-Reply-To: <20240319065738.6646-1-heming.zhao@suse.com>

Please ignore this mail, forgot to add 'v3' in patch.

On 3/19/24 14:57, Heming Zhao wrote:
> The group_search function ocfs2_cluster_group_search() should
> bypass groups with insufficient space to avoid unnecessary
> searches.
> 
> This patch is particularly useful when ocfs2 is handling huge
> number small files, and volume fragmentation is very high.
> In this case, ocfs2 is busy with looking up available la window
> from //global_bitmap.
> 
> This patch introduces a new member in the Group Description (gd)
> struct called 'bg_contig_free_bits', representing the max
> contigous free bits in this gd. When ocfs2 allocates a new
> la window from //global_bitmap, 'bg_contig_free_bits' helps
> expedite the search process.
> 
> Let's image below path.
> 
> 1. la state (->local_alloc_state) is set THROTTLED or DISABLED.
> 
> 2. when user delete a large file and trigger
>     ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
>     unconditionally.
> 
> 3. a write IOs thread run and trigger the worst performance path
> 
> ```
> ocfs2_reserve_clusters_with_limit
>   ocfs2_reserve_local_alloc_bits
>    ocfs2_local_alloc_slide_window //[1]
>     + ocfs2_local_alloc_reserve_for_window //[2]
>     + ocfs2_local_alloc_new_window //[3]
>        ocfs2_recalc_la_window
> ```
> 
> [1]:
> will be called when la window bits used up.
> 
> [2]:
> under la state is ENABLED, and this func only check global_bitmap
> free bits, it will succeed in general.
> 
> [3]:
> will use the default la window size to search clusters then fail.
> ocfs2_recalc_la_window attempts other la window sizes.
> the timing complexity is O(n^4), resulting in a significant time
> cost for scanning global bitmap. This leads to a dramatic slowdown
> in write I/Os (e.g., user space 'dd').
> 
> i.e.
> an ocfs2 partition size: 1.45TB, cluster size: 4KB,
> la window default size: 106MB.
> The partition is fragmentation by creating & deleting huge mount of
> small files.
> 
> before this patch, the timing of [3] should be
> (the number got from real world):
> - la window size change order (size: MB):
>    106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
>    only 0.8MB succeed, 0.8MB also triggers la window to disable.
>    ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
>    runs in worst case.
> - group chain number: 242
>    ocfs2_claim_suballoc_bits calls for-loop 242 times
> - each chain has 49 block group
>    ocfs2_search_chain calls while-loop 49 times
> - each bg has 32256 blocks
>    ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
>    for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
>    (32256/64) (this is not worst value) for timing calucation.
> 
> the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)
> 
> In the worst case, user space writes 1MB data will trigger 42M scanning
> times.
> 
> under this patch, the timing is '7*242*49 = 83006', reduced by three
> orders of magnitude.
> 
> Signed-off-by: Heming Zhao <heming.zhao@suse.com>
> ---
> v3:
> 1. Fix wrong var length for 'struct ocfs2_group_desc'
>     .bg_contig_free_bits, change from '__le32' to '__le16'.
> 2. change all related code to use '__le16' instead of '__le32'.
>     Please note: change ocfs2_find_max_contig_free_bits() input
>     parameters and return type to 'u16'.
> v2:
> 1. fix wrong length converting from cpu_to_le16() to cpu_to_le32()
>     for setting bg->bg_contig_free_bits.
> 2. change ocfs2_find_max_contig_free_bits return type from 'void' to
>     'unsigned int'.
> 3. restore ocfs2_block_group_set_bits() input parameters style, change
>     parameter 'failure_path' to 'fastpath'.
> 4. after <3>, add new parameter 'unsigned int max_contig_bits'.
> 5. after <3>, restore define 'struct ocfs2_suballoc_result' from
>     'suballoc.h' to 'suballoc.c'.
> 6. modify some code indent error.
> 
> ---
>   fs/ocfs2/move_extents.c |  2 +-
>   fs/ocfs2/ocfs2_fs.h     |  3 +-
>   fs/ocfs2/resize.c       |  7 +++
>   fs/ocfs2/suballoc.c     | 99 +++++++++++++++++++++++++++++++++++++----
>   fs/ocfs2/suballoc.h     |  6 ++-
>   5 files changed, 106 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
> index 1f9ed117e78b..f9d6a4f9ca92 100644
> --- a/fs/ocfs2/move_extents.c
> +++ b/fs/ocfs2/move_extents.c
> @@ -685,7 +685,7 @@ static int ocfs2_move_extent(struct ocfs2_move_extents_context *context,
>   	}
>   
>   	ret = ocfs2_block_group_set_bits(handle, gb_inode, gd, gd_bh,
> -					 goal_bit, len);
> +					 goal_bit, len, 0, 0);
>   	if (ret) {
>   		ocfs2_rollback_alloc_dinode_counts(gb_inode, gb_bh, len,
>   					       le16_to_cpu(gd->bg_chain));
> diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
> index 7aebdbf5cc0a..c93689b568fe 100644
> --- a/fs/ocfs2/ocfs2_fs.h
> +++ b/fs/ocfs2/ocfs2_fs.h
> @@ -883,7 +883,8 @@ struct ocfs2_group_desc
>   	__le16	bg_free_bits_count;     /* Free bits count */
>   	__le16   bg_chain;               /* What chain I am in. */
>   /*10*/	__le32   bg_generation;
> -	__le32	bg_reserved1;
> +	__le16   bg_contig_free_bits;   /* max contig free bits length */
> +	__le16   bg_reserved1;
>   	__le64   bg_next_group;          /* Next group in my list, in
>   					   blocks */
>   /*20*/	__le64   bg_parent_dinode;       /* dinode which owns me, in
> diff --git a/fs/ocfs2/resize.c b/fs/ocfs2/resize.c
> index d65d43c61857..8fbf203c2208 100644
> --- a/fs/ocfs2/resize.c
> +++ b/fs/ocfs2/resize.c
> @@ -91,6 +91,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>   	u16 cl_bpc = le16_to_cpu(cl->cl_bpc);
>   	u16 cl_cpg = le16_to_cpu(cl->cl_cpg);
>   	u16 old_bg_clusters;
> +	u16 contig_bits, old_bg_contig_free_bits;
>   
>   	trace_ocfs2_update_last_group_and_inode(new_clusters,
>   						first_new_cluster);
> @@ -122,6 +123,11 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>   		le16_add_cpu(&group->bg_free_bits_count, -1 * backups);
>   	}
>   
> +	contig_bits = ocfs2_find_max_contig_free_bits(group->bg_bitmap,
> +					group->bg_bits, 0);
> +	old_bg_contig_free_bits = group->bg_contig_free_bits;
> +	group->bg_contig_free_bits = cpu_to_le16(contig_bits);
> +
>   	ocfs2_journal_dirty(handle, group_bh);
>   
>   	/* update the inode accordingly. */
> @@ -160,6 +166,7 @@ static int ocfs2_update_last_group_and_inode(handle_t *handle,
>   		le16_add_cpu(&group->bg_free_bits_count, backups);
>   		le16_add_cpu(&group->bg_bits, -1 * num_bits);
>   		le16_add_cpu(&group->bg_free_bits_count, -1 * num_bits);
> +		group->bg_contig_free_bits = old_bg_contig_free_bits;
>   	}
>   out:
>   	if (ret)
> diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
> index 166c8918c825..7d7b063fad93 100644
> --- a/fs/ocfs2/suballoc.c
> +++ b/fs/ocfs2/suballoc.c
> @@ -50,6 +50,10 @@ struct ocfs2_suballoc_result {
>   	u64		sr_blkno;	/* The first allocated block */
>   	unsigned int	sr_bit_offset;	/* The bit in the bg */
>   	unsigned int	sr_bits;	/* How many bits we claimed */
> +	unsigned int	sr_max_contig_bits; /* The length for contiguous
> +					     * free bits, only available
> +					     * for cluster group
> +					     */
>   };
>   
>   static u64 ocfs2_group_from_res(struct ocfs2_suballoc_result *res)
> @@ -1272,6 +1276,26 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh,
>   	return ret;
>   }
>   
> +u16 ocfs2_find_max_contig_free_bits(void *bitmap,
> +			 u16 total_bits, u16 start)
> +{
> +	u16 offset, free_bits;
> +	u16 contig_bits = 0;
> +
> +	while (start < total_bits) {
> +		offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start);
> +		if (offset == total_bits)
> +			break;
> +
> +		start = ocfs2_find_next_bit(bitmap, total_bits, offset);
> +		free_bits = start - offset;
> +		if (contig_bits < free_bits)
> +			contig_bits = free_bits;
> +	}
> +
> +	return contig_bits;
> +}
> +
>   static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>   					     struct buffer_head *bg_bh,
>   					     unsigned int bits_wanted,
> @@ -1280,6 +1304,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>   {
>   	void *bitmap;
>   	u16 best_offset, best_size;
> +	u16 prev_best_size = 0;
>   	int offset, start, found, status = 0;
>   	struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) bg_bh->b_data;
>   
> @@ -1308,6 +1333,7 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>   			/* got a zero after some ones */
>   			found = 1;
>   			start = offset + 1;
> +			prev_best_size = best_size;
>   		}
>   		if (found > best_size) {
>   			best_size = found;
> @@ -1320,6 +1346,8 @@ static int ocfs2_block_group_find_clear_bits(struct ocfs2_super *osb,
>   		}
>   	}
>   
> +	/* best_size will be allocated, we save prev_best_size */
> +	res->sr_max_contig_bits = prev_best_size;
>   	if (best_size) {
>   		res->sr_bit_offset = best_offset;
>   		res->sr_bits = best_size;
> @@ -1337,11 +1365,15 @@ int ocfs2_block_group_set_bits(handle_t *handle,
>   					     struct ocfs2_group_desc *bg,
>   					     struct buffer_head *group_bh,
>   					     unsigned int bit_off,
> -					     unsigned int num_bits)
> +					     unsigned int num_bits,
> +					     unsigned int max_contig_bits,
> +					     int fastpath)
>   {
>   	int status;
>   	void *bitmap = bg->bg_bitmap;
>   	int journal_type = OCFS2_JOURNAL_ACCESS_WRITE;
> +	unsigned int start = bit_off + num_bits;
> +	u16 contig_bits;
>   
>   	/* All callers get the descriptor via
>   	 * ocfs2_read_group_descriptor().  Any corruption is a code bug. */
> @@ -1373,6 +1405,28 @@ int ocfs2_block_group_set_bits(handle_t *handle,
>   	while(num_bits--)
>   		ocfs2_set_bit(bit_off++, bitmap);
>   
> +	/*
> +	 * this is optimize path, caller set old contig value
> +	 * in max_contig_bits to bypass finding action.
> +	 */
> +	if (fastpath) {
> +		bg->bg_contig_free_bits = cpu_to_le16(max_contig_bits);
> +	} else if (ocfs2_is_cluster_bitmap(alloc_inode)) {
> +		/*
> +		 * Usually, the block group bitmap allocates only 1 bit
> +		 * at a time, while the cluster group allocates n bits
> +		 * each time. Therefore, we only save the contig bits for
> +		 * the cluster group.
> +		 */
> +		contig_bits = ocfs2_find_max_contig_free_bits(bitmap,
> +				    le16_to_cpu(bg->bg_bits), start);
> +		if (contig_bits > max_contig_bits)
> +			max_contig_bits = contig_bits;
> +		bg->bg_contig_free_bits = cpu_to_le16(max_contig_bits);
> +	} else {
> +		bg->bg_contig_free_bits = 0;
> +	}
> +
>   	ocfs2_journal_dirty(handle, group_bh);
>   
>   bail:
> @@ -1486,7 +1540,12 @@ static int ocfs2_cluster_group_search(struct inode *inode,
>   
>   	BUG_ON(!ocfs2_is_cluster_bitmap(inode));
>   
> -	if (gd->bg_free_bits_count) {
> +	if (le16_to_cpu(gd->bg_contig_free_bits) &&
> +	    le16_to_cpu(gd->bg_contig_free_bits) < bits_wanted)
> +		return -ENOSPC;
> +
> +	/* ->bg_contig_free_bits may un-initialized, so compare again */
> +	if (le16_to_cpu(gd->bg_free_bits_count) >= bits_wanted) {
>   		max_bits = le16_to_cpu(gd->bg_bits);
>   
>   		/* Tail groups in cluster bitmaps which aren't cpg
> @@ -1555,7 +1614,7 @@ static int ocfs2_block_group_search(struct inode *inode,
>   	BUG_ON(min_bits != 1);
>   	BUG_ON(ocfs2_is_cluster_bitmap(inode));
>   
> -	if (bg->bg_free_bits_count) {
> +	if (le16_to_cpu(bg->bg_free_bits_count) >= bits_wanted) {
>   		ret = ocfs2_block_group_find_clear_bits(OCFS2_SB(inode->i_sb),
>   							group_bh, bits_wanted,
>   							le16_to_cpu(bg->bg_bits),
> @@ -1715,7 +1774,8 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac,
>   	}
>   
>   	ret = ocfs2_block_group_set_bits(handle, alloc_inode, gd, group_bh,
> -					 res->sr_bit_offset, res->sr_bits);
> +					 res->sr_bit_offset, res->sr_bits,
> +					 res->sr_max_contig_bits, 0);
>   	if (ret < 0) {
>   		ocfs2_rollback_alloc_dinode_counts(alloc_inode, ac->ac_bh,
>   					       res->sr_bits,
> @@ -1849,7 +1909,9 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac,
>   					    bg,
>   					    group_bh,
>   					    res->sr_bit_offset,
> -					    res->sr_bits);
> +					    res->sr_bits,
> +					    res->sr_max_contig_bits,
> +					    0);
>   	if (status < 0) {
>   		ocfs2_rollback_alloc_dinode_counts(alloc_inode,
>   					ac->ac_bh, res->sr_bits, chain);
> @@ -2163,7 +2225,9 @@ int ocfs2_claim_new_inode_at_loc(handle_t *handle,
>   					 bg,
>   					 bg_bh,
>   					 res->sr_bit_offset,
> -					 res->sr_bits);
> +					 res->sr_bits,
> +					 res->sr_max_contig_bits,
> +					 0);
>   	if (ret < 0) {
>   		ocfs2_rollback_alloc_dinode_counts(ac->ac_inode,
>   					       ac->ac_bh, res->sr_bits, chain);
> @@ -2382,11 +2446,13 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
>   					struct buffer_head *group_bh,
>   					unsigned int bit_off,
>   					unsigned int num_bits,
> +					unsigned int max_contig_bits,
>   					void (*undo_fn)(unsigned int bit,
>   							unsigned long *bmap))
>   {
>   	int status;
>   	unsigned int tmp;
> +	u16 contig_bits;
>   	struct ocfs2_group_desc *undo_bg = NULL;
>   	struct journal_head *jh;
>   
> @@ -2433,6 +2499,20 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
>   				   num_bits);
>   	}
>   
> +	/*
> +	 * TODO: even 'num_bits == 1' (the worst case, release 1 cluster),
> +	 * we still need to rescan whole bitmap.
> +	 */
> +	if (ocfs2_is_cluster_bitmap(alloc_inode)) {
> +		contig_bits = ocfs2_find_max_contig_free_bits(bg->bg_bitmap,
> +				    le16_to_cpu(bg->bg_bits), 0);
> +		if (contig_bits > max_contig_bits)
> +			max_contig_bits = contig_bits;
> +		bg->bg_contig_free_bits = cpu_to_le16(max_contig_bits);
> +	} else {
> +		bg->bg_contig_free_bits = 0;
> +	}
> +
>   	if (undo_fn)
>   		spin_unlock(&jh->b_state_lock);
>   
> @@ -2459,6 +2539,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle,
>   	struct ocfs2_chain_list *cl = &fe->id2.i_chain;
>   	struct buffer_head *group_bh = NULL;
>   	struct ocfs2_group_desc *group;
> +	u16 old_bg_contig_free_bits = 0;
>   
>   	/* The alloc_bh comes from ocfs2_free_dinode() or
>   	 * ocfs2_free_clusters().  The callers have all locked the
> @@ -2483,9 +2564,11 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle,
>   
>   	BUG_ON((count + start_bit) > le16_to_cpu(group->bg_bits));
>   
> +	if (ocfs2_is_cluster_bitmap(alloc_inode))
> +		old_bg_contig_free_bits = group->bg_contig_free_bits;
>   	status = ocfs2_block_group_clear_bits(handle, alloc_inode,
>   					      group, group_bh,
> -					      start_bit, count, undo_fn);
> +					      start_bit, count, 0, undo_fn);
>   	if (status < 0) {
>   		mlog_errno(status);
>   		goto bail;
> @@ -2496,7 +2579,7 @@ static int _ocfs2_free_suballoc_bits(handle_t *handle,
>   	if (status < 0) {
>   		mlog_errno(status);
>   		ocfs2_block_group_set_bits(handle, alloc_inode, group, group_bh,
> -				start_bit, count);
> +				start_bit, count, old_bg_contig_free_bits, 1);
>   		goto bail;
>   	}
>   
> diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h
> index 9c74eace3adc..b481b834857d 100644
> --- a/fs/ocfs2/suballoc.h
> +++ b/fs/ocfs2/suballoc.h
> @@ -79,12 +79,16 @@ void ocfs2_rollback_alloc_dinode_counts(struct inode *inode,
>   			 struct buffer_head *di_bh,
>   			 u32 num_bits,
>   			 u16 chain);
> +u16 ocfs2_find_max_contig_free_bits(void *bitmap,
> +			 u16 total_bits, u16 start);
>   int ocfs2_block_group_set_bits(handle_t *handle,
>   			 struct inode *alloc_inode,
>   			 struct ocfs2_group_desc *bg,
>   			 struct buffer_head *group_bh,
>   			 unsigned int bit_off,
> -			 unsigned int num_bits);
> +			 unsigned int num_bits,
> +			 unsigned int max_contig_bits,
> +			 int fastpath);
>   
>   int ocfs2_claim_metadata(handle_t *handle,
>   			 struct ocfs2_alloc_context *ac,


  reply	other threads:[~2024-03-19  7:00 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-19  6:57 [PATCH] ocfs2: improve write IO performance when fragmentation is high Heming Zhao
2024-03-19  7:00 ` Heming Zhao [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-03-08  1:52 Heming Zhao
2024-03-11  9:18 ` Joseph Qi
2024-03-11 11:04 ` Joseph Qi
2024-03-12  0:54   ` Heming Zhao
2024-03-13  1:34   ` Heming Zhao
2024-03-13  9:09     ` Joseph Qi
2024-03-13 12:32       ` Heming Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2dbad363-ec3c-491a-b53d-bf064e3aaf05@suse.com \
    --to=heming.zhao@suse.com \
    --cc=ailiop@suse.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=ocfs2-devel@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).