* [PATCH] Btrfs: fix race when cleaning unused block groups
@ 2014-11-05 19:56 Filipe Manana
2014-11-05 20:07 ` Josef Bacik
2014-11-05 20:33 ` Josef Bacik
0 siblings, 2 replies; 5+ messages in thread
From: Filipe Manana @ 2014-11-05 19:56 UTC (permalink / raw
To: linux-btrfs; +Cc: Filipe Manana
We have a race while deleting unused block groups that causes extents written
by past generations/transactions to be rewritten by the current transaction
before that transaction is committed. The steps that lead to this issue:
1) At transaction N one or more block groups became unused and we added them
to the list fs_info->unused_bgs;
2) While still at transaction N we write btree extents to block group X and the
transaction is committed;
3) The cleaner kthread is awaken and calls btrfs_delete_unused_bgs() to go through
the list fs_info->unused_bgs and remove unused block groups;
4) Transaction N + 1 starts;
5) At transaction N + 1, block group X becomes unused and is added to the list
fs_info->unused_bgs - this implies delayed refs were run, so we had the
following function calls: btrfs_run_delayed_refs() -> __btrfs_free_extent()
-> update_block_group(). The update_block_group() function grabs the lock
fs_info->unused_bgs_lock, adds block group X to fs_info->unused_bgs and
releases that lock;
6) The cleaner kthread, while at btrfs_delete_unused_bgs(), sees block group X
added by transaction N + 1 because it's doing a loop that finishes only when
the list fs_info->unused_bgs is empty and locks and unlocks the spinlock
fs_info->unused_bgs_lock on each iteration. So it deletes the block group
and its respective chunk is released. Even if it didn't do the lock/unlock
per iteration, it could still see block group X in the list, because the
cleaner kthread might call btrfs_delete_unused_bgs() multiple times (for
example if there are several snapshots to delete);
7) A new block group X' is created for data, and it's associated to the same chunk
that block group X was associated to;
8) Extents from block group X' are allocated for file data and for example an fsync
makes the file data be effectively written to disk;
9) A crash/reboot happens before transaction N + 1 is committed;
10) On the next mount, we will read extents from block group/chunk X but they no
longer have valid btree nodes/leafs - they have instead file data, and therefore
all sorts of errors will happen.
So fix this by ensuring the cleaner kthread can never delete a block group that
became unused in the current transaction, that is, only delete block groups that
were added to the unused_bgs list by past transactions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/disk-io.c | 1 +
fs/btrfs/extent-tree.c | 5 +++--
fs/btrfs/transaction.c | 5 +++++
4 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 36f82ba..a5e471a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1726,6 +1726,7 @@ struct btrfs_fs_info {
spinlock_t unused_bgs_lock;
struct list_head unused_bgs;
+ struct list_head unused_bgs_to_clean;
/* For btrfs to record security options */
struct security_mnt_opts security_opts;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2409718..702bbdf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2243,6 +2243,7 @@ int open_ctree(struct super_block *sb,
INIT_LIST_HEAD(&fs_info->space_info);
INIT_LIST_HEAD(&fs_info->tree_mod_seq_list);
INIT_LIST_HEAD(&fs_info->unused_bgs);
+ INIT_LIST_HEAD(&fs_info->unused_bgs_to_clean);
btrfs_mapping_init(&fs_info->mapping_tree);
btrfs_init_block_rsv(&fs_info->global_block_rsv,
BTRFS_BLOCK_RSV_GLOBAL);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 744b580..bc1c0b7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8858,6 +8858,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
up_write(&info->commit_root_sem);
spin_lock(&info->unused_bgs_lock);
+ list_splice_init(&info->unused_bgs_to_clean, &info->unused_bgs);
while (!list_empty(&info->unused_bgs)) {
block_group = list_first_entry(&info->unused_bgs,
struct btrfs_block_group_cache,
@@ -9466,10 +9467,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
return;
spin_lock(&fs_info->unused_bgs_lock);
- while (!list_empty(&fs_info->unused_bgs)) {
+ while (!list_empty(&fs_info->unused_bgs_to_clean)) {
u64 start, end;
- block_group = list_first_entry(&fs_info->unused_bgs,
+ block_group = list_first_entry(&fs_info->unused_bgs_to_clean,
struct btrfs_block_group_cache,
bg_list);
space_info = block_group->space_info;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 396ae8b..86d7cf5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1937,6 +1937,11 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
btrfs_prepare_extent_commit(trans, root);
+ spin_lock(&root->fs_info->unused_bgs_lock);
+ list_splice_init(&root->fs_info->unused_bgs,
+ &root->fs_info->unused_bgs_to_clean);
+ spin_unlock(&root->fs_info->unused_bgs_lock);
+
cur_trans = root->fs_info->running_transaction;
btrfs_set_root_node(&root->fs_info->tree_root->root_item,
--
1.9.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] Btrfs: fix race when cleaning unused block groups
2014-11-05 19:56 [PATCH] Btrfs: fix race when cleaning unused block groups Filipe Manana
@ 2014-11-05 20:07 ` Josef Bacik
2014-11-05 20:33 ` Josef Bacik
1 sibling, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2014-11-05 20:07 UTC (permalink / raw
To: Filipe Manana, linux-btrfs
On 11/05/2014 02:56 PM, Filipe Manana wrote:
> We have a race while deleting unused block groups that causes extents written
> by past generations/transactions to be rewritten by the current transaction
> before that transaction is committed. The steps that lead to this issue:
>
> 1) At transaction N one or more block groups became unused and we added them
> to the list fs_info->unused_bgs;
>
> 2) While still at transaction N we write btree extents to block group X and the
> transaction is committed;
>
> 3) The cleaner kthread is awaken and calls btrfs_delete_unused_bgs() to go through
> the list fs_info->unused_bgs and remove unused block groups;
>
> 4) Transaction N + 1 starts;
>
> 5) At transaction N + 1, block group X becomes unused and is added to the list
> fs_info->unused_bgs - this implies delayed refs were run, so we had the
> following function calls: btrfs_run_delayed_refs() -> __btrfs_free_extent()
> -> update_block_group(). The update_block_group() function grabs the lock
> fs_info->unused_bgs_lock, adds block group X to fs_info->unused_bgs and
> releases that lock;
>
> 6) The cleaner kthread, while at btrfs_delete_unused_bgs(), sees block group X
> added by transaction N + 1 because it's doing a loop that finishes only when
> the list fs_info->unused_bgs is empty and locks and unlocks the spinlock
> fs_info->unused_bgs_lock on each iteration. So it deletes the block group
> and its respective chunk is released. Even if it didn't do the lock/unlock
> per iteration, it could still see block group X in the list, because the
> cleaner kthread might call btrfs_delete_unused_bgs() multiple times (for
> example if there are several snapshots to delete);
>
> 7) A new block group X' is created for data, and it's associated to the same chunk
> that block group X was associated to;
>
> 8) Extents from block group X' are allocated for file data and for example an fsync
> makes the file data be effectively written to disk;
>
> 9) A crash/reboot happens before transaction N + 1 is committed;
>
> 10) On the next mount, we will read extents from block group/chunk X but they no
> longer have valid btree nodes/leafs - they have instead file data, and therefore
> all sorts of errors will happen.
>
> So fix this by ensuring the cleaner kthread can never delete a block group that
> became unused in the current transaction, that is, only delete block groups that
> were added to the unused_bgs list by past transactions.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/disk-io.c | 1 +
> fs/btrfs/extent-tree.c | 5 +++--
> fs/btrfs/transaction.c | 5 +++++
> 4 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 36f82ba..a5e471a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1726,6 +1726,7 @@ struct btrfs_fs_info {
>
> spinlock_t unused_bgs_lock;
> struct list_head unused_bgs;
> + struct list_head unused_bgs_to_clean;
>
> /* For btrfs to record security options */
> struct security_mnt_opts security_opts;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 2409718..702bbdf 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2243,6 +2243,7 @@ int open_ctree(struct super_block *sb,
> INIT_LIST_HEAD(&fs_info->space_info);
> INIT_LIST_HEAD(&fs_info->tree_mod_seq_list);
> INIT_LIST_HEAD(&fs_info->unused_bgs);
> + INIT_LIST_HEAD(&fs_info->unused_bgs_to_clean);
> btrfs_mapping_init(&fs_info->mapping_tree);
> btrfs_init_block_rsv(&fs_info->global_block_rsv,
> BTRFS_BLOCK_RSV_GLOBAL);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 744b580..bc1c0b7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8858,6 +8858,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
> up_write(&info->commit_root_sem);
>
> spin_lock(&info->unused_bgs_lock);
> + list_splice_init(&info->unused_bgs_to_clean, &info->unused_bgs);
> while (!list_empty(&info->unused_bgs)) {
> block_group = list_first_entry(&info->unused_bgs,
> struct btrfs_block_group_cache,
> @@ -9466,10 +9467,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
> return;
>
> spin_lock(&fs_info->unused_bgs_lock);
> - while (!list_empty(&fs_info->unused_bgs)) {
> + while (!list_empty(&fs_info->unused_bgs_to_clean)) {
> u64 start, end;
>
> - block_group = list_first_entry(&fs_info->unused_bgs,
> + block_group = list_first_entry(&fs_info->unused_bgs_to_clean,
> struct btrfs_block_group_cache,
> bg_list);
> space_info = block_group->space_info;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 396ae8b..86d7cf5 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1937,6 +1937,11 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
>
> btrfs_prepare_extent_commit(trans, root);
>
> + spin_lock(&root->fs_info->unused_bgs_lock);
> + list_splice_init(&root->fs_info->unused_bgs,
> + &root->fs_info->unused_bgs_to_clean);
> + spin_unlock(&root->fs_info->unused_bgs_lock);
> +
> cur_trans = root->fs_info->running_transaction;
>
> btrfs_set_root_node(&root->fs_info->tree_root->root_item,
>
Eesh that's horrible, thanks for catching it.
Josef
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Btrfs: fix race when cleaning unused block groups
2014-11-05 19:56 [PATCH] Btrfs: fix race when cleaning unused block groups Filipe Manana
2014-11-05 20:07 ` Josef Bacik
@ 2014-11-05 20:33 ` Josef Bacik
2014-11-05 21:03 ` Filipe David Manana
1 sibling, 1 reply; 5+ messages in thread
From: Josef Bacik @ 2014-11-05 20:33 UTC (permalink / raw
To: Filipe Manana, linux-btrfs
On 11/05/2014 02:56 PM, Filipe Manana wrote:
> We have a race while deleting unused block groups that causes extents written
> by past generations/transactions to be rewritten by the current transaction
> before that transaction is committed. The steps that lead to this issue:
>
> 1) At transaction N one or more block groups became unused and we added them
> to the list fs_info->unused_bgs;
>
> 2) While still at transaction N we write btree extents to block group X and the
> transaction is committed;
>
> 3) The cleaner kthread is awaken and calls btrfs_delete_unused_bgs() to go through
> the list fs_info->unused_bgs and remove unused block groups;
>
> 4) Transaction N + 1 starts;
>
> 5) At transaction N + 1, block group X becomes unused and is added to the list
> fs_info->unused_bgs - this implies delayed refs were run, so we had the
> following function calls: btrfs_run_delayed_refs() -> __btrfs_free_extent()
> -> update_block_group(). The update_block_group() function grabs the lock
> fs_info->unused_bgs_lock, adds block group X to fs_info->unused_bgs and
> releases that lock;
>
> 6) The cleaner kthread, while at btrfs_delete_unused_bgs(), sees block group X
> added by transaction N + 1 because it's doing a loop that finishes only when
> the list fs_info->unused_bgs is empty and locks and unlocks the spinlock
> fs_info->unused_bgs_lock on each iteration. So it deletes the block group
> and its respective chunk is released. Even if it didn't do the lock/unlock
> per iteration, it could still see block group X in the list, because the
> cleaner kthread might call btrfs_delete_unused_bgs() multiple times (for
> example if there are several snapshots to delete);
>
> 7) A new block group X' is created for data, and it's associated to the same chunk
> that block group X was associated to;
>
Actually this can't happen, we search the commit root for a free dev
extent, so if block group X` get's mapped to a dev extent that was
deleted in the same transaction as it was free'd in then that is a
different problem.
> 8) Extents from block group X' are allocated for file data and for example an fsync
> makes the file data be effectively written to disk;
>
Also if a new block group is allocated fsync() will trigger a full
transaction commit. So thinking about this more I'm not entirely sure
there is actually a problem here. Did you observe this issue? Are you
sure it's because of this change and not just exacerbated by it? Thanks,
Josef
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Btrfs: fix race when cleaning unused block groups
2014-11-05 20:33 ` Josef Bacik
@ 2014-11-05 21:03 ` Filipe David Manana
2014-11-05 21:13 ` Josef Bacik
0 siblings, 1 reply; 5+ messages in thread
From: Filipe David Manana @ 2014-11-05 21:03 UTC (permalink / raw
To: Josef Bacik; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org
On Wed, Nov 5, 2014 at 8:33 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 11/05/2014 02:56 PM, Filipe Manana wrote:
>>
>> We have a race while deleting unused block groups that causes extents
>> written
>> by past generations/transactions to be rewritten by the current
>> transaction
>> before that transaction is committed. The steps that lead to this issue:
>>
>> 1) At transaction N one or more block groups became unused and we added
>> them
>> to the list fs_info->unused_bgs;
>>
>> 2) While still at transaction N we write btree extents to block group X
>> and the
>> transaction is committed;
>>
>> 3) The cleaner kthread is awaken and calls btrfs_delete_unused_bgs() to go
>> through
>> the list fs_info->unused_bgs and remove unused block groups;
>>
>> 4) Transaction N + 1 starts;
>>
>> 5) At transaction N + 1, block group X becomes unused and is added to the
>> list
>> fs_info->unused_bgs - this implies delayed refs were run, so we had
>> the
>> following function calls: btrfs_run_delayed_refs() ->
>> __btrfs_free_extent()
>> -> update_block_group(). The update_block_group() function grabs the
>> lock
>> fs_info->unused_bgs_lock, adds block group X to fs_info->unused_bgs
>> and
>> releases that lock;
>>
>> 6) The cleaner kthread, while at btrfs_delete_unused_bgs(), sees block
>> group X
>> added by transaction N + 1 because it's doing a loop that finishes
>> only when
>> the list fs_info->unused_bgs is empty and locks and unlocks the
>> spinlock
>> fs_info->unused_bgs_lock on each iteration. So it deletes the block
>> group
>> and its respective chunk is released. Even if it didn't do the
>> lock/unlock
>> per iteration, it could still see block group X in the list, because
>> the
>> cleaner kthread might call btrfs_delete_unused_bgs() multiple times
>> (for
>> example if there are several snapshots to delete);
>>
>> 7) A new block group X' is created for data, and it's associated to the
>> same chunk
>> that block group X was associated to;
>>
>
> Actually this can't happen, we search the commit root for a free dev extent,
> so if block group X` get's mapped to a dev extent that was deleted in the
> same transaction as it was free'd in then that is a different problem.
Hum, yep I missed the detail that when looking for free device extents
we use the commit root.
In that case I don't think anymore that there's a problem.
Thanks for looking at it.
>
>> 8) Extents from block group X' are allocated for file data and for example
>> an fsync
>> makes the file data be effectively written to disk;
>>
>
> Also if a new block group is allocated fsync() will trigger a full
> transaction commit. So thinking about this more I'm not entirely sure there
> is actually a problem here. Did you observe this issue? Are you sure it's
> because of this change and not just exacerbated by it? Thanks,
>
> Josef
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Btrfs: fix race when cleaning unused block groups
2014-11-05 21:03 ` Filipe David Manana
@ 2014-11-05 21:13 ` Josef Bacik
0 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2014-11-05 21:13 UTC (permalink / raw
To: fdmanana; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org
On 11/05/2014 04:03 PM, Filipe David Manana wrote:
> On Wed, Nov 5, 2014 at 8:33 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 11/05/2014 02:56 PM, Filipe Manana wrote:
>>>
>>> We have a race while deleting unused block groups that causes extents
>>> written
>>> by past generations/transactions to be rewritten by the current
>>> transaction
>>> before that transaction is committed. The steps that lead to this issue:
>>>
>>> 1) At transaction N one or more block groups became unused and we added
>>> them
>>> to the list fs_info->unused_bgs;
>>>
>>> 2) While still at transaction N we write btree extents to block group X
>>> and the
>>> transaction is committed;
>>>
>>> 3) The cleaner kthread is awaken and calls btrfs_delete_unused_bgs() to go
>>> through
>>> the list fs_info->unused_bgs and remove unused block groups;
>>>
>>> 4) Transaction N + 1 starts;
>>>
>>> 5) At transaction N + 1, block group X becomes unused and is added to the
>>> list
>>> fs_info->unused_bgs - this implies delayed refs were run, so we had
>>> the
>>> following function calls: btrfs_run_delayed_refs() ->
>>> __btrfs_free_extent()
>>> -> update_block_group(). The update_block_group() function grabs the
>>> lock
>>> fs_info->unused_bgs_lock, adds block group X to fs_info->unused_bgs
>>> and
>>> releases that lock;
>>>
>>> 6) The cleaner kthread, while at btrfs_delete_unused_bgs(), sees block
>>> group X
>>> added by transaction N + 1 because it's doing a loop that finishes
>>> only when
>>> the list fs_info->unused_bgs is empty and locks and unlocks the
>>> spinlock
>>> fs_info->unused_bgs_lock on each iteration. So it deletes the block
>>> group
>>> and its respective chunk is released. Even if it didn't do the
>>> lock/unlock
>>> per iteration, it could still see block group X in the list, because
>>> the
>>> cleaner kthread might call btrfs_delete_unused_bgs() multiple times
>>> (for
>>> example if there are several snapshots to delete);
>>>
>>> 7) A new block group X' is created for data, and it's associated to the
>>> same chunk
>>> that block group X was associated to;
>>>
>>
>> Actually this can't happen, we search the commit root for a free dev extent,
>> so if block group X` get's mapped to a dev extent that was deleted in the
>> same transaction as it was free'd in then that is a different problem.
>
> Hum, yep I missed the detail that when looking for free device extents
> we use the commit root.
> In that case I don't think anymore that there's a problem.
>
> Thanks for looking at it.
>
I still think its a good idea just so we don't have a lot of churn, but
change up the commit message to make it sound less like the original
stuff would eat children. Thanks,
Josef
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2014-11-05 21:16 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-05 19:56 [PATCH] Btrfs: fix race when cleaning unused block groups Filipe Manana
2014-11-05 20:07 ` Josef Bacik
2014-11-05 20:33 ` Josef Bacik
2014-11-05 21:03 ` Filipe David Manana
2014-11-05 21:13 ` Josef Bacik
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.