[PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs

Netdev Archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs
@ 2023-04-06 11:48 Liang Chen
  2023-04-06 15:25 ` Alexander H Duyck
  0 siblings, 1 reply; 6+ messages in thread
From: Liang Chen @ 2023-04-06 11:48 UTC (permalink / raw
  To: kuba, alexander.duyck, ilias.apalodimas, hawk
  Cc: davem, edumazet, pabeni, netdev, liangchen.linux

Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
recycling") allowed coalescing to proceed with non page pool page and
page pool page when @from is cloned, i.e.

to->pp_recycle    --> false
from->pp_recycle  --> true
skb_cloned(from)  --> true

However, it actually requires skb_cloned(@from) to hold true until
coalescing finishes in this situation. If the other cloned SKB is
released while the merging is in process, from_shinfo->nr_frags will be
set to 0 towards the end of the function, causing the increment of frag
page _refcount to be unexpectedly skipped resulting in inconsistent
reference counts. Later when SKB(@to) is released, it frees the page
directly even though the page pool page is still in use, leading to
use-after-free or double-free errors.

So it needs to be specially handled at where the ref count may get lost.

The double-free error message below prompted us to investigate:
BUG: Bad page state in process swapper/1  pfn:0e0d1
page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
index:0x2 pfn:0xe0d1
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount

CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
Call Trace:
 <IRQ>
dump_stack_lvl+0x32/0x50
bad_page+0x69/0xf0
free_pcp_prepare+0x260/0x2f0
free_unref_page+0x20/0x1c0
skb_release_data+0x10b/0x1a0
napi_consume_skb+0x56/0x150
net_rx_action+0xf0/0x350
? __napi_schedule+0x79/0x90
__do_softirq+0xc8/0x2b1
__irq_exit_rcu+0xb9/0xf0
common_interrupt+0x82/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
RIP: 0010:default_idle+0xb/0x20

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
---
Changes from v1:
- deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
---
 net/core/skbuff.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 050a875d09c5..77da8ce74a1e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 
 		skb_fill_page_desc(to, to_shinfo->nr_frags,
 				   page, offset, skb_headlen(from));
-		*fragstolen = true;
+
+		/* When @from is pp_recycle and @to isn't, coalescing is
+		 * allowed to proceed if @from is cloned. However if the
+		 * execution reaches this point, @from is already transitioned
+		 * into non-cloned because the other cloned skb is released
+		 * somewhere else concurrently. In this case, we need to make
+		 * sure the ref count is incremented, not directly stealing
+		 * from page pool.
+		 */
+		if (to->pp_recycle != from->pp_recycle)
+			get_page(page);
+		else
+			*fragstolen = true;
 	} else {
 		if (to_shinfo->nr_frags +
 		    from_shinfo->nr_frags > MAX_SKB_FRAGS)
@@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	       from_shinfo->nr_frags * sizeof(skb_frag_t));
 	to_shinfo->nr_frags += from_shinfo->nr_frags;
 
-	if (!skb_cloned(from))
+	/* Same situation as above where head data presents. When @from is
+	 * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
+	 * is cloned. However @from can be transitioned into non-cloned
+	 * concurrently by this point. If it does happen, we need to make sure
+	 * the ref count is properly incremented.
+	 */
+	if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
 		from_shinfo->nr_frags = 0;
 
 	/* if the skb is not cloned this does nothing
-- 
2.18.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs
  2023-04-06 11:48 [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs Liang Chen
@ 2023-04-06 15:25 ` Alexander H Duyck
  2023-04-07  8:02   ` Ilias Apalodimas
  2023-04-07  8:06   ` Yunsheng Lin
  0 siblings, 2 replies; 6+ messages in thread
From: Alexander H Duyck @ 2023-04-06 15:25 UTC (permalink / raw
  To: Liang Chen, kuba, ilias.apalodimas, hawk; +Cc: davem, edumazet, pabeni, netdev

On Thu, 2023-04-06 at 19:48 +0800, Liang Chen wrote:
> Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
> recycling") allowed coalescing to proceed with non page pool page and
> page pool page when @from is cloned, i.e.
> 
> to->pp_recycle    --> false
> from->pp_recycle  --> true
> skb_cloned(from)  --> true
> 
> However, it actually requires skb_cloned(@from) to hold true until
> coalescing finishes in this situation. If the other cloned SKB is
> released while the merging is in process, from_shinfo->nr_frags will be
> set to 0 towards the end of the function, causing the increment of frag
> page _refcount to be unexpectedly skipped resulting in inconsistent
> reference counts. Later when SKB(@to) is released, it frees the page
> directly even though the page pool page is still in use, leading to
> use-after-free or double-free errors.
> 
> So it needs to be specially handled at where the ref count may get lost.
> 
> The double-free error message below prompted us to investigate:
> BUG: Bad page state in process swapper/1  pfn:0e0d1
> page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
> index:0x2 pfn:0xe0d1
> flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
> raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
> page dumped because: nonzero _refcount
> 
> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
> Call Trace:
>  <IRQ>
> dump_stack_lvl+0x32/0x50
> bad_page+0x69/0xf0
> free_pcp_prepare+0x260/0x2f0
> free_unref_page+0x20/0x1c0
> skb_release_data+0x10b/0x1a0
> napi_consume_skb+0x56/0x150
> net_rx_action+0xf0/0x350
> ? __napi_schedule+0x79/0x90
> __do_softirq+0xc8/0x2b1
> __irq_exit_rcu+0xb9/0xf0
> common_interrupt+0x82/0xa0
> </IRQ>
> <TASK>
> asm_common_interrupt+0x22/0x40
> RIP: 0010:default_idle+0xb/0x20
> 
> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> ---
> Changes from v1:
> - deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
> ---
>  net/core/skbuff.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 050a875d09c5..77da8ce74a1e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  
>  		skb_fill_page_desc(to, to_shinfo->nr_frags,
>  				   page, offset, skb_headlen(from));
> -		*fragstolen = true;
> +
> +		/* When @from is pp_recycle and @to isn't, coalescing is
> +		 * allowed to proceed if @from is cloned. However if the
> +		 * execution reaches this point, @from is already transitioned
> +		 * into non-cloned because the other cloned skb is released
> +		 * somewhere else concurrently. In this case, we need to make
> +		 * sure the ref count is incremented, not directly stealing
> +		 * from page pool.
> +		 */
> +		if (to->pp_recycle != from->pp_recycle)
> +			get_page(page);
> +		else
> +			*fragstolen = true;
>  	} else {
>  		if (to_shinfo->nr_frags +
>  		    from_shinfo->nr_frags > MAX_SKB_FRAGS)
> @@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  	       from_shinfo->nr_frags * sizeof(skb_frag_t));
>  	to_shinfo->nr_frags += from_shinfo->nr_frags;
>  
> -	if (!skb_cloned(from))
> +	/* Same situation as above where head data presents. When @from is
> +	 * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
> +	 * is cloned. However @from can be transitioned into non-cloned
> +	 * concurrently by this point. If it does happen, we need to make sure
> +	 * the ref count is properly incremented.
> +	 */
> +	if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
>  		from_shinfo->nr_frags = 0;
>  
>  	/* if the skb is not cloned this does nothing

So looking this over I believe this should resolve the issue you
pointed out while maintaining current functionality.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

One follow-on that we may want to do with this would be to look at
consolidating the 3 spots where we are checking for our combination of
pp_recycle comparison and skb_cloned and maybe pass one boolean flag
indicating that we have to transfer everything by taking page
references.

Also I think we can actually increase the number of cases where we
support coalescing if we were to take apart the skb_head_is_locked call
and move the skb_cloned check from it into your recycle check in the
portion where we are stealing from the header.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs
  2023-04-06 15:25 ` Alexander H Duyck
@ 2023-04-07  8:02   ` Ilias Apalodimas
  2023-04-10  7:26     ` Liang Chen
  2023-04-07  8:06   ` Yunsheng Lin
  1 sibling, 1 reply; 6+ messages in thread
From: Ilias Apalodimas @ 2023-04-07  8:02 UTC (permalink / raw
  To: Alexander H Duyck; +Cc: Liang Chen, kuba, hawk, davem, edumazet, pabeni, netdev

Hi all,

On Thu, 6 Apr 2023 at 18:25, Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, 2023-04-06 at 19:48 +0800, Liang Chen wrote:
> > Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
> > recycling") allowed coalescing to proceed with non page pool page and
> > page pool page when @from is cloned, i.e.
> >
> > to->pp_recycle    --> false
> > from->pp_recycle  --> true
> > skb_cloned(from)  --> true
> >
> > However, it actually requires skb_cloned(@from) to hold true until
> > coalescing finishes in this situation. If the other cloned SKB is
> > released while the merging is in process, from_shinfo->nr_frags will be
> > set to 0 towards the end of the function, causing the increment of frag
> > page _refcount to be unexpectedly skipped resulting in inconsistent
> > reference counts. Later when SKB(@to) is released, it frees the page
> > directly even though the page pool page is still in use, leading to
> > use-after-free or double-free errors.
> >
> > So it needs to be specially handled at where the ref count may get lost.
> >
> > The double-free error message below prompted us to investigate:
> > BUG: Bad page state in process swapper/1  pfn:0e0d1
> > page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
> > index:0x2 pfn:0xe0d1
> > flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> > raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
> > raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
> > page dumped because: nonzero _refcount
> >
> > CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
> > Call Trace:
> >  <IRQ>
> > dump_stack_lvl+0x32/0x50
> > bad_page+0x69/0xf0
> > free_pcp_prepare+0x260/0x2f0
> > free_unref_page+0x20/0x1c0
> > skb_release_data+0x10b/0x1a0
> > napi_consume_skb+0x56/0x150
> > net_rx_action+0xf0/0x350
> > ? __napi_schedule+0x79/0x90
> > __do_softirq+0xc8/0x2b1
> > __irq_exit_rcu+0xb9/0xf0
> > common_interrupt+0x82/0xa0
> > </IRQ>
> > <TASK>
> > asm_common_interrupt+0x22/0x40
> > RIP: 0010:default_idle+0xb/0x20
> >
> > Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> > ---
> > Changes from v1:
> > - deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
> > ---
> >  net/core/skbuff.c | 22 ++++++++++++++++++++--
> >  1 file changed, 20 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 050a875d09c5..77da8ce74a1e 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >
> >               skb_fill_page_desc(to, to_shinfo->nr_frags,
> >                                  page, offset, skb_headlen(from));
> > -             *fragstolen = true;
> > +
> > +             /* When @from is pp_recycle and @to isn't, coalescing is
> > +              * allowed to proceed if @from is cloned. However if the
> > +              * execution reaches this point, @from is already transitioned
> > +              * into non-cloned because the other cloned skb is released
> > +              * somewhere else concurrently. In this case, we need to make
> > +              * sure the ref count is incremented, not directly stealing
> > +              * from page pool.
> > +              */
> > +             if (to->pp_recycle != from->pp_recycle)
> > +                     get_page(page);
> > +             else
> > +                     *fragstolen = true;
> >       } else {
> >               if (to_shinfo->nr_frags +
> >                   from_shinfo->nr_frags > MAX_SKB_FRAGS)
> > @@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >              from_shinfo->nr_frags * sizeof(skb_frag_t));
> >       to_shinfo->nr_frags += from_shinfo->nr_frags;
> >
> > -     if (!skb_cloned(from))
> > +     /* Same situation as above where head data presents. When @from is
> > +      * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
> > +      * is cloned. However @from can be transitioned into non-cloned
> > +      * concurrently by this point. If it does happen, we need to make sure
> > +      * the ref count is properly incremented.
> > +      */
> > +     if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
> >               from_shinfo->nr_frags = 0;
> >
> >       /* if the skb is not cloned this does nothing
>
> So looking this over I believe this should resolve the issue you
> pointed out while maintaining current functionality.
>
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

Yes, this would work, but I am not sure we really want this here.
Is coalescing page-pool-owned and slab-owned SKBs frequent to justify
the additional overhead? If the answer is yes, the patch seems fine

Thanks
/Ilias
>
> One follow-on that we may want to do with this would be to look at
> consolidating the 3 spots where we are checking for our combination of
> pp_recycle comparison and skb_cloned and maybe pass one boolean flag
> indicating that we have to transfer everything by taking page
> references.
>
> Also I think we can actually increase the number of cases where we
> support coalescing if we were to take apart the skb_head_is_locked call
> and move the skb_cloned check from it into your recycle check in the
> portion where we are stealing from the header.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs
  2023-04-06 15:25 ` Alexander H Duyck
  2023-04-07  8:02   ` Ilias Apalodimas
@ 2023-04-07  8:06   ` Yunsheng Lin
  2023-04-07 15:01     ` Alexander Duyck
  1 sibling, 1 reply; 6+ messages in thread
From: Yunsheng Lin @ 2023-04-07  8:06 UTC (permalink / raw
  To: Alexander H Duyck, Liang Chen, kuba, ilias.apalodimas, hawk
  Cc: davem, edumazet, pabeni, netdev

On 2023/4/6 23:25, Alexander H Duyck wrote:
> On Thu, 2023-04-06 at 19:48 +0800, Liang Chen wrote:
>> Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
>> recycling") allowed coalescing to proceed with non page pool page and
>> page pool page when @from is cloned, i.e.
>>
>> to->pp_recycle    --> false
>> from->pp_recycle  --> true
>> skb_cloned(from)  --> true
>>
>> However, it actually requires skb_cloned(@from) to hold true until
>> coalescing finishes in this situation. If the other cloned SKB is
>> released while the merging is in process, from_shinfo->nr_frags will be
>> set to 0 towards the end of the function, causing the increment of frag
>> page _refcount to be unexpectedly skipped resulting in inconsistent
>> reference counts. Later when SKB(@to) is released, it frees the page
>> directly even though the page pool page is still in use, leading to
>> use-after-free or double-free errors.
>>
>> So it needs to be specially handled at where the ref count may get lost.
>>
>> The double-free error message below prompted us to investigate:
>> BUG: Bad page state in process swapper/1  pfn:0e0d1
>> page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
>> index:0x2 pfn:0xe0d1
>> flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>> raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
>> raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
>> page dumped because: nonzero _refcount
>>
>> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
>> Call Trace:
>>  <IRQ>
>> dump_stack_lvl+0x32/0x50
>> bad_page+0x69/0xf0
>> free_pcp_prepare+0x260/0x2f0
>> free_unref_page+0x20/0x1c0
>> skb_release_data+0x10b/0x1a0
>> napi_consume_skb+0x56/0x150
>> net_rx_action+0xf0/0x350
>> ? __napi_schedule+0x79/0x90
>> __do_softirq+0xc8/0x2b1
>> __irq_exit_rcu+0xb9/0xf0
>> common_interrupt+0x82/0xa0
>> </IRQ>
>> <TASK>
>> asm_common_interrupt+0x22/0x40
>> RIP: 0010:default_idle+0xb/0x20
>>
>> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
>> ---
>> Changes from v1:
>> - deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
>> ---
>>  net/core/skbuff.c | 22 ++++++++++++++++++++--
>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 050a875d09c5..77da8ce74a1e 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>>  
>>  		skb_fill_page_desc(to, to_shinfo->nr_frags,
>>  				   page, offset, skb_headlen(from));
>> -		*fragstolen = true;
>> +
>> +		/* When @from is pp_recycle and @to isn't, coalescing is
>> +		 * allowed to proceed if @from is cloned. However if the
>> +		 * execution reaches this point, @from is already transitioned
>> +		 * into non-cloned because the other cloned skb is released
>> +		 * somewhere else concurrently. In this case, we need to make
>> +		 * sure the ref count is incremented, not directly stealing
>> +		 * from page pool.
>> +		 */
>> +		if (to->pp_recycle != from->pp_recycle)
>> +			get_page(page);
>> +		else
>> +			*fragstolen = true;
>>  	} else {
>>  		if (to_shinfo->nr_frags +
>>  		    from_shinfo->nr_frags > MAX_SKB_FRAGS)
>> @@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>>  	       from_shinfo->nr_frags * sizeof(skb_frag_t));
>>  	to_shinfo->nr_frags += from_shinfo->nr_frags;
>>  
>> -	if (!skb_cloned(from))
>> +	/* Same situation as above where head data presents. When @from is
>> +	 * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
>> +	 * is cloned. However @from can be transitioned into non-cloned
>> +	 * concurrently by this point. If it does happen, we need to make sure
>> +	 * the ref count is properly incremented.
>> +	 */
>> +	if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
>>  		from_shinfo->nr_frags = 0;
>>  
>>  	/* if the skb is not cloned this does nothing
> 
> So looking this over I believe this should resolve the issue you
> pointed out while maintaining current functionality.
> 
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
> 
> One follow-on that we may want to do with this would be to look at
> consolidating the 3 spots where we are checking for our combination of
> pp_recycle comparison and skb_cloned and maybe pass one boolean flag
> indicating that we have to transfer everything by taking page
> references.
> 
> Also I think we can actually increase the number of cases where we
> support coalescing if we were to take apart the skb_head_is_locked call
> and move the skb_cloned check from it into your recycle check in the
> portion where we are stealing from the header.
While at it, as we have already add additional checks to handle the below
case:
 to->pp_recycle    --> false
 from->pp_recycle  --> true
 skb_cloned(from)  --> false

Does it make sense to relax the checking at the beginning to allow
the above case to support coalescing from the beginning?

Also, dose moving to a per-page marker make sense if we want to
remove those additional checking?

> .
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs
  2023-04-07  8:06   ` Yunsheng Lin
@ 2023-04-07 15:01     ` Alexander Duyck
  0 siblings, 0 replies; 6+ messages in thread
From: Alexander Duyck @ 2023-04-07 15:01 UTC (permalink / raw
  To: Yunsheng Lin
  Cc: Liang Chen, kuba, ilias.apalodimas, hawk, davem, edumazet, pabeni,
	netdev

On Fri, Apr 7, 2023 at 1:06 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/4/6 23:25, Alexander H Duyck wrote:
> > On Thu, 2023-04-06 at 19:48 +0800, Liang Chen wrote:
> >> Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
> >> recycling") allowed coalescing to proceed with non page pool page and
> >> page pool page when @from is cloned, i.e.
> >>
> >> to->pp_recycle    --> false
> >> from->pp_recycle  --> true
> >> skb_cloned(from)  --> true
> >>
> >> However, it actually requires skb_cloned(@from) to hold true until
> >> coalescing finishes in this situation. If the other cloned SKB is
> >> released while the merging is in process, from_shinfo->nr_frags will be
> >> set to 0 towards the end of the function, causing the increment of frag
> >> page _refcount to be unexpectedly skipped resulting in inconsistent
> >> reference counts. Later when SKB(@to) is released, it frees the page
> >> directly even though the page pool page is still in use, leading to
> >> use-after-free or double-free errors.
> >>
> >> So it needs to be specially handled at where the ref count may get lost.
> >>
> >> The double-free error message below prompted us to investigate:
> >> BUG: Bad page state in process swapper/1  pfn:0e0d1
> >> page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
> >> index:0x2 pfn:0xe0d1
> >> flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >> raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
> >> raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
> >> page dumped because: nonzero _refcount
> >>
> >> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
> >> Call Trace:
> >>  <IRQ>
> >> dump_stack_lvl+0x32/0x50
> >> bad_page+0x69/0xf0
> >> free_pcp_prepare+0x260/0x2f0
> >> free_unref_page+0x20/0x1c0
> >> skb_release_data+0x10b/0x1a0
> >> napi_consume_skb+0x56/0x150
> >> net_rx_action+0xf0/0x350
> >> ? __napi_schedule+0x79/0x90
> >> __do_softirq+0xc8/0x2b1
> >> __irq_exit_rcu+0xb9/0xf0
> >> common_interrupt+0x82/0xa0
> >> </IRQ>
> >> <TASK>
> >> asm_common_interrupt+0x22/0x40
> >> RIP: 0010:default_idle+0xb/0x20
> >>
> >> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> >> ---
> >> Changes from v1:
> >> - deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
> >> ---
> >>  net/core/skbuff.c | 22 ++++++++++++++++++++--
> >>  1 file changed, 20 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index 050a875d09c5..77da8ce74a1e 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >>
> >>              skb_fill_page_desc(to, to_shinfo->nr_frags,
> >>                                 page, offset, skb_headlen(from));
> >> -            *fragstolen = true;
> >> +
> >> +            /* When @from is pp_recycle and @to isn't, coalescing is
> >> +             * allowed to proceed if @from is cloned. However if the
> >> +             * execution reaches this point, @from is already transitioned
> >> +             * into non-cloned because the other cloned skb is released
> >> +             * somewhere else concurrently. In this case, we need to make
> >> +             * sure the ref count is incremented, not directly stealing
> >> +             * from page pool.
> >> +             */
> >> +            if (to->pp_recycle != from->pp_recycle)
> >> +                    get_page(page);
> >> +            else
> >> +                    *fragstolen = true;
> >>      } else {
> >>              if (to_shinfo->nr_frags +
> >>                  from_shinfo->nr_frags > MAX_SKB_FRAGS)
> >> @@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >>             from_shinfo->nr_frags * sizeof(skb_frag_t));
> >>      to_shinfo->nr_frags += from_shinfo->nr_frags;
> >>
> >> -    if (!skb_cloned(from))
> >> +    /* Same situation as above where head data presents. When @from is
> >> +     * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
> >> +     * is cloned. However @from can be transitioned into non-cloned
> >> +     * concurrently by this point. If it does happen, we need to make sure
> >> +     * the ref count is properly incremented.
> >> +     */
> >> +    if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
> >>              from_shinfo->nr_frags = 0;
> >>
> >>      /* if the skb is not cloned this does nothing
> >
> > So looking this over I believe this should resolve the issue you
> > pointed out while maintaining current functionality.
> >
> > Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
> >
> > One follow-on that we may want to do with this would be to look at
> > consolidating the 3 spots where we are checking for our combination of
> > pp_recycle comparison and skb_cloned and maybe pass one boolean flag
> > indicating that we have to transfer everything by taking page
> > references.
> >
> > Also I think we can actually increase the number of cases where we
> > support coalescing if we were to take apart the skb_head_is_locked call
> > and move the skb_cloned check from it into your recycle check in the
> > portion where we are stealing from the header.
> While at it, as we have already add additional checks to handle the below
> case:
>  to->pp_recycle    --> false
>  from->pp_recycle  --> true
>  skb_cloned(from)  --> false
>
> Does it make sense to relax the checking at the beginning to allow
> the above case to support coalescing from the beginning?
>
> Also, dose moving to a per-page marker make sense if we want to
> remove those additional checking?

Actually based on the comments from v2 it sounds like we may want to
actually go the other way and instead just limit the cases where we
combine frames and see if anyone complains of a performance drop as a
result. The general idea being to avoid having all the extra checks
throughout the code should improve performance in the general case.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs
  2023-04-07  8:02   ` Ilias Apalodimas
@ 2023-04-10  7:26     ` Liang Chen
  0 siblings, 0 replies; 6+ messages in thread
From: Liang Chen @ 2023-04-10  7:26 UTC (permalink / raw
  To: Ilias Apalodimas
  Cc: Alexander H Duyck, kuba, hawk, davem, edumazet, pabeni, netdev

On Fri, Apr 7, 2023 at 4:03 PM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi all,
>
> On Thu, 6 Apr 2023 at 18:25, Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Thu, 2023-04-06 at 19:48 +0800, Liang Chen wrote:
> > > Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
> > > recycling") allowed coalescing to proceed with non page pool page and
> > > page pool page when @from is cloned, i.e.
> > >
> > > to->pp_recycle    --> false
> > > from->pp_recycle  --> true
> > > skb_cloned(from)  --> true
> > >
> > > However, it actually requires skb_cloned(@from) to hold true until
> > > coalescing finishes in this situation. If the other cloned SKB is
> > > released while the merging is in process, from_shinfo->nr_frags will be
> > > set to 0 towards the end of the function, causing the increment of frag
> > > page _refcount to be unexpectedly skipped resulting in inconsistent
> > > reference counts. Later when SKB(@to) is released, it frees the page
> > > directly even though the page pool page is still in use, leading to
> > > use-after-free or double-free errors.
> > >
> > > So it needs to be specially handled at where the ref count may get lost.
> > >
> > > The double-free error message below prompted us to investigate:
> > > BUG: Bad page state in process swapper/1  pfn:0e0d1
> > > page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
> > > index:0x2 pfn:0xe0d1
> > > flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> > > raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
> > > raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
> > > page dumped because: nonzero _refcount
> > >
> > > CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
> > > Call Trace:
> > >  <IRQ>
> > > dump_stack_lvl+0x32/0x50
> > > bad_page+0x69/0xf0
> > > free_pcp_prepare+0x260/0x2f0
> > > free_unref_page+0x20/0x1c0
> > > skb_release_data+0x10b/0x1a0
> > > napi_consume_skb+0x56/0x150
> > > net_rx_action+0xf0/0x350
> > > ? __napi_schedule+0x79/0x90
> > > __do_softirq+0xc8/0x2b1
> > > __irq_exit_rcu+0xb9/0xf0
> > > common_interrupt+0x82/0xa0
> > > </IRQ>
> > > <TASK>
> > > asm_common_interrupt+0x22/0x40
> > > RIP: 0010:default_idle+0xb/0x20
> > >
> > > Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> > > ---
> > > Changes from v1:
> > > - deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
> > > ---
> > >  net/core/skbuff.c | 22 ++++++++++++++++++++--
> > >  1 file changed, 20 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 050a875d09c5..77da8ce74a1e 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >
> > >               skb_fill_page_desc(to, to_shinfo->nr_frags,
> > >                                  page, offset, skb_headlen(from));
> > > -             *fragstolen = true;
> > > +
> > > +             /* When @from is pp_recycle and @to isn't, coalescing is
> > > +              * allowed to proceed if @from is cloned. However if the
> > > +              * execution reaches this point, @from is already transitioned
> > > +              * into non-cloned because the other cloned skb is released
> > > +              * somewhere else concurrently. In this case, we need to make
> > > +              * sure the ref count is incremented, not directly stealing
> > > +              * from page pool.
> > > +              */
> > > +             if (to->pp_recycle != from->pp_recycle)
> > > +                     get_page(page);
> > > +             else
> > > +                     *fragstolen = true;
> > >       } else {
> > >               if (to_shinfo->nr_frags +
> > >                   from_shinfo->nr_frags > MAX_SKB_FRAGS)
> > > @@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >              from_shinfo->nr_frags * sizeof(skb_frag_t));
> > >       to_shinfo->nr_frags += from_shinfo->nr_frags;
> > >
> > > -     if (!skb_cloned(from))
> > > +     /* Same situation as above where head data presents. When @from is
> > > +      * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
> > > +      * is cloned. However @from can be transitioned into non-cloned
> > > +      * concurrently by this point. If it does happen, we need to make sure
> > > +      * the ref count is properly incremented.
> > > +      */
> > > +     if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
> > >               from_shinfo->nr_frags = 0;
> > >
> > >       /* if the skb is not cloned this does nothing
> >
> > So looking this over I believe this should resolve the issue you
> > pointed out while maintaining current functionality.
> >
> > Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
>
> Yes, this would work, but I am not sure we really want this here.
> Is coalescing page-pool-owned and slab-owned SKBs frequent to justify
> the additional overhead? If the answer is yes, the patch seems fine
>
> Thanks
> /Ilias

We did some analysis on the frequency of the case where the following
condition were met when entering the function,

to->pp_recycle    --> false
from->pp_recycle  --> true
skb_cloned(from)  --> true

with our networking setup (mtu 1500), there was no such occurrence. It
was only when we limited the packet size down to 256 bytes we started
to observe very few occurrences (less than 1%). And it was not
significant enough to show any difference in our performance test.

So it seems the performance gain does not justify the additional
overhead. Can you please consider the v1 patch instead?

Thanks,
Liang
> >
> > One follow-on that we may want to do with this would be to look at
> > consolidating the 3 spots where we are checking for our combination of
> > pp_recycle comparison and skb_cloned and maybe pass one boolean flag
> > indicating that we have to transfer everything by taking page
> > references.
> >
> > Also I think we can actually increase the number of cases where we
> > support coalescing if we were to take apart the skb_head_is_locked call
> > and move the skb_cloned check from it into your recycle check in the
> > portion where we are stealing from the header.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-04-10  7:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-06 11:48 [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs Liang Chen
2023-04-06 15:25 ` Alexander H Duyck
2023-04-07  8:02   ` Ilias Apalodimas
2023-04-10  7:26     ` Liang Chen
2023-04-07  8:06   ` Yunsheng Lin
2023-04-07 15:01     ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).