tc tp creation performance degratation since kernel 5.1

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* tc tp creation performance degratation since kernel 5.1
@ 2019-06-12 12:03 Jiri Pirko
  2019-06-12 12:30 ` Paolo Abeni
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jiri Pirko @ 2019-06-12 12:03 UTC (permalink / raw
  To: netdev; +Cc: vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk

Hi.

I came across serious performance degradation when adding many tps. I'm
using following script:

------------------------------------------------------------------------
#!/bin/bash

dev=testdummy
ip link add name $dev type dummy
ip link set dev $dev up
tc qdisc add dev $dev ingress

tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
pref_id=1

while [ $pref_id -lt 20000 ]
do
        echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
        ((pref_id++))
done

start=$(date +"%s")
tc -b $tmp_file_name
stop=$(date +"%s")
echo "Insertion duration: $(($stop - $start)) sec"
rm -f $tmp_file_name

ip link del dev $dev
------------------------------------------------------------------------

On my testing vm, result on 5.1 kernel is:
Insertion duration: 3 sec
On net-next this is:
Insertion duration: 54 sec

I did simple prifiling using perf. Output on 5.1 kernel:
    77.85%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
     3.30%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.33%  tc_pref_scale.s  [kernel.kallsyms]  [k] do_syscall_64
     0.60%  tc_pref_scale.s  libc-2.28.so       [.] malloc
     0.55%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
     0.51%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
     0.40%  tc_pref_scale.s  libc-2.28.so       [.] __gconv_transform_utf8_internal
     0.38%  tc_pref_scale.s  libc-2.28.so       [.] _int_free
     0.37%  tc_pref_scale.s  libc-2.28.so       [.] __GI___strlen_sse2
     0.37%  tc               [kernel.kallsyms]  [k] idr_get_free

Output on net-next:
    39.26%  tc               [kernel.vmlinux]  [k] lock_is_held_type
    33.99%  tc               [kernel.vmlinux]  [k] tcf_chain_tp_find
    12.77%  tc               [kernel.vmlinux]  [k] __asan_load4_noabort
     1.90%  tc               [kernel.vmlinux]  [k] __asan_load8_noabort
     1.08%  tc               [kernel.vmlinux]  [k] lock_acquire
     0.94%  tc               [kernel.vmlinux]  [k] debug_lockdep_rcu_enabled
     0.61%  tc               [kernel.vmlinux]  [k] debug_lockdep_rcu_enabled.part.5
     0.51%  tc               [kernel.vmlinux]  [k] unwind_next_frame
     0.50%  tc               [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
     0.47%  tc_pref_scale.s  [kernel.vmlinux]  [k] lock_acquire
     0.47%  tc               [kernel.vmlinux]  [k] lock_release

I didn't investigate this any further now. I fear that this might be
related to Vlad's changes in the area. Any ideas?

Thanks!

Jiri

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko
@ 2019-06-12 12:30 ` Paolo Abeni
  2019-06-13  4:50   ` Jiri Pirko
  2019-06-12 12:34 ` Vlad Buslov
  2019-06-13  8:11 ` Jiri Pirko
  2 siblings, 1 reply; 10+ messages in thread
From: Paolo Abeni @ 2019-06-12 12:30 UTC (permalink / raw
  To: Jiri Pirko, netdev; +Cc: vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk

Hi,

On Wed, 2019-06-12 at 14:03 +0200, Jiri Pirko wrote:
> I did simple prifiling using perf. Output on 5.1 kernel:
>     77.85%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>      3.30%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>      1.33%  tc_pref_scale.s  [kernel.kallsyms]  [k] do_syscall_64
>      0.60%  tc_pref_scale.s  libc-2.28.so       [.] malloc
>      0.55%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>      0.51%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>      0.40%  tc_pref_scale.s  libc-2.28.so       [.] __gconv_transform_utf8_internal
>      0.38%  tc_pref_scale.s  libc-2.28.so       [.] _int_free
>      0.37%  tc_pref_scale.s  libc-2.28.so       [.] __GI___strlen_sse2
>      0.37%  tc               [kernel.kallsyms]  [k] idr_get_free
> 
> Output on net-next:
>     39.26%  tc               [kernel.vmlinux]  [k] lock_is_held_type

It looks like you have lockdep enabled here, but not on the 5.1 build.

That would explain such a large perf difference.

Can you please double check?

thanks,

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko
  2019-06-12 12:30 ` Paolo Abeni
@ 2019-06-12 12:34 ` Vlad Buslov
  2019-06-13  5:49   ` Jiri Pirko
  2019-06-13  8:11 ` Jiri Pirko
  2 siblings, 1 reply; 10+ messages in thread
From: Vlad Buslov @ 2019-06-12 12:34 UTC (permalink / raw
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, Vlad Buslov, pablo@netfilter.org,
	xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov


On Wed 12 Jun 2019 at 15:03, Jiri Pirko <jiri@resnulli.us> wrote:
> Hi.
>
> I came across serious performance degradation when adding many tps. I'm
> using following script:
>
> ------------------------------------------------------------------------
> #!/bin/bash
>
> dev=testdummy
> ip link add name $dev type dummy
> ip link set dev $dev up
> tc qdisc add dev $dev ingress
>
> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
> pref_id=1
>
> while [ $pref_id -lt 20000 ]
> do
>         echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>         ((pref_id++))
> done
>
> start=$(date +"%s")
> tc -b $tmp_file_name
> stop=$(date +"%s")
> echo "Insertion duration: $(($stop - $start)) sec"
> rm -f $tmp_file_name
>
> ip link del dev $dev
> ------------------------------------------------------------------------
>
> On my testing vm, result on 5.1 kernel is:
> Insertion duration: 3 sec
> On net-next this is:
> Insertion duration: 54 sec
>
> I did simple prifiling using perf. Output on 5.1 kernel:
>     77.85%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>      3.30%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>      1.33%  tc_pref_scale.s  [kernel.kallsyms]  [k] do_syscall_64
>      0.60%  tc_pref_scale.s  libc-2.28.so       [.] malloc
>      0.55%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>      0.51%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>      0.40%  tc_pref_scale.s  libc-2.28.so       [.] __gconv_transform_utf8_internal
>      0.38%  tc_pref_scale.s  libc-2.28.so       [.] _int_free
>      0.37%  tc_pref_scale.s  libc-2.28.so       [.] __GI___strlen_sse2
>      0.37%  tc               [kernel.kallsyms]  [k] idr_get_free

Are these results for same config? Here I don't see any lockdep or
KASAN. However in next trace...

>
> Output on net-next:
>     39.26%  tc               [kernel.vmlinux]  [k] lock_is_held_type
>     33.99%  tc               [kernel.vmlinux]  [k] tcf_chain_tp_find
>     12.77%  tc               [kernel.vmlinux]  [k] __asan_load4_noabort
>      1.90%  tc               [kernel.vmlinux]  [k] __asan_load8_noabort
>      1.08%  tc               [kernel.vmlinux]  [k] lock_acquire
>      0.94%  tc               [kernel.vmlinux]  [k] debug_lockdep_rcu_enabled
>      0.61%  tc               [kernel.vmlinux]  [k] debug_lockdep_rcu_enabled.part.5
>      0.51%  tc               [kernel.vmlinux]  [k] unwind_next_frame
>      0.50%  tc               [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
>      0.47%  tc_pref_scale.s  [kernel.vmlinux]  [k] lock_acquire
>      0.47%  tc               [kernel.vmlinux]  [k] lock_release

... both lockdep and kasan consume most of CPU time.

BTW it takes 5 sec to execute your script on my system with net-next
(debug options disabled).

>
> I didn't investigate this any further now. I fear that this might be
> related to Vlad's changes in the area. Any ideas?
>
> Thanks!
>
> Jiri

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-12 12:30 ` Paolo Abeni
@ 2019-06-13  4:50   ` Jiri Pirko
  0 siblings, 0 replies; 10+ messages in thread
From: Jiri Pirko @ 2019-06-13  4:50 UTC (permalink / raw
  To: Paolo Abeni; +Cc: netdev, vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk

Wed, Jun 12, 2019 at 02:30:37PM CEST, pabeni@redhat.com wrote:
>Hi,
>
>On Wed, 2019-06-12 at 14:03 +0200, Jiri Pirko wrote:
>> I did simple prifiling using perf. Output on 5.1 kernel:
>>     77.85%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>>      3.30%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>>      1.33%  tc_pref_scale.s  [kernel.kallsyms]  [k] do_syscall_64
>>      0.60%  tc_pref_scale.s  libc-2.28.so       [.] malloc
>>      0.55%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>>      0.51%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>>      0.40%  tc_pref_scale.s  libc-2.28.so       [.] __gconv_transform_utf8_internal
>>      0.38%  tc_pref_scale.s  libc-2.28.so       [.] _int_free
>>      0.37%  tc_pref_scale.s  libc-2.28.so       [.] __GI___strlen_sse2
>>      0.37%  tc               [kernel.kallsyms]  [k] idr_get_free
>> 
>> Output on net-next:
>>     39.26%  tc               [kernel.vmlinux]  [k] lock_is_held_type
>
>It looks like you have lockdep enabled here, but not on the 5.1 build.
>
>That would explain such a large perf difference.
>
>Can you please double check?

Will do.

>
>thanks,
>
>Paolo
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-12 12:34 ` Vlad Buslov
@ 2019-06-13  5:49   ` Jiri Pirko
  0 siblings, 0 replies; 10+ messages in thread
From: Jiri Pirko @ 2019-06-13  5:49 UTC (permalink / raw
  To: Vlad Buslov
  Cc: netdev@vger.kernel.org, pablo@netfilter.org,
	xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov

Wed, Jun 12, 2019 at 02:34:02PM CEST, vladbu@mellanox.com wrote:
>
>On Wed 12 Jun 2019 at 15:03, Jiri Pirko <jiri@resnulli.us> wrote:
>> Hi.
>>
>> I came across serious performance degradation when adding many tps. I'm
>> using following script:
>>
>> ------------------------------------------------------------------------
>> #!/bin/bash
>>
>> dev=testdummy
>> ip link add name $dev type dummy
>> ip link set dev $dev up
>> tc qdisc add dev $dev ingress
>>
>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
>> pref_id=1
>>
>> while [ $pref_id -lt 20000 ]
>> do
>>         echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>>         ((pref_id++))
>> done
>>
>> start=$(date +"%s")
>> tc -b $tmp_file_name
>> stop=$(date +"%s")
>> echo "Insertion duration: $(($stop - $start)) sec"
>> rm -f $tmp_file_name
>>
>> ip link del dev $dev
>> ------------------------------------------------------------------------
>>
>> On my testing vm, result on 5.1 kernel is:
>> Insertion duration: 3 sec
>> On net-next this is:
>> Insertion duration: 54 sec
>>
>> I did simple prifiling using perf. Output on 5.1 kernel:
>>     77.85%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>>      3.30%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>>      1.33%  tc_pref_scale.s  [kernel.kallsyms]  [k] do_syscall_64
>>      0.60%  tc_pref_scale.s  libc-2.28.so       [.] malloc
>>      0.55%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>>      0.51%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>>      0.40%  tc_pref_scale.s  libc-2.28.so       [.] __gconv_transform_utf8_internal
>>      0.38%  tc_pref_scale.s  libc-2.28.so       [.] _int_free
>>      0.37%  tc_pref_scale.s  libc-2.28.so       [.] __GI___strlen_sse2
>>      0.37%  tc               [kernel.kallsyms]  [k] idr_get_free
>
>Are these results for same config? Here I don't see any lockdep or
>KASAN. However in next trace...
>
>>
>> Output on net-next:
>>     39.26%  tc               [kernel.vmlinux]  [k] lock_is_held_type
>>     33.99%  tc               [kernel.vmlinux]  [k] tcf_chain_tp_find
>>     12.77%  tc               [kernel.vmlinux]  [k] __asan_load4_noabort
>>      1.90%  tc               [kernel.vmlinux]  [k] __asan_load8_noabort
>>      1.08%  tc               [kernel.vmlinux]  [k] lock_acquire
>>      0.94%  tc               [kernel.vmlinux]  [k] debug_lockdep_rcu_enabled
>>      0.61%  tc               [kernel.vmlinux]  [k] debug_lockdep_rcu_enabled.part.5
>>      0.51%  tc               [kernel.vmlinux]  [k] unwind_next_frame
>>      0.50%  tc               [kernel.vmlinux]  [k] _raw_spin_unlock_irqrestore
>>      0.47%  tc_pref_scale.s  [kernel.vmlinux]  [k] lock_acquire
>>      0.47%  tc               [kernel.vmlinux]  [k] lock_release
>
>... both lockdep and kasan consume most of CPU time.
>
>BTW it takes 5 sec to execute your script on my system with net-next
>(debug options disabled).

You are right, my bad. Sorry for the fuzz.


>
>>
>> I didn't investigate this any further now. I fear that this might be
>> related to Vlad's changes in the area. Any ideas?
>>
>> Thanks!
>>
>> Jiri

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko
  2019-06-12 12:30 ` Paolo Abeni
  2019-06-12 12:34 ` Vlad Buslov
@ 2019-06-13  8:11 ` Jiri Pirko
  2019-06-13 10:09   ` Vlad Buslov
  2 siblings, 1 reply; 10+ messages in thread
From: Jiri Pirko @ 2019-06-13  8:11 UTC (permalink / raw
  To: netdev; +Cc: vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk, pabeni

I made a mistake during measurements, sorry about that.

This is the correct script:
-----------------------------------------------------------------------
#!/bin/bash

dev=testdummy
ip link add name $dev type dummy
ip link set dev $dev up
tc qdisc add dev $dev ingress

tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
pref_id=1

while [ $pref_id -lt 20000 ]
do
        echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
        #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
        ((pref_id++))
done

start=$(date +"%s")
tc -b $tmp_file_name
stop=$(date +"%s")
echo "Insertion duration: $(($stop - $start)) sec"
rm -f $tmp_file_name

ip link del dev $dev
-----------------------------------------------------------------------

Note the commented out matchall. I don't see the regression with
matchall. However, I see that with flower:
kernel 5.1
Insertion duration: 4 sec
kernel 5.2
Insertion duration: 163 sec

I don't see any significant difference in perf:
kernel 5.1
    77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
     1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
     1.44%  tc               [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
     0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
     0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
     0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
     0.53%  tc               libc-2.28.so        [.] __memset_sse2_unaligned_erms
     0.49%  tc               [kernel.vmlinux]    [k] __memset
     0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
     0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
     0.24%  tc               [kernel.vmlinux]    [k] __memcpy
     0.23%  tc               [cls_flower]        [k] fl_change
     0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
     0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc


    75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
     2.70%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
     0.87%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
     0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
     0.67%  tc               [kernel.kallsyms]  [k] memset
     0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
     0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
     0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
     0.46%  tc               [kernel.kallsyms]  [k] fl_change
     0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
     0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
     0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
     0.34%  tc_pref_scale_o  libc-2.28.so       [.] __fcntl64_nocancel_adjusted

Any ideas?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-13  8:11 ` Jiri Pirko
@ 2019-06-13 10:09   ` Vlad Buslov
  2019-06-13 11:11     ` Jiri Pirko
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Buslov @ 2019-06-13 10:09 UTC (permalink / raw
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, Vlad Buslov, pablo@netfilter.org,
	xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov,
	pabeni@redhat.com

On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote:
> I made a mistake during measurements, sorry about that.
>
> This is the correct script:
> -----------------------------------------------------------------------
> #!/bin/bash
>
> dev=testdummy
> ip link add name $dev type dummy
> ip link set dev $dev up
> tc qdisc add dev $dev ingress
>
> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
> pref_id=1
>
> while [ $pref_id -lt 20000 ]
> do
>         echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
>         #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>         ((pref_id++))
> done
>
> start=$(date +"%s")
> tc -b $tmp_file_name
> stop=$(date +"%s")
> echo "Insertion duration: $(($stop - $start)) sec"
> rm -f $tmp_file_name
>
> ip link del dev $dev
> -----------------------------------------------------------------------
>
> Note the commented out matchall. I don't see the regression with
> matchall. However, I see that with flower:
> kernel 5.1
> Insertion duration: 4 sec
> kernel 5.2
> Insertion duration: 163 sec
>
> I don't see any significant difference in perf:
> kernel 5.1
>     77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
>      1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
>      1.44%  tc               [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
>      0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
>      0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
>      0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
>      0.53%  tc               libc-2.28.so        [.] __memset_sse2_unaligned_erms
>      0.49%  tc               [kernel.vmlinux]    [k] __memset
>      0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
>      0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
>      0.24%  tc               [kernel.vmlinux]    [k] __memcpy
>      0.23%  tc               [cls_flower]        [k] fl_change
>      0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
>      0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc
>
>
>     75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>      2.70%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>      1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
>      0.87%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>      0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
>      0.67%  tc               [kernel.kallsyms]  [k] memset
>      0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>      0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
>      0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
>      0.46%  tc               [kernel.kallsyms]  [k] fl_change
>      0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
>      0.34%  tc_pref_scale_o  libc-2.28.so       [.] __fcntl64_nocancel_adjusted
>
> Any ideas?

Thanks for providing reproduction script!

I've investigate the problem and found the root cause. First of all I
noticed that CPU utilization during problematic tc run is quite low
(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
and obtained following off-CPU trace (uninteresting traces are omitted
for brevity):

~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
...
    finish_task_switch
    __sched_text_start
    schedule
    schedule_timeout
    wait_for_completion
    __wait_rcu_gp
    synchronize_rcu
    fl_change
    tc_new_tfilter
    rtnetlink_rcv_msg
    netlink_rcv_skb
    netlink_unicast
    netlink_sendmsg
    sock_sendmsg
    ___sys_sendmsg
    __sys_sendmsg
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    -                tc (2069)
        142284953

As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
code is in fl_create_new_mask() function:

	err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
				      &newmask->ht_node, mask_ht_params);
	if (err)
		goto errout_destroy;

	/* Wait until any potential concurrent users of mask are finished */
	synchronize_rcu();

The justification for this is described in comment in
fl_check_assign_mask() (user of fl_create_new_mask()):

	/* Insert mask as temporary node to prevent concurrent creation of mask
	 * with same key. Any concurrent lookups with same key will return
	 * -EAGAIN because mask's refcnt is zero. It is safe to insert
	 * stack-allocated 'mask' to masks hash table because we call
	 * synchronize_rcu() before returning from this function (either in case
	 * of error or after replacing it with heap-allocated mask in
	 * fl_create_new_mask()).
	 */
	fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
						       &mask->ht_node,
						       mask_ht_params);

The offending commit is part of my series that implements unlocked
flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
insertion")

The justification presented in it is no longer relevant since Ivan
Vecera changed mask to be dynamically allocated in commit 2cddd2014782
("net/sched: cls_flower: allocate mask dynamically in fl_change()").
With this we can just change fl_change() to deallocate temporary mask
with rcu grace period and remove offending syncrhonize_rcu() call.

Any other suggestions?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-13 10:09   ` Vlad Buslov
@ 2019-06-13 11:11     ` Jiri Pirko
  2019-06-13 11:26       ` Vlad Buslov
  0 siblings, 1 reply; 10+ messages in thread
From: Jiri Pirko @ 2019-06-13 11:11 UTC (permalink / raw
  To: Vlad Buslov
  Cc: netdev@vger.kernel.org, pablo@netfilter.org,
	xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov,
	pabeni@redhat.com

Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote:
>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote:
>> I made a mistake during measurements, sorry about that.
>>
>> This is the correct script:
>> -----------------------------------------------------------------------
>> #!/bin/bash
>>
>> dev=testdummy
>> ip link add name $dev type dummy
>> ip link set dev $dev up
>> tc qdisc add dev $dev ingress
>>
>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
>> pref_id=1
>>
>> while [ $pref_id -lt 20000 ]
>> do
>>         echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
>>         #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>>         ((pref_id++))
>> done
>>
>> start=$(date +"%s")
>> tc -b $tmp_file_name
>> stop=$(date +"%s")
>> echo "Insertion duration: $(($stop - $start)) sec"
>> rm -f $tmp_file_name
>>
>> ip link del dev $dev
>> -----------------------------------------------------------------------
>>
>> Note the commented out matchall. I don't see the regression with
>> matchall. However, I see that with flower:
>> kernel 5.1
>> Insertion duration: 4 sec
>> kernel 5.2
>> Insertion duration: 163 sec
>>
>> I don't see any significant difference in perf:
>> kernel 5.1
>>     77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
>>      1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
>>      1.44%  tc               [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
>>      0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
>>      0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
>>      0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
>>      0.53%  tc               libc-2.28.so        [.] __memset_sse2_unaligned_erms
>>      0.49%  tc               [kernel.vmlinux]    [k] __memset
>>      0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
>>      0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
>>      0.24%  tc               [kernel.vmlinux]    [k] __memcpy
>>      0.23%  tc               [cls_flower]        [k] fl_change
>>      0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
>>      0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc
>>
>>
>>     75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>>      2.70%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>>      1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
>>      0.87%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>>      0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
>>      0.67%  tc               [kernel.kallsyms]  [k] memset
>>      0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>>      0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
>>      0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
>>      0.46%  tc               [kernel.kallsyms]  [k] fl_change
>>      0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
>>      0.34%  tc_pref_scale_o  libc-2.28.so       [.] __fcntl64_nocancel_adjusted
>>
>> Any ideas?
>
>Thanks for providing reproduction script!
>
>I've investigate the problem and found the root cause. First of all I
>noticed that CPU utilization during problematic tc run is quite low
>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
>and obtained following off-CPU trace (uninteresting traces are omitted
>for brevity):
>
>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
>...
>    finish_task_switch
>    __sched_text_start
>    schedule
>    schedule_timeout
>    wait_for_completion
>    __wait_rcu_gp
>    synchronize_rcu
>    fl_change
>    tc_new_tfilter
>    rtnetlink_rcv_msg
>    netlink_rcv_skb
>    netlink_unicast
>    netlink_sendmsg
>    sock_sendmsg
>    ___sys_sendmsg
>    __sys_sendmsg
>    do_syscall_64
>    entry_SYSCALL_64_after_hwframe
>    -                tc (2069)
>        142284953
>
>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
>code is in fl_create_new_mask() function:
>
>	err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
>				      &newmask->ht_node, mask_ht_params);
>	if (err)
>		goto errout_destroy;
>
>	/* Wait until any potential concurrent users of mask are finished */
>	synchronize_rcu();
>
>The justification for this is described in comment in
>fl_check_assign_mask() (user of fl_create_new_mask()):
>
>	/* Insert mask as temporary node to prevent concurrent creation of mask
>	 * with same key. Any concurrent lookups with same key will return
>	 * -EAGAIN because mask's refcnt is zero. It is safe to insert
>	 * stack-allocated 'mask' to masks hash table because we call
>	 * synchronize_rcu() before returning from this function (either in case
>	 * of error or after replacing it with heap-allocated mask in
>	 * fl_create_new_mask()).
>	 */
>	fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
>						       &mask->ht_node,
>						       mask_ht_params);
>
>The offending commit is part of my series that implements unlocked
>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
>insertion")
>
>The justification presented in it is no longer relevant since Ivan
>Vecera changed mask to be dynamically allocated in commit 2cddd2014782
>("net/sched: cls_flower: allocate mask dynamically in fl_change()").
>With this we can just change fl_change() to deallocate temporary mask
>with rcu grace period and remove offending syncrhonize_rcu() call.
>
>Any other suggestions?

So basically you just change synchronize_rcu() to kfree_rcu(mask),
   correct?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-13 11:11     ` Jiri Pirko
@ 2019-06-13 11:26       ` Vlad Buslov
  2019-06-13 14:18         ` Jiri Pirko
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Buslov @ 2019-06-13 11:26 UTC (permalink / raw
  To: Jiri Pirko
  Cc: Vlad Buslov, netdev@vger.kernel.org, pablo@netfilter.org,
	xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov,
	pabeni@redhat.com


On Thu 13 Jun 2019 at 14:11, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote:
>>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote:
>>> I made a mistake during measurements, sorry about that.
>>>
>>> This is the correct script:
>>> -----------------------------------------------------------------------
>>> #!/bin/bash
>>>
>>> dev=testdummy
>>> ip link add name $dev type dummy
>>> ip link set dev $dev up
>>> tc qdisc add dev $dev ingress
>>>
>>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
>>> pref_id=1
>>>
>>> while [ $pref_id -lt 20000 ]
>>> do
>>>         echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
>>>         #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>>>         ((pref_id++))
>>> done
>>>
>>> start=$(date +"%s")
>>> tc -b $tmp_file_name
>>> stop=$(date +"%s")
>>> echo "Insertion duration: $(($stop - $start)) sec"
>>> rm -f $tmp_file_name
>>>
>>> ip link del dev $dev
>>> -----------------------------------------------------------------------
>>>
>>> Note the commented out matchall. I don't see the regression with
>>> matchall. However, I see that with flower:
>>> kernel 5.1
>>> Insertion duration: 4 sec
>>> kernel 5.2
>>> Insertion duration: 163 sec
>>>
>>> I don't see any significant difference in perf:
>>> kernel 5.1
>>>     77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
>>>      1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
>>>      1.44%  tc               [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
>>>      0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
>>>      0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
>>>      0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
>>>      0.53%  tc               libc-2.28.so        [.] __memset_sse2_unaligned_erms
>>>      0.49%  tc               [kernel.vmlinux]    [k] __memset
>>>      0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
>>>      0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
>>>      0.24%  tc               [kernel.vmlinux]    [k] __memcpy
>>>      0.23%  tc               [cls_flower]        [k] fl_change
>>>      0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
>>>      0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc
>>>
>>>
>>>     75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>>>      2.70%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>>>      1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
>>>      0.87%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>>>      0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
>>>      0.67%  tc               [kernel.kallsyms]  [k] memset
>>>      0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>>>      0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
>>>      0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
>>>      0.46%  tc               [kernel.kallsyms]  [k] fl_change
>>>      0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
>>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
>>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
>>>      0.34%  tc_pref_scale_o  libc-2.28.so       [.] __fcntl64_nocancel_adjusted
>>>
>>> Any ideas?
>>
>>Thanks for providing reproduction script!
>>
>>I've investigate the problem and found the root cause. First of all I
>>noticed that CPU utilization during problematic tc run is quite low
>>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
>>and obtained following off-CPU trace (uninteresting traces are omitted
>>for brevity):
>>
>>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
>>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
>>...
>>    finish_task_switch
>>    __sched_text_start
>>    schedule
>>    schedule_timeout
>>    wait_for_completion
>>    __wait_rcu_gp
>>    synchronize_rcu
>>    fl_change
>>    tc_new_tfilter
>>    rtnetlink_rcv_msg
>>    netlink_rcv_skb
>>    netlink_unicast
>>    netlink_sendmsg
>>    sock_sendmsg
>>    ___sys_sendmsg
>>    __sys_sendmsg
>>    do_syscall_64
>>    entry_SYSCALL_64_after_hwframe
>>    -                tc (2069)
>>        142284953
>>
>>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
>>code is in fl_create_new_mask() function:
>>
>>	err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
>>				      &newmask->ht_node, mask_ht_params);
>>	if (err)
>>		goto errout_destroy;
>>
>>	/* Wait until any potential concurrent users of mask are finished */
>>	synchronize_rcu();
>>
>>The justification for this is described in comment in
>>fl_check_assign_mask() (user of fl_create_new_mask()):
>>
>>	/* Insert mask as temporary node to prevent concurrent creation of mask
>>	 * with same key. Any concurrent lookups with same key will return
>>	 * -EAGAIN because mask's refcnt is zero. It is safe to insert
>>	 * stack-allocated 'mask' to masks hash table because we call
>>	 * synchronize_rcu() before returning from this function (either in case
>>	 * of error or after replacing it with heap-allocated mask in
>>	 * fl_create_new_mask()).
>>	 */
>>	fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
>>						       &mask->ht_node,
>>						       mask_ht_params);
>>
>>The offending commit is part of my series that implements unlocked
>>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
>>insertion")
>>
>>The justification presented in it is no longer relevant since Ivan
>>Vecera changed mask to be dynamically allocated in commit 2cddd2014782
>>("net/sched: cls_flower: allocate mask dynamically in fl_change()").
>>With this we can just change fl_change() to deallocate temporary mask
>>with rcu grace period and remove offending syncrhonize_rcu() call.
>>
>>Any other suggestions?
>
> So basically you just change synchronize_rcu() to kfree_rcu(mask),
>    correct?

Not really. I remove synchronize_rcu() and change all kfree(mask) in
fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which
uses queue_rcu_work() internally. This would allow us to deallocate
fl_flow_mask in same manner on all code paths and doesn't require any
extensions in fl_flow_mask struct (kfree_rcu would require extending it
with rcu_head).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: tc tp creation performance degratation since kernel 5.1
  2019-06-13 11:26       ` Vlad Buslov
@ 2019-06-13 14:18         ` Jiri Pirko
  0 siblings, 0 replies; 10+ messages in thread
From: Jiri Pirko @ 2019-06-13 14:18 UTC (permalink / raw
  To: Vlad Buslov
  Cc: netdev@vger.kernel.org, pablo@netfilter.org,
	xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov,
	pabeni@redhat.com

Thu, Jun 13, 2019 at 01:26:17PM CEST, vladbu@mellanox.com wrote:
>
>On Thu 13 Jun 2019 at 14:11, Jiri Pirko <jiri@resnulli.us> wrote:
>> Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote:
>>>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> I made a mistake during measurements, sorry about that.
>>>>
>>>> This is the correct script:
>>>> -----------------------------------------------------------------------
>>>> #!/bin/bash
>>>>
>>>> dev=testdummy
>>>> ip link add name $dev type dummy
>>>> ip link set dev $dev up
>>>> tc qdisc add dev $dev ingress
>>>>
>>>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
>>>> pref_id=1
>>>>
>>>> while [ $pref_id -lt 20000 ]
>>>> do
>>>>         echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name
>>>>         #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
>>>>         ((pref_id++))
>>>> done
>>>>
>>>> start=$(date +"%s")
>>>> tc -b $tmp_file_name
>>>> stop=$(date +"%s")
>>>> echo "Insertion duration: $(($stop - $start)) sec"
>>>> rm -f $tmp_file_name
>>>>
>>>> ip link del dev $dev
>>>> -----------------------------------------------------------------------
>>>>
>>>> Note the commented out matchall. I don't see the regression with
>>>> matchall. However, I see that with flower:
>>>> kernel 5.1
>>>> Insertion duration: 4 sec
>>>> kernel 5.2
>>>> Insertion duration: 163 sec
>>>>
>>>> I don't see any significant difference in perf:
>>>> kernel 5.1
>>>>     77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
>>>>      1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
>>>>      1.44%  tc               [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestore
>>>>      0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
>>>>      0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
>>>>      0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
>>>>      0.53%  tc               libc-2.28.so        [.] __memset_sse2_unaligned_erms
>>>>      0.49%  tc               [kernel.vmlinux]    [k] __memset
>>>>      0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
>>>>      0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
>>>>      0.24%  tc               [kernel.vmlinux]    [k] __memcpy
>>>>      0.23%  tc               [cls_flower]        [k] fl_change
>>>>      0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
>>>>      0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc
>>>>
>>>>
>>>>     75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>>>>      2.70%  tc               [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
>>>>      1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
>>>>      0.87%  tc               libc-2.28.so       [.] __memset_sse2_unaligned_erms
>>>>      0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
>>>>      0.67%  tc               [kernel.kallsyms]  [k] memset
>>>>      0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>>>>      0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
>>>>      0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
>>>>      0.46%  tc               [kernel.kallsyms]  [k] fl_change
>>>>      0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
>>>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
>>>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
>>>>      0.34%  tc_pref_scale_o  libc-2.28.so       [.] __fcntl64_nocancel_adjusted
>>>>
>>>> Any ideas?
>>>
>>>Thanks for providing reproduction script!
>>>
>>>I've investigate the problem and found the root cause. First of all I
>>>noticed that CPU utilization during problematic tc run is quite low
>>>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
>>>and obtained following off-CPU trace (uninteresting traces are omitted
>>>for brevity):
>>>
>>>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
>>>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
>>>...
>>>    finish_task_switch
>>>    __sched_text_start
>>>    schedule
>>>    schedule_timeout
>>>    wait_for_completion
>>>    __wait_rcu_gp
>>>    synchronize_rcu
>>>    fl_change
>>>    tc_new_tfilter
>>>    rtnetlink_rcv_msg
>>>    netlink_rcv_skb
>>>    netlink_unicast
>>>    netlink_sendmsg
>>>    sock_sendmsg
>>>    ___sys_sendmsg
>>>    __sys_sendmsg
>>>    do_syscall_64
>>>    entry_SYSCALL_64_after_hwframe
>>>    -                tc (2069)
>>>        142284953
>>>
>>>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
>>>code is in fl_create_new_mask() function:
>>>
>>>	err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
>>>				      &newmask->ht_node, mask_ht_params);
>>>	if (err)
>>>		goto errout_destroy;
>>>
>>>	/* Wait until any potential concurrent users of mask are finished */
>>>	synchronize_rcu();
>>>
>>>The justification for this is described in comment in
>>>fl_check_assign_mask() (user of fl_create_new_mask()):
>>>
>>>	/* Insert mask as temporary node to prevent concurrent creation of mask
>>>	 * with same key. Any concurrent lookups with same key will return
>>>	 * -EAGAIN because mask's refcnt is zero. It is safe to insert
>>>	 * stack-allocated 'mask' to masks hash table because we call
>>>	 * synchronize_rcu() before returning from this function (either in case
>>>	 * of error or after replacing it with heap-allocated mask in
>>>	 * fl_create_new_mask()).
>>>	 */
>>>	fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
>>>						       &mask->ht_node,
>>>						       mask_ht_params);
>>>
>>>The offending commit is part of my series that implements unlocked
>>>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
>>>insertion")
>>>
>>>The justification presented in it is no longer relevant since Ivan
>>>Vecera changed mask to be dynamically allocated in commit 2cddd2014782
>>>("net/sched: cls_flower: allocate mask dynamically in fl_change()").
>>>With this we can just change fl_change() to deallocate temporary mask
>>>with rcu grace period and remove offending syncrhonize_rcu() call.
>>>
>>>Any other suggestions?
>>
>> So basically you just change synchronize_rcu() to kfree_rcu(mask),
>>    correct?
>
>Not really. I remove synchronize_rcu() and change all kfree(mask) in
>fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which
>uses queue_rcu_work() internally. This would allow us to deallocate
>fl_flow_mask in same manner on all code paths and doesn't require any
>extensions in fl_flow_mask struct (kfree_rcu would require extending it
>with rcu_head).

Got it. Makes sense to me. Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-06-13 16:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko
2019-06-12 12:30 ` Paolo Abeni
2019-06-13  4:50   ` Jiri Pirko
2019-06-12 12:34 ` Vlad Buslov
2019-06-13  5:49   ` Jiri Pirko
2019-06-13  8:11 ` Jiri Pirko
2019-06-13 10:09   ` Vlad Buslov
2019-06-13 11:11     ` Jiri Pirko
2019-06-13 11:26       ` Vlad Buslov
2019-06-13 14:18         ` Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.