From: Adam Smith <hrotsvit@gmail.com>
To: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: brouer@redhat.com, xdp-newbies@vger.kernel.org, "Karlsson,
Magnus" <magnus.karlsson@intel.com>,
Federico Parola <federico.parola@polito.it>
Subject: Re: Questions about IRQ utilization and throughput with XDP_REDIRECT on Intel i40e
Date: Fri, 23 Sep 2022 16:28:31 -0500 [thread overview]
Message-ID: <CADx6jH7pxdvFsx2=DC=etkpUgjFe8rEPVvSQpHdWE09nvkxeyA@mail.gmail.com> (raw)
In-Reply-To: <4400b817-582a-842a-282b-4dd6d8a22f11@redhat.com>
On Fri, Sep 23, 2022 at 11:55 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 20/09/2022 19.37, Adam Smith wrote:
> > (answered inline, below)
> >
> > On Tue, Sep 20, 2022 at 3:17 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >> (answered inline, below)
> >>
> >> On 19/09/2022 22.55, Adam Smith wrote:
> >>> Hello,
> >>>
> >>> In trying to understand the differences in IRQ utilization and
> >>> throughput when performing XDP_REDIRECT in a simple netfilter bridge
> >>> on the Intel i40e, we have encountered behavior we are unable to
> >>> explain and we would like advice on where to investigate next.
> >>>
> >>> The two questions we are seeking guidance for are:
> >>> 1) Why does XDP in the i40e driver handle interrupts on multiple IRQs,
> >>> while the same flows are serviced by a single IRQ without XDP
> >>> (netfilter bridge)?
> >>>
> >>
> >> Remember IRQ smp affinity is configurable via /proc/irq/ files.
> >> Below bash code simply uses the queue number as the assigned CPU number.
> >>
> >> echo " --- Align IRQs: i40e ---"
> >> # i40e have driver name as starting prefix, making it easier to "catch"
> >> for F in /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list; do
> >> # Extract irqname e.g. "i40e-eth2-TxRx-1"
> >> irqname=$(basename $(dirname $(dirname $F))) ;
> >> # Substring pattern removal to extract Q-number
> >> hwq_nr=${irqname#*-*-*-}
> >> echo $hwq_nr > $F
> >> #grep . -H $F;
> >> done
> >>
> >> Thus we get this one-to-one mapping of Q-to-CPU number:
> >>
> >> $ grep -H . /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list
> >> /proc/irq/218/i40e-i40e1-TxRx-0/../smp_affinity_list:0
> >> /proc/irq/219/i40e-i40e1-TxRx-1/../smp_affinity_list:1
> >> /proc/irq/220/i40e-i40e1-TxRx-2/../smp_affinity_list:2
> >> /proc/irq/221/i40e-i40e1-TxRx-3/../smp_affinity_list:3
> >> /proc/irq/222/i40e-i40e1-TxRx-4/../smp_affinity_list:4
> >> /proc/irq/223/i40e-i40e1-TxRx-5/../smp_affinity_list:5
> >> /proc/irq/224/i40e-0000:04:00.0:fdir-TxRx-0/../smp_affinity_list:0
> >
> > Apologies, I should have mentioned that IRQ affinity was already
> > pinned via the recommended set_irq_affinity script from Intel driver
> > tools.
>
> It sounds like a bug (Cc Magnus) if you had pinned the IRQ affinity and
> then traffic migrates around anyway.
>
> There can be two possible setup issue that cause this:
>
> (#1) When loading XDP some drivers reset too much of the link
> config (like ixgbe), which could lead to smp_affinity getting reset
> to defaults. (Check setting with above grep -H . )
Verified that smp_affinity is not being reset.
> (#2) The i40e NICs hardware ATR/Flow-Director could be the one that
> moves traffic in unexpected ways. You can disable it via ethtool
> running this command:
>
> ethtool -K i40e1 ntuple-filters off
>
> (Cc Federico as they mention ATR/Flow-Director in their paper)
Disabling ntuple-filters did not change the behavior.
> >>> 2) Why does the i40e driver with XDP under load seemingly get faster
> >>> when tracing is attached to functions inside the driver’s napi_poll
> >>> loop?
> >>
> >> My theory is: Because you keep the CPU from going into sleep states.
> >>
> >>> Our working theory is that the i40e driver is not as efficient in
> >>> interrupt handling when XDP is enabled. Something in napi_poll is
> >>> looping too aggressively, and, when artificially slowed by attaching
> >>> to various kprobes and tracepoints, the slightly delayed code becomes
> >>> more efficient.
> >>>
> >>> Testing setup:
> >>>
> >>
> >> So, the test setup is basically a forwarding scenario using bridging.
> >> (It reminds me, we should add BPF bridge FIB lookup helpers... Cc lorenzo)
> >>
> >>> Without XDP, our iperf3 test utilizes almost 100% CPU on a single core
> >>> to achieve approximately 9.42 Gbits/sec. Total hard IRQs over 10
> >>> seconds is as follows:
> >>> i40e-enp1s0f1-TxRx-1 127k
> >>> Iperf3 retransmissions are roughly 0.
> >>
> >> The key here is that your test utilizes almost 100% CPU on a single
> >> core. From this info I know that the CPU isn't going into deep sleep
> >> states.
> >>
> >>
> >>> With simple XDP_REDIRECT programs installed on both interfaces, CPU
> >>> usage drops to ~43% on two different cores (one significantly higher
> >>> than the other), and hard IRQs over 10 seconds is as follows:
> >>> i40e-enp1s0f0-TxRx-1 169k
> >>> i40e-enp1s0f0-TxRx-2 82k
> >>
> >> To avoid the jumping between IRQs, you should configure the smp_affinity
> >> as described above, BUT it will not solve the drop issue.
> >
> > As stated above, IRQs were pinned, which is what led us to question
> > the difference between XDP & Linux bridge.
> >
> >>> i40e-enp1s0f1-TxRx-1 147k
> >>> i40e-enp1s0f1-TxRx-2 235k
> >>> Throughput in this case is only ~8.75 Gbits/sec, and iperf3
> >>> retransmissions number between 1k and 3k consistently.
> >>
> >> The XDP redirect is so fast that the CPU is bored and decides to dive
> >> into deep sleep state levels. If the time it takes to wakeup again +
> >> overhead of starting NAPI (hardirq->softirq) is too long, then packets
> >> will be dropped due to overflowing hardware RX-queue.
> >>
> >> You can directly see the time/latency it takes to wake up from these
> >> sleep states on your hardware from this grep command:
> >>
> >> $ grep -H . /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> >> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> >> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
> >> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
> >> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:40
> >> /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
> >>
> >> As explained in[1] you can calculate back how many bytes are able to
> >> arrive at a given link speed when sleeping e.g. 133 usec, and then based
> >> on the expected packet size figure out if the default 512 slots RX-queue
> >> for i40e is large enough.
> >>
> >> [1]
> >> https://github.com/torvalds/linux/blob/v6.0-rc6/samples/bpf/xdp_redirect_cpu_user.c#L331-L346
> >
> > RX-queue size was set to 4096 for our tests, which is the maximum
> > available on the X710.
> >
> >>> When we use bpftrace to attach multiple BPF programs to i40e functions
> >>> involved in XDP (e.g., `bpftrace -e ‘tracepoint:i40e:i40e_clean_rx_irq
> >>> {} kprobe:i40e_xmit_xdp_ring {}’), retransmissions drop to 0,
> >>> throughput increases to 9.4 Gbits/sec, and CPU utilization on the
> >>> busier CPU increases to ~73%. Hard IRQs are similar to the
> >>> XDP_REDIRECT IRQs above.
> >>>
> >>> Attaching traces should not logically result in a throughput increase.
> >>>
> >>> Any insight or guidance would be greatly appreciated!
> >>
> >> Solution#1: Sysadm can configured system to avoid deep-sleep via:
> >>
> >> # tuned-adm profile network-latency
> >>
> >> Solution#2: Can be combined with increasing RX-queue size via:
> >>
> >> # ethtool -G i40e1 rx 2048
> >>
> >> --Jesper
> >
> > Thank you very much! Changing CPU sleep behaviors explained our 2nd
> > issue from above with retransmissions and slower speeds without
> > profiling attached.
> >
> > We are still at a loss as to the differences in number of IRQs used
> > between XDP & bridge mode, but performance is now aligned with our
> > expectations. In rechecking these numbers after tuning the CPU with
> > tuned-adm, we did notice that XDP generates roughly 10x the number of
> > hard irqs compared to non-XDP bridge mode, but only on one
> > interrupt/core. See:
> >
>
> Good to hear the tuned-adm trick worked for you.
>
> > Non-XDP Bridge
> > $ sudo hardirqs -C 10 1
>
> Assume this is 1 sec sampling.
This is a 10 second sample window printed 1 time.
> > Tracing hard irq events... Hit Ctrl-C to end.
> > HARDIRQ TOTAL_count
> > [...]
> > i40e-enp1s0f1-TxRx-1 118820
> >
>
> Kind weird enp1s0f0 doesn't show up.
It is worth mentioning that enp1s0f1 is the egress interface,
suggesting that these interrupts occur on transmit. I agree that it
seems odd that enp1s0f0 doesn't at least show similar transmit for the
returning ACKs.
> Processing 10Gbit/s with MTU 1500 bytes packets means 833,333 packet per
> sec will be arriving (10*10^9/8/1500).
>
> Thus, NAPI pool must be doing some bulk processing, as your hardirq's
> are 118.820/sec.
Correcting for time factor, it's 11882/sec.
> I have a number of "napi_monitor" tools (I've implemented myself) to see
> what level of bulking you get.
> The @napi_bulk output from this bpftrace script shows you bulking:
>
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/napi_monitor.bt
>
> Script do much more (latency measurements) so you likely just want to
> use this oneliner:
>
> bpftrace -e 'tracepoint:napi:napi_poll { @napi_bulk=lhist(args->work,
> 0, 64, 4); }'
>
>
>
> > XDP (same network flow)
> > $ sudo hardirqs -C 10 1
> > Tracing hard irq events... Hit Ctrl-C to end.
> > HARDIRQ TOTAL_count
> > [...]
> > i40e-enp1s0f0-TxRx-2 79071
> > i40e-enp1s0f0-TxRx-1 106929
> > i40e-enp1s0f1-TxRx-2 993162
> > i40e-enp1s0f1-TxRx-1 108362
> >
>
> Ignoring the weird RX-queue spread.
>
> Only looking at i40e-enp1s0f1-TxRx-2 with 993,162 IRQ/sec.
> This is way too high, as we just learned 833Kpps should be the packet
> numbers. As this is a TCP flow going though the bridging box, then we
> likely also need to account for the ACK packets flowing in the other
> direction, thus we can get above the 833Kpps.
>
> This could indicate that NAPI loop only process a single packet at the
> time, which is weird.
>
> Could you check what NAPI is doing, again with oneliner:
>
> bpftrace -e 'tracepoint:napi:napi_poll { @napi_bulk=lhist(args->work,
> 0, 64, 4); }'
>
>
> > Is it possible that we are seeing hard interrupts from both the RX &
> > TX packets under XDP?
>
> NIC drivers often tries to do the work of the TX DMA completion
> interrupts while processing RX packets anyhow, thus often the TX IRQ are
> significant lower if this happens.
As mentioned above, enp1s0f1 is the egress interface. We see
approximately a 10 to 1 ratio of transmitted packets to return ACK
packets, which seems to line up with the interrupt counters we are
seeing.
Repeated from above for clarity, with added annotations:
XDP (same network flow)
$ sudo hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.
HARDIRQ TOTAL_count
[...]
i40e-enp1s0f0-TxRx-2 79071 *[should be data receive if XDP is
respecting CPU affinity for transmit]
i40e-enp1s0f0-TxRx-1 106929 *[should be ACK transmitting if XDP
is respecting CPU affinity for transmit]
i40e-enp1s0f1-TxRx-2 993162 *[assumed iperf data transmit]
i40e-enp1s0f1-TxRx-1 108362 *[assumed ACK responses]
TSO & GSO are both disabled, so we are unsure why there's asymmetry on
TxRx-2 between the 2 interfaces.
> > In non-XDP, we notice that we are only seeing
> > one network interface producing hard interrupts and we are assuming
> > that the other interface must be serviced fully by polling.
>
> That is kind of weird, not seeing any IRQs for enp1s0f0.
>
> You should use this improved version of the oneline to also get ifnames:
>
> sudo bpftrace -e 'tracepoint:napi:napi_poll {
> @napi_bulk[str(args->dev_name)]=lhist(args->work, 0, 64, 4); }'
>
> --Jesper
We ran the above bpftrace command both with and without XDP enabled:
(bar charts removed)
XDP
------------------------------
@napi_bulk[enp1s0f0]:
[0, 4) 91335
[4, 8) 90
[8, 12) 43776
[12, 16) 31
[16, 20) 47939
[20, 24) 23
[24, 28) 10105
[28, 32) 28
[32, 36) 2350
[36, 40) 14
[40, 44) 1021
[44, 48) 12
[48, 52) 586
[52, 56) 24
[56, 60) 2285
[60, 64) 16
[64, 68) 110876
@napi_bulk[enp1s0f1]:
[0, 4) 1177723
[4, 8) 82
[8, 12) 7
[12, 16) 1
[16, 20) 3
[20, 24) 2
[24, 28) 1
[28, 32) 1
[32, 36) 0
[36, 40) 0
[40, 44) 0
[44, 48) 0
[48, 52) 1
[52, 56) 0
[56, 60) 0
[60, 64) 0
[64, 68) 41
No XDP (Linux bridge only)
-----------------------------
@napi_bulk[enp1s0f0]:
[64, 68) 122992
@napi_bulk[enp1s0f1]:
[0, 4) 2800
[4, 8) 1565
[8, 12) 91207
[12, 16) 820
[16, 20) 26599
*As a side note, @napi_bulk[str(args->dev_name)] wasn't recognized as
a valid arg by bpftrace, despite showing up in the format for the
tracepoint, so we used args->napi->dev->name for the key. I'm not sure
what I'm missing there, but it seemed worth mentioning.
Thanks!
Adam Smith
prev parent reply other threads:[~2022-09-23 21:28 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-19 20:55 Questions about IRQ utilization and throughput with XDP_REDIRECT on Intel i40e Adam Smith
2022-09-20 8:17 ` Jesper Dangaard Brouer
2022-09-20 17:37 ` Adam Smith
2022-09-23 16:55 ` Jesper Dangaard Brouer
2022-09-23 21:28 ` Adam Smith [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CADx6jH7pxdvFsx2=DC=etkpUgjFe8rEPVvSQpHdWE09nvkxeyA@mail.gmail.com' \
--to=hrotsvit@gmail.com \
--cc=brouer@redhat.com \
--cc=federico.parola@polito.it \
--cc=jbrouer@redhat.com \
--cc=magnus.karlsson@intel.com \
--cc=xdp-newbies@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).