[RFC Optimizing veth xsk performance 00/10]

Netdev Archive mirror
 help / color / mirror / Atom feed

* [RFC Optimizing veth xsk performance 00/10]
@ 2023-08-03 14:04 huangjie.albert
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback huangjie.albert
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=yes, Size: 3268 bytes --]

AF_XDP is a kernel bypass technology that can greatly improve performance.
However, for virtual devices like veth, even with the use of AF_XDP sockets,
there are still many additional software paths that consume CPU resources. 
This patch series focuses on optimizing the performance of AF_XDP sockets 
for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. 
Patch 5 introduces tx queue and tx napi for packet transmission, while 
patch 9 primarily implements zero-copy, and patch 10 adds support for 
batch sending of IPv4 UDP packets. These optimizations significantly reduce 
the software path and support checksum offload.

I tested those feature with
A typical topology is shown below:
veth<-->veth-peer                                    veth1-peer<--->veth1
	1       |                                                  |   7
	        |2                                                6|
	        |                                                  |
	      bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
                  3                    4                 5    
             (machine1)                              (machine2)    
AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
veth:(172.17.0.2/24)
bridge:(172.17.0.1/24)
eth0:(192.168.156.66/24)

eth1(172.17.0.2/24)
bridge1:(172.17.0.1/24)
eth0:(192.168.156.88/24)

after set default route、snat、dnat. we can have a tests
to get the performance results.

packets send from veth to veth1:
af_xdp test tool:
link:https://github.com/cclinuxer/libxudp
send:(veth)
./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
recv:(veth1)
./objs/xudpperf recv --src 172.17.0.2:6002

udp test tool:iperf3
send:(veth)
iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u
recv:(veth1)
iperf3 -s -p 6002

performance:
performance:(test weth libxdp lib)
UDP                              : 250 Kpps (with 100% cpu)
AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with zerocopy + no batch : 540 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)

With af_xdp batch, the libxdp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.

This is just an RFC patch series, and some code details still need 
further consideration. Please review this proposal.

thanks!

huangjie.albert (10):
  veth: Implement ethtool's get_ringparam() callback
  xsk: add dma_check_skip for  skipping dma check
  veth: add support for send queue
  xsk: add xsk_tx_completed_addr function
  veth: use send queue tx napi to xmit xsk tx desc
  veth: add ndo_xsk_wakeup callback for veth
  sk_buff: add destructor_arg_xsk_pool for zero copy
  xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX
  veth: support zero copy for af xdp
  veth: af_xdp tx batch support for ipv4 udp

 drivers/net/veth.c          | 729 +++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |   1 +
 include/net/xdp.h           |   1 +
 include/net/xdp_sock_drv.h  |   1 +
 include/net/xsk_buff_pool.h |   1 +
 net/xdp/xsk.c               |   6 +
 net/xdp/xsk_buff_pool.c     |   3 +-
 net/xdp/xsk_queue.h         |  11 +
 8 files changed, 751 insertions(+), 2 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 20:41   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 02/10] xsk: add dma_check_skip for skipping dma check huangjie.albert
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Menglong Dong, Yunsheng Lin, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

some xsk libary calls get_ringparam() API to get the queue length
to init the xsk umem.

Implement that in veth so those scenarios can work properly.

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..c2b431a7a017 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -255,6 +255,17 @@ static void veth_get_channels(struct net_device *dev,
 static int veth_set_channels(struct net_device *dev,
 			     struct ethtool_channels *ch);
 
+static void veth_get_ringparam(struct net_device *dev,
+				  struct ethtool_ringparam *ring,
+				  struct kernel_ethtool_ringparam *kernel_ring,
+				  struct netlink_ext_ack *extack)
+{
+	ring->rx_max_pending = VETH_RING_SIZE;
+	ring->tx_max_pending = VETH_RING_SIZE;
+	ring->rx_pending = VETH_RING_SIZE;
+	ring->tx_pending = VETH_RING_SIZE;
+}
+
 static const struct ethtool_ops veth_ethtool_ops = {
 	.get_drvinfo		= veth_get_drvinfo,
 	.get_link		= ethtool_op_get_link,
@@ -265,6 +276,7 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_ts_info		= ethtool_op_get_ts_info,
 	.get_channels		= veth_get_channels,
 	.set_channels		= veth_set_channels,
+	.get_ringparam		= veth_get_ringparam,
 };
 
 /* general routines */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 02/10] xsk: add dma_check_skip for  skipping dma check
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 20:42   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 03/10] veth: add support for send queue huangjie.albert
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Shmulik Ladkani, Kees Cook, Richard Gobert,
	Yunsheng Lin, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

for the virtual net device such as veth, there is
no need to do dma check if we support zero copy.

add this flag after unaligned. beacause there are 4 bytes hole
pahole -V ./net/xdp/xsk_buff_pool.o:
-----------
...
	/* --- cacheline 3 boundary (192 bytes) --- */
	u32                        chunk_size;           /*   192     4 */
	u32                        frame_len;            /*   196     4 */
	u8                         cached_need_wakeup;   /*   200     1 */
	bool                       uses_need_wakeup;     /*   201     1 */
	bool                       dma_need_sync;        /*   202     1 */
	bool                       unaligned;            /*   203     1 */

	/* XXX 4 bytes hole, try to pack */

	void *                     addrs;                /*   208     8 */
	spinlock_t                 cq_lock;              /*   216     4 */
...
-----------

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 include/net/xsk_buff_pool.h | 1 +
 net/xdp/xsk_buff_pool.c     | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..fe31097dc11b 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -81,6 +81,7 @@ struct xsk_buff_pool {
 	bool uses_need_wakeup;
 	bool dma_need_sync;
 	bool unaligned;
+	bool dma_check_skip;
 	void *addrs;
 	/* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect:
 	 * NAPI TX thread and sendmsg error paths in the SKB destructor callback and when
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b3f7b310811e..ed251b8e8773 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 		XDP_PACKET_HEADROOM;
 	pool->umem = umem;
 	pool->addrs = umem->addrs;
+	pool->dma_check_skip = false;
 	INIT_LIST_HEAD(&pool->free_list);
 	INIT_LIST_HEAD(&pool->xskb_list);
 	INIT_LIST_HEAD(&pool->xsk_tx_list);
@@ -202,7 +203,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (err)
 		goto err_unreg_pool;
 
-	if (!pool->dma_pages) {
+	if (!pool->dma_pages && !pool->dma_check_skip) {
 		WARN(1, "Driver did not DMA map zero-copy buffers");
 		err = -EINVAL;
 		goto err_unreg_xsk;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 03/10] veth: add support for send queue
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback huangjie.albert
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 02/10] xsk: add dma_check_skip for skipping dma check huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 20:44   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 04/10] xsk: add xsk_tx_completed_addr function huangjie.albert
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Kees Cook, Richard Gobert, Yunsheng Lin,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

in order to support native af_xdp for veth. we
need support for send queue for napi tx.
the upcoming patch will make use of it.

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index c2b431a7a017..63c3ebe4c5d0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -56,6 +56,11 @@ struct veth_rq_stats {
 	struct u64_stats_sync	syncp;
 };
 
+struct veth_sq_stats {
+	struct veth_stats	vs;
+	struct u64_stats_sync	syncp;
+};
+
 struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,11 +74,25 @@ struct veth_rq {
 	struct page_pool	*page_pool;
 };
 
+struct veth_sq {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
+	struct xdp_mem_info	xdp_mem;
+	struct veth_sq_stats	stats;
+	u32 queue_index;
+	/* this is for xsk */
+	struct {
+		struct xsk_buff_pool __rcu *pool;
+		u32 last_cpu;
+	}xsk;
+};
+
 struct veth_priv {
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	struct bpf_prog		*_xdp_prog;
 	struct veth_rq		*rq;
+	struct veth_sq		*sq;
 	unsigned int		requested_headroom;
 };
 
@@ -1495,6 +1514,15 @@ static int veth_alloc_queues(struct net_device *dev)
 		u64_stats_init(&priv->rq[i].stats.syncp);
 	}
 
+	priv->sq = kcalloc(dev->num_tx_queues, sizeof(*priv->sq), GFP_KERNEL);
+	if (!priv->sq)
+		return -ENOMEM;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		priv->sq[i].dev = dev;
+		u64_stats_init(&priv->sq[i].stats.syncp);
+	}
+
 	return 0;
 }
 
@@ -1503,6 +1531,7 @@ static void veth_free_queues(struct net_device *dev)
 	struct veth_priv *priv = netdev_priv(dev);
 
 	kfree(priv->rq);
+	kfree(priv->sq);
 }
 
 static int veth_dev_init(struct net_device *dev)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 04/10] xsk: add xsk_tx_completed_addr function
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (2 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 03/10] veth: add support for send queue huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 20:46   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 05/10] veth: use send queue tx napi to xmit xsk tx desc huangjie.albert
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Menglong Dong, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

Return desc to the cq by using the descriptor address.

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 include/net/xdp_sock_drv.h |  1 +
 net/xdp/xsk.c              |  6 ++++++
 net/xdp/xsk_queue.h        | 11 +++++++++++
 3 files changed, 18 insertions(+)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..5220454bff5c 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -15,6 +15,7 @@
 #ifdef CONFIG_XDP_SOCKETS
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr);
 bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4f1e0599146e..b2b8aa7b0bcf 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -396,6 +396,12 @@ void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 }
 EXPORT_SYMBOL(xsk_tx_completed);
 
+void xsk_tx_completed_addr(struct xsk_buff_pool *pool, u64 addr)
+{
+	xskq_prod_submit_addr(pool->cq, addr);
+}
+EXPORT_SYMBOL(xsk_tx_completed_addr);
+
 void xsk_tx_release(struct xsk_buff_pool *pool)
 {
 	struct xdp_sock *xs;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..a494d1dcb1c3 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -428,6 +428,17 @@ static inline void __xskq_prod_submit(struct xsk_queue *q, u32 idx)
 	smp_store_release(&q->ring->producer, idx); /* B, matches C */
 }
 
+
+static inline void xskq_prod_submit_addr(struct xsk_queue *q, u64 addr)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+	u32 idx = q->ring->producer;
+
+	ring->desc[idx++ & q->ring_mask] = addr;
+
+	__xskq_prod_submit(q, idx);
+}
+
 static inline void xskq_prod_submit(struct xsk_queue *q)
 {
 	__xskq_prod_submit(q, q->cached_prod);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 05/10] veth: use send queue tx napi to xmit xsk tx desc
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (3 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 04/10] xsk: add xsk_tx_completed_addr function huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 20:59   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 06/10] veth: add ndo_xsk_wakeup callback for veth huangjie.albert
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Kees Cook, Menglong Dong, Richard Gobert,
	Yunsheng Lin, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=yes, Size: 8436 bytes --]

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 265 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 264 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 63c3ebe4c5d0..944761807ca4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,8 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <net/page_pool.h>
+#include <net/xdp_sock_drv.h>
+#include <net/xdp.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -1061,6 +1063,176 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	return done;
 }
 
+static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
+{
+	struct veth_priv *priv, *peer_priv;
+	struct net_device *dev, *peer_dev;
+	struct veth_rq *peer_rq;
+	struct veth_stats peer_stats = {};
+	struct veth_stats stats = {};
+	struct veth_xdp_tx_bq bq;
+	struct xdp_desc desc;
+	void *xdpf;
+	int done = 0;
+
+	bq.count = 0;
+	dev = sq->dev;
+	priv = netdev_priv(dev);
+	peer_dev = priv->peer;
+	peer_priv = netdev_priv(peer_dev);
+
+	/* todo: queue index must set before this */
+	peer_rq = &peer_priv->rq[sq->queue_index];
+
+	/* set xsk wake up flag, to do: where to disable */
+	if (xsk_uses_need_wakeup(xsk_pool))
+		xsk_set_tx_need_wakeup(xsk_pool);
+
+	while (budget-- > 0) {
+		unsigned int truesize = 0;
+		struct xdp_frame *p_frame;
+		struct page *page;
+		void *new_addr;
+		void *addr;
+
+		/*
+		* get a desc from xsk pool
+		*/
+		if (!xsk_tx_peek_desc(xsk_pool, &desc)) {
+			break;
+		}
+
+		/*
+		* Get a xmit addr
+		* desc.addr is a offset, so we should to convert to real virtual address
+		*/
+		addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
+
+		/* can not hold all data in a page */
+		truesize =  SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + desc.len + sizeof(struct xdp_frame);
+		if (truesize > PAGE_SIZE) {
+			stats.xdp_drops++;
+			xsk_tx_completed_addr(xsk_pool, desc.addr);
+			continue;
+		}
+
+		page = dev_alloc_page();
+		if (!page) {
+			/*
+			* error , release xdp frame and increase drops
+			*/
+			xsk_tx_completed_addr(xsk_pool, desc.addr);
+			stats.xdp_drops++;
+			break;
+		}
+		new_addr = page_to_virt(page);
+
+		p_frame = new_addr;
+		new_addr += sizeof(struct xdp_frame);
+		p_frame->data = new_addr;
+		p_frame->len = desc.len;
+
+		/* frame should change to the page size, beacause the (struct skb_shared_info)  is so large,
+		 * if we build skb in veth_xdp_rcv_one, skb->tail may larger than skb->end which could triger a skb_panic
+		 */
+		p_frame->headroom = 0;
+		p_frame->metasize = 0;
+		p_frame->frame_sz = PAGE_SIZE;
+		p_frame->flags = 0;
+		p_frame->mem.type = MEM_TYPE_PAGE_SHARED;
+		memcpy(p_frame->data, addr, p_frame->len);
+		xsk_tx_completed_addr(xsk_pool, desc.addr);
+
+		/* if peer have xdp prog, if it has ,just send to peer */
+		p_frame = veth_xdp_rcv_one(peer_rq, p_frame, &bq, &peer_stats);
+		/* if no xdp with this queue, convert to skb to xmit*/
+		if (p_frame) {
+			xdpf = p_frame;
+			veth_xdp_rcv_bulk_skb(peer_rq, &xdpf, 1, &bq, &peer_stats);
+			p_frame = NULL;
+		}
+
+		stats.xdp_bytes += desc.len;
+
+		done++;
+	}
+
+	/* release, move consumer，and wakeup the producer */
+	if (done) {
+		napi_schedule(&peer_rq->xdp_napi);
+		xsk_tx_release(xsk_pool);
+	}
+
+
+
+	/* just for peer rq */
+	if (peer_stats.xdp_tx > 0)
+		veth_xdp_flush(peer_rq, &bq);
+	if (peer_stats.xdp_redirect > 0)
+		xdp_do_flush();
+
+	/* update peer rq stats, or maybe we do not need to do this */
+	u64_stats_update_begin(&peer_rq->stats.syncp);
+	peer_rq->stats.vs.xdp_redirect += peer_stats.xdp_redirect;
+	peer_rq->stats.vs.xdp_packets += done;
+	peer_rq->stats.vs.xdp_bytes += stats.xdp_bytes;
+	peer_rq->stats.vs.xdp_drops += peer_stats.xdp_drops;
+	peer_rq->stats.vs.rx_drops += peer_stats.rx_drops;
+	peer_rq->stats.vs.xdp_tx += peer_stats.xdp_tx;
+	u64_stats_update_end(&peer_rq->stats.syncp);
+
+	/* update sq stats */
+	u64_stats_update_begin(&sq->stats.syncp);
+	sq->stats.vs.xdp_packets += done;
+	sq->stats.vs.xdp_bytes += stats.xdp_bytes;
+	sq->stats.vs.xdp_drops += stats.xdp_drops;
+	u64_stats_update_end(&sq->stats.syncp);
+
+	return done;
+}
+
+static int veth_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct veth_sq *sq = container_of(napi, struct veth_sq, xdp_napi);
+	struct xsk_buff_pool *pool;
+	int done = 0;
+	xdp_set_return_frame_no_direct();
+
+	sq->xsk.last_cpu = smp_processor_id();
+
+	/* xmit for tx queue */
+	rcu_read_lock();
+	pool = rcu_dereference(sq->xsk.pool);
+	if (pool) {
+		done  = veth_xsk_tx_xmit(sq, pool, budget);
+	}
+	rcu_read_unlock();
+
+	if (done < budget) {
+		/* if done < budget, the tx ring is no buffer */
+		napi_complete_done(napi, done);
+	}
+
+	xdp_clear_return_frame_no_direct();
+
+	return done;
+}
+
+
+static int veth_napi_add_tx(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_sq *sq = &priv->sq[i];
+		netif_napi_add(dev, &sq->xdp_napi, veth_poll_tx);
+		napi_enable(&sq->xdp_napi);
+	}
+
+	return 0;
+}
+
 static int veth_create_page_pool(struct veth_rq *rq)
 {
 	struct page_pool_params pp_params = {
@@ -1153,6 +1325,19 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
 	}
 }
 
+static void veth_napi_del_tx(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_sq *sq = &priv->sq[i];
+
+		napi_disable(&sq->xdp_napi);
+		__netif_napi_del(&sq->xdp_napi);
+	}
+}
+
 static void veth_napi_del(struct net_device *dev)
 {
 	veth_napi_del_range(dev, 0, dev->real_num_rx_queues);
@@ -1360,7 +1545,7 @@ static void veth_set_xdp_features(struct net_device *dev)
 		struct veth_priv *priv_peer = netdev_priv(peer);
 		xdp_features_t val = NETDEV_XDP_ACT_BASIC |
 				     NETDEV_XDP_ACT_REDIRECT |
-				     NETDEV_XDP_ACT_RX_SG;
+				     NETDEV_XDP_ACT_RX_SG | NETDEV_XDP_ACT_XSK_ZEROCOPY;
 
 		if (priv_peer->_xdp_prog || veth_gro_requested(peer))
 			val |= NETDEV_XDP_ACT_NDO_XMIT |
@@ -1737,11 +1922,89 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return err;
 }
 
+static int veth_xsk_pool_enable(struct net_device *dev, struct xsk_buff_pool *pool, u16 qid)
+{
+	struct veth_priv *peer_priv;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer_dev = priv->peer;
+	int err = 0;
+
+	if (qid >= dev->real_num_tx_queues)
+		return -EINVAL;
+
+	if(!peer_dev)
+		return -EINVAL;
+
+	/* no dma, so we just skip dma skip in xsk zero copy */
+	pool->dma_check_skip = true;
+
+	peer_priv = netdev_priv(peer_dev);
+	/*
+	*  enable peer tx xdp here, this side
+	*  xdp is enable by veth_xdp_set
+	*  to do: we need to check whther this side is already enable xdp
+	*  maybe it do not have xdp prog
+	*/
+	if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+		/*  peer should enable napi*/
+		err = veth_napi_enable(peer_dev);
+		if (err)
+			return err;
+	}
+
+	/* Here is already protected by rtnl_lock, so rcu_assign_pointer
+	 * is safe.
+	 */
+	rcu_assign_pointer(priv->sq[qid].xsk.pool, pool);
+
+	veth_napi_add_tx(dev);
+
+	return err;
+}
+
+static int veth_xsk_pool_disable(struct net_device *dev, u16 qid)
+{
+	struct veth_priv *peer_priv;
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer_dev = priv->peer;
+	int err = 0;
+
+	if (qid >= dev->real_num_tx_queues)
+		return -EINVAL;
+
+	if(!peer_dev)
+		return -EINVAL;
+
+	peer_priv = netdev_priv(peer_dev);
+
+	/* to do: this may be failed */
+	if (!(peer_priv->_xdp_prog) && (!veth_gro_requested(peer_dev))) {
+		/*  disable peer napi */
+		veth_napi_del(peer_dev);
+	}
+
+	veth_napi_del_tx(dev);
+
+	rcu_assign_pointer(priv->sq[qid].xsk.pool, NULL);
+	return err;
+}
+
+/* this  is for setup xdp */
+static int veth_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	if (xdp->xsk.pool)
+		return veth_xsk_pool_enable(dev, xdp->xsk.pool, xdp->xsk.queue_id);
+	else
+		return veth_xsk_pool_disable(dev, xdp->xsk.queue_id);
+}
+
 static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_SETUP_XSK_POOL:
+		return veth_xsk_pool_setup(dev, xdp);
 	default:
 		return -EINVAL;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 06/10] veth: add ndo_xsk_wakeup callback for veth
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (4 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 05/10] veth: use send queue tx napi to xmit xsk tx desc huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 21:01   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 07/10] sk_buff: add destructor_arg_xsk_pool for zero copy huangjie.albert
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Shmulik Ladkani, Kees Cook, Richard Gobert,
	Yunsheng Lin, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 944761807ca4..600225e27e9e 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1840,6 +1840,45 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static void veth_xsk_remote_trigger_napi(void *info)
+{
+	struct veth_sq *sq = info;
+
+	napi_schedule(&sq->xdp_napi);
+}
+
+static int veth_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+	struct veth_priv *priv;
+	struct veth_sq *sq;
+	u32 last_cpu, cur_cpu;
+
+	if (!netif_running(dev))
+		return -ENETDOWN;
+
+	if (qid >= dev->real_num_rx_queues)
+		return -EINVAL;
+
+	priv = netdev_priv(dev);
+	sq = &priv->sq[qid];
+
+	if (napi_if_scheduled_mark_missed(&sq->xdp_napi))
+		return 0;
+
+	last_cpu = sq->xsk.last_cpu;
+	cur_cpu = get_cpu();
+
+	/*  raise a napi */
+	if (last_cpu == cur_cpu) {
+		napi_schedule(&sq->xdp_napi);
+	} else {
+		smp_call_function_single(last_cpu, veth_xsk_remote_trigger_napi, sq, true);
+	}
+
+	put_cpu();
+	return 0;
+}
+
 static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			struct netlink_ext_ack *extack)
 {
@@ -2054,6 +2093,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
+	.ndo_xsk_wakeup		= veth_xsk_wakeup,
 	.ndo_get_peer_dev	= veth_peer_dev,
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 07/10] sk_buff: add destructor_arg_xsk_pool for zero copy
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (5 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 06/10] veth: add ndo_xsk_wakeup callback for veth huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 08/10] xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX huangjie.albert
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Shmulik Ladkani, Kees Cook, Richard Gobert,
	Yunsheng Lin, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

this member is add for dummy dev to suppot zero copy

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 include/linux/skbuff.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 16a49ba534e4..fa9577d233a4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -592,6 +592,7 @@ struct skb_shared_info {
 	/* Intermediate layers must ensure that destructor_arg
 	 * remains valid until skb destructor */
 	void *		destructor_arg;
+	void *		destructor_arg_xsk_pool; /*  just for dummy device xsk zero copy */
 
 	/* must be last field, see pskb_expand_head() */
 	skb_frag_t	frags[MAX_SKB_FRAGS];
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 08/10] xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (6 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 07/10] sk_buff: add destructor_arg_xsk_pool for zero copy huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 09/10] veth: support zero copy for af xdp huangjie.albert
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Kees Cook, Shmulik Ladkani, Richard Gobert,
	Yunsheng Lin, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

this type of xdp mem will be used for zero copy in
later patch

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 include/net/xdp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1c5381fc95f..cb1621b5a0c9 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -42,6 +42,7 @@ enum xdp_mem_type {
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
 	MEM_TYPE_PAGE_POOL,
 	MEM_TYPE_XSK_BUFF_POOL,
+	MEM_TYPE_XSK_BUFF_POOL_TX,
 	MEM_TYPE_MAX,
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 09/10] veth: support zero copy for af xdp
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (7 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 08/10] xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 21:05   ` Simon Horman
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 10/10] veth: af_xdp tx batch support for ipv4 udp huangjie.albert
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

The following conditions need to be satisfied to achieve zero-copy:
1. The tx desc has enough space to store the xdp_frame and skb_share_info.
2. The memory address pointed to by the tx desc is within a page.

test zero copy with libxdp
Performance:
		     |MSS (bytes) | Packet rate (PPS)
AF_XDP               | 1300       | 480k
AF_XDP with zero copy| 1300       | 540K

signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 207 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 178 insertions(+), 29 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 600225e27e9e..e4f1a8345f42 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -103,6 +103,11 @@ struct veth_xdp_tx_bq {
 	unsigned int count;
 };
 
+struct veth_seg_info {
+	u32 segs;
+	u64 desc[] ____cacheline_aligned_in_smp;
+};
+
 /*
  * ethtool interface
  */
@@ -645,6 +650,100 @@ static int veth_xdp_tx(struct veth_rq *rq, struct xdp_buff *xdp,
 	return 0;
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static void veth_xsk_destruct_skb(struct sk_buff *skb)
+{
+	struct veth_seg_info *seg_info = (struct veth_seg_info *)skb_shinfo(skb)->destructor_arg;
+	struct xsk_buff_pool *pool = (struct xsk_buff_pool *)skb_shinfo(skb)->destructor_arg_xsk_pool;
+	unsigned long flags;
+	u32 index = 0;
+	u64 addr;
+
+	/* release cq */
+	spin_lock_irqsave(&pool->cq_lock, flags);
+	for (index = 0; index < seg_info->segs; index++) {
+		addr = (u64)(long)seg_info->desc[index];
+		xsk_tx_completed_addr(pool, addr);
+	}
+	spin_unlock_irqrestore(&pool->cq_lock, flags);
+
+	kfree(seg_info);
+	skb_shinfo(skb)->destructor_arg = NULL;
+	skb_shinfo(skb)->destructor_arg_xsk_pool = NULL;
+}
+
+static struct sk_buff *veth_build_skb_zerocopy(struct net_device *dev, struct xsk_buff_pool *pool,
+					      struct xdp_desc *desc)
+{
+	struct veth_seg_info *seg_info;
+	struct sk_buff *skb;
+	struct page *page;
+	void *hard_start;
+	u32 len, ts;
+	void *buffer;
+	int headroom;
+	u64 addr;
+	u32 index;
+
+	addr = desc->addr;
+	len = desc->len;
+	buffer = xsk_buff_raw_get_data(pool, addr);
+	ts = pool->unaligned ? len : pool->chunk_size;
+
+	headroom = offset_in_page(buffer);
+
+	/* offset in umem pool buffer */
+	addr = buffer - pool->addrs;
+
+	/* get the page of the desc */
+	page = pool->umem->pgs[addr >> PAGE_SHIFT];
+
+	/* in order to avoid to get freed by kfree_skb */
+	get_page(page);
+
+	hard_start = page_to_virt(page);
+
+	skb = veth_build_skb(hard_start, headroom, len, ts);
+	seg_info = (struct veth_seg_info *)kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS), GFP_KERNEL);
+	if (!seg_info)
+	{
+		printk("here must to deal with\n");
+	}
+
+	/* later we will support gso for this */
+	index = skb_shinfo(skb)->gso_segs;
+	seg_info->desc[index] = desc->addr;
+	seg_info->segs = ++index;
+
+	skb->truesize += ts;
+	skb->dev = dev;
+	skb_shinfo(skb)->destructor_arg = (void *)(long)seg_info;
+	skb_shinfo(skb)->destructor_arg_xsk_pool = (void *)(long)pool;
+	skb->destructor = veth_xsk_destruct_skb;
+
+	/* set the mac header */
+	skb->protocol = eth_type_trans(skb, dev);
+
+	/* to do, add skb to sock. may be there is no need to do for this
+	*  refcount_add(ts, &xs->sk.sk_wmem_alloc);
+	*/
+	return skb;
+}
+
 static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 					  struct xdp_frame *frame,
 					  struct veth_xdp_tx_bq *bq,
@@ -1063,6 +1162,20 @@ static int veth_poll(struct napi_struct *napi, int budget)
 	return done;
 }
 
+/*  if buffer contain in a page */
+static inline bool buffer_in_page(void *buffer, u32 len)
+{
+	u32 offset;
+
+	offset = offset_in_page(buffer);
+
+	if(PAGE_SIZE - offset >= len) {
+		return true;
+	} else {
+		return false;
+	}
+}
+
 static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
 {
 	struct veth_priv *priv, *peer_priv;
@@ -1073,6 +1186,9 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 	struct veth_xdp_tx_bq bq;
 	struct xdp_desc desc;
 	void *xdpf;
+	struct sk_buff *skb = NULL;
+	bool zc = xsk_pool->umem->zc;
+	u32 xsk_headroom = xsk_pool->headroom;
 	int done = 0;
 
 	bq.count = 0;
@@ -1102,12 +1218,6 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 			break;
 		}
 
-		/*
-		* Get a xmit addr
-		* desc.addr is a offset, so we should to convert to real virtual address
-		*/
-		addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
-
 		/* can not hold all data in a page */
 		truesize =  SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + desc.len + sizeof(struct xdp_frame);
 		if (truesize > PAGE_SIZE) {
@@ -1116,16 +1226,39 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 			continue;
 		}
 
-		page = dev_alloc_page();
-		if (!page) {
-			/*
-			* error , release xdp frame and increase drops
-			*/
-			xsk_tx_completed_addr(xsk_pool, desc.addr);
-			stats.xdp_drops++;
-			break;
+		/*
+		* Get a xmit addr
+		* desc.addr is a offset, so we should to convert to real virtual address
+		*/
+		addr = xsk_buff_raw_get_data(xsk_pool, desc.addr);
+
+		/*
+		 * in order to support zero copy, headroom must have enough space to hold xdp_frame
+		 */
+		if (zc && (xsk_headroom < sizeof(struct xdp_frame)))
+			zc = false;
+
+		/*
+		 * if desc not contain in a page, also do not support zero copy
+		*/
+		if (!buffer_in_page(addr, desc.len))
+			zc = false;
+
+		if (zc) {
+			/* headroom is reserved for xdp_frame */
+			new_addr = addr - sizeof(struct xdp_frame);
+		} else {
+			page = dev_alloc_page();
+			if (!page) {
+				/*
+				* error , release xdp frame and increase drops
+				*/
+				xsk_tx_completed_addr(xsk_pool, desc.addr);
+				stats.xdp_drops++;
+				break;
+			}
+			new_addr = page_to_virt(page);
 		}
-		new_addr = page_to_virt(page);
 
 		p_frame = new_addr;
 		new_addr += sizeof(struct xdp_frame);
@@ -1137,19 +1270,37 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 		 */
 		p_frame->headroom = 0;
 		p_frame->metasize = 0;
-		p_frame->frame_sz = PAGE_SIZE;
 		p_frame->flags = 0;
-		p_frame->mem.type = MEM_TYPE_PAGE_SHARED;
-		memcpy(p_frame->data, addr, p_frame->len);
-		xsk_tx_completed_addr(xsk_pool, desc.addr);
-
-		/* if peer have xdp prog, if it has ,just send to peer */
-		p_frame = veth_xdp_rcv_one(peer_rq, p_frame, &bq, &peer_stats);
-		/* if no xdp with this queue, convert to skb to xmit*/
-		if (p_frame) {
-			xdpf = p_frame;
-			veth_xdp_rcv_bulk_skb(peer_rq, &xdpf, 1, &bq, &peer_stats);
-			p_frame = NULL;
+
+		if (zc) {
+			p_frame->frame_sz = xsk_pool->frame_len;
+			/* to do: if there is a xdp, how to recycle the tx desc */
+			p_frame->mem.type = MEM_TYPE_XSK_BUFF_POOL_TX;
+			/* no need to copy address for af+xdp */
+			p_frame = veth_xdp_rcv_one(peer_rq, p_frame, &bq, &peer_stats);
+			if (p_frame) {
+				skb = veth_build_skb_zerocopy(peer_dev, xsk_pool, &desc);
+				if (skb) {
+					napi_gro_receive(&peer_rq->xdp_napi, skb);
+					skb = NULL;
+				} else {
+					xsk_tx_completed_addr(xsk_pool, desc.addr);
+				}
+			}
+		} else {
+			p_frame->frame_sz = PAGE_SIZE;
+			p_frame->mem.type = MEM_TYPE_PAGE_SHARED;
+			memcpy(p_frame->data, addr, p_frame->len);
+			xsk_tx_completed_addr(xsk_pool, desc.addr);
+
+			/* if peer have xdp prog, if it has ,just send to peer */
+			p_frame = veth_xdp_rcv_one(peer_rq, p_frame, &bq, &peer_stats);
+			/* if no xdp with this queue, convert to skb to xmit*/
+			if (p_frame) {
+				xdpf = p_frame;
+				veth_xdp_rcv_bulk_skb(peer_rq, &xdpf, 1, &bq, &peer_stats);
+				p_frame = NULL;
+			}
 		}
 
 		stats.xdp_bytes += desc.len;
@@ -1163,8 +1314,6 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 		xsk_tx_release(xsk_pool);
 	}
 
-
-
 	/* just for peer rq */
 	if (peer_stats.xdp_tx > 0)
 		veth_xdp_flush(peer_rq, &bq);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC Optimizing veth xsk performance 10/10] veth: af_xdp tx batch support for ipv4 udp
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (8 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 09/10] veth: support zero copy for af xdp huangjie.albert
@ 2023-08-03 14:04 ` huangjie.albert
  2023-08-04 21:12   ` Simon Horman
  2023-08-03 14:20 ` [RFC Optimizing veth xsk performance 00/10] Paolo Abeni
  2023-08-03 15:01 ` Jesper Dangaard Brouer
  11 siblings, 1 reply; 22+ messages in thread
From: huangjie.albert @ 2023-08-03 14:04 UTC (permalink / raw
  To: davem, edumazet, kuba, pabeni
  Cc: huangjie.albert, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=yes, Size: 12161 bytes --]

A typical topology is shown below:
veth<--------veth-peer
	1       |
		|2
		|
	      bridge<------->eth0(such as mlnx5 NIC)

If you use af_xdp to send packets from veth to a physical NIC,
it needs to go through some software paths, so we can refer to
the implementation of kernel GSO. When af_xdp sends packets out
from veth, consider aggregating packets and send a large packet
from the veth virtual NIC to the physical NIC.

performance:(test weth libxdp lib)
AF_XDP without batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with   batch : 1.5 Mpps (with ksoftirqd 15% cpu)

With af_xdp batch, the libxdp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.

Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
---
 drivers/net/veth.c | 264 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 249 insertions(+), 15 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index e4f1a8345f42..b0dbd21089c8 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -29,6 +29,7 @@
 #include <net/page_pool.h>
 #include <net/xdp_sock_drv.h>
 #include <net/xdp.h>
+#include <net/udp.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -103,6 +104,18 @@ struct veth_xdp_tx_bq {
 	unsigned int count;
 };
 
+struct veth_gso_tuple {
+	__u8	protocol;
+	__be32	saddr;
+	__be32	daddr;
+	__be16	source;
+	__be16	dest;
+	__be16	gso_size;
+	__be16	gso_segs;
+	bool gso_enable;
+	bool gso_flush;
+};
+
 struct veth_seg_info {
 	u32 segs;
 	u64 desc[] ____cacheline_aligned_in_smp;
@@ -650,6 +663,84 @@ static int veth_xdp_tx(struct veth_rq *rq, struct xdp_buff *xdp,
 	return 0;
 }
 
+static struct sk_buff *veth_build_gso_head_skb(struct net_device *dev, char *buff, u32 tot_len, u32 headroom, u32 iph_len, u32 th_len)
+{
+	struct sk_buff *skb = NULL;
+	int err = 0;
+
+	skb = alloc_skb(tot_len, GFP_KERNEL);
+	if (unlikely(!skb))
+		return NULL;
+
+	/* header room contains the eth header */
+	skb_reserve(skb, headroom - ETH_HLEN);
+
+	skb_put(skb, ETH_HLEN + iph_len + th_len);
+
+	skb_shinfo(skb)->gso_segs = 0;
+
+	err = skb_store_bits(skb, 0, buff, ETH_HLEN + iph_len + th_len);
+	if (unlikely(err)) {
+		kfree_skb(skb);
+		return NULL;
+	}
+
+	skb->protocol = eth_type_trans(skb, dev);
+	skb->network_header = skb->mac_header + ETH_HLEN;
+	skb->transport_header = skb->network_header + iph_len;
+	skb->ip_summed = CHECKSUM_PARTIAL;
+
+	return skb;
+}
+
+static inline bool gso_segment_match(struct veth_gso_tuple *gso_tuple, struct iphdr *iph, struct udphdr *udph)
+{
+	if (gso_tuple->protocol == iph->protocol &&
+		gso_tuple->saddr == iph->saddr &&
+		gso_tuple->daddr == iph->daddr &&
+		gso_tuple->source == udph->source &&
+		gso_tuple->dest == udph->dest &&
+		gso_tuple->gso_size == ntohs(udph->len))
+	{
+		gso_tuple->gso_flush = false;
+		return true;
+	} else {
+		gso_tuple->gso_flush = true;
+		return false;
+	}
+}
+
+static inline void gso_tuple_init(struct veth_gso_tuple *gso_tuple, struct iphdr *iph, struct udphdr *udph)
+{
+	gso_tuple->protocol = iph->protocol;
+	gso_tuple->saddr = iph->saddr;
+	gso_tuple->daddr = iph->daddr;
+	gso_tuple->source = udph->source;
+	gso_tuple->dest = udph->dest;
+	gso_tuple->gso_flush = false;
+	gso_tuple->gso_size = ntohs(udph->len);
+	gso_tuple->gso_segs = 0;
+}
+
+/* only ipv4 udp support gso now */
+static inline bool ip_hdr_gso_check(unsigned char *buff, u32 len)
+{
+	struct iphdr *iph;
+
+	if (len <= (ETH_HLEN + sizeof(*iph)))
+		return false;
+
+	iph = (struct iphdr *)(buff + ETH_HLEN);
+
+	/*
+	 * check for ip headers, if the data support gso
+	 */
+	if (iph->ihl < 5 || iph->version != 4 || len < (iph->ihl * 4 + ETH_HLEN) || iph->protocol != IPPROTO_UDP)
+		return false;
+
+	return true;
+}
+
 static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 				      int buflen)
 {
@@ -686,8 +777,8 @@ static void veth_xsk_destruct_skb(struct sk_buff *skb)
 	skb_shinfo(skb)->destructor_arg_xsk_pool = NULL;
 }
 
-static struct sk_buff *veth_build_skb_zerocopy(struct net_device *dev, struct xsk_buff_pool *pool,
-					      struct xdp_desc *desc)
+static struct sk_buff *veth_build_skb_zerocopy_normal(struct net_device *dev,
+		struct xsk_buff_pool *pool, struct xdp_desc *desc)
 {
 	struct veth_seg_info *seg_info;
 	struct sk_buff *skb;
@@ -698,45 +789,133 @@ static struct sk_buff *veth_build_skb_zerocopy(struct net_device *dev, struct xs
 	int headroom;
 	u64 addr;
 	u32 index;
-
 	addr = desc->addr;
 	len = desc->len;
 	buffer = xsk_buff_raw_get_data(pool, addr);
 	ts = pool->unaligned ? len : pool->chunk_size;
-
 	headroom = offset_in_page(buffer);
-
 	/* offset in umem pool buffer */
 	addr = buffer - pool->addrs;
-
 	/* get the page of the desc */
 	page = pool->umem->pgs[addr >> PAGE_SHIFT];
-
 	/* in order to avoid to get freed by kfree_skb */
 	get_page(page);
-
 	hard_start = page_to_virt(page);
-
 	skb = veth_build_skb(hard_start, headroom, len, ts);
 	seg_info = (struct veth_seg_info *)kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS), GFP_KERNEL);
 	if (!seg_info)
 	{
 		printk("here must to deal with\n");
 	}
-
 	/* later we will support gso for this */
 	index = skb_shinfo(skb)->gso_segs;
 	seg_info->desc[index] = desc->addr;
 	seg_info->segs = ++index;
-
 	skb->truesize += ts;
 	skb->dev = dev;
 	skb_shinfo(skb)->destructor_arg = (void *)(long)seg_info;
 	skb_shinfo(skb)->destructor_arg_xsk_pool = (void *)(long)pool;
 	skb->destructor = veth_xsk_destruct_skb;
-
 	/* set the mac header */
 	skb->protocol = eth_type_trans(skb, dev);
+	/* to do, add skb to sock. may be there is no need to do for this
+	*  refcount_add(ts, &xs->sk.sk_wmem_alloc);
+	*/
+	return skb;
+}
+
+static struct sk_buff *veth_build_skb_zerocopy_gso(struct net_device *dev, struct xsk_buff_pool *pool,
+					      struct xdp_desc *desc, struct veth_gso_tuple *gso_tuple, struct sk_buff *prev_skb)
+{
+	u32 hr, len, ts, index, iph_len, th_len, data_offset, data_len, tot_len;
+	struct veth_seg_info *seg_info;
+	void *buffer;
+	struct udphdr *udph;
+	struct iphdr *iph;
+	struct sk_buff *skb;
+	struct page *page;
+	int hh_len = 0;
+	u64 addr;
+
+	addr = desc->addr;
+	len = desc->len;
+
+	/* l2 reserved len */
+	hh_len = LL_RESERVED_SPACE(dev);
+	hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(hh_len));
+
+	/* data points to eth header */
+	buffer = (unsigned char *)xsk_buff_raw_get_data(pool, addr);
+
+	iph = (struct iphdr *)(buffer + ETH_HLEN);
+	iph_len = iph->ihl * 4;
+
+	udph = (struct udphdr *)(buffer + ETH_HLEN + iph_len);
+	th_len = sizeof(struct udphdr);
+
+	if (gso_tuple->gso_flush)
+		gso_tuple_init(gso_tuple, iph, udph);
+
+	ts = pool->unaligned ? len : pool->chunk_size;
+
+	data_offset = offset_in_page(buffer) + ETH_HLEN + iph_len + th_len;
+	data_len = len - (ETH_HLEN + iph_len + th_len);
+
+	/* head is null or this is a new 5 tuple */
+	if (NULL == prev_skb || !gso_segment_match(gso_tuple, iph, udph)) {
+		tot_len = hr + iph_len + th_len;
+		skb = veth_build_gso_head_skb(dev, buffer, tot_len, hr, iph_len, th_len);
+		if (!skb) {
+			/* to do: handle here for skb */
+			return NULL;
+		}
+
+		/* store information for gso */
+		seg_info = (struct veth_seg_info *)kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS), GFP_KERNEL);
+		if (!seg_info) {
+			/* to do */
+			kfree_skb(skb);
+			return NULL;
+		}
+	} else {
+		skb = prev_skb;
+		skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4 | SKB_GSO_PARTIAL;
+		skb_shinfo(skb)->gso_size = data_len;
+		skb->ip_summed = CHECKSUM_PARTIAL;
+
+		/* max segment is MAX_SKB_FRAGS */
+		if(skb_shinfo(skb)->gso_segs >= MAX_SKB_FRAGS - 1) {
+			gso_tuple->gso_flush = true;
+		}
+		seg_info = (struct veth_seg_info *)skb_shinfo(skb)->destructor_arg;
+	}
+
+	/* offset in umem pool buffer */
+	addr = buffer - pool->addrs;
+
+	/* get the page of the desc */
+	page = pool->umem->pgs[addr >> PAGE_SHIFT];
+
+	/* in order to avoid to get freed by kfree_skb */
+	get_page(page);
+
+	/* desc.data can not hold in two   */
+	skb_fill_page_desc(skb, skb_shinfo(skb)->gso_segs, page, data_offset, data_len);
+
+	skb->len += data_len;
+	skb->data_len += data_len;
+	skb->truesize += ts;
+	skb->dev = dev;
+
+	/* later we will support gso for this */
+	index = skb_shinfo(skb)->gso_segs;
+	seg_info->desc[index] = desc->addr;
+	seg_info->segs = ++index;
+	skb_shinfo(skb)->gso_segs++;
+
+	skb_shinfo(skb)->destructor_arg = (void *)(long)seg_info;
+	skb_shinfo(skb)->destructor_arg_xsk_pool = (void *)(long)pool;
+	skb->destructor = veth_xsk_destruct_skb;
 
 	/* to do, add skb to sock. may be there is no need to do for this
 	*  refcount_add(ts, &xs->sk.sk_wmem_alloc);
@@ -744,6 +923,22 @@ static struct sk_buff *veth_build_skb_zerocopy(struct net_device *dev, struct xs
 	return skb;
 }
 
+static inline struct sk_buff *veth_build_skb_zerocopy(struct net_device *dev, struct xsk_buff_pool *pool,
+					      struct xdp_desc *desc, struct veth_gso_tuple *gso_tuple, struct sk_buff *prev_skb)
+{
+	void *buffer;
+
+	buffer = xsk_buff_raw_get_data(pool, desc->addr);
+	if (ip_hdr_gso_check(buffer, desc->len)) {
+		gso_tuple->gso_enable = true;
+		return veth_build_skb_zerocopy_gso(dev, pool, desc, gso_tuple, prev_skb);
+	} else {
+		gso_tuple->gso_flush = false;
+		gso_tuple->gso_enable = false;
+		return veth_build_skb_zerocopy_normal(dev, pool, desc);
+	}
+}
+
 static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 					  struct xdp_frame *frame,
 					  struct veth_xdp_tx_bq *bq,
@@ -1176,16 +1371,33 @@ static inline bool buffer_in_page(void *buffer, u32 len)
 	}
 }
 
+static inline void veth_skb_gso_check_update(struct sk_buff *skb)
+{
+	struct iphdr *iph = ip_hdr(skb);
+	struct udphdr *uh = udp_hdr(skb);
+	int ip_tot_len = skb->len;
+	int udp_len = skb->len - (skb->transport_header - skb->network_header);
+	iph->tot_len = htons(ip_tot_len);
+	ip_send_check(iph);
+	uh->len = htons(udp_len);
+	uh->check = 0;
+
+	/* udp4 checksum update */
+	udp4_hwcsum(skb, iph->saddr, iph->daddr);
+}
+
 static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
 {
 	struct veth_priv *priv, *peer_priv;
 	struct net_device *dev, *peer_dev;
+	struct veth_gso_tuple gso_tuple;
 	struct veth_rq *peer_rq;
 	struct veth_stats peer_stats = {};
 	struct veth_stats stats = {};
 	struct veth_xdp_tx_bq bq;
 	struct xdp_desc desc;
 	void *xdpf;
+	struct sk_buff *prev_skb = NULL;
 	struct sk_buff *skb = NULL;
 	bool zc = xsk_pool->umem->zc;
 	u32 xsk_headroom = xsk_pool->headroom;
@@ -1200,6 +1412,8 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 	/* todo: queue index must set before this */
 	peer_rq = &peer_priv->rq[sq->queue_index];
 
+	memset(&gso_tuple, 0, sizeof(gso_tuple));
+
 	/* set xsk wake up flag, to do: where to disable */
 	if (xsk_uses_need_wakeup(xsk_pool))
 		xsk_set_tx_need_wakeup(xsk_pool);
@@ -1279,12 +1493,26 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 			/* no need to copy address for af+xdp */
 			p_frame = veth_xdp_rcv_one(peer_rq, p_frame, &bq, &peer_stats);
 			if (p_frame) {
-				skb = veth_build_skb_zerocopy(peer_dev, xsk_pool, &desc);
-				if (skb) {
+				skb = veth_build_skb_zerocopy(peer_dev, xsk_pool, &desc, &gso_tuple, prev_skb);
+				if (!gso_tuple.gso_enable) {
 					napi_gro_receive(&peer_rq->xdp_napi, skb);
 					skb = NULL;
 				} else {
-					xsk_tx_completed_addr(xsk_pool, desc.addr);
+					if (prev_skb && gso_tuple.gso_flush) {
+						veth_skb_gso_check_update(prev_skb);
+						napi_gro_receive(&peer_rq->xdp_napi, prev_skb);
+
+						if (prev_skb == skb) {
+							skb = NULL;
+							prev_skb = NULL;
+						} else {
+							prev_skb = skb;
+						}
+					} else if (NULL == skb){
+						xsk_tx_completed_addr(xsk_pool, desc.addr);
+					} else {
+						prev_skb = skb;
+					}
 				}
 			}
 		} else {
@@ -1308,6 +1536,12 @@ static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool,
 		done++;
 	}
 
+	/* gso skb */
+	if (NULL!=skb) {
+		veth_skb_gso_check_update(skb);
+		napi_gro_receive(&peer_rq->xdp_napi, skb);
+	}
+
 	/* release, move consumer，and wakeup the producer */
 	if (done) {
 		napi_schedule(&peer_rq->xdp_napi);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 00/10]
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (9 preceding siblings ...)
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 10/10] veth: af_xdp tx batch support for ipv4 udp huangjie.albert
@ 2023-08-03 14:20 ` Paolo Abeni
  2023-08-04  4:16   ` [External] " 黄杰
  2023-08-03 15:01 ` Jesper Dangaard Brouer
  11 siblings, 1 reply; 22+ messages in thread
From: Paolo Abeni @ 2023-08-03 14:20 UTC (permalink / raw
  To: huangjie.albert, davem, edumazet, kuba
  Cc: Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Pavel Begunkov, Yunsheng Lin,
	Kees Cook, Richard Gobert, open list:NETWORKING DRIVERS,
	open list, open list:XDP (eXpress Data Path)

On Thu, 2023-08-03 at 22:04 +0800, huangjie.albert wrote:
> AF_XDP is a kernel bypass technology that can greatly improve performance.
> However, for virtual devices like veth, even with the use of AF_XDP sockets,
> there are still many additional software paths that consume CPU resources. 
> This patch series focuses on optimizing the performance of AF_XDP sockets 
> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. 
> Patch 5 introduces tx queue and tx napi for packet transmission, while 
> patch 9 primarily implements zero-copy, and patch 10 adds support for 
> batch sending of IPv4 UDP packets. These optimizations significantly reduce 
> the software path and support checksum offload.
> 
> I tested those feature with
> A typical topology is shown below:
> veth<-->veth-peer                                    veth1-peer<--->veth1
> 	1       |                                                  |   7
> 	        |2                                                6|
> 	        |                                                  |
> 	      bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>                   3                    4                 5    
>              (machine1)                              (machine2)    
> AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
> veth:(172.17.0.2/24)
> bridge:(172.17.0.1/24)
> eth0:(192.168.156.66/24)
> 
> eth1(172.17.0.2/24)
> bridge1:(172.17.0.1/24)
> eth0:(192.168.156.88/24)
> 
> after set default route��?snat��?dnat. we can have a tests
> to get the performance results.
> 
> packets send from veth to veth1:
> af_xdp test tool:
> link:https://github.com/cclinuxer/libxudp
> send:(veth)
> ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
> recv:(veth1)
> ./objs/xudpperf recv --src 172.17.0.2:6002
> 
> udp test tool:iperf3
> send:(veth)
> iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u

Should be: '-b 0' otherwise you will experience additional overhead.

And you would likely pin processes and irqs to ensure BH and US run on
different cores of the same numa node.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 00/10]
  2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
                   ` (10 preceding siblings ...)
  2023-08-03 14:20 ` [RFC Optimizing veth xsk performance 00/10] Paolo Abeni
@ 2023-08-03 15:01 ` Jesper Dangaard Brouer
  11 siblings, 0 replies; 22+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-03 15:01 UTC (permalink / raw
  To: huangjie.albert, davem, edumazet, kuba, pabeni, Maryam Tahhan,
	Keith Wiles, Liang Chen
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Yunsheng Lin, Kees Cook,
	Richard Gobert, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)



On 03/08/2023 16.04, huangjie.albert wrote:
> AF_XDP is a kernel bypass technology that can greatly improve performance.
> However, for virtual devices like veth, even with the use of AF_XDP sockets,
> there are still many additional software paths that consume CPU resources.
> This patch series focuses on optimizing the performance of AF_XDP sockets
> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> Patch 5 introduces tx queue and tx napi for packet transmission, while
> patch 9 primarily implements zero-copy, and patch 10 adds support for
> batch sending of IPv4 UDP packets. These optimizations significantly reduce
> the software path and support checksum offload.
> 
> I tested those feature with
> A typical topology is shown below:
> veth<-->veth-peer                                    veth1-peer<--->veth1
> 	1       |                                                  |   7
> 	        |2                                                6|
> 	        |                                                  |
> 	      bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>                    3                    4                 5
>               (machine1)                              (machine2)
> AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
> veth:(172.17.0.2/24)
> bridge:(172.17.0.1/24)
> eth0:(192.168.156.66/24)
> 
> eth1(172.17.0.2/24)
> bridge1:(172.17.0.1/24)
> eth0:(192.168.156.88/24)
> 
> after set default route、snat、dnat. we can have a tests
> to get the performance results.
> 
> packets send from veth to veth1:
> af_xdp test tool:
> link:https://github.com/cclinuxer/libxudp
> send:(veth)
> ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
> recv:(veth1)
> ./objs/xudpperf recv --src 172.17.0.2:6002
> 
> udp test tool:iperf3
> send:(veth)
> iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u
> recv:(veth1)
> iperf3 -s -p 6002
> 
> performance:
> performance:(test weth libxdp lib)
> UDP                              : 250 Kpps (with 100% cpu)
> AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
> AF_XDP  with zerocopy + no batch : 540 Kpps (with ksoftirqd 100% cpu)
> AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)
> 
> With af_xdp batch, the libxdp user-space program reaches a bottleneck.

Do you mean libxdp [1] or libxudp ?

[1] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp

> Therefore, the softirq did not reach the limit.
> 
> This is just an RFC patch series, and some code details still need
> further consideration. Please review this proposal.
>

I find this performance work interesting as we have customer requests
(via Maryam (cc)) to improve AF_XDP performance both native and on veth.

Our benchmark is stored at:
  https://github.com/maryamtahhan/veth-benchmark

Great to see other companies also interested in this area.

--Jesper

> thanks!
> 
> huangjie.albert (10):
>    veth: Implement ethtool's get_ringparam() callback
>    xsk: add dma_check_skip for  skipping dma check
>    veth: add support for send queue
>    xsk: add xsk_tx_completed_addr function
>    veth: use send queue tx napi to xmit xsk tx desc
>    veth: add ndo_xsk_wakeup callback for veth
>    sk_buff: add destructor_arg_xsk_pool for zero copy
>    xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX
>    veth: support zero copy for af xdp
>    veth: af_xdp tx batch support for ipv4 udp
> 
>   drivers/net/veth.c          | 729 +++++++++++++++++++++++++++++++++++-
>   include/linux/skbuff.h      |   1 +
>   include/net/xdp.h           |   1 +
>   include/net/xdp_sock_drv.h  |   1 +
>   include/net/xsk_buff_pool.h |   1 +
>   net/xdp/xsk.c               |   6 +
>   net/xdp/xsk_buff_pool.c     |   3 +-
>   net/xdp/xsk_queue.h         |  11 +
>   8 files changed, 751 insertions(+), 2 deletions(-)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [RFC Optimizing veth xsk performance 00/10]
  2023-08-03 14:20 ` [RFC Optimizing veth xsk performance 00/10] Paolo Abeni
@ 2023-08-04  4:16   ` 黄杰
  0 siblings, 0 replies; 22+ messages in thread
From: 黄杰 @ 2023-08-04  4:16 UTC (permalink / raw
  To: Paolo Abeni
  Cc: davem, edumazet, kuba, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Björn Töpel,
	Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon,
	Pavel Begunkov, Yunsheng Lin, Kees Cook, Richard Gobert,
	open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

Paolo Abeni <pabeni@redhat.com> 于2023年8月3日周四 22:20写道：
>
> On Thu, 2023-08-03 at 22:04 +0800, huangjie.albert wrote:
> > AF_XDP is a kernel bypass technology that can greatly improve performance.
> > However, for virtual devices like veth, even with the use of AF_XDP sockets,
> > there are still many additional software paths that consume CPU resources.
> > This patch series focuses on optimizing the performance of AF_XDP sockets
> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> > Patch 5 introduces tx queue and tx napi for packet transmission, while
> > patch 9 primarily implements zero-copy, and patch 10 adds support for
> > batch sending of IPv4 UDP packets. These optimizations significantly reduce
> > the software path and support checksum offload.
> >
> > I tested those feature with
> > A typical topology is shown below:
> > veth<-->veth-peer                                    veth1-peer<--->veth1
> >       1       |                                                  |   7
> >               |2                                                6|
> >               |                                                  |
> >             bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
> >                   3                    4                 5
> >              (machine1)                              (machine2)
> > AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
> > veth:(172.17.0.2/24)
> > bridge:(172.17.0.1/24)
> > eth0:(192.168.156.66/24)
> >
> > eth1(172.17.0.2/24)
> > bridge1:(172.17.0.1/24)
> > eth0:(192.168.156.88/24)
> >
> > after set default route . snat . dnat. we can have a tests
> > to get the performance results.
> >
> > packets send from veth to veth1:
> > af_xdp test tool:
> > link:https://github.com/cclinuxer/libxudp
> > send:(veth)
> > ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
> > recv:(veth1)
> > ./objs/xudpperf recv --src 172.17.0.2:6002
> >
> > udp test tool:iperf3
> > send:(veth)
> > iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u
>
> Should be: '-b 0' otherwise you will experience additional overhead.
>

with -b 0:
performance:
performance:(test weth libxdp lib)
UDP                              : 320 Kpps (with 100% cpu)
AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with zerocopy + no batch : 540 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)

thanks.

> And you would likely pin processes and irqs to ensure BH and US run on
> different cores of the same numa node.
>
> Cheers,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback huangjie.albert
@ 2023-08-04 20:41   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 20:41 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Menglong Dong, Yunsheng Lin,
	Richard Gobert, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:27PM +0800, huangjie.albert wrote:
> some xsk libary calls get_ringparam() API to get the queue length

nit: libary -> library

Please consider using checkpatch.pl --codespell

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 02/10] xsk: add dma_check_skip for  skipping dma check
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 02/10] xsk: add dma_check_skip for skipping dma check huangjie.albert
@ 2023-08-04 20:42   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 20:42 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Shmulik Ladkani, Kees Cook,
	Richard Gobert, Yunsheng Lin, open list:NETWORKING DRIVERS,
	open list, open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:28PM +0800, huangjie.albert wrote:
> for the virtual net device such as veth, there is
> no need to do dma check if we support zero copy.
> 
> add this flag after unaligned. beacause there are 4 bytes hole

nit: beacause there are 4 bytes hole
  -> Because there is a 4 byte hole.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 03/10] veth: add support for send queue
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 03/10] veth: add support for send queue huangjie.albert
@ 2023-08-04 20:44   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 20:44 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Kees Cook, Richard Gobert,
	Yunsheng Lin, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:29PM +0800, huangjie.albert wrote:

...

> @@ -69,11 +74,25 @@ struct veth_rq {
>  	struct page_pool	*page_pool;
>  };
>  
> +struct veth_sq {
> +	struct napi_struct	xdp_napi;
> +	struct net_device	*dev;
> +	struct xdp_mem_info	xdp_mem;
> +	struct veth_sq_stats	stats;
> +	u32 queue_index;
> +	/* this is for xsk */
> +	struct {
> +		struct xsk_buff_pool __rcu *pool;
> +		u32 last_cpu;
> +	}xsk;

nit: '}xsk;' -> '} xsk;'

Please consider running checkpatch.pl

...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 04/10] xsk: add xsk_tx_completed_addr function
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 04/10] xsk: add xsk_tx_completed_addr function huangjie.albert
@ 2023-08-04 20:46   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 20:46 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Yunsheng Lin, Menglong Dong,
	Richard Gobert, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:30PM +0800, huangjie.albert wrote:

...

> index 13354a1e4280..a494d1dcb1c3 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -428,6 +428,17 @@ static inline void __xskq_prod_submit(struct xsk_queue *q, u32 idx)
>  	smp_store_release(&q->ring->producer, idx); /* B, matches C */
>  }
>  
> +

nit: one blank line is enough

> +static inline void xskq_prod_submit_addr(struct xsk_queue *q, u64 addr)
> +{
> +	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
> +	u32 idx = q->ring->producer;
> +
> +	ring->desc[idx++ & q->ring_mask] = addr;
> +
> +	__xskq_prod_submit(q, idx);
> +}
> +
>  static inline void xskq_prod_submit(struct xsk_queue *q)
>  {
>  	__xskq_prod_submit(q, q->cached_prod);
> -- 
> 2.20.1
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 05/10] veth: use send queue tx napi to xmit xsk tx desc
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 05/10] veth: use send queue tx napi to xmit xsk tx desc huangjie.albert
@ 2023-08-04 20:59   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 20:59 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Kees Cook, Menglong Dong,
	Richard Gobert, Yunsheng Lin, open list:NETWORKING DRIVERS,
	open list, open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:31PM +0800, huangjie.albert wrote:

Please include a patch description.

> Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>

Please consider formatting this as:

	... Albert Huang <huangjie.albert@bytedance.com>

> ---
>  drivers/net/veth.c | 265 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 264 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 63c3ebe4c5d0..944761807ca4 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -27,6 +27,8 @@
>  #include <linux/bpf_trace.h>
>  #include <linux/net_tstamp.h>
>  #include <net/page_pool.h>
> +#include <net/xdp_sock_drv.h>
> +#include <net/xdp.h>
>  
>  #define DRV_NAME	"veth"
>  #define DRV_VERSION	"1.0"

> @@ -1061,6 +1063,176 @@ static int veth_poll(struct napi_struct *napi, int budget)
>  	return done;
>  }
>  
> +static int veth_xsk_tx_xmit(struct veth_sq *sq, struct xsk_buff_pool *xsk_pool, int budget)
> +{
> +	struct veth_priv *priv, *peer_priv;
> +	struct net_device *dev, *peer_dev;
> +	struct veth_rq *peer_rq;
> +	struct veth_stats peer_stats = {};
> +	struct veth_stats stats = {};
> +	struct veth_xdp_tx_bq bq;
> +	struct xdp_desc desc;
> +	void *xdpf;
> +	int done = 0;

Please try to use reverse xmas tree ordering - longest line to shortest -
for local variable declarations in new Networking code.

https://github.com/ecree-solarflare/xmastree is your friend here.

> +
> +	bq.count = 0;
> +	dev = sq->dev;
> +	priv = netdev_priv(dev);
> +	peer_dev = priv->peer;

Sparse seems a bit unhappy about this.

  .../veth.c:1081:18: warning: incorrect type in assignment (different address spaces)
  .../veth.c:1081:18:    expected struct net_device *peer_dev
  .../veth.c:1081:18:    got struct net_device [noderef] __rcu *peer

Looking over existing code in this file, perhaps this is appropriate:

	peer_dev = rtnl_dereference(priv->peer);

Likewise in a few other places in this patch.

...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 06/10] veth: add ndo_xsk_wakeup callback for veth
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 06/10] veth: add ndo_xsk_wakeup callback for veth huangjie.albert
@ 2023-08-04 21:01   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 21:01 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Shmulik Ladkani, Kees Cook,
	Richard Gobert, Yunsheng Lin, open list:NETWORKING DRIVERS,
	open list, open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:32PM +0800, huangjie.albert wrote:
> Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>
> ---
>  drivers/net/veth.c | 40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 944761807ca4..600225e27e9e 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1840,6 +1840,45 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
>  	rcu_read_unlock();
>  }
>  
> +static void veth_xsk_remote_trigger_napi(void *info)
> +{
> +	struct veth_sq *sq = info;
> +
> +	napi_schedule(&sq->xdp_napi);
> +}
> +
> +static int veth_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
> +{
> +	struct veth_priv *priv;
> +	struct veth_sq *sq;
> +	u32 last_cpu, cur_cpu;
> +
> +	if (!netif_running(dev))
> +		return -ENETDOWN;
> +
> +	if (qid >= dev->real_num_rx_queues)
> +		return -EINVAL;
> +
> +	priv = netdev_priv(dev);
> +	sq = &priv->sq[qid];
> +
> +	if (napi_if_scheduled_mark_missed(&sq->xdp_napi))
> +		return 0;
> +
> +	last_cpu = sq->xsk.last_cpu;
> +	cur_cpu = get_cpu();
> +
> +	/*  raise a napi */
> +	if (last_cpu == cur_cpu) {
> +		napi_schedule(&sq->xdp_napi);
> +	} else {
> +		smp_call_function_single(last_cpu, veth_xsk_remote_trigger_napi, sq, true);
> +	}

nit: no need for braces in the above.

	if (last_cpu == cur_cpu)
		napi_schedule(&sq->xdp_napi);
	else
		smp_call_function_single(last_cpu, veth_xsk_remote_trigger_napi, sq, true);

...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 09/10] veth: support zero copy for af xdp
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 09/10] veth: support zero copy for af xdp huangjie.albert
@ 2023-08-04 21:05   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 21:05 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Yunsheng Lin, Kees Cook,
	Richard Gobert, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:35PM +0800, huangjie.albert wrote:

...

> +static struct sk_buff *veth_build_skb_zerocopy(struct net_device *dev, struct xsk_buff_pool *pool,
> +					      struct xdp_desc *desc)
> +{
> +	struct veth_seg_info *seg_info;
> +	struct sk_buff *skb;
> +	struct page *page;
> +	void *hard_start;
> +	u32 len, ts;
> +	void *buffer;
> +	int headroom;
> +	u64 addr;
> +	u32 index;
> +
> +	addr = desc->addr;
> +	len = desc->len;
> +	buffer = xsk_buff_raw_get_data(pool, addr);
> +	ts = pool->unaligned ? len : pool->chunk_size;
> +
> +	headroom = offset_in_page(buffer);
> +
> +	/* offset in umem pool buffer */
> +	addr = buffer - pool->addrs;
> +
> +	/* get the page of the desc */
> +	page = pool->umem->pgs[addr >> PAGE_SHIFT];
> +
> +	/* in order to avoid to get freed by kfree_skb */
> +	get_page(page);
> +
> +	hard_start = page_to_virt(page);
> +
> +	skb = veth_build_skb(hard_start, headroom, len, ts);
> +	seg_info = (struct veth_seg_info *)kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS), GFP_KERNEL);

There is no need to explicitly case the return value of kmalloc,
as it returns void *.

	seg_info = kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS),
			   GFP_KERNEL);

...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC Optimizing veth xsk performance 10/10] veth: af_xdp tx batch support for ipv4 udp
  2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 10/10] veth: af_xdp tx batch support for ipv4 udp huangjie.albert
@ 2023-08-04 21:12   ` Simon Horman
  0 siblings, 0 replies; 22+ messages in thread
From: Simon Horman @ 2023-08-04 21:12 UTC (permalink / raw
  To: huangjie.albert
  Cc: davem, edumazet, kuba, pabeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Pavel Begunkov, Yunsheng Lin, Kees Cook,
	Richard Gobert, open list:NETWORKING DRIVERS, open list,
	open list:XDP (eXpress Data Path)

On Thu, Aug 03, 2023 at 10:04:36PM +0800, huangjie.albert wrote:

...

> @@ -103,6 +104,18 @@ struct veth_xdp_tx_bq {
>  	unsigned int count;
>  };
>  
> +struct veth_gso_tuple {
> +	__u8	protocol;
> +	__be32	saddr;
> +	__be32	daddr;
> +	__be16	source;
> +	__be16	dest;
> +	__be16	gso_size;
> +	__be16	gso_segs;
> +	bool gso_enable;
> +	bool gso_flush;
> +};
> +
>  struct veth_seg_info {
>  	u32 segs;
>  	u64 desc[] ____cacheline_aligned_in_smp;

...

> +static inline bool gso_segment_match(struct veth_gso_tuple *gso_tuple, struct iphdr *iph, struct udphdr *udph)
> +{
> +	if (gso_tuple->protocol == iph->protocol &&
> +		gso_tuple->saddr == iph->saddr &&
> +		gso_tuple->daddr == iph->daddr &&
> +		gso_tuple->source == udph->source &&
> +		gso_tuple->dest == udph->dest &&
> +		gso_tuple->gso_size == ntohs(udph->len))

The type of the gso_size field is __be16,
but it is being assigned a host byte order value.

> +	{
> +		gso_tuple->gso_flush = false;
> +		return true;
> +	} else {
> +		gso_tuple->gso_flush = true;
> +		return false;
> +	}
> +}
> +
> +static inline void gso_tuple_init(struct veth_gso_tuple *gso_tuple, struct iphdr *iph, struct udphdr *udph)
> +{
> +	gso_tuple->protocol = iph->protocol;
> +	gso_tuple->saddr = iph->saddr;
> +	gso_tuple->daddr = iph->daddr;
> +	gso_tuple->source = udph->source;
> +	gso_tuple->dest = udph->dest;
> +	gso_tuple->gso_flush = false;
> +	gso_tuple->gso_size = ntohs(udph->len);


Likewise, here.

As flagged by Sparse.

  .../veth.c:721:29: warning: incorrect type in assignment (different base types)
  .../veth.c:721:29:    expected restricted __be16 [usertype] gso_size
  .../veth.c:721:29:    got unsigned short [usertype]
  .../veth.c:703:26: warning: restricted __be16 degrades to integer

> +	gso_tuple->gso_segs = 0;
> +}

...

> +static struct sk_buff *veth_build_skb_zerocopy_gso(struct net_device *dev, struct xsk_buff_pool *pool,
> +					      struct xdp_desc *desc, struct veth_gso_tuple *gso_tuple, struct sk_buff *prev_skb)

Please consider constraining line length to 80 columns.

> +{
> +	u32 hr, len, ts, index, iph_len, th_len, data_offset, data_len, tot_len;
> +	struct veth_seg_info *seg_info;
> +	void *buffer;
> +	struct udphdr *udph;
> +	struct iphdr *iph;
> +	struct sk_buff *skb;
> +	struct page *page;
> +	int hh_len = 0;
> +	u64 addr;
> +
> +	addr = desc->addr;
> +	len = desc->len;
> +
> +	/* l2 reserved len */
> +	hh_len = LL_RESERVED_SPACE(dev);
> +	hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(hh_len));
> +
> +	/* data points to eth header */
> +	buffer = (unsigned char *)xsk_buff_raw_get_data(pool, addr);
> +
> +	iph = (struct iphdr *)(buffer + ETH_HLEN);
> +	iph_len = iph->ihl * 4;
> +
> +	udph = (struct udphdr *)(buffer + ETH_HLEN + iph_len);
> +	th_len = sizeof(struct udphdr);
> +
> +	if (gso_tuple->gso_flush)
> +		gso_tuple_init(gso_tuple, iph, udph);
> +
> +	ts = pool->unaligned ? len : pool->chunk_size;
> +
> +	data_offset = offset_in_page(buffer) + ETH_HLEN + iph_len + th_len;
> +	data_len = len - (ETH_HLEN + iph_len + th_len);
> +
> +	/* head is null or this is a new 5 tuple */
> +	if (NULL == prev_skb || !gso_segment_match(gso_tuple, iph, udph)) {
> +		tot_len = hr + iph_len + th_len;
> +		skb = veth_build_gso_head_skb(dev, buffer, tot_len, hr, iph_len, th_len);
> +		if (!skb) {
> +			/* to do: handle here for skb */
> +			return NULL;
> +		}
> +
> +		/* store information for gso */
> +		seg_info = (struct veth_seg_info *)kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS), GFP_KERNEL);

No need to case the return value of kmalloc, it's type is void *.

		seg_info = kmalloc(struct_size(seg_info, desc, MAX_SKB_FRAGS),
				   GFP_KERNEL);
> +		if (!seg_info) {
> +			/* to do */
> +			kfree_skb(skb);
> +			return NULL;
> +		}

...

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-08-04 21:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-03 14:04 [RFC Optimizing veth xsk performance 00/10] huangjie.albert
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 01/10] veth: Implement ethtool's get_ringparam() callback huangjie.albert
2023-08-04 20:41   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 02/10] xsk: add dma_check_skip for skipping dma check huangjie.albert
2023-08-04 20:42   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 03/10] veth: add support for send queue huangjie.albert
2023-08-04 20:44   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 04/10] xsk: add xsk_tx_completed_addr function huangjie.albert
2023-08-04 20:46   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 05/10] veth: use send queue tx napi to xmit xsk tx desc huangjie.albert
2023-08-04 20:59   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 06/10] veth: add ndo_xsk_wakeup callback for veth huangjie.albert
2023-08-04 21:01   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 07/10] sk_buff: add destructor_arg_xsk_pool for zero copy huangjie.albert
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 08/10] xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX huangjie.albert
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 09/10] veth: support zero copy for af xdp huangjie.albert
2023-08-04 21:05   ` Simon Horman
2023-08-03 14:04 ` [RFC Optimizing veth xsk performance 10/10] veth: af_xdp tx batch support for ipv4 udp huangjie.albert
2023-08-04 21:12   ` Simon Horman
2023-08-03 14:20 ` [RFC Optimizing veth xsk performance 00/10] Paolo Abeni
2023-08-04  4:16   ` [External] " 黄杰
2023-08-03 15:01 ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).