[RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
@ 2024-01-17 21:56 Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

Hi, 

I am continuing the work of ebpf-based Qdisc based on Cong’s previous
RFC. The followings are some use cases of eBPF Qdisc:

1. Allow customizing Qdiscs in an easier way. So that people don't
   have to write a complete Qdisc kernel module just to experiment
   some new queuing theory.

2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
   is before enqueue, it is impossible to adjust those "tokens" after
   packets get dropped in enqueue. With eBPF Qdisc, it is easy to
   be solved with a shared map between clsact and sch_bpf.

3. Replace qevents, as now the user gains much more control over the
   skb and queues.

4. Provide a new way to reuse TC filters. Currently TC relies on filter
   chain and block to reuse the TC filters, but they are too complicated
   to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
   TC filters on _any_ Qdisc (even on a different netdev) to do the
   classification.

5. Potentially pave a way for ingress to queue packets, although
   current implementation is still only for egress.

I’ve combed through previous comments and appreciated the feedbacks.
Some major changes in this RFC is the use of kptr to skb to maintain
the validility of skb during its lifetime in the Qdisc, dropping rbtree
maps, and the inclusion of two examples. 

Some questions for discussion:

1. We now pass a trusted kptr of sk_buff to the program instead of
   __sk_buff. This makes most helpers using __sk_buff incompatible
   with eBPF qdisc. An alternative is to still use __sk_buff in the
   context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
   this can only be applied to enqueue program, since in dequeue program
   skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
   is no __sk_buff). Any suggestion for making skb kptr and helper
   functions compatible?

2. The current patchset uses netlink. Do we also want to use bpf_link
   for attachment?

3. People have suggested struct_ops. We chose not to use struct_ops since
   users might want to create multiple bpf qdiscs with different
   implementations. Current struct_ops attachment model does not seem
   to support replacing only functions of a specific instance of a module,
   but I might be wrong.

Todo:
  - Add selftest

  - Make bpf list/rbtree use list/rbnode in skb so that developers
    don't need to allocate bpf objects for storing skb kptrs.

Note:
  - This patchset requires bpf support of exchanging kptr into allocated
    objects (local kptr), which Dave Marchevsky is working on.

  - The user space programs in the sample are adapted from the example
    Peihao Yang written in RFC v5 thread.

---
v7: Reference skb using kptr to sk_buff instead of __sk_buff
    Use the new bpf rbtree/link to for skb queues
    Add reset and init programs
    Add a bpf fq qdisc sample
    Add a bpf netem qdisc sample

v6: switch to kptr based approach

v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
    implement flow map as map-in-map
    rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
    clean up eBPF qdisc program context

v4: get rid of PIFO, use rbtree directly

v3: move priority queue from sch_bpf to skb map
    introduce skb map and its helpers
    introduce bpf_skb_classify()
    use netdevice notifier to reset skb's
    Rebase on latest bpf-next

v2: Rebase on latest net-next
    Make the code more complete (but still incomplete)

Amery Hung (5):
  net_sched: Add reset program
  net_sched: Add init program
  tools/libbpf: Add support for BPF_PROG_TYPE_QDISC
  samples/bpf: Add an example of bpf fq qdisc
  samples/bpf: Add an example of bpf netem qdisc

Cong Wang (3):
  net_sched: Introduce eBPF based Qdisc
  net_sched: Add kfuncs for working with skb
  net_sched: Introduce kfunc bpf_skb_tc_classify()

 include/linux/bpf_types.h       |   4 +
 include/uapi/linux/bpf.h        |  23 +
 include/uapi/linux/pkt_sched.h  |  24 ++
 kernel/bpf/btf.c                |   5 +
 kernel/bpf/helpers.c            |   1 +
 kernel/bpf/syscall.c            |  10 +
 net/core/filter.c               | 100 +++++
 net/sched/Kconfig               |  15 +
 net/sched/Makefile              |   1 +
 net/sched/sch_bpf.c             | 729 ++++++++++++++++++++++++++++++++
 samples/bpf/Makefile            |  14 +-
 samples/bpf/bpf_experimental.h  | 134 ++++++
 samples/bpf/tc_clsact_edt.bpf.c | 103 +++++
 samples/bpf/tc_sch_fq.bpf.c     | 666 +++++++++++++++++++++++++++++
 samples/bpf/tc_sch_fq.c         | 321 ++++++++++++++
 samples/bpf/tc_sch_netem.bpf.c  | 256 +++++++++++
 samples/bpf/tc_sch_netem.c      | 347 +++++++++++++++
 tools/include/uapi/linux/bpf.h  |  23 +
 tools/lib/bpf/libbpf.c          |   4 +
 19 files changed, 2779 insertions(+), 1 deletion(-)
 create mode 100644 net/sched/sch_bpf.c
 create mode 100644 samples/bpf/bpf_experimental.h
 create mode 100644 samples/bpf/tc_clsact_edt.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.c
 create mode 100644 samples/bpf/tc_sch_netem.bpf.c
 create mode 100644 samples/bpf/tc_sch_netem.c

-- 
2.20.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-20 11:41   ` kernel test robot
                     ` (3 more replies)
  2024-01-17 21:56 ` [RFC PATCH v7 2/8] net_sched: Add kfuncs for working with skb Amery Hung
                   ` (7 subsequent siblings)
  8 siblings, 4 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

From: Cong Wang <xiyou.wangcong@gmail.com>

Introduce a new Qdisc which is completely managed by eBPF program
of type BPF_PROG_TYPE_QDISC. It accepts two eBPF programs of
the same type, but one for enqueue and the other for dequeue.

It interacts with Qdisc layer in two ways:
1) It relies on Qdisc watchdog to handle throttling;
2) It could pass the skb enqueue/dequeue down to child classes

The context is used differently for enqueue and dequeue, as shown below:

┌──────────┬───────────────┬──────────────────────────────────┐
│ prog     │     input     │              output              │
├──────────┼───────────────┼──────────────────────────────────┤
│          │ ctx->skb      │ SCH_BPF_THROTTLE: ctx->expire    │
│          │               │                   ctx->delta_ns  │
│          │ ctx->classid  │                                  │
│          │               │ SCH_BPF_QUEUED: None             │
│          │               │                                  │
│          │               │ SCH_BPF_BYPASS: None             │
│ enqueue  │               │                                  │
│          │               │ SCH_BPF_STOLEN: None             │
│          │               │                                  │
│          │               │ SCH_BPF_DROP: None               │
│          │               │                                  │
│          │               │ SCH_BPF_CN: None                 │
│          │               │                                  │
│          │               │ SCH_BPF_PASS: ctx->classid       │
├──────────┼───────────────┼──────────────────────────────────┤
│          │ ctx->classid  │ SCH_BPF_THROTTLE: ctx->expire    │
│          │               │                   ctx->delta_ns  │
│          │               │                                  │
│ dequeue  │               │ SCH_BPF_DEQUEUED: None           │
│          │               │                                  │
│          │               │ SCH_BPF_DROP: None               │
│          │               │                                  │
│          │               │ SCH_BPF_PASS: ctx->classid       │
└──────────┴───────────────┴──────────────────────────────────┘

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/linux/bpf_types.h      |   4 +
 include/uapi/linux/bpf.h       |  21 ++
 include/uapi/linux/pkt_sched.h |  16 +
 kernel/bpf/btf.c               |   5 +
 kernel/bpf/helpers.c           |   1 +
 kernel/bpf/syscall.c           |   8 +
 net/core/filter.c              |  96 ++++++
 net/sched/Kconfig              |  15 +
 net/sched/Makefile             |   1 +
 net/sched/sch_bpf.c            | 537 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  21 ++
 11 files changed, 725 insertions(+)
 create mode 100644 net/sched/sch_bpf.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a4247377e951..3e35033a9126 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -83,6 +83,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
 BPF_PROG_TYPE(BPF_PROG_TYPE_NETFILTER, netfilter,
 	      struct bpf_nf_ctx, struct bpf_nf_ctx)
 #endif
+#ifdef CONFIG_NET
+BPF_PROG_TYPE(BPF_PROG_TYPE_QDISC, tc_qdisc,
+	      struct bpf_qdisc_ctx, struct bpf_qdisc_ctx)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0bb92414c036..df280bbb7c0d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -997,6 +997,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	BPF_PROG_TYPE_QDISC,
 };
 
 enum bpf_attach_type {
@@ -1056,6 +1057,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_UNIX_GETSOCKNAME,
 	BPF_NETKIT_PRIMARY,
 	BPF_NETKIT_PEER,
+	BPF_QDISC_ENQUEUE,
+	BPF_QDISC_DEQUEUE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -7357,4 +7360,22 @@ struct bpf_iter_num {
 	__u64 __opaque[1];
 } __attribute__((aligned(8)));
 
+struct bpf_qdisc_ctx {
+	__bpf_md_ptr(struct sk_buff *, skb);
+	__u32 classid;
+	__u64 expire;
+	__u64 delta_ns;
+};
+
+enum {
+	SCH_BPF_QUEUED,
+	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
+	SCH_BPF_DROP,
+	SCH_BPF_CN,
+	SCH_BPF_THROTTLE,
+	SCH_BPF_PASS,
+	SCH_BPF_BYPASS,
+	SCH_BPF_STOLEN,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index f762a10bfb78..d05462309f5a 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1317,4 +1317,20 @@ enum {
 
 #define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
 
+#define TCA_SCH_BPF_FLAG_DIRECT _BITUL(0)
+enum {
+	TCA_SCH_BPF_UNSPEC,
+	TCA_SCH_BPF_ENQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_ENQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_ENQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_DEQUEUE_PROG_NAME,	/* string */
+	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	__TCA_SCH_BPF_MAX,
+};
+
+#define TCA_SCH_BPF_MAX (__TCA_SCH_BPF_MAX - 1)
+
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 15d71d2986d3..ee8d6c127b04 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -217,6 +217,7 @@ enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_SOCKET_FILTER,
 	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_NETFILTER,
+	BTF_KFUNC_HOOK_QDISC,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -5928,6 +5929,8 @@ static bool prog_args_trusted(const struct bpf_prog *prog)
 		return bpf_lsm_is_trusted(prog);
 	case BPF_PROG_TYPE_STRUCT_OPS:
 		return true;
+	case BPF_PROG_TYPE_QDISC:
+		return true;
 	default:
 		return false;
 	}
@@ -7865,6 +7868,8 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_LWT;
 	case BPF_PROG_TYPE_NETFILTER:
 		return BTF_KFUNC_HOOK_NETFILTER;
+	case BPF_PROG_TYPE_QDISC:
+		return BTF_KFUNC_HOOK_QDISC;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 56b0c1f678ee..d5e581ccd9a0 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2610,6 +2610,7 @@ static int __init kfunc_init(void)
 
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_QDISC, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set);
 	ret = ret ?: register_btf_id_dtor_kfuncs(generic_dtors,
 						  ARRAY_SIZE(generic_dtors),
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 13eb50446e7a..1838bddd8526 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2502,6 +2502,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		if (expected_attach_type == BPF_NETFILTER)
 			return 0;
 		return -EINVAL;
+	case BPF_PROG_TYPE_QDISC:
+		switch (expected_attach_type) {
+		case BPF_QDISC_ENQUEUE:
+		case BPF_QDISC_DEQUEUE:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	case BPF_PROG_TYPE_SYSCALL:
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
diff --git a/net/core/filter.c b/net/core/filter.c
index 383f96b0a1c7..f25a0b6b5d56 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8889,6 +8889,90 @@ static int tc_cls_act_btf_struct_access(struct bpf_verifier_log *log,
 	return ret;
 }
 
+static int tc_qdisc_prologue(struct bpf_insn *insn_buf, bool direct_write,
+			     const struct bpf_prog *prog)
+{
+	return bpf_unclone_prologue(insn_buf, direct_write, prog,
+				    SCH_BPF_DROP);
+}
+
+BTF_ID_LIST_SINGLE(tc_qdisc_ctx_access_btf_ids, struct, sk_buff)
+
+static bool tc_qdisc_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     const struct bpf_prog *prog,
+				     struct bpf_insn_access_aux *info)
+{
+	struct btf *btf;
+
+	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_qdisc_ctx, skb):
+		if (type == BPF_WRITE ||
+		    size != sizeof_field(struct bpf_qdisc_ctx, skb))
+			return false;
+
+		if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
+			return false;
+
+		btf = bpf_get_btf_vmlinux();
+		if (IS_ERR_OR_NULL(btf))
+			return false;
+
+		info->btf = btf;
+		info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
+		info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
+		return true;
+	case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
+		return size == sizeof_field(struct bpf_qdisc_ctx, classid);
+	case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
+		return size == sizeof_field(struct bpf_qdisc_ctx, expire);
+	case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
+		return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
+	default:
+		return false;
+	}
+
+	return false;
+}
+
+static int tc_qdisc_btf_struct_access(struct bpf_verifier_log *log,
+				      const struct bpf_reg_state *reg,
+				      int off, int size)
+{
+	const struct btf_type *skbt, *t;
+	size_t end;
+
+	skbt = btf_type_by_id(reg->btf, tc_qdisc_ctx_access_btf_ids[0]);
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != skbt)
+		return -EACCES;
+
+	switch (off) {
+	case offsetof(struct sk_buff, cb) ...
+	     offsetofend(struct sk_buff, cb) - 1:
+		end = offsetofend(struct sk_buff, cb);
+		break;
+	case offsetof(struct sk_buff, tstamp):
+		end = offsetofend(struct sk_buff, tstamp);
+		break;
+	default:
+		bpf_log(log, "no write support to skb at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (off + size > end) {
+		bpf_log(log,
+			"write access at off %d with size %d beyond the member of sk_buff ended at %zu\n",
+			off, size, end);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
 static bool __is_valid_xdp_access(int off, int size)
 {
 	if (off < 0 || off >= sizeof(struct xdp_md))
@@ -10890,6 +10974,18 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops tc_qdisc_verifier_ops = {
+	.get_func_proto		= tc_cls_act_func_proto,
+	.is_valid_access	= tc_qdisc_is_valid_access,
+	.gen_prologue		= tc_qdisc_prologue,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+	.btf_struct_access	= tc_qdisc_btf_struct_access,
+};
+
+const struct bpf_prog_ops tc_qdisc_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
 const struct bpf_verifier_ops xdp_verifier_ops = {
 	.get_func_proto		= xdp_func_proto,
 	.is_valid_access	= xdp_is_valid_access,
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 470c70deffe2..e4ece091af4d 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -403,6 +403,21 @@ config NET_SCH_ETS
 
 	  If unsure, say N.
 
+config NET_SCH_BPF
+	tristate "eBPF based programmable queue discipline"
+	help
+	  This eBPF based queue discipline offers a way to program your
+	  own packet scheduling algorithm. This is a classful qdisc which
+	  also allows you to decide the hierarchy.
+
+	  Say Y here if you want to use the eBPF based programmable queue
+	  discipline.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_bpf.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index b5fd49641d91..4e24c6c79cb8 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE)	+= sch_fq_pie.o
 obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 obj-$(CONFIG_NET_SCH_ETF)	+= sch_etf.o
 obj-$(CONFIG_NET_SCH_TAPRIO)	+= sch_taprio.o
+obj-$(CONFIG_NET_SCH_BPF)	+= sch_bpf.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
new file mode 100644
index 000000000000..56f3ab9c6059
--- /dev/null
+++ b/net/sched/sch_bpf.c
@@ -0,0 +1,537 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Programmable Qdisc with eBPF
+ *
+ * Copyright (C) 2022, ByteDance, Cong Wang <cong.wang@bytedance.com>
+ */
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+#define ACT_BPF_NAME_LEN	256
+
+struct sch_bpf_prog {
+	struct bpf_prog *prog;
+	const char *name;
+};
+
+struct sch_bpf_class {
+	struct Qdisc_class_common common;
+	struct Qdisc *qdisc;
+
+	unsigned int drops;
+	unsigned int overlimits;
+	struct gnet_stats_basic_sync bstats;
+};
+
+struct bpf_sched_data {
+	struct tcf_proto __rcu *filter_list; /* optional external classifier */
+	struct tcf_block *block;
+	struct Qdisc_class_hash clhash;
+	struct sch_bpf_prog __rcu enqueue_prog;
+	struct sch_bpf_prog __rcu dequeue_prog;
+
+	struct qdisc_watchdog watchdog;
+};
+
+static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *skb,
+			     int name, int id, int tag)
+{
+	struct nlattr *nla;
+
+	if (prog->name &&
+	    nla_put_string(skb, name, prog->name))
+		return -EMSGSIZE;
+
+	if (nla_put_u32(skb, id, prog->prog->aux->id))
+		return -EMSGSIZE;
+
+	nla = nla_reserve(skb, tag, sizeof(prog->prog->tag));
+	if (!nla)
+		return -EMSGSIZE;
+
+	memcpy(nla_data(nla), prog->prog->tag, nla_len(nla));
+	return 0;
+}
+
+static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (sch_bpf_dump_prog(&q->enqueue_prog, skb, TCA_SCH_BPF_ENQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_ENQUEUE_PROG_ID, TCA_SCH_BPF_ENQUEUE_PROG_TAG))
+		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
+			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int sch_bpf_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	return 0;
+}
+
+static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct Qdisc_class_common *clc;
+
+	clc = qdisc_class_find(&q->clhash, classid);
+	if (!clc)
+		return NULL;
+	return container_of(clc, struct sch_bpf_class, common);
+}
+
+static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			   struct sk_buff **to_free)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	unsigned int len = qdisc_pkt_len(skb);
+	struct bpf_qdisc_ctx ctx = {};
+	int res = NET_XMIT_SUCCESS;
+	struct sch_bpf_class *cl;
+	struct bpf_prog *enqueue;
+
+	enqueue = rcu_dereference(q->enqueue_prog.prog);
+	if (!enqueue)
+		return NET_XMIT_DROP;
+
+	ctx.skb = skb;
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(enqueue, &ctx);
+	switch (res) {
+	case SCH_BPF_THROTTLE:
+		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
+		qdisc_qstats_overlimit(sch);
+		fallthrough;
+	case SCH_BPF_QUEUED:
+		qdisc_qstats_backlog_inc(sch, skb);
+		return NET_XMIT_SUCCESS;
+	case SCH_BPF_BYPASS:
+		qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	case SCH_BPF_STOLEN:
+		__qdisc_drop(skb, to_free);
+		return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+	case SCH_BPF_CN:
+		return NET_XMIT_CN;
+	case SCH_BPF_PASS:
+		break;
+	default:
+		return qdisc_drop(skb, sch, to_free);
+	}
+
+	cl = sch_bpf_find(sch, ctx.classid);
+	if (!cl || !cl->qdisc)
+		return qdisc_drop(skb, sch, to_free);
+
+	res = qdisc_enqueue(skb, cl->qdisc, to_free);
+	if (res != NET_XMIT_SUCCESS) {
+		if (net_xmit_drop_count(res)) {
+			qdisc_qstats_drop(sch);
+			cl->drops++;
+		}
+		return res;
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return res;
+}
+
+DEFINE_PER_CPU(struct sk_buff*, bpf_skb_dequeue);
+
+static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct bpf_qdisc_ctx ctx = {};
+	struct sk_buff *skb = NULL;
+	struct bpf_prog *dequeue;
+	struct sch_bpf_class *cl;
+	int res;
+
+	dequeue = rcu_dereference(q->dequeue_prog.prog);
+	if (!dequeue)
+		return NULL;
+
+	__this_cpu_write(bpf_skb_dequeue, NULL);
+	ctx.classid = sch->handle;
+	res = bpf_prog_run(dequeue, &ctx);
+	switch (res) {
+	case SCH_BPF_DEQUEUED:
+		skb = __this_cpu_read(bpf_skb_dequeue);
+		qdisc_bstats_update(sch, skb);
+		qdisc_qstats_backlog_dec(sch, skb);
+		break;
+	case SCH_BPF_THROTTLE:
+		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
+		qdisc_qstats_overlimit(sch);
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (cl)
+			cl->overlimits++;
+		return NULL;
+	case SCH_BPF_PASS:
+		cl = sch_bpf_find(sch, ctx.classid);
+		if (!cl || !cl->qdisc)
+			return NULL;
+		skb = qdisc_dequeue_peeked(cl->qdisc);
+		if (skb) {
+			bstats_update(&cl->bstats, skb);
+			qdisc_bstats_update(sch, skb);
+			qdisc_qstats_backlog_dec(sch, skb);
+			sch->q.qlen--;
+		}
+		break;
+	}
+
+	return skb;
+}
+
+static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	return cl->qdisc;
+}
+
+static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+			 struct Qdisc **old, struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+
+	if (new)
+		*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent,
+				  u32 classid)
+{
+	return 0;
+}
+
+static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle)
+{
+	return (unsigned long)sch_bpf_find(sch, handle);
+}
+
+static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl,
+					   struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return q->block;
+}
+
+static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
+	[TCA_SCH_BPF_ENQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_ENQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
+};
+
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+{
+	struct bpf_prog *fp, *old_fp;
+	char *prog_name = NULL;
+	u32 bpf_fd;
+
+	if (!fd)
+		return -EINVAL;
+	bpf_fd = nla_get_u32(fd);
+
+	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_QDISC);
+	if (IS_ERR(fp))
+		return PTR_ERR(fp);
+
+	if (name) {
+		prog_name = nla_memdup(name, GFP_KERNEL);
+		if (!prog_name) {
+			bpf_prog_put(fp);
+			return -ENOMEM;
+		}
+	}
+
+	prog->name = prog_name;
+
+	/* updates to prog->prog are prevent since the caller holds
+	 * sch_tree_lock
+	 */
+	old_fp = rcu_replace_pointer(prog->prog, fp, 1);
+	if (old_fp)
+		bpf_prog_put(old_fp);
+
+	return 0;
+}
+
+static void bpf_cleanup_prog(struct sch_bpf_prog *prog)
+{
+	struct bpf_prog *old_fp = NULL;
+
+	/* updates to prog->prog are prevent since the caller holds
+	 * sch_tree_lock
+	 */
+	old_fp = rcu_replace_pointer(prog->prog, old_fp, 1);
+	if (old_fp)
+		bpf_prog_put(old_fp);
+
+	kfree(prog->name);
+}
+
+static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_SCH_BPF_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_SCH_BPF_MAX, opt,
+					  sch_bpf_policy, NULL);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+	if (err)
+		goto failure;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+failure:
+	sch_tree_unlock(sch);
+	return err;
+}
+
+static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
+			struct netlink_ext_ack *extack)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	int err;
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+	if (opt) {
+		err = sch_bpf_change(sch, opt, extack);
+		if (err)
+			return err;
+	}
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	return qdisc_class_hash_init(&q->clhash);
+}
+
+static void sch_bpf_reset(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (cl->qdisc)
+				qdisc_reset(cl->qdisc);
+		}
+	}
+
+	qdisc_watchdog_cancel(&q->watchdog);
+}
+
+static void sch_bpf_destroy_class(struct Qdisc *sch, struct sch_bpf_class *cl)
+{
+	qdisc_put(cl->qdisc);
+	kfree(cl);
+}
+
+static void sch_bpf_destroy(struct Qdisc *sch)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	qdisc_watchdog_cancel(&q->watchdog);
+	tcf_block_put(q->block);
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			sch_bpf_destroy_class(sch, cl);
+		}
+	}
+
+	qdisc_class_hash_destroy(&q->clhash);
+
+	sch_tree_lock(sch);
+	bpf_cleanup_prog(&q->enqueue_prog);
+	bpf_cleanup_prog(&q->dequeue_prog);
+	sch_tree_unlock(sch);
+}
+
+static int sch_bpf_change_class(struct Qdisc *sch, u32 classid,
+				u32 parentid, struct nlattr **tca,
+				unsigned long *arg,
+				struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg;
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	if (!cl) {
+		if (classid == 0 || TC_H_MAJ(classid ^ sch->handle) != 0 ||
+		    sch_bpf_find(sch, classid))
+			return -EINVAL;
+
+		cl = kzalloc(sizeof(*cl), GFP_KERNEL);
+		if (!cl)
+			return -ENOBUFS;
+
+		cl->common.classid = classid;
+		gnet_stats_basic_sync_init(&cl->bstats);
+		qdisc_class_hash_insert(&q->clhash, &cl->common);
+	}
+
+	qdisc_class_hash_grow(sch, &q->clhash);
+	*arg = (unsigned long)cl;
+	return 0;
+}
+
+static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg,
+			  struct netlink_ext_ack *extack)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct bpf_sched_data *q = qdisc_priv(sch);
+
+	qdisc_class_hash_remove(&q->clhash, &cl->common);
+	if (cl->qdisc)
+		qdisc_put(cl->qdisc);
+	return 0;
+}
+
+static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg,
+			      struct sk_buff *skb, struct tcmsg *tcm)
+{
+	return 0;
+}
+
+static int
+sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d)
+{
+	struct sch_bpf_class *cl = (struct sch_bpf_class *)arg;
+	struct gnet_stats_queue qs = {
+		.drops = cl->drops,
+		.overlimits = cl->overlimits,
+	};
+	__u32 qlen = 0;
+
+	if (cl->qdisc)
+		qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog);
+	else
+		qlen = 0;
+
+	if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 ||
+	    gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0)
+		return -1;
+	return 0;
+}
+
+static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct sch_bpf_class *cl;
+	unsigned int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->clhash.hashsize; i++) {
+		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
+			if (arg->count < arg->skip) {
+				arg->count++;
+				continue;
+			}
+			if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+				arg->stop = 1;
+				return;
+			}
+			arg->count++;
+		}
+	}
+}
+
+static const struct Qdisc_class_ops sch_bpf_class_ops = {
+	.graft		=	sch_bpf_graft,
+	.leaf		=	sch_bpf_leaf,
+	.find		=	sch_bpf_search,
+	.change		=	sch_bpf_change_class,
+	.delete		=	sch_bpf_delete,
+	.tcf_block	=	sch_bpf_tcf_block,
+	.bind_tcf	=	sch_bpf_bind,
+	.unbind_tcf	=	sch_bpf_unbind,
+	.dump		=	sch_bpf_dump_class,
+	.dump_stats	=	sch_bpf_dump_class_stats,
+	.walk		=	sch_bpf_walk,
+};
+
+static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
+	.cl_ops		=	&sch_bpf_class_ops,
+	.id		=	"bpf",
+	.priv_size	=	sizeof(struct bpf_sched_data),
+	.enqueue	=	sch_bpf_enqueue,
+	.dequeue	=	sch_bpf_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	sch_bpf_init,
+	.reset		=	sch_bpf_reset,
+	.destroy	=	sch_bpf_destroy,
+	.change		=	sch_bpf_change,
+	.dump		=	sch_bpf_dump,
+	.dump_stats	=	sch_bpf_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init sch_bpf_mod_init(void)
+{
+	return register_qdisc(&sch_bpf_qdisc_ops);
+}
+
+static void __exit sch_bpf_mod_exit(void)
+{
+	unregister_qdisc(&sch_bpf_qdisc_ops);
+}
+
+module_init(sch_bpf_mod_init)
+module_exit(sch_bpf_mod_exit)
+MODULE_AUTHOR("Cong Wang");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("eBPF queue discipline");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0bb92414c036..df280bbb7c0d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -997,6 +997,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	BPF_PROG_TYPE_QDISC,
 };
 
 enum bpf_attach_type {
@@ -1056,6 +1057,8 @@ enum bpf_attach_type {
 	BPF_CGROUP_UNIX_GETSOCKNAME,
 	BPF_NETKIT_PRIMARY,
 	BPF_NETKIT_PEER,
+	BPF_QDISC_ENQUEUE,
+	BPF_QDISC_DEQUEUE,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -7357,4 +7360,22 @@ struct bpf_iter_num {
 	__u64 __opaque[1];
 } __attribute__((aligned(8)));
 
+struct bpf_qdisc_ctx {
+	__bpf_md_ptr(struct sk_buff *, skb);
+	__u32 classid;
+	__u64 expire;
+	__u64 delta_ns;
+};
+
+enum {
+	SCH_BPF_QUEUED,
+	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
+	SCH_BPF_DROP,
+	SCH_BPF_CN,
+	SCH_BPF_THROTTLE,
+	SCH_BPF_PASS,
+	SCH_BPF_BYPASS,
+	SCH_BPF_STOLEN,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 2/8] net_sched: Add kfuncs for working with skb
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 3/8] net_sched: Introduce kfunc bpf_skb_tc_classify() Amery Hung
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

From: Cong Wang <cong.wang@bytedance.com>

This patch introduces four kfuncs available to developers:

struct sk_buff *bpf_skb_acquire(struct sk_buff *skb);
void bpf_skb_release(struct sk_buff *skb);
bool bpf_qdisc_set_skb_dequeue(struct sk_buff *skb);
u32 bpf_skb_get_hash(struct sk_buff *skb)

kptr is used to ensure the vailidility of skbs throughout their lifetime
in eBPF qdiscs. First, in the enqueue program, bpf_skb_acquire() can be
used to acquire a referenced kptr to an skb from ctx->skb. Then, it can
be stored in bpf maps or allocated objects serving as queues. Otherwise,
the program should call bpf_skb_release() to release the reference.
Finally, in the dequeue program, a skb kptr can be exchanged out of
queues and passed to bpf_qdisc_set_skb_dequeue() to set the skb to be
dequeued. The kfunc will also release the reference.

Since skb kptr is incompatible with helpers taking __sk_buff,
bpf_skb_get_hash() is added for now for the ease of implementing
flow-based queueing algorithms.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/sch_bpf.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 56f3ab9c6059..b0e7c3a19c30 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/filter.h>
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
@@ -520,9 +521,89 @@ static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
 	.owner		=	THIS_MODULE,
 };
 
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/* bpf_skb_acquire - Acquire a reference to an skb. An skb acquired by this
+ * kfunc which is not stored in a map as a kptr, must be released by calling
+ * bpf_skb_release().
+ * @skb: The skb on which a reference is being acquired.
+ */
+__bpf_kfunc struct sk_buff *bpf_skb_acquire(struct sk_buff *skb)
+{
+	return skb_get(skb);
+}
+
+/* bpf_skb_release - Release the reference acquired on an skb.
+ * @skb: The skb on which a reference is being released.
+ */
+__bpf_kfunc void bpf_skb_release(struct sk_buff *skb)
+{
+	skb_unref(skb);
+}
+
+/* bpf_skb_destroy - Release an skb reference acquired and exchanged into
+ * an allocated object or a map.
+ * @skb: The skb on which a reference is being released.
+ */
+__bpf_kfunc void bpf_skb_destroy(struct sk_buff *skb)
+{
+	skb_unref(skb);
+	consume_skb(skb);
+}
+
+/* bpf_skb_get_hash - Get the flow hash of an skb.
+ * @skb: The skb to get the flow hash from.
+ */
+__bpf_kfunc u32 bpf_skb_get_hash(struct sk_buff *skb)
+{
+	return skb_get_hash(skb);
+}
+
+/* bpf_qdisc_set_skb_dequeue - Set the skb to be dequeued. This will also
+ * release the reference to the skb.
+ * @skb: The skb to be dequeued by the qdisc.
+ */
+__bpf_kfunc void bpf_qdisc_set_skb_dequeue(struct sk_buff *skb)
+{
+	consume_skb(skb);
+	__this_cpu_write(bpf_skb_dequeue, skb);
+}
+
+__diag_pop();
+
+BTF_SET8_START(skb_kfunc_btf_ids)
+BTF_ID_FLAGS(func, bpf_skb_acquire, KF_ACQUIRE)
+BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_skb_get_hash)
+BTF_ID_FLAGS(func, bpf_qdisc_set_skb_dequeue, KF_RELEASE)
+BTF_SET8_END(skb_kfunc_btf_ids)
+
+static const struct btf_kfunc_id_set skb_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &skb_kfunc_btf_ids,
+};
+
+BTF_ID_LIST(skb_kfunc_dtor_ids)
+BTF_ID(struct, sk_buff)
+BTF_ID_FLAGS(func, bpf_skb_destroy, KF_RELEASE)
+
 static int __init sch_bpf_mod_init(void)
 {
-	return register_qdisc(&sch_bpf_qdisc_ops);
+	int ret;
+	const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = {
+		{
+			.btf_id       = skb_kfunc_dtor_ids[0],
+			.kfunc_btf_id = skb_kfunc_dtor_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_QDISC, &skb_kfunc_set);
+	ret = ret ?: register_btf_id_dtor_kfuncs(skb_kfunc_dtors,
+						 ARRAY_SIZE(skb_kfunc_dtors),
+						 THIS_MODULE);
+	return ret ?: register_qdisc(&sch_bpf_qdisc_ops);
 }
 
 static void __exit sch_bpf_mod_exit(void)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 3/8] net_sched: Introduce kfunc bpf_skb_tc_classify()
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 2/8] net_sched: Add kfuncs for working with skb Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 4/8] net_sched: Add reset program Amery Hung
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

From: Cong Wang <cong.wang@bytedance.com>

Introduce a kfunc, bpf_skb_tc_classify(), to reuse exising TC filters
on *any* Qdisc to classify the skb.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Co-developed-by: Amery Hung <amery.hung@bytedance.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 net/sched/sch_bpf.c | 68 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index b0e7c3a19c30..1910a58a3352 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -571,6 +571,73 @@ __bpf_kfunc void bpf_qdisc_set_skb_dequeue(struct sk_buff *skb)
 	__this_cpu_write(bpf_skb_dequeue, skb);
 }
 
+/* bpf_skb_tc_classify - Classify an skb using an existing filter referred
+ * to by the specified handle on the net device of index ifindex.
+ * @skb: The skb to be classified.
+ * @handle: The handle of the filter to be referenced.
+ * @ifindex: The ifindex of the net device where the filter is attached.
+ *
+ * Returns a 64-bit integer containing the tc action verdict and the classid,
+ * created as classid << 32 | action.
+ */
+__bpf_kfunc u64 bpf_skb_tc_classify(struct sk_buff *skb, int ifindex,
+				    u32 handle)
+{
+	struct net *net = dev_net(skb->dev);
+	const struct Qdisc_class_ops *cops;
+	struct tcf_result res = {};
+	struct tcf_block *block;
+	struct tcf_chain *chain;
+	struct net_device *dev;
+	int result = TC_ACT_OK;
+	unsigned long cl = 0;
+	struct Qdisc *q;
+
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		goto out;
+	q = qdisc_lookup_rcu(dev, handle);
+	if (!q)
+		goto out;
+
+	cops = q->ops->cl_ops;
+	if (!cops)
+		goto out;
+	if (!cops->tcf_block)
+		goto out;
+	if (TC_H_MIN(handle)) {
+		cl = cops->find(q, handle);
+		if (cl == 0)
+			goto out;
+	}
+	block = cops->tcf_block(q, cl, NULL);
+	if (!block)
+		goto out;
+
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
+		struct tcf_proto *tp;
+
+		result = tcf_classify(skb, NULL, tp, &res, false);
+		if (result >= 0) {
+			switch (result) {
+			case TC_ACT_QUEUED:
+			case TC_ACT_STOLEN:
+			case TC_ACT_TRAP:
+				fallthrough;
+			case TC_ACT_SHOT:
+				rcu_read_unlock();
+				return result;
+			}
+		}
+	}
+out:
+	rcu_read_unlock();
+	return (res.class << 32 | result);
+}
+
 __diag_pop();
 
 BTF_SET8_START(skb_kfunc_btf_ids)
@@ -578,6 +645,7 @@ BTF_ID_FLAGS(func, bpf_skb_acquire, KF_ACQUIRE)
 BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_skb_get_hash)
 BTF_ID_FLAGS(func, bpf_qdisc_set_skb_dequeue, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_skb_tc_classify)
 BTF_SET8_END(skb_kfunc_btf_ids)
 
 static const struct btf_kfunc_id_set skb_kfunc_set = {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 4/8] net_sched: Add reset program
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
                   ` (2 preceding siblings ...)
  2024-01-17 21:56 ` [RFC PATCH v7 3/8] net_sched: Introduce kfunc bpf_skb_tc_classify() Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 5/8] net_sched: Add init program Amery Hung
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

Allow developers to implement customized reset logic through an optional
reset program. The program also takes bpf_qdisc_ctx as context, but
currently cannot access any field.

To release skbs, the program can release all references to bpf list or
rbtree serving as skb queues. The destructor kfunc bpf_skb_destroy()
will be called by bpf_map_free_deferred(). This prevents the qdisc from
holding the sch_tree_lock for too long when there are many packets in
the qdisc.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/uapi/linux/bpf.h       |  1 +
 include/uapi/linux/pkt_sched.h |  4 ++++
 kernel/bpf/syscall.c           |  1 +
 net/core/filter.c              |  3 +++
 net/sched/sch_bpf.c            | 30 ++++++++++++++++++++++++++----
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index df280bbb7c0d..84669886a493 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1059,6 +1059,7 @@ enum bpf_attach_type {
 	BPF_NETKIT_PEER,
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
+	BPF_QDISC_RESET,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d05462309f5a..e9e1a83c22f7 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1328,6 +1328,10 @@ enum {
 	TCA_SCH_BPF_DEQUEUE_PROG_FD,	/* u32 */
 	TCA_SCH_BPF_DEQUEUE_PROG_ID,	/* u32 */
 	TCA_SCH_BPF_DEQUEUE_PROG_TAG,	/* data */
+	TCA_SCH_BPF_RESET_PROG_NAME,	/* string */
+	TCA_SCH_BPF_RESET_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_RESET_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_RESET_PROG_TAG,	/* data */
 	__TCA_SCH_BPF_MAX,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1838bddd8526..9af6fa542f2e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2506,6 +2506,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		switch (expected_attach_type) {
 		case BPF_QDISC_ENQUEUE:
 		case BPF_QDISC_DEQUEUE:
+		case BPF_QDISC_RESET:
 			return 0;
 		default:
 			return -EINVAL;
diff --git a/net/core/filter.c b/net/core/filter.c
index f25a0b6b5d56..f8e17465377f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8905,6 +8905,9 @@ static bool tc_qdisc_is_valid_access(int off, int size,
 {
 	struct btf *btf;
 
+	if (prog->expected_attach_type == BPF_QDISC_RESET)
+		return false;
+
 	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
 		return false;
 
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 1910a58a3352..3f0f809dced6 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -42,6 +42,7 @@ struct bpf_sched_data {
 	struct Qdisc_class_hash clhash;
 	struct sch_bpf_prog __rcu enqueue_prog;
 	struct sch_bpf_prog __rcu dequeue_prog;
+	struct sch_bpf_prog __rcu reset_prog;
 
 	struct qdisc_watchdog watchdog;
 };
@@ -51,6 +52,9 @@ static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *sk
 {
 	struct nlattr *nla;
 
+	if (!prog->prog)
+		return 0;
+
 	if (prog->name &&
 	    nla_put_string(skb, name, prog->name))
 		return -EMSGSIZE;
@@ -81,6 +85,9 @@ static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME,
 			      TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG))
 		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->reset_prog, skb, TCA_SCH_BPF_RESET_PROG_NAME,
+			      TCA_SCH_BPF_RESET_PROG_ID, TCA_SCH_BPF_RESET_PROG_TAG))
+		goto nla_put_failure;
 
 	return nla_nest_end(skb, opts);
 
@@ -259,16 +266,21 @@ static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
 	[TCA_SCH_BPF_DEQUEUE_PROG_FD]	= { .type = NLA_U32 },
 	[TCA_SCH_BPF_DEQUEUE_PROG_NAME]	= { .type = NLA_NUL_STRING,
 					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_RESET_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_RESET_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
 };
 
-static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog)
+static int bpf_init_prog(struct nlattr *fd, struct nlattr *name,
+			 struct sch_bpf_prog *prog, bool optional)
 {
 	struct bpf_prog *fp, *old_fp;
 	char *prog_name = NULL;
 	u32 bpf_fd;
 
 	if (!fd)
-		return -EINVAL;
+		return optional ? 0 : -EINVAL;
+
 	bpf_fd = nla_get_u32(fd);
 
 	fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_QDISC);
@@ -327,11 +339,15 @@ static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
 	sch_tree_lock(sch);
 
 	err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD],
-			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog);
+			    tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog, false);
 	if (err)
 		goto failure;
 	err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD],
-			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog);
+			    tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog, false);
+	if (err)
+		goto failure;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_RESET_PROG_FD],
+			    tb[TCA_SCH_BPF_RESET_PROG_NAME], &q->reset_prog, true);
 failure:
 	sch_tree_unlock(sch);
 	return err;
@@ -360,7 +376,9 @@ static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
 static void sch_bpf_reset(struct Qdisc *sch)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct bpf_qdisc_ctx ctx = {};
 	struct sch_bpf_class *cl;
+	struct bpf_prog *reset;
 	unsigned int i;
 
 	for (i = 0; i < q->clhash.hashsize; i++) {
@@ -371,6 +389,9 @@ static void sch_bpf_reset(struct Qdisc *sch)
 	}
 
 	qdisc_watchdog_cancel(&q->watchdog);
+	reset = rcu_dereference(q->reset_prog.prog);
+	if (reset)
+		bpf_prog_run(reset, &ctx);
 }
 
 static void sch_bpf_destroy_class(struct Qdisc *sch, struct sch_bpf_class *cl)
@@ -398,6 +419,7 @@ static void sch_bpf_destroy(struct Qdisc *sch)
 	sch_tree_lock(sch);
 	bpf_cleanup_prog(&q->enqueue_prog);
 	bpf_cleanup_prog(&q->dequeue_prog);
+	bpf_cleanup_prog(&q->reset_prog);
 	sch_tree_unlock(sch);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index df280bbb7c0d..84669886a493 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1059,6 +1059,7 @@ enum bpf_attach_type {
 	BPF_NETKIT_PEER,
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
+	BPF_QDISC_RESET,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 5/8] net_sched: Add init program
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
                   ` (3 preceding siblings ...)
  2024-01-17 21:56 ` [RFC PATCH v7 4/8] net_sched: Add reset program Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-17 21:56 ` [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC Amery Hung
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

This patch adds another optional program to be called during
the creation of a qdisc for initializating data in the bpf world.
The program takes bpf_qdisc_ctx as context, but cannot not access
any field.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 include/uapi/linux/bpf.h       |  1 +
 include/uapi/linux/pkt_sched.h |  4 ++++
 kernel/bpf/syscall.c           |  1 +
 net/core/filter.c              |  3 ++-
 net/sched/sch_bpf.c            | 23 ++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h |  1 +
 6 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 84669886a493..cad0788bef99 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1060,6 +1060,7 @@ enum bpf_attach_type {
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
 	BPF_QDISC_RESET,
+	BPF_QDISC_INIT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index e9e1a83c22f7..61f0cf4a088c 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1332,6 +1332,10 @@ enum {
 	TCA_SCH_BPF_RESET_PROG_FD,	/* u32 */
 	TCA_SCH_BPF_RESET_PROG_ID,	/* u32 */
 	TCA_SCH_BPF_RESET_PROG_TAG,	/* data */
+	TCA_SCH_BPF_INIT_PROG_NAME,	/* string */
+	TCA_SCH_BPF_INIT_PROG_FD,	/* u32 */
+	TCA_SCH_BPF_INIT_PROG_ID,	/* u32 */
+	TCA_SCH_BPF_INIT_PROG_TAG,	/* data */
 	__TCA_SCH_BPF_MAX,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9af6fa542f2e..0959905044b9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2507,6 +2507,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		case BPF_QDISC_ENQUEUE:
 		case BPF_QDISC_DEQUEUE:
 		case BPF_QDISC_RESET:
+		case BPF_QDISC_INIT:
 			return 0;
 		default:
 			return -EINVAL;
diff --git a/net/core/filter.c b/net/core/filter.c
index f8e17465377f..5619a12c0d06 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8905,7 +8905,8 @@ static bool tc_qdisc_is_valid_access(int off, int size,
 {
 	struct btf *btf;
 
-	if (prog->expected_attach_type == BPF_QDISC_RESET)
+	if (prog->expected_attach_type == BPF_QDISC_RESET ||
+	    prog->expected_attach_type == BPF_QDISC_INIT)
 		return false;
 
 	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c
index 3f0f809dced6..925a131016f0 100644
--- a/net/sched/sch_bpf.c
+++ b/net/sched/sch_bpf.c
@@ -43,6 +43,7 @@ struct bpf_sched_data {
 	struct sch_bpf_prog __rcu enqueue_prog;
 	struct sch_bpf_prog __rcu dequeue_prog;
 	struct sch_bpf_prog __rcu reset_prog;
+	struct sch_bpf_prog __rcu init_prog;
 
 	struct qdisc_watchdog watchdog;
 };
@@ -88,6 +89,9 @@ static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (sch_bpf_dump_prog(&q->reset_prog, skb, TCA_SCH_BPF_RESET_PROG_NAME,
 			      TCA_SCH_BPF_RESET_PROG_ID, TCA_SCH_BPF_RESET_PROG_TAG))
 		goto nla_put_failure;
+	if (sch_bpf_dump_prog(&q->init_prog, skb, TCA_SCH_BPF_INIT_PROG_NAME,
+			      TCA_SCH_BPF_INIT_PROG_ID, TCA_SCH_BPF_INIT_PROG_TAG))
+		goto nla_put_failure;
 
 	return nla_nest_end(skb, opts);
 
@@ -269,6 +273,9 @@ static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = {
 	[TCA_SCH_BPF_RESET_PROG_FD]	= { .type = NLA_U32 },
 	[TCA_SCH_BPF_RESET_PROG_NAME]	= { .type = NLA_NUL_STRING,
 					    .len = ACT_BPF_NAME_LEN },
+	[TCA_SCH_BPF_INIT_PROG_FD]	= { .type = NLA_U32 },
+	[TCA_SCH_BPF_INIT_PROG_NAME]	= { .type = NLA_NUL_STRING,
+					    .len = ACT_BPF_NAME_LEN },
 };
 
 static int bpf_init_prog(struct nlattr *fd, struct nlattr *name,
@@ -348,6 +355,10 @@ static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt,
 		goto failure;
 	err = bpf_init_prog(tb[TCA_SCH_BPF_RESET_PROG_FD],
 			    tb[TCA_SCH_BPF_RESET_PROG_NAME], &q->reset_prog, true);
+	if (err)
+		goto failure;
+	err = bpf_init_prog(tb[TCA_SCH_BPF_INIT_PROG_FD],
+			    tb[TCA_SCH_BPF_INIT_PROG_NAME], &q->init_prog, true);
 failure:
 	sch_tree_unlock(sch);
 	return err;
@@ -357,6 +368,8 @@ static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
 			struct netlink_ext_ack *extack)
 {
 	struct bpf_sched_data *q = qdisc_priv(sch);
+	struct bpf_qdisc_ctx ctx = {};
+	struct bpf_prog *init;
 	int err;
 
 	qdisc_watchdog_init(&q->watchdog, sch);
@@ -370,7 +383,14 @@ static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
 	if (err)
 		return err;
 
-	return qdisc_class_hash_init(&q->clhash);
+	err = qdisc_class_hash_init(&q->clhash);
+	if (err < 0)
+		return err;
+
+	init = rcu_dereference(q->init_prog.prog);
+	if (init)
+		bpf_prog_run(init, &ctx);
+	return 0;
 }
 
 static void sch_bpf_reset(struct Qdisc *sch)
@@ -420,6 +440,7 @@ static void sch_bpf_destroy(struct Qdisc *sch)
 	bpf_cleanup_prog(&q->enqueue_prog);
 	bpf_cleanup_prog(&q->dequeue_prog);
 	bpf_cleanup_prog(&q->reset_prog);
+	bpf_cleanup_prog(&q->init_prog);
 	sch_tree_unlock(sch);
 }
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 84669886a493..cad0788bef99 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1060,6 +1060,7 @@ enum bpf_attach_type {
 	BPF_QDISC_ENQUEUE,
 	BPF_QDISC_DEQUEUE,
 	BPF_QDISC_RESET,
+	BPF_QDISC_INIT,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
                   ` (4 preceding siblings ...)
  2024-01-17 21:56 ` [RFC PATCH v7 5/8] net_sched: Add init program Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-23  0:17   ` Andrii Nakryiko
  2024-01-17 21:56 ` [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc Amery Hung
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

While eBPF qdisc uses NETLINK for attachment, expected_attach_type is
required at load time to verify context access from different programs.
This patch adds the section definition for this.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 tools/lib/bpf/libbpf.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e067be95da3c..0541f85b4ce6 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8991,6 +8991,10 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("struct_ops.s+",	STRUCT_OPS, 0, SEC_SLEEPABLE),
 	SEC_DEF("sk_lookup",		SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE),
 	SEC_DEF("netfilter",		NETFILTER, BPF_NETFILTER, SEC_NONE),
+	SEC_DEF("qdisc/enqueue",	QDISC, BPF_QDISC_ENQUEUE, SEC_ATTACHABLE_OPT),
+	SEC_DEF("qdisc/dequeue",	QDISC, BPF_QDISC_DEQUEUE, SEC_ATTACHABLE_OPT),
+	SEC_DEF("qdisc/reset",		QDISC, BPF_QDISC_RESET, SEC_ATTACHABLE_OPT),
+	SEC_DEF("qdisc/init",		QDISC, BPF_QDISC_INIT, SEC_ATTACHABLE_OPT),
 };
 
 int libbpf_register_prog_handler(const char *sec,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
                   ` (5 preceding siblings ...)
  2024-01-17 21:56 ` [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-24 10:29   ` Daniel Borkmann
  2024-01-17 21:56 ` [RFC PATCH v7 8/8] samples/bpf: Add an example of bpf netem qdisc Amery Hung
  2024-01-23 21:13 ` [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Stanislav Fomichev
  8 siblings, 1 reply; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

tc_sch_fq.bpf.c
A simple bpf fair queueing (fq) qdisc that gives each flow a euqal chance
to transmit data. The qdisc respects the timestamp in a skb set by an
clsact rate limiter. It can also inform the rate limiter about packet drop
when enabled to adjust timestamps. The implementation does not prevent hash
collision of flows nor does it recycle flows.

tc_sch_fq.c
A user space program to load and attach the eBPF-based fq qdisc, which
by default add the bpf fq to the loopback device, but can also add to other
dev and class with '-d' and '-p' options.

To test the bpf fq qdisc with the EDT rate limiter:
$ tc qdisc add dev lo clsact
$ tc filter add dev lo egress bpf obj tc_clsact_edt.bpf.o sec classifier
$ ./tc_sch_fq -s

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 samples/bpf/Makefile            |   8 +-
 samples/bpf/bpf_experimental.h  | 134 +++++++
 samples/bpf/tc_clsact_edt.bpf.c | 103 +++++
 samples/bpf/tc_sch_fq.bpf.c     | 666 ++++++++++++++++++++++++++++++++
 samples/bpf/tc_sch_fq.c         | 321 +++++++++++++++
 5 files changed, 1231 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/bpf_experimental.h
 create mode 100644 samples/bpf/tc_clsact_edt.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.bpf.c
 create mode 100644 samples/bpf/tc_sch_fq.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 933f6c3fe6b0..ea516a00352d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -46,6 +46,7 @@ tprogs-y += xdp_fwd
 tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += tc_sch_fq
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -98,6 +99,7 @@ ibumad-objs := ibumad_user.o
 hbm-objs := hbm.o $(CGROUP_HELPERS)
 
 xdp_router_ipv4-objs := xdp_router_ipv4_user.o $(XDP_SAMPLE)
+tc_sch_fq-objs := tc_sch_fq.o
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
@@ -149,6 +151,7 @@ always-y += task_fd_query_kern.o
 always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
+always-y += tc_sch_fq.bpf.o
 
 TPROGS_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -195,6 +198,7 @@ TPROGLDLIBS_tracex4		+= -lrt
 TPROGLDLIBS_trace_output	+= -lrt
 TPROGLDLIBS_map_perf_test	+= -lrt
 TPROGLDLIBS_test_overhead	+= -lrt
+TPROGLDLIBS_tc_sch_fq		+= -lmnl
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 # make M=samples/bpf LLC=~/git/llvm-project/llvm/build/bin/llc CLANG=~/git/llvm-project/llvm/build/bin/clang
@@ -306,6 +310,7 @@ $(obj)/$(TRACE_HELPERS) $(obj)/$(CGROUP_HELPERS) $(obj)/$(XDP_SAMPLE): | libbpf_
 .PHONY: libbpf_hdrs
 
 $(obj)/xdp_router_ipv4_user.o: $(obj)/xdp_router_ipv4.skel.h
+$(obj)/tc_sch_fq.o: $(obj)/tc_sch_fq.skel.h
 
 $(obj)/tracex5.bpf.o: $(obj)/syscall_nrs.h
 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
@@ -370,10 +375,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h tc_sch_fq.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+tc_sch_fq.skel.h-deps := tc_sch_fq.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/bpf_experimental.h b/samples/bpf/bpf_experimental.h
new file mode 100644
index 000000000000..fc39063e0322
--- /dev/null
+++ b/samples/bpf/bpf_experimental.h
@@ -0,0 +1,134 @@
+#ifndef __BPF_EXPERIMENTAL__
+#define __BPF_EXPERIMENTAL__
+
+#include "vmlinux.h"
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+#define __contains(name, node) __attribute__((btf_decl_tag("contains:" #name ":" #node)))
+
+/* Description
+ *	Allocates an object of the type represented by 'local_type_id' in
+ *	program BTF. User may use the bpf_core_type_id_local macro to pass the
+ *	type ID of a struct in program BTF.
+ *
+ *	The 'local_type_id' parameter must be a known constant.
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
+ * Returns
+ *	A pointer to an object of the type corresponding to the passed in
+ *	'local_type_id', or NULL on failure.
+ */
+extern void *bpf_obj_new_impl(__u64 local_type_id, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_obj_new_impl */
+#define bpf_obj_new(type) ((type *)bpf_obj_new_impl(bpf_core_type_id_local(type), NULL))
+
+/* Description
+ *	Free an allocated object. All fields of the object that require
+ *	destruction will be destructed before the storage is freed.
+ *
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
+ * Returns
+ *	Void.
+ */
+extern void bpf_obj_drop_impl(void *kptr, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_obj_drop_impl */
+#define bpf_obj_drop(kptr) bpf_obj_drop_impl(kptr, NULL)
+
+/* Description
+ *	Increment the refcount on a refcounted local kptr, turning the
+ *	non-owning reference input into an owning reference in the process.
+ *
+ *	The 'meta' parameter is rewritten by the verifier, no need for BPF
+ *	program to set it.
+ * Returns
+ *	An owning reference to the object pointed to by 'kptr'
+ */
+extern void *bpf_refcount_acquire_impl(void *kptr, void *meta) __ksym;
+
+/* Convenience macro to wrap over bpf_refcount_acquire_impl */
+#define bpf_refcount_acquire(kptr) bpf_refcount_acquire_impl(kptr, NULL)
+
+/* Description
+ *	Add a new entry to the beginning of the BPF linked list.
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
+ * Returns
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a list
+ */
+extern int bpf_list_push_front_impl(struct bpf_list_head *head,
+				    struct bpf_list_node *node,
+				    void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_list_push_front_impl */
+#define bpf_list_push_front(head, node) bpf_list_push_front_impl(head, node, NULL, 0)
+
+/* Description
+ *	Add a new entry to the end of the BPF linked list.
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
+ * Returns
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a list
+ */
+extern int bpf_list_push_back_impl(struct bpf_list_head *head,
+				   struct bpf_list_node *node,
+				   void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_list_push_back_impl */
+#define bpf_list_push_back(head, node) bpf_list_push_back_impl(head, node, NULL, 0)
+
+/* Description
+ *	Remove the entry at the beginning of the BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+extern struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove the entry at the end of the BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+extern struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove 'node' from rbtree with root 'root'
+ * Returns
+ * 	Pointer to the removed node, or NULL if 'root' didn't contain 'node'
+ */
+extern struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
+					     struct bpf_rb_node *node) __ksym;
+
+/* Description
+ *	Add 'node' to rbtree with root 'root' using comparator 'less'
+ *
+ *	The 'meta' and 'off' parameters are rewritten by the verifier, no need
+ *	for BPF programs to set them
+ * Returns
+ *	0 if the node was successfully added
+ *	-EINVAL if the node wasn't added because it's already in a tree
+ */
+extern int bpf_rbtree_add_impl(struct bpf_rb_root *root, struct bpf_rb_node *node,
+			       bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b),
+			       void *meta, __u64 off) __ksym;
+
+/* Convenience macro to wrap over bpf_rbtree_add_impl */
+#define bpf_rbtree_add(head, node, less) bpf_rbtree_add_impl(head, node, less, NULL, 0)
+
+/* Description
+ *	Return the first (leftmost) node in input tree
+ * Returns
+ *	Pointer to the node, which is _not_ removed from the tree. If the tree
+ *	contains no nodes, returns NULL.
+ */
+extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
+
+#endif
diff --git a/samples/bpf/tc_clsact_edt.bpf.c b/samples/bpf/tc_clsact_edt.bpf.c
new file mode 100644
index 000000000000..f0b2ea84028d
--- /dev/null
+++ b/samples/bpf/tc_clsact_edt.bpf.c
@@ -0,0 +1,103 @@
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define ETH_P_IP	0x0800
+#define TC_ACT_OK	0
+#define NS_PER_SEC	1000000000ULL
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} rate_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 16);
+	__type(key, u32);
+	__type(value, u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+} tstamp_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 16);
+	__type(key, u32);
+	__type(value, u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+} comp_map SEC(".maps");
+
+u64 last_ktime = 0;
+
+SEC("classifier")
+int prog(struct __sk_buff *skb)
+{
+	void *data_end = (void *)(unsigned long long)skb->data_end;
+	u64 *rate, *tstamp, delay_ns, tstamp_comp, tstamp_new, *comp, comp_ns, now, init_rate = 12500000;    /* 100 Mbits/sec */
+	void *data = (void *)(unsigned long long)skb->data;
+	struct iphdr *ip = data + sizeof(struct ethhdr);
+	struct ethhdr *eth = data;
+	u64 len = skb->len;
+	long ret;
+	u64 zero = 0;
+
+	now = bpf_ktime_get_ns();
+
+	if (data + sizeof(struct ethhdr) > data_end)
+		return TC_ACT_OK;
+	if (skb->protocol != bpf_htons(ETH_P_IP))
+		return TC_ACT_OK;
+	if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) > data_end)
+		return TC_ACT_OK;
+
+	rate = bpf_map_lookup_elem(&rate_map, &ip->daddr);
+	if (!rate) {
+		bpf_map_update_elem(&rate_map, &ip->daddr, &init_rate, BPF_ANY);
+		bpf_map_update_elem(&tstamp_map, &ip->daddr, &now, BPF_ANY);
+		bpf_map_update_elem(&comp_map, &ip->daddr, &zero, BPF_ANY);
+		return TC_ACT_OK;
+	}
+
+	delay_ns = skb->len * NS_PER_SEC / (*rate);
+
+	tstamp = bpf_map_lookup_elem(&tstamp_map, &ip->daddr);
+	if (!tstamp)	/* unlikely */
+		return TC_ACT_OK;
+
+	comp = bpf_map_lookup_elem(&comp_map, &ip->daddr);
+	if (!comp)	/* unlikely */
+		return TC_ACT_OK;
+
+	// Reset comp and tstamp when idle
+	if (now - last_ktime > 1000000000) {
+		__sync_lock_test_and_set(comp, 0);
+		__sync_lock_test_and_set(tstamp, now);
+	}
+	last_ktime = now;
+
+	comp_ns = __sync_lock_test_and_set(comp, 0);
+	tstamp_comp = *tstamp - comp_ns;
+	if (tstamp_comp < now) {
+		tstamp_new = tstamp_comp + delay_ns;
+		if (tstamp_new < now) {
+			__sync_fetch_and_add(comp, now - tstamp_new);
+			__sync_lock_test_and_set(tstamp, now);
+		} else {
+			__sync_fetch_and_sub(tstamp, comp_ns);
+			__sync_fetch_and_add(tstamp, delay_ns);
+		}
+		skb->tstamp = now;
+		return TC_ACT_OK;
+	}
+
+	__sync_fetch_and_sub(tstamp, comp_ns);
+	skb->tstamp = *tstamp;
+	__sync_fetch_and_add(tstamp, delay_ns);
+
+	return TC_ACT_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tc_sch_fq.bpf.c b/samples/bpf/tc_sch_fq.bpf.c
new file mode 100644
index 000000000000..b2287ea3e2b2
--- /dev/null
+++ b/samples/bpf/tc_sch_fq.bpf.c
@@ -0,0 +1,666 @@
+#include "vmlinux.h"
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+
+#define TC_PRIO_CONTROL  7
+#define TC_PRIO_MAX  15
+
+#define NS_PER_SEC 1000000000
+#define PSCHED_MTU (64 * 1024 + 14)
+
+#define NUM_QUEUE_LOG 10
+#define NUM_QUEUE (1 << NUM_QUEUE_LOG)
+#define PRIO_QUEUE (NUM_QUEUE + 1)
+#define COMP_DROP_PKT_DELAY 1
+#define THROTTLED 0xffffffffffffffff
+
+/* fq configuration */
+__u64 q_flow_refill_delay = 40 * 10000; //40us
+__u64 q_horizon = NS_PER_SEC * 10ULL;
+__u32 q_initial_quantum = 10 * PSCHED_MTU;
+__u32 q_quantum = 2 * PSCHED_MTU;
+__u32 q_orphan_mask = 1023;
+__u32 q_flow_plimit = 100;
+__u32 q_plimit = 10000;
+bool q_horizon_drop = true;
+
+bool q_compensate_tstamp;
+bool q_random_drop;
+
+unsigned long time_next_delayed_flow = ~0ULL;
+unsigned long unthrottle_latency_ns = 0ULL;
+unsigned long ktime_cache = 0;
+unsigned long dequeue_now;
+unsigned int fq_qlen = 0;
+
+struct fq_cb {
+	u32 plen;
+};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr *skb;
+	struct bpf_rb_node node;
+};
+
+struct fq_flow_node {
+	u32 hash;
+	int credit;
+	u32 qlen;
+	u32 socket_hash;
+	u64 age;
+	u64 time_next_packet;
+	struct bpf_list_node list_node;
+	struct bpf_rb_node rb_node;
+	struct bpf_rb_root queue __contains(skb_node, node);
+	struct bpf_spin_lock lock;
+	struct bpf_refcount refcount;
+};
+
+struct dequeue_nonprio_ctx {
+	bool dequeued;
+	u64 expire;
+};
+
+struct fq_stashed_flow {
+	struct fq_flow_node __kptr *flow;
+};
+
+/* [NUM_QUEUE] for TC_PRIO_CONTROL
+ * [0, NUM_QUEUE - 1] for other flows
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct fq_stashed_flow);
+	__uint(max_entries, NUM_QUEUE + 1);
+} fq_stashed_flows SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} rate_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, __u32);
+	__type(value, __u64);
+	__uint(pinning, LIBBPF_PIN_BY_NAME);
+	__uint(max_entries, 16);
+} comp_map SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock fq_delayed_lock;
+private(A) struct bpf_rb_root fq_delayed __contains(fq_flow_node, rb_node);
+
+private(B) struct bpf_spin_lock fq_new_flows_lock;
+private(B) struct bpf_list_head fq_new_flows __contains(fq_flow_node, list_node);
+
+private(C) struct bpf_spin_lock fq_old_flows_lock;
+private(C) struct bpf_list_head fq_old_flows __contains(fq_flow_node, list_node);
+
+struct sk_buff *bpf_skb_acquire(struct sk_buff *p) __ksym;
+void bpf_skb_release(struct sk_buff *p) __ksym;
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_qdisc_set_skb_dequeue(struct sk_buff *p) __ksym;
+
+static __always_inline bool bpf_kptr_xchg_back(void *map_val, void *ptr)
+{
+	void *ret;
+
+	ret = bpf_kptr_xchg(map_val, ptr);
+	if (ret) { //unexpected
+		bpf_obj_drop(ret);
+		return false;
+	}
+	return true;
+}
+
+static __always_inline int hash64(u64 val, int bits)
+{
+	return val * 0x61C8864680B583EBull >> (64 - bits);
+}
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skb_a;
+	struct skb_node *skb_b;
+
+	skb_a = container_of(a, struct skb_node, node);
+	skb_b = container_of(b, struct skb_node, node);
+
+	return skb_a->tstamp < skb_b->tstamp;
+}
+
+static bool fn_time_next_packet_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct fq_flow_node *flow_a;
+	struct fq_flow_node *flow_b;
+
+	flow_a = container_of(a, struct fq_flow_node, rb_node);
+	flow_b = container_of(b, struct fq_flow_node, rb_node);
+
+	return flow_a->time_next_packet < flow_b->time_next_packet;
+}
+
+static __always_inline void
+fq_flows_add_head(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static __always_inline void
+fq_flows_add_tail(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+		  struct fq_flow_node *flow)
+{
+	bpf_spin_lock(lock);
+	bpf_list_push_back(head, &flow->list_node);
+	bpf_spin_unlock(lock);
+}
+
+static __always_inline bool
+fq_flows_is_empty(struct bpf_list_head *head, struct bpf_spin_lock *lock)
+{
+	struct bpf_list_node *node;
+
+	bpf_spin_lock(lock);
+	node = bpf_list_pop_front(head);
+	if (node) {
+		bpf_list_push_front(head, node);
+		bpf_spin_unlock(lock);
+		return false;
+	}
+	bpf_spin_unlock(lock);
+
+	return true;
+}
+
+static __always_inline void fq_flow_set_detached(struct fq_flow_node *flow)
+{
+	flow->age = bpf_jiffies64();
+	bpf_obj_drop(flow);
+}
+
+static __always_inline bool fq_flow_is_detached(struct fq_flow_node *flow)
+{
+	return flow->age != 0 && flow->age != THROTTLED;
+}
+
+static __always_inline bool fq_flow_is_throttled(struct fq_flow_node *flow)
+{
+	return flow->age != THROTTLED;
+}
+
+static __always_inline bool sk_listener(struct sock *sk)
+{
+	return (1 << sk->__sk_common.skc_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
+}
+
+static __always_inline int
+fq_classify(struct sk_buff *skb, u32 *hash, struct fq_stashed_flow **sflow,
+	    bool *connected, u32 *sk_hash)
+{
+	struct fq_flow_node *flow;
+	struct sock *sk = skb->sk;
+
+	*connected = false;
+
+	if ((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) {
+		*hash = PRIO_QUEUE;
+	} else {
+		if (!sk || sk_listener(sk)) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else if (sk->__sk_common.skc_state == TCP_CLOSE) {
+			*sk_hash = bpf_skb_get_hash(skb) & q_orphan_mask;
+			*sk_hash = (*sk_hash << 1 | 1);
+		} else {
+			*sk_hash = sk->__sk_common.skc_hash;
+			*connected = true;
+		}
+		*hash = hash64(*sk_hash, NUM_QUEUE_LOG);
+	}
+
+	*sflow = bpf_map_lookup_elem(&fq_stashed_flows, hash);
+	if (!*sflow)
+		return -1; //unexpected
+
+	if ((*sflow)->flow)
+		return 0;
+
+	flow = bpf_obj_new(typeof(*flow));
+	if (!flow)
+		return -1;
+
+	flow->hash = *hash;
+	flow->credit = q_initial_quantum;
+	flow->qlen = 0;
+	flow->age = 1UL;
+	flow->time_next_packet = 0;
+
+	bpf_kptr_xchg_back(&(*sflow)->flow, flow);
+
+	return 0;
+}
+
+static __always_inline bool fq_packet_beyond_horizon(struct sk_buff *skb)
+{
+	return (s64)skb->tstamp > (s64)(ktime_cache + q_horizon);
+}
+
+SEC("qdisc/enqueue")
+int enqueue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct iphdr *iph = (void *)(long)ctx->skb->data + sizeof(struct ethhdr);
+	u64 time_to_send, jiffies, delay_ns, *comp_ns, *rate;
+	struct fq_flow_node *flow = NULL, *flow_copy;
+	struct sk_buff *skb = ctx->skb;
+	u32 hash, plen, daddr, sk_hash;
+	struct fq_stashed_flow *sflow;
+	struct bpf_rb_node *node;
+	struct skb_node *skbn;
+	void *flow_queue;
+	bool connected;
+
+	if (q_random_drop & (bpf_get_prandom_u32() > ~0U * 0.90))
+		goto drop;
+
+	if (fq_qlen >= q_plimit)
+		goto drop;
+
+	skb = bpf_skb_acquire(ctx->skb);
+	if (!skb->tstamp) {
+		time_to_send = ktime_cache = bpf_ktime_get_ns();
+	} else {
+		if (fq_packet_beyond_horizon(skb)) {
+			ktime_cache = bpf_ktime_get_ns();
+			if (fq_packet_beyond_horizon(skb)) {
+				if (q_horizon_drop)
+					goto rel_skb_and_drop;
+
+				skb->tstamp = ktime_cache + q_horizon;
+			}
+		}
+		time_to_send = skb->tstamp;
+	}
+
+	if (fq_classify(skb, &hash, &sflow, &connected, &sk_hash) < 0)
+		goto rel_skb_and_drop;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		goto rel_skb_and_drop; //unexpected
+
+	if (hash != PRIO_QUEUE) {
+		if (connected && flow->socket_hash != sk_hash) {
+			flow->credit = q_initial_quantum;
+			flow->socket_hash = sk_hash;
+			if (fq_flow_is_throttled(flow)) {
+				/* mark the flow as undetached. The reference to the
+				 * throttled flow in fq_delayed will be removed later.
+				 */
+				flow_copy = bpf_refcount_acquire(flow);
+				flow_copy->age = 0;
+				fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow_copy);
+			}
+			flow->time_next_packet = 0ULL;
+		}
+
+		if (flow->qlen >= q_flow_plimit) {
+			bpf_kptr_xchg_back(&sflow->flow, flow);
+			goto rel_skb_and_drop;
+		}
+
+		if (fq_flow_is_detached(flow)) {
+			if (connected)
+				flow->socket_hash = sk_hash;
+
+			flow_copy = bpf_refcount_acquire(flow);
+
+			jiffies = bpf_jiffies64();
+			if ((s64)(jiffies - (flow_copy->age + q_flow_refill_delay)) > 0) {
+				if (flow_copy->credit < q_quantum)
+					flow_copy->credit = q_quantum;
+			}
+			flow_copy->age = 0;
+			fq_flows_add_tail(&fq_new_flows, &fq_new_flows_lock, flow_copy);
+		}
+	}
+
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_kptr_xchg_back(&sflow->flow, flow);
+		goto rel_skb_and_drop;
+	}
+
+	skbn->tstamp = time_to_send;
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (skb)
+		bpf_skb_release(skb);
+
+	bpf_spin_lock(&flow->lock);
+	bpf_rbtree_add(&flow->queue, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&flow->lock);
+
+	flow->qlen++;
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	fq_qlen++;
+	return SCH_BPF_QUEUED;
+
+rel_skb_and_drop:
+	bpf_skb_release(skb);
+drop:
+	if (q_compensate_tstamp) {
+		bpf_probe_read_kernel(&plen, sizeof(plen), (void *)(ctx->skb->cb));
+		bpf_probe_read_kernel(&daddr, sizeof(daddr), &iph->daddr);
+		rate = bpf_map_lookup_elem(&rate_map, &daddr);
+		comp_ns = bpf_map_lookup_elem(&comp_map, &daddr);
+		if (rate && comp_ns) {
+			delay_ns = (u64)plen * NS_PER_SEC / (*rate);
+			__sync_fetch_and_add(comp_ns, delay_ns);
+		}
+	}
+	return SCH_BPF_DROP;
+}
+
+static int fq_unset_throttled_flows(u32 index, bool *unset_all)
+{
+	struct bpf_rb_node *node = NULL;
+	struct fq_flow_node *flow;
+	u32 hash;
+
+	bpf_spin_lock(&fq_delayed_lock);
+
+	node = bpf_rbtree_first(&fq_delayed);
+	if (!node) {
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+	if (!*unset_all && flow->time_next_packet > dequeue_now) {
+		time_next_delayed_flow = flow->time_next_packet;
+		bpf_spin_unlock(&fq_delayed_lock);
+		return 1;
+	}
+
+	node = bpf_rbtree_remove(&fq_delayed, &flow->rb_node);
+
+	bpf_spin_unlock(&fq_delayed_lock);
+
+	if (!node)
+		return 1; //unexpected
+
+	flow = container_of(node, struct fq_flow_node, rb_node);
+
+	/* the flow was recycled during enqueue() */
+	if (flow->age != THROTTLED) {
+		bpf_obj_drop(flow);
+		return 0;
+	}
+
+	flow->age = 0;
+	fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+
+	return 0;
+}
+
+static __always_inline void fq_flow_set_throttled(struct fq_flow_node *flow)
+{
+	flow->age = THROTTLED;
+
+	if (time_next_delayed_flow > flow->time_next_packet)
+		time_next_delayed_flow = flow->time_next_packet;
+
+	bpf_spin_lock(&fq_delayed_lock);
+	bpf_rbtree_add(&fq_delayed, &flow->rb_node, fn_time_next_packet_less);
+	bpf_spin_unlock(&fq_delayed_lock);
+}
+
+static __always_inline void fq_check_throttled(void)
+{
+	bool unset_all = false;
+	unsigned long sample;
+
+	if (time_next_delayed_flow > dequeue_now)
+		return;
+
+	sample = (unsigned long)(dequeue_now - time_next_delayed_flow);
+	unthrottle_latency_ns -= unthrottle_latency_ns >> 3;
+	unthrottle_latency_ns += sample >> 3;
+
+	time_next_delayed_flow = ~0ULL;
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+}
+
+static int
+fq_dequeue_nonprio_flows(u32 index, struct dequeue_nonprio_ctx *ctx)
+{
+	u64 time_next_packet, time_to_send;
+	struct skb_node *skbn, *skbn_tbd;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+	u32 plen, key = 0;
+	struct fq_cb cb;
+	bool is_empty;
+
+	head = &fq_new_flows;
+	lock = &fq_new_flows_lock;
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node) {
+			if (time_next_delayed_flow != ~0ULL)
+				ctx->expire = time_next_delayed_flow;
+			return 1;
+		}
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	if (flow->credit <= 0) {
+		flow->credit += q_quantum;
+		fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		return 0;
+	}
+
+	bpf_spin_lock(&flow->lock);
+	rb_node = bpf_rbtree_first(&flow->queue);
+	if (!rb_node) {
+		bpf_spin_unlock(&flow->lock);
+		is_empty = fq_flows_is_empty(&fq_old_flows, &fq_old_flows_lock);
+		if (head == &fq_new_flows && !is_empty)
+			fq_flows_add_tail(&fq_old_flows, &fq_old_flows_lock, flow);
+		else
+			fq_flow_set_detached(flow);
+
+		return 0;
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	time_to_send = skbn->tstamp;
+
+	time_next_packet = (time_to_send > flow->time_next_packet) ?
+		time_to_send : flow->time_next_packet;
+	if (dequeue_now < time_next_packet) {
+		bpf_spin_unlock(&flow->lock);
+		flow->time_next_packet = time_next_packet;
+		fq_flow_set_throttled(flow);
+		return 0;
+	}
+
+	rb_node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!rb_node) {
+		fq_flows_add_tail(head, lock, flow);
+		return 0; //unexpected
+	}
+
+	skbn = container_of(rb_node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	if (!skb)
+		goto out;
+
+	bpf_probe_read_kernel(&cb, sizeof(cb), skb->cb);
+	plen = cb.plen;
+
+	flow->credit -= plen;
+	flow->qlen--;
+	fq_qlen--;
+
+	ctx->dequeued = true;
+	bpf_qdisc_set_skb_dequeue(skb);
+out:
+	bpf_obj_drop(skbn);
+	fq_flows_add_head(head, lock, flow);
+
+	return 1;
+}
+
+static __always_inline struct sk_buff *fq_dequeue_prio(void)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+	struct sk_buff *skb = NULL;
+	struct bpf_rb_node *node;
+	struct skb_node *skbn;
+	u32 hash = NUM_QUEUE;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &hash);
+	if (!sflow)
+		return NULL; //unexpected
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (!flow)
+		return NULL;
+
+	bpf_spin_lock(&flow->lock);
+	node = bpf_rbtree_first(&flow->queue);
+	if (!node) {
+		bpf_spin_unlock(&flow->lock);
+		goto xchg_flow_back;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	node = bpf_rbtree_remove(&flow->queue, &skbn->node);
+	bpf_spin_unlock(&flow->lock);
+
+	if (!node)
+		goto xchg_flow_back;
+
+	skbn = container_of(node, struct skb_node, node);
+	skb = bpf_kptr_xchg(&skbn->skb, skb);
+	bpf_obj_drop(skbn);
+	fq_qlen--;
+
+xchg_flow_back:
+	bpf_kptr_xchg_back(&sflow->flow, flow);
+
+	return skb;
+}
+
+SEC("qdisc/dequeue")
+int dequeue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct dequeue_nonprio_ctx cb_ctx = {};
+	struct skb_node *skbn = NULL;
+	struct bpf_rb_node *rb_node;
+	struct sk_buff *skb = NULL;
+
+	skb = fq_dequeue_prio();
+	if (skb) {
+		bpf_qdisc_set_skb_dequeue(skb);
+		return SCH_BPF_DEQUEUED;
+	}
+
+	ktime_cache = dequeue_now = bpf_ktime_get_ns();
+	fq_check_throttled();
+	bpf_loop(q_plimit, fq_dequeue_nonprio_flows, &cb_ctx, 0);
+
+	if (cb_ctx.dequeued)
+		return SCH_BPF_DEQUEUED;
+
+	if (cb_ctx.expire) {
+		ctx->expire = cb_ctx.expire;
+		return SCH_BPF_THROTTLE;
+	}
+
+	return SCH_BPF_DROP;
+}
+
+static int
+fq_reset_flows(u32 index, void *ctx)
+{
+	struct bpf_list_head *head;
+	struct bpf_list_node *node;
+	struct bpf_spin_lock *lock;
+	struct fq_flow_node *flow;
+
+	head = &fq_new_flows;
+	lock = &fq_new_flows_lock;
+	bpf_spin_lock(&fq_new_flows_lock);
+	node = bpf_list_pop_front(&fq_new_flows);
+	bpf_spin_unlock(&fq_new_flows_lock);
+	if (!node) {
+		head = &fq_old_flows;
+		lock = &fq_old_flows_lock;
+		bpf_spin_lock(&fq_old_flows_lock);
+		node = bpf_list_pop_front(&fq_old_flows);
+		bpf_spin_unlock(&fq_old_flows_lock);
+		if (!node)
+			return 1;
+	}
+
+	flow = container_of(node, struct fq_flow_node, list_node);
+	bpf_obj_drop(flow);
+
+	return 0;
+}
+
+static int
+fq_reset_stashed_flows(u32 index, void *ctx)
+{
+	struct fq_flow_node *flow = NULL;
+	struct fq_stashed_flow *sflow;
+
+	sflow = bpf_map_lookup_elem(&fq_stashed_flows, &index);
+	if (!sflow)
+		return 0;
+
+	flow = bpf_kptr_xchg(&sflow->flow, flow);
+	if (flow)
+		bpf_obj_drop(flow);
+
+	return 0;
+}
+
+SEC("qdisc/reset")
+void reset_prog(struct bpf_qdisc_ctx *ctx)
+{
+	bool unset_all = true;
+	fq_qlen = 0;
+	bpf_loop(NUM_QUEUE + 1, fq_reset_stashed_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_reset_flows, NULL, 0);
+	bpf_loop(NUM_QUEUE, fq_unset_throttled_flows, &unset_all, 0);
+	return;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tc_sch_fq.c b/samples/bpf/tc_sch_fq.c
new file mode 100644
index 000000000000..0e55d377ea33
--- /dev/null
+++ b/samples/bpf/tc_sch_fq.c
@@ -0,0 +1,321 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <signal.h>
+#include <time.h>
+#include <sys/stat.h>
+
+#include <libmnl/libmnl.h>
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <net/if.h>
+
+#include "tc_sch_fq.skel.h"
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format,
+			   va_list args)
+{
+	return vprintf(format, args);
+}
+
+#define TCA_BUF_MAX (64 * 1024)
+#define FILTER_NAMESZ 16
+
+bool cleanup;
+unsigned int ifindex;
+unsigned int handle = 0x8000000;
+unsigned int parent = TC_H_ROOT;
+struct mnl_socket *nl;
+
+static void usage(const char *cmd)
+{
+	printf("Attach an fq eBPF qdisc and optionally an EDT rate limiter.\n");
+	printf("Usage: %s [...]\n", cmd);
+	printf("	-d <device>	Device\n");
+	printf("	-h <handle>	Qdisc handle\n");
+	printf("	-p <parent>	Parent Qdisc handle\n");
+	printf("	-s		Share packet drop info with the clsact EDT rate limiter\n");
+	printf("	-c		Delete the qdisc before quit\n");
+	printf("	-v		Verbose\n");
+}
+
+static int get_tc_classid(__u32 *h, const char *str)
+{
+	unsigned long maj, min;
+	char *p;
+
+	maj = TC_H_ROOT;
+	if (strcmp(str, "root") == 0)
+		goto ok;
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str) {
+		maj = 0;
+		if (*p != ':')
+			return -1;
+	}
+	if (*p == ':') {
+		if (maj >= (1<<16))
+			return -1;
+		maj <<= 16;
+		str = p+1;
+		min = strtoul(str, &p, 16);
+		if (*p != 0)
+			return -1;
+		if (min >= (1<<16))
+			return -1;
+		maj |= min;
+	} else if (*p != 0)
+		return -1;
+
+ok:
+	*h = maj;
+	return 0;
+}
+
+static int get_qdisc_handle(__u32 *h, const char *str)
+{
+	__u32 maj;
+	char *p;
+
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str || maj >= (1 << 16))
+		return -1;
+	maj <<= 16;
+	if (*p != ':' && *p != 0)
+		return -1;
+ok:
+	*h = maj;
+	return 0;
+}
+
+static void sigdown(int signo)
+{
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST,
+		.n.nlmsg_type = RTM_DELQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	if (!cleanup)
+		exit(0);
+
+	req.n.nlmsg_seq = time(NULL);
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = TC_H_ROOT;
+	req.t.tcm_handle = handle;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0)
+		exit(1);
+
+	exit(0);
+}
+
+static int qdisc_add_tc_sch_fq(struct tc_sch_fq *skel)
+{
+	char qdisc_type[FILTER_NAMESZ] = "bpf";
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct rtattr *option_attr;
+	const char *qdisc_name;
+	char prog_name[256];
+	int ret;
+	unsigned int seq, portid;
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_EXCL | NLM_F_CREATE,
+		.n.nlmsg_type = RTM_NEWQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	seq = time(NULL);
+	portid = mnl_socket_get_portid(nl);
+
+	qdisc_name = bpf_object__name(skel->obj);
+
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = parent;
+	req.t.tcm_handle = handle;
+	mnl_attr_put_str(&req.n, TCA_KIND, qdisc_type);
+
+	// eBPF Qdisc specific attributes
+	option_attr = (struct rtattr *)mnl_nlmsg_get_payload_tail(&req.n);
+	mnl_attr_put(&req.n, TCA_OPTIONS, 0, NULL);
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.enqueue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_enqueue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.dequeue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_dequeue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_RESET_PROG_FD,
+			 bpf_program__fd(skel->progs.reset_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_reset", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_RESET_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	option_attr->rta_len = (void *)mnl_nlmsg_get_payload_tail(&req.n) -
+			       (void *)option_attr;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0) {
+		perror("mnl_socket_sendto");
+		return -1;
+	}
+
+	for (;;) {
+		ret = mnl_socket_recvfrom(nl, buf, sizeof(buf));
+		if (ret == -1) {
+			if (errno == ENOBUFS || errno == EINTR)
+				continue;
+
+			if (errno == EAGAIN) {
+				errno = 0;
+				ret = 0;
+				break;
+			}
+
+			perror("mnl_socket_recvfrom");
+			return -1;
+		}
+
+		ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL);
+		if (ret < 0) {
+			perror("mnl_cb_run");
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_level = 2);
+	bool verbose = false, share = false;
+	struct tc_sch_fq *skel = NULL;
+	struct stat stat_buf = {};
+	char d[IFNAMSIZ] = "lo";
+	int opt, ret = 1;
+	struct sigaction sa = {
+		.sa_handler = sigdown,
+	};
+
+	while ((opt = getopt(argc, argv, "d:h:p:csv")) != -1) {
+		switch (opt) {
+		/* General args */
+		case 'd':
+			strncpy(d, optarg, sizeof(d)-1);
+			break;
+		case 'h':
+			ret = get_qdisc_handle(&handle, optarg);
+			if (ret) {
+				printf("Invalid qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'p':
+			ret = get_tc_classid(&parent, optarg);
+			if (ret) {
+				printf("Invalid parent qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'c':
+			cleanup = true;
+			break;
+		case 's':
+			share = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	nl = mnl_socket_open(NETLINK_ROUTE);
+	if (!nl) {
+		perror("mnl_socket_open");
+		return 1;
+	}
+
+	ret = mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID);
+	if (ret < 0) {
+		perror("mnl_socket_bind");
+		ret = 1;
+		goto out;
+	}
+
+	ifindex = if_nametoindex(d);
+	if (errno == ENODEV) {
+		fprintf(stderr, "No such device: %s\n", d);
+		goto out;
+	}
+
+	if (sigaction(SIGINT, &sa, NULL) || sigaction(SIGTERM, &sa, NULL))
+		goto out;
+
+	if (verbose)
+		libbpf_set_print(libbpf_print_fn);
+
+	skel = tc_sch_fq__open_opts(&opts);
+	if (!skel) {
+		perror("Failed to open tc_sch_fq");
+		goto out;
+	}
+
+	if (share) {
+		if (stat("/sys/fs/bpf/tc", &stat_buf) == -1)
+			mkdir("/sys/fs/bpf/tc", 0700);
+
+		mkdir("/sys/fs/bpf/tc/globals", 0700);
+
+		bpf_map__set_pin_path(skel->maps.rate_map, "/sys/fs/bpf/tc/globals/rate_map");
+		bpf_map__set_pin_path(skel->maps.comp_map, "/sys/fs/bpf/tc/globals/comp_map");
+
+		skel->bss->q_compensate_tstamp = true;
+		skel->bss->q_random_drop = true;
+	}
+
+	ret = tc_sch_fq__load(skel);
+	if (ret) {
+		perror("Failed to load tc_sch_fq");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	ret = qdisc_add_tc_sch_fq(skel);
+	if (ret < 0) {
+		perror("Failed to create qdisc");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	for (;;)
+		pause();
+
+out_destroy:
+	tc_sch_fq__destroy(skel);
+out:
+	mnl_socket_close(nl);
+	return ret;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v7 8/8] samples/bpf: Add an example of bpf netem qdisc
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
                   ` (6 preceding siblings ...)
  2024-01-17 21:56 ` [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc Amery Hung
@ 2024-01-17 21:56 ` Amery Hung
  2024-01-23 21:13 ` [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Stanislav Fomichev
  8 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-17 21:56 UTC (permalink / raw
  To: netdev; +Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

tc_sch_netem.bpf.c
A simple bpf network emulator (netem) qdisc that simulates packet drops,
loss, and delay. The qdisc shares the state machine of Gilbert-Elliott
model via a eBPF map when it is added to multiple tx queues.

tc_sch_netem.c
A user space program to load and attach the eBPF-based netem qdisc, which
by default add the bpf fq to the loopback device, but can also add to other
dev and class with '-d' and '-p' options.

To test mq + netem with shared state machine:
$ tc qdisc add dev ens5 root handle 1: mq
$ ./tc_sch_netem -d ens5 -p 1:1 -h 801 -s
$ ./tc_sch_netem -d ens5 -p 1:2 -h 802 -s
$ ./tc_sch_netem -d ens5 -p 1:3 -h 803 -s
$ ./tc_sch_netem -d ens5 -p 1:4 -h 804 -s

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
---
 samples/bpf/Makefile           |   8 +-
 samples/bpf/tc_sch_netem.bpf.c | 256 ++++++++++++++++++++++++
 samples/bpf/tc_sch_netem.c     | 347 +++++++++++++++++++++++++++++++++
 3 files changed, 610 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/tc_sch_netem.bpf.c
 create mode 100644 samples/bpf/tc_sch_netem.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea516a00352d..880f15ae4bed 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -47,6 +47,7 @@ tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
 tprogs-y += tc_sch_fq
+tprogs-y += tc_sch_netem
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -100,6 +101,7 @@ hbm-objs := hbm.o $(CGROUP_HELPERS)
 
 xdp_router_ipv4-objs := xdp_router_ipv4_user.o $(XDP_SAMPLE)
 tc_sch_fq-objs := tc_sch_fq.o
+tc_sch_netem-objs := tc_sch_netem.o
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
@@ -152,6 +154,7 @@ always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
 always-y += tc_sch_fq.bpf.o
+always-y += tc_sch_netem.bpf.o
 
 TPROGS_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -199,6 +202,7 @@ TPROGLDLIBS_trace_output	+= -lrt
 TPROGLDLIBS_map_perf_test	+= -lrt
 TPROGLDLIBS_test_overhead	+= -lrt
 TPROGLDLIBS_tc_sch_fq		+= -lmnl
+TPROGLDLIBS_tc_sch_netem	+= -lmnl
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 # make M=samples/bpf LLC=~/git/llvm-project/llvm/build/bin/llc CLANG=~/git/llvm-project/llvm/build/bin/clang
@@ -311,6 +315,7 @@ $(obj)/$(TRACE_HELPERS) $(obj)/$(CGROUP_HELPERS) $(obj)/$(XDP_SAMPLE): | libbpf_
 
 $(obj)/xdp_router_ipv4_user.o: $(obj)/xdp_router_ipv4.skel.h
 $(obj)/tc_sch_fq.o: $(obj)/tc_sch_fq.skel.h
+$(obj)/tc_sch_netem.o: $(obj)/tc_sch_netem.skel.h
 
 $(obj)/tracex5.bpf.o: $(obj)/syscall_nrs.h
 $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
@@ -375,11 +380,12 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h tc_sch_fq.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h tc_sch_fq.skel.h tc_sch_netem.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
 tc_sch_fq.skel.h-deps := tc_sch_fq.bpf.o
+tc_sch_netem.skel.h-deps := tc_sch_netem.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/tc_sch_netem.bpf.c b/samples/bpf/tc_sch_netem.bpf.c
new file mode 100644
index 000000000000..b4db382f2c58
--- /dev/null
+++ b/samples/bpf/tc_sch_netem.bpf.c
@@ -0,0 +1,256 @@
+#include "vmlinux.h"
+#include "bpf_experimental.h"
+#include <bpf/bpf_helpers.h>
+
+#define NETEM_DIST_SCALE	8192
+
+#define NS_PER_SEC		1000000000
+
+int q_loss_model = CLG_GILB_ELL;
+unsigned int q_limit = 1000;
+signed long q_latency = 0;
+signed long q_jitter = 0;
+unsigned int q_loss = 1;
+unsigned int q_qlen = 0;
+
+struct crndstate q_loss_cor = {.last = 0, .rho = 0,};
+struct crndstate q_delay_cor = {.last = 0, .rho = 0,};
+
+struct skb_node {
+	u64 tstamp;
+	struct sk_buff __kptr *skb;
+	struct bpf_rb_node node;
+};
+
+struct clg_state {
+	u64 state;
+	u32 a1;
+	u32 a2;
+	u32 a3;
+	u32 a4;
+	u32 a5;
+};
+
+static bool skbn_tstamp_less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct skb_node *skb_a;
+	struct skb_node *skb_b;
+
+	skb_a = container_of(a, struct skb_node, node);
+	skb_b = container_of(b, struct skb_node, node);
+
+	return skb_a->tstamp < skb_b->tstamp;
+}
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, __u32);
+	__type(value, struct clg_state);
+	__uint(max_entries, 1);
+} g_clg_state SEC(".maps");
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+
+private(A) struct bpf_spin_lock t_root_lock;
+private(A) struct bpf_rb_root t_root __contains(skb_node, node);
+
+struct sk_buff *bpf_skb_acquire(struct sk_buff *p) __ksym;
+void bpf_skb_release(struct sk_buff *p) __ksym;
+u32 bpf_skb_get_hash(struct sk_buff *p) __ksym;
+void bpf_qdisc_set_skb_dequeue(struct sk_buff *p) __ksym;
+
+static __always_inline u32 get_crandom(struct crndstate *state)
+{
+	u64 value, rho;
+	unsigned long answer;
+
+	if (!state || state->rho == 0)	/* no correlation */
+		return bpf_get_prandom_u32();
+
+	value = bpf_get_prandom_u32();
+	rho = (u64)state->rho + 1;
+	answer = (value * ((1ull<<32) - rho) + state->last * rho) >> 32;
+	state->last = answer;
+	return answer;
+}
+
+static __always_inline s64 tabledist(s64 mu, s32 sigma, struct crndstate *state)
+{
+	s64 x;
+	long t;
+	u32 rnd;
+
+	if (sigma == 0)
+		return mu;
+
+	rnd = get_crandom(state);
+
+	/* default uniform distribution */
+	return ((rnd % (2 * (u32)sigma)) + mu) - sigma;
+}
+
+static __always_inline bool loss_gilb_ell(void)
+{
+	struct clg_state *clg;
+	u32 r1, r2, key = 0;
+	bool ret = false;
+
+ 	clg = bpf_map_lookup_elem(&g_clg_state, &key);
+	if (!clg)
+		return false;
+
+	r1 = bpf_get_prandom_u32();
+	r2 = bpf_get_prandom_u32();
+
+	switch (clg->state) {
+	case GOOD_STATE:
+		if (r1 < clg->a1)
+			__sync_val_compare_and_swap(&clg->state,
+						    GOOD_STATE, BAD_STATE);
+		if (r2 < clg->a4)
+			ret = true;
+		break;
+	case BAD_STATE:
+		if (r1 < clg->a2)
+			__sync_val_compare_and_swap(&clg->state,
+						    BAD_STATE, GOOD_STATE);
+		if (r2 > clg->a3)
+			ret = true;
+	}
+
+	return ret;
+}
+
+static __always_inline bool loss_event(void)
+{
+	switch (q_loss_model) {
+	case CLG_RANDOM:
+		return q_loss && q_loss >= get_crandom(&q_loss_cor);
+	case CLG_GILB_ELL:
+		return loss_gilb_ell();
+	}
+
+	return false;
+}
+
+static __always_inline void tfifo_enqueue(struct skb_node *skbn)
+{
+	bpf_spin_lock(&t_root_lock);
+	bpf_rbtree_add(&t_root, &skbn->node, skbn_tstamp_less);
+	bpf_spin_unlock(&t_root_lock);
+}
+
+SEC("qdisc/enqueue")
+int enqueue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct sk_buff *old, *skb = ctx->skb;
+	struct skb_node *skbn;
+	int count = 1;
+	s64 delay = 0;
+	u64 now;
+
+	if (loss_event())
+		--count;
+
+	if (count == 0)
+		return SCH_BPF_BYPASS;
+
+	q_qlen++;
+	if (q_qlen > q_limit)
+		return SCH_BPF_DROP;
+
+	skb = bpf_skb_acquire(ctx->skb);
+	skbn = bpf_obj_new(typeof(*skbn));
+	if (!skbn) {
+		bpf_skb_release(skb);
+		return SCH_BPF_DROP;
+	}
+
+	delay = tabledist(q_latency, q_jitter, &q_delay_cor);
+
+	now = bpf_ktime_get_ns();
+
+	skbn->tstamp = now + delay;
+	old = bpf_kptr_xchg(&skbn->skb, skb);
+	if (old)
+		bpf_skb_release(old);
+
+	tfifo_enqueue(skbn);
+	return SCH_BPF_QUEUED;
+}
+
+
+SEC("qdisc/dequeue")
+int dequeue_prog(struct bpf_qdisc_ctx *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct sk_buff *skb = NULL;
+	struct skb_node *skbn;
+	u64 now;
+
+	now = bpf_ktime_get_ns();
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return SCH_BPF_DROP;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	if (skbn->tstamp <= now) {
+		node = bpf_rbtree_remove(&t_root, &skbn->node);
+		bpf_spin_unlock(&t_root_lock);
+
+		if (!node)
+			return SCH_BPF_DROP;
+
+		skbn = container_of(node, struct skb_node, node);
+		skb = bpf_kptr_xchg(&skbn->skb, skb);
+		if (!skb) {
+			bpf_obj_drop(skbn);
+			return SCH_BPF_DROP;
+		}
+
+		bpf_qdisc_set_skb_dequeue(skb);
+		bpf_obj_drop(skbn);
+
+		q_qlen--;
+		return SCH_BPF_DEQUEUED;
+	}
+
+	ctx->expire = skbn->tstamp;
+	bpf_spin_unlock(&t_root_lock);
+	return SCH_BPF_THROTTLE;
+}
+
+static int reset_queue(u32 index, void *ctx)
+{
+	struct bpf_rb_node *node = NULL;
+	struct skb_node *skbn;
+
+	bpf_spin_lock(&t_root_lock);
+	node = bpf_rbtree_first(&t_root);
+	if (!node) {
+		bpf_spin_unlock(&t_root_lock);
+		return 1;
+	}
+
+	skbn = container_of(node, struct skb_node, node);
+	node = bpf_rbtree_remove(&t_root, &skbn->node);
+	bpf_spin_unlock(&t_root_lock);
+
+	if (!node)
+		return 1;
+
+	skbn = container_of(node, struct skb_node, node);
+	bpf_obj_drop(skbn);
+	return 0;
+}
+
+SEC("qdisc/reset")
+void reset_prog(struct bpf_qdisc_ctx *ctx)
+{
+	bpf_loop(q_limit, reset_queue, NULL, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tc_sch_netem.c b/samples/bpf/tc_sch_netem.c
new file mode 100644
index 000000000000..918d626909d3
--- /dev/null
+++ b/samples/bpf/tc_sch_netem.c
@@ -0,0 +1,347 @@
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <signal.h>
+#include <time.h>
+#include <limits.h>
+#include <sys/stat.h>
+
+#include <libmnl/libmnl.h>
+#include <linux/bpf.h>
+#include <linux/pkt_sched.h>
+#include <linux/rtnetlink.h>
+#include <net/if.h>
+
+struct crndstate {
+	__u32 last;
+	__u32 rho;
+};
+
+struct clg_state {
+	__u64 state;
+	__u32 a1;
+	__u32 a2;
+	__u32 a3;
+	__u32 a4;
+	__u32 a5;
+};
+
+#include "tc_sch_netem.skel.h"
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format,
+			   va_list args)
+{
+	return vprintf(format, args);
+}
+
+#define TCA_BUF_MAX (64 * 1024)
+#define FILTER_NAMESZ 16
+
+bool cleanup;
+unsigned int ifindex;
+unsigned int handle = 0x8000000;
+unsigned int parent = TC_H_ROOT;
+struct mnl_socket *nl;
+
+static void usage(const char *cmd)
+{
+	printf("Attach a netem eBPF qdisc and optionally an EDT rate limiter.\n");
+	printf("Usage: %s [...]\n", cmd);
+	printf("	-d <device>	Device\n");
+	printf("	-h <handle>	Qdisc handle\n");
+	printf("	-p <parent>	Parent Qdisc handle\n");
+	printf("	-s		Share a global Gilbert-Elliot state mahine\n");
+	printf("	-c		Delete the qdisc before quit\n");
+	printf("	-v		Verbose\n");
+}
+
+static int get_tc_classid(__u32 *h, const char *str)
+{
+	unsigned long maj, min;
+	char *p;
+
+	maj = TC_H_ROOT;
+	if (strcmp(str, "root") == 0)
+		goto ok;
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str) {
+		maj = 0;
+		if (*p != ':')
+			return -1;
+	}
+	if (*p == ':') {
+		if (maj >= (1<<16))
+			return -1;
+		maj <<= 16;
+		str = p+1;
+		min = strtoul(str, &p, 16);
+		if (*p != 0)
+			return -1;
+		if (min >= (1<<16))
+			return -1;
+		maj |= min;
+	} else if (*p != 0)
+		return -1;
+
+ok:
+	*h = maj;
+	return 0;
+}
+
+static int get_qdisc_handle(__u32 *h, const char *str)
+{
+	__u32 maj;
+	char *p;
+
+	maj = TC_H_UNSPEC;
+	if (strcmp(str, "none") == 0)
+		goto ok;
+	maj = strtoul(str, &p, 16);
+	if (p == str || maj >= (1 << 16))
+		return -1;
+	maj <<= 16;
+	if (*p != ':' && *p != 0)
+		return -1;
+ok:
+	*h = maj;
+	return 0;
+}
+
+static void sigdown(int signo)
+{
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST,
+		.n.nlmsg_type = RTM_DELQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	if (!cleanup)
+		exit(0);
+
+	req.n.nlmsg_seq = time(NULL);
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = parent;
+	req.t.tcm_handle = handle;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0)
+		exit(1);
+
+	exit(0);
+}
+
+static int qdisc_add_tc_sch_netem(struct tc_sch_netem *skel)
+{
+	char qdisc_type[FILTER_NAMESZ] = "bpf";
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct rtattr *option_attr;
+	const char *qdisc_name;
+	char prog_name[256];
+	int ret;
+	unsigned int seq, portid;
+	struct {
+		struct nlmsghdr n;
+		struct tcmsg t;
+		char buf[TCA_BUF_MAX];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_EXCL | NLM_F_CREATE,
+		.n.nlmsg_type = RTM_NEWQDISC,
+		.t.tcm_family = AF_UNSPEC,
+	};
+
+	seq = time(NULL);
+	portid = mnl_socket_get_portid(nl);
+
+	qdisc_name = bpf_object__name(skel->obj);
+
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = parent;
+	req.t.tcm_handle = handle;
+	mnl_attr_put_str(&req.n, TCA_KIND, qdisc_type);
+
+	// eBPF Qdisc specific attributes
+	option_attr = (struct rtattr *)mnl_nlmsg_get_payload_tail(&req.n);
+	mnl_attr_put(&req.n, TCA_OPTIONS, 0, NULL);
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.enqueue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_enqueue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_ENQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_FD,
+			 bpf_program__fd(skel->progs.dequeue_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_dequeue", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_DEQUEUE_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	mnl_attr_put_u32(&req.n, TCA_SCH_BPF_RESET_PROG_FD,
+			 bpf_program__fd(skel->progs.reset_prog));
+	snprintf(prog_name, sizeof(prog_name), "%s_reset", qdisc_name);
+	mnl_attr_put(&req.n, TCA_SCH_BPF_RESET_PROG_NAME, strlen(prog_name) + 1, prog_name);
+
+	option_attr->rta_len = (void *)mnl_nlmsg_get_payload_tail(&req.n) -
+			       (void *)option_attr;
+
+	if (mnl_socket_sendto(nl, &req.n, req.n.nlmsg_len) < 0) {
+		perror("mnl_socket_sendto");
+		return -1;
+	}
+
+	for (;;) {
+		ret = mnl_socket_recvfrom(nl, buf, sizeof(buf));
+		if (ret == -1) {
+			if (errno == ENOBUFS || errno == EINTR)
+				continue;
+
+			if (errno == EAGAIN) {
+				errno = 0;
+				ret = 0;
+				break;
+			}
+
+			perror("mnl_socket_recvfrom");
+			return -1;
+		}
+
+		ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL);
+		if (ret < 0) {
+			perror("mnl_cb_run");
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_level = 2);
+	bool verbose = false, share = false, state_init = false;
+	struct tc_sch_netem *skel = NULL;
+	struct clg_state state = {};
+	struct stat stat_buf = {};
+	int opt, ret = 1, key = 0;
+	char d[IFNAMSIZ] = "lo";
+	struct sigaction sa = {
+		.sa_handler = sigdown,
+	};
+
+	while ((opt = getopt(argc, argv, "d:h:p:csv")) != -1) {
+		switch (opt) {
+		/* General args */
+		case 'd':
+			strncpy(d, optarg, sizeof(d)-1);
+			break;
+		case 'h':
+			ret = get_qdisc_handle(&handle, optarg);
+			if (ret) {
+				printf("Invalid qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'p':
+			ret = get_tc_classid(&parent, optarg);
+			if (ret) {
+				printf("Invalid parent qdisc handle\n");
+				return 1;
+			}
+			break;
+		case 'c':
+			cleanup = true;
+			break;
+		case 's':
+			share = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	nl = mnl_socket_open(NETLINK_ROUTE);
+	if (!nl) {
+		perror("mnl_socket_open");
+		return 1;
+	}
+
+	ret = mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID);
+	if (ret < 0) {
+		perror("mnl_socket_bind");
+		ret = 1;
+		goto out;
+	}
+
+	ifindex = if_nametoindex(d);
+	if (errno == ENODEV) {
+		fprintf(stderr, "No such device: %s\n", d);
+		goto out;
+	}
+
+	if (sigaction(SIGINT, &sa, NULL) || sigaction(SIGTERM, &sa, NULL))
+		goto out;
+
+	if (verbose)
+		libbpf_set_print(libbpf_print_fn);
+
+	skel = tc_sch_netem__open_opts(&opts);
+	if (!skel) {
+		perror("Failed to open tc_sch_netem");
+		goto out;
+	}
+
+	if (share) {
+		if (stat("/sys/fs/bpf/tc", &stat_buf) == -1)
+			mkdir("/sys/fs/bpf/tc", 0700);
+
+		mkdir("/sys/fs/bpf/tc/globals", 0700);
+
+		bpf_map__set_pin_path(skel->maps.g_clg_state, "/sys/fs/bpf/tc/globals/g_clg_state");
+	}
+
+	ret = tc_sch_netem__load(skel);
+	if (ret) {
+		perror("Failed to load tc_sch_netem");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	if (!state_init) {
+		state.state = 1;
+		state.a1 = (double)UINT_MAX * 0.05;
+		state.a2 = (double)UINT_MAX * 0.95;
+		state.a3 = (double)UINT_MAX * 0.30;
+		state.a4 = (double)UINT_MAX * 0.001;
+
+		bpf_map__update_elem(skel->maps.g_clg_state, &key, sizeof(key), &state,
+				     sizeof(state), 0);
+
+		state_init = true;
+	}
+
+	ret = qdisc_add_tc_sch_netem(skel);
+	if (ret < 0) {
+		perror("Failed to create qdisc");
+		ret = 1;
+		goto out_destroy;
+	}
+
+	for (;;)
+		pause();
+
+out_destroy:
+	tc_sch_netem__destroy(skel);
+out:
+	mnl_socket_close(nl);
+	return ret;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
@ 2024-01-20 11:41   ` kernel test robot
  2024-01-20 12:13   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2024-01-20 11:41 UTC (permalink / raw
  To: Amery Hung; +Cc: oe-kbuild-all

Hi Amery,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on v6.7]
[cannot apply to bpf-next/master bpf/master net/main net-next/main linus/master horms-ipvs/master next-20240119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Amery-Hung/net_sched-Introduce-eBPF-based-Qdisc/20240118-055917
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link:    https://lore.kernel.org/r/232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung%40bytedance.com
patch subject: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
config: i386-randconfig-013-20240120 (https://download.01.org/0day-ci/archive/20240120/202401201946.WJWvnukr-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240120/202401201946.WJWvnukr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401201946.WJWvnukr-lkp@intel.com/

All warnings (new ones prefixed by >>):

   net/core/filter.c: In function 'tc_qdisc_is_valid_access':
   net/core/filter.c:8929:9: error: implicit declaration of function 'bpf_get_btf_vmlinux' [-Werror=implicit-function-declaration]
    8929 |   btf = bpf_get_btf_vmlinux();
         |         ^~~~~~~~~~~~~~~~~~~
>> net/core/filter.c:8929:7: warning: assignment to 'struct btf *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
    8929 |   btf = bpf_get_btf_vmlinux();
         |       ^
   cc1: some warnings being treated as errors


vim +8929 net/core/filter.c

  8909	
  8910	static bool tc_qdisc_is_valid_access(int off, int size,
  8911					     enum bpf_access_type type,
  8912					     const struct bpf_prog *prog,
  8913					     struct bpf_insn_access_aux *info)
  8914	{
  8915		struct btf *btf;
  8916	
  8917		if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
  8918			return false;
  8919	
  8920		switch (off) {
  8921		case offsetof(struct bpf_qdisc_ctx, skb):
  8922			if (type == BPF_WRITE ||
  8923			    size != sizeof_field(struct bpf_qdisc_ctx, skb))
  8924				return false;
  8925	
  8926			if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
  8927				return false;
  8928	
> 8929			btf = bpf_get_btf_vmlinux();
  8930			if (IS_ERR_OR_NULL(btf))
  8931				return false;
  8932	
  8933			info->btf = btf;
  8934			info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
  8935			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
  8936			return true;
  8937		case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
  8938			return size == sizeof_field(struct bpf_qdisc_ctx, classid);
  8939		case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
  8940			return size == sizeof_field(struct bpf_qdisc_ctx, expire);
  8941		case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
  8942			return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
  8943		default:
  8944			return false;
  8945		}
  8946	
  8947		return false;
  8948	}
  8949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
  2024-01-20 11:41   ` kernel test robot
@ 2024-01-20 12:13   ` kernel test robot
  2024-01-20 13:47   ` kernel test robot
  2024-01-23 23:51   ` Martin KaFai Lau
  3 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2024-01-20 12:13 UTC (permalink / raw
  To: Amery Hung; +Cc: llvm, oe-kbuild-all

Hi Amery,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on mst-vhost/linux-next]
[also build test ERROR on v6.7]
[cannot apply to bpf-next/master bpf/master net/main net-next/main linus/master horms-ipvs/master next-20240119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Amery-Hung/net_sched-Introduce-eBPF-based-Qdisc/20240118-055917
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link:    https://lore.kernel.org/r/232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung%40bytedance.com
patch subject: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
config: i386-buildonly-randconfig-002-20240120 (https://download.01.org/0day-ci/archive/20240120/202401201954.x8zfr4TI-lkp@intel.com/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240120/202401201954.x8zfr4TI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401201954.x8zfr4TI-lkp@intel.com/

All errors (new ones prefixed by >>):

>> net/core/filter.c:8929:9: error: call to undeclared function 'bpf_get_btf_vmlinux'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    8929 |                 btf = bpf_get_btf_vmlinux();
         |                       ^
>> net/core/filter.c:8929:7: error: incompatible integer to pointer conversion assigning to 'struct btf *' from 'int' [-Wint-conversion]
    8929 |                 btf = bpf_get_btf_vmlinux();
         |                     ^ ~~~~~~~~~~~~~~~~~~~~~
   2 errors generated.


vim +/bpf_get_btf_vmlinux +8929 net/core/filter.c

  8909	
  8910	static bool tc_qdisc_is_valid_access(int off, int size,
  8911					     enum bpf_access_type type,
  8912					     const struct bpf_prog *prog,
  8913					     struct bpf_insn_access_aux *info)
  8914	{
  8915		struct btf *btf;
  8916	
  8917		if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
  8918			return false;
  8919	
  8920		switch (off) {
  8921		case offsetof(struct bpf_qdisc_ctx, skb):
  8922			if (type == BPF_WRITE ||
  8923			    size != sizeof_field(struct bpf_qdisc_ctx, skb))
  8924				return false;
  8925	
  8926			if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
  8927				return false;
  8928	
> 8929			btf = bpf_get_btf_vmlinux();
  8930			if (IS_ERR_OR_NULL(btf))
  8931				return false;
  8932	
  8933			info->btf = btf;
  8934			info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
  8935			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
  8936			return true;
  8937		case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
  8938			return size == sizeof_field(struct bpf_qdisc_ctx, classid);
  8939		case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
  8940			return size == sizeof_field(struct bpf_qdisc_ctx, expire);
  8941		case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
  8942			return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
  8943		default:
  8944			return false;
  8945		}
  8946	
  8947		return false;
  8948	}
  8949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
  2024-01-20 11:41   ` kernel test robot
  2024-01-20 12:13   ` kernel test robot
@ 2024-01-20 13:47   ` kernel test robot
  2024-01-23 23:51   ` Martin KaFai Lau
  3 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2024-01-20 13:47 UTC (permalink / raw
  To: Amery Hung; +Cc: oe-kbuild-all

Hi Amery,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on mst-vhost/linux-next]
[also build test ERROR on v6.7]
[cannot apply to bpf-next/master bpf/master net/main net-next/main linus/master horms-ipvs/master next-20240119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Amery-Hung/net_sched-Introduce-eBPF-based-Qdisc/20240118-055917
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link:    https://lore.kernel.org/r/232881645a5c4c05a35df4ff1f08a19ef9a02662.1705432850.git.amery.hung%40bytedance.com
patch subject: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
config: i386-randconfig-013-20240120 (https://download.01.org/0day-ci/archive/20240120/202401202131.VhVYkEh1-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240120/202401202131.VhVYkEh1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202401202131.VhVYkEh1-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/core/filter.c: In function 'tc_qdisc_is_valid_access':
>> net/core/filter.c:8929:9: error: implicit declaration of function 'bpf_get_btf_vmlinux' [-Werror=implicit-function-declaration]
    8929 |   btf = bpf_get_btf_vmlinux();
         |         ^~~~~~~~~~~~~~~~~~~
   net/core/filter.c:8929:7: warning: assignment to 'struct btf *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
    8929 |   btf = bpf_get_btf_vmlinux();
         |       ^
   cc1: some warnings being treated as errors


vim +/bpf_get_btf_vmlinux +8929 net/core/filter.c

  8909	
  8910	static bool tc_qdisc_is_valid_access(int off, int size,
  8911					     enum bpf_access_type type,
  8912					     const struct bpf_prog *prog,
  8913					     struct bpf_insn_access_aux *info)
  8914	{
  8915		struct btf *btf;
  8916	
  8917		if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
  8918			return false;
  8919	
  8920		switch (off) {
  8921		case offsetof(struct bpf_qdisc_ctx, skb):
  8922			if (type == BPF_WRITE ||
  8923			    size != sizeof_field(struct bpf_qdisc_ctx, skb))
  8924				return false;
  8925	
  8926			if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
  8927				return false;
  8928	
> 8929			btf = bpf_get_btf_vmlinux();
  8930			if (IS_ERR_OR_NULL(btf))
  8931				return false;
  8932	
  8933			info->btf = btf;
  8934			info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
  8935			info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
  8936			return true;
  8937		case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
  8938			return size == sizeof_field(struct bpf_qdisc_ctx, classid);
  8939		case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
  8940			return size == sizeof_field(struct bpf_qdisc_ctx, expire);
  8941		case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
  8942			return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
  8943		default:
  8944			return false;
  8945		}
  8946	
  8947		return false;
  8948	}
  8949	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC
  2024-01-17 21:56 ` [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC Amery Hung
@ 2024-01-23  0:17   ` Andrii Nakryiko
  2024-01-23 19:40     ` Amery Hung
  0 siblings, 1 reply; 36+ messages in thread
From: Andrii Nakryiko @ 2024-01-23  0:17 UTC (permalink / raw
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On Wed, Jan 17, 2024 at 1:57 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> While eBPF qdisc uses NETLINK for attachment, expected_attach_type is
> required at load time to verify context access from different programs.
> This patch adds the section definition for this.
>
> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>  tools/lib/bpf/libbpf.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index e067be95da3c..0541f85b4ce6 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -8991,6 +8991,10 @@ static const struct bpf_sec_def section_defs[] = {
>         SEC_DEF("struct_ops.s+",        STRUCT_OPS, 0, SEC_SLEEPABLE),
>         SEC_DEF("sk_lookup",            SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE),
>         SEC_DEF("netfilter",            NETFILTER, BPF_NETFILTER, SEC_NONE),
> +       SEC_DEF("qdisc/enqueue",        QDISC, BPF_QDISC_ENQUEUE, SEC_ATTACHABLE_OPT),
> +       SEC_DEF("qdisc/dequeue",        QDISC, BPF_QDISC_DEQUEUE, SEC_ATTACHABLE_OPT),
> +       SEC_DEF("qdisc/reset",          QDISC, BPF_QDISC_RESET, SEC_ATTACHABLE_OPT),
> +       SEC_DEF("qdisc/init",           QDISC, BPF_QDISC_INIT, SEC_ATTACHABLE_OPT),

seems like SEC_ATTACHABLE (or just 0) is what you want.
expected_attach_type shouldn't be optional for any new program type

>  };
>
>  int libbpf_register_prog_handler(const char *sec,
> --
> 2.20.1
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC
  2024-01-23  0:17   ` Andrii Nakryiko
@ 2024-01-23 19:40     ` Amery Hung
  0 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-23 19:40 UTC (permalink / raw
  To: Andrii Nakryiko
  Cc: netdev, bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On Mon, Jan 22, 2024 at 4:18 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 1:57 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> > While eBPF qdisc uses NETLINK for attachment, expected_attach_type is
> > required at load time to verify context access from different programs.
> > This patch adds the section definition for this.
> >
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >  tools/lib/bpf/libbpf.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index e067be95da3c..0541f85b4ce6 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -8991,6 +8991,10 @@ static const struct bpf_sec_def section_defs[] = {
> >         SEC_DEF("struct_ops.s+",        STRUCT_OPS, 0, SEC_SLEEPABLE),
> >         SEC_DEF("sk_lookup",            SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE),
> >         SEC_DEF("netfilter",            NETFILTER, BPF_NETFILTER, SEC_NONE),
> > +       SEC_DEF("qdisc/enqueue",        QDISC, BPF_QDISC_ENQUEUE, SEC_ATTACHABLE_OPT),
> > +       SEC_DEF("qdisc/dequeue",        QDISC, BPF_QDISC_DEQUEUE, SEC_ATTACHABLE_OPT),
> > +       SEC_DEF("qdisc/reset",          QDISC, BPF_QDISC_RESET, SEC_ATTACHABLE_OPT),
> > +       SEC_DEF("qdisc/init",           QDISC, BPF_QDISC_INIT, SEC_ATTACHABLE_OPT),
>
> seems like SEC_ATTACHABLE (or just 0) is what you want.
> expected_attach_type shouldn't be optional for any new program type
>

Got it. Will change the flags to SEC_NONE.

> >  };
> >
> >  int libbpf_register_prog_handler(const char *sec,
> > --
> > 2.20.1
> >
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
                   ` (7 preceding siblings ...)
  2024-01-17 21:56 ` [RFC PATCH v7 8/8] samples/bpf: Add an example of bpf netem qdisc Amery Hung
@ 2024-01-23 21:13 ` Stanislav Fomichev
  2024-01-24 10:10   ` Daniel Borkmann
  2024-01-24 12:09   ` Jamal Hadi Salim
  8 siblings, 2 replies; 36+ messages in thread
From: Stanislav Fomichev @ 2024-01-23 21:13 UTC (permalink / raw
  To: Amery Hung
  Cc: netdev, bpf, yangpeihao, toke, jhs, jiri, xiyou.wangcong,
	yepeilin.cs

On 01/17, Amery Hung wrote:
> Hi, 
> 
> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> RFC. The followings are some use cases of eBPF Qdisc:
> 
> 1. Allow customizing Qdiscs in an easier way. So that people don't
>    have to write a complete Qdisc kernel module just to experiment
>    some new queuing theory.
> 
> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>    is before enqueue, it is impossible to adjust those "tokens" after
>    packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>    be solved with a shared map between clsact and sch_bpf.
> 
> 3. Replace qevents, as now the user gains much more control over the
>    skb and queues.
> 
> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>    chain and block to reuse the TC filters, but they are too complicated
>    to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>    TC filters on _any_ Qdisc (even on a different netdev) to do the
>    classification.
> 
> 5. Potentially pave a way for ingress to queue packets, although
>    current implementation is still only for egress.
> 
> I’ve combed through previous comments and appreciated the feedbacks.
> Some major changes in this RFC is the use of kptr to skb to maintain
> the validility of skb during its lifetime in the Qdisc, dropping rbtree
> maps, and the inclusion of two examples. 
> 
> Some questions for discussion:
> 
> 1. We now pass a trusted kptr of sk_buff to the program instead of
>    __sk_buff. This makes most helpers using __sk_buff incompatible
>    with eBPF qdisc. An alternative is to still use __sk_buff in the
>    context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>    this can only be applied to enqueue program, since in dequeue program
>    skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>    is no __sk_buff). Any suggestion for making skb kptr and helper
>    functions compatible?
> 
> 2. The current patchset uses netlink. Do we also want to use bpf_link
>    for attachment?

[..]

> 3. People have suggested struct_ops. We chose not to use struct_ops since
>    users might want to create multiple bpf qdiscs with different
>    implementations. Current struct_ops attachment model does not seem
>    to support replacing only functions of a specific instance of a module,
>    but I might be wrong.

I still feel like it deserves at leasta try. Maybe we can find some potential
path where struct_ops can allow different implementations (Martin probably
has some ideas about that). I looked at the bpf qdisc itself and it doesn't
really have anything complicated (besides trying to play nicely with other
tc classes/actions, but I'm not sure how relevant that is).

With struct_ops you can also get your (2) addressed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
                     ` (2 preceding siblings ...)
  2024-01-20 13:47   ` kernel test robot
@ 2024-01-23 23:51   ` Martin KaFai Lau
  2024-01-24  5:22     ` Amery Hung
  3 siblings, 1 reply; 36+ messages in thread
From: Martin KaFai Lau @ 2024-01-23 23:51 UTC (permalink / raw
  To: Amery Hung
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev

On 1/17/24 1:56 PM, Amery Hung wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 0bb92414c036..df280bbb7c0d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -997,6 +997,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_SK_LOOKUP,
>   	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
>   	BPF_PROG_TYPE_NETFILTER,
> +	BPF_PROG_TYPE_QDISC,
>   };
>   
>   enum bpf_attach_type {
> @@ -1056,6 +1057,8 @@ enum bpf_attach_type {
>   	BPF_CGROUP_UNIX_GETSOCKNAME,
>   	BPF_NETKIT_PRIMARY,
>   	BPF_NETKIT_PEER,
> +	BPF_QDISC_ENQUEUE,
> +	BPF_QDISC_DEQUEUE,
>   	__MAX_BPF_ATTACH_TYPE
>   };
>   
> @@ -7357,4 +7360,22 @@ struct bpf_iter_num {
>   	__u64 __opaque[1];
>   } __attribute__((aligned(8)));
>   
> +struct bpf_qdisc_ctx {
> +	__bpf_md_ptr(struct sk_buff *, skb);
> +	__u32 classid;
> +	__u64 expire;
> +	__u64 delta_ns;
> +};
> +
> +enum {
> +	SCH_BPF_QUEUED,
> +	SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
> +	SCH_BPF_DROP,
> +	SCH_BPF_CN,
> +	SCH_BPF_THROTTLE,
> +	SCH_BPF_PASS,
> +	SCH_BPF_BYPASS,
> +	SCH_BPF_STOLEN,
> +};
> +
>   #endif /* _UAPI__LINUX_BPF_H__ */

[ ... ]

> +static bool tc_qdisc_is_valid_access(int off, int size,
> +				     enum bpf_access_type type,
> +				     const struct bpf_prog *prog,
> +				     struct bpf_insn_access_aux *info)
> +{
> +	struct btf *btf;
> +
> +	if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
> +		return false;
> +
> +	switch (off) {
> +	case offsetof(struct bpf_qdisc_ctx, skb):
> +		if (type == BPF_WRITE ||
> +		    size != sizeof_field(struct bpf_qdisc_ctx, skb))
> +			return false;
> +
> +		if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
> +			return false;
> +
> +		btf = bpf_get_btf_vmlinux();
> +		if (IS_ERR_OR_NULL(btf))
> +			return false;
> +
> +		info->btf = btf;
> +		info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
> +		info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> +		return true;
> +	case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
> +		return size == sizeof_field(struct bpf_qdisc_ctx, classid);
> +	case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
> +		return size == sizeof_field(struct bpf_qdisc_ctx, expire);
> +	case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
> +		return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
> +	default:
> +		return false;
> +	}
> +
> +	return false;
> +}
> +

[ ... ]

> +static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> +			   struct sk_buff **to_free)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	unsigned int len = qdisc_pkt_len(skb);
> +	struct bpf_qdisc_ctx ctx = {};
> +	int res = NET_XMIT_SUCCESS;
> +	struct sch_bpf_class *cl;
> +	struct bpf_prog *enqueue;
> +
> +	enqueue = rcu_dereference(q->enqueue_prog.prog);
> +	if (!enqueue)
> +		return NET_XMIT_DROP;
> +
> +	ctx.skb = skb;
> +	ctx.classid = sch->handle;
> +	res = bpf_prog_run(enqueue, &ctx);
> +	switch (res) {
> +	case SCH_BPF_THROTTLE:
> +		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> +		qdisc_qstats_overlimit(sch);
> +		fallthrough;
> +	case SCH_BPF_QUEUED:
> +		qdisc_qstats_backlog_inc(sch, skb);
> +		return NET_XMIT_SUCCESS;
> +	case SCH_BPF_BYPASS:
> +		qdisc_qstats_drop(sch);
> +		__qdisc_drop(skb, to_free);
> +		return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> +	case SCH_BPF_STOLEN:
> +		__qdisc_drop(skb, to_free);
> +		return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> +	case SCH_BPF_CN:
> +		return NET_XMIT_CN;
> +	case SCH_BPF_PASS:
> +		break;
> +	default:
> +		return qdisc_drop(skb, sch, to_free);
> +	}
> +
> +	cl = sch_bpf_find(sch, ctx.classid);
> +	if (!cl || !cl->qdisc)
> +		return qdisc_drop(skb, sch, to_free);
> +
> +	res = qdisc_enqueue(skb, cl->qdisc, to_free);
> +	if (res != NET_XMIT_SUCCESS) {
> +		if (net_xmit_drop_count(res)) {
> +			qdisc_qstats_drop(sch);
> +			cl->drops++;
> +		}
> +		return res;
> +	}
> +
> +	sch->qstats.backlog += len;
> +	sch->q.qlen++;
> +	return res;
> +}
> +
> +DEFINE_PER_CPU(struct sk_buff*, bpf_skb_dequeue);
> +
> +static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	struct bpf_qdisc_ctx ctx = {};
> +	struct sk_buff *skb = NULL;
> +	struct bpf_prog *dequeue;
> +	struct sch_bpf_class *cl;
> +	int res;
> +
> +	dequeue = rcu_dereference(q->dequeue_prog.prog);
> +	if (!dequeue)
> +		return NULL;
> +
> +	__this_cpu_write(bpf_skb_dequeue, NULL);
> +	ctx.classid = sch->handle;
> +	res = bpf_prog_run(dequeue, &ctx);
> +	switch (res) {
> +	case SCH_BPF_DEQUEUED:
> +		skb = __this_cpu_read(bpf_skb_dequeue);
> +		qdisc_bstats_update(sch, skb);
> +		qdisc_qstats_backlog_dec(sch, skb);
> +		break;
> +	case SCH_BPF_THROTTLE:
> +		qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> +		qdisc_qstats_overlimit(sch);
> +		cl = sch_bpf_find(sch, ctx.classid);
> +		if (cl)
> +			cl->overlimits++;
> +		return NULL;
> +	case SCH_BPF_PASS:
> +		cl = sch_bpf_find(sch, ctx.classid);
> +		if (!cl || !cl->qdisc)
> +			return NULL;
> +		skb = qdisc_dequeue_peeked(cl->qdisc);
> +		if (skb) {
> +			bstats_update(&cl->bstats, skb);
> +			qdisc_bstats_update(sch, skb);
> +			qdisc_qstats_backlog_dec(sch, skb);
> +			sch->q.qlen--;
> +		}
> +		break;
> +	}
> +
> +	return skb;
> +}

[ ... ]

> +static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
> +			struct netlink_ext_ack *extack)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	int err;
> +
> +	qdisc_watchdog_init(&q->watchdog, sch);
> +	if (opt) {
> +		err = sch_bpf_change(sch, opt, extack);
> +		if (err)
> +			return err;
> +	}
> +
> +	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
> +	if (err)
> +		return err;
> +
> +	return qdisc_class_hash_init(&q->clhash);
> +}
> +
> +static void sch_bpf_reset(struct Qdisc *sch)
> +{
> +	struct bpf_sched_data *q = qdisc_priv(sch);
> +	struct sch_bpf_class *cl;
> +	unsigned int i;
> +
> +	for (i = 0; i < q->clhash.hashsize; i++) {
> +		hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
> +			if (cl->qdisc)
> +				qdisc_reset(cl->qdisc);
> +		}
> +	}
> +
> +	qdisc_watchdog_cancel(&q->watchdog);
> +}
> +

[ ... ]

> +static const struct Qdisc_class_ops sch_bpf_class_ops = {
> +	.graft		=	sch_bpf_graft,
> +	.leaf		=	sch_bpf_leaf,
> +	.find		=	sch_bpf_search,
> +	.change		=	sch_bpf_change_class,
> +	.delete		=	sch_bpf_delete,
> +	.tcf_block	=	sch_bpf_tcf_block,
> +	.bind_tcf	=	sch_bpf_bind,
> +	.unbind_tcf	=	sch_bpf_unbind,
> +	.dump		=	sch_bpf_dump_class,
> +	.dump_stats	=	sch_bpf_dump_class_stats,
> +	.walk		=	sch_bpf_walk,
> +};
> +
> +static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
> +	.cl_ops		=	&sch_bpf_class_ops,
> +	.id		=	"bpf",
> +	.priv_size	=	sizeof(struct bpf_sched_data),
> +	.enqueue	=	sch_bpf_enqueue,
> +	.dequeue	=	sch_bpf_dequeue,

I looked at the high level of the patchset. The major ops that it wants to be 
programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in 
patch 4 and patch 5).

This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for 
".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the 
uapi. It is no long an acceptable way to add new bpf extension.

Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented 
in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops 
can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.

If other ops (like ".dump", ".dump_stats"...) do not have good use case to be 
programmable in bpf, it can stay with the kernel implementation for now and only 
allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset 
implemented.

You mentioned in the cover letter that:
"Current struct_ops attachment model does not seem to support replacing only 
functions of a specific instance of a module, but I might be wrong."

I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes, 
it can be done through the struct_ops's ".init_member". Take a look at 
bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for 
".dump" (for example) when loading the bpf Qdisc_ops.

> +	.peek		=	qdisc_peek_dequeued,
> +	.init		=	sch_bpf_init,
> +	.reset		=	sch_bpf_reset, > +	.destroy	=	sch_bpf_destroy,
> +	.change		=	sch_bpf_change,
> +	.dump		=	sch_bpf_dump,
> +	.dump_stats	=	sch_bpf_dump_stats,
> +	.owner		=	THIS_MODULE,
> +};
> +
> +static int __init sch_bpf_mod_init(void)
> +{
> +	return register_qdisc(&sch_bpf_qdisc_ops);
> +}
> +
> +static void __exit sch_bpf_mod_exit(void)
> +{
> +	unregister_qdisc(&sch_bpf_qdisc_ops);
> +}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-23 23:51   ` Martin KaFai Lau
@ 2024-01-24  5:22     ` Amery Hung
  2024-01-26  2:22       ` Martin KaFai Lau
  0 siblings, 1 reply; 36+ messages in thread
From: Amery Hung @ 2024-01-24  5:22 UTC (permalink / raw
  To: Martin KaFai Lau
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev

On Tue, Jan 23, 2024 at 3:51 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/17/24 1:56 PM, Amery Hung wrote:
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 0bb92414c036..df280bbb7c0d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -997,6 +997,7 @@ enum bpf_prog_type {
> >       BPF_PROG_TYPE_SK_LOOKUP,
> >       BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> >       BPF_PROG_TYPE_NETFILTER,
> > +     BPF_PROG_TYPE_QDISC,
> >   };
> >
> >   enum bpf_attach_type {
> > @@ -1056,6 +1057,8 @@ enum bpf_attach_type {
> >       BPF_CGROUP_UNIX_GETSOCKNAME,
> >       BPF_NETKIT_PRIMARY,
> >       BPF_NETKIT_PEER,
> > +     BPF_QDISC_ENQUEUE,
> > +     BPF_QDISC_DEQUEUE,
> >       __MAX_BPF_ATTACH_TYPE
> >   };
> >
> > @@ -7357,4 +7360,22 @@ struct bpf_iter_num {
> >       __u64 __opaque[1];
> >   } __attribute__((aligned(8)));
> >
> > +struct bpf_qdisc_ctx {
> > +     __bpf_md_ptr(struct sk_buff *, skb);
> > +     __u32 classid;
> > +     __u64 expire;
> > +     __u64 delta_ns;
> > +};
> > +
> > +enum {
> > +     SCH_BPF_QUEUED,
> > +     SCH_BPF_DEQUEUED = SCH_BPF_QUEUED,
> > +     SCH_BPF_DROP,
> > +     SCH_BPF_CN,
> > +     SCH_BPF_THROTTLE,
> > +     SCH_BPF_PASS,
> > +     SCH_BPF_BYPASS,
> > +     SCH_BPF_STOLEN,
> > +};
> > +
> >   #endif /* _UAPI__LINUX_BPF_H__ */
>
> [ ... ]
>
> > +static bool tc_qdisc_is_valid_access(int off, int size,
> > +                                  enum bpf_access_type type,
> > +                                  const struct bpf_prog *prog,
> > +                                  struct bpf_insn_access_aux *info)
> > +{
> > +     struct btf *btf;
> > +
> > +     if (off < 0 || off >= sizeof(struct bpf_qdisc_ctx))
> > +             return false;
> > +
> > +     switch (off) {
> > +     case offsetof(struct bpf_qdisc_ctx, skb):
> > +             if (type == BPF_WRITE ||
> > +                 size != sizeof_field(struct bpf_qdisc_ctx, skb))
> > +                     return false;
> > +
> > +             if (prog->expected_attach_type != BPF_QDISC_ENQUEUE)
> > +                     return false;
> > +
> > +             btf = bpf_get_btf_vmlinux();
> > +             if (IS_ERR_OR_NULL(btf))
> > +                     return false;
> > +
> > +             info->btf = btf;
> > +             info->btf_id = tc_qdisc_ctx_access_btf_ids[0];
> > +             info->reg_type = PTR_TO_BTF_ID | PTR_TRUSTED;
> > +             return true;
> > +     case bpf_ctx_range(struct bpf_qdisc_ctx, classid):
> > +             return size == sizeof_field(struct bpf_qdisc_ctx, classid);
> > +     case bpf_ctx_range(struct bpf_qdisc_ctx, expire):
> > +             return size == sizeof_field(struct bpf_qdisc_ctx, expire);
> > +     case bpf_ctx_range(struct bpf_qdisc_ctx, delta_ns):
> > +             return size == sizeof_field(struct bpf_qdisc_ctx, delta_ns);
> > +     default:
> > +             return false;
> > +     }
> > +
> > +     return false;
> > +}
> > +
>
> [ ... ]
>
> > +static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> > +                        struct sk_buff **to_free)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     unsigned int len = qdisc_pkt_len(skb);
> > +     struct bpf_qdisc_ctx ctx = {};
> > +     int res = NET_XMIT_SUCCESS;
> > +     struct sch_bpf_class *cl;
> > +     struct bpf_prog *enqueue;
> > +
> > +     enqueue = rcu_dereference(q->enqueue_prog.prog);
> > +     if (!enqueue)
> > +             return NET_XMIT_DROP;
> > +
> > +     ctx.skb = skb;
> > +     ctx.classid = sch->handle;
> > +     res = bpf_prog_run(enqueue, &ctx);
> > +     switch (res) {
> > +     case SCH_BPF_THROTTLE:
> > +             qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> > +             qdisc_qstats_overlimit(sch);
> > +             fallthrough;
> > +     case SCH_BPF_QUEUED:
> > +             qdisc_qstats_backlog_inc(sch, skb);
> > +             return NET_XMIT_SUCCESS;
> > +     case SCH_BPF_BYPASS:
> > +             qdisc_qstats_drop(sch);
> > +             __qdisc_drop(skb, to_free);
> > +             return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
> > +     case SCH_BPF_STOLEN:
> > +             __qdisc_drop(skb, to_free);
> > +             return NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
> > +     case SCH_BPF_CN:
> > +             return NET_XMIT_CN;
> > +     case SCH_BPF_PASS:
> > +             break;
> > +     default:
> > +             return qdisc_drop(skb, sch, to_free);
> > +     }
> > +
> > +     cl = sch_bpf_find(sch, ctx.classid);
> > +     if (!cl || !cl->qdisc)
> > +             return qdisc_drop(skb, sch, to_free);
> > +
> > +     res = qdisc_enqueue(skb, cl->qdisc, to_free);
> > +     if (res != NET_XMIT_SUCCESS) {
> > +             if (net_xmit_drop_count(res)) {
> > +                     qdisc_qstats_drop(sch);
> > +                     cl->drops++;
> > +             }
> > +             return res;
> > +     }
> > +
> > +     sch->qstats.backlog += len;
> > +     sch->q.qlen++;
> > +     return res;
> > +}
> > +
> > +DEFINE_PER_CPU(struct sk_buff*, bpf_skb_dequeue);
> > +
> > +static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     struct bpf_qdisc_ctx ctx = {};
> > +     struct sk_buff *skb = NULL;
> > +     struct bpf_prog *dequeue;
> > +     struct sch_bpf_class *cl;
> > +     int res;
> > +
> > +     dequeue = rcu_dereference(q->dequeue_prog.prog);
> > +     if (!dequeue)
> > +             return NULL;
> > +
> > +     __this_cpu_write(bpf_skb_dequeue, NULL);
> > +     ctx.classid = sch->handle;
> > +     res = bpf_prog_run(dequeue, &ctx);
> > +     switch (res) {
> > +     case SCH_BPF_DEQUEUED:
> > +             skb = __this_cpu_read(bpf_skb_dequeue);
> > +             qdisc_bstats_update(sch, skb);
> > +             qdisc_qstats_backlog_dec(sch, skb);
> > +             break;
> > +     case SCH_BPF_THROTTLE:
> > +             qdisc_watchdog_schedule_range_ns(&q->watchdog, ctx.expire, ctx.delta_ns);
> > +             qdisc_qstats_overlimit(sch);
> > +             cl = sch_bpf_find(sch, ctx.classid);
> > +             if (cl)
> > +                     cl->overlimits++;
> > +             return NULL;
> > +     case SCH_BPF_PASS:
> > +             cl = sch_bpf_find(sch, ctx.classid);
> > +             if (!cl || !cl->qdisc)
> > +                     return NULL;
> > +             skb = qdisc_dequeue_peeked(cl->qdisc);
> > +             if (skb) {
> > +                     bstats_update(&cl->bstats, skb);
> > +                     qdisc_bstats_update(sch, skb);
> > +                     qdisc_qstats_backlog_dec(sch, skb);
> > +                     sch->q.qlen--;
> > +             }
> > +             break;
> > +     }
> > +
> > +     return skb;
> > +}
>
> [ ... ]
>
> > +static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt,
> > +                     struct netlink_ext_ack *extack)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     int err;
> > +
> > +     qdisc_watchdog_init(&q->watchdog, sch);
> > +     if (opt) {
> > +             err = sch_bpf_change(sch, opt, extack);
> > +             if (err)
> > +                     return err;
> > +     }
> > +
> > +     err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
> > +     if (err)
> > +             return err;
> > +
> > +     return qdisc_class_hash_init(&q->clhash);
> > +}
> > +
> > +static void sch_bpf_reset(struct Qdisc *sch)
> > +{
> > +     struct bpf_sched_data *q = qdisc_priv(sch);
> > +     struct sch_bpf_class *cl;
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < q->clhash.hashsize; i++) {
> > +             hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) {
> > +                     if (cl->qdisc)
> > +                             qdisc_reset(cl->qdisc);
> > +             }
> > +     }
> > +
> > +     qdisc_watchdog_cancel(&q->watchdog);
> > +}
> > +
>
> [ ... ]
>
> > +static const struct Qdisc_class_ops sch_bpf_class_ops = {
> > +     .graft          =       sch_bpf_graft,
> > +     .leaf           =       sch_bpf_leaf,
> > +     .find           =       sch_bpf_search,
> > +     .change         =       sch_bpf_change_class,
> > +     .delete         =       sch_bpf_delete,
> > +     .tcf_block      =       sch_bpf_tcf_block,
> > +     .bind_tcf       =       sch_bpf_bind,
> > +     .unbind_tcf     =       sch_bpf_unbind,
> > +     .dump           =       sch_bpf_dump_class,
> > +     .dump_stats     =       sch_bpf_dump_class_stats,
> > +     .walk           =       sch_bpf_walk,
> > +};
> > +
> > +static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = {
> > +     .cl_ops         =       &sch_bpf_class_ops,
> > +     .id             =       "bpf",
> > +     .priv_size      =       sizeof(struct bpf_sched_data),
> > +     .enqueue        =       sch_bpf_enqueue,
> > +     .dequeue        =       sch_bpf_dequeue,
>
> I looked at the high level of the patchset. The major ops that it wants to be
> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
> patch 4 and patch 5).
>
> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
> uapi. It is no long an acceptable way to add new bpf extension.
>
> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
>

Partially. If using struct_ops, I think we'll need another structure
like the following in bpf qdisc to be implemented with struct_ops bpf:

struct bpf_qdisc_ops {
    int (*enqueue) (struct sk_buff *skb)
    void (*dequeue) (void)
    void (*init) (void)
    void (*reset) (void)
};

Then, Qdisc_ops will wrap around them to handle things that cannot be
implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).

> If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
> programmable in bpf, it can stay with the kernel implementation for now and only
> allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
> implemented.
>
> You mentioned in the cover letter that:
> "Current struct_ops attachment model does not seem to support replacing only
> functions of a specific instance of a module, but I might be wrong."
>
> I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
> it can be done through the struct_ops's ".init_member". Take a look at
> bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
> ".dump" (for example) when loading the bpf Qdisc_ops.
>

I have no problem with partially replacing a struct, which like you
mentioned has been demonstrated by congestion control or sched_ext.
What I am not sure about is the ability to create multiple bpf qdiscs,
where each has different ".enqueue", ".dequeue", and so on. I like the
struct_ops approach and would love to try it if struct_ops support
this.

Thanks,
Amery



> > +     .peek           =       qdisc_peek_dequeued,
> > +     .init           =       sch_bpf_init,
> > +     .reset          =       sch_bpf_reset, > +      .destroy        =       sch_bpf_destroy,
> > +     .change         =       sch_bpf_change,
> > +     .dump           =       sch_bpf_dump,
> > +     .dump_stats     =       sch_bpf_dump_stats,
> > +     .owner          =       THIS_MODULE,
> > +};
> > +
> > +static int __init sch_bpf_mod_init(void)
> > +{
> > +     return register_qdisc(&sch_bpf_qdisc_ops);
> > +}
> > +
> > +static void __exit sch_bpf_mod_exit(void)
> > +{
> > +     unregister_qdisc(&sch_bpf_qdisc_ops);
> > +}
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-23 21:13 ` [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Stanislav Fomichev
@ 2024-01-24 10:10   ` Daniel Borkmann
  2024-01-24 12:09   ` Jamal Hadi Salim
  1 sibling, 0 replies; 36+ messages in thread
From: Daniel Borkmann @ 2024-01-24 10:10 UTC (permalink / raw
  To: Stanislav Fomichev, Amery Hung
  Cc: netdev, bpf, yangpeihao, toke, jhs, jiri, xiyou.wangcong,
	yepeilin.cs

On 1/23/24 10:13 PM, Stanislav Fomichev wrote:
> On 01/17, Amery Hung wrote:
>> Hi,
>>
>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>> RFC. The followings are some use cases of eBPF Qdisc:
>>
>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>     have to write a complete Qdisc kernel module just to experiment
>>     some new queuing theory.
>>
>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>     is before enqueue, it is impossible to adjust those "tokens" after
>>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>     be solved with a shared map between clsact and sch_bpf.
>>
>> 3. Replace qevents, as now the user gains much more control over the
>>     skb and queues.
>>
>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>     chain and block to reuse the TC filters, but they are too complicated
>>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>>     classification.
>>
>> 5. Potentially pave a way for ingress to queue packets, although
>>     current implementation is still only for egress.
>>
>> I’ve combed through previous comments and appreciated the feedbacks.
>> Some major changes in this RFC is the use of kptr to skb to maintain
>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>> maps, and the inclusion of two examples.
>>
>> Some questions for discussion:
>>
>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>     __sk_buff. This makes most helpers using __sk_buff incompatible
>>     with eBPF qdisc. An alternative is to still use __sk_buff in the
>>     context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>     this can only be applied to enqueue program, since in dequeue program
>>     skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>     is no __sk_buff). Any suggestion for making skb kptr and helper
>>     functions compatible?
>>
>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>     for attachment?
> 
> [..]
> 
>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>     users might want to create multiple bpf qdiscs with different
>>     implementations. Current struct_ops attachment model does not seem
>>     to support replacing only functions of a specific instance of a module,
>>     but I might be wrong.
> 
> I still feel like it deserves at leasta try. Maybe we can find some potential
> path where struct_ops can allow different implementations (Martin probably
> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> really have anything complicated (besides trying to play nicely with other
> tc classes/actions, but I'm not sure how relevant that is).

Plus it's also not used in the two sample implementations, given you can
implement this as part of the enqueue operation in bpf. It would make sense
to drop the kfunc from the set. One other note.. the BPF samples have been
bitrotting for quite a while, please convert this into a proper BPF selftest
so that BPF CI can run this.

> With struct_ops you can also get your (2) addressed.

+1

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc
  2024-01-17 21:56 ` [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc Amery Hung
@ 2024-01-24 10:29   ` Daniel Borkmann
  2024-01-26 19:49     ` Amery Hung
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Borkmann @ 2024-01-24 10:29 UTC (permalink / raw
  To: Amery Hung, netdev
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On 1/17/24 10:56 PM, Amery Hung wrote:
> tc_sch_fq.bpf.c
> A simple bpf fair queueing (fq) qdisc that gives each flow a euqal chance
> to transmit data. The qdisc respects the timestamp in a skb set by an
> clsact rate limiter. It can also inform the rate limiter about packet drop
> when enabled to adjust timestamps. The implementation does not prevent hash
> collision of flows nor does it recycle flows.
> 
> tc_sch_fq.c
> A user space program to load and attach the eBPF-based fq qdisc, which
> by default add the bpf fq to the loopback device, but can also add to other
> dev and class with '-d' and '-p' options.
> 
> To test the bpf fq qdisc with the EDT rate limiter:
> $ tc qdisc add dev lo clsact
> $ tc filter add dev lo egress bpf obj tc_clsact_edt.bpf.o sec classifier
> $ ./tc_sch_fq -s

Would be nice if you also include a performance comparison (did you do
production tests with it?) with side-by-side to native fq and if you see
a delta elaborate on what would be needed to address it.

> Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> ---
>   samples/bpf/Makefile            |   8 +-
>   samples/bpf/bpf_experimental.h  | 134 +++++++
>   samples/bpf/tc_clsact_edt.bpf.c | 103 +++++
>   samples/bpf/tc_sch_fq.bpf.c     | 666 ++++++++++++++++++++++++++++++++
>   samples/bpf/tc_sch_fq.c         | 321 +++++++++++++++
>   5 files changed, 1231 insertions(+), 1 deletion(-)
>   create mode 100644 samples/bpf/bpf_experimental.h
>   create mode 100644 samples/bpf/tc_clsact_edt.bpf.c
>   create mode 100644 samples/bpf/tc_sch_fq.bpf.c
>   create mode 100644 samples/bpf/tc_sch_fq.c

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-23 21:13 ` [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Stanislav Fomichev
  2024-01-24 10:10   ` Daniel Borkmann
@ 2024-01-24 12:09   ` Jamal Hadi Salim
  2024-01-24 13:07     ` Daniel Borkmann
  1 sibling, 1 reply; 36+ messages in thread
From: Jamal Hadi Salim @ 2024-01-24 12:09 UTC (permalink / raw
  To: Stanislav Fomichev
  Cc: Amery Hung, netdev, bpf, yangpeihao, toke, jiri, xiyou.wangcong,
	yepeilin.cs

On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 01/17, Amery Hung wrote:
> > Hi,
> >
> > I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> > RFC. The followings are some use cases of eBPF Qdisc:
> >
> > 1. Allow customizing Qdiscs in an easier way. So that people don't
> >    have to write a complete Qdisc kernel module just to experiment
> >    some new queuing theory.
> >
> > 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> >    is before enqueue, it is impossible to adjust those "tokens" after
> >    packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> >    be solved with a shared map between clsact and sch_bpf.
> >
> > 3. Replace qevents, as now the user gains much more control over the
> >    skb and queues.
> >
> > 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> >    chain and block to reuse the TC filters, but they are too complicated
> >    to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> >    TC filters on _any_ Qdisc (even on a different netdev) to do the
> >    classification.
> >
> > 5. Potentially pave a way for ingress to queue packets, although
> >    current implementation is still only for egress.
> >
> > I’ve combed through previous comments and appreciated the feedbacks.
> > Some major changes in this RFC is the use of kptr to skb to maintain
> > the validility of skb during its lifetime in the Qdisc, dropping rbtree
> > maps, and the inclusion of two examples.
> >
> > Some questions for discussion:
> >
> > 1. We now pass a trusted kptr of sk_buff to the program instead of
> >    __sk_buff. This makes most helpers using __sk_buff incompatible
> >    with eBPF qdisc. An alternative is to still use __sk_buff in the
> >    context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
> >    this can only be applied to enqueue program, since in dequeue program
> >    skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
> >    is no __sk_buff). Any suggestion for making skb kptr and helper
> >    functions compatible?
> >
> > 2. The current patchset uses netlink. Do we also want to use bpf_link
> >    for attachment?
>
> [..]
>
> > 3. People have suggested struct_ops. We chose not to use struct_ops since
> >    users might want to create multiple bpf qdiscs with different
> >    implementations. Current struct_ops attachment model does not seem
> >    to support replacing only functions of a specific instance of a module,
> >    but I might be wrong.
>
> I still feel like it deserves at leasta try. Maybe we can find some potential
> path where struct_ops can allow different implementations (Martin probably
> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> really have anything complicated (besides trying to play nicely with other
> tc classes/actions, but I'm not sure how relevant that is).

Are you suggesting that it is a nuisance to integrate with the
existing infra? I would consider it being a lot more than "trying to
play nicely". Besides, it's a kfunc and people will not be forced to
use it.

cheers,
jamal

> With struct_ops you can also get your (2) addressed.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-24 12:09   ` Jamal Hadi Salim
@ 2024-01-24 13:07     ` Daniel Borkmann
  2024-01-24 14:11       ` Jamal Hadi Salim
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Borkmann @ 2024-01-24 13:07 UTC (permalink / raw
  To: Jamal Hadi Salim, Stanislav Fomichev
  Cc: Amery Hung, netdev, bpf, yangpeihao, toke, jiri, xiyou.wangcong,
	yepeilin.cs

On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>> On 01/17, Amery Hung wrote:
>>> Hi,
>>>
>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>>> RFC. The followings are some use cases of eBPF Qdisc:
>>>
>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>>     have to write a complete Qdisc kernel module just to experiment
>>>     some new queuing theory.
>>>
>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>>     is before enqueue, it is impossible to adjust those "tokens" after
>>>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>>     be solved with a shared map between clsact and sch_bpf.
>>>
>>> 3. Replace qevents, as now the user gains much more control over the
>>>     skb and queues.
>>>
>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>>     chain and block to reuse the TC filters, but they are too complicated
>>>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>>>     classification.
>>>
>>> 5. Potentially pave a way for ingress to queue packets, although
>>>     current implementation is still only for egress.
>>>
>>> I’ve combed through previous comments and appreciated the feedbacks.
>>> Some major changes in this RFC is the use of kptr to skb to maintain
>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>>> maps, and the inclusion of two examples.
>>>
>>> Some questions for discussion:
>>>
>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>>     __sk_buff. This makes most helpers using __sk_buff incompatible
>>>     with eBPF qdisc. An alternative is to still use __sk_buff in the
>>>     context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>>     this can only be applied to enqueue program, since in dequeue program
>>>     skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>>     is no __sk_buff). Any suggestion for making skb kptr and helper
>>>     functions compatible?
>>>
>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>>     for attachment?
>>
>> [..]
>>
>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>>     users might want to create multiple bpf qdiscs with different
>>>     implementations. Current struct_ops attachment model does not seem
>>>     to support replacing only functions of a specific instance of a module,
>>>     but I might be wrong.
>>
>> I still feel like it deserves at leasta try. Maybe we can find some potential
>> path where struct_ops can allow different implementations (Martin probably
>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
>> really have anything complicated (besides trying to play nicely with other
>> tc classes/actions, but I'm not sure how relevant that is).
> 
> Are you suggesting that it is a nuisance to integrate with the
> existing infra? I would consider it being a lot more than "trying to
> play nicely". Besides, it's a kfunc and people will not be forced to
> use it.

What's the use case? If you already go that route to implement your own
qdisc, why is there a need to take the performane hit and go all the
way into old style cls/act infra when it can be done in a more straight
forward way natively? For the vast majority of cases this will be some
very lightweight classification anyway (if not outsourced to tc egress
given the lock). If there is a concrete production need, it could be
added, otherwise if there is no immediate use case which cannot be solved
otherwise I would not add unnecessary kfuncs.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-24 13:07     ` Daniel Borkmann
@ 2024-01-24 14:11       ` Jamal Hadi Salim
  2024-01-24 15:26         ` Daniel Borkmann
  0 siblings, 1 reply; 36+ messages in thread
From: Jamal Hadi Salim @ 2024-01-24 14:11 UTC (permalink / raw
  To: Daniel Borkmann
  Cc: Stanislav Fomichev, Amery Hung, netdev, bpf, yangpeihao, toke,
	jiri, xiyou.wangcong, yepeilin.cs

On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
> > On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >> On 01/17, Amery Hung wrote:
> >>> Hi,
> >>>
> >>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> >>> RFC. The followings are some use cases of eBPF Qdisc:
> >>>
> >>> 1. Allow customizing Qdiscs in an easier way. So that people don't
> >>>     have to write a complete Qdisc kernel module just to experiment
> >>>     some new queuing theory.
> >>>
> >>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> >>>     is before enqueue, it is impossible to adjust those "tokens" after
> >>>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> >>>     be solved with a shared map between clsact and sch_bpf.
> >>>
> >>> 3. Replace qevents, as now the user gains much more control over the
> >>>     skb and queues.
> >>>
> >>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> >>>     chain and block to reuse the TC filters, but they are too complicated
> >>>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> >>>     TC filters on _any_ Qdisc (even on a different netdev) to do the
> >>>     classification.
> >>>
> >>> 5. Potentially pave a way for ingress to queue packets, although
> >>>     current implementation is still only for egress.
> >>>
> >>> I’ve combed through previous comments and appreciated the feedbacks.
> >>> Some major changes in this RFC is the use of kptr to skb to maintain
> >>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
> >>> maps, and the inclusion of two examples.
> >>>
> >>> Some questions for discussion:
> >>>
> >>> 1. We now pass a trusted kptr of sk_buff to the program instead of
> >>>     __sk_buff. This makes most helpers using __sk_buff incompatible
> >>>     with eBPF qdisc. An alternative is to still use __sk_buff in the
> >>>     context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
> >>>     this can only be applied to enqueue program, since in dequeue program
> >>>     skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
> >>>     is no __sk_buff). Any suggestion for making skb kptr and helper
> >>>     functions compatible?
> >>>
> >>> 2. The current patchset uses netlink. Do we also want to use bpf_link
> >>>     for attachment?
> >>
> >> [..]
> >>
> >>> 3. People have suggested struct_ops. We chose not to use struct_ops since
> >>>     users might want to create multiple bpf qdiscs with different
> >>>     implementations. Current struct_ops attachment model does not seem
> >>>     to support replacing only functions of a specific instance of a module,
> >>>     but I might be wrong.
> >>
> >> I still feel like it deserves at leasta try. Maybe we can find some potential
> >> path where struct_ops can allow different implementations (Martin probably
> >> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> >> really have anything complicated (besides trying to play nicely with other
> >> tc classes/actions, but I'm not sure how relevant that is).
> >
> > Are you suggesting that it is a nuisance to integrate with the
> > existing infra? I would consider it being a lot more than "trying to
> > play nicely". Besides, it's a kfunc and people will not be forced to
> > use it.
>
> What's the use case?

What's the use case for enabling existing infra to work? Sure, let's
rewrite everything from scratch in ebpf. And then introduce new
tooling which well funded companies are capable of owning the right
resources to build and manage. Open source is about choices and
sharing and this is about choices and sharing.

> If you already go that route to implement your own
> qdisc, why is there a need to take the performane hit and go all the
> way into old style cls/act infra when it can be done in a more straight
> forward way natively?

Who is forcing you to use the kfunc? This is about choice.
What is ebpf these days anyways? Is it a) a programming environment or
b) is it the only way to do things? I see it as available infra i.e #a
not as the answer looking for a question.  IOW, as something we can
use to build the infra we need and use kfuncs when it makes sense. Not
everybody has infinite resources to keep hacking things into ebpf.

> For the vast majority of cases this will be some
> very lightweight classification anyway (if not outsourced to tc egress
> given the lock). If there is a concrete production need, it could be
> added, otherwise if there is no immediate use case which cannot be solved
> otherwise I would not add unnecessary kfuncs.

"Unnecessary" is really your view.

cheers,
jamal

> Cheers,
> Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-24 14:11       ` Jamal Hadi Salim
@ 2024-01-24 15:26         ` Daniel Borkmann
  2024-01-24 21:26           ` Amery Hung
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Borkmann @ 2024-01-24 15:26 UTC (permalink / raw
  To: Jamal Hadi Salim
  Cc: Stanislav Fomichev, Amery Hung, netdev, bpf, yangpeihao, toke,
	jiri, xiyou.wangcong, yepeilin.cs

On 1/24/24 3:11 PM, Jamal Hadi Salim wrote:
> On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
>>> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>> On 01/17, Amery Hung wrote:
>>>>> Hi,
>>>>>
>>>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>>>>> RFC. The followings are some use cases of eBPF Qdisc:
>>>>>
>>>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>>>>      have to write a complete Qdisc kernel module just to experiment
>>>>>      some new queuing theory.
>>>>>
>>>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>>>>      is before enqueue, it is impossible to adjust those "tokens" after
>>>>>      packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>>>>      be solved with a shared map between clsact and sch_bpf.
>>>>>
>>>>> 3. Replace qevents, as now the user gains much more control over the
>>>>>      skb and queues.
>>>>>
>>>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>>>>      chain and block to reuse the TC filters, but they are too complicated
>>>>>      to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>>>>      TC filters on _any_ Qdisc (even on a different netdev) to do the
>>>>>      classification.
>>>>>
>>>>> 5. Potentially pave a way for ingress to queue packets, although
>>>>>      current implementation is still only for egress.
>>>>>
>>>>> I’ve combed through previous comments and appreciated the feedbacks.
>>>>> Some major changes in this RFC is the use of kptr to skb to maintain
>>>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>>>>> maps, and the inclusion of two examples.
>>>>>
>>>>> Some questions for discussion:
>>>>>
>>>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>>>>      __sk_buff. This makes most helpers using __sk_buff incompatible
>>>>>      with eBPF qdisc. An alternative is to still use __sk_buff in the
>>>>>      context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>>>>      this can only be applied to enqueue program, since in dequeue program
>>>>>      skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>>>>      is no __sk_buff). Any suggestion for making skb kptr and helper
>>>>>      functions compatible?
>>>>>
>>>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>>>>      for attachment?
>>>>
>>>> [..]
>>>>
>>>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>>>>      users might want to create multiple bpf qdiscs with different
>>>>>      implementations. Current struct_ops attachment model does not seem
>>>>>      to support replacing only functions of a specific instance of a module,
>>>>>      but I might be wrong.
>>>>
>>>> I still feel like it deserves at leasta try. Maybe we can find some potential
>>>> path where struct_ops can allow different implementations (Martin probably
>>>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
>>>> really have anything complicated (besides trying to play nicely with other
>>>> tc classes/actions, but I'm not sure how relevant that is).
>>>
>>> Are you suggesting that it is a nuisance to integrate with the
>>> existing infra? I would consider it being a lot more than "trying to
>>> play nicely". Besides, it's a kfunc and people will not be forced to
>>> use it.
>>
>> What's the use case?
> 
> What's the use case for enabling existing infra to work? Sure, let's
> rewrite everything from scratch in ebpf. And then introduce new
> tooling which well funded companies are capable of owning the right
> resources to build and manage. Open source is about choices and
> sharing and this is about choices and sharing.
> 
>> If you already go that route to implement your own
>> qdisc, why is there a need to take the performane hit and go all the
>> way into old style cls/act infra when it can be done in a more straight
>> forward way natively?
> 
> Who is forcing you to use the kfunc? This is about choice.
> What is ebpf these days anyways? Is it a) a programming environment or
> b) is it the only way to do things? I see it as available infra i.e #a
> not as the answer looking for a question.  IOW, as something we can
> use to build the infra we need and use kfuncs when it makes sense. Not
> everybody has infinite resources to keep hacking things into ebpf.
> 
>> For the vast majority of cases this will be some
>> very lightweight classification anyway (if not outsourced to tc egress
>> given the lock). If there is a concrete production need, it could be
>> added, otherwise if there is no immediate use case which cannot be solved
>> otherwise I would not add unnecessary kfuncs.
> 
> "Unnecessary" is really your view.

Looks like we're talking past each other? If there is no plan to use it
in production (I assume Amery would be able to answer?), why add it right
now to the initial series, only to figure out later on (worst case in
few years) when the time comes that the kfunc does not fit the actual
need? You've probably seen the life cycle doc (Documentation/bpf/kfuncs.rst)
and while changes can be made, they should still be mindful about potential
breakages the longer it's out in the wild, hence my question if it's
planning to be used given it wasn't in the samples.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-24 15:26         ` Daniel Borkmann
@ 2024-01-24 21:26           ` Amery Hung
  2024-01-25 11:57             ` Daniel Borkmann
  0 siblings, 1 reply; 36+ messages in thread
From: Amery Hung @ 2024-01-24 21:26 UTC (permalink / raw
  To: Daniel Borkmann
  Cc: Jamal Hadi Salim, Stanislav Fomichev, netdev, bpf, yangpeihao,
	toke, jiri, xiyou.wangcong, yepeilin.cs

On Wed, Jan 24, 2024 at 7:27 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/24/24 3:11 PM, Jamal Hadi Salim wrote:
> > On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
> >>> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
> >>>> On 01/17, Amery Hung wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
> >>>>> RFC. The followings are some use cases of eBPF Qdisc:
> >>>>>
> >>>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
> >>>>>      have to write a complete Qdisc kernel module just to experiment
> >>>>>      some new queuing theory.
> >>>>>
> >>>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> >>>>>      is before enqueue, it is impossible to adjust those "tokens" after
> >>>>>      packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> >>>>>      be solved with a shared map between clsact and sch_bpf.
> >>>>>
> >>>>> 3. Replace qevents, as now the user gains much more control over the
> >>>>>      skb and queues.
> >>>>>
> >>>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> >>>>>      chain and block to reuse the TC filters, but they are too complicated
> >>>>>      to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> >>>>>      TC filters on _any_ Qdisc (even on a different netdev) to do the
> >>>>>      classification.
> >>>>>
> >>>>> 5. Potentially pave a way for ingress to queue packets, although
> >>>>>      current implementation is still only for egress.
> >>>>>
> >>>>> I’ve combed through previous comments and appreciated the feedbacks.
> >>>>> Some major changes in this RFC is the use of kptr to skb to maintain
> >>>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
> >>>>> maps, and the inclusion of two examples.
> >>>>>
> >>>>> Some questions for discussion:
> >>>>>
> >>>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
> >>>>>      __sk_buff. This makes most helpers using __sk_buff incompatible
> >>>>>      with eBPF qdisc. An alternative is to still use __sk_buff in the
> >>>>>      context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
> >>>>>      this can only be applied to enqueue program, since in dequeue program
> >>>>>      skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
> >>>>>      is no __sk_buff). Any suggestion for making skb kptr and helper
> >>>>>      functions compatible?
> >>>>>
> >>>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
> >>>>>      for attachment?
> >>>>
> >>>> [..]
> >>>>
> >>>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
> >>>>>      users might want to create multiple bpf qdiscs with different
> >>>>>      implementations. Current struct_ops attachment model does not seem
> >>>>>      to support replacing only functions of a specific instance of a module,
> >>>>>      but I might be wrong.
> >>>>
> >>>> I still feel like it deserves at leasta try. Maybe we can find some potential
> >>>> path where struct_ops can allow different implementations (Martin probably
> >>>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
> >>>> really have anything complicated (besides trying to play nicely with other
> >>>> tc classes/actions, but I'm not sure how relevant that is).
> >>>
> >>> Are you suggesting that it is a nuisance to integrate with the
> >>> existing infra? I would consider it being a lot more than "trying to
> >>> play nicely". Besides, it's a kfunc and people will not be forced to
> >>> use it.
> >>
> >> What's the use case?
> >
> > What's the use case for enabling existing infra to work? Sure, let's
> > rewrite everything from scratch in ebpf. And then introduce new
> > tooling which well funded companies are capable of owning the right
> > resources to build and manage. Open source is about choices and
> > sharing and this is about choices and sharing.
> >
> >> If you already go that route to implement your own
> >> qdisc, why is there a need to take the performane hit and go all the
> >> way into old style cls/act infra when it can be done in a more straight
> >> forward way natively?
> >
> > Who is forcing you to use the kfunc? This is about choice.
> > What is ebpf these days anyways? Is it a) a programming environment or
> > b) is it the only way to do things? I see it as available infra i.e #a
> > not as the answer looking for a question.  IOW, as something we can
> > use to build the infra we need and use kfuncs when it makes sense. Not
> > everybody has infinite resources to keep hacking things into ebpf.
> >
> >> For the vast majority of cases this will be some
> >> very lightweight classification anyway (if not outsourced to tc egress
> >> given the lock). If there is a concrete production need, it could be
> >> added, otherwise if there is no immediate use case which cannot be solved
> >> otherwise I would not add unnecessary kfuncs.
> >
> > "Unnecessary" is really your view.
>
> Looks like we're talking past each other? If there is no plan to use it
> in production (I assume Amery would be able to answer?), why add it right
> now to the initial series, only to figure out later on (worst case in
> few years) when the time comes that the kfunc does not fit the actual
> need? You've probably seen the life cycle doc (Documentation/bpf/kfuncs.rst)
> and while changes can be made, they should still be mindful about potential
> breakages the longer it's out in the wild, hence my question if it's
> planning to be used given it wasn't in the samples.
>

We would like to reuse existing TC filters. Like Jamal says, changing
filter rules in production can be done easily with existing tooling.
Besides, when the user is only interested in exploring scheduling
algorithms but not classifying traffics, they don't need to replicate
the filter again in bpf. I can add bpf_skb_tc_classify() test cases in
the next series if that helps.

Thanks,
Amery

> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc
  2024-01-24 21:26           ` Amery Hung
@ 2024-01-25 11:57             ` Daniel Borkmann
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Borkmann @ 2024-01-25 11:57 UTC (permalink / raw
  To: Amery Hung
  Cc: Jamal Hadi Salim, Stanislav Fomichev, netdev, bpf, yangpeihao,
	toke, jiri, xiyou.wangcong, yepeilin.cs

On 1/24/24 10:26 PM, Amery Hung wrote:
> On Wed, Jan 24, 2024 at 7:27 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/24/24 3:11 PM, Jamal Hadi Salim wrote:
>>> On Wed, Jan 24, 2024 at 8:08 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 1/24/24 1:09 PM, Jamal Hadi Salim wrote:
>>>>> On Tue, Jan 23, 2024 at 4:13 PM Stanislav Fomichev <sdf@google.com> wrote:
>>>>>> On 01/17, Amery Hung wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am continuing the work of ebpf-based Qdisc based on Cong’s previous
>>>>>>> RFC. The followings are some use cases of eBPF Qdisc:
>>>>>>>
>>>>>>> 1. Allow customizing Qdiscs in an easier way. So that people don't
>>>>>>>       have to write a complete Qdisc kernel module just to experiment
>>>>>>>       some new queuing theory.
>>>>>>>
>>>>>>> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>>>>>>>       is before enqueue, it is impossible to adjust those "tokens" after
>>>>>>>       packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>>>>>>>       be solved with a shared map between clsact and sch_bpf.
>>>>>>>
>>>>>>> 3. Replace qevents, as now the user gains much more control over the
>>>>>>>       skb and queues.
>>>>>>>
>>>>>>> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>>>>>>>       chain and block to reuse the TC filters, but they are too complicated
>>>>>>>       to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>>>>>>>       TC filters on _any_ Qdisc (even on a different netdev) to do the
>>>>>>>       classification.
>>>>>>>
>>>>>>> 5. Potentially pave a way for ingress to queue packets, although
>>>>>>>       current implementation is still only for egress.
>>>>>>>
>>>>>>> I’ve combed through previous comments and appreciated the feedbacks.
>>>>>>> Some major changes in this RFC is the use of kptr to skb to maintain
>>>>>>> the validility of skb during its lifetime in the Qdisc, dropping rbtree
>>>>>>> maps, and the inclusion of two examples.
>>>>>>>
>>>>>>> Some questions for discussion:
>>>>>>>
>>>>>>> 1. We now pass a trusted kptr of sk_buff to the program instead of
>>>>>>>       __sk_buff. This makes most helpers using __sk_buff incompatible
>>>>>>>       with eBPF qdisc. An alternative is to still use __sk_buff in the
>>>>>>>       context and use bpf_cast_to_kern_ctx() to acquire the kptr. However,
>>>>>>>       this can only be applied to enqueue program, since in dequeue program
>>>>>>>       skbs do not come from ctx but kptrs exchanged out of maps (i.e., there
>>>>>>>       is no __sk_buff). Any suggestion for making skb kptr and helper
>>>>>>>       functions compatible?
>>>>>>>
>>>>>>> 2. The current patchset uses netlink. Do we also want to use bpf_link
>>>>>>>       for attachment?
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> 3. People have suggested struct_ops. We chose not to use struct_ops since
>>>>>>>       users might want to create multiple bpf qdiscs with different
>>>>>>>       implementations. Current struct_ops attachment model does not seem
>>>>>>>       to support replacing only functions of a specific instance of a module,
>>>>>>>       but I might be wrong.
>>>>>>
>>>>>> I still feel like it deserves at leasta try. Maybe we can find some potential
>>>>>> path where struct_ops can allow different implementations (Martin probably
>>>>>> has some ideas about that). I looked at the bpf qdisc itself and it doesn't
>>>>>> really have anything complicated (besides trying to play nicely with other
>>>>>> tc classes/actions, but I'm not sure how relevant that is).
>>>>>
>>>>> Are you suggesting that it is a nuisance to integrate with the
>>>>> existing infra? I would consider it being a lot more than "trying to
>>>>> play nicely". Besides, it's a kfunc and people will not be forced to
>>>>> use it.
>>>>
>>>> What's the use case?
>>>
>>> What's the use case for enabling existing infra to work? Sure, let's
>>> rewrite everything from scratch in ebpf. And then introduce new
>>> tooling which well funded companies are capable of owning the right
>>> resources to build and manage. Open source is about choices and
>>> sharing and this is about choices and sharing.
>>>
>>>> If you already go that route to implement your own
>>>> qdisc, why is there a need to take the performane hit and go all the
>>>> way into old style cls/act infra when it can be done in a more straight
>>>> forward way natively?
>>>
>>> Who is forcing you to use the kfunc? This is about choice.
>>> What is ebpf these days anyways? Is it a) a programming environment or
>>> b) is it the only way to do things? I see it as available infra i.e #a
>>> not as the answer looking for a question.  IOW, as something we can
>>> use to build the infra we need and use kfuncs when it makes sense. Not
>>> everybody has infinite resources to keep hacking things into ebpf.
>>>
>>>> For the vast majority of cases this will be some
>>>> very lightweight classification anyway (if not outsourced to tc egress
>>>> given the lock). If there is a concrete production need, it could be
>>>> added, otherwise if there is no immediate use case which cannot be solved
>>>> otherwise I would not add unnecessary kfuncs.
>>>
>>> "Unnecessary" is really your view.
>>
>> Looks like we're talking past each other? If there is no plan to use it
>> in production (I assume Amery would be able to answer?), why add it right
>> now to the initial series, only to figure out later on (worst case in
>> few years) when the time comes that the kfunc does not fit the actual
>> need? You've probably seen the life cycle doc (Documentation/bpf/kfuncs.rst)
>> and while changes can be made, they should still be mindful about potential
>> breakages the longer it's out in the wild, hence my question if it's
>> planning to be used given it wasn't in the samples.
> 
> We would like to reuse existing TC filters. Like Jamal says, changing
> filter rules in production can be done easily with existing tooling.
> Besides, when the user is only interested in exploring scheduling
> algorithms but not classifying traffics, they don't need to replicate
> the filter again in bpf. I can add bpf_skb_tc_classify() test cases in
> the next series if that helps.

In that case, please add a BPF selftest for exercising the kfunc, and also
expand the commit description with the above rationale.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-24  5:22     ` Amery Hung
@ 2024-01-26  2:22       ` Martin KaFai Lau
  2024-01-27  1:17         ` Amery Hung
  0 siblings, 1 reply; 36+ messages in thread
From: Martin KaFai Lau @ 2024-01-26  2:22 UTC (permalink / raw
  To: Amery Hung
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev

On 1/23/24 9:22 PM, Amery Hung wrote:
>> I looked at the high level of the patchset. The major ops that it wants to be
>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
>> patch 4 and patch 5).
>>
>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
>> uapi. It is no long an acceptable way to add new bpf extension.
>>
>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
>> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
>>
> Partially. If using struct_ops, I think we'll need another structure
> like the following in bpf qdisc to be implemented with struct_ops bpf:
> 
> struct bpf_qdisc_ops {
>      int (*enqueue) (struct sk_buff *skb)
>      void (*dequeue) (void)
>      void (*init) (void)
>      void (*reset) (void)
> };
> 
> Then, Qdisc_ops will wrap around them to handle things that cannot be
> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).

We can see how those limitations (calling sch_tree_lock() and returning a ptr) 
can be addressed in bpf. This will also help other similar use cases.

Other than sch_tree_lock and returning a ptr from a bpf prog. What else do you 
see that blocks directly implementing the enqueue/dequeue/init/reset in the 
struct Qdisc_ops?

Have you thought above ".priv_size"? It is now fixed to sizeof(struct 
bpf_sched_data). It should be useful to allow the bpf prog to store its own data 
there?

> 
>> If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
>> programmable in bpf, it can stay with the kernel implementation for now and only
>> allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
>> implemented.
>>
>> You mentioned in the cover letter that:
>> "Current struct_ops attachment model does not seem to support replacing only
>> functions of a specific instance of a module, but I might be wrong."
>>
>> I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
>> it can be done through the struct_ops's ".init_member". Take a look at
>> bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
>> ".dump" (for example) when loading the bpf Qdisc_ops.
>>
> I have no problem with partially replacing a struct, which like you
> mentioned has been demonstrated by congestion control or sched_ext.
> What I am not sure about is the ability to create multiple bpf qdiscs,
> where each has different ".enqueue", ".dequeue", and so on. I like the
> struct_ops approach and would love to try it if struct_ops support
> this.

The need for allowing different ".enqueue/.dequeue/..." bpf 
(BPF_PROG_TYPE_QDISC) programs loaded into different qdisc instances is because 
there is only one ".id == bpf" Qdisc_ops native kernel implementation which is 
then because of the limitation you mentioned above?

Am I understanding your reason correctly on why it requires to load different 
bpf prog for different qdisc instances?

If the ".enqueue/.dequeue/..." in the "struct Qdisc_ops" can be directly 
implemented in bpf prog itself, it can just load another bpf struct_ops which 
has a different ".enqueue/.dequeue/..." implementation:

#> bpftool struct_ops register bpf_simple_fq_v1.bpf.o
#> bpftool struct_ops register bpf_simple_fq_v2.bpf.o
#> bpftool struct_ops register bpf_simple_fq_xyz.bpf.o

 From reading the test bpf prog, I think the set is on a good footing. Instead 
of working around the limitation by wrapping the bpf prog in a predefined 
"struct Qdisc_ops sch_bpf_qdisc_ops", lets first understand what is missing in 
bpf and see how we could address them.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc
  2024-01-24 10:29   ` Daniel Borkmann
@ 2024-01-26 19:49     ` Amery Hung
  0 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-26 19:49 UTC (permalink / raw
  To: Daniel Borkmann
  Cc: netdev, bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs

On Wed, Jan 24, 2024 at 2:29 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/17/24 10:56 PM, Amery Hung wrote:
> > tc_sch_fq.bpf.c
> > A simple bpf fair queueing (fq) qdisc that gives each flow a euqal chance
> > to transmit data. The qdisc respects the timestamp in a skb set by an
> > clsact rate limiter. It can also inform the rate limiter about packet drop
> > when enabled to adjust timestamps. The implementation does not prevent hash
> > collision of flows nor does it recycle flows.
> >
> > tc_sch_fq.c
> > A user space program to load and attach the eBPF-based fq qdisc, which
> > by default add the bpf fq to the loopback device, but can also add to other
> > dev and class with '-d' and '-p' options.
> >
> > To test the bpf fq qdisc with the EDT rate limiter:
> > $ tc qdisc add dev lo clsact
> > $ tc filter add dev lo egress bpf obj tc_clsact_edt.bpf.o sec classifier
> > $ ./tc_sch_fq -s
>
> Would be nice if you also include a performance comparison (did you do
> production tests with it?) with side-by-side to native fq and if you see
> a delta elaborate on what would be needed to address it.

I did a simple test by adding a fq to the loopback device and then
sending a single stream traffic via iperf. The bpf implementation of
fq achieves 90% throughput compared with the native one.

I think the overhead mainly comes from allocating bpf objects (struct
skb_node) to store skb kptrs. This part can be removed if bpf
list/rbtree recognizes skb->list/rbnode. On the kfunc implementation
side, I think we can do it by saving struct bpf_rb_node_kern into
skb->rb_node and skb->cb. I haven't looked into the verifier to see
what needs to be done.

I will move the test cases from samples to selftests and include more
testing in the next patchset.

>
> > Signed-off-by: Amery Hung <amery.hung@bytedance.com>
> > ---
> >   samples/bpf/Makefile            |   8 +-
> >   samples/bpf/bpf_experimental.h  | 134 +++++++
> >   samples/bpf/tc_clsact_edt.bpf.c | 103 +++++
> >   samples/bpf/tc_sch_fq.bpf.c     | 666 ++++++++++++++++++++++++++++++++
> >   samples/bpf/tc_sch_fq.c         | 321 +++++++++++++++
> >   5 files changed, 1231 insertions(+), 1 deletion(-)
> >   create mode 100644 samples/bpf/bpf_experimental.h
> >   create mode 100644 samples/bpf/tc_clsact_edt.bpf.c
> >   create mode 100644 samples/bpf/tc_sch_fq.bpf.c
> >   create mode 100644 samples/bpf/tc_sch_fq.c

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-26  2:22       ` Martin KaFai Lau
@ 2024-01-27  1:17         ` Amery Hung
  2024-01-30  6:39           ` Martin KaFai Lau
  0 siblings, 1 reply; 36+ messages in thread
From: Amery Hung @ 2024-01-27  1:17 UTC (permalink / raw
  To: Martin KaFai Lau
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev

On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/23/24 9:22 PM, Amery Hung wrote:
> >> I looked at the high level of the patchset. The major ops that it wants to be
> >> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
> >> patch 4 and patch 5).
> >>
> >> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
> >> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
> >> uapi. It is no long an acceptable way to add new bpf extension.
> >>
> >> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
> >> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
> >> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
> >>
> > Partially. If using struct_ops, I think we'll need another structure
> > like the following in bpf qdisc to be implemented with struct_ops bpf:
> >
> > struct bpf_qdisc_ops {
> >      int (*enqueue) (struct sk_buff *skb)
> >      void (*dequeue) (void)
> >      void (*init) (void)
> >      void (*reset) (void)
> > };
> >
> > Then, Qdisc_ops will wrap around them to handle things that cannot be
> > implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>
> We can see how those limitations (calling sch_tree_lock() and returning a ptr)
> can be addressed in bpf. This will also help other similar use cases.
>

For kptr, I wonder if we can support the following semantics in bpf if
they make sense:
1. Passing a referenced kptr into a bpf program, which will also need
to be released, or exchanged into maps or allocated objects.
2. Returning a kptr from a program and treating it as releasing the reference.

> Other than sch_tree_lock and returning a ptr from a bpf prog. What else do you
> see that blocks directly implementing the enqueue/dequeue/init/reset in the
> struct Qdisc_ops?
>

Not much! We can deal with sch_tree_lock later since
enqueue/dequeue/init/reset are unlikely to use it.

> Have you thought above ".priv_size"? It is now fixed to sizeof(struct
> bpf_sched_data). It should be useful to allow the bpf prog to store its own data
> there?
>

Maybe we can let bpf qdiscs store statistics here and make it work
with netlink. I haven't explored much in how bpf qdiscs record and
share statistics with user space.

> >
> >> If other ops (like ".dump", ".dump_stats"...) do not have good use case to be
> >> programmable in bpf, it can stay with the kernel implementation for now and only
> >> allows the userspace to load the a bpf Qdisc_ops with .equeue/dequeue/init/reset
> >> implemented.
> >>
> >> You mentioned in the cover letter that:
> >> "Current struct_ops attachment model does not seem to support replacing only
> >> functions of a specific instance of a module, but I might be wrong."
> >>
> >> I assumed you meant allow bpf to replace only "some" ops of the Qdisc_ops? Yes,
> >> it can be done through the struct_ops's ".init_member". Take a look at
> >> bpf_tcp_ca_init_member. The kernel can assign the kernel implementation for
> >> ".dump" (for example) when loading the bpf Qdisc_ops.
> >>
> > I have no problem with partially replacing a struct, which like you
> > mentioned has been demonstrated by congestion control or sched_ext.
> > What I am not sure about is the ability to create multiple bpf qdiscs,
> > where each has different ".enqueue", ".dequeue", and so on. I like the
> > struct_ops approach and would love to try it if struct_ops support
> > this.
>
> The need for allowing different ".enqueue/.dequeue/..." bpf
> (BPF_PROG_TYPE_QDISC) programs loaded into different qdisc instances is because
> there is only one ".id == bpf" Qdisc_ops native kernel implementation which is
> then because of the limitation you mentioned above?
>
> Am I understanding your reason correctly on why it requires to load different
> bpf prog for different qdisc instances?
>
> If the ".enqueue/.dequeue/..." in the "struct Qdisc_ops" can be directly
> implemented in bpf prog itself, it can just load another bpf struct_ops which
> has a different ".enqueue/.dequeue/..." implementation:
>
> #> bpftool struct_ops register bpf_simple_fq_v1.bpf.o
> #> bpftool struct_ops register bpf_simple_fq_v2.bpf.o
> #> bpftool struct_ops register bpf_simple_fq_xyz.bpf.o
>
>  From reading the test bpf prog, I think the set is on a good footing. Instead
> of working around the limitation by wrapping the bpf prog in a predefined
> "struct Qdisc_ops sch_bpf_qdisc_ops", lets first understand what is missing in
> bpf and see how we could address them.
>

Thank you so much for the clarification. I had the wrong impression
since I was thinking about using a structure in the bpf qdisc for
struct_ops. It makes sense to try making "struct Qdisc_ops" work with
struct_ops. I will send the next patch set with struct_ops.

Thanks,
Amery

>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-27  1:17         ` Amery Hung
@ 2024-01-30  6:39           ` Martin KaFai Lau
  2024-01-30 17:49             ` Kui-Feng Lee
  2024-01-31 16:23             ` Amery Hung
  0 siblings, 2 replies; 36+ messages in thread
From: Martin KaFai Lau @ 2024-01-30  6:39 UTC (permalink / raw
  To: Amery Hung
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee

On 1/26/24 5:17 PM, Amery Hung wrote:
> On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 1/23/24 9:22 PM, Amery Hung wrote:
>>>> I looked at the high level of the patchset. The major ops that it wants to be
>>>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and ".reset" in
>>>> patch 4 and patch 5).
>>>>
>>>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach types (each for
>>>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new "bpf_qdisc_ctx" in the
>>>> uapi. It is no long an acceptable way to add new bpf extension.
>>>>
>>>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely implemented
>>>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops for Qdisc_ops
>>>> can be created. The bpf Qdisc_ops can be loaded through the existing struct_ops api.
>>>>
>>> Partially. If using struct_ops, I think we'll need another structure
>>> like the following in bpf qdisc to be implemented with struct_ops bpf:
>>>
>>> struct bpf_qdisc_ops {
>>>       int (*enqueue) (struct sk_buff *skb)
>>>       void (*dequeue) (void)
>>>       void (*init) (void)
>>>       void (*reset) (void)
>>> };
>>>
>>> Then, Qdisc_ops will wrap around them to handle things that cannot be
>>> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>>
>> We can see how those limitations (calling sch_tree_lock() and returning a ptr)
>> can be addressed in bpf. This will also help other similar use cases.
>>
> 
> For kptr, I wonder if we can support the following semantics in bpf if
> they make sense:

I think they are useful but they are not fully supported now.

Some thoughts below.

> 1. Passing a referenced kptr into a bpf program, which will also need
> to be released, or exchanged into maps or allocated objects.

"enqueue" should be the one considering here:

struct Qdisc_ops {
	/* ... */
	int                     (*enqueue)(struct sk_buff *skb,
					   struct Qdisc *sch,
					   struct sk_buff **to_free);

};

The verifier only marks the skb as a trusted kptr but does not mark its 
reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:

	if (prog_args_trusted(prog))
		info->reg_type |= PTR_TRUSTED;

The verifier does not know the skb ownership is passed into the ".enqueue" ops 
and does not know the bpf prog needs to release it or store it in a map.

The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just 
an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at 
acquire_reference_state() which is the useful one here.

Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can 
always acquire_reference_state() for the "struct sk_buff *skb" argument.

Take a look at a recent RFC: 
https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch 
is tagging the argument could be NULL by appending "__nullable" to the argument 
name. The verifier will enforce that the bpf prog must check for NULL first.

The similar idea can be used here but with a different tagging (for example, 
"__must_release", admittedly not a good name). While the RFC patch is 
in-progress, for now, may be hardcode for the ".enqueue" ops in 
check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This 
part can be adjusted later once the RFC patch will be in shape.

Then one more thing is to track when the struct_ops bpf prog is actually reading 
the value of the skb pointer. One thing is worth to mention here, e.g. a 
struct_ops prog for enqueue:

SEC("struct_ops")
int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
	     struct sk_buff **to_free)
{
	return bpf_qdisc_drop(skb, sch, to_free);
}

Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array 
of __u64 as the only argument. The skb is actually in ctx[0], sch is in 
ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]), 
btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize 
the reg->ref_obj_id by the id obtained earlier from acquire_reference_state() 
during check_struct_ops_btf_id() somehow.

> 2. Returning a kptr from a program and treating it as releasing the reference.

e.g. for dequeue:

struct Qdisc_ops {
	/* ... */
	struct sk_buff *        (*dequeue)(struct Qdisc *);
};

Right now the verifier should complain on check_reference_leak() if the 
struct_ops bpf prog is returning a referenced kptr.

Unlike an argument, the return type of a function does not have a name to tag. 
It is the first case that a struct_ops bpf_prog returning a pointer. One idea is 
to assume it must be a trusted pointer (PTR_TRUSTED) and the verifier should 
check it is indeed with PTR_TRUSTED flag.

May be release_reference_state() can be called to assume the kernel will release 
it as long as the return pointer type is PTR_TRUSTED and the type matches the 
return type of the ops. Take a look at check_return_code().

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-30  6:39           ` Martin KaFai Lau
@ 2024-01-30 17:49             ` Kui-Feng Lee
  2024-01-31  1:01               ` Martin KaFai Lau
  2024-01-31 16:23             ` Amery Hung
  1 sibling, 1 reply; 36+ messages in thread
From: Kui-Feng Lee @ 2024-01-30 17:49 UTC (permalink / raw
  To: Martin KaFai Lau, Amery Hung
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee



On 1/29/24 22:39, Martin KaFai Lau wrote:
> On 1/26/24 5:17 PM, Amery Hung wrote:
>> On Thu, Jan 25, 2024 at 6:22 PM Martin KaFai Lau 
>> <martin.lau@linux.dev> wrote:
>>>
>>> On 1/23/24 9:22 PM, Amery Hung wrote:
>>>>> I looked at the high level of the patchset. The major ops that it 
>>>>> wants to be
>>>>> programmable in bpf is the ".enqueue" and ".dequeue" (+ ".init" and 
>>>>> ".reset" in
>>>>> patch 4 and patch 5).
>>>>>
>>>>> This patch adds a new prog type BPF_PROG_TYPE_QDISC, four attach 
>>>>> types (each for
>>>>> ".enqueue", ".dequeue", ".init", and ".reset"), and a new 
>>>>> "bpf_qdisc_ctx" in the
>>>>> uapi. It is no long an acceptable way to add new bpf extension.
>>>>>
>>>>> Can the ".enqueue", ".dequeue", ".init", and ".reset" be completely 
>>>>> implemented
>>>>> in bpf (with the help of new kfuncs if needed)? Then a struct_ops 
>>>>> for Qdisc_ops
>>>>> can be created. The bpf Qdisc_ops can be loaded through the 
>>>>> existing struct_ops api.
>>>>>
>>>> Partially. If using struct_ops, I think we'll need another structure
>>>> like the following in bpf qdisc to be implemented with struct_ops bpf:
>>>>
>>>> struct bpf_qdisc_ops {
>>>>       int (*enqueue) (struct sk_buff *skb)
>>>>       void (*dequeue) (void)
>>>>       void (*init) (void)
>>>>       void (*reset) (void)
>>>> };
>>>>
>>>> Then, Qdisc_ops will wrap around them to handle things that cannot be
>>>> implemented with bpf (e.g., sch_tree_lock, returning a skb ptr).
>>>
>>> We can see how those limitations (calling sch_tree_lock() and 
>>> returning a ptr)
>>> can be addressed in bpf. This will also help other similar use cases.
>>>
>>
>> For kptr, I wonder if we can support the following semantics in bpf if
>> they make sense:
> 
> I think they are useful but they are not fully supported now.
> 
> Some thoughts below.
> 
>> 1. Passing a referenced kptr into a bpf program, which will also need
>> to be released, or exchanged into maps or allocated objects.
> 
> "enqueue" should be the one considering here:
> 
> struct Qdisc_ops {
>      /* ... */
>      int                     (*enqueue)(struct sk_buff *skb,
>                         struct Qdisc *sch,
>                         struct sk_buff **to_free);
> 
> };
> 
> The verifier only marks the skb as a trusted kptr but does not mark its 
> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
> 
>      if (prog_args_trusted(prog))
>          info->reg_type |= PTR_TRUSTED;
> 
> The verifier does not know the skb ownership is passed into the 
> ".enqueue" ops and does not know the bpf prog needs to release it or 
> store it in a map.
> 
> The verifier tracks the reference state when a KF_ACQUIRE kfunc is 
> called (just an example, not saying we need to use KF_ACQUIRE kfunc). 
> Take a look at acquire_reference_state() which is the useful one here.
> 
> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier 
> can always acquire_reference_state() for the "struct sk_buff *skb" 
> argument.
> 
> Take a look at a recent RFC: 
> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC 
> patch is tagging the argument could be NULL by appending "__nullable" to 
> the argument name. The verifier will enforce that the bpf prog must 
> check for NULL first.
> 
> The similar idea can be used here but with a different tagging (for 
> example, "__must_release", admittedly not a good name). While the RFC 
> patch is in-progress, for now, may be hardcode for the ".enqueue" ops in 
> check_struct_ops_btf_id() and always acquire_reference_state() for the 
> skb. This part can be adjusted later once the RFC patch will be in shape.
> 
> 
> Then one more thing is to track when the struct_ops bpf prog is actually 
> reading the value of the skb pointer. One thing is worth to mention 
> here, e.g. a struct_ops prog for enqueue:
> 
> SEC("struct_ops")
> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>           struct sk_buff **to_free)
> {
>      return bpf_qdisc_drop(skb, sch, to_free);
> }
> 
> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to 
> an array of __u64 as the only argument. The skb is actually in ctx[0], 
> sch is in ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. 
> r1 = ctx[0]), btf_ctx_access() marks the reg_type to PTR_TRUSTED. It 
> needs to also initialize the reg->ref_obj_id by the id obtained earlier 
> from acquire_reference_state() during check_struct_ops_btf_id() somehow.
> 
> 
>> 2. Returning a kptr from a program and treating it as releasing the 
>> reference.
> 
> e.g. for dequeue:
> 
> struct Qdisc_ops {
>      /* ... */
>      struct sk_buff *        (*dequeue)(struct Qdisc *);
> };
> 
> 
> Right now the verifier should complain on check_reference_leak() if the 
> struct_ops bpf prog is returning a referenced kptr.
> 
> Unlike an argument, the return type of a function does not have a name 
> to tag. It is the first case that a struct_ops bpf_prog returning a 

We may tag the stub functions instead, right?
Is the purpose here to return a referenced pointer from a struct_ops
operator without verifier complaining?

> pointer. One idea is to assume it must be a trusted pointer 
> (PTR_TRUSTED) and the verifier should check it is indeed with 
> PTR_TRUSTED flag.
> 
> May be release_reference_state() can be called to assume the kernel will 
> release it as long as the return pointer type is PTR_TRUSTED and the 
> type matches the return type of the ops. Take a look at 
> check_return_code().
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-30 17:49             ` Kui-Feng Lee
@ 2024-01-31  1:01               ` Martin KaFai Lau
  2024-01-31 16:49                 ` Kui-Feng Lee
  0 siblings, 1 reply; 36+ messages in thread
From: Martin KaFai Lau @ 2024-01-31  1:01 UTC (permalink / raw
  To: Kui-Feng Lee
  Cc: Amery Hung, bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee

On 1/30/24 9:49 AM, Kui-Feng Lee wrote:
>>> 2. Returning a kptr from a program and treating it as releasing the reference.
>>
>> e.g. for dequeue:
>>
>> struct Qdisc_ops {
>>      /* ... */
>>      struct sk_buff *        (*dequeue)(struct Qdisc *);
>> };
>>
>>
>> Right now the verifier should complain on check_reference_leak() if the 
>> struct_ops bpf prog is returning a referenced kptr.
>>
>> Unlike an argument, the return type of a function does not have a name to tag. 
>> It is the first case that a struct_ops bpf_prog returning a 
> 
> We may tag the stub functions instead, right?

What is the suggestion on how to tag the return type?

I was suggesting it doesn't need to tag and it should by default require a 
trusted ptr for the pointer returned by struct_ops. The pointer argument and the 
return pointer of a struct_ops should be a trusted ptr.

> Is the purpose here to return a referenced pointer from a struct_ops
> operator without verifier complaining?

Yes, basically need to teach the verifier the kernel will do the reference release.

> 
>> pointer. One idea is to assume it must be a trusted pointer (PTR_TRUSTED) and 
>> the verifier should check it is indeed with PTR_TRUSTED flag.
>>
>> May be release_reference_state() can be called to assume the kernel will 
>> release it as long as the return pointer type is PTR_TRUSTED and the type 
>> matches the return type of the ops. Take a look at check_return_code(). 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-30  6:39           ` Martin KaFai Lau
  2024-01-30 17:49             ` Kui-Feng Lee
@ 2024-01-31 16:23             ` Amery Hung
  2024-02-02  1:47               ` Martin KaFai Lau
  1 sibling, 1 reply; 36+ messages in thread
From: Amery Hung @ 2024-01-31 16:23 UTC (permalink / raw
  To: Martin KaFai Lau
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee

On Mon, Jan 29, 2024 at 10:39 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >> We can see how those limitations (calling sch_tree_lock() and returning a ptr)
> >> can be addressed in bpf. This will also help other similar use cases.
> >>
> >
> > For kptr, I wonder if we can support the following semantics in bpf if
> > they make sense:
>
> I think they are useful but they are not fully supported now.
>
> Some thoughts below.
>
> > 1. Passing a referenced kptr into a bpf program, which will also need
> > to be released, or exchanged into maps or allocated objects.
>
> "enqueue" should be the one considering here:
>
> struct Qdisc_ops {
>         /* ... */
>         int                     (*enqueue)(struct sk_buff *skb,
>                                            struct Qdisc *sch,
>                                            struct sk_buff **to_free);
>
> };
>
> The verifier only marks the skb as a trusted kptr but does not mark its
> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
>
>         if (prog_args_trusted(prog))
>                 info->reg_type |= PTR_TRUSTED;
>
> The verifier does not know the skb ownership is passed into the ".enqueue" ops
> and does not know the bpf prog needs to release it or store it in a map.
>
> The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just
> an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at
> acquire_reference_state() which is the useful one here.
>
> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can
> always acquire_reference_state() for the "struct sk_buff *skb" argument.
>
> Take a look at a recent RFC:
> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch
> is tagging the argument could be NULL by appending "__nullable" to the argument
> name. The verifier will enforce that the bpf prog must check for NULL first.
>
> The similar idea can be used here but with a different tagging (for example,
> "__must_release", admittedly not a good name). While the RFC patch is
> in-progress, for now, may be hardcode for the ".enqueue" ops in
> check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This
> part can be adjusted later once the RFC patch will be in shape.
>

Make sense. One more thing to consider here is that .enqueue is
actually a reference acquiring and releasing function at the same
time. Assuming ctx written to by a struct_ops program can be seen by
the kernel, another new tag for the "to_free" argument will still be
needed so that the verifier can recognize when writing skb to
"to_free".

>
> Then one more thing is to track when the struct_ops bpf prog is actually reading
> the value of the skb pointer. One thing is worth to mention here, e.g. a
> struct_ops prog for enqueue:
>
> SEC("struct_ops")
> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>              struct sk_buff **to_free)
> {
>         return bpf_qdisc_drop(skb, sch, to_free);
> }
>
> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array
> of __u64 as the only argument. The skb is actually in ctx[0], sch is in
> ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]),
> btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize
> the reg->ref_obj_id by the id obtained earlier from acquire_reference_state()
> during check_struct_ops_btf_id() somehow.
>
>
> > 2. Returning a kptr from a program and treating it as releasing the reference.
>
> e.g. for dequeue:
>
> struct Qdisc_ops {
>         /* ... */
>         struct sk_buff *        (*dequeue)(struct Qdisc *);
> };
>
>
> Right now the verifier should complain on check_reference_leak() if the
> struct_ops bpf prog is returning a referenced kptr.
>
> Unlike an argument, the return type of a function does not have a name to tag.
> It is the first case that a struct_ops bpf_prog returning a pointer. One idea is
> to assume it must be a trusted pointer (PTR_TRUSTED) and the verifier should
> check it is indeed with PTR_TRUSTED flag.
>
> May be release_reference_state() can be called to assume the kernel will release
> it as long as the return pointer type is PTR_TRUSTED and the type matches the
> return type of the ops. Take a look at check_return_code().

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-31  1:01               ` Martin KaFai Lau
@ 2024-01-31 16:49                 ` Kui-Feng Lee
  2024-01-31 16:59                   ` Amery Hung
  0 siblings, 1 reply; 36+ messages in thread
From: Kui-Feng Lee @ 2024-01-31 16:49 UTC (permalink / raw
  To: Martin KaFai Lau
  Cc: Amery Hung, bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee



On 1/30/24 17:01, Martin KaFai Lau wrote:
> On 1/30/24 9:49 AM, Kui-Feng Lee wrote:
>>>> 2. Returning a kptr from a program and treating it as releasing the 
>>>> reference.
>>>
>>> e.g. for dequeue:
>>>
>>> struct Qdisc_ops {
>>>      /* ... */
>>>      struct sk_buff *        (*dequeue)(struct Qdisc *);
>>> };
>>>
>>>
>>> Right now the verifier should complain on check_reference_leak() if 
>>> the struct_ops bpf prog is returning a referenced kptr.
>>>
>>> Unlike an argument, the return type of a function does not have a 
>>> name to tag. It is the first case that a struct_ops bpf_prog returning a 
>>
>> We may tag the stub functions instead, right?
> 
> What is the suggestion on how to tag the return type?
> 
> I was suggesting it doesn't need to tag and it should by default require 
> a trusted ptr for the pointer returned by struct_ops. The pointer 
> argument and the return pointer of a struct_ops should be a trusted ptr.


That make sense to me. Should we also allow operators to return a null
pointer?

> 
>> Is the purpose here to return a referenced pointer from a struct_ops
>> operator without verifier complaining?
> 
> Yes, basically need to teach the verifier the kernel will do the 
> reference release.
> 
>>
>>> pointer. One idea is to assume it must be a trusted pointer 
>>> (PTR_TRUSTED) and the verifier should check it is indeed with 
>>> PTR_TRUSTED flag.
>>>
>>> May be release_reference_state() can be called to assume the kernel 
>>> will release it as long as the return pointer type is PTR_TRUSTED and 
>>> the type matches the return type of the ops. Take a look at 
>>> check_return_code(). 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-31 16:49                 ` Kui-Feng Lee
@ 2024-01-31 16:59                   ` Amery Hung
  0 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-01-31 16:59 UTC (permalink / raw
  To: Kui-Feng Lee
  Cc: Martin KaFai Lau, bpf, yangpeihao, toke, jhs, jiri, sdf,
	xiyou.wangcong, yepeilin.cs, netdev, Kui-Feng Lee

On Wed, Jan 31, 2024 at 8:49 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 1/30/24 17:01, Martin KaFai Lau wrote:
> > On 1/30/24 9:49 AM, Kui-Feng Lee wrote:
> >>>> 2. Returning a kptr from a program and treating it as releasing the
> >>>> reference.
> >>>
> >>> e.g. for dequeue:
> >>>
> >>> struct Qdisc_ops {
> >>>      /* ... */
> >>>      struct sk_buff *        (*dequeue)(struct Qdisc *);
> >>> };
> >>>
> >>>
> >>> Right now the verifier should complain on check_reference_leak() if
> >>> the struct_ops bpf prog is returning a referenced kptr.
> >>>
> >>> Unlike an argument, the return type of a function does not have a
> >>> name to tag. It is the first case that a struct_ops bpf_prog returning a
> >>
> >> We may tag the stub functions instead, right?
> >
> > What is the suggestion on how to tag the return type?
> >
> > I was suggesting it doesn't need to tag and it should by default require
> > a trusted ptr for the pointer returned by struct_ops. The pointer
> > argument and the return pointer of a struct_ops should be a trusted ptr.
>
>
> That make sense to me. Should we also allow operators to return a null
> pointer?
>

.dequeue in Qdisc_ops can return a null pointer when there is no skb
to be dequeued so I think that should be allowed.

> >
> >> Is the purpose here to return a referenced pointer from a struct_ops
> >> operator without verifier complaining?
> >
> > Yes, basically need to teach the verifier the kernel will do the
> > reference release.
> >
> >>
> >>> pointer. One idea is to assume it must be a trusted pointer
> >>> (PTR_TRUSTED) and the verifier should check it is indeed with
> >>> PTR_TRUSTED flag.
> >>>
> >>> May be release_reference_state() can be called to assume the kernel
> >>> will release it as long as the return pointer type is PTR_TRUSTED and
> >>> the type matches the return type of the ops. Take a look at
> >>> check_return_code().
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-01-31 16:23             ` Amery Hung
@ 2024-02-02  1:47               ` Martin KaFai Lau
  2024-02-09 20:14                 ` Amery Hung
  0 siblings, 1 reply; 36+ messages in thread
From: Martin KaFai Lau @ 2024-02-02  1:47 UTC (permalink / raw
  To: Amery Hung
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee

On 1/31/24 8:23 AM, Amery Hung wrote:
>>> 1. Passing a referenced kptr into a bpf program, which will also need
>>> to be released, or exchanged into maps or allocated objects.
>> "enqueue" should be the one considering here:
>>
>> struct Qdisc_ops {
>>          /* ... */
>>          int                     (*enqueue)(struct sk_buff *skb,
>>                                             struct Qdisc *sch,
>>                                             struct sk_buff **to_free);
>>
>> };
>>
>> The verifier only marks the skb as a trusted kptr but does not mark its
>> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
>>
>>          if (prog_args_trusted(prog))
>>                  info->reg_type |= PTR_TRUSTED;
>>
>> The verifier does not know the skb ownership is passed into the ".enqueue" ops
>> and does not know the bpf prog needs to release it or store it in a map.
>>
>> The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just
>> an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at
>> acquire_reference_state() which is the useful one here.
>>
>> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can
>> always acquire_reference_state() for the "struct sk_buff *skb" argument.
>>
>> Take a look at a recent RFC:
>> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
>> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch
>> is tagging the argument could be NULL by appending "__nullable" to the argument
>> name. The verifier will enforce that the bpf prog must check for NULL first.
>>
>> The similar idea can be used here but with a different tagging (for example,
>> "__must_release", admittedly not a good name). While the RFC patch is
>> in-progress, for now, may be hardcode for the ".enqueue" ops in
>> check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This
>> part can be adjusted later once the RFC patch will be in shape.
>>
> Make sense. One more thing to consider here is that .enqueue is
> actually a reference acquiring and releasing function at the same
> time. Assuming ctx written to by a struct_ops program can be seen by
> the kernel, another new tag for the "to_free" argument will still be
> needed so that the verifier can recognize when writing skb to
> "to_free".

I don't think "to_free" needs special tagging. I was thinking the 
"bpf_qdisc_drop" kfunc could be a KF_RELEASE. Ideally, it should be like

__bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
	                       struct sk_buff **to_free)
{
	return qdisc_drop(skb, sch, to_free);
}

However, I don't think the verifier supports pointer to pointer now. Meaning
"struct sk_buff **to_free" does not work.

If the ptr indirection spinning in my head is sound, one possible solution to 
unblock the qdisc work is to introduce:

struct bpf_sk_buff_ptr {
	struct sk_buff *skb;
};

and the bpf_qdisc_drop kfunc:

__bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
                                struct bpf_sk_buff_ptr *to_free_list)

and the enqueue prog:

SEC("struct_ops/enqueue")
int BPF_PROG(test_enqueue, struct sk_buff *skb,
              struct Qdisc *sch,
              struct bpf_sk_buff_ptr *to_free_list)
{
	return bpf_qdisc_drop(skb, sch, to_free_list);
}

and the ".is_valid_access" needs to change the btf_type from "struct sk_buff **" 
to "struct bpf_sk_buff_ptr *" which is sort of similar to the bpf_tcp_ca.c that 
is changing the "struct sock *" type to the "struct tcp_sock *" type.

I have the compiler-tested idea here: 
https://git.kernel.org/pub/scm/linux/kernel/git/martin.lau/bpf-next.git/log/?h=qdisc-ideas


> 
>> Then one more thing is to track when the struct_ops bpf prog is actually reading
>> the value of the skb pointer. One thing is worth to mention here, e.g. a
>> struct_ops prog for enqueue:
>>
>> SEC("struct_ops")
>> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
>>               struct sk_buff **to_free)
>> {
>>          return bpf_qdisc_drop(skb, sch, to_free);
>> }
>>
>> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array
>> of __u64 as the only argument. The skb is actually in ctx[0], sch is in
>> ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]),
>> btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize
>> the reg->ref_obj_id by the id obtained earlier from acquire_reference_state()
>> during check_struct_ops_btf_id() somehow.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v7 1/8] net_sched: Introduce eBPF based Qdisc
  2024-02-02  1:47               ` Martin KaFai Lau
@ 2024-02-09 20:14                 ` Amery Hung
  0 siblings, 0 replies; 36+ messages in thread
From: Amery Hung @ 2024-02-09 20:14 UTC (permalink / raw
  To: Martin KaFai Lau
  Cc: bpf, yangpeihao, toke, jhs, jiri, sdf, xiyou.wangcong,
	yepeilin.cs, netdev, Kui-Feng Lee

On Thu, Feb 1, 2024 at 5:47 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/31/24 8:23 AM, Amery Hung wrote:
> >>> 1. Passing a referenced kptr into a bpf program, which will also need
> >>> to be released, or exchanged into maps or allocated objects.
> >> "enqueue" should be the one considering here:
> >>
> >> struct Qdisc_ops {
> >>          /* ... */
> >>          int                     (*enqueue)(struct sk_buff *skb,
> >>                                             struct Qdisc *sch,
> >>                                             struct sk_buff **to_free);
> >>
> >> };
> >>
> >> The verifier only marks the skb as a trusted kptr but does not mark its
> >> reg->ref_obj_id. Take a look at btf_ctx_access(). In particular:
> >>
> >>          if (prog_args_trusted(prog))
> >>                  info->reg_type |= PTR_TRUSTED;
> >>
> >> The verifier does not know the skb ownership is passed into the ".enqueue" ops
> >> and does not know the bpf prog needs to release it or store it in a map.
> >>
> >> The verifier tracks the reference state when a KF_ACQUIRE kfunc is called (just
> >> an example, not saying we need to use KF_ACQUIRE kfunc). Take a look at
> >> acquire_reference_state() which is the useful one here.
> >>
> >> Whenever the verifier is loading the ".enqueue" bpf_prog, the verifier can
> >> always acquire_reference_state() for the "struct sk_buff *skb" argument.
> >>
> >> Take a look at a recent RFC:
> >> https://lore.kernel.org/bpf/20240122212217.1391878-1-thinker.li@gmail.com/
> >> which is tagging the argument of an ops (e.g. ".enqueue" here). That RFC patch
> >> is tagging the argument could be NULL by appending "__nullable" to the argument
> >> name. The verifier will enforce that the bpf prog must check for NULL first.
> >>
> >> The similar idea can be used here but with a different tagging (for example,
> >> "__must_release", admittedly not a good name). While the RFC patch is
> >> in-progress, for now, may be hardcode for the ".enqueue" ops in
> >> check_struct_ops_btf_id() and always acquire_reference_state() for the skb. This
> >> part can be adjusted later once the RFC patch will be in shape.
> >>
> > Make sense. One more thing to consider here is that .enqueue is
> > actually a reference acquiring and releasing function at the same
> > time. Assuming ctx written to by a struct_ops program can be seen by
> > the kernel, another new tag for the "to_free" argument will still be
> > needed so that the verifier can recognize when writing skb to
> > "to_free".
>
> I don't think "to_free" needs special tagging. I was thinking the
> "bpf_qdisc_drop" kfunc could be a KF_RELEASE. Ideally, it should be like
>
> __bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
>                                struct sk_buff **to_free)
> {
>         return qdisc_drop(skb, sch, to_free);
> }
>
> However, I don't think the verifier supports pointer to pointer now. Meaning
> "struct sk_buff **to_free" does not work.
>
> If the ptr indirection spinning in my head is sound, one possible solution to
> unblock the qdisc work is to introduce:
>
> struct bpf_sk_buff_ptr {
>         struct sk_buff *skb;
> };
>
> and the bpf_qdisc_drop kfunc:
>
> __bpf_kfunc int bpf_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
>                                 struct bpf_sk_buff_ptr *to_free_list)
>
> and the enqueue prog:
>
> SEC("struct_ops/enqueue")
> int BPF_PROG(test_enqueue, struct sk_buff *skb,
>               struct Qdisc *sch,
>               struct bpf_sk_buff_ptr *to_free_list)
> {
>         return bpf_qdisc_drop(skb, sch, to_free_list);
> }
>
> and the ".is_valid_access" needs to change the btf_type from "struct sk_buff **"
> to "struct bpf_sk_buff_ptr *" which is sort of similar to the bpf_tcp_ca.c that
> is changing the "struct sock *" type to the "struct tcp_sock *" type.
>
> I have the compiler-tested idea here:
> https://git.kernel.org/pub/scm/linux/kernel/git/martin.lau/bpf-next.git/log/?h=qdisc-ideas
>
>
> >
> >> Then one more thing is to track when the struct_ops bpf prog is actually reading
> >> the value of the skb pointer. One thing is worth to mention here, e.g. a
> >> struct_ops prog for enqueue:
> >>
> >> SEC("struct_ops")
> >> int BPF_PROG(bpf_dropall_enqueue, struct sk_buff *skb, struct Qdisc *sch,
> >>               struct sk_buff **to_free)
> >> {
> >>          return bpf_qdisc_drop(skb, sch, to_free);
> >> }
> >>
> >> Take a look at the BPF_PROG macro, the bpf prog is getting a pointer to an array
> >> of __u64 as the only argument. The skb is actually in ctx[0], sch is in
> >> ctx[1]...etc. When ctx[0] is read to get the skb pointer (e.g. r1 = ctx[0]),
> >> btf_ctx_access() marks the reg_type to PTR_TRUSTED. It needs to also initialize
> >> the reg->ref_obj_id by the id obtained earlier from acquire_reference_state()
> >> during check_struct_ops_btf_id() somehow.
>

I appreciate the idea. The pointer redirection works without problems.
I now have a working fifo bpf qdisc using struct_ops. I will explore
how other parts of qdisc work with struct_ops.

Thanks,
Amery

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-02-09 20:15 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-17 21:56 [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 1/8] " Amery Hung
2024-01-20 11:41   ` kernel test robot
2024-01-20 12:13   ` kernel test robot
2024-01-20 13:47   ` kernel test robot
2024-01-23 23:51   ` Martin KaFai Lau
2024-01-24  5:22     ` Amery Hung
2024-01-26  2:22       ` Martin KaFai Lau
2024-01-27  1:17         ` Amery Hung
2024-01-30  6:39           ` Martin KaFai Lau
2024-01-30 17:49             ` Kui-Feng Lee
2024-01-31  1:01               ` Martin KaFai Lau
2024-01-31 16:49                 ` Kui-Feng Lee
2024-01-31 16:59                   ` Amery Hung
2024-01-31 16:23             ` Amery Hung
2024-02-02  1:47               ` Martin KaFai Lau
2024-02-09 20:14                 ` Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 2/8] net_sched: Add kfuncs for working with skb Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 3/8] net_sched: Introduce kfunc bpf_skb_tc_classify() Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 4/8] net_sched: Add reset program Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 5/8] net_sched: Add init program Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 6/8] tools/libbpf: Add support for BPF_PROG_TYPE_QDISC Amery Hung
2024-01-23  0:17   ` Andrii Nakryiko
2024-01-23 19:40     ` Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 7/8] samples/bpf: Add an example of bpf fq qdisc Amery Hung
2024-01-24 10:29   ` Daniel Borkmann
2024-01-26 19:49     ` Amery Hung
2024-01-17 21:56 ` [RFC PATCH v7 8/8] samples/bpf: Add an example of bpf netem qdisc Amery Hung
2024-01-23 21:13 ` [RFC PATCH v7 0/8] net_sched: Introduce eBPF based Qdisc Stanislav Fomichev
2024-01-24 10:10   ` Daniel Borkmann
2024-01-24 12:09   ` Jamal Hadi Salim
2024-01-24 13:07     ` Daniel Borkmann
2024-01-24 14:11       ` Jamal Hadi Salim
2024-01-24 15:26         ` Daniel Borkmann
2024-01-24 21:26           ` Amery Hung
2024-01-25 11:57             ` Daniel Borkmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.