Re: [PATCH 1/3] Rough VJ Channel Implementation

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 11:47 Kelly Daly
@ 2006-04-26  7:33 ` David S. Miller
  2006-04-27  3:31   ` Kelly Daly
  2006-04-26  7:59 ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-26  7:33 UTC (permalink / raw
  To: kelly; +Cc: netdev, rusty

From: Kelly Daly <kelly@au1.ibm.com>
Date: Wed, 26 Apr 2006 11:47:34 +0000

> Noting Dave's recent release of his implementation, we thought we'd
> better get this "out there" so we can do some early
> comparison/combining and come up with the best possible
> implementation.

Thanks for publishing your work.

I'm actually not that upset that I duplicated the work a little
bit because trying to start implementing things forced me to
think in a more focued way about this stuff.

I'll look over your patches, thanks.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 11:47 Kelly Daly
  2006-04-26  7:33 ` David S. Miller
@ 2006-04-26  7:59 ` David S. Miller
  2006-05-04  7:28   ` Kelly Daly
  1 sibling, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-26  7:59 UTC (permalink / raw
  To: kelly; +Cc: netdev, rusty

Ok I have comments already just glancing at the initial patch.

With the 32-bit descriptors in the channel, you indeed end up
with a fixed sized pool with a lot of hard-to-finesse sizing
and lookup problems to solve.

So what I wanted to do was finesse the entire issue by simply
side-stepping it initially.  Use a normal buffer with a tail
descriptor, when you enqueue you give a tail descriptor pointer.

Yes, it's weirder to handle this in hardware, but it's not
impossible and using real pointers means two things:

1) You can design a simple netif_receive_skb() channel that works
   today, encapsulation of channel buffers into an SKB is like
   15 lines of code and no funny lookups.

2) People can start porting the input path of drivers right now and
   retain full functionality and test anything they want.  This is
   important for getting the drivers stable as fast as possible.

And it also means we can tackle the buffer pool issue of the 32-bit
descriptors later, if we actually want to do things that way, I
think we probably don't.

To be honest, I don't think using a 32-bit descriptor is so critical
even from a hardware implementation perspective.  Yes, on 64-bit
you're dealing with a 64-bit quantity so the number of entries in the
channel are halfed from what a 32-bit arch uses.

Yes I say this for 2 reasons:

1) We have no idea whether it's critical to have "~512" entries
   in the channel which is about what a u32 queue entry type
   affords you on x86 with 4096 byte page size.

2) Furthermore, it is sized by page size, and most 64-bit platforms
   use an 8K base page size anyways, so the number of queue entries
   ends of being the same.  Yes, I know some 64-bit platforms use
   a 4K page size, please see #1 :-)

I really dislike the pools of buffers, partly because they are fixed
size (or dynamically sized and even more expensive to implement), but
moreso because there is all of this absolutely stupid state management
you eat just to get at the real data.  That's pointless, we're trying
to make this as light as possible.  Just use real pointers and
describe the packet with a tail descriptor.

We can use a u64 or whatever in a hardware implementation.

Next, you can't even begin to work on the protocol channels before you
do one very important piece of work.  Integration of all of the ipv4
and ipv6 protocol hash tables into a central code, it's a total
prerequisite.  Then you modify things to use a generic
inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
protocol number as well as saddr/daddr/sport/dport and searches
from a central table.

So I think I'll continue working on my implementation, it's more
transitional and that's how we have to do this kind of work.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 11:47 Kelly Daly
  2006-04-26  7:33 ` David S. Miller
  2006-04-26  7:59 ` David S. Miller
  0 siblings, 2 replies; 79+ messages in thread
From: Kelly Daly @ 2006-04-26 11:47 UTC (permalink / raw
  To: netdev; +Cc: rusty, davem

Hey guys...  I've been working with Rusty on a VJ Channel implementation.  
Noting Dave's recent release of his implementation, we thought we'd better 
get this "out there" so we can do some early comparison/combining and 
come up with the best possible implementation.

There are three patches in total:
1) vj_core.patch - core files for VJ to userspace
2) vj_udp.patch  - badly hacked up UDP receive implementation - basically just to test what logic may be like!
3) vj_ne2k.patch - modified NE2K and 8390 used for testing on QEMU

Notes:
* channels can have global or local buffers (local for userspace.  Could be used directly by intelligent NIC)
* UDP receive breaks real UDP - doesn't talk anything except VJ Channels anymore.  Needs integration with normal sources.
* Userspace test app (below) uses VJ protocol family to mmap space for local buffers, if it receives buffers in kernel space sends a request for that buffer to be copied to local buffer.
* Default channel converts to skb and feeds through normal receive path.

TODO:
* send not yet implemented
* integrate non vj
* LOTS of fixmes

Cheers,
Kelly



Test userspace app:
/*  Van Jacobson net channels implementation for Linux
    Copyright (C) 2006  Kelly Daly <kdaly@au.ibm.com>  IBM Corporation

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <sys/poll.h>
#include <netinet/in.h>
#include "linux-2.6.16/include/linux/types.h"
#include "linux-2.6.16/include/linux/vjchan.h"

//flowid
#define SADDR 0
#define DADDR 0
#define SPORT 0
#define DPORT 60000
#define IFINDEX 0

#define PF_VJCHAN 27

static struct vj_buffer *get_buffer(struct vj_channel_ring *ring, int desc_num)
{
        printf("desc_num %i\n", desc_num);
        return (void *)ring + (desc_num + 1) * getpagesize();
}
/* return the next buffer, but do not move on */
static struct vj_buffer *vj_peek_next_buffer(struct vj_channel_ring *ring)
{
        if (ring->c.head == ring->p.tail)
                return NULL;
        return get_buffer(ring, ring->q[ring->c.head]);
}

/* move on to next buffer */
static void vj_done_with_buffer(struct vj_channel_ring *ring)
{
        ring->c.head = (ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;

        printf("done_with_buffer\n\n");
}

int main(int argc, char *argv[])
{
        int sk, cls, bnd, pll;
        void * mmapped;
        struct vj_flowid flowid;
        struct vj_channel_ring *ring;
        struct vj_buffer *buf;
        struct pollfd pfd;

        printf("\nstart of vjchannel socket test app\n");
        sk = socket(PF_VJCHAN, SOCK_DGRAM, IPPROTO_UDP);
        if (sk == -1) {
                perror("Unable to open socket!");
                return -1;
        }
        printf("socket open with ret code %i\n\n", sk);

//create flowid!!!
        flowid.saddr = SADDR;
        flowid.daddr = DADDR;
        flowid.sport = SPORT;
        flowid.dport = htons(DPORT);
        flowid.ifindex = IFINDEX;
        flowid.proto = IPPROTO_UDP;

        printf("flowid created\n");

        bnd = bind(sk, (struct sockaddr *)&flowid, sizeof(struct vj_flowid));
        if (bnd == -1) {
                perror("Unable to bind socket!");
                return -1;
        }
        printf("socket bound with ret code %i\n\n", bnd);

        ring = mmap(0, (getpagesize() * (VJ_NET_CHANNEL_ENTRIES+1)), PROT_READ|PROT_WRITE, MAP_SHARED, sk, 0);
        if (ring == MAP_FAILED) {
                perror ("Unable to mmap socket!");
                return -1;
        }
        printf("socket mmapped to address %lu\n\n", (unsigned long)mmapped);
        
        pfd.fd = sk;
        pfd.events = POLLIN;

        for (;;) {
                pll = poll(&pfd, 1, -1);
                
                if (pll < 0) {
                        perror("polling failed!");
                        return -1;
                }

//consume
                buf = vj_peek_next_buffer(ring);

                printf("buf %p\n", buf);

//print data, not headers
                printf("   Buffer Length = %i\n", buf->data_len);
                printf("   Header Length = %i\n", buf->header_len);
                printf("   Buffer Data: '%.*s'\n", buf->data_len - 28, buf->data + buf->header_len + 28);
                vj_done_with_buffer(ring);
        }

        cls = close(sk);
        if (cls != 0) {
                perror("Unable to close socket!");
                return -2;
        }
        printf("socket closed with ret code %i\n\n", cls);
        return 0;
}




-------------------------
Signed-off-by: Kelly Daly <kelly@au.ibm.com>

Basic infrastructure for Van Jacobson net channels: lockless ringbuffer for buffer transport.  Entries in ring buffer are descriptors for global or local buffers: ring and local buffers are mmapped into userspace.
Channels are registered with the core by flowid, and a thread services the default channel for any non-matching packets.  Drivers get (global) buffers from vj_get_buffer, and dispatch them through vj_netif_rx.
As userspace mmap cannot reach global buffers, select() copies global buffers into local buffers if required.


diff -r 47031a1f466c linux-2.6.16/include/linux/socket.h
--- linux-2.6.16/include/linux/socket.h Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/include/linux/socket.h Mon Apr 24 19:50:46 2006
@@ -186,6 +187,7 @@
 #define AF_PPPOX       24      /* PPPoX sockets                */
 #define AF_WANPIPE     25      /* Wanpipe API Sockets */
 #define AF_LLC         26      /* Linux LLC                    */
+#define AF_VJCHAN      27      /* VJ Channel */
 #define AF_TIPC                30      /* TIPC sockets                 */
 #define AF_BLUETOOTH   31      /* Bluetooth sockets            */
 #define AF_MAX         32      /* For now.. */
@@ -219,7 +221,8 @@
 #define PF_PPPOX       AF_PPPOX
 #define PF_WANPIPE     AF_WANPIPE
 #define PF_LLC         AF_LLC
+#define PF_VJCHAN      AF_VJCHAN
 #define PF_TIPC                AF_TIPC
 #define PF_BLUETOOTH   AF_BLUETOOTH
 #define PF_MAX         AF_MAX

diff -r 47031a1f466c linux-2.6.16/net/Kconfig
--- linux-2.6.16/net/Kconfig    Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/Kconfig    Mon Apr 24 19:50:46 2006
@@ -65,6 +65,12 @@
 source "net/ipv6/Kconfig"
 
 endif # if INET
+
+config VJCHAN
+       bool "Van Jacobson Net Channel Support (EXPERIMENTAL)"
+       depends on EXPERIMENTAL
+       ---help---
+         This adds a userspace-accessible packet receive interface.  Say N.
 
 menuconfig NETFILTER
        bool "Network packet filtering (replaces ipchains)"
diff -r 47031a1f466c linux-2.6.16/net/Makefile
--- linux-2.6.16/net/Makefile   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/Makefile   Mon Apr 24 19:50:46 2006
@@ -46,6 +46,7 @@
 obj-$(CONFIG_IP_SCTP)          += sctp/
 obj-$(CONFIG_IEEE80211)                += ieee80211/
 obj-$(CONFIG_TIPC)             += tipc/
+obj-$(CONFIG_VJCHAN)           += vjchan/
 
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)           += sysctl_net.o
diff -r 47031a1f466c linux-2.6.16/include/linux/vjchan.h
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/include/linux/vjchan.h Mon Apr 24 19:50:46 2006
@@ -0,0 +1,79 @@
+#ifndef _LINUX_VJCHAN_H
+#define _LINUX_VJCHAN_H
+
+/* num entries in channel q: set so consumer is at offset 1024. */
+#define VJ_NET_CHANNEL_ENTRIES 254
+/* identifies non-local buffers (ie. need kernel to copy to a local) */
+#define VJ_HIGH_BIT 0x80000000
+
+struct vj_producer {
+       __u16 tail;                     /* next element to add */
+       __u8 wakecnt;                   /* do wakeup if != consumer wakecnt */
+       __u8 pad;
+       __u16 old_head;                 /* last cleared buffer posn +1 */
+       __u16 pad2;
+};
+
+struct vj_consumer {
+       __u16 head;                     /* next element to remove */
+       __u8 wakecnt;                   /* increment to request wakeup */
+};
+
+/* mmap returns one of these, followed by 254 pages with a buffer each */
+struct vj_channel_ring {
+       struct vj_producer p;           /* producer's header */
+       __u32 q[VJ_NET_CHANNEL_ENTRIES];
+       struct vj_consumer c;           /* consumer's header */
+};
+
+struct vj_buffer {
+       __u32 data_len;         /* length of actual data in buffer */
+       __u32 header_len;       /* offset eth + ip header (true for now) */
+       __u32 ifindex;          /* interface the packet came in on. */
+       char data[0];
+};
+
+/* Currently assumed IPv4 */
+struct vj_flowid
+{
+       __u32 saddr, daddr;
+       __u16 sport, dport;
+       __u32 ifindex;
+       __u16 proto;
+};
+
+#ifdef __KERNEL__
+struct net_device;
+struct sk_buff;
+
+struct vj_descriptor {
+       unsigned long address;          /* address of net_channel_buffer */
+       unsigned long buffer_len;       /* max length including header */
+};
+
+/* Everything about a vj_channel */
+struct vj_channel
+{
+       struct vj_channel_ring *ring;
+       wait_queue_head_t wq;
+       struct list_head list;
+       struct vj_flowid flowid;
+       int num_local_buffers;
+       struct vj_descriptor *descs;
+        unsigned long * used_descs;
+};
+
+void vj_inc_wakecnt(struct vj_channel *chan);
+struct vj_buffer *vj_get_buffer(int *desc_num);
+void vj_netif_rx(struct vj_buffer *buffer, int desc_num, unsigned short proto);
+int vj_xmit(struct sk_buff *skb, struct net_device *dev);
+struct vj_channel *vj_alloc_chan(int num_buffers);
+void vj_register_chan(struct vj_channel *chan, const struct vj_flowid *flowid);
+void vj_unregister_chan(struct vj_channel *chan);
+void vj_free_chan(struct vj_channel *chan);
+struct vj_buffer *vj_peek_next_buffer(struct vj_channel *chan);
+void vj_done_with_buffer(struct vj_channel *chan);
+unsigned short eth_vj_type_trans(struct vj_buffer *buffer);
+int vj_need_local_buffer(struct vj_channel *chan);
+#endif
+#endif /* _LINUX_VJCHAN_H */
diff -r 47031a1f466c linux-2.6.16/net/vjchan/Makefile
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/Makefile    Mon Apr 24 19:50:46 2006
@@ -0,0 +1,3 @@
+#obj-m += vjtest.o
+obj-y += vjnet.o
+obj-y += af_vjchan.o
diff -r 47031a1f466c linux-2.6.16/net/vjchan/af_vjchan.c
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/af_vjchan.c Mon Apr 24 19:50:46 2006
@@ -0,0 +1,198 @@
+/*  Van Jacobson net channels implementation for Linux
+    Copyright (C) 2006  Kelly Daly <kdaly@au.ibm.com>  IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+*/
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/socket.h>
+#include <linux/vjchan.h>
+#include <net/sock.h>
+
+struct vjchan_sock
+{
+       struct sock sk;
+       struct vj_channel *chan;
+       int vj_reg_flag;
+};
+
+static inline struct vjchan_sock *vj_sk(struct sock *sk)
+{
+       return (struct vjchan_sock *)sk;
+}
+
+static struct proto vjchan_proto = {
+       .name = "VJCHAN",
+       .owner = THIS_MODULE,
+       .obj_size = sizeof(struct vjchan_sock),
+};
+
+int vjchan_release(struct socket *sock)
+{
+       struct sock *sk = sock->sk;
+
+       sock_orphan(sk);
+       sock->sk = NULL;
+       sock_put(sk);
+       return 0;
+}
+
+int vjchan_bind(struct socket *sock, struct sockaddr *addr, int sockaddr_len)
+{
+       struct sock *sk = sock->sk;
+       struct vjchan_sock *vjsk;
+       struct vj_flowid *flowid = (struct vj_flowid *)addr;
+
+       /* FIXME: avoid clashing with normal sockets, replace zeroes. */
+       vjsk = vj_sk(sk);
+       vj_register_chan(vjsk->chan, flowid);
+       vjsk->vj_reg_flag = 1;
+
+       return 0;
+}
+
+int vjchan_getname(struct socket *sock, struct sockaddr *addr,
+                  int *sockaddr_len, int peer)
+{
+       /* FIXME: Implement */
+       return 0;
+}
+
+unsigned int vjchan_poll(struct file *file, struct socket *sock,
+                        struct poll_table_struct *wait)
+{
+       struct sock *sk = sock->sk;
+       struct vj_channel *chan = vj_sk(sk)->chan;
+
+       poll_wait(file, &chan->wq, wait);
+       vj_inc_wakecnt(chan);
+
+       if (vj_peek_next_buffer(chan) && vj_need_local_buffer(chan) == 0)
+               return POLLIN | POLLRDNORM;
+
+       return 0;
+}
+
+/* We map the ring first, then one page per buffer. */
+int vjchan_mmap(struct file *file, struct socket *sock,
+               struct vm_area_struct *vma)
+{
+       struct sock *sk = sock->sk;
+       struct vj_channel *chan = vj_sk(sk)->chan;
+       int i, vip;
+       unsigned long pos;
+
+       if (vma->vm_end - vma->vm_start !=
+           (1 + chan->num_local_buffers)*PAGE_SIZE)
+               return -EINVAL;
+
+       pos = vma->vm_start;
+       vip = vm_insert_page(vma, pos, virt_to_page(chan->ring));
+       pos += PAGE_SIZE;
+       for (i = 0; i < chan->num_local_buffers; i++) {
+               vip = vm_insert_page(vma, pos, virt_to_page(chan->descs[i].address));
+               pos += PAGE_SIZE;
+       }
+       return 0;
+}
+
+const struct proto_ops vjchan_ops = {
+       .family = PF_VJCHAN,
+       .owner = THIS_MODULE,
+       .release = vjchan_release,
+       .bind = vjchan_bind,
+       .socketpair = sock_no_socketpair,
+       .accept = sock_no_accept,
+       .getname = vjchan_getname,
+       .poll = vjchan_poll,
+       .ioctl = sock_no_ioctl,
+       .shutdown = sock_no_shutdown,
+       .setsockopt = sock_common_setsockopt,
+       .getsockopt = sock_common_getsockopt,
+       .sendmsg = sock_no_sendmsg,
+       .recvmsg = sock_no_recvmsg,
+       .mmap = vjchan_mmap,
+       .sendpage = sock_no_sendpage
+};
+
+static void vjchan_destruct(struct sock *sk)
+{
+       struct vjchan_sock *vjsk;
+
+       vjsk = vj_sk(sk);
+       if (vjsk->vj_reg_flag) {
+               vj_unregister_chan(vjsk->chan);
+               vjsk->vj_reg_flag = 0;
+       }
+       vj_free_chan(vjsk->chan);
+
+}
+
+static int vjchan_create(struct socket *sock, int protocol)
+{
+       struct sock *sk;
+       struct vjchan_sock *vjsk;
+       int err;
+
+       if (!capable(CAP_NET_RAW))
+               return -EPERM;
+       if (sock->type != SOCK_DGRAM
+           && sock->type != SOCK_RAW
+           && sock->type != SOCK_PACKET)
+               return -ESOCKTNOSUPPORT;
+
+       sock->state = SS_UNCONNECTED;
+
+       err = -ENOBUFS;
+       sk = sk_alloc(PF_VJCHAN, GFP_KERNEL, &vjchan_proto, 1);
+       if (sk == NULL)
+               goto out;
+
+       sock->ops = &vjchan_ops;
+
+       sock_init_data(sock, sk);
+       sk->sk_family = PF_VJCHAN;
+       sk->sk_destruct = vjchan_destruct;
+
+       vjsk = vj_sk(sk);
+       vjsk->chan = vj_alloc_chan(VJ_NET_CHANNEL_ENTRIES);
+       vjsk->vj_reg_flag = 0;
+       if (!vjsk->chan)
+               return -ENOMEM;
+       return 0;
+out:
+       return err;
+}
+
+static struct net_proto_family vjchan_family_ops = {
+       .family =       PF_VJCHAN,
+       .create =       vjchan_create,
+       .owner  =       THIS_MODULE,
+};
+
+static void __exit vjchan_exit(void)
+{
+       sock_unregister(PF_VJCHAN);
+}
+
+static int __init vjchan_init(void)
+{
+       return sock_register(&vjchan_family_ops);
+}
+
+module_init(vjchan_init);
+module_exit(vjchan_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_VJCHAN);
diff -r 47031a1f466c linux-2.6.16/net/vjchan/vjnet.c
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/vjnet.c     Mon Apr 24 19:50:46 2006
@@ -0,0 +1,550 @@
+/*  Van Jacobson net channels implementation for Linux
+    Copyright (C) 2006  Kelly Daly <kdaly@au.ibm.com>  IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+*/
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/etherdevice.h>
+#include <linux/spinlock.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/vjchan.h>
+
+#define BUFFER_DATA_LEN 2048
+#define NUM_GLOBAL_DESCRIPTORS 1024
+
+/* All our channels.  FIXME: Lockless funky hash structure please... */
+static LIST_HEAD(channels);
+static spinlock_t chan_lock = SPIN_LOCK_UNLOCKED;
+
+/* Default channel, also holds global buffers (userspace-mapped
+ * channels have local buffers, which they prefer to use). */
+static struct vj_channel *default_chan;
+
+/* need to increment for wake in udp.c wait_for_vj_buffer */
+void vj_inc_wakecnt(struct vj_channel *chan)
+{
+       chan->ring->c.wakecnt++;
+       pr_debug("*** incremented wakecnt - should allow wake up\n");
+}
+EXPORT_SYMBOL(vj_inc_wakecnt);
+
+static int is_empty(struct vj_channel_ring *ring)
+{
+       if (ring->c.head == ring->p.tail)
+               return 1;
+       return 0;
+}
+
+static struct vj_buffer *get_buffer(unsigned int desc_num,
+                                   struct vj_channel *chan)
+{
+       struct vj_buffer *buf;
+
+       if ((desc_num & VJ_HIGH_BIT) || (chan->num_local_buffers == 0)) {
+               desc_num &= ~VJ_HIGH_BIT;
+               BUG_ON(desc_num >= default_chan->num_local_buffers);
+               buf = (struct vj_buffer*)default_chan->descs[desc_num].address;
+       } else {
+               BUG_ON(desc_num >= chan->num_local_buffers);
+               buf = (struct vj_buffer *)chan->descs[desc_num].address;
+       }
+       
+       pr_debug("       received desc_num is %i\n", desc_num);
+       pr_debug("get_buffer %p (%s) %i: %p (len=%li ifind=%i hlen=%li) %#02X %#02X %#02X %#02X %#02X %#02X %#02X %#02X\n",
+                current, current->comm, desc_num, buf, buf->data_len, buf->ifindex, buf->header_len + (sizeof(struct iphdr *) * 4),
+                buf->data[0], buf->data[1], buf->data[2], buf->data[3], buf->data[4], buf->data[5], buf->data[6], buf->data[7]);
+
+       return buf;
+}
+
+static void release_buffer(struct vj_channel *chan, unsigned int descnum)
+{
+       if (descnum & VJ_HIGH_BIT) {
+               BUG_ON(test_bit(descnum & ~VJ_HIGH_BIT,
+                               default_chan->used_descs) == 0);
+               clear_bit(descnum & ~VJ_HIGH_BIT, default_chan->used_descs);
+       } else {
+               BUG_ON(test_bit(descnum, chan->used_descs) == 0);
+               clear_bit(descnum, chan->used_descs);
+       }
+}
+
+/* Free all descriptors for the current channel between where we last
+ * freed to and where the consumer has not yet consumed. chan->c.head
+ * is not cleared because it may not have been consumed, therefore
+ * chan->p.old_head is not cleared.  If chan->p.old_head ==
+ * chan->c.head then nothing more has been consumed since we last
+ * freed the descriptors. 
+ *
+ * Because we're using local and global channels we need to select the
+ * bitmap according to the channel.  Local channels may be pointing to
+ * local or global buffers, so we need to select the bitmap according
+ * to the buffer type */
+
+/* Free descriptors consumer has consumed since last free */
+static void free_descs_for_channel(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+       int desc_num;
+
+       while (ring->p.old_head != ring->c.head) {
+               printk("ring->p.old_head %i, ring->c.head %i\n", ring->p.old_head, ring->c.head);
+               desc_num = ring->q[ring->p.old_head];
+
+               printk("desc_num %i\n", desc_num);
+
+               /* FIXME: Security concerns: make sure this descriptor
+                * really used by this vjchannel.  Userspace could
+                * have changed it. */
+               release_buffer(chan, desc_num);
+               ring->p.old_head = (ring->p.old_head + 1) % VJ_NET_CHANNEL_ENTRIES;
+               printk("ring->p.old_head %i, ring->c.head %i\n\n", ring->p.old_head, ring->c.head);
+       }
+}
+
+/* return -1 if no descriptor found and none can be freed */
+static int get_free_descriptor(struct vj_channel *chan)
+{
+       int free_desc, bitval;
+
+       BUG_ON(chan->num_local_buffers == 0);
+       do {
+               free_desc = find_first_zero_bit(chan->used_descs,
+                                               chan->num_local_buffers);
+               pr_debug("free_desc = %i\n", free_desc);
+               if (free_desc >= chan->num_local_buffers) {
+                       /* no descriptors, refresh bitmap and try again! */
+                       free_descs_for_channel(chan);
+                       free_desc = find_first_zero_bit(chan->used_descs,
+                                               chan->num_local_buffers);
+                       if (free_desc >= chan->num_local_buffers)
+                               /* still no descriptors */
+                               return -1;
+               }
+               bitval = test_and_set_bit(free_desc, chan->used_descs);
+               pr_debug("bitval = %i\n", bitval);
+       } while (bitval == 1);  //keep going until we get a FREE free bit!
+
+       /* We set high bit to indicate a global channel. */
+       if (chan == default_chan)
+               free_desc |= VJ_HIGH_BIT;
+       return free_desc;
+}
+
+/* This function puts a buffer into a local address space for a
+ * channel that is unable to use a kernel address space.  If address
+ * high bit is set then the buffer is in kernel space - get a free
+ * local buffer and copy it across.  Set local buf to used (done when
+ * finding free buffer), kernel buf to unused. */
+/* FIXME: Loop, do as many as possible at once. */
+int vj_need_local_buffer(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+       u32 new_desc, k_desc;
+
+       k_desc = ring->q[ring->c.head];
+
+       if (ring->q[ring->c.head] & VJ_HIGH_BIT) {
+               struct vj_buffer *buf, *kbuf;
+
+               kbuf = get_buffer(k_desc, chan);
+               new_desc = get_free_descriptor(chan);
+               if (new_desc == -1)
+                       return -ENOBUFS;
+               buf = get_buffer(new_desc, chan);       
+               memcpy (buf, kbuf, sizeof(struct vj_buffer)
+                       + kbuf->data_len + kbuf->header_len);
+/* clear the old descriptor and set q to new one */
+               k_desc &= ~VJ_HIGH_BIT;
+               clear_bit(k_desc, default_chan->used_descs);    
+               ring->q[ring->c.head] = new_desc;
+       }
+       return 0;
+}
+EXPORT_SYMBOL(vj_need_local_buffer);
+
+struct vj_buffer *vj_get_buffer(int *desc_num)
+{
+       *desc_num = get_free_descriptor(default_chan);
+
+       if (*desc_num == -1) {
+               printk("no free bits!\n");
+               return NULL;  
+       }
+
+       return get_buffer(*desc_num, default_chan);
+}
+EXPORT_SYMBOL(vj_get_buffer);
+
+static void enqueue_buffer(struct vj_channel *chan, struct vj_buffer *buffer, int desc_num)
+{
+       u16 tail, nxt;
+       int i;
+
+       pr_debug("*** in enqueue buffer\n");
+       pr_debug("   desc_num = %i\n", desc_num);
+       pr_debug("   Buffer Data Length = %lu\n", buffer->data_len);
+       pr_debug("   Buffer Header Length = %lu\n", buffer->header_len);
+       pr_debug("   Buffer Data:\n");
+       for (i = 0; i < buffer->data_len; i++) {
+               pr_debug("%i ", buffer->data[i]);
+               if (i % 20 == 0)
+                       pr_debug("\n");
+       }
+       pr_debug("\n");
+
+       tail = chan->ring->p.tail;
+       nxt = (tail + 1) % VJ_NET_CHANNEL_ENTRIES;
+               
+       pr_debug("nxt = %i and chan->c.head = %i\n", nxt, chan->ring->c.head);
+       if (nxt != chan->ring->c.head) {
+               chan->ring->q[tail] = desc_num;
+
+               smp_wmb();
+               chan->ring->p.tail=nxt;
+               pr_debug("chan->p.wakecnt = %i and chan->c.wakecnt = %i\n", chan->ring->p.wakecnt, chan->ring->c.wakecnt);
+               free_descs_for_channel(chan);
+               if (chan->ring->p.wakecnt != chan->ring->c.wakecnt) {
+                       ++chan->ring->p.wakecnt;
+                       /* consume whatever is available */
+                       pr_debug("WAKE UP, CONSUMER!!!\n\n");
+                       wake_up(&chan->wq);
+               }
+       } else //if can't add it to chan, may as well allow it to be reused
+               release_buffer(chan, desc_num);
+}
+
+/* FIXME: If we're going to do wildcards here, we need to do ordering between different partial matches... */
+static struct vj_channel *find_channel(u32 saddr, u32 daddr, u16 proto, u16 sport, u16 dport, u32 ifindex)
+{
+       struct vj_channel *i;
+
+       pr_debug("args saddr %u, daddr %u, sport %u, dport %u, ifindex %u, proto %u\n", saddr, daddr, sport, dport, ifindex, proto);
+
+       list_for_each_entry(i, &channels, list) {
+               pr_debug("saddr %u, daddr %u, sport %u, dport %u, ifindex %u, proto %u\n", i->flowid.saddr, i->flowid.daddr, i->flowid.sport, i->flowid.dport, i->flowid.ifindex, i->flowid.proto);
+       
+               if ((!i->flowid.saddr || i->flowid.saddr == saddr) &&
+                   (!i->flowid.daddr || i->flowid.daddr == daddr) &&
+                   (!i->flowid.proto || i->flowid.proto == proto) &&
+                   (!i->flowid.sport || i->flowid.sport == sport) &&
+                   (!i->flowid.dport || i->flowid.dport == dport) &&
+                   (!i->flowid.ifindex || i->flowid.ifindex == ifindex)) {
+                       pr_debug("Found channel %p\n", i);
+                       return i;
+               }
+       }
+       pr_debug("using default channel %p\n", default_chan);
+       return default_chan;
+}
+
+void vj_netif_rx(struct vj_buffer *buffer, int desc_num, 
+                unsigned short proto)
+{
+       struct vj_channel *chan;
+       struct iphdr *ip;
+       int iphl, offset, real_data_len;
+       u16 *ports;
+       unsigned long flags;
+
+       offset = sizeof(struct iphdr) + sizeof(struct udphdr);
+       real_data_len = buffer->data_len - offset;
+
+
+       pr_debug("data_len = %lu, offset = %i, real data? = %i\n\n\n", buffer->data_len, offset, real_data_len);
+       /* this is always 18 when there's 18 or less characters in buffer->data */
+
+       pr_debug("rx) desc_num = %i\n\n", desc_num);
+
+       spin_lock_irqsave(&chan_lock, flags);
+       if (proto == __constant_htons(ETH_P_IP)) {
+
+               ip = (struct iphdr *)(buffer->data + buffer->header_len);
+               ports = (u16 *)(ip + 1);
+               iphl = ip->ihl * 4;
+               
+               if ((buffer->data_len < (iphl + 4)) || 
+                   (iphl != sizeof(struct iphdr))) {
+                       pr_debug("Bad data, default chan\n");
+                       pr_debug("buffer data_len = %li, header len = %li, ip->ihl = %i\n", buffer->data_len, buffer->header_len, ip->ihl);
+                       chan = default_chan;
+               } else {
+                       chan = find_channel(ip->saddr, ip->daddr, 
+                                           ip->protocol, ports[0], 
+                                           ports[1], buffer->ifindex);
+                       
+               }
+       } else
+               chan = default_chan;
+       enqueue_buffer(chan, buffer, desc_num);
+
+       spin_unlock_irqrestore(&chan_lock, flags);
+}
+EXPORT_SYMBOL(vj_netif_rx);
+
+/*
+ *     Determine the packet's protocol ID. The rule here is that we 
+ *     assume 802.3 if the type field is short enough to be a length.
+ *     This is normal practice and works for any 'now in use' protocol.
+ */
+ 
+unsigned short eth_vj_type_trans(struct vj_buffer *buffer)
+{
+       struct ethhdr *eth;
+       unsigned char *rawp;
+
+       eth = (struct ethhdr *)buffer->data;
+       buffer->header_len = ETH_HLEN;
+
+       BUG_ON(buffer->header_len > buffer->data_len);  
+
+       buffer->data_len -= buffer->header_len;
+       if (ntohs(eth->h_proto) >= 1536)
+               return eth->h_proto;
+               
+       rawp = buffer->data;
+       
+       /*
+        *      This is a magic hack to spot IPX packets. Older Novell breaks
+        *      the protocol design and runs IPX over 802.3 without an 802.2 LLC
+        *      layer. We look for FFFF which isn't a used 802.2 SSAP/DSAP. This
+        *      won't work for fault tolerant netware but does for the rest.
+        */
+       if (*(unsigned short *)rawp == 0xFFFF)
+               return htons(ETH_P_802_3);
+               
+       /*
+        *      Real 802.2 LLC
+        */
+       return htons(ETH_P_802_2);
+}
+EXPORT_SYMBOL(eth_vj_type_trans);
+
+static void send_to_netif_rx(struct vj_buffer *buffer)
+{
+       struct sk_buff *skb;
+       struct net_device *dev;
+       int i;
+
+       dev = dev_get_by_index(buffer->ifindex);
+       if (!dev)
+               return;
+       skb = dev_alloc_skb(buffer->data_len + 2);
+       if (skb == NULL) {
+               dev_put(dev);
+               return;
+       }
+
+       skb_reserve(skb, 2);
+       skb->dev = dev;
+
+       skb_put(skb, buffer->data_len);
+       memcpy(skb->data, buffer->data, buffer->data_len);
+
+       pr_debug(" *** C buffer data_len = %lu and skb->len = %i\n", buffer->data_len, skb->len);
+       for (i = 0; i < 10; i++)
+               pr_debug("%i\n", skb->data[i]);
+
+       skb->protocol = eth_type_trans(skb, skb->dev);
+
+       netif_receive_skb(skb);
+}
+
+/* handles default_chan (buffers that nobody else wants) */
+static int default_thread(void *unused)
+{
+       int consumed = 0;
+       int woken = 0;
+       struct vj_buffer *buffer;
+       wait_queue_t wait;
+
+       /* When we get woken up, don't want to be removed from waitqueue! */
+//no more wait.task    struct task_struct * task is now void *private
+       wait.private = current;
+       wait.func = default_wake_function;
+       INIT_LIST_HEAD(&wait.task_list);
+
+       add_wait_queue(&default_chan->wq, &wait);
+       set_current_state(TASK_UNINTERRUPTIBLE);
+       while (!kthread_should_stop()) {
+               /* FIXME: if we do this before prepare_to_wait, avoids wmb */
+               default_chan->ring->c.wakecnt++;
+               smp_wmb();
+
+               while (!is_empty(default_chan->ring)) {
+                       smp_read_barrier_depends();
+                       buffer = get_buffer(default_chan->ring->q[default_chan->ring->c.head], default_chan);
+                       pr_debug("calling send_to_netif_rx\n");
+                       send_to_netif_rx(buffer);
+                       smp_rmb();
+                       default_chan->ring->c.head = (default_chan->ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
+                       consumed++;
+               }
+
+               schedule();
+               woken++;
+               set_current_state(TASK_INTERRUPTIBLE);
+       }
+       remove_wait_queue(&default_chan->wq, &wait);
+
+       __set_current_state(TASK_RUNNING);
+
+       pr_debug("consumer finished! consumed %i and woke %i\n", consumed, woken);
+       return 0;
+}
+
+/* return the next buffer, but do not move on */
+struct vj_buffer *vj_peek_next_buffer(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+
+       if (is_empty(ring))
+               return NULL;
+       return get_buffer(ring->q[ring->c.head], chan);
+}
+EXPORT_SYMBOL(vj_peek_next_buffer);
+
+/* move on to next buffer */
+void vj_done_with_buffer(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+
+       ring->c.head = (ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
+
+       pr_debug("done_with_buffer\n\n");
+}
+EXPORT_SYMBOL(vj_done_with_buffer);
+
+struct vj_channel *vj_alloc_chan(int num_buffers)
+{
+       int i;
+       struct vj_channel *chan = kmalloc(sizeof(*chan), GFP_KERNEL);
+
+       if (!chan)
+               return NULL;
+
+       chan->ring = (void *)get_zeroed_page(GFP_KERNEL);
+       if (chan->ring == NULL)
+               goto free_chan;
+
+       init_waitqueue_head(&chan->wq);
+       chan->ring->p.tail = chan->ring->p.wakecnt = chan->ring->p.old_head = chan->ring->c.head = chan->ring->c.wakecnt = 0;
+
+       chan->num_local_buffers = num_buffers;
+       if (chan->num_local_buffers == 0)
+               return chan;
+
+       chan->used_descs = kzalloc(BITS_TO_LONGS(chan->num_local_buffers)
+                                  * sizeof(long), GFP_KERNEL);
+       if (chan->used_descs == NULL)
+               goto free_ring;
+       chan->descs = kmalloc(sizeof(*chan->descs)*num_buffers, GFP_KERNEL);
+       if (chan->descs == NULL)
+               goto free_used_descs;
+       for (i = 0; i < chan->num_local_buffers; i++) {
+               chan->descs[i].buffer_len = PAGE_SIZE;
+               chan->descs[i].address = get_zeroed_page(GFP_KERNEL);
+               if (chan->descs[i].address == 0)
+                       goto free_descs;
+       }
+
+       return chan;
+
+free_descs:
+       for (--i; i >= 0; i--)
+               free_page(chan->descs[i].address);
+       kfree(chan->descs);
+free_used_descs:
+       kfree(chan->used_descs);
+free_ring:
+       free_page((unsigned long)chan->ring);
+free_chan:
+       kfree(chan);
+       return NULL;
+}
+EXPORT_SYMBOL(vj_alloc_chan);
+
+void vj_register_chan(struct vj_channel *chan, const struct vj_flowid *flowid)
+{
+       pr_debug("%p %s: registering channel %p\n",
+              current, current->comm, chan);
+       chan->flowid = *flowid;
+       spin_lock_irq(&chan_lock);
+       list_add(&chan->list, &channels);
+       spin_unlock_irq(&chan_lock);
+}
+EXPORT_SYMBOL(vj_register_chan);
+
+void vj_unregister_chan(struct vj_channel *chan)
+{
+       pr_debug("%p %s: unregistering channel %p\n",
+              current, current->comm, chan);
+       spin_lock_irq(&chan_lock);
+       list_del(&chan->list);
+       spin_unlock_irq(&chan_lock);
+}
+EXPORT_SYMBOL(vj_unregister_chan);
+
+void vj_free_chan(struct vj_channel *chan)
+{
+       pr_debug("%p %s: freeing channel %p\n",
+              current, current->comm, chan);
+       /* FIXME: Mark any buffer still in channel as free! */
+       kfree(chan);
+}
+EXPORT_SYMBOL(vj_free_chan);
+
+
+
+/* not using at the mo - working on rx, not tx */
+int vj_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+       struct vj_buffer *buffer;
+       /* first element in dev priv data must be addr of net_channel */
+//     struct net_channel *chan = *(struct net_channel **) netdev_priv(dev) + 1;
+       int desc_num;
+
+       buffer = vj_get_buffer(&desc_num);
+       buffer->data_len = skb->len;
+       memcpy(buffer->data, skb->data, buffer->data_len);
+//     enqueue_buffer(chan, buffer, desc_num);
+
+       kfree(skb);
+       return 0;
+}
+EXPORT_SYMBOL(vj_xmit);
+
+static int __init init(void)
+{
+       default_chan = vj_alloc_chan(NUM_GLOBAL_DESCRIPTORS);
+       if (!default_chan)
+               return -ENOMEM;
+
+       kthread_run(default_thread, NULL, "kvj_net");
+       return 0;
+}
+
+module_init(init);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("VJ Channel Networking Module.");
+MODULE_AUTHOR("Kelly Daly <kelly@au1.ibm.com>");

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 16:57 Caitlin Bestler
  2006-04-26 19:23 ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-26 16:57 UTC (permalink / raw
  To: David S. Miller, kelly; +Cc: netdev, rusty

netdev-owner@vger.kernel.org wrote:
> Ok I have comments already just glancing at the initial patch.
> 
> With the 32-bit descriptors in the channel, you indeed end up
> with a fixed sized pool with a lot of hard-to-finesse sizing
> and lookup problems to solve.
> 
> So what I wanted to do was finesse the entire issue by simply
> side-stepping it initially.  Use a normal buffer with a tail
> descriptor, when you enqueue you give a tail descriptor pointer.
> 
> Yes, it's weirder to handle this in hardware, but it's not
> impossible and using real pointers means two things:
> 
> 1) You can design a simple netif_receive_skb() channel that works
>    today, encapsulation of channel buffers into an SKB is like
>    15 lines of code and no funny lookups.
> 
> 2) People can start porting the input path of drivers right now and
>    retain full functionality and test anything they want.  This is
>    important for getting the drivers stable as fast as possible.
> 
> And it also means we can tackle the buffer pool issue of the
> 32-bit descriptors later, if we actually want to do things
> that way, I think we probably don't.
> 
> To be honest, I don't think using a 32-bit descriptor is so
> critical even from a hardware implementation perspective.
> Yes, on 64-bit you're dealing with a 64-bit quantity so the
> number of entries in the channel are halfed from what a 32-bit arch
> uses. 
> 
> Yes I say this for 2 reasons:
> 
> 1) We have no idea whether it's critical to have "~512" entries
>    in the channel which is about what a u32 queue entry type
>    affords you on x86 with 4096 byte page size.
> 
> 2) Furthermore, it is sized by page size, and most 64-bit platforms
>    use an 8K base page size anyways, so the number of queue entries
>    ends of being the same.  Yes, I know some 64-bit platforms use
>    a 4K page size, please see #1 :-)
> 
> I really dislike the pools of buffers, partly because they
> are fixed size (or dynamically sized and even more expensive
> to implement), but moreso because there is all of this
> absolutely stupid state management you eat just to get at the
> real data.  That's pointless, we're trying to make this as
> light as possible.  Just use real pointers and describe the
> packet with a tail descriptor.
> 
> We can use a u64 or whatever in a hardware implementation.
> 
> Next, you can't even begin to work on the protocol channels
> before you do one very important piece of work.  Integration
> of all of the ipv4 and ipv6 protocol hash tables into a
> central code, it's a total prerequisite.  Then you modify
> things to use a generic
> inet_{,listen_}lookup() or inet6_{,listen_}lookup() that
> takes a protocol number as well as saddr/daddr/sport/dport
> and searches from a central table.
> 
> So I think I'll continue working on my implementation, it's
> more transitional and that's how we have to do this kind of work. -

The major element I liked about Kelly's approach is that the ring
is clearly designed to allow a NIC to place packets directly into
a ring that is directly accessible by the user. Evolutionary steps
are good, but isn't direct placement into a user-accessible simple
ring buffer the ultimate justification of netchannels?

But that doesn't mean that we have to have a very artificial definition
of the ring based on presumptions that hardware only understands 512<<n
sized buffers. Hardware today is typically just as smart as the
processors
that IP networks were first designed on, if not more so.

Central integration also will need to be integrated with packet
filtering.
In particular, once a flow has been assigned to a netchannel ring, who
is
responsible for doing the packet filtering? Or is it enough to check the
packet filter when the net channel flow is created?




^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 16:57 Caitlin Bestler
@ 2006-04-26 19:23 ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-26 19:23 UTC (permalink / raw
  To: caitlinb; +Cc: kelly, netdev, rusty

From: "Caitlin Bestler" <caitlinb@broadcom.com>
Date: Wed, 26 Apr 2006 09:57:22 -0700

> The major element I liked about Kelly's approach is that the ring
> is clearly designed to allow a NIC to place packets directly into
> a ring that is directly accessible by the user. Evolutionary steps
> are good, but isn't direct placement into a user-accessible simple
> ring buffer the ultimate justification of netchannels?

It is a very good point and one I actually need to think about
some more.

I'll be up front and say that I don't think it's actually necessary to
do channels all the way to userspace, just channeling to the in-kernel
networking protocol is more than sufficient.  This will get us most of
the way without having to deal with any of the thorny issues of doing
protocols in userspace.  I could be wrong but this is my gut instinct
at this time.

> Central integration also will need to be integrated with packet
> filtering.  In particular, once a flow has been assigned to a
> netchannel ring, who is responsible for doing the packet filtering?
> Or is it enough to check the packet filter when the net channel flow
> is created?

Very good question and one that hasn't been discussed enough yet.

Eventually we should be able to do things such as allow netfilter
to register channels too.

Before we do that, we'll have to decide how we'll handle potential
conflicts between local sockets and firewall rules.

There is a school of opinion that would agree to a rule such as:
if a local socket exists, it can trump firewalling.

This would be a nice and simple way to deal with firewall rules that
potentially shadow local sockets.  You couldn't have created that
fully bound socket in the first place if the firewall rules didn't
allow it.  You'd need to insert rules subsequently that block the
connection's flow.  If we want to support that we either have to do
netfilter channels from the get-go, or simply disable socket
netchannels altogether if netfilter is enabled.

I personally think allowing sockets to trump firewall rules is an
acceptable relaxation of the rules in order to simplify the
implementation.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 19:30 Caitlin Bestler
  2006-04-26 19:46 ` Jeff Garzik
  2006-04-27  3:40 ` Rusty Russell
  0 siblings, 2 replies; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-26 19:30 UTC (permalink / raw
  To: David S. Miller; +Cc: kelly, netdev, rusty

David S. Miller wrote:

> 
> I personally think allowing sockets to trump firewall rules
> is an acceptable relaxation of the rules in order to simplify
> the implementation.

I agree.  I have never seen a set of netfilter rules that
would block arbitrary packets *within* an established connection.

Technically you can create such rules, but every single set
of rules actually deployed that I have ever seen started with
a rule to pass all packets for established connections, and
then proceeded to control which connections could be initiated
or accepted.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 19:30 Caitlin Bestler
@ 2006-04-26 19:46 ` Jeff Garzik
  2006-04-26 22:40   ` David S. Miller
  2006-04-27  3:40 ` Rusty Russell
  1 sibling, 1 reply; 79+ messages in thread
From: Jeff Garzik @ 2006-04-26 19:46 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: David S. Miller, kelly, netdev, rusty

Caitlin Bestler wrote:
> David S. Miller wrote:
> 
>> I personally think allowing sockets to trump firewall rules
>> is an acceptable relaxation of the rules in order to simplify
>> the implementation.
> 
> I agree.  I have never seen a set of netfilter rules that
> would block arbitrary packets *within* an established connection.
> 
> Technically you can create such rules, but every single set
> of rules actually deployed that I have ever seen started with
> a rule to pass all packets for established connections, and
> then proceeded to control which connections could be initiated
> or accepted.

Oh, there are plenty of examples of filtering within an established 
connection:  input rules.  I've seen "drop all packets from <these> IPs"
type rules frequently.  Victims of DoS use those kinds of rules to stop 
packets as early as possible.

	Jeff



^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 20:20 Caitlin Bestler
  2006-04-26 22:35 ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-26 20:20 UTC (permalink / raw
  To: Jeff Garzik; +Cc: David S. Miller, kelly, netdev, rusty

Jeff Garzik wrote:
> Caitlin Bestler wrote:
>> David S. Miller wrote:
>> 
>>> I personally think allowing sockets to trump firewall rules is an
>>> acceptable relaxation of the rules in order to simplify the
>>> implementation.
>> 
>> I agree.  I have never seen a set of netfilter rules that would block
>> arbitrary packets *within* an established connection.
>> 
>> Technically you can create such rules, but every single set of rules
>> actually deployed that I have ever seen started with a rule to pass
>> all packets for established connections, and then proceeded to
>> control which connections could be initiated or accepted.
> 
> Oh, there are plenty of examples of filtering within an established
> connection:  input rules.  I've seen "drop all packets from <these>
> IPs" type rules frequently.  Victims of DoS use those kinds of
> rules to stop packets as early as possible.
> 
> 	Jeff

If you are dropping all packets from IP X, then how was the connection
established? Obviously we are only dealing with connections that
were established before the rule to drop all packets from IP X
was created.

That calls for an ability to revoke the assignment of any flow to
a vj_netchannel when a new rule is created that would filter any
packet that would be classified by the flow.

Basically the rule is that a delegation to a vj_netchannel is
only allowed for flows where *all* packets assigned to that flow
(input or output) would receive a 'pass' from netchannels.

That makes sense.  What I don't see a need for is examing *each*
delegated packet against the entire set of existing rules. Basically,
a flow should either be rule-compliant or not. If it is not, then
the delegation of the flow should be abandoned. If that requires
re-importing TCP state, then perhaps the TCP connection needs to
be aborted.

In any event, if netfilter is selectively rejecting packets in the
middle
of a connection then the connection is going to fail anyway. 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 20:20 Caitlin Bestler
@ 2006-04-26 22:35 ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-26 22:35 UTC (permalink / raw
  To: caitlinb; +Cc: jeff, kelly, netdev, rusty

From: "Caitlin Bestler" <caitlinb@broadcom.com>
Date: Wed, 26 Apr 2006 13:20:50 -0700

> If you are dropping all packets from IP X, then how was the connection
> established? Obviously we are only dealing with connections that
> were established before the rule to drop all packets from IP X
> was created.

The problem is listening TCP connections that you don't
want anyone in the world to be able to connect to.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 19:46 ` Jeff Garzik
@ 2006-04-26 22:40   ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-26 22:40 UTC (permalink / raw
  To: jeff; +Cc: caitlinb, kelly, netdev, rusty

From: Jeff Garzik <jeff@garzik.org>
Date: Wed, 26 Apr 2006 15:46:58 -0400

> Oh, there are plenty of examples of filtering within an established 
> connection:  input rules.  I've seen "drop all packets from <these> IPs"
> type rules frequently.  Victims of DoS use those kinds of rules to stop 
> packets as early as possible.

Yes, good point, but this applies to listening connections.

We'll need to figure out a way to deal with this.

It occurs to me that for established connections, netfilter can simply
remove all matching entries from the netchannel lookup tables.

But that still leaves the thorny listening socket issue.  This may
by itself make netfilter netchannel support important and that brings
up a lot of issues about classifier algorithms.

All of this I wanted to avoid as we start this work :-)

We can think about how to approach these other problems and start
with something simple meanwhile.  That seems to me to be the best
approach moving forward.

It's important to start really simple else we'll just keep getting
bogged down in complexity and details and never implement anything.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 22:53 Caitlin Bestler
  2006-04-26 22:59 ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-26 22:53 UTC (permalink / raw
  To: David S. Miller, jeff; +Cc: kelly, netdev, rusty

David S. Miller wrote:
> From: Jeff Garzik <jeff@garzik.org>
> Date: Wed, 26 Apr 2006 15:46:58 -0400
> 
>> Oh, there are plenty of examples of filtering within an established
>> connection:  input rules.  I've seen "drop all packets from <these>
>> IPs" type rules frequently.  Victims of DoS use those kinds of rules
>> to stop packets as early as possible.
> 
> Yes, good point, but this applies to listening connections.
> 
> We'll need to figure out a way to deal with this.
> 
> It occurs to me that for established connections, netfilter
> can simply remove all matching entries from the netchannel lookup
> tables. 
> 
> But that still leaves the thorny listening socket issue.
> This may by itself make netfilter netchannel support
> important and that brings up a lot of issues about classifier
> algorithms. 
> 
> All of this I wanted to avoid as we start this work :-)
> 
> We can think about how to approach these other problems and
> start with something simple meanwhile.  That seems to me to
> be the best approach moving forward.
> 
> It's important to start really simple else we'll just keep
> getting bogged down in complexity and details and never
> implement anything.

How does this sound?

The netchannel qualifiers should only deal with TCP packets
for established connections. Listens would continue to be 
dealt with by the existing stack logic, vj_channelizing
only occurring when the the connection was accepted.

The vj_netchannel qualifiers would conceptually take place
before the netfilter rules (to avoid making deployment
of netchannels dependent on netfilter) but their creation
would have to be approved by netfilter (if netfiler was
active). Netfilter could also revoke vj_channel qualifiers.

If the rule is that "if a vj_netchannel rule exists then it
must be ok with netfilter" is actually very easy to implement.
During early development you simply tell the testers "hey,
don't set up any netchannels that netfilter would reject"
and defer implementing enforcement until after the netchannels
code actually works. After all, if it is isn't actually successfully
transmitting or receiving packets yet it can't really be acting
contrary to netfilter policy.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 22:53 Caitlin Bestler
@ 2006-04-26 22:59 ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-26 22:59 UTC (permalink / raw
  To: caitlinb; +Cc: jeff, kelly, netdev, rusty

From: "Caitlin Bestler" <caitlinb@broadcom.com>
Date: Wed, 26 Apr 2006 15:53:44 -0700

> The netchannel qualifiers should only deal with TCP packets
> for established connections. Listens would continue to be 
> dealt with by the existing stack logic, vj_channelizing
> only occurring when the the connection was accepted.

I consider netchannel support for listening TCP sockets
to be absolutely essential.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-27  1:02 Caitlin Bestler
  2006-04-27  6:08 ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-27  1:02 UTC (permalink / raw
  To: David S. Miller; +Cc: jeff, kelly, netdev, rusty

netdev-owner@vger.kernel.org wrote:
> From: "Caitlin Bestler" <caitlinb@broadcom.com>
> Date: Wed, 26 Apr 2006 15:53:44 -0700
> 
>> The netchannel qualifiers should only deal with TCP packets for
>> established connections. Listens would continue to be dealt with by
>> the existing stack logic, vj_channelizing only occurring when the the
>> connection was accepted.
> 
> I consider netchannel support for listening TCP sockets to be
> absolutely essential. -

Meaning that inbound SYNs would be placed in a net channel
for processing by a Consumer at the other end of the ring?

If so the rules filtering SYNs would have to be applied either
before it went into the ring, or when the consumer end takes
them out. The latter makes more sense to me, because the rules
about what remote hosts can initiate a connection request to
a given TCP port can be fairly complex for a variety of
legitimate reasons.

Would it be reasonable to state that a net channel carrying
SYNs should not be set up when the consumer is a user mode
process?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26  7:33 ` David S. Miller
@ 2006-04-27  3:31   ` Kelly Daly
  2006-04-27  6:25     ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Kelly Daly @ 2006-04-27  3:31 UTC (permalink / raw
  To: David S. Miller; +Cc: rusty, netdev

Hi Dave,

Thanks for your response.  =)

On Wednesday 26 April 2006 17:59, you wrote:
> Ok I have comments already just glancing at the initial patch.
>
> With the 32-bit descriptors in the channel, you indeed end up
> with a fixed sized pool with a lot of hard-to-finesse sizing
> and lookup problems to solve.

It should be quite trivial to resize this pool using RCU.

>
> So what I wanted to do was finesse the entire issue by simply
> side-stepping it initially.  Use a normal buffer with a tail
> descriptor, when you enqueue you give a tail descriptor pointer.


The tail pointers are an excellent idea - and they certainly fix a lot of 
compatibility issues that we side-stepped (we were going for the "make it 
work" approach rather than the "make it right" - figured we could get to that 
bit later  =P  ).

> I really dislike the pools of buffers, partly because they are fixed
> size (or dynamically sized and even more expensive to implement), but
> moreso because there is all of this absolutely stupid state management
> you eat just to get at the real data.  That's pointless, we're trying
> to make this as light as possible.  Just use real pointers and
> describe the packet with a tail descriptor.

We approached this from the understanding that an intelligent NIC will be able 
to transition directly to userspace, which is a major win.  0 copies to 
userspace would be sweet.  I think we can still achieve this using your 
scheme without *too* much pain.

> Next, you can't even begin to work on the protocol channels before you
> do one very important piece of work.  Integration of all of the ipv4
> and ipv6 protocol hash tables into a central code, it's a total
> prerequisite.  Then you modify things to use a generic
> inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
> protocol number as well as saddr/daddr/sport/dport and searches
> from a central table.

Understood.  And agreed.  Once again was side-stepped just to try to get a 
"working model".  Will look into this immediately.

> So I think I'll continue working on my implementation, it's more
> transitional and that's how we have to do this kind of work.


Thanks again for your comments  =) (and thanks to everyone else who took the 
time to respond to this)

Kelly

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26 19:30 Caitlin Bestler
  2006-04-26 19:46 ` Jeff Garzik
@ 2006-04-27  3:40 ` Rusty Russell
  2006-04-27  4:58   ` James Morris
  2006-04-27  6:17   ` David S. Miller
  1 sibling, 2 replies; 79+ messages in thread
From: Rusty Russell @ 2006-04-27  3:40 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: David S. Miller, kelly, netdev

On Wed, 2006-04-26 at 12:30 -0700, Caitlin Bestler wrote:
> David S. Miller wrote:
> 
> > 
> > I personally think allowing sockets to trump firewall rules
> > is an acceptable relaxation of the rules in order to simplify
> > the implementation.
> 
> I agree.  I have never seen a set of netfilter rules that
> would block arbitrary packets *within* an established connection.

Intelligent or no, this does happen.  More importantly, people rely on
packet counters.  Basically I don't think we can "relax" our firewall
implementation and retain trust 8(

I started thinking about this back in January.  We could force
everything through the "slow" path when something is registered with
netfilter (similarly raw sockets, bonding, divert).  Or, we could delay
LOCAL_IN hook processing until we get to socket receive.

Delaying netfilter hook processing won't work for intelligent NICs that
write straight to mmapped buffers, but we could make that CAP_NET_RAW.

We *used* to have an nf_cache mechanism to determine exactly when the
netfilter hooks cared about a packet, but it was never used and was hard
to reconcile with connection-tracking timeouts...

Cheers,
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  3:40 ` Rusty Russell
@ 2006-04-27  4:58   ` James Morris
  2006-04-27  6:16     ` David S. Miller
  2006-04-27  6:17   ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: James Morris @ 2006-04-27  4:58 UTC (permalink / raw
  To: Rusty Russell; +Cc: Caitlin Bestler, David S. Miller, kelly, netdev

On Thu, 27 Apr 2006, Rusty Russell wrote:

> netfilter (similarly raw sockets, bonding, divert).  Or, we could delay
> LOCAL_IN hook processing until we get to socket receive.

This an idea proposed for skfilter [1], too, allowing packets to be 
filtered by local endpoint.


[1] http://people.redhat.com/jmorris/selinux/skfilter/



-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  1:02 Caitlin Bestler
@ 2006-04-27  6:08 ` David S. Miller
  2006-04-27  6:17   ` Andi Kleen
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-27  6:08 UTC (permalink / raw
  To: caitlinb; +Cc: jeff, kelly, netdev, rusty

From: "Caitlin Bestler" <caitlinb@broadcom.com>
Date: Wed, 26 Apr 2006 18:02:43 -0700

> Would it be reasonable to state that a net channel carrying
> SYNs should not be set up when the consumer is a user mode
> process?

I'm currently assuming that the protocol processing is still done in
the kernel on behalf of the user context, so the issues you raise
really aren't relevant.

We really shouldn't be jumping the gun so far into the implementation
as others seem to be doing.  Let's do it simple first and see if
putting things all the way to userspace even is necessary.

No work is going to get done if we keep carrying on like this
over details we really do not need to consider right away.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  4:58   ` James Morris
@ 2006-04-27  6:16     ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-27  6:16 UTC (permalink / raw
  To: jmorris; +Cc: rusty, caitlinb, kelly, netdev

From: James Morris <jmorris@namei.org>
Date: Thu, 27 Apr 2006 00:58:41 -0400 (EDT)

> On Thu, 27 Apr 2006, Rusty Russell wrote:
> 
> > netfilter (similarly raw sockets, bonding, divert).  Or, we could delay
> > LOCAL_IN hook processing until we get to socket receive.
> 
> This an idea proposed for skfilter [1], too, allowing packets to be 
> filtered by local endpoint.
> 
> [1] http://people.redhat.com/jmorris/selinux/skfilter/

Moving forward this really is an important problem that we'll need to
solve, and we'll need to solve it such that netfilter can be fully
enabled in tandem with net channels doing their thing.

It's simple, if we don't make them work together, then as a
consequence the real life sites that would benefit the most from net
channels will not see the benefit from them because they will use
netfilter and they will have firewall rules enabled.  Our work is
largely wasteful if that's what happens.

But let's move forward on the bits we can implement now, believing
optimistically that we will find a way to deal with this issue
properly. :-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  3:40 ` Rusty Russell
  2006-04-27  4:58   ` James Morris
@ 2006-04-27  6:17   ` David S. Miller
  1 sibling, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-27  6:17 UTC (permalink / raw
  To: rusty; +Cc: caitlinb, kelly, netdev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Thu, 27 Apr 2006 13:40:26 +1000

> We *used* to have an nf_cache mechanism to determine exactly when the
> netfilter hooks cared about a packet, but it was never used and was hard
> to reconcile with connection-tracking timeouts...

Let's not consider bringing that thing back :-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  6:08 ` David S. Miller
@ 2006-04-27  6:17   ` Andi Kleen
  2006-04-27  6:27     ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2006-04-27  6:17 UTC (permalink / raw
  To: David S. Miller; +Cc: caitlinb, jeff, kelly, netdev, rusty

On Thursday 27 April 2006 08:08, David S. Miller wrote:

> I'm currently assuming that the protocol processing is still done in
> the kernel on behalf of the user context, so the issues you raise
> really aren't relevant.
> 
> We really shouldn't be jumping the gun so far into the implementation
> as others seem to be doing.  Let's do it simple first and see if
> putting things all the way to userspace even is necessary.

I still have my doubts about doing that securely anyways.
 
> No work is going to get done if we keep carrying on like this
> over details we really do not need to consider right away.

One thing I would like to see is some generic code for the channels.
It might be interesting to try if that data structure could be used
in other parts of the kernel that pass objects around (like VM or block
layer) 

-Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  3:31   ` Kelly Daly
@ 2006-04-27  6:25     ` David S. Miller
  2006-04-27 11:51       ` Evgeniy Polyakov
  2006-05-04  2:59       ` Kelly Daly
  0 siblings, 2 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-27  6:25 UTC (permalink / raw
  To: kelly; +Cc: rusty, netdev

From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 27 Apr 2006 13:31:37 +1000

> It should be quite trivial to resize this pool using RCU.

Yes, a lot of this stuff can use RCU, in particular the channel
demux is a prime candidate.

There are some non-trivial issues wrt. synchronizing the net
channel lookup state with socket state changes (socket moves to
close or whatever).  This reminds me that we had some nice TCP
hash table RCU patches that Ben LaHaise posted at one point and
that slipped through the cracks.  That took care of all the event
ordering issues, it seemed at the time, and is something we need
to get back on track with.

> The tail pointers are an excellent idea - and they certainly fix a
> lot of compatibility issues that we side-stepped (we were going for
> the "make it work" approach rather than the "make it right" -
> figured we could get to that bit later =P ).

Start simple, we can keep mucking with the interfaces over and over
again as we move from simple netif_receive_skb() channels out to the
more complex socket demux style channel.

This is a big and long project, there are no style points for trying
to go all the way in the first pass :-)

> We approached this from the understanding that an intelligent NIC
> will be able to transition directly to userspace, which is a major
> win.  0 copies to userspace would be sweet.  I think we can still
> achieve this using your scheme without *too* much pain.

Understood.  What's your basic idea?  Just make the buffers in the
pool large enough to fit the SKB encapsulation at the end?

Note that this will change a lot of the assumptions currently in
your buffer handling code about buffer reuse and such.

So the idea in your scheme is to give the buffer pools to the NIC
in a per-channel way via a simple descriptor table?  And the u32's
are arbitrary keys that index into this descriptor table, right?

I would suggest just sticking to the simple global input queue.
Solve the easy problems and the buffering model first.  Then we
can port drivers and people can bang on the basic infrastructure.
Take my SKB encapsulator in my vj-2.6 tree once you've transformed
your buffer pools to accomodate.

I'll actually sit back and let you do that, I'm actually coming around
more to your scheme in some regards :-)  I'll sit and think about some
of the heavier issues we'll hit in the next phase and once you have
a cut at the current phase I'll work on a tg3 driver port.

Thanks!

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  6:17   ` Andi Kleen
@ 2006-04-27  6:27     ` David S. Miller
  2006-04-27  6:41       ` Andi Kleen
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-27  6:27 UTC (permalink / raw
  To: ak; +Cc: caitlinb, jeff, kelly, netdev, rusty

From: Andi Kleen <ak@suse.de>
Date: Thu, 27 Apr 2006 08:17:35 +0200

> On Thursday 27 April 2006 08:08, David S. Miller wrote:
> 
> > I'm currently assuming that the protocol processing is still done in
> > the kernel on behalf of the user context, so the issues you raise
> > really aren't relevant.
> > 
> > We really shouldn't be jumping the gun so far into the implementation
> > as others seem to be doing.  Let's do it simple first and see if
> > putting things all the way to userspace even is necessary.
> 
> I still have my doubts about doing that securely anyways.

The NIC has a descriptor of buffers, the NIC can thus DMA right
into this buffer which only contains packet data and nothing
else outside of those packets.

The software implementation, of course, will not be able to do
this and will need to copy.

> One thing I would like to see is some generic code for the channels.
> It might be interesting to try if that data structure could be used
> in other parts of the kernel that pass objects around (like VM or block
> layer) 

Seconded.  This should be easy once we have the basic global input
queue channel working.

I even put it in include/linux/netchannel.h in my vj-2.6 tree sort
of to hint at this. :)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  6:27     ` David S. Miller
@ 2006-04-27  6:41       ` Andi Kleen
  2006-04-27  7:52         ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Andi Kleen @ 2006-04-27  6:41 UTC (permalink / raw
  To: David S. Miller; +Cc: caitlinb, jeff, kelly, netdev, rusty

On Thursday 27 April 2006 08:27, David S. Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Thu, 27 Apr 2006 08:17:35 +0200
> 
> > On Thursday 27 April 2006 08:08, David S. Miller wrote:
> > 
> > > I'm currently assuming that the protocol processing is still done in
> > > the kernel on behalf of the user context, so the issues you raise
> > > really aren't relevant.
> > > 
> > > We really shouldn't be jumping the gun so far into the implementation
> > > as others seem to be doing.  Let's do it simple first and see if
> > > putting things all the way to userspace even is necessary.
> > 
> > I still have my doubts about doing that securely anyways.
> 
> The NIC has a descriptor of buffers, the NIC can thus DMA right
> into this buffer which only contains packet data and nothing
> else outside of those packets.

Yes but all clients will see all the data from all sockets don't they?
[Unless you have a RDMA nic that can scale to hundred thousands of connections, 
but let's assume standard hardware for now]

-Andi

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  6:41       ` Andi Kleen
@ 2006-04-27  7:52         ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-27  7:52 UTC (permalink / raw
  To: ak; +Cc: caitlinb, jeff, kelly, netdev, rusty

From: Andi Kleen <ak@suse.de>
Date: Thu, 27 Apr 2006 08:41:51 +0200

> Yes but all clients will see all the data from all sockets don't
> they?  [Unless you have a RDMA nic that can scale to hundred
> thousands of connections, but let's assume standard hardware for
> now]

Each netchannel, which goes to a specific socket, has a ring
buffer of packets the NIC can use.  Those packets are mmap()'d
into userspace so we can control the layout, the page boundaries,
etc. and the NIC will only DMA packets matching that channel ID
into that userland area.

Have a look at the code Kelly posted.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  6:25     ` David S. Miller
@ 2006-04-27 11:51       ` Evgeniy Polyakov
  2006-04-27 20:09         ` David S. Miller
  2006-05-04  2:59       ` Kelly Daly
  1 sibling, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-27 11:51 UTC (permalink / raw
  To: David S. Miller; +Cc: kelly, rusty, netdev

On Wed, Apr 26, 2006 at 11:25:01PM -0700, David S. Miller (davem@davemloft.net) wrote:
> > We approached this from the understanding that an intelligent NIC
> > will be able to transition directly to userspace, which is a major
> > win.  0 copies to userspace would be sweet.  I think we can still
> > achieve this using your scheme without *too* much pain.
> 
> Understood.  What's your basic idea?  Just make the buffers in the
> pool large enough to fit the SKB encapsulation at the end?

There are some caveats here found while developing zero-copy sniffer
[1]. Project's goal was to remap skbs into userspace in real-time.
While absolute numbers (posted to netdev@) were really high, it is only
applicable to read-only application. As was shown in IOAT thread,
data must be warmed in caches, so reading from mapped area will be as
fast as memcpy() (read+write), and copy_to_user() actually almost equal
to memcpy() (benchmarks were posted to netdev@). And we must add
remapping overhead.

If we want to dma data from nic into premapped userspace area, this will
strike with message sizes/misalignment/slow read and so on, so
preallocation has even more problems.

This change also requires significant changes in application, at least
until recv/send are changed, which is not the best thing to do.

So I think that mapping itself can be done as some additional socket
option or something not turnedon by default.

I do think that significant win in VJ's tests belongs not to remapping
and cache-oriented changes, but to move all protocol processing into
process' context.

I fully agree with Dave that it must be implemented step-by-step, and
the most significant, IMHO, is moving protocol processing into socket's
"place". This will force to netfilter changes, but I do think that for
the proof-of-concept code we can turn it off.

I will start to work in this direction next week after aio_sendfile() is
completed.

So, we will have three attempts to write incompatible stacks - and that is good :)
No one need an excuse to rewrite something, as I read in Rusty's blog...

Thanks.

[1]. http://tservice.net.ru/~s0mbre/old/?section=projects&item=af_tlb

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27 11:51       ` Evgeniy Polyakov
@ 2006-04-27 20:09         ` David S. Miller
  2006-04-28  6:05           ` Evgeniy Polyakov
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-27 20:09 UTC (permalink / raw
  To: johnpol; +Cc: kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 27 Apr 2006 15:51:26 +0400

> There are some caveats here found while developing zero-copy sniffer
> [1]. Project's goal was to remap skbs into userspace in real-time.
> While absolute numbers (posted to netdev@) were really high, it is only
> applicable to read-only application. As was shown in IOAT thread,
> data must be warmed in caches, so reading from mapped area will be as
> fast as memcpy() (read+write), and copy_to_user() actually almost equal
> to memcpy() (benchmarks were posted to netdev@). And we must add
> remapping overhead.

Yes, all of these issues are related quite strongly.  Thanks for
making the connection explicit.

But, the mapping overhead is zero for this net channel stuff, at
least as it is implemented and designed by Kelly.  Ring buffer is
setup ahead of time into the user's address space, and a ring of
buffers into that area are given to the networking card.

We remember the translations here, so no get_user_pages() on each
transfer and garbage like that.  And yes this all harks back to the
issues that are discussed in Chapter 5 of Networking Algorithmics.
But the core thing to understand is that by defining a new API and
setting up the buffer pool ahead of time, we avoid all of the
get_user_pages() overhead while retaining full kernel/user protection.

Evgeniy, the difference between this and your work is that you did not
have an intelligent piece of hardware that could be told to recognize
flows, and only put packets for a specific flow into that's flow's
buffer pool.

> If we want to dma data from nic into premapped userspace area, this will
> strike with message sizes/misalignment/slow read and so on, so
> preallocation has even more problems.

I do not really think this is an issue, we put the full packet into
user space and teach it where the offset is to the actual data.
We'll do the same things we do today to try and get the data area
aligned.  User can do whatever is logical and relevant on his end
to deal with strange cases.

In fact we can specify that card has to take some care to get data
area of packet aligned on say an 8 byte boundary or something like
that.  When we don't have hardware assist, we are going to be doing
copies.

> This change also requires significant changes in application, at least
> until recv/send are changed, which is not the best thing to do.

This is exactly the point, we can only do a good job and receive zero
copy if we can change the interfaces, and that's exactly what we're
doing here.

> I do think that significant win in VJ's tests belongs not to remapping
> and cache-oriented changes, but to move all protocol processing into
> process' context.

I partly disagree.  The biggest win is eliminating all of the control
overhead (all of "softint RX + protocol demux + IP route lookup +
socket lookup" is turned into single flow demux), and the SMP safe
data structure which makes it realistic enough to always move the bulk
of the packet work to the socket's home cpu.

I do not think userspace protocol implementation buys enough to
justify it.  We have to do the protection switch in and out of kernel
space anyways, so why not still do the protected protocol processing
work in the kernel?  It is still being done on the user's behalf,
contributes to his time slice, and avoids all of the terrible issues
of userspace protocol implementations.

So in my mind, the optimal situation from both a protection preservation
and also a performance perspective is net channels to kernel socket
protocol processing, buffers DMA'd directly into userspace if hardware
assist is present.

> I fully agree with Dave that it must be implemented step-by-step, and
> the most significant, IMHO, is moving protocol processing into socket's
> "place". This will force to netfilter changes, but I do think that for
> the proof-of-concept code we can turn it off.

And I also want to note that even if the whole idea explodes and
cannot be made to work, there are good arguments for transitioning
to SKB'less drivers for their own sake.  So work will really not
be lost.

Let's have 100 different implementations of net channels! :-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-27 21:12 Caitlin Bestler
  2006-04-28  6:10 ` Evgeniy Polyakov
  2006-04-28  8:24 ` Rusty Russell
  0 siblings, 2 replies; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-27 21:12 UTC (permalink / raw
  To: David S. Miller, johnpol; +Cc: kelly, rusty, netdev

netdev-owner@vger.kernel.org wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 27 Apr 2006 15:51:26 +0400
> 
>> There are some caveats here found while developing zero-copy sniffer
>> [1]. Project's goal was to remap skbs into userspace in real-time.
>> While absolute numbers (posted to netdev@) were really high, it is
>> only applicable to read-only application. As was shown in IOAT
>> thread, data must be warmed in caches, so reading from mapped area
>> will be as fast as memcpy() (read+write), and copy_to_user()
>> actually almost equal to memcpy() (benchmarks were posted to
>> netdev@). And we must add remapping overhead.
> 
> Yes, all of these issues are related quite strongly.  Thanks
> for making the connection explicit.
> 
> But, the mapping overhead is zero for this net channel stuff,
> at least as it is implemented and designed by Kelly.  Ring
> buffer is setup ahead of time into the user's address space,
> and a ring of buffers into that area are given to the networking card.
> 
> We remember the translations here, so no get_user_pages() on
> each transfer and garbage like that.  And yes this all harks
> back to the issues that are discussed in Chapter 5 of
> Networking Algorithmics.
> But the core thing to understand is that by defining a new
> API and setting up the buffer pool ahead of time, we avoid all of the
> get_user_pages() overhead while retaining full kernel/user protection.
> 
> Evgeniy, the difference between this and your work is that
> you did not have an intelligent piece of hardware that could
> be told to recognize flows, and only put packets for a
> specific flow into that's flow's buffer pool.
> 
>> If we want to dma data from nic into premapped userspace area, this
>> will strike with message sizes/misalignment/slow read and so on, so
>> preallocation has even more problems.
> 
> I do not really think this is an issue, we put the full
> packet into user space and teach it where the offset is to
> the actual data.
> We'll do the same things we do today to try and get the data
> area aligned.  User can do whatever is logical and relevant
> on his end to deal with strange cases.
> 
> In fact we can specify that card has to take some care to get
> data area of packet aligned on say an 8 byte boundary or
> something like that.  When we don't have hardware assist, we
> are going to be doing copies.
> 
>> This change also requires significant changes in application, at
>> least until recv/send are changed, which is not the best thing to do.
> 
> This is exactly the point, we can only do a good job and
> receive zero copy if we can change the interfaces, and that's
> exactly what we're doing here.
> 
>> I do think that significant win in VJ's tests belongs not to
>> remapping and cache-oriented changes, but to move all protocol
>> processing into process' context.
> 
> I partly disagree.  The biggest win is eliminating all of the
> control overhead (all of "softint RX + protocol demux + IP
> route lookup + socket lookup" is turned into single flow
> demux), and the SMP safe data structure which makes it
> realistic enough to always move the bulk of the packet work
> to the socket's home cpu.
> 
> I do not think userspace protocol implementation buys enough
> to justify it.  We have to do the protection switch in and
> out of kernel space anyways, so why not still do the
> protected protocol processing work in the kernel?  It is
> still being done on the user's behalf, contributes to his
> time slice, and avoids all of the terrible issues of
> userspace protocol implementations.
> 
> So in my mind, the optimal situation from both a protection
> preservation and also a performance perspective is net
> channels to kernel socket protocol processing, buffers DMA'd
> directly into userspace if hardware assist is present.
> 

Having a ring that is already flow qualified is indeed the
most important savings, and worth pursuing even if reaching
consensus on how to safely enable user-mode L4 processing.
The latter *can* be a big advantage when the L4 processing
can be done based on a user-mode call from an already
scheduled process. But the benefit is not there for a process
that needs to be woken up each time it receives a short request.

So the real issue is when there is an intelligent device that
uses hardware packet classification to place the packet in
the correct ring. We don't want to bypass packet filtering,
but it would be terribly wasteful to reclassify the packet.
Intelligent NICs will have packet classification capabilities
to support RDMA and iSCSI. Those capabilities should be available
to benefit SOCK_STREAM and SOCK_DGRAM users as well without it
being a choice of either turning all stack control over to
the NIC or ignorign all NIC capabilities beyound pretending
to be a dumb Ethernet NIC.

For example, counting packets within an approved connection
is a valid goal that the final solution should support. But
would a simple count be sufficient, or do we truly need the
full flexibility currently found in netfilter?

Obviously all of this does not need to be resolved in full
detail, but there should be some sense of the direction so
that data structures can be designed properly. My assumption
is that each input ring has a matching output ring, and that
the output ring cannot be used to send packets that would
not be matched by the reverse rule for the paired input ring.
So the information that supports enforcing that rule needs
to be stored somewhere other than the ring itself.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27 20:09         ` David S. Miller
@ 2006-04-28  6:05           ` Evgeniy Polyakov
  0 siblings, 0 replies; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28  6:05 UTC (permalink / raw
  To: David S. Miller; +Cc: kelly, rusty, netdev

On Thu, Apr 27, 2006 at 01:09:18PM -0700, David S. Miller (davem@davemloft.net) wrote:
> Evgeniy, the difference between this and your work is that you did not
> have an intelligent piece of hardware that could be told to recognize
> flows, and only put packets for a specific flow into that's flow's
> buffer pool.

There are the most "intellegent" NICs which use MMIO copy like Realtek 8139 :)
which were used in receiving zero-copy [1] project.

There was special alorithm researched for receiving zero-copy [1] to allow 
to put not page-aligned TCP frames into pages, but there was other
problem when page was committed, since no byte commit is allowed in VFS.

In this case we do not have that problem, but instead we must force userspace to
be very smart when dealing with mapped buffers, instead of simple recv().
And for sending it must be even smarter, since data must be properly
aligned. And what about crappy hardware which can DMA only into limited
memory area, or NIC that can not do sg? Or do we need remapping for NIC
that can not do checksum calculation?

> > If we want to dma data from nic into premapped userspace area, this will
> > strike with message sizes/misalignment/slow read and so on, so
> > preallocation has even more problems.
> 
> I do not really think this is an issue, we put the full packet into
> user space and teach it where the offset is to the actual data.
> We'll do the same things we do today to try and get the data area
> aligned.  User can do whatever is logical and relevant on his end
> to deal with strange cases.
> 
> In fact we can specify that card has to take some care to get data
> area of packet aligned on say an 8 byte boundary or something like
> that.  When we don't have hardware assist, we are going to be doing
> copies.

Userspace must be too smart, and as we saw with various java tests, it
can not be so even now.
And what if pages are shared and several threads are trying to write
into the same remapped area? Will we use COW and be blamed like Mach
and FreeBSD developers? :)

> > I do think that significant win in VJ's tests belongs not to remapping
> > and cache-oriented changes, but to move all protocol processing into
> > process' context.
> 
> I partly disagree.  The biggest win is eliminating all of the control
> overhead (all of "softint RX + protocol demux + IP route lookup +
> socket lookup" is turned into single flow demux), and the SMP safe
> data structure which makes it realistic enough to always move the bulk
> of the packet work to the socket's home cpu.
> 
> I do not think userspace protocol implementation buys enough to
> justify it.  We have to do the protection switch in and out of kernel
> space anyways, so why not still do the protected protocol processing
> work in the kernel?  It is still being done on the user's behalf,
> contributes to his time slice, and avoids all of the terrible issues
> of userspace protocol implementations.

After hard irq softirq is scheduled, then later userspace is scheduled,
at least 2 context switch just to move a packet, and "slow" userspace
code is interrupted by both irqs again...
I run some tests on ppc32 embedded boards which showed that rescheduling
latency tend to have milliseconds delay sometimes (about 4 running processes
on 200mhz cpu), although we do not have some real-time requirements here
it is not a good sign...

> And I also want to note that even if the whole idea explodes and
> cannot be made to work, there are good arguments for transitioning
> to SKB'less drivers for their own sake.  So work will really not
> be lost.
> 
> Let's have 100 different implementations of net channels! :-)

:)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27 21:12 Caitlin Bestler
@ 2006-04-28  6:10 ` Evgeniy Polyakov
  2006-04-28  7:20   ` David S. Miller
  2006-04-28  8:24 ` Rusty Russell
  1 sibling, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28  6:10 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: David S. Miller, kelly, rusty, netdev

On Thu, Apr 27, 2006 at 02:12:09PM -0700, Caitlin Bestler (caitlinb@broadcom.com) wrote:
> So the real issue is when there is an intelligent device that
> uses hardware packet classification to place the packet in
> the correct ring. We don't want to bypass packet filtering,
> but it would be terribly wasteful to reclassify the packet.
> Intelligent NICs will have packet classification capabilities
> to support RDMA and iSCSI. Those capabilities should be available
> to benefit SOCK_STREAM and SOCK_DGRAM users as well without it
> being a choice of either turning all stack control over to
> the NIC or ignorign all NIC capabilities beyound pretending
> to be a dumb Ethernet NIC.

Btw, how is it supposed to work without header split capabale hardware?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28  6:10 ` Evgeniy Polyakov
@ 2006-04-28  7:20   ` David S. Miller
  2006-04-28  7:32     ` Evgeniy Polyakov
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-28  7:20 UTC (permalink / raw
  To: johnpol; +Cc: caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Apr 2006 10:10:54 +0400

> On Thu, Apr 27, 2006 at 02:12:09PM -0700, Caitlin Bestler (caitlinb@broadcom.com) wrote:
> > So the real issue is when there is an intelligent device that
> > uses hardware packet classification to place the packet in
> > the correct ring. We don't want to bypass packet filtering,
> > but it would be terribly wasteful to reclassify the packet.
> > Intelligent NICs will have packet classification capabilities
> > to support RDMA and iSCSI. Those capabilities should be available
> > to benefit SOCK_STREAM and SOCK_DGRAM users as well without it
> > being a choice of either turning all stack control over to
> > the NIC or ignorign all NIC capabilities beyound pretending
> > to be a dumb Ethernet NIC.
> 
> Btw, how is it supposed to work without header split capabale hardware?

I do not see header splitting as a requirement, let the raw
headers sit in the user queue and provide an offset to the data.
All of this page alignment stuff is unnecessary complexity.

I know you think applications are too dumb to be expected to handle
these kinds of things, but how many apps do you expect to convert over
to these new interfaces?

The ones that matter will, and great care will be made by the
programmer who does this.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28  7:20   ` David S. Miller
@ 2006-04-28  7:32     ` Evgeniy Polyakov
  2006-04-28 18:20       ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28  7:32 UTC (permalink / raw
  To: David S. Miller; +Cc: caitlinb, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 12:20:27AM -0700, David S. Miller (davem@davemloft.net) wrote:
> > Btw, how is it supposed to work without header split capabale hardware?
> 
> I do not see header splitting as a requirement, let the raw
> headers sit in the user queue and provide an offset to the data.
> All of this page alignment stuff is unnecessary complexity.
> 
> I know you think applications are too dumb to be expected to handle
> these kinds of things, but how many apps do you expect to convert over
> to these new interfaces?
> 
> The ones that matter will, and great care will be made by the
> programmer who does this.

Ugh, so it will not ring buffer of data _flow_, but ring buffer of
(header+data) packets.

Definitely, userspace application must be very smart to deal with 
ip/tcp/option headers...

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27 21:12 Caitlin Bestler
  2006-04-28  6:10 ` Evgeniy Polyakov
@ 2006-04-28  8:24 ` Rusty Russell
  2006-04-28 19:21   ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: Rusty Russell @ 2006-04-28  8:24 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: David S. Miller, johnpol, kelly, netdev

On Thu, 2006-04-27 at 14:12 -0700, Caitlin Bestler wrote:
> So the real issue is when there is an intelligent device that
> uses hardware packet classification to place the packet in
> the correct ring. We don't want to bypass packet filtering,
> but it would be terribly wasteful to reclassify the packet.
> Intelligent NICs will have packet classification capabilities
> to support RDMA and iSCSI. Those capabilities should be available
> to benefit SOCK_STREAM and SOCK_DGRAM users as well without it
> being a choice of either turning all stack control over to
> the NIC or ignorign all NIC capabilities beyound pretending
> to be a dumb Ethernet NIC.
> 
> For example, counting packets within an approved connection
> is a valid goal that the final solution should support. But
> would a simple count be sufficient, or do we truly need the
> full flexibility currently found in netfilter?

Note that the problem space AFAICT includes strange advanced routing
setups, ingress qos and possibly others, not just netfilter.  But
perhaps the same solutions apply, so I'll concentrate on nf.

If we start with a "disable direct netchannels when netfilter hooks
registered", we would inevitably refine it to "disable some netchannels
when netfilter hooks registered".  The worst case for this filtering
based on connection tracking, with its constantly changing effects as
things time out.  Hard problem.

Is it time to re-examine the Grand Unified Lookup which Dave mentions
every few years? 8)

> My assumption
> is that each input ring has a matching output ring, and that
> the output ring cannot be used to send packets that would
> not be matched by the reverse rule for the paired input ring.
> So the information that supports enforcing that rule needs
> to be stored somewhere other than the ring itself.

Ah, this is a different problem.  Our idea was to have a syscall which
would check & sanitize the buffers for output.  To do this, you need the
ability to chain buffers (a simple next entry in the header, for us).

Sanitization would copy the header into a global buffer (ie. not one
reachable by userspace), check the flowid, and chain on the rest of the
user buffer.  After it had sanitized the buffers, it would activate the
NIC, which would only send out buffers which started with a kernel
buffer.

Of course, the first step (CAP_NET_RAW-only) wouldn't need this.  And,
if the "sanitize_and_send" syscall were PF_VJCHAN's write(), then the
contents of the write() could actually be the header: userspace would
never deal with chained buffers.

Finally, it's not clear how one should sanely mix this with sendfile
etc.  Maybe you don't, and only use this for RDMA, etc.

Cheers!
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 15:59 Caitlin Bestler
  2006-04-28 16:12 ` Evgeniy Polyakov
  0 siblings, 1 reply; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-28 15:59 UTC (permalink / raw
  To: Evgeniy Polyakov; +Cc: David S. Miller, kelly, rusty, netdev

Evgeniy Polyakov wrote:
> On Thu, Apr 27, 2006 at 02:12:09PM -0700, Caitlin Bestler
> (caitlinb@broadcom.com) wrote:
>> So the real issue is when there is an intelligent device that uses
>> hardware packet classification to place the packet in the correct
>> ring. We don't want to bypass packet filtering, but it would be
>> terribly wasteful to reclassify the packet.
>> Intelligent NICs will have packet classification capabilities to
>> support RDMA and iSCSI. Those capabilities should be available to
>> benefit SOCK_STREAM and SOCK_DGRAM users as well without it being a
>> choice of either turning all stack control over to the NIC or
>> ignorign all NIC capabilities beyound pretending to be a dumb
>> Ethernet NIC. 
> 
> Btw, how is it supposed to work without header split capabale
> hardware?

Hardware that can classify packets is obviously capable of doing
header data separation, but that does not mean that it has to do so.

If the host wants header data separation it's real value is that when
packets arrive in order that fewer distinct copies are required to
move the data to the user buffer (because separated data can
be placed back-to-back in a data-only ring). But that's an optimization,
it's not needed to make the idea worth doing, or even necessarily
in the first implementation.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 15:59 Caitlin Bestler
@ 2006-04-28 16:12 ` Evgeniy Polyakov
  2006-04-28 19:09   ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 16:12 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: David S. Miller, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 08:59:19AM -0700, Caitlin Bestler (caitlinb@broadcom.com) wrote:
> > Btw, how is it supposed to work without header split capabale
> > hardware?
> 
> Hardware that can classify packets is obviously capable of doing
> header data separation, but that does not mean that it has to do so.
> 
> If the host wants header data separation it's real value is that when
> packets arrive in order that fewer distinct copies are required to
> move the data to the user buffer (because separated data can
> be placed back-to-back in a data-only ring). But that's an optimization,
> it's not needed to make the idea worth doing, or even necessarily
> in the first implementation.

If there is dataflow, not flow of packets or flow of data with holes,
it could be possible to modify recv() to just return the right pointer,
so in theory userspace modifications would be minimal.
With copy in place it completely does not differ from current design
with copy_to_user() being used since memcpy() is just slightly faster
than copy*user().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 17:02 Caitlin Bestler
  2006-04-28 17:18 ` Stephen Hemminger
  2006-04-28 17:25 ` Evgeniy Polyakov
  0 siblings, 2 replies; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-28 17:02 UTC (permalink / raw
  To: Evgeniy Polyakov; +Cc: David S. Miller, kelly, rusty, netdev

Evgeniy Polyakov wrote:
> On Fri, Apr 28, 2006 at 08:59:19AM -0700, Caitlin Bestler
> (caitlinb@broadcom.com) wrote:
>>> Btw, how is it supposed to work without header split capabale
>>> hardware?
>> 
>> Hardware that can classify packets is obviously capable of doing
>> header data separation, but that does not mean that it has to do so.
>> 
>> If the host wants header data separation it's real value is that when
>> packets arrive in order that fewer distinct copies are required to
>> move the data to the user buffer (because separated data can be
>> placed back-to-back in a data-only ring). But that's an
>> optimization, it's not needed to make the idea worth doing, or even
>> necessarily in the first implementation.
> 
> If there is dataflow, not flow of packets or flow of data
> with holes, it could be possible to modify recv() to just
> return the right pointer, so in theory userspace
> modifications would be minimal.
> With copy in place it completely does not differ from current
> design with copy_to_user() being used since memcpy() is just
> slightly faster than copy*user().

If the app is really ready to use a modified interface we might as well
just give them a QP/CQ interface. But I suppose "receive by pointer"
interfaces don't really stretch the sockets interface all that badly.
The key is that you have to decide how the buffer is released,
is it the next call? Or a separate call? Does releasing buffer
N+2 release buffers N and N+1? What you want to avoid 
is having to keep a scoreboard of which buffers have been
released.

But in context, header/data separation would allow in order
packets to have the data be placed back to back, which 
could allow a single recv to report the payload of multiple
successive TCP segments. So the benefit of header/data
separation remains the same, and I still say it's a optimization
that should not be made a requirement. The benefits of vj_channels
exist even without them. When the packet classifier runs on the
host, header/data separation would not be free. I want to enable
hardware offloads, not make the kernel bend over backwards
to emulate how hardware would work. I'm just hoping that we
can agree to let hardware do its work without being forced to
work the same way the kernel does (i.e., running down a long
list of arbitrary packet filter rules on a per packet basis).

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:02 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Caitlin Bestler
@ 2006-04-28 17:18 ` Stephen Hemminger
  2006-04-28 17:29   ` Evgeniy Polyakov
  2006-04-28 19:10   ` David S. Miller
  2006-04-28 17:25 ` Evgeniy Polyakov
  1 sibling, 2 replies; 79+ messages in thread
From: Stephen Hemminger @ 2006-04-28 17:18 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: Evgeniy Polyakov, David S. Miller, kelly, rusty, netdev

On Fri, 28 Apr 2006 10:02:10 -0700
"Caitlin Bestler" <caitlinb@broadcom.com> wrote:

> Evgeniy Polyakov wrote:
> > On Fri, Apr 28, 2006 at 08:59:19AM -0700, Caitlin Bestler
> > (caitlinb@broadcom.com) wrote:
> >>> Btw, how is it supposed to work without header split capabale
> >>> hardware?
> >> 
> >> Hardware that can classify packets is obviously capable of doing
> >> header data separation, but that does not mean that it has to do so.
> >> 
> >> If the host wants header data separation it's real value is that when
> >> packets arrive in order that fewer distinct copies are required to
> >> move the data to the user buffer (because separated data can be
> >> placed back-to-back in a data-only ring). But that's an
> >> optimization, it's not needed to make the idea worth doing, or even
> >> necessarily in the first implementation.
> > 
> > If there is dataflow, not flow of packets or flow of data
> > with holes, it could be possible to modify recv() to just
> > return the right pointer, so in theory userspace
> > modifications would be minimal.
> > With copy in place it completely does not differ from current
> > design with copy_to_user() being used since memcpy() is just
> > slightly faster than copy*user().
> 
> If the app is really ready to use a modified interface we might as well
> just give them a QP/CQ interface. But I suppose "receive by pointer"
> interfaces don't really stretch the sockets interface all that badly.
> The key is that you have to decide how the buffer is released,
> is it the next call? Or a separate call? Does releasing buffer
> N+2 release buffers N and N+1? What you want to avoid 
> is having to keep a scoreboard of which buffers have been
> released.
>

Please just use existing AIO interface.  We don't need another
interface. The number of interfaces increases the exposed bug
surface geometrically.  Which means for each new interface, it
means testing and fixing bugs in every possible usage.



> But in context, header/data separation would allow in order
> packets to have the data be placed back to back, which 
> could allow a single recv to report the payload of multiple
> successive TCP segments. So the benefit of header/data
> separation remains the same, and I still say it's a optimization
> that should not be made a requirement. The benefits of vj_channels
> exist even without them. When the packet classifier runs on the
> host, header/data separation would not be free. I want to enable
> hardware offloads, not make the kernel bend over backwards
> to emulate how hardware would work. I'm just hoping that we
> can agree to let hardware do its work without being forced to
> work the same way the kernel does (i.e., running down a long
> list of arbitrary packet filter rules on a per packet basis).
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:02 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Caitlin Bestler
  2006-04-28 17:18 ` Stephen Hemminger
@ 2006-04-28 17:25 ` Evgeniy Polyakov
  2006-04-28 19:14   ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 17:25 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: David S. Miller, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 10:02:10AM -0700, Caitlin Bestler (caitlinb@broadcom.com) wrote:
> If the app is really ready to use a modified interface we might as well
> just give them a QP/CQ interface. But I suppose "receive by pointer"
> interfaces don't really stretch the sockets interface all that badly.
> The key is that you have to decide how the buffer is released,
> is it the next call? Or a separate call? Does releasing buffer
> N+2 release buffers N and N+1? What you want to avoid 
> is having to keep a scoreboard of which buffers have been
> released.
> 
> But in context, header/data separation would allow in order
> packets to have the data be placed back to back, which 
> could allow a single recv to report the payload of multiple
> successive TCP segments. So the benefit of header/data
> separation remains the same, and I still say it's a optimization
> that should not be made a requirement. The benefits of vj_channels
> exist even without them. When the packet classifier runs on the
> host, header/data separation would not be free. I want to enable
> hardware offloads, not make the kernel bend over backwards
> to emulate how hardware would work. I'm just hoping that we
> can agree to let hardware do its work without being forced to
> work the same way the kernel does (i.e., running down a long
> list of arbitrary packet filter rules on a per packet basis).

I see your point, and respectfully disagree.
The more complex userspace interface we create the less users it will
have. It is completely unconvenient to read 100 bytes and receive only
80, since 20 were eaten by header. And what if we need only 20, but
packet contains 100, introduce per packet head pointer? 
For purpose of benchmarking it works perfectly -
read the whole packet, one can event touch that data to emulate real
work, but for the real world it becomes practically unusabl.

But what we are talking about right now is a research project, not
production system, so we can create any interface we like since the main
goal, IMHO, is searching for the bottlenecks in the current stack and
ways of it's removal even by introducing new complex interface.
I would definitely like to see how your approach works for some 
kind of real workloads and does it allow to
create faster and generally better systems.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:18 ` Stephen Hemminger
@ 2006-04-28 17:29   ` Evgeniy Polyakov
  2006-04-28 17:41     ` Stephen Hemminger
  2006-04-28 19:10   ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 17:29 UTC (permalink / raw
  To: Stephen Hemminger; +Cc: Caitlin Bestler, David S. Miller, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 10:18:33AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> Please just use existing AIO interface.  We don't need another
> interface. The number of interfaces increases the exposed bug
> surface geometrically.  Which means for each new interface, it
> means testing and fixing bugs in every possible usage.

Networking AIO? Like [1] :)
That would be really good.

1. http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:29   ` Evgeniy Polyakov
@ 2006-04-28 17:41     ` Stephen Hemminger
  2006-04-28 17:55       ` Evgeniy Polyakov
  0 siblings, 1 reply; 79+ messages in thread
From: Stephen Hemminger @ 2006-04-28 17:41 UTC (permalink / raw
  To: Evgeniy Polyakov; +Cc: Caitlin Bestler, David S. Miller, kelly, rusty, netdev

On Fri, 28 Apr 2006 21:29:32 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Fri, Apr 28, 2006 at 10:18:33AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> > Please just use existing AIO interface.  We don't need another
> > interface. The number of interfaces increases the exposed bug
> > surface geometrically.  Which means for each new interface, it
> > means testing and fixing bugs in every possible usage.
> 
> Networking AIO? Like [1] :)
> That would be really good.
> 
> 1. http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio
> 

The existing infrastructure is there in the syscall layer, it just
isn't really AIO for sockets. That naio project has two problems, first
they require driver changes, and he is doing it on the stupidest
of hardware, optimizing a 8139too is foolish. Second, introducing
kevents, seems unnecessary and hasn't been accepted in the mainline.

The existing linux AIO model seems sufficient:
	http://lse.sourceforge.net/io/aio.html

There is work to put true Posix AIO on top of this.



	

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 17:55 Caitlin Bestler
  2006-04-28 22:17 ` Rusty Russell
  0 siblings, 1 reply; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-28 17:55 UTC (permalink / raw
  To: Evgeniy Polyakov; +Cc: David S. Miller, kelly, rusty, netdev

Evgeniy Polyakov wrote:

> 
> I see your point, and respectfully disagree.
> The more complex userspace interface we create the less users
> it will have. It is completely unconvenient to read 100 bytes
> and receive only 80, since 20 were eaten by header. And what
> if we need only 20, but packet contains 100, introduce per packet
> head pointer? For purpose of benchmarking it works perfectly - read
> the whole packet, one can event touch that data to emulate real
> work, but for the real world it becomes practically unusabl.
> 

In a straight-forward user-mode library using existing interfaces the
message would be interleaved with the headers in the inbound ring.
While the inbound ring is part of user memory, it is not what the
user would process from, that would be the buffer they supplied 
in a call to read() or recvmsg(), that buffer would have to make
no allowances for interleaved headers.

Enabling zero-copy when a buffer is pre-posted is possible, but
modestly complex. Research on MPI and SDP have generally
shown that the unless the pinning overhead is eliminated somehow
that the buffers have to be quite large before zero-copy reception
becomes a benefit. vj_netchannels represent a strategy of minimizing
registration/pinning costs even if it means paying for an extra copy.
Because the extra copy is closely tied to the activation of the data
sink consumer the cost of that extra copy is greatly reduced because
it places the data in the cache immediately before the application
will in fact use the received data.

Also keep in mind that once the issues are resolved to allow the
netchannel rings to be directly visible to a user-mode client that
enhanced/specialized interfaces can easily be added in user-mode
libraries. So focusing on supporting existing conventional interfaces
is probably the best approach for the initial efforts.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:41     ` Stephen Hemminger
@ 2006-04-28 17:55       ` Evgeniy Polyakov
  2006-04-28 19:16         ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 17:55 UTC (permalink / raw
  To: Stephen Hemminger; +Cc: Caitlin Bestler, David S. Miller, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 10:41:18AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> On Fri, 28 Apr 2006 21:29:32 +0400
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Fri, Apr 28, 2006 at 10:18:33AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> > > Please just use existing AIO interface.  We don't need another
> > > interface. The number of interfaces increases the exposed bug
> > > surface geometrically.  Which means for each new interface, it
> > > means testing and fixing bugs in every possible usage.
> > 
> > Networking AIO? Like [1] :)
> > That would be really good.
> > 
> > 1. http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio
> > 
> 
> The existing infrastructure is there in the syscall layer, it just
> isn't really AIO for sockets. That naio project has two problems, first
> they require driver changes, and he is doing it on the stupidest
> of hardware, optimizing a 8139too is foolish.

No, it does not. You confuse it with receiving zero-copy support which
allows to DMA data directly into VFS cache [1].
NAIO works for any kind of hardware and was tested with e1000 and showed
noticeble win in both CPU usage and network performance.

> Second, introducing
> kevents, seems unnecessary and hasn't been accepted in the mainline.

kevent was never sent to lkml@ although it showed over 40% win over epoll for
test web server. Sending it to lkml@ is just jumping into ... not into
technical world, so I posted it first here, but without much attention
though.

> The existing linux AIO model seems sufficient:
> 	http://lse.sourceforge.net/io/aio.html
> 
> There is work to put true Posix AIO on top of this.

There are a lot of discussions about combining AIO with epoll and 
combine them into something similar to kevent which allows to monitor
level and edge triggered events, to create proper state machine for AIO
compeltions. kevent [2] does exactly that. AIO works not as state
machine, but it's repeated-check design is more like postponing work
from one context to special thread.

1. receiving zero-copy support 
http://tservice.net.ru/~s0mbre/old/?section=projects&item=recv_zero_copy

2. kevent system
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28  7:32     ` Evgeniy Polyakov
@ 2006-04-28 18:20       ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-28 18:20 UTC (permalink / raw
  To: johnpol; +Cc: caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Apr 2006 11:32:16 +0400

> Definitely, userspace application must be very smart to deal with 
> ip/tcp/option headers...

That is why we will put an "offset+len" in the ring so they need not
parse the packet headers.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 16:12 ` Evgeniy Polyakov
@ 2006-04-28 19:09   ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-28 19:09 UTC (permalink / raw
  To: johnpol; +Cc: caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Apr 2006 20:12:21 +0400

> If there is dataflow, not flow of packets or flow of data with holes,
> it could be possible to modify recv() to just return the right pointer,
> so in theory userspace modifications would be minimal.
> With copy in place it completely does not differ from current design
> with copy_to_user() being used since memcpy() is just slightly faster
> than copy*user().

I very much feel that avoiding userland API changes is
a complete mistake.

We need new interfaces to do this right.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:18 ` Stephen Hemminger
  2006-04-28 17:29   ` Evgeniy Polyakov
@ 2006-04-28 19:10   ` David S. Miller
  2006-04-28 20:46     ` Brent Cook
  1 sibling, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-28 19:10 UTC (permalink / raw
  To: shemminger; +Cc: caitlinb, johnpol, kelly, rusty, netdev

From: Stephen Hemminger <shemminger@osdl.org>
Date: Fri, 28 Apr 2006 10:18:33 -0700

> Please just use existing AIO interface.

I totally disagree, the existing AIO interface is garbage.

We need new APIs to do this right, to get the ring buffer
and the zero-copy'ness correct.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:25 ` Evgeniy Polyakov
@ 2006-04-28 19:14   ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-28 19:14 UTC (permalink / raw
  To: johnpol; +Cc: caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Apr 2006 21:25:36 +0400

> The more complex userspace interface we create the less users it will
> have. It is completely unconvenient to read 100 bytes and receive only
> 80, since 20 were eaten by header.

These bytes are charged to socket anyways, and allowing the
headers to be there is the only clean way to finesse the whole
zero-copy problem.

User can manage his data any way he likes.  He can decide to take
advantage of the zero-copy layout we've provided, or he can copy
to put things into a format he is more happy with at the cost
of the copy.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:55       ` Evgeniy Polyakov
@ 2006-04-28 19:16         ` David S. Miller
  2006-04-28 19:49           ` Stephen Hemminger
  2006-04-28 19:52           ` [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Evgeniy Polyakov
  0 siblings, 2 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-28 19:16 UTC (permalink / raw
  To: johnpol; +Cc: shemminger, caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Apr 2006 21:55:39 +0400

> On Fri, Apr 28, 2006 at 10:41:18AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> > Second, introducing
> > kevents, seems unnecessary and hasn't been accepted in the mainline.
> 
> kevent was never sent to lkml@ although it showed over 40% win over epoll for
> test web server. Sending it to lkml@ is just jumping into ... not into
> technical world, so I posted it first here, but without much attention
> though.

Frankly I found kevents to be a very strong idea.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28  8:24 ` Rusty Russell
@ 2006-04-28 19:21   ` David S. Miller
  2006-04-28 22:04     ` Rusty Russell
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-28 19:21 UTC (permalink / raw
  To: rusty; +Cc: caitlinb, johnpol, kelly, netdev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Fri, 28 Apr 2006 18:24:08 +1000

> Note that the problem space AFAICT includes strange advanced routing
> setups, ingress qos and possibly others, not just netfilter.  But
> perhaps the same solutions apply, so I'll concentrate on nf.

Yes, this hasn't been mentioned explicitly yet.

The big problem is that we don't want the classifier to become
overly complex.

One scheme I'm thinking about right now is an ordered lookup
that looks like:

1) Check for established sockets, they trump everything else.

2) Check for classifier rules, ie. netfilter and packet scheduler
   stuff

3) Check for listening sockets

4) default channel

#2 is still an unsolved problem, we don't want this big complex
classifier to be required in the hardware implementations.
However, using just IP addresses and ports does not map well to
what netfilter and co. want.

> Ah, this is a different problem.  Our idea was to have a syscall which
> would check & sanitize the buffers for output.  To do this, you need the
> ability to chain buffers (a simple next entry in the header, for us).
> 
> Sanitization would copy the header into a global buffer (ie. not one
> reachable by userspace), check the flowid, and chain on the rest of the
> user buffer.  After it had sanitized the buffers, it would activate the
> NIC, which would only send out buffers which started with a kernel
> buffer.
> 
> Of course, the first step (CAP_NET_RAW-only) wouldn't need this.  And,
> if the "sanitize_and_send" syscall were PF_VJCHAN's write(), then the
> contents of the write() could actually be the header: userspace would
> never deal with chained buffers.

I am not sure any of this is anything more than overhead.

If we just pop the buffers directly into the user mmap()'d ring
buffer, headers and all, and give an offset+length pair so the
user knows where the data starts and how much data is there, it
should all just work out.  Where to put the offset+length is
just a detail.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 19:16         ` David S. Miller
@ 2006-04-28 19:49           ` Stephen Hemminger
  2006-04-28 19:59             ` Evgeniy Polyakov
  2006-04-28 19:52           ` [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Evgeniy Polyakov
  1 sibling, 1 reply; 79+ messages in thread
From: Stephen Hemminger @ 2006-04-28 19:49 UTC (permalink / raw
  To: David S. Miller; +Cc: johnpol, caitlinb, kelly, rusty, netdev

On Fri, 28 Apr 2006 12:16:36 -0700 (PDT)
"David S. Miller" <davem@davemloft.net> wrote:

> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Fri, 28 Apr 2006 21:55:39 +0400
> 
> > On Fri, Apr 28, 2006 at 10:41:18AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> > > Second, introducing
> > > kevents, seems unnecessary and hasn't been accepted in the mainline.
> > 
> > kevent was never sent to lkml@ although it showed over 40% win over epoll for
> > test web server. Sending it to lkml@ is just jumping into ... not into
> > technical world, so I posted it first here, but without much attention
> > though.
> 
> Frankly I found kevents to be a very strong idea.

But there is this huge semantic overload of kevent, poll, epoll, aio,
regular sendmsg/recv, posix aio, etc.  

Perhaps a clean break with the socket interface is needed. Otherwise, there
are nasty complications with applications that mix old socket calls and new interface
on the same connection.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 19:16         ` David S. Miller
  2006-04-28 19:49           ` Stephen Hemminger
@ 2006-04-28 19:52           ` Evgeniy Polyakov
  1 sibling, 0 replies; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 19:52 UTC (permalink / raw
  To: David S. Miller; +Cc: shemminger, caitlinb, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 12:16:36PM -0700, David S. Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Fri, 28 Apr 2006 21:55:39 +0400
> 
> > On Fri, Apr 28, 2006 at 10:41:18AM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> > > Second, introducing
> > > kevents, seems unnecessary and hasn't been accepted in the mainline.
> > 
> > kevent was never sent to lkml@ although it showed over 40% win over epoll for
> > test web server. Sending it to lkml@ is just jumping into ... not into
> > technical world, so I posted it first here, but without much attention
> > though.
> 
> Frankly I found kevents to be a very strong idea.

Glad to hear this.
I probably should resend patches netdev@ and (mar my karma) send it to
lkml@...?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 19:49           ` Stephen Hemminger
@ 2006-04-28 19:59             ` Evgeniy Polyakov
  2006-04-28 22:00               ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 19:59 UTC (permalink / raw
  To: Stephen Hemminger; +Cc: David S. Miller, caitlinb, kelly, rusty, netdev

On Fri, Apr 28, 2006 at 12:49:15PM -0700, Stephen Hemminger (shemminger@osdl.org) wrote:
> But there is this huge semantic overload of kevent, poll, epoll, aio,
> regular sendmsg/recv, posix aio, etc.  
> 
> Perhaps a clean break with the socket interface is needed. Otherwise, there
> are nasty complications with applications that mix old socket calls and new interface
> on the same connection.

kevent can be used as poll without any changes to the socket code.
There are two types of network related kevents - socket events
(recv/send/accept) and network aio, which can be turned completely off
in config.
There are following events which are supported by kevent:
o usual poll/select notifications
o inode notifications (create/remove)
o timer notifications
o socket notifications (send/recv/accept)
o network aio system
o fs aio (project closed, aio_sendfile() is being developed instead)

Any of the above can be turned off by config option.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 19:10   ` David S. Miller
@ 2006-04-28 20:46     ` Brent Cook
  0 siblings, 0 replies; 79+ messages in thread
From: Brent Cook @ 2006-04-28 20:46 UTC (permalink / raw
  To: David S. Miller; +Cc: shemminger, caitlinb, johnpol, kelly, rusty, netdev

On Friday 28 April 2006 14:10, David S. Miller wrote:
> From: Stephen Hemminger <shemminger@osdl.org>
> Date: Fri, 28 Apr 2006 10:18:33 -0700
>
> > Please just use existing AIO interface.
>
> I totally disagree, the existing AIO interface is garbage.
>
> We need new APIs to do this right, to get the ring buffer
> and the zero-copy'ness correct.
> -

Heh, like PF_RING? Just mmap a socket and read out some structures?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 19:59             ` Evgeniy Polyakov
@ 2006-04-28 22:00               ` David S. Miller
  2006-04-29 13:54                 ` Evgeniy Polyakov
       [not found]                 ` <20060429124451.GA19810@2ka.mipt.ru>
  0 siblings, 2 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-28 22:00 UTC (permalink / raw
  To: johnpol; +Cc: shemminger, caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 28 Apr 2006 23:59:30 +0400

> kevent can be used as poll without any changes to the socket code.
> There are two types of network related kevents - socket events
> (recv/send/accept) and network aio, which can be turned completely off
> in config.
> There are following events which are supported by kevent:
> o usual poll/select notifications
> o inode notifications (create/remove)
> o timer notifications
> o socket notifications (send/recv/accept)
> o network aio system
> o fs aio (project closed, aio_sendfile() is being developed instead)
> 
> Any of the above can be turned off by config option.

Feel free to post the current version of your kevent patch
here so we can discuss something concrete.

Maybe you have even some toy example user applications that
use kevent that people can look at too?  That might help
in understanding how it's supposed to be used.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 19:21   ` David S. Miller
@ 2006-04-28 22:04     ` Rusty Russell
  2006-04-28 22:38       ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Rusty Russell @ 2006-04-28 22:04 UTC (permalink / raw
  To: David S. Miller; +Cc: caitlinb, johnpol, kelly, netdev

On Fri, 2006-04-28 at 12:21 -0700, David S. Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Fri, 28 Apr 2006 18:24:08 +1000
> 
> > Note that the problem space AFAICT includes strange advanced routing
> > setups, ingress qos and possibly others, not just netfilter.  But
> > perhaps the same solutions apply, so I'll concentrate on nf.
> 
> Yes, this hasn't been mentioned explicitly yet.
> 
> The big problem is that we don't want the classifier to become
> overly complex.
> 
> One scheme I'm thinking about right now is an ordered lookup
> that looks like:
> 
> 1) Check for established sockets, they trump everything else.
> 
> 2) Check for classifier rules, ie. netfilter and packet scheduler
>    stuff
> 
> 3) Check for listening sockets
> 
> 4) default channel
> 
> #2 is still an unsolved problem, we don't want this big complex
> classifier to be required in the hardware implementations.
> However, using just IP addresses and ports does not map well to
> what netfilter and co. want.

You're still thinking you can bypass classifiers for established
sockets, but I really don't think you can.  I think the simplest
solution is to effectively remove from (or flag) the established &
listening hashes anything which could be effected by classifiers, so
those packets get send through the default channel.

This can graduate from "all or nothing" to some more fine-grained scheme
over time.  I have some early thoughts on how we could really do this
with filtering by connection tracking state; serious work, but feasible.

> > Ah, this is a different problem.  Our idea was to have a syscall which
> > would check & sanitize the buffers for output.  To do this, you need the
> > ability to chain buffers (a simple next entry in the header, for us).
> > 
> > Sanitization would copy the header into a global buffer (ie. not one
> > reachable by userspace), check the flowid, and chain on the rest of the
> > user buffer.  After it had sanitized the buffers, it would activate the
> > NIC, which would only send out buffers which started with a kernel
> > buffer.
> > 
> > Of course, the first step (CAP_NET_RAW-only) wouldn't need this.  And,
> > if the "sanitize_and_send" syscall were PF_VJCHAN's write(), then the
> > contents of the write() could actually be the header: userspace would
> > never deal with chained buffers.
> 
> I am not sure any of this is anything more than overhead.
> 
> If we just pop the buffers directly into the user mmap()'d ring
> buffer, headers and all, and give an offset+length pair so the
> user knows where the data starts and how much data is there, it
> should all just work out.  Where to put the offset+length is
> just a detail.

Agreed, but I was talking about userspace *send*, in reply to Caitlin
bringing it up.  A little off-topic, but I mentioned our thoughts simply
to show that it's possible to do unpriv'ed output...

(Kelly is taking a couple of well-earned days off ATM).

Cheers!
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 17:55 Caitlin Bestler
@ 2006-04-28 22:17 ` Rusty Russell
  2006-04-28 22:40   ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Rusty Russell @ 2006-04-28 22:17 UTC (permalink / raw
  To: Caitlin Bestler; +Cc: Evgeniy Polyakov, David S. Miller, kelly, netdev

On Fri, 2006-04-28 at 10:55 -0700, Caitlin Bestler wrote:
> vj_netchannels represent a strategy of minimizing
> registration/pinning costs even if it means paying for an extra copy.
> Because the extra copy is closely tied to the activation of the data
> sink consumer the cost of that extra copy is greatly reduced because
> it places the data in the cache immediately before the application
> will in fact use the received data.

Just to be clear here: I agree with Dave that without classifying
hardware, there's no point (and much pain) in going all the way to
userspace with the channel (ie. mmap).  If you're going to copy anyway,
might as well do it in the socket's read() call: then the user can then
aim the copy exactly where they want, too.  We'll need that TCP code in
the kernel for the foreseeable future anyway 8)

However, in future, if intelligent cards exist, having an API which lets
them do zero-copy and not overly penalize less intelligent cards makes
sense.

Side note: my Xen I/O patches allow the implementation of exactly this
kind of virtual hardware (no coincidence 8), so intelligent cards might
not be as far away as you think...

> Also keep in mind that once the issues are resolved to allow the
> netchannel rings to be directly visible to a user-mode client that
> enhanced/specialized interfaces can easily be added in user-mode
> libraries. So focusing on supporting existing conventional interfaces
> is probably the best approach for the initial efforts.

Absolutely.

Cheers!
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 22:04     ` Rusty Russell
@ 2006-04-28 22:38       ` David S. Miller
  2006-04-29  0:10         ` Rusty Russell
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-28 22:38 UTC (permalink / raw
  To: rusty; +Cc: caitlinb, johnpol, kelly, netdev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Sat, 29 Apr 2006 08:04:04 +1000

> You're still thinking you can bypass classifiers for established
> sockets, but I really don't think you can.  I think the simplest
> solution is to effectively remove from (or flag) the established &
> listening hashes anything which could be effected by classifiers, so
> those packets get send through the default channel.

OK, when rules are installed, the socket channel mappings are
flushed.  This is your idea right?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 22:17 ` Rusty Russell
@ 2006-04-28 22:40   ` David S. Miller
  2006-04-29  0:22     ` Rusty Russell
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-04-28 22:40 UTC (permalink / raw
  To: rusty; +Cc: caitlinb, johnpol, kelly, netdev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Sat, 29 Apr 2006 08:17:01 +1000

> On Fri, 2006-04-28 at 10:55 -0700, Caitlin Bestler wrote:
> > vj_netchannels represent a strategy of minimizing
> > registration/pinning costs even if it means paying for an extra copy.
> > Because the extra copy is closely tied to the activation of the data
> > sink consumer the cost of that extra copy is greatly reduced because
> > it places the data in the cache immediately before the application
> > will in fact use the received data.
> 
> Just to be clear here: I agree with Dave that without classifying
> hardware, there's no point (and much pain) in going all the way to
> userspace with the channel (ie. mmap).  If you're going to copy anyway,
> might as well do it in the socket's read() call: then the user can then
> aim the copy exactly where they want, too.  We'll need that TCP code in
> the kernel for the foreseeable future anyway 8)
> 
> However, in future, if intelligent cards exist, having an API which lets
> them do zero-copy and not overly penalize less intelligent cards makes
> sense.

I do not think intelligent cards imply protocol in user space.
You can still get the zero copy, and moving the work to the
remote cpu, without all the complexity assosciated with putting
the protocol in userspace.  It buys nothing but complexity.

> Side note: my Xen I/O patches allow the implementation of exactly this
> kind of virtual hardware (no coincidence 8), so intelligent cards might
> not be as far away as you think...

Such hardware can be prototyped in QEMU as well.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 23:45 Caitlin Bestler
  0 siblings, 0 replies; 79+ messages in thread
From: Caitlin Bestler @ 2006-04-28 23:45 UTC (permalink / raw
  To: David S. Miller, rusty; +Cc: johnpol, kelly, netdev

David S. Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Sat, 29 Apr 2006 08:04:04 +1000
> 
>> You're still thinking you can bypass classifiers for established
>> sockets, but I really don't think you can.  I think the simplest
>> solution is to effectively remove from (or flag) the established &
>> listening hashes anything which could be effected by classifiers, so
>> those packets get send through the default channel.
> 
> OK, when rules are installed, the socket channel mappings are
> flushed.  This is your idea right?

You mean when new rules are installed that would conflict with
an existing mapping, right?

Bumping every connection out of vj-channel mode whenever any new
rule was installed would be very counter-productive.

Ultimately, you only want a direct-to-user vj-channel when all
packets assigned to it would be passed by netchannels, and maybe
increment a single packet counter. Checking a single QoS rate
limiter may be possible too, but if there are more complex
rules then the channel has to be kept in kernel because it
wouldn't make sense to trust user-mode code to apply the
netchannel rules reliably.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 22:38       ` David S. Miller
@ 2006-04-29  0:10         ` Rusty Russell
  0 siblings, 0 replies; 79+ messages in thread
From: Rusty Russell @ 2006-04-29  0:10 UTC (permalink / raw
  To: David S. Miller; +Cc: caitlinb, johnpol, kelly, netdev

On Fri, 2006-04-28 at 15:38 -0700, David S. Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Sat, 29 Apr 2006 08:04:04 +1000
> 
> > You're still thinking you can bypass classifiers for established
> > sockets, but I really don't think you can.  I think the simplest
> > solution is to effectively remove from (or flag) the established &
> > listening hashes anything which could be effected by classifiers, so
> > those packets get send through the default channel.
> 
> OK, when rules are installed, the socket channel mappings are
> flushed.  This is your idea right?

Yeah.  First off, all flushed.  Later on, we get selective.

Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 22:40   ` David S. Miller
@ 2006-04-29  0:22     ` Rusty Russell
  2006-04-29  6:46       ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Rusty Russell @ 2006-04-29  0:22 UTC (permalink / raw
  To: David S. Miller; +Cc: caitlinb, johnpol, kelly, netdev

On Fri, 2006-04-28 at 15:40 -0700, David S. Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Sat, 29 Apr 2006 08:17:01 +1000
> > However, in future, if intelligent cards exist, having an API which lets
> > them do zero-copy and not overly penalize less intelligent cards makes
> > sense.
> 
> I do not think intelligent cards imply protocol in user space.
> You can still get the zero copy, and moving the work to the
> remote cpu, without all the complexity assosciated with putting
> the protocol in userspace.  It buys nothing but complexity.

You're thinking the card would place the packet in the mmap'ed buffer,
but the protocol handling would still be done (on that user-accessible
buffer) in kernelspace?

I hadn't considered that.  Are the userspace-kernel interactions here
are a lesser problem than telling userspace "you want direct access to
the packets?  Great, *you* handle the whole thing".

I am thinking the big payoff for this would be MPI et al (RDMA), so we
might be best leaving it alone.

> > Side note: my Xen I/O patches allow the implementation of exactly this
> > kind of virtual hardware (no coincidence 8), so intelligent cards might
> > not be as far away as you think...
> 
> Such hardware can be prototyped in QEMU as well.

Absolutely (and writing QEMU devices is easier than writing a Linux
device driver, which says something sad).  

But the Xen virtual intelligent NIC would be a "real" NIC, not (just) a
prototype.

Cheers!
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-29  0:22     ` Rusty Russell
@ 2006-04-29  6:46       ` David S. Miller
  0 siblings, 0 replies; 79+ messages in thread
From: David S. Miller @ 2006-04-29  6:46 UTC (permalink / raw
  To: rusty; +Cc: caitlinb, johnpol, kelly, netdev

From: Rusty Russell <rusty@rustcorp.com.au>
Date: Sat, 29 Apr 2006 10:22:40 +1000

> You're thinking the card would place the packet in the mmap'ed buffer,
> but the protocol handling would still be done (on that user-accessible
> buffer) in kernelspace?

Exactly.

> I hadn't considered that.  Are the userspace-kernel interactions here
> are a lesser problem than telling userspace "you want direct access to
> the packets?  Great, *you* handle the whole thing".

I've very much weary of putting a second TCP stack in userspace
for the same reasons most folks are weary of TOE.

And frankly we should only go towards that kind of duplication if it
shows a real performance gain.

Nevertheless I do highly encourage folks to experiment with that
as much as possible, I could be dead wrong on my hunch that it
won't help enough to justify allowing it.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-28 22:00               ` David S. Miller
@ 2006-04-29 13:54                 ` Evgeniy Polyakov
       [not found]                 ` <20060429124451.GA19810@2ka.mipt.ru>
  1 sibling, 0 replies; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-04-29 13:54 UTC (permalink / raw
  To: David S. Miller; +Cc: shemminger, caitlinb, kelly, rusty, netdev

[-- Attachment #1: Type: text/plain, Size: 2358 bytes --]

On Fri, Apr 28, 2006 at 03:00:56PM -0700, David S. Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Fri, 28 Apr 2006 23:59:30 +0400
> 
> > kevent can be used as poll without any changes to the socket code.
> > There are two types of network related kevents - socket events
> > (recv/send/accept) and network aio, which can be turned completely off
> > in config.
> > There are following events which are supported by kevent:
> > o usual poll/select notifications
> > o inode notifications (create/remove)
> > o timer notifications
> > o socket notifications (send/recv/accept)
> > o network aio system
> > o fs aio (project closed, aio_sendfile() is being developed instead)
> > 
> > Any of the above can be turned off by config option.
> 
> Feel free to post the current version of your kevent patch
> here so we can discuss something concrete.
> 
> Maybe you have even some toy example user applications that
> use kevent that people can look at too?  That might help
> in understanding how it's supposed to be used.

There are several at project's homepage [1] and in archive [2]:
evserver_epoll.c - epoll-based web server (pure epoll)
evserver_kevent.c - kevent-based web server (socket notifications)
evserver_poll.c - web server which uses kevent-based poll (poll/select
notifications)
evtest.c - can wait for any type of events. It was used to test
timer notifications.
naio_recv.c/naio_send.c - network AIO sending and receiving benchmarks
(sync/async)
aio_sendfile.c - aio sendfile benchmark (sendfile/aio_sendfile). Kernel
implementation is not 100% ready, pages are only asynchronously propagated 
into VFS cache, but are not sent yet.

There are also links to benchmark results, comparison with FreeBSD kqueue, some
conclusions on kevent homepage [1].
Network AIO [3] homepage also contains additional NAIO benchmarks with
some graphs.

1. kevent project home page.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

2. kevent archive
http://tservice.net.ru/~s0mbre/archive/kevent/

3. Network AIO
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

Current development kevent patchset (against 2.6.15-rc7, but could be
applied against later trees too) attached gzipped, sory if you get this
twice.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

-- 
	Evgeniy Polyakov

[-- Attachment #2: kevent_full.diff.gz --]
[-- Type: application/x-gunzip, Size: 23625 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
       [not found]                 ` <20060429124451.GA19810@2ka.mipt.ru>
@ 2006-05-01 21:32                   ` David S. Miller
  2006-05-02  7:08                     ` Evgeniy Polyakov
  2006-05-02  8:10                     ` [1/1] Kevent subsystem Evgeniy Polyakov
  0 siblings, 2 replies; 79+ messages in thread
From: David S. Miller @ 2006-05-01 21:32 UTC (permalink / raw
  To: johnpol; +Cc: shemminger, caitlinb, kelly, rusty, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Sat, 29 Apr 2006 16:44:51 +0400

> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

I understand how in some ways this is work in progress,
but direct calls into ext3 from the kevent code?  I'd
like stuff like that cleaned up before reviewing :-)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-01 21:32                   ` David S. Miller
@ 2006-05-02  7:08                     ` Evgeniy Polyakov
  2006-05-02  8:10                     ` [1/1] Kevent subsystem Evgeniy Polyakov
  1 sibling, 0 replies; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-05-02  7:08 UTC (permalink / raw
  To: David S. Miller; +Cc: shemminger, caitlinb, kelly, rusty, netdev

On Mon, May 01, 2006 at 02:32:46PM -0700, David S. Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Sat, 29 Apr 2006 16:44:51 +0400
> 
> > Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> 
> I understand how in some ways this is work in progress,
> but direct calls into ext3 from the kevent code?  I'd
> like stuff like that cleaned up before reviewing :-)

Well, this only requires per address space ->get_block() callback,
which is what ext3_get_block() is.

I will update and resend patchset today.

Thank you.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [1/1] Kevent subsystem.
  2006-05-01 21:32                   ` David S. Miller
  2006-05-02  7:08                     ` Evgeniy Polyakov
@ 2006-05-02  8:10                     ` Evgeniy Polyakov
  1 sibling, 0 replies; 79+ messages in thread
From: Evgeniy Polyakov @ 2006-05-02  8:10 UTC (permalink / raw
  To: David S. Miller; +Cc: shemminger, caitlinb, kelly, rusty, netdev

[-- Attachment #1: Type: text/plain, Size: 1007 bytes --]

Kevent subsystem incorporates several AIO/kqueue design notes and
ideas. Kevent can be used both for edge and level triggered notifications.

It supports:
o socket notifications (accept, receiving and sending)
o network AIO (aio_send(), aio_recv() and aio_sendfile()) [3]
o inode notifications (create/remove)
o generic poll()/select() notifications
o timer notifications

More info, design notes, benchmarks (web server based on epoll, kevent, 
kevent_poll; naio_send() vs. send(), naio_recv() vs. recv() with different number 
of sending/receiving users) can be found on project's homepage [1].

Userspace interface was greatly described in LWN article [2].

1. kevent homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

2. LWN article about kevent.
http://lwn.net/Articles/172844/

3. Network AIO (aio_send(), aio_recv(), aio_sendfile()).
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

-- 
	Evgeniy Polyakov

[-- Attachment #2: kevent_full.diff.3.gz --]
[-- Type: application/x-gunzip, Size: 24072 bytes --]

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-27  6:25     ` David S. Miller
  2006-04-27 11:51       ` Evgeniy Polyakov
@ 2006-05-04  2:59       ` Kelly Daly
  2006-05-04 23:22         ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: Kelly Daly @ 2006-05-04  2:59 UTC (permalink / raw
  Cc: David S. Miller, rusty, netdev

On Thursday 27 April 2006 16:25, you wrote:
> So the idea in your scheme is to give the buffer pools to the NIC
> in a per-channel way via a simple descriptor table?  And the u32's
> are arbitrary keys that index into this descriptor table, right?
>

yeah - it _was_...  Although since having a play with coding it into your 
implementation we've come up with the following:

Using the descriptor table adds excess complexity for kernel buffers, and is 
really only useful for userspace.  So instead of using descriptor tables for 
everything we've come up with a dynamic descriptor table scheme instead where 
they are used only for userspace.

The move to skb-ising the buffers has made it more difficult to keep track of 
buffer lifetimes.  Previously we were leaving the buffers in the ring until 
completely finished with them.  The producer could reuse the buffer once the 
consumer head had moved on.  With the graft to skb we can no longer do this 
unless the packets are processed serially (which is ok for socket channels, 
but not realistic for the default).

We DID write an infrastructure to resolve this issue, although it is more 
complex than the dynamic descriptor scheme for userspace.  And we want to 
keep this simple - right?

Cheers,
K

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-04-26  7:59 ` David S. Miller
@ 2006-05-04  7:28   ` Kelly Daly
  2006-05-04 23:11     ` David S. Miller
  0 siblings, 1 reply; 79+ messages in thread
From: Kelly Daly @ 2006-05-04  7:28 UTC (permalink / raw
  To: David S. Miller; +Cc: kelly, netdev, rusty

On Wednesday 26 April 2006 17:59, David S. Miller wrote:
> Next, you can't even begin to work on the protocol channels before you
> do one very important piece of work.  Integration of all of the ipv4
> and ipv6 protocol hash tables into a central code, it's a total
> prerequisite.  Then you modify things to use a generic
> inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
> protocol number as well as saddr/daddr/sport/dport and searches
> from a central table.

Back here again  ;)

Is this on the right track (see patch below)?

K



_____________________


diff -urp davem_orig/include/net/inet_hashtables.h kelly/include/net/inet_hashtables.h
--- davem_orig/include/net/inet_hashtables.h	2006-04-27 00:08:32.000000000 +1000
+++ kelly/include/net/inet_hashtables.h	2006-05-04 14:28:59.000000000 +1000
@@ -418,4 +418,6 @@ static inline struct sock *inet_lookup(s
 
 extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
 			     struct sock *sk);
+
+extern struct inet_hashinfo *inet_hashes[256];
 #endif /* _INET_HASHTABLES_H */
diff -urp davem_orig/include/net/sock.h kelly/include/net/sock.h
--- davem_orig/include/net/sock.h	2006-05-02 13:42:10.000000000 +1000
+++ kelly/include/net/sock.h	2006-05-04 14:28:59.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
 	unsigned short		sk_type;
 	int			sk_rcvbuf;
 	socket_lock_t		sk_lock;
+	struct netchannel	*sk_channel;
 	wait_queue_head_t	*sk_sleep;
 	struct dst_entry	*sk_dst_cache;
 	struct xfrm_policy	*sk_policy[2];
diff -urp davem_orig/net/core/dev.c kelly/net/core/dev.c
--- davem_orig/net/core/dev.c	2006-04-27 15:49:27.000000000 +1000
+++ kelly/net/core/dev.c	2006-05-04 16:58:49.000000000 +1000
@@ -116,6 +116,7 @@
 #include <net/iw_handler.h>
 #include <asm/current.h>
 #include <linux/audit.h>
+#include <net/inet_hashtables.h>
 
 /*
  *	The list of packet types we will receive (as opposed to discard)
@@ -190,6 +191,8 @@ static inline struct hlist_head *dev_ind
 	return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
 }
 
+static struct netchannel default_netchannel;
+
 /*
  *	Our notifier list
  */
@@ -1907,6 +1910,37 @@ struct netchannel_buftrailer *__netchann
 }
 EXPORT_SYMBOL_GPL(__netchannel_dequeue);
 
+
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *np)
+{
+	struct sock *sk = NULL;
+	unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
+	void *data = (void *)np - dlen;
+
+	switch (np->netchan_buf_proto) {
+	case __constant_htons(ETH_P_IP): {
+		struct iphdr *ip = data;
+		int iphl = ip->ihl * 4;
+
+		if (dlen >= (iphl + 4) && iphl == sizeof(struct iphdr)) {
+			u16 *ports = (u16 *)(ip + 1);
+
+			if (inet_hashes[ip->protocol]) {
+				sk = inet_lookup(inet_hashes[ip->protocol], 
+						 ip->saddr, ports[0],
+						 ip->daddr, ports[1],
+						 np->netchan_buf_dev->ifindex);
+			}
+			break;
+		}
+	}
+	}
+	if (sk && sk->sk_channel)
+		return sk->sk_channel;
+	return &default_netchannel;
+}
+
 static gifconf_func_t * gifconf_list [NPROTO];
 
 /**
@@ -3421,6 +3455,9 @@ static int __init net_dev_init(void)
 	hotcpu_notifier(dev_cpu_callback, 0);
 	dst_init();
 	dev_mcast_init();
+
+	/* FIXME: This should be attached to thread/threads. */
+	netchannel_init(&default_netchannel, NULL, NULL);
 	rc = 0;
 out:
 	return rc;
diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
--- davem_orig/net/ipv4/inet_hashtables.c	2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/inet_hashtables.c	2006-05-04 14:28:59.000000000 +1000
@@ -337,3 +337,5 @@ out:
 }
 
 EXPORT_SYMBOL_GPL(inet_hash_connect);
+
+struct inet_hashinfo *inet_hashes[256];
diff -urp davem_orig/net/ipv4/tcp.c kelly/net/ipv4/tcp.c
--- davem_orig/net/ipv4/tcp.c	2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/tcp.c	2006-05-04 14:28:59.000000000 +1000
@@ -2173,6 +2173,7 @@ void __init tcp_init(void)
 	       tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
 
 	tcp_register_congestion_control(&tcp_reno);
+	inet_hashes[IPPROTO_TCP] = &tcp_hashinfo;
 }
 
 EXPORT_SYMBOL(tcp_close);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-04  7:28   ` Kelly Daly
@ 2006-05-04 23:11     ` David S. Miller
  2006-05-05  2:48       ` Kelly Daly
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-05-04 23:11 UTC (permalink / raw
  To: kelly; +Cc: netdev, rusty

From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 4 May 2006 17:28:27 +1000

> On Wednesday 26 April 2006 17:59, David S. Miller wrote:
> > Next, you can't even begin to work on the protocol channels before you
> > do one very important piece of work.  Integration of all of the ipv4
> > and ipv6 protocol hash tables into a central code, it's a total
> > prerequisite.  Then you modify things to use a generic
> > inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
> > protocol number as well as saddr/daddr/sport/dport and searches
> > from a central table.
> 
> Back here again  ;)
> 
> Is this on the right track (see patch below)?

It is on the right track.

I very much fear abuse of the inet_hashes[] array.  So I'd rather
hide it behind a programmatic interface, something like:

extern struct sock *inet_lookup_proto(u16 protocol, u32 saddr, u16 sport,
				      u32 daddr, u16 dport, int ifindex);

and export that from inet_hashtables.c

Then you have registry and unregistry functions in inet_hashtables.c
that setup the static inet_hashes[] array.  So TCP would go:

	inet_hash_register(IPPROTO_TCP, &tcp_hashinfo);

instead of the direct assignment to inet_hashes[] it makes right
now in your patch.

Thanks!

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-04  2:59       ` Kelly Daly
@ 2006-05-04 23:22         ` David S. Miller
  2006-05-05  1:31           ` Rusty Russell
  0 siblings, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-05-04 23:22 UTC (permalink / raw
  To: kelly; +Cc: rusty, netdev

From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 4 May 2006 12:59:23 +1000

> We DID write an infrastructure to resolve this issue, although it is more 
> complex than the dynamic descriptor scheme for userspace.  And we want to 
> keep this simple - right?

Yes.

I wonder if it is possible to manage the buffer pool just like a SLAB
cache to deal with the variable lifetimes.  The system has a natural
"working set" size of networking buffers at a given point in time and
even the default net channel can grow to accomodate that with some
kind of limit.

This is kind of what I was alluding to in the past, in that we now
have globals limits on system TCP socket memory when really what we
want to do is have a set of global generic system packet memory
limits.

These two things can tie in together.

Note that this means we need a callback in the SKB to free the memory
up.  For direct net channels to a socket, you don't need any callbacks
of course because as you mentioned you know the buffer lifetimes.

People want such a callback anyways in order to experiment with SKB
recycling in drivers.

Note that some kind of "shrink" callback would need to be implemented.
It would only be needed for the default channel.  We need to seriously
avoid needing something like this over the socket net channels because
that is serious complexity.

Finally... if we go the global packet memory route, we will need hard
and soft limits.  There is a danger in such a scheme of not being able
to get critical control packets out (ACKs, etc.).  Also, there are all
kinds of classification and drop algorithms (see RED) which could be
used to handle overload situations gracefully.

So, are you still sure you want to do away with the descriptors for
the default channel?  Is the scheme I have outlined above doable or
is there some critical barrier or some complexity issue which makes
it undesirable?

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-04 23:22         ` David S. Miller
@ 2006-05-05  1:31           ` Rusty Russell
  0 siblings, 0 replies; 79+ messages in thread
From: Rusty Russell @ 2006-05-05  1:31 UTC (permalink / raw
  To: David S. Miller; +Cc: kelly, netdev

On Thu, 2006-05-04 at 16:22 -0700, David S. Miller wrote:
> From: Kelly Daly <kelly@au1.ibm.com>
> Date: Thu, 4 May 2006 12:59:23 +1000
> 
> > We DID write an infrastructure to resolve this issue, although it is more 
> > complex than the dynamic descriptor scheme for userspace.  And we want to 
> > keep this simple - right?
> 
> Yes.
> 
> I wonder if it is possible to manage the buffer pool just like a SLAB
> cache to deal with the variable lifetimes.  The system has a natural
> "working set" size of networking buffers at a given point in time and
> even the default net channel can grow to accomodate that with some
> kind of limit.
> 
> This is kind of what I was alluding to in the past, in that we now
> have globals limits on system TCP socket memory when really what we
> want to do is have a set of global generic system packet memory
> limits.
> 
> These two things can tie in together.

Hi Dave,

	We kept a simple "used" bitmap, but to avoid the consumer touching it,
also put a "I am masquerading as an SKB" bit in the trailer, like so:

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .16405-linux-2.6.17-rc3-git7/include/linux/skbuff.h .16405-linux-2.6.17-rc3-git7.updated/include/linux/skbuff.h
--- .16405-linux-2.6.17-rc3-git7/include/linux/skbuff.h	2006-05-03 22:07:14.000000000 +1000
+++ .16405-linux-2.6.17-rc3-git7.updated/include/linux/skbuff.h	2006-05-03 22:07:15.000000000 +1000
@@ -133,7 +133,8 @@ struct skb_frag_struct {
  */
 struct skb_shared_info {
 	atomic_t	dataref;
-	unsigned short	nr_frags;
+	unsigned short	nr_frags : 15;
+	unsigned int	chan_as_skb : 1;
 	unsigned short	tso_size;
 	unsigned short	tso_segs;
 	unsigned short  ufo_size;
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .16405-linux-2.6.17-rc3-git7/net/core/skbuff.c .16405-linux-2.6.17-rc3-git7.updated/net/core/skbuff.c
--- .16405-linux-2.6.17-rc3-git7/net/core/skbuff.c	2006-05-03 22:07:14.000000000 +1000
+++ .16405-linux-2.6.17-rc3-git7.updated/net/core/skbuff.c	2006-05-03 22:07:15.000000000 +1000
@@ -289,6 +289,7 @@ struct sk_buff *skb_netchan_graft(struct
 	skb_shinfo(skb)->ufo_size = 0;
 	skb_shinfo(skb)->ip6_frag_id = 0;
 	skb_shinfo(skb)->frag_list = NULL;
+	skb_shinfo(skb)->chan_as_skb = 1;

 	return skb;
 }
@@ -328,7 +329,10 @@ void skb_release_data(struct sk_buff *sk
 		if (skb_shinfo(skb)->frag_list)
 			skb_drop_fraglist(skb);

-		kfree(skb->head);
+		if (skb_shinfo(skb)->chan_as_skb)
+			skb_shinfo(skb)->chan_as_skb = 0;
+		else
+			kfree(skb->head);
 	}
 }

Buffer allocation would be: find_first_bit, check that it's not actually
inside an skb, or otherwise find_next_bit.  Assuming most buffers do not
go down default channel, this is efficient.

Problems:
1) it's still not cache-friendly with producers on multiple CPUs.  We
could divide up the bitmap into per-cpu regions to try first to improve
cache behaviour.

2) In addition, we had every buffer one page large.  This isn't
sufficient for jumbo frames, and wasteful for ethernet.  So if we
statically assign descriptors -> buffers, we need to have multiple
sizes.

3) OTOH, if descriptor table is dynamic, we have cache issues again as
multiple people are writing to it, and it's not clear what we really
gain over direct pointers.

4) Grow/shrink can be done, but needs stop_machine, or maybe tricky RCU.

5) The killer for me: we can't use our scheme straight-to-userspace
anyway, since we can't trust the (user-writable) ringbuffer in deciding
what buffers to release.  Since we need to store this somewhere, we need
a test in netchannel_enqueue.  At which point, we might as well
translate to "descriptors" at that point, anyway (since descriptors are
only really needed for userspace).  Something like:

	tail = np->netchan_tail;
	if (tail == np->netchan_head)
		return -ENOMEM;

+	/* Write to userspace?  They can't deref ptr anyway. */
+	if (np->shadow_ring && !netchan_local_buf(bp)) {
+		np->shadow_ring[tail] = bp;
+		bp = (void *)-1;
+	}
	np->netchan_queue[tail++] = bp;
	if (tail == NET_CHANNEL_ENTRIES)

(We don't have local buffers yet, but I'm assuming we'll use v. low
pointers for them).  Userspace goes "desc number is in range, we can
access directly" or "desc number isn't, call into kernel to copy them
for us".

> So, are you still sure you want to do away with the descriptors for
> the default channel?  Is the scheme I have outlined above doable or
> is there some critical barrier or some complexity issue which makes
> it undesirable?

I think it's simpler to build global alloc limiters on what we have.
The slab already has the nice lifetime and cache-friendly properties we
want, so we just have to write the limiting code.  There's enough work
there to keep us busy 8)

Cheers,
Rusty.
-- 
 ccontrol: http://ozlabs.org/~rusty/ccontrol

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-04 23:11     ` David S. Miller
@ 2006-05-05  2:48       ` Kelly Daly
  2006-05-16  1:02         ` Kelly Daly
  0 siblings, 1 reply; 79+ messages in thread
From: Kelly Daly @ 2006-05-05  2:48 UTC (permalink / raw
  To: David S. Miller; +Cc: rusty, netdev

On Friday 05 May 2006 09:11, David S. Miller wrote:
> I very much fear abuse of the inet_hashes[] array.  So I'd rather
> hide it behind a programmatic interface, something like:

done!  I will continue with implementation of default netchannel for now.


> Thanks!
anytime  =)

Cheers,
K


______________________

diff -urp davem_orig/include/net/inet_hashtables.h kelly/include/net/inet_hashtables.h
--- davem_orig/include/net/inet_hashtables.h	2006-04-27 00:08:32.000000000 +1000
+++ kelly/include/net/inet_hashtables.h	2006-05-05 12:05:33.000000000 +1000
@@ -418,4 +418,7 @@ static inline struct sock *inet_lookup(s
 
 extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
 			     struct sock *sk);
+extern void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo);
+extern struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex);
+
 #endif /* _INET_HASHTABLES_H */
diff -urp davem_orig/include/net/sock.h kelly/include/net/sock.h
--- davem_orig/include/net/sock.h	2006-05-02 13:42:10.000000000 +1000
+++ kelly/include/net/sock.h	2006-05-04 14:28:59.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
 	unsigned short		sk_type;
 	int			sk_rcvbuf;
 	socket_lock_t		sk_lock;
+	struct netchannel	*sk_channel;
 	wait_queue_head_t	*sk_sleep;
 	struct dst_entry	*sk_dst_cache;
 	struct xfrm_policy	*sk_policy[2];
diff -urp davem_orig/net/core/dev.c kelly/net/core/dev.c
--- davem_orig/net/core/dev.c	2006-04-27 15:49:27.000000000 +1000
+++ kelly/net/core/dev.c	2006-05-05 10:39:22.000000000 +1000
@@ -116,6 +116,7 @@
 #include <net/iw_handler.h>
 #include <asm/current.h>
 #include <linux/audit.h>
+#include <net/inet_hashtables.h>
 
 /*
  *	The list of packet types we will receive (as opposed to discard)
@@ -190,6 +191,8 @@ static inline struct hlist_head *dev_ind
 	return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
 }
 
+static struct netchannel default_netchannel;
+
 /*
  *	Our notifier list
  */
@@ -1907,6 +1910,34 @@ struct netchannel_buftrailer *__netchann
 }
 EXPORT_SYMBOL_GPL(__netchannel_dequeue);
 
+
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *np)
+{
+	struct sock *sk = NULL;
+	unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
+	void *data = (void *)np - dlen;
+
+	switch (np->netchan_buf_proto) {
+	case __constant_htons(ETH_P_IP): {
+		struct iphdr *ip = data;
+		int iphl = ip->ihl * 4;
+
+		if (dlen >= (iphl + 4) && iphl == sizeof(struct iphdr)) {
+			u16 *ports = (u16 *)(ip + 1);
+			sk = inet_lookup_proto(ip->protocol,
+					       ip->saddr, ports[0],
+					       ip->daddr, ports[1],
+					       np->netchan_buf_dev->ifindex);
+			break;
+		}
+	}
+	}
+	if (sk && sk->sk_channel)
+		return sk->sk_channel;
+	return &default_netchannel;
+}
+
 static gifconf_func_t * gifconf_list [NPROTO];
 
 /**
@@ -3421,6 +3452,9 @@ static int __init net_dev_init(void)
 	hotcpu_notifier(dev_cpu_callback, 0);
 	dst_init();
 	dev_mcast_init();
+
+	/* FIXME: This should be attached to thread/threads. */
+	netchannel_init(&default_netchannel, NULL, NULL);
 	rc = 0;
 out:
 	return rc;
diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
--- davem_orig/net/ipv4/inet_hashtables.c	2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/inet_hashtables.c	2006-05-05 12:05:33.000000000 +1000
@@ -337,3 +337,25 @@ out:
 }
 
 EXPORT_SYMBOL_GPL(inet_hash_connect);
+
+static struct inet_hashinfo *inet_hashes[256];
+
+void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo)
+{
+	BUG_ON(inet_hashes[proto]);
+	inet_hashes[proto] = hashinfo;
+}
+EXPORT_SYMBOL(inet_hash_register);
+
+struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex)
+{
+	struct sock *sk = NULL;
+	if (inet_hashes[protocol]) {
+		sk = inet_lookup(inet_hashes[protocol], 
+				 saddr, sport,
+				 daddr, dport,
+				 ifindex);
+	}
+	return sk;
+}
+EXPORT_SYMBOL(inet_lookup_proto);
diff -urp davem_orig/net/ipv4/tcp.c kelly/net/ipv4/tcp.c
--- davem_orig/net/ipv4/tcp.c	2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/tcp.c	2006-05-05 11:29:18.000000000 +1000
@@ -2173,6 +2173,7 @@ void __init tcp_init(void)
 	       tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
 
 	tcp_register_congestion_control(&tcp_reno);
+	inet_hash_register(IPPROTO_TCP, &tcp_hashinfo);
 }
 
 EXPORT_SYMBOL(tcp_close);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-05  2:48       ` Kelly Daly
@ 2006-05-16  1:02         ` Kelly Daly
  2006-05-16  1:05           ` David S. Miller
  2006-05-16  5:16           ` David S. Miller
  0 siblings, 2 replies; 79+ messages in thread
From: Kelly Daly @ 2006-05-16  1:02 UTC (permalink / raw
  To: David S. Miller; +Cc: netdev, rusty

On Friday 05 May 2006 12:48, Kelly Daly wrote:
> done!  I will continue with implementation of default netchannel for now.

___________________


diff -urp davem_orig/include/net/inet_hashtables.h kelly/include/net/inet_hashtables.h
--- davem_orig/include/net/inet_hashtables.h	2006-04-27 00:08:32.000000000 +1000
+++ kelly/include/net/inet_hashtables.h	2006-05-05 12:45:44.000000000 +1000
@@ -418,4 +418,7 @@ static inline struct sock *inet_lookup(s
 
 extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
 			     struct sock *sk);
+extern void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo);
+extern struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex);
+
 #endif /* _INET_HASHTABLES_H */
diff -urp davem_orig/include/net/sock.h kelly/include/net/sock.h
--- davem_orig/include/net/sock.h	2006-05-02 13:42:10.000000000 +1000
+++ kelly/include/net/sock.h	2006-05-04 14:28:59.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
 	unsigned short		sk_type;
 	int			sk_rcvbuf;
 	socket_lock_t		sk_lock;
+	struct netchannel	*sk_channel;
 	wait_queue_head_t	*sk_sleep;
 	struct dst_entry	*sk_dst_cache;
 	struct xfrm_policy	*sk_policy[2];
diff -urp davem_orig/net/core/dev.c kelly/net/core/dev.c
--- davem_orig/net/core/dev.c	2006-04-27 15:49:27.000000000 +1000
+++ kelly/net/core/dev.c	2006-05-15 12:21:41.000000000 +1000
@@ -113,9 +113,11 @@
 #include <linux/delay.h>
 #include <linux/wireless.h>
 #include <linux/netchannel.h>
+#include <linux/kthread.h>
 #include <net/iw_handler.h>
 #include <asm/current.h>
 #include <linux/audit.h>
+#include <net/inet_hashtables.h>
 
 /*
  *	The list of packet types we will receive (as opposed to discard)
@@ -190,6 +192,10 @@ static inline struct hlist_head *dev_ind
 	return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
 }
 
+/*      default netchannel shtuff */
+static struct netchannel default_netchannel;
+static wait_queue_head_t default_netchannel_wq;
+
 /*
  *	Our notifier list
  */
@@ -1854,6 +1860,35 @@ softnet_break:
 	goto out;
 }
 
+static void default_netchannel_wake(struct netchannel *np)
+{
+	wake_up(&default_netchannel_wq);
+}
+
+/* handles default chan buffers that nobody else wants */
+static int default_netchannel_thread(void *unused)
+{
+	wait_queue_t wait;
+	struct netchannel_buftrailer *bp;
+	struct sk_buff *skbp;
+
+	wait.private = current;
+	wait.func = default_wake_function;;
+	INIT_LIST_HEAD(&wait.task_list);
+
+	add_wait_queue(&default_netchannel_wq, &wait);
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	while (!kthread_should_stop()) {
+		bp = __netchannel_dequeue(&default_netchannel);
+		skbp = skb_netchan_graft(bp, GFP_ATOMIC);
+		netif_receive_skb(skbp);
+	}
+	remove_wait_queue(&default_netchannel_wq, &wait);
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
 void netchannel_init(struct netchannel *np,
 		     void (*callb)(struct netchannel *), void *callb_data)
 {
@@ -1907,6 +1942,34 @@ struct netchannel_buftrailer *__netchann
 }
 EXPORT_SYMBOL_GPL(__netchannel_dequeue);
 
+
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *np)
+{
+	struct sock *sk = NULL;
+	unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
+	void *data = (void *)np - dlen;
+
+	switch (np->netchan_buf_proto) {
+	case __constant_htons(ETH_P_IP): {
+		struct iphdr *ip = data;
+		int iphl = ip->ihl * 4;
+
+		if (dlen >= (iphl + 4) && iphl == sizeof(struct iphdr)) {
+			u16 *ports = (u16 *)(ip + 1);
+			sk = inet_lookup_proto(ip->protocol,
+					       ip->saddr, ports[0],
+					       ip->daddr, ports[1],
+					       np->netchan_buf_dev->ifindex);
+			break;
+		}
+	}
+	}
+	if (sk && sk->sk_channel)
+		return sk->sk_channel;
+	return &default_netchannel;
+}
+
 static gifconf_func_t * gifconf_list [NPROTO];
 
 /**
@@ -3375,6 +3438,7 @@ static int dev_cpu_callback(struct notif
 static int __init net_dev_init(void)
 {
 	int i, rc = -ENOMEM;
+	struct task_struct *netchan_thread;
 
 	BUG_ON(!dev_boot_phase);
 
@@ -3421,7 +3485,12 @@ static int __init net_dev_init(void)
 	hotcpu_notifier(dev_cpu_callback, 0);
 	dst_init();
 	dev_mcast_init();
-	rc = 0;
+
+	netchannel_init(&default_netchannel, default_netchannel_wake, NULL);
+	netchan_thread = kthread_run(default_netchannel_thread, NULL, "kvj_def");
+
+	if (!IS_ERR(netchan_thread)) 	/* kthread_run returned thread */
+		rc = 0;
 out:
 	return rc;
 }
diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
--- davem_orig/net/ipv4/inet_hashtables.c	2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/inet_hashtables.c	2006-05-05 12:45:44.000000000 +1000
@@ -337,3 +337,25 @@ out:
 }
 
 EXPORT_SYMBOL_GPL(inet_hash_connect);
+
+static struct inet_hashinfo *inet_hashes[256];
+
+void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo)
+{
+	BUG_ON(inet_hashes[proto]);
+	inet_hashes[proto] = hashinfo;
+}
+EXPORT_SYMBOL(inet_hash_register);
+
+struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex)
+{
+	struct sock *sk = NULL;
+	if (inet_hashes[protocol]) {
+		sk = inet_lookup(inet_hashes[protocol], 
+				 saddr, sport,
+				 daddr, dport,
+				 ifindex);
+	}
+	return sk;
+}
+EXPORT_SYMBOL(inet_lookup_proto);
diff -urp davem_orig/net/ipv4/tcp.c kelly/net/ipv4/tcp.c
--- davem_orig/net/ipv4/tcp.c	2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/tcp.c	2006-05-05 11:29:18.000000000 +1000
@@ -2173,6 +2173,7 @@ void __init tcp_init(void)
 	       tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
 
 	tcp_register_congestion_control(&tcp_reno);
+	inet_hash_register(IPPROTO_TCP, &tcp_hashinfo);
 }
 
 EXPORT_SYMBOL(tcp_close);

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-16  1:02         ` Kelly Daly
@ 2006-05-16  1:05           ` David S. Miller
  2006-05-16  1:15             ` Kelly Daly
  2006-05-16  5:16           ` David S. Miller
  1 sibling, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-05-16  1:05 UTC (permalink / raw
  To: kelly; +Cc: netdev, rusty

From: Kelly Daly <kelly@au1.ibm.com>
Date: Tue, 16 May 2006 11:02:29 +1000

> On Friday 05 May 2006 12:48, Kelly Daly wrote:
> > done!  I will continue with implementation of default netchannel for now.

Some context?  It's been a week since we were discussing this,
so I'd like to know what we're looking at here in this patch :)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-16  1:05           ` David S. Miller
@ 2006-05-16  1:15             ` Kelly Daly
  0 siblings, 0 replies; 79+ messages in thread
From: Kelly Daly @ 2006-05-16  1:15 UTC (permalink / raw
  To: David S. Miller; +Cc: kelly, netdev, rusty

On Tuesday 16 May 2006 11:05, David S. Miller wrote:
> From: Kelly Daly <kelly@au1.ibm.com>
> Date: Tue, 16 May 2006 11:02:29 +1000
>
> > On Friday 05 May 2006 12:48, Kelly Daly wrote:
> > > done!  I will continue with implementation of default netchannel for
> > > now.
>
> Some context?  It's been a week since we were discussing this,
> so I'd like to know what we're looking at here in this patch :)

the implementation of the default netchannel  =)

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-16  1:02         ` Kelly Daly
  2006-05-16  1:05           ` David S. Miller
@ 2006-05-16  5:16           ` David S. Miller
  2006-06-22  2:05             ` Kelly Daly
  1 sibling, 1 reply; 79+ messages in thread
From: David S. Miller @ 2006-05-16  5:16 UTC (permalink / raw
  To: kelly; +Cc: netdev, rusty

From: Kelly Daly <kelly@au1.ibm.com>
Date: Tue, 16 May 2006 11:02:29 +1000

> +/* handles default chan buffers that nobody else wants */
> +static int default_netchannel_thread(void *unused)
> +{
> +	wait_queue_t wait;
> +	struct netchannel_buftrailer *bp;
> +	struct sk_buff *skbp;
> +
> +	wait.private = current;
> +	wait.func = default_wake_function;;
> +	INIT_LIST_HEAD(&wait.task_list);
> +
> +	add_wait_queue(&default_netchannel_wq, &wait);
> +	set_current_state(TASK_UNINTERRUPTIBLE);
> +	while (!kthread_should_stop()) {
> +		bp = __netchannel_dequeue(&default_netchannel);
> +		skbp = skb_netchan_graft(bp, GFP_ATOMIC);
> +		netif_receive_skb(skbp);
> +	}
> +	remove_wait_queue(&default_netchannel_wq, &wait);
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +

When does this thread ever go to sleep?  Seems like it will loop
forever and not block when the default_netchannel queue is empty.
:-)

> +	unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;

Probably deserves a "netchan_buf_len(bp)" inline in linux/netchannel.h

> diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
> --- davem_orig/net/ipv4/inet_hashtables.c	2006-04-27 00:08:33.000000000 +1000
> +++ kelly/net/ipv4/inet_hashtables.c	2006-05-05 12:45:44.000000000 +1000

The hash table bits look good, just as they did last time :-)
So I'll put this part into my vj-2.6 tree now, thanks.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-05-16  5:16           ` David S. Miller
@ 2006-06-22  2:05             ` Kelly Daly
  2006-06-22  3:58               ` James Morris
  2006-07-08  0:05               ` David Miller
  0 siblings, 2 replies; 79+ messages in thread
From: Kelly Daly @ 2006-06-22  2:05 UTC (permalink / raw
  To: David S. Miller; +Cc: netdev, rusty

> The hash table bits look good, just as they did last time :-)
> So I'll put this part into my vj-2.6 tree now, thanks.
Rockin' - thanks...

Sorry for the massive delay - here's the next attempt.

-------

diff -urp davem/include/linux/netchannel.h kelly_new/include/linux/netchannel.h
--- davem/include/linux/netchannel.h	2006-06-16 15:14:15.000000000 +1000
+++ kelly_new/include/linux/netchannel.h	2006-06-22 11:47:04.000000000 +1000
@@ -19,6 +19,7 @@ struct netchannel {
 	void 			(*netchan_callb)(struct netchannel *);
 	void			*netchan_callb_data;
 	unsigned long		netchan_head;
+	wait_queue_head_t	wq;
 };
 
 extern void netchannel_init(struct netchannel *,
@@ -56,6 +57,11 @@ static inline unsigned char *netchan_buf
 	return netchan_buf_base(bp) + bp->netchan_buf_offset;
 }
 
+static inline int netchan_data_len(const struct netchannel_buftrailer *bp)
+{
+	return bp->netchan_buf_len - bp->netchan_buf_offset;
+}
+
 extern int netchannel_enqueue(struct netchannel *, struct netchannel_buftrailer *);
 extern struct netchannel_buftrailer *__netchannel_dequeue(struct netchannel *);
 static inline struct netchannel_buftrailer *netchannel_dequeue(struct netchannel *np)
@@ -65,6 +71,7 @@ static inline struct netchannel_buftrail
 	return __netchannel_dequeue(np);
 }
 
+extern struct netchannel *find_netchannel(const struct netchannel_buftrailer *bp);
 extern struct sk_buff *skb_netchan_graft(struct netchannel_buftrailer *, gfp_t);
 
 #endif /* _LINUX_NETCHANNEL_H */
diff -urp davem/include/net/inet_hashtables.h kelly_new/include/net/inet_hashtables.h
--- davem/include/net/inet_hashtables.h	2006-06-16 14:34:20.000000000 +1000
+++ kelly_new/include/net/inet_hashtables.h	2006-06-19 10:42:45.000000000 +1000
@@ -418,4 +418,7 @@ static inline struct sock *inet_lookup(s
 
 extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
 			     struct sock *sk);
+extern void  inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo);
+extern struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex);
+
 #endif /* _INET_HASHTABLES_H */
diff -urp davem/include/net/sock.h kelly_new/include/net/sock.h
--- davem/include/net/sock.h	2006-06-16 15:14:16.000000000 +1000
+++ kelly_new/include/net/sock.h	2006-06-19 10:42:45.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
 	unsigned short		sk_type;
 	int			sk_rcvbuf;
 	socket_lock_t		sk_lock;
+	struct netchannel	*sk_channel;
 	wait_queue_head_t	*sk_sleep;
 	struct dst_entry	*sk_dst_cache;
 	struct xfrm_policy	*sk_policy[2];
diff -urp davem/net/core/dev.c kelly_new/net/core/dev.c
--- davem/net/core/dev.c	2006-06-16 15:14:16.000000000 +1000
+++ kelly_new/net/core/dev.c	2006-06-22 11:45:55.000000000 +1000
@@ -113,9 +113,12 @@
 #include <linux/delay.h>
 #include <linux/wireless.h>
 #include <linux/netchannel.h>
+#include <linux/kthread.h>
+#include <linux/wait.h>
 #include <net/iw_handler.h>
 #include <asm/current.h>
 #include <linux/audit.h>
+#include <net/inet_hashtables.h>
 
 /*
  *	The list of packet types we will receive (as opposed to discard)
@@ -190,6 +193,8 @@ static inline struct hlist_head *dev_ind
 	return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
 }
 
+static struct netchannel default_netchannel;
+
 /*
  *	Our notifier list
  */
@@ -1854,11 +1859,18 @@ softnet_break:
 	goto out;
 }
 
+void netchannel_wake(struct netchannel *np)
+{
+	wake_up(&np->wq);
+}
+
 void netchannel_init(struct netchannel *np,
 		     void (*callb)(struct netchannel *), void *callb_data)
 {
 	memset(np, 0, sizeof(*np));
 
+	init_waitqueue_head(&np->wq);
+
 	np->netchan_callb	= callb;
 	np->netchan_callb_data	= callb_data;
 }
@@ -1912,6 +1924,76 @@ struct netchannel_buftrailer *__netchann
 }
 EXPORT_SYMBOL_GPL(__netchannel_dequeue);
 
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *bp)
+{
+	struct sock *sk = NULL;
+	int datalen = netchan_data_len(bp);
+
+	switch (bp->netchan_buf_proto) {
+	case __constant_htons(ETH_P_IP): {
+		struct iphdr *ip = (void *)bp - datalen;
+		int iphl = ip->ihl * 4;
+
+		/* FIXME: Do sanity checks, parse packet. */
+
+		if (datalen >+ (iphl + 4) && iphl == sizeof(struct iphdr)) {
+			u16 *ports = (u16 *)ip + 1;
+			sk = inet_lookup_proto(ip->protocol, 
+					 ip->saddr, ports[0],
+					 ip->daddr, ports[1],
+					 bp->netchan_buf_dev->ifindex);
+		}
+		break;
+	}
+	}
+
+	if (sk && sk->sk_channel)
+		return sk->sk_channel;
+	return &default_netchannel;
+}
+EXPORT_SYMBOL_GPL(find_netchannel);
+
+static int sock_add_netchannel(struct sock *sk)
+{
+	struct netchannel *np;
+
+	np = kmalloc(sizeof(struct netchannel), GFP_KERNEL);
+	if (!np)
+		return -ENOMEM;
+	netchannel_init(np, netchannel_wake, (void *)np);
+	sk->sk_channel = np;
+
+	return 0;
+}
+
+/* deal with packets coming to default thread */
+static int netchannel_default_thread(void *unused)
+{
+	struct netchannel *np = &default_netchannel;
+	struct netchannel_buftrailer *nbp;
+	struct sk_buff *skbp;
+	DECLARE_WAITQUEUE(wait, current);
+
+	add_wait_queue(&np->wq, &wait);
+	set_current_state(TASK_UNINTERRUPTIBLE);
+
+	while (!kthread_should_stop()) {
+		while (np->netchan_tail != np->netchan_head) {
+			nbp = netchannel_dequeue(np);
+			skbp = skb_netchan_graft(nbp, GFP_KERNEL);
+			netif_receive_skb(skbp);
+		}
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	
+	remove_wait_queue(&np->wq, &wait);
+	__set_current_state(TASK_RUNNING);
+
+	return 0;
+}
+
 static gifconf_func_t * gifconf_list [NPROTO];
 
 /**
@@ -3426,6 +3508,10 @@ static int __init net_dev_init(void)
 	hotcpu_notifier(dev_cpu_callback, 0);
 	dst_init();
 	dev_mcast_init();
+
+	netchannel_init(&default_netchannel, netchannel_wake, (void *)&default_netchannel);
+	kthread_run(netchannel_default_thread, NULL, "nc_def");
+
 	rc = 0;
 out:
 	return rc;

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-06-22  2:05             ` Kelly Daly
@ 2006-06-22  3:58               ` James Morris
  2006-06-22  4:31                 ` Arnaldo Carvalho de Melo
  2006-06-22  4:36                 ` YOSHIFUJI Hideaki / 吉藤英明
  2006-07-08  0:05               ` David Miller
  1 sibling, 2 replies; 79+ messages in thread
From: James Morris @ 2006-06-22  3:58 UTC (permalink / raw
  To: Kelly Daly; +Cc: David S. Miller, netdev, rusty

On Thu, 22 Jun 2006, Kelly Daly wrote:

> +	switch (bp->netchan_buf_proto) {
> +	case __constant_htons(ETH_P_IP): {

__constant_htons and friends should not be used in runtime code, only for 
data being initialized at compile time.



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-06-22  3:58               ` James Morris
@ 2006-06-22  4:31                 ` Arnaldo Carvalho de Melo
  2006-06-22  4:36                 ` YOSHIFUJI Hideaki / 吉藤英明
  1 sibling, 0 replies; 79+ messages in thread
From: Arnaldo Carvalho de Melo @ 2006-06-22  4:31 UTC (permalink / raw
  To: James Morris; +Cc: Kelly Daly, David S. Miller, netdev, rusty

On 6/22/06, James Morris <jmorris@namei.org> wrote:
> On Thu, 22 Jun 2006, Kelly Daly wrote:
>
> > +     switch (bp->netchan_buf_proto) {
> > +     case __constant_htons(ETH_P_IP): {
>
> __constant_htons and friends should not be used in runtime code, only for
> data being initialized at compile time.

... because they generate the same code, so, to make source code less
cluttered  ... :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-06-22  3:58               ` James Morris
  2006-06-22  4:31                 ` Arnaldo Carvalho de Melo
@ 2006-06-22  4:36                 ` YOSHIFUJI Hideaki / 吉藤英明
  1 sibling, 0 replies; 79+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-22  4:36 UTC (permalink / raw
  To: jmorris; +Cc: kelly, davem, netdev, rusty, yoshfuji

In article <Pine.LNX.4.64.0606212354420.15426@d.namei> (at Wed, 21 Jun 2006 23:58:56 -0400 (EDT)), James Morris <jmorris@namei.org> says:

> On Thu, 22 Jun 2006, Kelly Daly wrote:
> 
> > +	switch (bp->netchan_buf_proto) {
> > +	case __constant_htons(ETH_P_IP): {
> 
> __constant_htons and friends should not be used in runtime code, only for 
> data being initialized at compile time.

I disagree.  For "case," use __constant_{hton,ntoh}{s,l}(), please.

--yoshfuji

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
  2006-06-22  2:05             ` Kelly Daly
  2006-06-22  3:58               ` James Morris
@ 2006-07-08  0:05               ` David Miller
  1 sibling, 0 replies; 79+ messages in thread
From: David Miller @ 2006-07-08  0:05 UTC (permalink / raw
  To: kelly; +Cc: netdev, rusty

From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 22 Jun 2006 12:05:35 +1000

> > The hash table bits look good, just as they did last time :-)
> > So I'll put this part into my vj-2.6 tree now, thanks.
> Rockin' - thanks...
> 
> Sorry for the massive delay - here's the next attempt.

My review delay was just as bad if not worse :-)

> +static int sock_add_netchannel(struct sock *sk)
> +{
> +	struct netchannel *np;
> +
> +	np = kmalloc(sizeof(struct netchannel), GFP_KERNEL);
> +	if (!np)
> +		return -ENOMEM;
> +	netchannel_init(np, netchannel_wake, (void *)np);
> +	sk->sk_channel = np;
> +
> +	return 0;
> +}

This function is unreferenced entirely?  It's marked static,
so don't bother including it unless it is being used.

Fix this, give me a good changelog and signed-off-by line
and I'll stick this into the vj-2.6 tree

Thanks!

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2006-07-08  0:04 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-28 17:02 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Caitlin Bestler
2006-04-28 17:18 ` Stephen Hemminger
2006-04-28 17:29   ` Evgeniy Polyakov
2006-04-28 17:41     ` Stephen Hemminger
2006-04-28 17:55       ` Evgeniy Polyakov
2006-04-28 19:16         ` David S. Miller
2006-04-28 19:49           ` Stephen Hemminger
2006-04-28 19:59             ` Evgeniy Polyakov
2006-04-28 22:00               ` David S. Miller
2006-04-29 13:54                 ` Evgeniy Polyakov
     [not found]                 ` <20060429124451.GA19810@2ka.mipt.ru>
2006-05-01 21:32                   ` David S. Miller
2006-05-02  7:08                     ` Evgeniy Polyakov
2006-05-02  8:10                     ` [1/1] Kevent subsystem Evgeniy Polyakov
2006-04-28 19:52           ` [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Evgeniy Polyakov
2006-04-28 19:10   ` David S. Miller
2006-04-28 20:46     ` Brent Cook
2006-04-28 17:25 ` Evgeniy Polyakov
2006-04-28 19:14   ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2006-04-28 23:45 Caitlin Bestler
2006-04-28 17:55 Caitlin Bestler
2006-04-28 22:17 ` Rusty Russell
2006-04-28 22:40   ` David S. Miller
2006-04-29  0:22     ` Rusty Russell
2006-04-29  6:46       ` David S. Miller
2006-04-28 15:59 Caitlin Bestler
2006-04-28 16:12 ` Evgeniy Polyakov
2006-04-28 19:09   ` David S. Miller
2006-04-27 21:12 Caitlin Bestler
2006-04-28  6:10 ` Evgeniy Polyakov
2006-04-28  7:20   ` David S. Miller
2006-04-28  7:32     ` Evgeniy Polyakov
2006-04-28 18:20       ` David S. Miller
2006-04-28  8:24 ` Rusty Russell
2006-04-28 19:21   ` David S. Miller
2006-04-28 22:04     ` Rusty Russell
2006-04-28 22:38       ` David S. Miller
2006-04-29  0:10         ` Rusty Russell
2006-04-27  1:02 Caitlin Bestler
2006-04-27  6:08 ` David S. Miller
2006-04-27  6:17   ` Andi Kleen
2006-04-27  6:27     ` David S. Miller
2006-04-27  6:41       ` Andi Kleen
2006-04-27  7:52         ` David S. Miller
2006-04-26 22:53 Caitlin Bestler
2006-04-26 22:59 ` David S. Miller
2006-04-26 20:20 Caitlin Bestler
2006-04-26 22:35 ` David S. Miller
2006-04-26 19:30 Caitlin Bestler
2006-04-26 19:46 ` Jeff Garzik
2006-04-26 22:40   ` David S. Miller
2006-04-27  3:40 ` Rusty Russell
2006-04-27  4:58   ` James Morris
2006-04-27  6:16     ` David S. Miller
2006-04-27  6:17   ` David S. Miller
2006-04-26 16:57 Caitlin Bestler
2006-04-26 19:23 ` David S. Miller
2006-04-26 11:47 Kelly Daly
2006-04-26  7:33 ` David S. Miller
2006-04-27  3:31   ` Kelly Daly
2006-04-27  6:25     ` David S. Miller
2006-04-27 11:51       ` Evgeniy Polyakov
2006-04-27 20:09         ` David S. Miller
2006-04-28  6:05           ` Evgeniy Polyakov
2006-05-04  2:59       ` Kelly Daly
2006-05-04 23:22         ` David S. Miller
2006-05-05  1:31           ` Rusty Russell
2006-04-26  7:59 ` David S. Miller
2006-05-04  7:28   ` Kelly Daly
2006-05-04 23:11     ` David S. Miller
2006-05-05  2:48       ` Kelly Daly
2006-05-16  1:02         ` Kelly Daly
2006-05-16  1:05           ` David S. Miller
2006-05-16  1:15             ` Kelly Daly
2006-05-16  5:16           ` David S. Miller
2006-06-22  2:05             ` Kelly Daly
2006-06-22  3:58               ` James Morris
2006-06-22  4:31                 ` Arnaldo Carvalho de Melo
2006-06-22  4:36                 ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-08  0:05               ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.