2018-08-13 03:25:00

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 0/6] XDP rx handler

Hi:

This series tries to implement XDP support for rx hanlder. This would
be useful for doing native XDP on stacked device like macvlan, bridge
or even bond.

The idea is simple, let stacked device register a XDP rx handler. And
when driver return XDP_PASS, it will call a new helper xdp_do_pass()
which will try to pass XDP buff to XDP rx handler directly. XDP rx
handler may then decide how to proceed, it could consume the buff, ask
driver to drop the packet or ask the driver to fallback to normal skb
path.

A sample XDP rx handler was implemented for macvlan. And virtio-net
(mergeable buffer case) was converted to call xdp_do_pass() as an
example. For ease comparision, generic XDP support for rx handler was
also implemented.

Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
shows about 83% improvement.

Please review.

Thanks

Jason Wang (6):
net: core: factor out generic XDP check and process routine
net: core: generic XDP support for stacked device
net: core: introduce XDP rx handler
macvlan: count the number of vlan in source mode
macvlan: basic XDP support
virtio-net: support XDP rx handler

drivers/net/macvlan.c | 189 +++++++++++++++++++++++++++++++++++++++++++--
drivers/net/virtio_net.c | 11 +++
include/linux/filter.h | 1 +
include/linux/if_macvlan.h | 1 +
include/linux/netdevice.h | 12 +++
net/core/dev.c | 69 +++++++++++++----
net/core/filter.c | 28 +++++++
7 files changed, 293 insertions(+), 18 deletions(-)

--
2.7.4



2018-08-13 03:19:00

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 1/6] net: core: factor out generic XDP check and process routine

Signed-off-by: Jason Wang <[email protected]>
---
net/core/dev.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f68122f..605c66e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4392,13 +4392,9 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
}
EXPORT_SYMBOL_GPL(do_xdp_generic);

-static int netif_rx_internal(struct sk_buff *skb)
+static int netif_do_generic_xdp(struct sk_buff *skb)
{
- int ret;
-
- net_timestamp_check(netdev_tstamp_prequeue, skb);
-
- trace_netif_rx(skb);
+ int ret = XDP_PASS;

if (static_branch_unlikely(&generic_xdp_needed_key)) {
int ret;
@@ -4408,15 +4404,28 @@ static int netif_rx_internal(struct sk_buff *skb)
ret = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
rcu_read_unlock();
preempt_enable();
-
- /* Consider XDP consuming the packet a success from
- * the netdev point of view we do not want to count
- * this as an error.
- */
- if (ret != XDP_PASS)
- return NET_RX_SUCCESS;
}

+ return ret;
+}
+
+static int netif_rx_internal(struct sk_buff *skb)
+{
+ int ret;
+
+ net_timestamp_check(netdev_tstamp_prequeue, skb);
+
+ trace_netif_rx(skb);
+
+ ret = netif_do_generic_xdp(skb);
+
+ /* Consider XDP consuming the packet a success from
+ * the netdev point of view we do not want to count
+ * this as an error.
+ */
+ if (ret != XDP_PASS)
+ return NET_RX_SUCCESS;
+
#ifdef CONFIG_RPS
if (static_key_false(&rps_needed)) {
struct rps_dev_flow voidflow, *rflow = &voidflow;
--
2.7.4


2018-08-13 03:19:00

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 2/6] net: core: generic XDP support for stacked device

Stacked device usually change skb->dev to its own and return
RX_HANDLER_ANOTHER during rx handler processing. But we don't call
generic XDP routine at that time, this means it can't work for stacked
device.

Fixing this by calling netif_do_generic_xdp() if rx handler returns
RX_HANDLER_ANOTHER. This allows us to do generic XDP on stacked device
e.g macvlan.

Signed-off-by: Jason Wang <[email protected]>
---
net/core/dev.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 605c66e..a77ce08 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4822,6 +4822,11 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
ret = NET_RX_SUCCESS;
goto out;
case RX_HANDLER_ANOTHER:
+ ret = netif_do_generic_xdp(skb);
+ if (ret != XDP_PASS) {
+ ret = NET_RX_SUCCESS;
+ goto out;
+ }
goto another_round;
case RX_HANDLER_EXACT:
deliver_exact = true;
--
2.7.4


2018-08-13 03:19:06

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 3/6] net: core: introduce XDP rx handler

This patch tries to introduce XDP rx handler. This will be used by
stacked device that depends on rx handler for having a fast packet
processing path based on XDP.

This idea is simple, when XDP program returns XDP_PASS, instead of
building skb immediately, driver will call xdp_do_pass() to check
whether or not there's a XDP rx handler, if yes, it will pass XDP
buffer to XDP rx handler first.

There are two main tasks for XDP rx handler, the first is check
whether or not the setup or packet could be processed through XDP buff
directly. The second task is to run XDP program. An XDP rx handler can
return several different results which was defined by enum
rx_xdp_handler_result_t:

RX_XDP_HANDLER_CONSUMED: This means the XDP buff were consumed.
RX_XDP_HANDLER_DROP: This means XDP rx handler ask to drop the packet.
RX_XDP_HANDLER_PASS_FALLBACK: This means XDP rx handler can not
process the packet (e.g cloning), and we need to fall back to normal
skb path to deal with the packet.

Consider we have the following configuration, Level 0 device which has
a rx handler for Level 1 device which has a rx handler for L2 device.

L2 device
|
L1 device
|
L0 device

With the help of XDP rx handler, we can attach XDP program on each of
the layer or even run native XDP handler for L2 without XDP prog
attached to L1 device:

(XDP prog for L2 device)
|
L2 XDP rx handler for L1
|
(XDP prog for L1 device)
|
L1 XDP rx hanlder for L0
|
XDP prog for L0 device

It works like: When the XDP program for L0 device returns XDP_PASS, we
will first try to check and pass XDP buff to its XDP rx handler if
there's one. Then the L1 XDP rx handler will be called and to run XDP
program for L1. When L1 XDP program returns XDP_PASS or there's no XDP
program attached to L1, we will try to call xdp_do_pass() to pass it
to XDP rx hanlder for L1. Then XDP buff will be passed to L2 XDP rx
handler etc. And it will try to run L2 XDP program if any. And if
there's no L2 XDP program or XDP program returns XDP_PASS. The handler
usually will build skb and call netif_rx() for a local receive. If any
of the XDP rx handlers returns XDP_RX_HANDLER_FALLBACK, the code will
return to L0 device and L0 device will try to build skb and go through
normal rx handler path for skb.

Signed-off-by: Jason Wang <[email protected]>
---
include/linux/filter.h | 1 +
include/linux/netdevice.h | 12 ++++++++++++
net/core/dev.c | 29 +++++++++++++++++++++++++++++
net/core/filter.c | 28 ++++++++++++++++++++++++++++
4 files changed, 70 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..7cc8e69 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -791,6 +791,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
int xdp_do_redirect(struct net_device *dev,
struct xdp_buff *xdp,
struct bpf_prog *prog);
+rx_handler_result_t xdp_do_pass(struct xdp_buff *xdp);
void xdp_do_flush_map(void);

void bpf_warn_invalid_xdp_action(u32 act);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 282e2e9..21f0a9e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -421,6 +421,14 @@ enum rx_handler_result {
typedef enum rx_handler_result rx_handler_result_t;
typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb);

+enum rx_xdp_handler_result {
+ RX_XDP_HANDLER_CONSUMED,
+ RX_XDP_HANDLER_DROP,
+ RX_XDP_HANDLER_FALLBACK,
+};
+typedef enum rx_xdp_handler_result rx_xdp_handler_result_t;
+typedef rx_xdp_handler_result_t rx_xdp_handler_func_t(struct net_device *dev,
+ struct xdp_buff *xdp);
void __napi_schedule(struct napi_struct *n);
void __napi_schedule_irqoff(struct napi_struct *n);

@@ -1898,6 +1906,7 @@ struct net_device {
struct bpf_prog __rcu *xdp_prog;
unsigned long gro_flush_timeout;
rx_handler_func_t __rcu *rx_handler;
+ rx_xdp_handler_func_t __rcu *rx_xdp_handler;
void __rcu *rx_handler_data;

#ifdef CONFIG_NET_CLS_ACT
@@ -3530,7 +3539,10 @@ bool netdev_is_rx_handler_busy(struct net_device *dev);
int netdev_rx_handler_register(struct net_device *dev,
rx_handler_func_t *rx_handler,
void *rx_handler_data);
+int netdev_rx_xdp_handler_register(struct net_device *dev,
+ rx_xdp_handler_func_t *rx_xdp_handler);
void netdev_rx_handler_unregister(struct net_device *dev);
+void netdev_rx_xdp_handler_unregister(struct net_device *dev);

bool dev_valid_name(const char *name);
int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
diff --git a/net/core/dev.c b/net/core/dev.c
index a77ce08..b4e8949 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4638,6 +4638,12 @@ bool netdev_is_rx_handler_busy(struct net_device *dev)
}
EXPORT_SYMBOL_GPL(netdev_is_rx_handler_busy);

+static bool netdev_is_rx_xdp_handler_busy(struct net_device *dev)
+{
+ ASSERT_RTNL();
+ return dev && rtnl_dereference(dev->rx_xdp_handler);
+}
+
/**
* netdev_rx_handler_register - register receive handler
* @dev: device to register a handler for
@@ -4670,6 +4676,22 @@ int netdev_rx_handler_register(struct net_device *dev,
}
EXPORT_SYMBOL_GPL(netdev_rx_handler_register);

+int netdev_rx_xdp_handler_register(struct net_device *dev,
+ rx_xdp_handler_func_t *rx_xdp_handler)
+{
+ if (netdev_is_rx_xdp_handler_busy(dev))
+ return -EBUSY;
+
+ if (dev->priv_flags & IFF_NO_RX_HANDLER)
+ return -EINVAL;
+
+ rcu_assign_pointer(dev->rx_xdp_handler, rx_xdp_handler);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_xdp_handler_register);
+
+
/**
* netdev_rx_handler_unregister - unregister receive handler
* @dev: device to unregister a handler from
@@ -4692,6 +4714,13 @@ void netdev_rx_handler_unregister(struct net_device *dev)
}
EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);

+void netdev_rx_xdp_handler_unregister(struct net_device *dev)
+{
+ ASSERT_RTNL();
+ RCU_INIT_POINTER(dev->rx_xdp_handler, NULL);
+}
+EXPORT_SYMBOL_GPL(netdev_rx_xdp_handler_unregister);
+
/*
* Limit the use of PFMEMALLOC reserves to those protocols that implement
* the special handling of PFMEMALLOC skbs.
diff --git a/net/core/filter.c b/net/core/filter.c
index 587bbfb..9ea3797 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3312,6 +3312,34 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
}
EXPORT_SYMBOL_GPL(xdp_do_redirect);

+rx_handler_result_t xdp_do_pass(struct xdp_buff *xdp)
+{
+ rx_xdp_handler_result_t ret;
+ rx_xdp_handler_func_t *rx_xdp_handler;
+ struct net_device *dev = xdp->rxq->dev;
+
+ ret = RX_XDP_HANDLER_FALLBACK;
+ rx_xdp_handler = rcu_dereference(dev->rx_xdp_handler);
+
+ if (rx_xdp_handler) {
+ ret = rx_xdp_handler(dev, xdp);
+ switch (ret) {
+ case RX_XDP_HANDLER_CONSUMED:
+ /* Fall through */
+ case RX_XDP_HANDLER_DROP:
+ /* Fall through */
+ case RX_XDP_HANDLER_FALLBACK:
+ break;
+ default:
+ BUG();
+ break;
+ }
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(xdp_do_pass);
+
static int xdp_do_generic_redirect_map(struct net_device *dev,
struct sk_buff *skb,
struct xdp_buff *xdp,
--
2.7.4


2018-08-13 03:19:09

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 4/6] macvlan: count the number of vlan in source mode

This patch tries to count the number of vlans in source mode. This
will be used for implementing XDP rx handler for macvlan.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvlan.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index cfda146..b7c814d 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -53,6 +53,7 @@ struct macvlan_port {
struct hlist_head vlan_source_hash[MACVLAN_HASH_SIZE];
DECLARE_BITMAP(mc_filter, MACVLAN_MC_FILTER_SZ);
unsigned char perm_addr[ETH_ALEN];
+ unsigned long source_count;
};

struct macvlan_source_entry {
@@ -1433,6 +1434,9 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
if (err)
goto unregister_netdev;

+ if (vlan->mode == MACVLAN_MODE_SOURCE)
+ port->source_count++;
+
list_add_tail_rcu(&vlan->list, &port->vlans);
netif_stacked_transfer_operstate(lowerdev, dev);
linkwatch_fire_event(dev);
@@ -1477,6 +1481,7 @@ static int macvlan_changelink(struct net_device *dev,
struct netlink_ext_ack *extack)
{
struct macvlan_dev *vlan = netdev_priv(dev);
+ struct macvlan_port *port = vlan->port;
enum macvlan_mode mode;
bool set_mode = false;
enum macvlan_macaddr_mode macmode;
@@ -1491,8 +1496,10 @@ static int macvlan_changelink(struct net_device *dev,
(vlan->mode == MACVLAN_MODE_PASSTHRU))
return -EINVAL;
if (vlan->mode == MACVLAN_MODE_SOURCE &&
- vlan->mode != mode)
+ vlan->mode != mode) {
macvlan_flush_sources(vlan->port, vlan);
+ port->source_count--;
+ }
}

if (data && data[IFLA_MACVLAN_FLAGS]) {
@@ -1510,8 +1517,13 @@ static int macvlan_changelink(struct net_device *dev,
}
vlan->flags = flags;
}
- if (set_mode)
+ if (set_mode) {
vlan->mode = mode;
+ if (mode == MACVLAN_MODE_SOURCE &&
+ vlan->mode != mode) {
+ port->source_count++;
+ }
+ }
if (data && data[IFLA_MACVLAN_MACADDR_MODE]) {
if (vlan->mode != MACVLAN_MODE_SOURCE)
return -EINVAL;
--
2.7.4


2018-08-13 03:19:17

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler

This patch tries to add the support of XDP rx handler to
virtio-net. This is straight-forward, just call xdp_do_pass() and
behave depends on its return value.

Test was done by using XDP_DROP (xdp1) for macvlan on top of
virtio-net. PPS of SKB mode was ~1.2Mpps while PPS of native XDP mode
was ~2.2Mpps. About 83% improvement was measured.

Notes: for RFC, only mergeable buffer case was implemented.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/virtio_net.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62311dd..1e22ad9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -777,6 +777,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
rcu_read_lock();
xdp_prog = rcu_dereference(rq->xdp_prog);
if (xdp_prog) {
+ rx_xdp_handler_result_t ret;
struct xdp_frame *xdpf;
struct page *xdp_page;
struct xdp_buff xdp;
@@ -825,6 +826,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,

switch (act) {
case XDP_PASS:
+ ret = xdp_do_pass(&xdp);
+ if (ret == RX_XDP_HANDLER_DROP)
+ goto drop;
+ if (ret != RX_XDP_HANDLER_FALLBACK) {
+ if (unlikely(xdp_page != page))
+ put_page(page);
+ rcu_read_unlock();
+ goto xdp_xmit;
+ }
/* recalculate offset to account for any header
* adjustments. Note other cases do not build an
* skb and avoid using offset
@@ -881,6 +891,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
case XDP_ABORTED:
trace_xdp_exception(vi->dev, xdp_prog, act);
/* fall through */
+drop:
case XDP_DROP:
if (unlikely(xdp_page != page))
__free_pages(xdp_page, 0);
--
2.7.4


2018-08-13 03:42:10

by Jason Wang

[permalink] [raw]
Subject: [RFC PATCH net-next V2 5/6] macvlan: basic XDP support

This patch tries to implementing basic XDP support for macvlan. The
implementation was split into two parts:

1) XDP rx handler of underlay device:

We will register an XDP rx handler (macvlan_handle_xdp) to under layer
device. In this handler, we will the following cases to go for slow
path (XDP_RX_HANDLER_PASS):

- The packet is a multicast packet.
- A vlan is source mode
- Destination mac address does not match any vlan

If none of the above cases were true, it means we could go for XDP
path directly. We will change the dev and return
RX_XDP_HANDLER_ANOTHER.

2) If we find a destination vlan, we will try to run XDP prog.

If XDP prog return XDP_PASS, we will call xdp_do_pass() to pass it to
up layer XDP rx handler. This is needed for e.g macvtap to work. If
XDP_RX_HANDLER_FALLBACK is returned, we will build skb and call
netif_rx() to finish the receiving. Otherwise just return the result
to lower device. For XDP_TX, we will build skb and try XDP generic
transmission routine for simplicity. This could be optimized on top.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvlan.c | 173 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/if_macvlan.h | 1 +
2 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index b7c814d..42b747c 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -34,6 +34,7 @@
#include <net/rtnetlink.h>
#include <net/xfrm.h>
#include <linux/netpoll.h>
+#include <linux/bpf.h>

#define MACVLAN_HASH_BITS 8
#define MACVLAN_HASH_SIZE (1<<MACVLAN_HASH_BITS)
@@ -436,6 +437,122 @@ static void macvlan_forward_source(struct sk_buff *skb,
}
}

+struct sk_buff *macvlan_xdp_build_skb(struct net_device *dev,
+ struct xdp_buff *xdp)
+{
+ int len;
+ int buflen = xdp->data_end - xdp->data_hard_start;
+ int headroom = xdp->data - xdp->data_hard_start;
+ struct sk_buff *skb;
+
+ len = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + headroom +
+ SKB_DATA_ALIGN(buflen);
+
+ skb = build_skb(xdp->data_hard_start, len);
+ if (!skb)
+ return NULL;
+
+ skb_reserve(skb, headroom);
+ __skb_put(skb, xdp->data_end - xdp->data);
+
+ skb->protocol = eth_type_trans(skb, dev);
+ skb->dev = dev;
+
+ return skb;
+}
+
+static rx_xdp_handler_result_t macvlan_receive_xdp(struct net_device *dev,
+ struct xdp_buff *xdp)
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+ struct bpf_prog *xdp_prog;
+ struct sk_buff *skb;
+ u32 act = XDP_PASS;
+ rx_xdp_handler_result_t ret;
+ int err;
+
+ rcu_read_lock();
+ xdp_prog = rcu_dereference(vlan->xdp_prog);
+
+ if (xdp_prog)
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
+
+ switch (act) {
+ case XDP_PASS:
+ ret = xdp_do_pass(xdp);
+ if (ret != RX_XDP_HANDLER_FALLBACK) {
+ rcu_read_unlock();
+ return ret;
+ }
+ skb = macvlan_xdp_build_skb(dev, xdp);
+ if (!skb) {
+ act = XDP_DROP;
+ break;
+ }
+ rcu_read_unlock();
+ netif_rx(skb);
+ macvlan_count_rx(vlan, skb->len, true, false);
+ goto out;
+ case XDP_TX:
+ skb = macvlan_xdp_build_skb(dev, xdp);
+ if (!skb) {
+ act = XDP_DROP;
+ break;
+ }
+ generic_xdp_tx(skb, xdp_prog);
+ break;
+ case XDP_REDIRECT:
+ err = xdp_do_redirect(dev, xdp, xdp_prog);
+ xdp_do_flush_map();
+ if (err)
+ act = XDP_DROP;
+ break;
+ case XDP_DROP:
+ break;
+ default:
+ bpf_warn_invalid_xdp_action(act);
+ break;
+ }
+
+ rcu_read_unlock();
+out:
+ if (act == XDP_DROP)
+ return RX_XDP_HANDLER_DROP;
+
+ return RX_XDP_HANDLER_CONSUMED;
+}
+
+/* called under rcu_read_lock() from XDP handler */
+static rx_xdp_handler_result_t macvlan_handle_xdp(struct net_device *dev,
+ struct xdp_buff *xdp)
+{
+ const struct ethhdr *eth = (const struct ethhdr *)xdp->data;
+ struct macvlan_port *port;
+ struct macvlan_dev *vlan;
+
+ if (is_multicast_ether_addr(eth->h_dest))
+ return RX_XDP_HANDLER_FALLBACK;
+
+ port = macvlan_port_get_rcu(dev);
+ if (port->source_count)
+ return RX_XDP_HANDLER_FALLBACK;
+
+ if (macvlan_passthru(port))
+ vlan = list_first_or_null_rcu(&port->vlans,
+ struct macvlan_dev, list);
+ else
+ vlan = macvlan_hash_lookup(port, eth->h_dest);
+
+ if (!vlan)
+ return RX_XDP_HANDLER_FALLBACK;
+
+ dev = vlan->dev;
+ if (unlikely(!(dev->flags & IFF_UP)))
+ return RX_XDP_HANDLER_DROP;
+
+ return macvlan_receive_xdp(dev, xdp);
+}
+
/* called under rcu_read_lock() from netif_receive_skb */
static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
{
@@ -1089,6 +1206,44 @@ static int macvlan_dev_get_iflink(const struct net_device *dev)
return vlan->lowerdev->ifindex;
}

+static int macvlan_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+ struct netlink_ext_ack *extack)
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+ struct bpf_prog *old_prog = rtnl_dereference(vlan->xdp_prog);
+
+ rcu_assign_pointer(vlan->xdp_prog, prog);
+
+ if (old_prog)
+ bpf_prog_put(old_prog);
+
+ return 0;
+}
+
+static u32 macvlan_xdp_query(struct net_device *dev)
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+ const struct bpf_prog *xdp_prog = rtnl_dereference(vlan->xdp_prog);
+
+ if (xdp_prog)
+ return xdp_prog->aux->id;
+
+ return 0;
+}
+
+static int macvlan_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+ switch (xdp->command) {
+ case XDP_SETUP_PROG:
+ return macvlan_xdp_set(dev, xdp->prog, xdp->extack);
+ case XDP_QUERY_PROG:
+ xdp->prog_id = macvlan_xdp_query(dev);
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
static const struct ethtool_ops macvlan_ethtool_ops = {
.get_link = ethtool_op_get_link,
.get_link_ksettings = macvlan_ethtool_get_link_ksettings,
@@ -1121,6 +1276,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
#endif
.ndo_get_iflink = macvlan_dev_get_iflink,
.ndo_features_check = passthru_features_check,
+ .ndo_bpf = macvlan_xdp,
};

void macvlan_common_setup(struct net_device *dev)
@@ -1173,10 +1329,20 @@ static int macvlan_port_create(struct net_device *dev)
INIT_WORK(&port->bc_work, macvlan_process_broadcast);

err = netdev_rx_handler_register(dev, macvlan_handle_frame, port);
- if (err)
+ if (err) {
kfree(port);
- else
- dev->priv_flags |= IFF_MACVLAN_PORT;
+ goto out;
+ }
+
+ err = netdev_rx_xdp_handler_register(dev, macvlan_handle_xdp);
+ if (err) {
+ netdev_rx_handler_unregister(dev);
+ kfree(port);
+ goto out;
+ }
+
+ dev->priv_flags |= IFF_MACVLAN_PORT;
+out:
return err;
}

@@ -1187,6 +1353,7 @@ static void macvlan_port_destroy(struct net_device *dev)

dev->priv_flags &= ~IFF_MACVLAN_PORT;
netdev_rx_handler_unregister(dev);
+ netdev_rx_xdp_handler_unregister(dev);

/* After this point, no packet can schedule bc_work anymore,
* but we need to cancel it and purge left skbs if any.
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 2e55e4c..7c7059b 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
#ifdef CONFIG_NET_POLL_CONTROLLER
struct netpoll *netpoll;
#endif
+ struct bpf_prog __rcu *xdp_prog;
};

static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
--
2.7.4


2018-08-14 00:34:01

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
> Hi:
>
> This series tries to implement XDP support for rx hanlder. This would
> be useful for doing native XDP on stacked device like macvlan, bridge
> or even bond.
>
> The idea is simple, let stacked device register a XDP rx handler. And
> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
> which will try to pass XDP buff to XDP rx handler directly. XDP rx
> handler may then decide how to proceed, it could consume the buff, ask
> driver to drop the packet or ask the driver to fallback to normal skb
> path.
>
> A sample XDP rx handler was implemented for macvlan. And virtio-net
> (mergeable buffer case) was converted to call xdp_do_pass() as an
> example. For ease comparision, generic XDP support for rx handler was
> also implemented.
>
> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
> shows about 83% improvement.

I'm missing the motiviation for this.
It seems performance of such solution is ~1M packet per second.
What would be a real life use case for such feature ?

Another concern is that XDP users expect to get line rate performance
and native XDP delivers it. 'generic XDP' is a fallback only
mechanism to operate on NICs that don't have native XDP yet.
Toshiaki's veth XDP work fits XDP philosophy and allows
high speed networking to be done inside containers after veth.
It's trying to get to line rate inside container.
This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
and the users will get confused with forever slow modes of XDP.

Please explain the problem you're trying to solve.
"look, here I can to XDP on top of macvlan" is not an explanation of the problem.


2018-08-14 08:18:00

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月14日 08:32, Alexei Starovoitov wrote:
> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to implement XDP support for rx hanlder. This would
>> be useful for doing native XDP on stacked device like macvlan, bridge
>> or even bond.
>>
>> The idea is simple, let stacked device register a XDP rx handler. And
>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>> handler may then decide how to proceed, it could consume the buff, ask
>> driver to drop the packet or ask the driver to fallback to normal skb
>> path.
>>
>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>> example. For ease comparision, generic XDP support for rx handler was
>> also implemented.
>>
>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>> shows about 83% improvement.
> I'm missing the motiviation for this.
> It seems performance of such solution is ~1M packet per second.

Notice it was measured by virtio-net which is kind of slow.

> What would be a real life use case for such feature ?

I had another run on top of 10G mlx4 and macvlan:

XDP_DROP on mlx4: 14.0Mpps
XDP_DROP on macvlan: 10.05Mpps

Perf shows macvlan_hash_lookup() and indirect call to
macvlan_handle_xdp() are the reasons for the number drop. I think the
numbers are acceptable. And we could try more optimizations on top.

So here's real life use case is trying to have an fast XDP path for rx
handler based device:

- For containers, we can run XDP for macvlan (~70% of wire speed). This
allows a container specific policy.
- For VM, we can implement macvtap XDP rx handler on top. This allow us
to forward packet to VM without building skb in the setup of macvtap.
- The idea could be used by other rx handler based device like bridge,
we may have a XDP fast forwarding path for bridge.

>
> Another concern is that XDP users expect to get line rate performance
> and native XDP delivers it. 'generic XDP' is a fallback only
> mechanism to operate on NICs that don't have native XDP yet.

So I can replace generic XDP TX routine with a native one for macvlan.

> Toshiaki's veth XDP work fits XDP philosophy and allows
> high speed networking to be done inside containers after veth.
> It's trying to get to line rate inside container.

This is one of the goal of this series as well. I agree veth XDP work
looks pretty fine, but it only work for a specific setup I believe since
it depends on XDP_REDIRECT which is supported by few drivers (and
there's no VF driver support). And in order to make it work for a end
user, the XDP program still need logic like hash(map) lookup to
determine the destination veth.

> This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
> and the users will get confused with forever slow modes of XDP.
>
> Please explain the problem you're trying to solve.
> "look, here I can to XDP on top of macvlan" is not an explanation of the problem.
>

Thanks

2018-08-14 09:23:56

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler

On Mon, 13 Aug 2018 11:17:30 +0800
Jason Wang <[email protected]> wrote:

> This patch tries to add the support of XDP rx handler to
> virtio-net. This is straight-forward, just call xdp_do_pass() and
> behave depends on its return value.
>
> Test was done by using XDP_DROP (xdp1) for macvlan on top of
> virtio-net. PPS of SKB mode was ~1.2Mpps while PPS of native XDP mode
> was ~2.2Mpps. About 83% improvement was measured.

I'm not convinced...

Why are you not using XDP_REDIRECT, which is already implemented in
receive_mergeable (which you modify below).

The macvlan driver just need to implement ndo_xdp_xmit(), and then you
can redirect (with XDP prog from physical driver into the guest). It
should be much faster...


> Notes: for RFC, only mergeable buffer case was implemented.
>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/net/virtio_net.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 62311dd..1e22ad9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -777,6 +777,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> rcu_read_lock();
> xdp_prog = rcu_dereference(rq->xdp_prog);
> if (xdp_prog) {
> + rx_xdp_handler_result_t ret;
> struct xdp_frame *xdpf;
> struct page *xdp_page;
> struct xdp_buff xdp;
> @@ -825,6 +826,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>
> switch (act) {
> case XDP_PASS:
> + ret = xdp_do_pass(&xdp);
> + if (ret == RX_XDP_HANDLER_DROP)
> + goto drop;
> + if (ret != RX_XDP_HANDLER_FALLBACK) {
> + if (unlikely(xdp_page != page))
> + put_page(page);
> + rcu_read_unlock();
> + goto xdp_xmit;
> + }
> /* recalculate offset to account for any header
> * adjustments. Note other cases do not build an
> * skb and avoid using offset
> @@ -881,6 +891,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> case XDP_ABORTED:
> trace_xdp_exception(vi->dev, xdp_prog, act);
> /* fall through */
> +drop:
> case XDP_DROP:
> if (unlikely(xdp_page != page))
> __free_pages(xdp_page, 0);



--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2018-08-14 10:29:17

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On Tue, 14 Aug 2018 15:59:01 +0800
Jason Wang <[email protected]> wrote:

> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
> > On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
> >> Hi:
> >>
> >> This series tries to implement XDP support for rx hanlder. This would
> >> be useful for doing native XDP on stacked device like macvlan, bridge
> >> or even bond.
> >>
> >> The idea is simple, let stacked device register a XDP rx handler. And
> >> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
> >> which will try to pass XDP buff to XDP rx handler directly. XDP rx
> >> handler may then decide how to proceed, it could consume the buff, ask
> >> driver to drop the packet or ask the driver to fallback to normal skb
> >> path.
> >>
> >> A sample XDP rx handler was implemented for macvlan. And virtio-net
> >> (mergeable buffer case) was converted to call xdp_do_pass() as an
> >> example. For ease comparision, generic XDP support for rx handler was
> >> also implemented.
> >>
> >> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
> >> shows about 83% improvement.
> > I'm missing the motiviation for this.
> > It seems performance of such solution is ~1M packet per second.
>
> Notice it was measured by virtio-net which is kind of slow.
>
> > What would be a real life use case for such feature ?
>
> I had another run on top of 10G mlx4 and macvlan:
>
> XDP_DROP on mlx4: 14.0Mpps
> XDP_DROP on macvlan: 10.05Mpps
>
> Perf shows macvlan_hash_lookup() and indirect call to
> macvlan_handle_xdp() are the reasons for the number drop. I think the
> numbers are acceptable. And we could try more optimizations on top.
>
> So here's real life use case is trying to have an fast XDP path for rx
> handler based device:
>
> - For containers, we can run XDP for macvlan (~70% of wire speed). This
> allows a container specific policy.
> - For VM, we can implement macvtap XDP rx handler on top. This allow us
> to forward packet to VM without building skb in the setup of macvtap.
> - The idea could be used by other rx handler based device like bridge,
> we may have a XDP fast forwarding path for bridge.
>
> >
> > Another concern is that XDP users expect to get line rate performance
> > and native XDP delivers it. 'generic XDP' is a fallback only
> > mechanism to operate on NICs that don't have native XDP yet.
>
> So I can replace generic XDP TX routine with a native one for macvlan.

If you simply implement ndo_xdp_xmit() for macvlan, and instead use
XDP_REDIRECT, then we are basically done.


> > Toshiaki's veth XDP work fits XDP philosophy and allows
> > high speed networking to be done inside containers after veth.
> > It's trying to get to line rate inside container.
>
> This is one of the goal of this series as well. I agree veth XDP work
> looks pretty fine, but it only work for a specific setup I believe since
> it depends on XDP_REDIRECT which is supported by few drivers (and
> there's no VF driver support).

The XDP_REDIRECT (RX-side) is trivial to add to drivers. It is a bad
argument that only a few drivers implement this. Especially since all
drivers also need to be extended with your proposed xdp_do_pass() call.

(rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
have to allocate HW TX-queue resources. If we disconnect RX and TX
side of redirect, then we can implement RX-side in an afternoon.


> And in order to make it work for a end
> user, the XDP program still need logic like hash(map) lookup to
> determine the destination veth.

That _is_ the general idea behind XDP and eBPF, that we need to add logic
that determine the destination. The kernel provides the basic
mechanisms for moving/redirecting packets fast, and someone else
builds an orchestration tool like Cilium, that adds the needed logic.

Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
accessible from XDP.

For macvlan, I imagine that we could add a BPF helper that allows you
to lookup/call macvlan_hash_lookup().


> > This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
> > and the users will get confused with forever slow modes of XDP.
> >
> > Please explain the problem you're trying to solve.
> > "look, here I can to XDP on top of macvlan" is not an explanation of the problem.
> >


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2018-08-14 13:21:46

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
> On Tue, 14 Aug 2018 15:59:01 +0800
> Jason Wang <[email protected]> wrote:
>
>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to implement XDP support for rx hanlder. This would
>>>> be useful for doing native XDP on stacked device like macvlan, bridge
>>>> or even bond.
>>>>
>>>> The idea is simple, let stacked device register a XDP rx handler. And
>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>> handler may then decide how to proceed, it could consume the buff, ask
>>>> driver to drop the packet or ask the driver to fallback to normal skb
>>>> path.
>>>>
>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>> example. For ease comparision, generic XDP support for rx handler was
>>>> also implemented.
>>>>
>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>>>> shows about 83% improvement.
>>> I'm missing the motiviation for this.
>>> It seems performance of such solution is ~1M packet per second.
>> Notice it was measured by virtio-net which is kind of slow.
>>
>>> What would be a real life use case for such feature ?
>> I had another run on top of 10G mlx4 and macvlan:
>>
>> XDP_DROP on mlx4: 14.0Mpps
>> XDP_DROP on macvlan: 10.05Mpps
>>
>> Perf shows macvlan_hash_lookup() and indirect call to
>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>> numbers are acceptable. And we could try more optimizations on top.
>>
>> So here's real life use case is trying to have an fast XDP path for rx
>> handler based device:
>>
>> - For containers, we can run XDP for macvlan (~70% of wire speed). This
>> allows a container specific policy.
>> - For VM, we can implement macvtap XDP rx handler on top. This allow us
>> to forward packet to VM without building skb in the setup of macvtap.
>> - The idea could be used by other rx handler based device like bridge,
>> we may have a XDP fast forwarding path for bridge.
>>
>>> Another concern is that XDP users expect to get line rate performance
>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>> mechanism to operate on NICs that don't have native XDP yet.
>> So I can replace generic XDP TX routine with a native one for macvlan.
> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
> XDP_REDIRECT, then we are basically done.

As I replied in another thread this probably not true. Its
ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
except for the case of bridge mode.

>
>
>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>> high speed networking to be done inside containers after veth.
>>> It's trying to get to line rate inside container.
>> This is one of the goal of this series as well. I agree veth XDP work
>> looks pretty fine, but it only work for a specific setup I believe since
>> it depends on XDP_REDIRECT which is supported by few drivers (and
>> there's no VF driver support).
> The XDP_REDIRECT (RX-side) is trivial to add to drivers. It is a bad
> argument that only a few drivers implement this. Especially since all
> drivers also need to be extended with your proposed xdp_do_pass() call.
>
> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
> have to allocate HW TX-queue resources. If we disconnect RX and TX
> side of redirect, then we can implement RX-side in an afternoon.

That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
which breaks assumptions of some drivers. And since we don't disconnect
RX and TX, it looks to me the partial implementation is even worse?
Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.

>
>
>> And in order to make it work for a end
>> user, the XDP program still need logic like hash(map) lookup to
>> determine the destination veth.
> That _is_ the general idea behind XDP and eBPF, that we need to add logic
> that determine the destination. The kernel provides the basic
> mechanisms for moving/redirecting packets fast, and someone else
> builds an orchestration tool like Cilium, that adds the needed logic.

Yes, so my reply is for the concern about performance. I meant anyway
the hash lookup will make it not hit the wire speed.

>
> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
> accessible from XDP.

Yes.

>
> For macvlan, I imagine that we could add a BPF helper that allows you
> to lookup/call macvlan_hash_lookup().

That's true but we still need a method to feed macvlan with XDP buff.
I'm not sure if this could be treated as another kind of redirection,
but ndo_xdp_xmit() could not be used for this case for sure. Compared to
redirection, XDP rx handler has its own advantages:

1) Use the exist API and userspace to setup the network topology instead
of inventing new tools and its own specific API. This means user can
just setup macvlan (macvtap, bridge or other) as usual and simply attach
XDP programs to both macvlan and its under layer device.
2) Ease the processing of complex logic, XDP can not do cloning or
reference counting. We can differ those cases and let normal networking
stack to deal with such packets seamlessly. I believe this is one of the
advantage of XDP. This makes us to focus on the fast path and greatly
simplify the codes.

Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
buff. It's just another basic mechanism. Policy is still done by XDP
program itself.

Thanks

>
>
>>> This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
>>> and the users will get confused with forever slow modes of XDP.
>>>
>>> Please explain the problem you're trying to solve.
>>> "look, here I can to XDP on top of macvlan" is not an explanation of the problem.
>>>
>


2018-08-14 14:05:02

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler



On 2018年08月14日 17:22, Jesper Dangaard Brouer wrote:
> On Mon, 13 Aug 2018 11:17:30 +0800
> Jason Wang<[email protected]> wrote:
>
>> This patch tries to add the support of XDP rx handler to
>> virtio-net. This is straight-forward, just call xdp_do_pass() and
>> behave depends on its return value.
>>
>> Test was done by using XDP_DROP (xdp1) for macvlan on top of
>> virtio-net. PPS of SKB mode was ~1.2Mpps while PPS of native XDP mode
>> was ~2.2Mpps. About 83% improvement was measured.
> I'm not convinced...
>
> Why are you not using XDP_REDIRECT, which is already implemented in
> receive_mergeable (which you modify below).
>
> The macvlan driver just need to implement ndo_xdp_xmit(), and then you
> can redirect (with XDP prog from physical driver into the guest). It
> should be much faster...
>
>

Macvlan is different from macvtap. For host RX, macvtap deliver the
packet to a pointer ring which could be accessed through a socket but
macvlan deliver the packet to the normal networking stack. As an example
of XDP rx handler, this series just try to make native XDP works for
macvlan, macvtap path will still go for skb (but it's not hard to add it
on top).

Consider the case of fast forwarding between host and guest. For TAP,
XDP_REDIRECT works perfectly since from the host point of view, host RX
is guest TX and host guest RX is host TX. But for macvtap which is based
on macvlan, transmitting packet to macvtap/macvlan means transmitting
packets to under layer device which is either a physical NIC or another
macvlan device. That's why we can't use XDP_REDIRECT with ndo_xdp_xmit().

Thanks














2018-08-14 14:09:20

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On 8/14/18 7:20 AM, Jason Wang wrote:
>
>
> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>> On Tue, 14 Aug 2018 15:59:01 +0800
>> Jason Wang <[email protected]> wrote:
>>
>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>> Hi:
>>>>>
>>>>> This series tries to implement XDP support for rx hanlder. This would
>>>>> be useful for doing native XDP on stacked device like macvlan, bridge
>>>>> or even bond.
>>>>>
>>>>> The idea is simple, let stacked device register a XDP rx handler. And
>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>> handler may then decide how to proceed, it could consume the buff, ask
>>>>> driver to drop the packet or ask the driver to fallback to normal skb
>>>>> path.
>>>>>
>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>> example. For ease comparision, generic XDP support for rx handler was
>>>>> also implemented.
>>>>>
>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>>>>> shows about 83% improvement.
>>>> I'm missing the motiviation for this.
>>>> It seems performance of such solution is ~1M packet per second.
>>> Notice it was measured by virtio-net which is kind of slow.
>>>
>>>> What would be a real life use case for such feature ?
>>> I had another run on top of 10G mlx4 and macvlan:
>>>
>>> XDP_DROP on mlx4: 14.0Mpps
>>> XDP_DROP on macvlan: 10.05Mpps
>>>
>>> Perf shows macvlan_hash_lookup() and indirect call to
>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>> numbers are acceptable. And we could try more optimizations on top.
>>>
>>> So here's real life use case is trying to have an fast XDP path for rx
>>> handler based device:
>>>
>>> - For containers, we can run XDP for macvlan (~70% of wire speed). This
>>> allows a container specific policy.
>>> - For VM, we can implement macvtap XDP rx handler on top. This allow us
>>> to forward packet to VM without building skb in the setup of macvtap.
>>> - The idea could be used by other rx handler based device like bridge,
>>> we may have a XDP fast forwarding path for bridge.
>>>
>>>> Another concern is that XDP users expect to get line rate performance
>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>> mechanism to operate on NICs that don't have native XDP yet.
>>> So I can replace generic XDP TX routine with a native one for macvlan.
>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>> XDP_REDIRECT, then we are basically done.
>
> As I replied in another thread this probably not true. Its
> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
> except for the case of bridge mode.
>
>>
>>
>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>> high speed networking to be done inside containers after veth.
>>>> It's trying to get to line rate inside container.
>>> This is one of the goal of this series as well. I agree veth XDP work
>>> looks pretty fine, but it only work for a specific setup I believe since
>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>> there's no VF driver support).
>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>> argument that only a few drivers implement this.  Especially since all
>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>
>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>> side of redirect, then we can implement RX-side in an afternoon.
>
> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
> which breaks assumptions of some drivers. And since we don't disconnect
> RX and TX, it looks to me the partial implementation is even worse?
> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>
>>
>>
>>> And in order to make it work for a end
>>> user, the XDP program still need logic like hash(map) lookup to
>>> determine the destination veth.
>> That _is_ the general idea behind XDP and eBPF, that we need to add logic
>> that determine the destination.  The kernel provides the basic
>> mechanisms for moving/redirecting packets fast, and someone else
>> builds an orchestration tool like Cilium, that adds the needed logic.
>
> Yes, so my reply is for the concern about performance. I meant anyway
> the hash lookup will make it not hit the wire speed.
>
>>
>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>> accessible from XDP.
>
> Yes.
>
>>
>> For macvlan, I imagine that we could add a BPF helper that allows you
>> to lookup/call macvlan_hash_lookup().
>
> That's true but we still need a method to feed macvlan with XDP buff.
> I'm not sure if this could be treated as another kind of redirection,
> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
> redirection, XDP rx handler has its own advantages:
>
> 1) Use the exist API and userspace to setup the network topology instead
> of inventing new tools and its own specific API. This means user can
> just setup macvlan (macvtap, bridge or other) as usual and simply attach
> XDP programs to both macvlan and its under layer device.
> 2) Ease the processing of complex logic, XDP can not do cloning or
> reference counting. We can differ those cases and let normal networking
> stack to deal with such packets seamlessly. I believe this is one of the
> advantage of XDP. This makes us to focus on the fast path and greatly
> simplify the codes.
>
> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
> buff. It's just another basic mechanism. Policy is still done by XDP
> program itself.
>

I have been looking into handling stacked devices via lookup helper
functions. The idea is that a program only needs to be installed on the
root netdev (ie., the one representing the physical port), and it can
use helpers to create an efficient pipeline to decide what to do with
the packet in the presence of stacked devices.

For example, anyone doing pure L3 could do:

{port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...

--> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT

port is the netdev associated with the ingress_ifindex in the xdp_md
context, vlan is the vlan in the packet or the assigned PVID if
relevant. From there l2dev could be a bond or bridge device for example,
and l3dev is the one with a network address (vlan netdev, bond netdev, etc).

I have L3 forwarding working for vlan devices and bonds. I had not
considered macvlans specifically yet, but it should be straightforward
to add.


2018-08-15 00:31:04

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月14日 22:03, David Ahern wrote:
> On 8/14/18 7:20 AM, Jason Wang wrote:
>>
>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>> Jason Wang <[email protected]> wrote:
>>>
>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> This series tries to implement XDP support for rx hanlder. This would
>>>>>> be useful for doing native XDP on stacked device like macvlan, bridge
>>>>>> or even bond.
>>>>>>
>>>>>> The idea is simple, let stacked device register a XDP rx handler. And
>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>> handler may then decide how to proceed, it could consume the buff, ask
>>>>>> driver to drop the packet or ask the driver to fallback to normal skb
>>>>>> path.
>>>>>>
>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>> example. For ease comparision, generic XDP support for rx handler was
>>>>>> also implemented.
>>>>>>
>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>>>>>> shows about 83% improvement.
>>>>> I'm missing the motiviation for this.
>>>>> It seems performance of such solution is ~1M packet per second.
>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>
>>>>> What would be a real life use case for such feature ?
>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>
>>>> XDP_DROP on mlx4: 14.0Mpps
>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>
>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>
>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>> handler based device:
>>>>
>>>> - For containers, we can run XDP for macvlan (~70% of wire speed). This
>>>> allows a container specific policy.
>>>> - For VM, we can implement macvtap XDP rx handler on top. This allow us
>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>> - The idea could be used by other rx handler based device like bridge,
>>>> we may have a XDP fast forwarding path for bridge.
>>>>
>>>>> Another concern is that XDP users expect to get line rate performance
>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>> XDP_REDIRECT, then we are basically done.
>> As I replied in another thread this probably not true. Its
>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>> except for the case of bridge mode.
>>
>>>
>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>> high speed networking to be done inside containers after veth.
>>>>> It's trying to get to line rate inside container.
>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>> looks pretty fine, but it only work for a specific setup I believe since
>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>> there's no VF driver support).
>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>> argument that only a few drivers implement this.  Especially since all
>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>
>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>> side of redirect, then we can implement RX-side in an afternoon.
>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>> which breaks assumptions of some drivers. And since we don't disconnect
>> RX and TX, it looks to me the partial implementation is even worse?
>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>
>>>
>>>> And in order to make it work for a end
>>>> user, the XDP program still need logic like hash(map) lookup to
>>>> determine the destination veth.
>>> That _is_ the general idea behind XDP and eBPF, that we need to add logic
>>> that determine the destination.  The kernel provides the basic
>>> mechanisms for moving/redirecting packets fast, and someone else
>>> builds an orchestration tool like Cilium, that adds the needed logic.
>> Yes, so my reply is for the concern about performance. I meant anyway
>> the hash lookup will make it not hit the wire speed.
>>
>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>> accessible from XDP.
>> Yes.
>>
>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>> to lookup/call macvlan_hash_lookup().
>> That's true but we still need a method to feed macvlan with XDP buff.
>> I'm not sure if this could be treated as another kind of redirection,
>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>> redirection, XDP rx handler has its own advantages:
>>
>> 1) Use the exist API and userspace to setup the network topology instead
>> of inventing new tools and its own specific API. This means user can
>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>> XDP programs to both macvlan and its under layer device.
>> 2) Ease the processing of complex logic, XDP can not do cloning or
>> reference counting. We can differ those cases and let normal networking
>> stack to deal with such packets seamlessly. I believe this is one of the
>> advantage of XDP. This makes us to focus on the fast path and greatly
>> simplify the codes.
>>
>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>> buff. It's just another basic mechanism. Policy is still done by XDP
>> program itself.
>>
> I have been looking into handling stacked devices via lookup helper
> functions. The idea is that a program only needs to be installed on the
> root netdev (ie., the one representing the physical port), and it can
> use helpers to create an efficient pipeline to decide what to do with
> the packet in the presence of stacked devices.
>
> For example, anyone doing pure L3 could do:
>
> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>
> --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>
> port is the netdev associated with the ingress_ifindex in the xdp_md
> context, vlan is the vlan in the packet or the assigned PVID if
> relevant. From there l2dev could be a bond or bridge device for example,
> and l3dev is the one with a network address (vlan netdev, bond netdev, etc).

Looks less flexible since the topology is hard coded in the XDP program
itself and this requires all logic to be implemented in the program on
the root netdev.

>
> I have L3 forwarding working for vlan devices and bonds. I had not
> considered macvlans specifically yet, but it should be straightforward
> to add.
>

Yes, and all these could be done through XDP rx handler as well, and it
can do even more with rather simple logic:

1 macvlan has its own namespace, and want its own bpf logic.
2 Ruse the exist topology information for dealing with more complex
setup like macvlan on top of bond and team. There's no need to bpf
program to care about topology. If you look at the code, there's even no
need to attach XDP on each stacked device. The calling of xdp_do_pass()
can try to pass XDP buff to upper device even if there's no XDP program
attached to current layer.
3 Deliver XDP buff to userspace through macvtap.

Thanks

2018-08-15 05:36:57

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On Wed, Aug 15, 2018 at 08:29:45AM +0800, Jason Wang wrote:
>
> Looks less flexible since the topology is hard coded in the XDP program
> itself and this requires all logic to be implemented in the program on the
> root netdev.
>
> >
> > I have L3 forwarding working for vlan devices and bonds. I had not
> > considered macvlans specifically yet, but it should be straightforward
> > to add.
> >
>
> Yes, and all these could be done through XDP rx handler as well, and it can
> do even more with rather simple logic:
>
> 1 macvlan has its own namespace, and want its own bpf logic.
> 2 Ruse the exist topology information for dealing with more complex setup
> like macvlan on top of bond and team. There's no need to bpf program to care
> about topology. If you look at the code, there's even no need to attach XDP
> on each stacked device. The calling of xdp_do_pass() can try to pass XDP
> buff to upper device even if there's no XDP program attached to current
> layer.
> 3 Deliver XDP buff to userspace through macvtap.

I think I'm getting what you're trying to achieve.
You actually don't want any bpf programs in there at all.
You want macvlan builtin logic to act on raw packet frames.
It would have been less confusing if you said so from the beginning.
I think there is little value in such work, since something still
needs to process this raw frames eventually. If it's XDP with BPF progs
than they can maintain the speed, but in such case there is no need
for macvlan. The first layer can be normal xdp+bpf+xdp_redirect just fine.
In case where there is no xdp+bpf in final processing, the frames are
converted to skb and performance is lost, so in such cases there is no
need for builtin macvlan acting on raw xdp frames either. Just keep
existing macvlan acting on skbs.


2018-08-15 07:06:43

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月15日 13:35, Alexei Starovoitov wrote:
> On Wed, Aug 15, 2018 at 08:29:45AM +0800, Jason Wang wrote:
>> Looks less flexible since the topology is hard coded in the XDP program
>> itself and this requires all logic to be implemented in the program on the
>> root netdev.
>>
>>> I have L3 forwarding working for vlan devices and bonds. I had not
>>> considered macvlans specifically yet, but it should be straightforward
>>> to add.
>>>
>> Yes, and all these could be done through XDP rx handler as well, and it can
>> do even more with rather simple logic:
>>
>> 1 macvlan has its own namespace, and want its own bpf logic.
>> 2 Ruse the exist topology information for dealing with more complex setup
>> like macvlan on top of bond and team. There's no need to bpf program to care
>> about topology. If you look at the code, there's even no need to attach XDP
>> on each stacked device. The calling of xdp_do_pass() can try to pass XDP
>> buff to upper device even if there's no XDP program attached to current
>> layer.
>> 3 Deliver XDP buff to userspace through macvtap.
> I think I'm getting what you're trying to achieve.
> You actually don't want any bpf programs in there at all.
> You want macvlan builtin logic to act on raw packet frames.

The built-in logic is just used to find the destination macvlan device.
It could be done by through another bpf program. Instead of inventing
lots of generic infrastructure on kernel with specific userspace API,
built-in logic has its own advantages:

- support hundreds or even thousands of macvlans
- using exist tools to configure network
- immunity to topology changes

> It would have been less confusing if you said so from the beginning.

The name "XDP rx handler" is probably not good. Something like "stacked
deivce XDP" might be better.

> I think there is little value in such work, since something still
> needs to process this raw frames eventually. If it's XDP with BPF progs
> than they can maintain the speed, but in such case there is no need
> for macvlan. The first layer can be normal xdp+bpf+xdp_redirect just fine.

I'm a little bit confused. We allow per veth XDP program, so I believe
per macvlan XDP program makes sense as well? This allows great
flexibility and there's no need to care about topology in bpf program.
The configuration is also greatly simplified. The only difference is we
can use xdp_redirect for veth since it was pair device, we can transmit
XDP frames to one veth and do XDP on its peer. This does not work for
the case of macvlan which is based on rx handler.

Actually, for the case of veth, if we implement XDP rx handler for
bridge it can works seamlessly with veth like.

eth0(XDP_PASS) -> [bridge XDP rx handler and ndo_xdp_xmit()] -> veth ---
veth (XDP).

Besides the usage for containers, we can implement macvtap RX handler
which allows a fast packet forwarding to userspace.

> In case where there is no xdp+bpf in final processing, the frames are
> converted to skb and performance is lost, so in such cases there is no
> need for builtin macvlan acting on raw xdp frames either. Just keep
> existing macvlan acting on skbs.
>

Yes, this is how veth works as well.

Actually, the idea is not limited to macvlan but for all device that is
based on rx handler. Consider the case of bonding, this allows to set a
very simple XDP program on slaves and keep a single main logic XDP
program on the bond instead of duplicating it in all slaves.

Thanks

2018-08-15 17:18:26

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On 8/14/18 6:29 PM, Jason Wang wrote:
>
>
> On 2018年08月14日 22:03, David Ahern wrote:
>> On 8/14/18 7:20 AM, Jason Wang wrote:
>>>
>>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>>> Jason Wang <[email protected]> wrote:
>>>>
>>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>>> Hi:
>>>>>>>
>>>>>>> This series tries to implement XDP support for rx hanlder. This
>>>>>>> would
>>>>>>> be useful for doing native XDP on stacked device like macvlan,
>>>>>>> bridge
>>>>>>> or even bond.
>>>>>>>
>>>>>>> The idea is simple, let stacked device register a XDP rx handler.
>>>>>>> And
>>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>>> handler may then decide how to proceed, it could consume the
>>>>>>> buff, ask
>>>>>>> driver to drop the packet or ask the driver to fallback to normal
>>>>>>> skb
>>>>>>> path.
>>>>>>>
>>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>>> example. For ease comparision, generic XDP support for rx handler
>>>>>>> was
>>>>>>> also implemented.
>>>>>>>
>>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan
>>>>>>> (XDP_DROP)
>>>>>>> shows about 83% improvement.
>>>>>> I'm missing the motiviation for this.
>>>>>> It seems performance of such solution is ~1M packet per second.
>>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>>
>>>>>> What would be a real life use case for such feature ?
>>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>>
>>>>> XDP_DROP on mlx4: 14.0Mpps
>>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>>
>>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>>
>>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>>> handler based device:
>>>>>
>>>>> - For containers, we can run XDP for macvlan (~70% of wire speed).
>>>>> This
>>>>> allows a container specific policy.
>>>>> - For VM, we can implement macvtap XDP rx handler on top. This
>>>>> allow us
>>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>>> - The idea could be used by other rx handler based device like bridge,
>>>>> we may have a XDP fast forwarding path for bridge.
>>>>>
>>>>>> Another concern is that XDP users expect to get line rate performance
>>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>>> XDP_REDIRECT, then we are basically done.
>>> As I replied in another thread this probably not true. Its
>>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>>> except for the case of bridge mode.
>>>
>>>>
>>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>>> high speed networking to be done inside containers after veth.
>>>>>> It's trying to get to line rate inside container.
>>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>>> looks pretty fine, but it only work for a specific setup I believe
>>>>> since
>>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>>> there's no VF driver support).
>>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>>> argument that only a few drivers implement this.  Especially since all
>>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>>
>>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>>> side of redirect, then we can implement RX-side in an afternoon.
>>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>>> which breaks assumptions of some drivers. And since we don't disconnect
>>> RX and TX, it looks to me the partial implementation is even worse?
>>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>>
>>>>
>>>>> And in order to make it work for a end
>>>>> user, the XDP program still need logic like hash(map) lookup to
>>>>> determine the destination veth.
>>>> That _is_ the general idea behind XDP and eBPF, that we need to add
>>>> logic
>>>> that determine the destination.  The kernel provides the basic
>>>> mechanisms for moving/redirecting packets fast, and someone else
>>>> builds an orchestration tool like Cilium, that adds the needed logic.
>>> Yes, so my reply is for the concern about performance. I meant anyway
>>> the hash lookup will make it not hit the wire speed.
>>>
>>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>>> accessible from XDP.
>>> Yes.
>>>
>>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>>> to lookup/call macvlan_hash_lookup().
>>> That's true but we still need a method to feed macvlan with XDP buff.
>>> I'm not sure if this could be treated as another kind of redirection,
>>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>>> redirection, XDP rx handler has its own advantages:
>>>
>>> 1) Use the exist API and userspace to setup the network topology instead
>>> of inventing new tools and its own specific API. This means user can
>>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>>> XDP programs to both macvlan and its under layer device.
>>> 2) Ease the processing of complex logic, XDP can not do cloning or
>>> reference counting. We can differ those cases and let normal networking
>>> stack to deal with such packets seamlessly. I believe this is one of the
>>> advantage of XDP. This makes us to focus on the fast path and greatly
>>> simplify the codes.
>>>
>>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>>> buff. It's just another basic mechanism. Policy is still done by XDP
>>> program itself.
>>>
>> I have been looking into handling stacked devices via lookup helper
>> functions. The idea is that a program only needs to be installed on the
>> root netdev (ie., the one representing the physical port), and it can
>> use helpers to create an efficient pipeline to decide what to do with
>> the packet in the presence of stacked devices.
>>
>> For example, anyone doing pure L3 could do:
>>
>> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>>
>>    --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>>
>> port is the netdev associated with the ingress_ifindex in the xdp_md
>> context, vlan is the vlan in the packet or the assigned PVID if
>> relevant. From there l2dev could be a bond or bridge device for example,
>> and l3dev is the one with a network address (vlan netdev, bond netdev,
>> etc).
>
> Looks less flexible since the topology is hard coded in the XDP program
> itself and this requires all logic to be implemented in the program on
> the root netdev.

Nothing about the topology is hard coded. The idea is to mimic a
hardware pipeline and acknowledging that a port device can have an
arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc

>
>>
>> I have L3 forwarding working for vlan devices and bonds. I had not
>> considered macvlans specifically yet, but it should be straightforward
>> to add.
>>
>
> Yes, and all these could be done through XDP rx handler as well, and it
> can do even more with rather simple logic:

From a forwarding perspective I suspect the rx handler approach is going
to have much more overhead (ie., higher latency per packet and hence
lower throughput) as the layers determine which one to use (e.g., is the
FIB lookup done on the port device, vlan device, or macvlan device on
the vlan device).

>
> 1 macvlan has its own namespace, and want its own bpf logic.
> 2 Ruse the exist topology information for dealing with more complex
> setup like macvlan on top of bond and team. There's no need to bpf
> program to care about topology. If you look at the code, there's even no
> need to attach XDP on each stacked device. The calling of xdp_do_pass()
> can try to pass XDP buff to upper device even if there's no XDP program
> attached to current layer.
> 3 Deliver XDP buff to userspace through macvtap.
>
> Thanks


2018-08-16 07:09:15

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月16日 01:17, David Ahern wrote:
> On 8/14/18 6:29 PM, Jason Wang wrote:
>>
>> On 2018年08月14日 22:03, David Ahern wrote:
>>> On 8/14/18 7:20 AM, Jason Wang wrote:
>>>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>>>> Jason Wang <[email protected]> wrote:
>>>>>
>>>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> This series tries to implement XDP support for rx hanlder. This
>>>>>>>> would
>>>>>>>> be useful for doing native XDP on stacked device like macvlan,
>>>>>>>> bridge
>>>>>>>> or even bond.
>>>>>>>>
>>>>>>>> The idea is simple, let stacked device register a XDP rx handler.
>>>>>>>> And
>>>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>>>> handler may then decide how to proceed, it could consume the
>>>>>>>> buff, ask
>>>>>>>> driver to drop the packet or ask the driver to fallback to normal
>>>>>>>> skb
>>>>>>>> path.
>>>>>>>>
>>>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>>>> example. For ease comparision, generic XDP support for rx handler
>>>>>>>> was
>>>>>>>> also implemented.
>>>>>>>>
>>>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan
>>>>>>>> (XDP_DROP)
>>>>>>>> shows about 83% improvement.
>>>>>>> I'm missing the motiviation for this.
>>>>>>> It seems performance of such solution is ~1M packet per second.
>>>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>>>
>>>>>>> What would be a real life use case for such feature ?
>>>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>>>
>>>>>> XDP_DROP on mlx4: 14.0Mpps
>>>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>>>
>>>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>>>
>>>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>>>> handler based device:
>>>>>>
>>>>>> - For containers, we can run XDP for macvlan (~70% of wire speed).
>>>>>> This
>>>>>> allows a container specific policy.
>>>>>> - For VM, we can implement macvtap XDP rx handler on top. This
>>>>>> allow us
>>>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>>>> - The idea could be used by other rx handler based device like bridge,
>>>>>> we may have a XDP fast forwarding path for bridge.
>>>>>>
>>>>>>> Another concern is that XDP users expect to get line rate performance
>>>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>>>> XDP_REDIRECT, then we are basically done.
>>>> As I replied in another thread this probably not true. Its
>>>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>>>> except for the case of bridge mode.
>>>>
>>>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>>>> high speed networking to be done inside containers after veth.
>>>>>>> It's trying to get to line rate inside container.
>>>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>>>> looks pretty fine, but it only work for a specific setup I believe
>>>>>> since
>>>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>>>> there's no VF driver support).
>>>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>>>> argument that only a few drivers implement this.  Especially since all
>>>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>>>
>>>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>>>> side of redirect, then we can implement RX-side in an afternoon.
>>>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>>>> which breaks assumptions of some drivers. And since we don't disconnect
>>>> RX and TX, it looks to me the partial implementation is even worse?
>>>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>>>
>>>>>> And in order to make it work for a end
>>>>>> user, the XDP program still need logic like hash(map) lookup to
>>>>>> determine the destination veth.
>>>>> That _is_ the general idea behind XDP and eBPF, that we need to add
>>>>> logic
>>>>> that determine the destination.  The kernel provides the basic
>>>>> mechanisms for moving/redirecting packets fast, and someone else
>>>>> builds an orchestration tool like Cilium, that adds the needed logic.
>>>> Yes, so my reply is for the concern about performance. I meant anyway
>>>> the hash lookup will make it not hit the wire speed.
>>>>
>>>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>>>> accessible from XDP.
>>>> Yes.
>>>>
>>>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>>>> to lookup/call macvlan_hash_lookup().
>>>> That's true but we still need a method to feed macvlan with XDP buff.
>>>> I'm not sure if this could be treated as another kind of redirection,
>>>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>>>> redirection, XDP rx handler has its own advantages:
>>>>
>>>> 1) Use the exist API and userspace to setup the network topology instead
>>>> of inventing new tools and its own specific API. This means user can
>>>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>>>> XDP programs to both macvlan and its under layer device.
>>>> 2) Ease the processing of complex logic, XDP can not do cloning or
>>>> reference counting. We can differ those cases and let normal networking
>>>> stack to deal with such packets seamlessly. I believe this is one of the
>>>> advantage of XDP. This makes us to focus on the fast path and greatly
>>>> simplify the codes.
>>>>
>>>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>>>> buff. It's just another basic mechanism. Policy is still done by XDP
>>>> program itself.
>>>>
>>> I have been looking into handling stacked devices via lookup helper
>>> functions. The idea is that a program only needs to be installed on the
>>> root netdev (ie., the one representing the physical port), and it can
>>> use helpers to create an efficient pipeline to decide what to do with
>>> the packet in the presence of stacked devices.
>>>
>>> For example, anyone doing pure L3 could do:
>>>
>>> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>>>
>>>    --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>>>
>>> port is the netdev associated with the ingress_ifindex in the xdp_md
>>> context, vlan is the vlan in the packet or the assigned PVID if
>>> relevant. From there l2dev could be a bond or bridge device for example,
>>> and l3dev is the one with a network address (vlan netdev, bond netdev,
>>> etc).
>> Looks less flexible since the topology is hard coded in the XDP program
>> itself and this requires all logic to be implemented in the program on
>> the root netdev.
> Nothing about the topology is hard coded. The idea is to mimic a
> hardware pipeline and acknowledging that a port device can have an
> arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc

I may miss something but BPF forbids loop. Without a loop how can we
make sure all stacked devices is enumerated correctly without knowing
the topology in advance?

>
>>> I have L3 forwarding working for vlan devices and bonds. I had not
>>> considered macvlans specifically yet, but it should be straightforward
>>> to add.
>>>
>> Yes, and all these could be done through XDP rx handler as well, and it
>> can do even more with rather simple logic:
> From a forwarding perspective I suspect the rx handler approach is going
> to have much more overhead (ie., higher latency per packet and hence
> lower throughput) as the layers determine which one to use (e.g., is the
> FIB lookup done on the port device, vlan device, or macvlan device on
> the vlan device).

Well, if we want stacked device behave correctly, this is probably the
only way. E.g in the above figure, to make "find l2dev" work correctly,
we still need device specific logic which would be much similar to what
XDP rx handler did.

Thanks

>
>> 1 macvlan has its own namespace, and want its own bpf logic.
>> 2 Ruse the exist topology information for dealing with more complex
>> setup like macvlan on top of bond and team. There's no need to bpf
>> program to care about topology. If you look at the code, there's even no
>> need to attach XDP on each stacked device. The calling of xdp_do_pass()
>> can try to pass XDP buff to upper device even if there's no XDP program
>> attached to current layer.
>> 3 Deliver XDP buff to userspace through macvtap.
>>
>> Thanks


2018-08-16 07:30:53

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On Wed, Aug 15, 2018 at 03:04:35PM +0800, Jason Wang wrote:
>
> > > 3 Deliver XDP buff to userspace through macvtap.
> > I think I'm getting what you're trying to achieve.
> > You actually don't want any bpf programs in there at all.
> > You want macvlan builtin logic to act on raw packet frames.
>
> The built-in logic is just used to find the destination macvlan device. It
> could be done by through another bpf program. Instead of inventing lots of
> generic infrastructure on kernel with specific userspace API, built-in logic
> has its own advantages:
>
> - support hundreds or even thousands of macvlans

are you saying xdp bpf program cannot handle thousands macvlans?

> - using exist tools to configure network
> - immunity to topology changes

what do you mean specifically?

>
> Besides the usage for containers, we can implement macvtap RX handler which
> allows a fast packet forwarding to userspace.

and try to reinvent af_xdp? the motivation for the patchset still escapes me.

> Actually, the idea is not limited to macvlan but for all device that is
> based on rx handler. Consider the case of bonding, this allows to set a very
> simple XDP program on slaves and keep a single main logic XDP program on the
> bond instead of duplicating it in all slaves.

I think such mixed environment of hardcoded in-kernel things like bond
mixed together with xdp programs will be difficult to manage and debug.
How admin suppose to debug it? Say something in the chain of
nic -> native xdp -> bond with your xdp rx -> veth -> xdp prog -> consumer
is dropping a packet. If all forwarding decisions are done by bpf progs
the progs will have packet tracing facility (like cilium does) to
show packet flow end-to-end. It works briliantly like traceroute within a host.
But when you have things like macvlan, bond, bridge in the middle
that can also act on packet, the admin will have a hard time.

Essentially what you're proposing is to make all kernel builtin packet
steering/forwarding facilities to understand raw xdp frames. That's a lot of code
and at the end of the chain you'd need fast xdp frame consumer otherwise
perf benefits are lost. If that consumer is xdp bpf program
why bother with xdp-fied macvlan or bond? If that consumer is tcp stack
than forwarding via xdp-fied bond is no faster than via skb-based bond.


2018-08-16 07:54:24

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On Thu, Aug 16, 2018 at 11:34:20AM +0800, Jason Wang wrote:
> > Nothing about the topology is hard coded. The idea is to mimic a
> > hardware pipeline and acknowledging that a port device can have an
> > arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc
>
> I may miss something but BPF forbids loop. Without a loop how can we make
> sure all stacked devices is enumerated correctly without knowing the
> topology in advance?

not following. why do you need a loop to implement macvlan as an xdp prog?
if loop is needed, such algorithm is not going to scale whether
it's implemented as bpf program or as in-kernel c code.


2018-08-16 08:19:50

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月16日 12:05, Alexei Starovoitov wrote:
> On Thu, Aug 16, 2018 at 11:34:20AM +0800, Jason Wang wrote:
>>> Nothing about the topology is hard coded. The idea is to mimic a
>>> hardware pipeline and acknowledging that a port device can have an
>>> arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc
>> I may miss something but BPF forbids loop. Without a loop how can we make
>> sure all stacked devices is enumerated correctly without knowing the
>> topology in advance?
> not following. why do you need a loop to implement macvlan as an xdp prog?
> if loop is needed, such algorithm is not going to scale whether
> it's implemented as bpf program or as in-kernel c code.

David said the port can have arbitrary layers stacked on it. So if we
try to enumerate them before making forwarding decisions purely by BPF
program, it looks to me a loop is needed here.

Thanks

2018-08-16 08:19:50

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月16日 10:49, Alexei Starovoitov wrote:
> On Wed, Aug 15, 2018 at 03:04:35PM +0800, Jason Wang wrote:
>>>> 3 Deliver XDP buff to userspace through macvtap.
>>> I think I'm getting what you're trying to achieve.
>>> You actually don't want any bpf programs in there at all.
>>> You want macvlan builtin logic to act on raw packet frames.
>> The built-in logic is just used to find the destination macvlan device. It
>> could be done by through another bpf program. Instead of inventing lots of
>> generic infrastructure on kernel with specific userspace API, built-in logic
>> has its own advantages:
>>
>> - support hundreds or even thousands of macvlans
> are you saying xdp bpf program cannot handle thousands macvlans?

Correct me if I was wrong. It works well when the macvlan requires
similar logic. But let's consider the case when each macvlan wants its
own specific logic. Is this possible to have thousands of different
policies and actions in a single BPF program? With XDP rx hanlder,
there's no need to root device to care about them. Each macvlan can only
care about itself. This is similar to the case that qdisc could be
attached to each stacked device.

>
>> - using exist tools to configure network
>> - immunity to topology changes
> what do you mean specifically?

Still the above example, if some macvlans is deleted or created. We need
notify and update the policies in the root device, this requires
userspace control program to monitor those changes and notify BPF
program through maps. Unless the BPF program is designed for some
specific configurations and setups, it would not be an easy task.

>
>> Besides the usage for containers, we can implement macvtap RX handler which
>> allows a fast packet forwarding to userspace.
> and try to reinvent af_xdp? the motivation for the patchset still escapes me.

Nope, macvtap was used for forwarding packets to VM. This is just try to
deliver the XDP buff to VM instead of skb. Similar idea was used by
TUN/TAP which shows amazing improvements.

>
>> Actually, the idea is not limited to macvlan but for all device that is
>> based on rx handler. Consider the case of bonding, this allows to set a very
>> simple XDP program on slaves and keep a single main logic XDP program on the
>> bond instead of duplicating it in all slaves.
> I think such mixed environment of hardcoded in-kernel things like bond
> mixed together with xdp programs will be difficult to manage and debug.
> How admin suppose to debug it?

Well, we've already had in-kernel XDP_TX routine. It should be not
harder than that.

> Say something in the chain of
> nic -> native xdp -> bond with your xdp rx -> veth -> xdp prog -> consumer
> is dropping a packet. If all forwarding decisions are done by bpf progs
> the progs will have packet tracing facility (like cilium does) to
> show packet flow end-to-end. It works briliantly like traceroute within a host.

Does this work well for veth pair as well? If yes, it should work for rx
handler or maybe it has some hard code logic like "ok, the packet goes
to veth, I'm sure it will be delivered to its peer"? The idea of this
series is not forbidding the forwarding decisions done by bpf progs, if
the code did this by accident, we can introduce flag to disable/enable
XDP rx handler.

And I believe redirection is part of XDP usage, we may still want things
like XDP_TX.

> But when you have things like macvlan, bond, bridge in the middle
> that can also act on packet, the admin will have a hard time.

I admit it may require admin help, but it gives us more flexibility.

>
> Essentially what you're proposing is to make all kernel builtin packet
> steering/forwarding facilities to understand raw xdp frames.

Probably not, at least for this series it just focus on rx handler. We
only have less than 10 devices use that.

> That's a lot of code
> and at the end of the chain you'd need fast xdp frame consumer otherwise
> perf benefits are lost.

The performance are lost but still the same as skb. And except for
redirection, we do have other consumer like XDP_TX.

> If that consumer is xdp bpf program
> why bother with xdp-fied macvlan or bond?

For macvlan, we may want to have different polices for different
devices. For bond, we don't want to duplicate XDP logic in each slaves,
and only bond know which slave could be used for XDP_TX.

> If that consumer is tcp stack
> than forwarding via xdp-fied bond is no faster than via skb-based bond.
>

Yes.

Thanks

2018-08-17 21:16:31

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

On 8/15/18 9:34 PM, Jason Wang wrote:
> I may miss something but BPF forbids loop. Without a loop how can we
> make sure all stacked devices is enumerated correctly without knowing
> the topology in advance?

netdev_for_each_upper_dev_rcu

BPF helpers allow programs to do lookups in kernel tables, in this case
the ability to find an upper device that would receive the packet.

2018-08-20 06:36:49

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年08月18日 05:15, David Ahern wrote:
> On 8/15/18 9:34 PM, Jason Wang wrote:
>> I may miss something but BPF forbids loop. Without a loop how can we
>> make sure all stacked devices is enumerated correctly without knowing
>> the topology in advance?
> netdev_for_each_upper_dev_rcu
>
> BPF helpers allow programs to do lookups in kernel tables, in this case
> the ability to find an upper device that would receive the packet.

So if I understand correctly, you mean using
netdev_for_each_upper_dev_rcu() inside a BPF helper? If yes, I think we
may still need device specific logic. E.g for macvlan,
netdev_for_each_upper_dev_rcu() enumerates all macvlan devices on top a
lower device. But what we need is one of the macvlan that matches the
dst mac address which is similar to what XDP rx handler did. And it
would become more complicated if we have multiple layers of device.

So let's consider a simple case, consider we have 5 macvlan devices:

macvlan0: doing some packet filtering before passing packets to TCP/IP stack
macvlan1: modify packets and redirect to another interface
macvlan2: modify packets and transmit packet back through XDP_TX
macvlan3: deliver packets to AF_XDP
macvtap0: deliver packets raw XDP to VM

So, with XDP rx handler, what we need to just to attach five different
XDP programs to each macvlan device. Your idea is to do all things in
the root device XDP program. This looks complicated and not flexible
since it needs to care a lot of things, e.g adding/removing
actions/policies. And XDP program needs to call BPF helper that use
netdev_for_each_upper_dev_rcu() to work correctly with stacked device.

Thanks

2018-09-05 17:21:54

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler

[ sorry for the delay; focused on the nexthop RFC ]

On 8/20/18 12:34 AM, Jason Wang wrote:
>
>
> On 2018年08月18日 05:15, David Ahern wrote:
>> On 8/15/18 9:34 PM, Jason Wang wrote:
>>> I may miss something but BPF forbids loop. Without a loop how can we
>>> make sure all stacked devices is enumerated correctly without knowing
>>> the topology in advance?
>> netdev_for_each_upper_dev_rcu
>>
>> BPF helpers allow programs to do lookups in kernel tables, in this case
>> the ability to find an upper device that would receive the packet.
>
> So if I understand correctly, you mean using
> netdev_for_each_upper_dev_rcu() inside a BPF helper? If yes, I think we
> may still need device specific logic. E.g for macvlan,
> netdev_for_each_upper_dev_rcu() enumerates all macvlan devices on top a
> lower device. But what we need is one of the macvlan that matches the
> dst mac address which is similar to what XDP rx handler did. And it
> would become more complicated if we have multiple layers of device.

My device lookup helper takes the base port index (starting device),
vlan protocol, vlan tag and dest mac. So, yes, the mac address is used
to uniquely identify the stacked device.

>
> So let's consider a simple case, consider we have 5 macvlan devices:
>
> macvlan0: doing some packet filtering before passing packets to TCP/IP
> stack
> macvlan1: modify packets and redirect to another interface
> macvlan2: modify packets and transmit packet back through XDP_TX
> macvlan3: deliver packets to AF_XDP
> macvtap0: deliver packets raw XDP to VM
>
> So, with XDP rx handler, what we need to just to attach five different
> XDP programs to each macvlan device. Your idea is to do all things in
> the root device XDP program. This looks complicated and not flexible
> since it needs to care a lot of things, e.g adding/removing
> actions/policies. And XDP program needs to call BPF helper that use
> netdev_for_each_upper_dev_rcu() to work correctly with stacked device.
>

Stacking on top of a nic port can have all kinds of combinations of
vlans, bonds, bridges, vlans on bonds and bridges, macvlans, etc. I
suspect trying to install a program for layer 3 forwarding on each one
and iteratively running the programs would kill the performance gained
from forwarding with xdp.


2018-09-06 05:14:13

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH net-next V2 0/6] XDP rx handler



On 2018年09月06日 01:20, David Ahern wrote:
> [ sorry for the delay; focused on the nexthop RFC ]

No problem. Your comments is appreciated.

> On 8/20/18 12:34 AM, Jason Wang wrote:
>>
>> On 2018年08月18日 05:15, David Ahern wrote:
>>> On 8/15/18 9:34 PM, Jason Wang wrote:
>>>> I may miss something but BPF forbids loop. Without a loop how can we
>>>> make sure all stacked devices is enumerated correctly without knowing
>>>> the topology in advance?
>>> netdev_for_each_upper_dev_rcu
>>>
>>> BPF helpers allow programs to do lookups in kernel tables, in this case
>>> the ability to find an upper device that would receive the packet.
>> So if I understand correctly, you mean using
>> netdev_for_each_upper_dev_rcu() inside a BPF helper? If yes, I think we
>> may still need device specific logic. E.g for macvlan,
>> netdev_for_each_upper_dev_rcu() enumerates all macvlan devices on top a
>> lower device. But what we need is one of the macvlan that matches the
>> dst mac address which is similar to what XDP rx handler did. And it
>> would become more complicated if we have multiple layers of device.
> My device lookup helper takes the base port index (starting device),
> vlan protocol, vlan tag and dest mac. So, yes, the mac address is used
> to uniquely identify the stacked device.

Ok.

>
>> So let's consider a simple case, consider we have 5 macvlan devices:
>>
>> macvlan0: doing some packet filtering before passing packets to TCP/IP
>> stack
>> macvlan1: modify packets and redirect to another interface
>> macvlan2: modify packets and transmit packet back through XDP_TX
>> macvlan3: deliver packets to AF_XDP
>> macvtap0: deliver packets raw XDP to VM
>>
>> So, with XDP rx handler, what we need to just to attach five different
>> XDP programs to each macvlan device. Your idea is to do all things in
>> the root device XDP program. This looks complicated and not flexible
>> since it needs to care a lot of things, e.g adding/removing
>> actions/policies. And XDP program needs to call BPF helper that use
>> netdev_for_each_upper_dev_rcu() to work correctly with stacked device.
>>
> Stacking on top of a nic port can have all kinds of combinations of
> vlans, bonds, bridges, vlans on bonds and bridges, macvlans, etc. I
> suspect trying to install a program for layer 3 forwarding on each one
> and iteratively running the programs would kill the performance gained
> from forwarding with xdp.

Yes, the performance may drop but it's still much faster than XDP
generic path.

One reason for the drop is the device specific logic like mac address
matching which is also needed for the case of a single XDP program on
the root device. For macvlan, if we allow attach XDP on macvlan, we can
offload the mac address lookup to hardware through L2 forwarding
offload, this can give us no performance drop I believe. The only reason
that was introduced by XDP rx handler itself is probably the indirect
calls. We can try to amortize them by introducing some kind of batching
on top. For the issue of multiple XDP program iterations, for this RFC,
if we have N stacked devices, there's no need to attach XDP program on
each layer, the only thing that need is the XDP_PASS action in the root
device, then you can attach XDP program on any one or some stacked
devices on top.

So the RFC is not intended to replace any exist solution, it just
provides some flexibility for having native XDP on stacked device (which
is based on rx handler) and benefit from exist tools to do the
configuration. If user want to do all things in the root device, that
should work well without any issues.

Thanks