2009-11-17 22:41:30

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 0/3] macvlan: add vepa and bridge mode

This is based on an earlier patch from Eric Biederman adding
forwarding between macvlans. I extended his approach to
allow the administrator to choose the mode for each macvlan,
and to implement a functional VEPA between macvlan.

Still missing from this is support for communication between
the lower device that the macvlans are based on. This would
be extremely useful but as others have found out before me
requires significant changes not only to macvlan but also
to the common transmit path.

I've seen one panic during testing this that I still need
to track down, but it generally does what is advertised.
I've tested VEPA operation with the hairpin support
added to the bridge driver by Anna Fischer.

My current plan is to submit this for inclusion in 2.6.33
when people are happy with it and I tracked down any
remaining bugs, and possibly to do the communication with
the lower device one release later.

Arnd <><

---

Arnd Bergmann (3):
macvlan: implement VEPA and private mode
macvlan: export macvlan mode through netlink
iplink: add macvlan options for bridge mode

Eric Biederman (1):
macvlan: Reflect macvlan packets meant for other macvlan devices

linux/drivers/net/macvlan.c | 170 +++++++++++++++++++++++++++++++++-----
linux/include/linux/if_link.h | 15 +++
2 files changed, 161 insertions(+), 24 deletions(-)

iproute2/include/linux/if_link.h | 15 +++
iproute2/ip/Makefile | 3 +-
iproute2/ip/iplink_macvlan.c | 93 ++++++++++++++++++
3 files changed, 110 insertions(+), 1 deletions(-)
create mode 100644 ip/iplink_macvlan.c


2009-11-17 22:41:29

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

From: Eric Biederman <[email protected]>

Switch ports do not send packets back out the same port they came
in on. This causes problems when using a macvlan device inside
of a network namespace as it becomes impossible to talk to
other macvlan devices.

Signed-off-by: Eric Biederman <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/macvlan.c | 91 ++++++++++++++++++++++++++++++++++++-------------
1 files changed, 67 insertions(+), 24 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 3aabfd9..406b8b5 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -29,6 +29,7 @@
#include <linux/if_link.h>
#include <linux/if_macvlan.h>
#include <net/rtnetlink.h>
+#include <net/xfrm.h>

#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)

@@ -102,7 +103,8 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
}

static void macvlan_broadcast(struct sk_buff *skb,
- const struct macvlan_port *port)
+ const struct macvlan_port *port,
+ struct net_device *src)
{
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_dev *vlan;
@@ -118,6 +120,9 @@ static void macvlan_broadcast(struct sk_buff *skb,
hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
dev = vlan->dev;

+ if (dev == src)
+ continue;
+
nskb = skb_clone(skb, GFP_ATOMIC);
if (nskb == NULL) {
dev->stats.rx_errors++;
@@ -140,20 +145,45 @@ static void macvlan_broadcast(struct sk_buff *skb,
}
}

+static int macvlan_unicast(struct sk_buff *skb, const struct macvlan_dev *dest)
+{
+ struct net_device *dev = dest->dev;
+
+ if (unlikely(!dev->flags & IFF_UP)) {
+ kfree_skb(skb);
+ return NET_XMIT_DROP;
+ }
+
+ skb = skb_share_check(skb, GFP_ATOMIC);
+ if (!skb) {
+ dev->stats.rx_errors++;
+ dev->stats.rx_dropped++;
+ return NET_XMIT_DROP;
+ }
+
+ dev->stats.rx_bytes += skb->len + ETH_HLEN;
+ dev->stats.rx_packets++;
+
+ skb->dev = dev;
+ skb->pkt_type = PACKET_HOST;
+ netif_rx(skb);
+ return NET_XMIT_SUCCESS;
+}
+
+
/* called under rcu_read_lock() from netif_receive_skb */
static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
{
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_port *port;
const struct macvlan_dev *vlan;
- struct net_device *dev;

port = rcu_dereference(skb->dev->macvlan_port);
if (port == NULL)
return skb;

if (is_multicast_ether_addr(eth->h_dest)) {
- macvlan_broadcast(skb, port);
+ macvlan_broadcast(skb, port, NULL);
return skb;
}

@@ -161,27 +191,43 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
if (vlan == NULL)
return skb;

- dev = vlan->dev;
- if (unlikely(!(dev->flags & IFF_UP))) {
- kfree_skb(skb);
- return NULL;
- }
+ macvlan_unicast(skb, vlan);
+ return NULL;
+}

- skb = skb_share_check(skb, GFP_ATOMIC);
- if (skb == NULL) {
- dev->stats.rx_errors++;
- dev->stats.rx_dropped++;
- return NULL;
- }
+static int macvlan_xmit_world(struct sk_buff *skb, struct net_device *dev)
+{
+ const struct macvlan_dev *vlan = netdev_priv(dev);
+ __skb_push(skb, skb->data - skb_mac_header(skb));
+ skb->dev = vlan->lowerdev;
+ return dev_queue_xmit(skb);
+}

- dev->stats.rx_bytes += skb->len + ETH_HLEN;
- dev->stats.rx_packets++;
+static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ const struct macvlan_dev *vlan = netdev_priv(dev);
+ const struct macvlan_port *port = vlan->port;
+ const struct macvlan_dev *dest;
+ const struct ethhdr *eth;

- skb->dev = dev;
- skb->pkt_type = PACKET_HOST;
+ skb->protocol = eth_type_trans(skb, dev);
+ eth = eth_hdr(skb);

- netif_rx(skb);
- return NULL;
+ skb_dst_drop(skb);
+ skb->mark = 0;
+ secpath_reset(skb);
+ nf_reset(skb);
+
+ if (is_multicast_ether_addr(eth->h_dest)) {
+ macvlan_broadcast(skb, port, dev);
+ return macvlan_xmit_world(skb, dev);
+ }
+
+ dest = macvlan_hash_lookup(port, eth->h_dest);
+ if (dest)
+ return macvlan_unicast(skb, dest);
+
+ return macvlan_xmit_world(skb, dev);
}

static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
@@ -189,13 +235,10 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
{
int i = skb_get_queue_mapping(skb);
struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
- const struct macvlan_dev *vlan = netdev_priv(dev);
unsigned int len = skb->len;
int ret;

- skb->dev = vlan->lowerdev;
- ret = dev_queue_xmit(skb);
-
+ ret = macvlan_queue_xmit(skb, dev);
if (likely(ret == NET_XMIT_SUCCESS)) {
txq->tx_packets++;
txq->tx_bytes += len;
--
1.6.3.3

2009-11-17 22:41:15

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 2/3] macvlan: implement VEPA and private mode

This allows each macvlan slave device to be in one
of three modes, depending on the use case:

MACVLAN_MODE_PRIVATE:
The device never communicates with any other device
on the same upper_dev. This even includes frames
coming back from a reflective relay, where supported
by the adjacent bridge.

MACVLAN_MODE_VEPA:
The new Virtual Ethernet Port Aggregator (VEPA) mode,
we assume that the adjacent bridge returns all frames
where both source and destination are local to the
macvlan port, i.e. the bridge is set up as a reflective
relay.
Broadcast frames coming in from the upper_dev get
flooded to all macvlan interfaces in VEPA mode.
We never deliver any frames locally.

MACVLAN_MODE_BRIDGE:
We provide the behavior of a simple bridge between
different macvlan interfaces on the same port. Frames
from one interface to another one get delivered directly
and are not sent out externally. Broadcast frames get
flooded to all other bridge ports and to the external
interface, but when they come back from a reflective
relay, we don't deliver them again.
Since we know all the MAC addresses, the macvlan bridge
mode does not require learning or STP like the bridge
module does.

Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/macvlan.c | 46 +++++++++++++++++++++++++++++++++++++---------
1 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 406b8b5..fa8b568 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -33,6 +33,12 @@

#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)

+enum macvlan_type {
+ MACVLAN_PRIVATE = 1,
+ MACVLAN_VEPA = 2,
+ MACVLAN_BRIDGE = 4,
+};
+
struct macvlan_port {
struct net_device *dev;
struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
@@ -45,6 +51,7 @@ struct macvlan_dev {
struct hlist_node hlist;
struct macvlan_port *port;
struct net_device *lowerdev;
+ enum macvlan_mode mode;
};


@@ -104,7 +111,8 @@ static int macvlan_addr_busy(const struct macvlan_port *port,

static void macvlan_broadcast(struct sk_buff *skb,
const struct macvlan_port *port,
- struct net_device *src)
+ struct net_device *src,
+ enum macvlan_mode mode)
{
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_dev *vlan;
@@ -123,6 +131,9 @@ static void macvlan_broadcast(struct sk_buff *skb,
if (dev == src)
continue;

+ if (!(vlan->mode & mode))
+ continue;
+
nskb = skb_clone(skb, GFP_ATOMIC);
if (nskb == NULL) {
dev->stats.rx_errors++;
@@ -177,13 +188,27 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_port *port;
const struct macvlan_dev *vlan;
+ const struct macvlan_dev *src;

port = rcu_dereference(skb->dev->macvlan_port);
if (port == NULL)
return skb;

if (is_multicast_ether_addr(eth->h_dest)) {
- macvlan_broadcast(skb, port, NULL);
+ src = macvlan_hash_lookup(port, eth->h_source);
+ if (!src)
+ /* frame comes from an external address */
+ macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_VEPA
+ | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
+ else if (src->mode == MACVLAN_MODE_VEPA)
+ /* flood to everyone except source */
+ macvlan_broadcast(skb, port, src->dev,
+ MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
+ else if (src->mode == MACVLAN_MODE_BRIDGE)
+ /* flood only to VEPA ports, bridge ports
+ already saw the frame */
+ macvlan_broadcast(skb, port, src->dev,
+ MACVLAN_MODE_VEPA);
return skb;
}

@@ -218,14 +243,17 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
secpath_reset(skb);
nf_reset(skb);

- if (is_multicast_ether_addr(eth->h_dest)) {
- macvlan_broadcast(skb, port, dev);
- return macvlan_xmit_world(skb, dev);
- }
+ if (vlan->mode == MACVLAN_MODE_BRIDGE) {
+ /* send to other bridge ports directly */
+ if (is_multicast_ether_addr(eth->h_dest)) {
+ macvlan_broadcast(skb, port, dev, MACVLAN_MODE_BRIDGE);
+ return macvlan_xmit_world(skb, dev);
+ }

- dest = macvlan_hash_lookup(port, eth->h_dest);
- if (dest)
- return macvlan_unicast(skb, dest);
+ dest = macvlan_hash_lookup(port, eth->h_dest);
+ if (dest && dest->mode == MACVLAN_MODE_BRIDGE)
+ return macvlan_unicast(skb, dest);
+ }

return macvlan_xmit_world(skb, dev);
}
--
1.6.3.3

2009-11-17 22:40:39

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 3/3] macvlan: export macvlan mode through netlink

In order to support all three modes of macvlan at
runtime, extend the existing netlink protocol
to allow choosing the mode per macvlan slave
interface.

This depends on a matching patch to iproute2
in order to become accessible in user land.

Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/macvlan.c | 67 +++++++++++++++++++++++++++++++++++++++++-----
include/linux/if_link.h | 15 ++++++++++
2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fa8b568..731017e 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -33,12 +33,6 @@

#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)

-enum macvlan_type {
- MACVLAN_PRIVATE = 1,
- MACVLAN_VEPA = 2,
- MACVLAN_BRIDGE = 4,
-};
-
struct macvlan_port {
struct net_device *dev;
struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
@@ -51,7 +45,7 @@ struct macvlan_dev {
struct hlist_node hlist;
struct macvlan_port *port;
struct net_device *lowerdev;
- enum macvlan_mode mode;
+ enum ifla_macvlan_mode mode;
};


@@ -112,7 +106,7 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
static void macvlan_broadcast(struct sk_buff *skb,
const struct macvlan_port *port,
struct net_device *src,
- enum macvlan_mode mode)
+ enum ifla_macvlan_mode mode)
{
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_dev *vlan;
@@ -553,6 +547,18 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
return -EADDRNOTAVAIL;
}
+
+ if (data && data[IFLA_MACVLAN_MODE]) {
+ u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
+ switch (mode) {
+ case MACVLAN_MODE_PRIVATE:
+ case MACVLAN_MODE_VEPA:
+ case MACVLAN_MODE_BRIDGE:
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
return 0;
}

@@ -617,6 +623,13 @@ static int macvlan_newlink(struct net_device *dev,
vlan->dev = dev;
vlan->port = port;

+ vlan->mode = MACVLAN_MODE_VEPA;
+ if (data && data[IFLA_MACVLAN_MODE]) {
+ u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
+
+ vlan->mode = mode;
+ }
+
err = register_netdevice(dev);
if (err < 0)
return err;
@@ -638,6 +651,39 @@ static void macvlan_dellink(struct net_device *dev)
macvlan_port_destroy(port->dev);
}

+static int macvlan_changelink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+ if (data && data[IFLA_MACVLAN_MODE]) {
+ u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
+ vlan->mode = mode;
+ }
+
+ return 0;
+}
+
+static size_t macvlan_get_size(const struct net_device *dev)
+{
+ return nla_total_size(4);
+}
+
+static int macvlan_fill_info(struct sk_buff *skb,
+ const struct net_device *dev)
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+
+ NLA_PUT_U32(skb, IFLA_MACVLAN_MODE, vlan->mode);
+ return 0;
+
+nla_put_failure:
+ return -EMSGSIZE;
+}
+
+static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
+ [IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
+};
+
static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
.kind = "macvlan",
.priv_size = sizeof(struct macvlan_dev),
@@ -646,6 +692,11 @@ static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
.validate = macvlan_validate,
.newlink = macvlan_newlink,
.dellink = macvlan_dellink,
+ .maxtype = IFLA_MACVLAN_MAX,
+ .policy = macvlan_policy,
+ .changelink = macvlan_changelink,
+ .get_size = macvlan_get_size,
+ .fill_info = macvlan_fill_info,
};

static int macvlan_device_event(struct notifier_block *unused,
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 176c518..ef70ebc 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -190,4 +190,19 @@ struct ifla_vlan_qos_mapping
__u32 to;
};

+/* MACVLAN section */
+enum {
+ IFLA_MACVLAN_UNSPEC,
+ IFLA_MACVLAN_MODE,
+ __IFLA_MACVLAN_MAX,
+};
+
+#define IFLA_MACVLAN_MAX (__IFLA_MACVLAN_MAX - 1)
+
+enum ifla_macvlan_mode {
+ MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
+ MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
+ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
+};
+
#endif /* _LINUX_IF_LINK_H */
--
1.6.3.3

2009-11-17 22:40:27

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH] iplink: add macvlan options for bridge mode

Macvlan can now optionally support forwarding between its
ports, if they are in "bridge" mode. This adds support
for this option to "ip link add", "ip link set" and "ip
-d link show".

The default mode in the kernel is now "vepa" mode, meaning
"virtual ethernet port aggregator". This mode is used
together with the "hairpin" mode of an ethernet bridge
that the parent of the macvlan device is connected to.
All frames still get sent out to the external interface,
but the adjacent bridge is able to send them back on
the same wire in hairpin mode, so the macvlan ports
are able to see each other, which the bridge can be
configured to monitor and control traffic between
all macvlan instances. Multicast traffic coming in
from the external interface is checked for the source
MAC address and only delivered to ports that have not
yet seen it.

In bridge mode, macvlan will send all multicast traffic
to other interfaces that are also in bridge mode but
not to those in vepa mode, which get them on the way
back from the hairpin.

The third supported mode is "private", which prevents
communication between macvlans even if the adjacent
bridge is in hairpin mode. This behavior is closer to
the original implementation of macvlan but stricly
maintains isolation.

Signed-off-by: Arnd Bergmann <[email protected]>
---
include/linux/if_link.h | 15 ++++++++
ip/Makefile | 3 +-
ip/iplink_macvlan.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 110 insertions(+), 1 deletions(-)
create mode 100644 ip/iplink_macvlan.c

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index b0b9e8a..425c489 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -188,4 +188,19 @@ struct ifla_vlan_qos_mapping
__u32 to;
};

+/* MACVLAN section */
+enum {
+ IFLA_MACVLAN_UNSPEC,
+ IFLA_MACVLAN_MODE,
+ __IFLA_MACVLAN_MAX,
+};
+
+enum ifla_macvlan_mode {
+ MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
+ MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
+ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
+};
+
+#define IFLA_MACVLAN_MAX (__IFLA_MACVLAN_MAX - 1)
+
#endif /* _LINUX_IF_LINK_H */
diff --git a/ip/Makefile b/ip/Makefile
index 51914e8..46a9836 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -2,7 +2,8 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o \
ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
- iplink_vlan.o link_veth.o link_gre.o iplink_can.o
+ iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
+ iplink_macvlan.o

RTMONOBJ=rtmon.o

diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
new file mode 100644
index 0000000..307f559
--- /dev/null
+++ b/ip/iplink_macvlan.c
@@ -0,0 +1,93 @@
+/*
+ * iplink_vlan.c VLAN device support
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: Patrick McHardy <[email protected]>
+ * Arnd Bergmann <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <linux/if_link.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+static void explain(void)
+{
+ fprintf(stderr,
+ "Usage: ... macvlan mode { private | vepa | bridge }\n"
+ );
+}
+
+static int mode_arg(void)
+{
+ fprintf(stderr, "Error: argument of \"mode\" must be \"private\", "
+ "\"vepa\" or \"bridge\"\n");
+ return -1;
+}
+
+static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv,
+ struct nlmsghdr *n)
+{
+ while (argc > 0) {
+ if (matches(*argv, "mode") == 0) {
+ __u32 mode = 0;
+ NEXT_ARG();
+
+ if (strcmp(*argv, "private") == 0)
+ mode = MACVLAN_MODE_PRIVATE;
+ else if (strcmp(*argv, "vepa") == 0)
+ mode = MACVLAN_MODE_VEPA;
+ else if (strcmp(*argv, "bridge") == 0)
+ mode = MACVLAN_MODE_BRIDGE;
+ else
+ return mode_arg();
+
+ addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
+ } else if (matches(*argv, "help") == 0) {
+ explain();
+ return -1;
+ } else {
+ fprintf(stderr, "macvlan: what is \"%s\"?\n", *argv);
+ explain();
+ return -1;
+ }
+ argc--, argv++;
+ }
+
+ return 0;
+}
+
+static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+ __u32 mode;
+
+ if (!tb)
+ return;
+
+ if (!tb[IFLA_MACVLAN_MODE] ||
+ RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE]) < sizeof(__u32))
+ return;
+
+ mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]);
+ fprintf(f, " mode %s ",
+ mode == MACVLAN_MODE_PRIVATE ? "private"
+ : mode == MACVLAN_MODE_VEPA ? "vepa"
+ : mode == MACVLAN_MODE_BRIDGE ? "bridge"
+ : "unknown");
+}
+
+struct link_util macvlan_link_util = {
+ .id = "macvlan",
+ .maxattr = IFLA_MACVLAN_MAX,
+ .parse_opt = macvlan_parse_opt,
+ .print_opt = macvlan_print_opt,
+};
--
1.6.3.3

2009-11-17 22:56:48

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 0/3] macvlan: add vepa and bridge mode

Sorry, I used the wrong address for the virtualization mailing list
at first. Please correct this to <[email protected]>
when replying to the other mails.

For people only subscribed to virtualization, you can find the actual
patches at
http://patchwork.kernel.org/patch/60810/
http://patchwork.kernel.org/patch/60811/
http://patchwork.kernel.org/patch/60813/
http://patchwork.kernel.org/patch/60814/

Arnd <><

On Tuesday 17 November 2009, Arnd Bergmann wrote:
> This is based on an earlier patch from Eric Biederman adding
> forwarding between macvlans. I extended his approach to
> allow the administrator to choose the mode for each macvlan,
> and to implement a functional VEPA between macvlan.
>
> Still missing from this is support for communication between
> the lower device that the macvlans are based on. This would
> be extremely useful but as others have found out before me
> requires significant changes not only to macvlan but also
> to the common transmit path.
>
> I've seen one panic during testing this that I still need
> to track down, but it generally does what is advertised.
> I've tested VEPA operation with the hairpin support
> added to the bridge driver by Anna Fischer.
>
> My current plan is to submit this for inclusion in 2.6.33
> when people are happy with it and I tracked down any
> remaining bugs, and possibly to do the communication with
> the lower device one release later.
>
> Arnd <><
>
> ---
>
> Arnd Bergmann (3):
> macvlan: implement VEPA and private mode
> macvlan: export macvlan mode through netlink
> iplink: add macvlan options for bridge mode
>
> Eric Biederman (1):
> macvlan: Reflect macvlan packets meant for other macvlan devices
>
> linux/drivers/net/macvlan.c | 170 +++++++++++++++++++++++++++++++++-----
> linux/include/linux/if_link.h | 15 +++
> 2 files changed, 161 insertions(+), 24 deletions(-)
>
> iproute2/include/linux/if_link.h | 15 +++
> iproute2/ip/Makefile | 3 +-
> iproute2/ip/iplink_macvlan.c | 93 ++++++++++++++++++
> 3 files changed, 110 insertions(+), 1 deletions(-)
> create mode 100644 ip/iplink_macvlan.c
>

2009-11-18 06:31:12

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

Arnd Bergmann a ?crit :
> From: Eric Biederman <[email protected]>
>
> Switch ports do not send packets back out the same port they came
> in on. This causes problems when using a macvlan device inside
> of a network namespace as it becomes impossible to talk to
> other macvlan devices.

This patch is very welcome. I review it and found one oddity.

>
> Signed-off-by: Eric Biederman <[email protected]>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> +static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + const struct macvlan_dev *vlan = netdev_priv(dev);
> + const struct macvlan_port *port = vlan->port;
> + const struct macvlan_dev *dest;
> + const struct ethhdr *eth;
>
> - skb->dev = dev;
> - skb->pkt_type = PACKET_HOST;
> + skb->protocol = eth_type_trans(skb, dev);
> + eth = eth_hdr(skb);
>
> - netif_rx(skb);
> - return NULL;
> + skb_dst_drop(skb);

Why do you drop dst here ?

It seems strange, since this driver specifically masks out IFF_XMIT_DST_RELEASE
in its macvlan_setup() :

dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;

If we really want to drop dst, it could be done by caller, if IFF_XMIT_DST_RELEASE
was not masked in macvlan_setup().


> + skb->mark = 0;
> + secpath_reset(skb);
> + nf_reset(skb);
> +
> + if (is_multicast_ether_addr(eth->h_dest)) {
> + macvlan_broadcast(skb, port, dev);
> + return macvlan_xmit_world(skb, dev);
> + }
> +
> + dest = macvlan_hash_lookup(port, eth->h_dest);
> + if (dest)
> + return macvlan_unicast(skb, dest);
> +
> + return macvlan_xmit_world(skb, dev);
> }


# find net drivers/net|xargs grep -n IFF_XMIT_DST_RELEASE
net/8021q/vlan_dev.c:837: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
net/atm/clip.c:561: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
net/core/dev.c:1778: if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
net/core/dev.c:5287: dev->priv_flags = IFF_XMIT_DST_RELEASE;
net/ipv4/ip_gre.c:1236: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
net/ipv4/ipip.c:717: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
net/ipv6/sit.c:1104: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/appletalk/ipddp.c:76: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/bonding/bond_main.c:4534: bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/eql.c:197: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/ifb.c:162: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/loopback.c:174: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/macvlan.c:421: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/ppp_generic.c:1057: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
drivers/net/wan/hdlc_fr.c:1057: dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;

2009-11-18 06:42:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 2/3] macvlan: implement VEPA and private mode

Arnd Bergmann a ?crit :
> This allows each macvlan slave device to be in one
> of three modes, depending on the use case:
>
> MACVLAN_MODE_PRIVATE:
> The device never communicates with any other device
> on the same upper_dev. This even includes frames
> coming back from a reflective relay, where supported
> by the adjacent bridge.
>
> MACVLAN_MODE_VEPA:
> The new Virtual Ethernet Port Aggregator (VEPA) mode,
> we assume that the adjacent bridge returns all frames
> where both source and destination are local to the
> macvlan port, i.e. the bridge is set up as a reflective
> relay.
> Broadcast frames coming in from the upper_dev get
> flooded to all macvlan interfaces in VEPA mode.
> We never deliver any frames locally.
>
> MACVLAN_MODE_BRIDGE:
> We provide the behavior of a simple bridge between
> different macvlan interfaces on the same port. Frames
> from one interface to another one get delivered directly
> and are not sent out externally. Broadcast frames get
> flooded to all other bridge ports and to the external
> interface, but when they come back from a reflective
> relay, we don't deliver them again.
> Since we know all the MAC addresses, the macvlan bridge
> mode does not require learning or STP like the bridge
> module does.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---


> if (is_multicast_ether_addr(eth->h_dest)) {
> - macvlan_broadcast(skb, port, NULL);
> + src = macvlan_hash_lookup(port, eth->h_source);
> + if (!src)
> + /* frame comes from an external address */


> + macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_VEPA
> + | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);

typo here, you probably meant MACVLAN_PRIVATE | MACVLAN_VEPA | MACVLAN_BRIDGE

> + else if (src->mode == MACVLAN_MODE_VEPA)
> + /* flood to everyone except source */
> + macvlan_broadcast(skb, port, src->dev,
> + MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
> + else if (src->mode == MACVLAN_MODE_BRIDGE)
> + /* flood only to VEPA ports, bridge ports
> + already saw the frame */
> + macvlan_broadcast(skb, port, src->dev,
> + MACVLAN_MODE_VEPA);
> return skb;
> }
>

2009-11-18 06:49:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 3/3] macvlan: export macvlan mode through netlink

Arnd Bergmann a ?crit :
> In order to support all three modes of macvlan at
> runtime, extend the existing netlink protocol
> to allow choosing the mode per macvlan slave
> interface.
>
> This depends on a matching patch to iproute2
> in order to become accessible in user land.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> drivers/net/macvlan.c | 67 +++++++++++++++++++++++++++++++++++++++++-----
> include/linux/if_link.h | 15 ++++++++++
> 2 files changed, 74 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index fa8b568..731017e 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -33,12 +33,6 @@
>
> #define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
>
> -enum macvlan_type {
> - MACVLAN_PRIVATE = 1,
> - MACVLAN_VEPA = 2,
> - MACVLAN_BRIDGE = 4,
> -};

I realize you defined MACVLAN_PRIVATE in patch 2, but used MACVLAN_MODE_PRIVATE,
so patch 2 is not compilable and breaks bisection ?


> +
> +enum ifla_macvlan_mode {
> + MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
> + MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
> + MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
> +};

2009-11-18 11:18:07

by Mark Smith

[permalink] [raw]
Subject: Re: [PATCH 0/3] macvlan: add vepa and bridge mode

On Tue, 17 Nov 2009 22:39:07 +0000
Arnd Bergmann <[email protected]> wrote:

> This is based on an earlier patch from Eric Biederman adding
> forwarding between macvlans. I extended his approach to
> allow the administrator to choose the mode for each macvlan,
> and to implement a functional VEPA between macvlan.
>
> Still missing from this is support for communication between
> the lower device that the macvlans are based on. This would
> be extremely useful but as others have found out before me
> requires significant changes not only to macvlan but also
> to the common transmit path.

If this means that the "children" macvlans can't communicate with their
"parent" interface as though they were all attached to the same virtual
ethernet segment, I think that is a reasonable limitation. On other
networking equipment I've used, the moment "sub-interfaces"
are created, their parent interface can't be used for any
communications, only for setting link related parameters e.g. for
ethernet interfaces, speed and duplex etc.

>
> I've seen one panic during testing this that I still need
> to track down, but it generally does what is advertised.
> I've tested VEPA operation with the hairpin support
> added to the bridge driver by Anna Fischer.
>
> My current plan is to submit this for inclusion in 2.6.33
> when people are happy with it and I tracked down any
> remaining bugs, and possibly to do the communication with
> the lower device one release later.
>
> Arnd <><
>
> ---
>
> Arnd Bergmann (3):
> macvlan: implement VEPA and private mode
> macvlan: export macvlan mode through netlink
> iplink: add macvlan options for bridge mode
>
> Eric Biederman (1):
> macvlan: Reflect macvlan packets meant for other macvlan devices
>
> linux/drivers/net/macvlan.c | 170 +++++++++++++++++++++++++++++++++-----
> linux/include/linux/if_link.h | 15 +++
> 2 files changed, 161 insertions(+), 24 deletions(-)
>
> iproute2/include/linux/if_link.h | 15 +++
> iproute2/ip/Makefile | 3 +-
> iproute2/ip/iplink_macvlan.c | 93 ++++++++++++++++++
> 3 files changed, 110 insertions(+), 1 deletions(-)
> create mode 100644 ip/iplink_macvlan.c
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-11-18 09:48:49

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

On Wednesday 18 November 2009, Eric Dumazet wrote:
> > - skb->dev = dev;
> > - skb->pkt_type = PACKET_HOST;
> > + skb->protocol = eth_type_trans(skb, dev);
> > + eth = eth_hdr(skb);
> >
> > - netif_rx(skb);
> > - return NULL;
> > + skb_dst_drop(skb);
>
> Why do you drop dst here ?
>
> It seems strange, since this driver specifically masks out IFF_XMIT_DST_RELEASE
> in its macvlan_setup() :
>
> dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
>
> If we really want to drop dst, it could be done by caller, if IFF_XMIT_DST_RELEASE
> was not masked in macvlan_setup().
>

That must be my fault, it is the only change I did to Eric B's patch when
forward-porting to 2.6.32. The original patch did

skb->protocol = eth_type_trans(skb, dev);
eth = eth_hdr(skb);
dst_release(skb->dst);
skb->dst = NULL;
skb->mark = 0;

and I tried to convert that in the same way that other drivers did, but I
have to admit that I did not understand the mechanics of IFF_XMIT_DST_RELEASE.

Arnd <><

2009-11-18 09:49:31

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/3] macvlan: implement VEPA and private mode

On Wednesday 18 November 2009, Eric Dumazet wrote:
> > + macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_VEPA
> > + | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
>
> typo here, you probably meant MACVLAN_PRIVATE | MACVLAN_VEPA | MACVLAN_BRIDGE

Well spotted. One last minute-change gone wrong.

Thanks,

Arnd <><

2009-11-18 09:59:52

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/3] macvlan: export macvlan mode through netlink

On Wednesday 18 November 2009, Eric Dumazet wrote:
> > --- a/drivers/net/macvlan.c
> > +++ b/drivers/net/macvlan.c
> > @@ -33,12 +33,6 @@
> >
> > #define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
> >
> > -enum macvlan_type {
> > - MACVLAN_PRIVATE = 1,
> > - MACVLAN_VEPA = 2,
> > - MACVLAN_BRIDGE = 4,
> > -};
>
> I realize you defined MACVLAN_PRIVATE in patch 2, but used MACVLAN_MODE_PRIVATE,
> so patch 2 is not compilable and breaks bisection ?
>

Hmm, right. I'll fix that up as well.

Thanks,

Arnd <><

2009-11-18 10:00:54

by Roel Kluin

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

On Tue, Nov 17, 2009 at 11:39 PM, Arnd Bergmann <[email protected]> wrote:
> From: Eric Biederman <[email protected]>
>
> Switch ports do not send packets back out the same port they came
> in on. ?This causes problems when using a macvlan device inside
> of a network namespace as it becomes impossible to talk to
> other macvlan devices.
>
> Signed-off-by: Eric Biederman <[email protected]>
> Signed-off-by: Arnd Bergmann <[email protected]>

I found a problem:

> @@ -140,20 +145,45 @@ static void macvlan_broadcast(struct sk_buff *skb,
> ? ? ? ?}
> ?}
>
> +static int macvlan_unicast(struct sk_buff *skb, const struct macvlan_dev *dest)
> +{
> + ? ? ? struct net_device *dev = dest->dev;
> +
> + ? ? ? if (unlikely(!dev->flags & IFF_UP)) {

parentheses are missing:

if (unlikely(!(dev->flags & IFF_UP))) {

> + ? ? ? ? ? ? ? kfree_skb(skb);
> + ? ? ? ? ? ? ? return NET_XMIT_DROP;
> + ? ? ? }
> +
> + ? ? ? skb = skb_share_check(skb, GFP_ATOMIC);
> + ? ? ? if (!skb) {
> + ? ? ? ? ? ? ? dev->stats.rx_errors++;
> + ? ? ? ? ? ? ? dev->stats.rx_dropped++;
> + ? ? ? ? ? ? ? return NET_XMIT_DROP;
> + ? ? ? }
> +
> + ? ? ? dev->stats.rx_bytes += skb->len + ETH_HLEN;
> + ? ? ? dev->stats.rx_packets++;
> +
> + ? ? ? skb->dev = dev;
> + ? ? ? skb->pkt_type = PACKET_HOST;
> + ? ? ? netif_rx(skb);
> + ? ? ? return NET_XMIT_SUCCESS;
> +}

2009-11-18 14:38:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

Arnd Bergmann <[email protected]> writes:

> On Wednesday 18 November 2009, Eric Dumazet wrote:
>> > - skb->dev = dev;
>> > - skb->pkt_type = PACKET_HOST;
>> > + skb->protocol = eth_type_trans(skb, dev);
>> > + eth = eth_hdr(skb);
>> >
>> > - netif_rx(skb);
>> > - return NULL;
>> > + skb_dst_drop(skb);
>>
>> Why do you drop dst here ?
>>
>> It seems strange, since this driver specifically masks out IFF_XMIT_DST_RELEASE
>> in its macvlan_setup() :
>>
>> dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
>>
>> If we really want to drop dst, it could be done by caller, if IFF_XMIT_DST_RELEASE
>> was not masked in macvlan_setup().
>>
>
> That must be my fault, it is the only change I did to Eric B's patch when
> forward-porting to 2.6.32. The original patch did
>
> skb->protocol = eth_type_trans(skb, dev);
> eth = eth_hdr(skb);
> dst_release(skb->dst);
> skb->dst = NULL;
> skb->mark = 0;
>
> and I tried to convert that in the same way that other drivers did, but I
> have to admit that I did not understand the mechanics of IFF_XMIT_DST_RELEASE.

Please copy and ideally share code with the veth driver for recycling a skb.
There are bunch of little things you have to do to get it right. As I recally
I was missing a few details in my original patch.

Eric

2009-11-18 14:44:34

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

On Wednesday 18 November 2009, Eric W. Biederman wrote:
> Please copy and ideally share code with the veth driver for recycling a skb.
> There are bunch of little things you have to do to get it right. As I recally
> I was missing a few details in my original patch.

Ok, thanks for the hint!

Arnd <><

2009-11-18 23:33:57

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

On Wednesday 18 November 2009 14:37:50 Eric W. Biederman wrote:
> Arnd Bergmann <[email protected]> writes:
> > On Wednesday 18 November 2009, Eric Dumazet wrote:
> >>
> >> Why do you drop dst here ?
> >>
> >> It seems strange, since this driver specifically masks out IFF_XMIT_DST_RELEASE
> >> in its macvlan_setup() :
> >>
> >> dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;

It seems that we should never drop dst then. We either forward the frame to
netif_rx or to dev_queue_xmit, and from how I read it now, we want to keep
the dst in both cases.

> Please copy and ideally share code with the veth driver for recycling a skb.
> There are bunch of little things you have to do to get it right. As I recally
> I was missing a few details in my original patch.

Are you thinking of something like the patch below? I haven't had the chance
to test this, but one thing it does is to handle the internal forwarding
differently from the receive path.

Arnd <><

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 731017e..73f8cb1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -114,6 +114,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
struct net_device *dev;
struct sk_buff *nskb;
unsigned int i;
+ int err;

if (skb->protocol == htons(ETH_P_PAUSE))
return;
@@ -135,47 +136,28 @@ static void macvlan_broadcast(struct sk_buff *skb,
continue;
}

- dev->stats.rx_bytes += skb->len + ETH_HLEN;
- dev->stats.rx_packets++;
- dev->stats.multicast++;
-
- nskb->dev = dev;
- if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast))
- nskb->pkt_type = PACKET_BROADCAST;
- else
- nskb->pkt_type = PACKET_MULTICAST;
-
- netif_rx(nskb);
+ if (mode == MACVLAN_MODE_BRIDGE) {
+ err = (dev_forward_skb(dev, nskb) < 0);
+ } else {
+ nskb->dev = dev;
+ if (!compare_ether_addr_64bits(eth->h_dest,
+ dev->broadcast))
+ nskb->pkt_type = PACKET_BROADCAST;
+ else
+ nskb->pkt_type = PACKET_MULTICAST;
+ err = (netif_rx(nskb) != NET_RX_SUCCESS);
+ }
+ if (err) {
+ dev->stats.rx_dropped++;
+ } else {
+ dev->stats.rx_bytes += skb->len + ETH_HLEN;
+ dev->stats.rx_packets++;
+ dev->stats.multicast++;
+ }
}
}
}

-static int macvlan_unicast(struct sk_buff *skb, const struct macvlan_dev *dest)
-{
- struct net_device *dev = dest->dev;
-
- if (unlikely(!dev->flags & IFF_UP)) {
- kfree_skb(skb);
- return NET_XMIT_DROP;
- }
-
- skb = skb_share_check(skb, GFP_ATOMIC);
- if (!skb) {
- dev->stats.rx_errors++;
- dev->stats.rx_dropped++;
- return NET_XMIT_DROP;
- }
-
- dev->stats.rx_bytes += skb->len + ETH_HLEN;
- dev->stats.rx_packets++;
-
- skb->dev = dev;
- skb->pkt_type = PACKET_HOST;
- netif_rx(skb);
- return NET_XMIT_SUCCESS;
-}
-
-
/* called under rcu_read_lock() from netif_receive_skb */
static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
{
@@ -183,6 +165,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
const struct macvlan_port *port;
const struct macvlan_dev *vlan;
const struct macvlan_dev *src;
+ struct net_device *dev;

port = rcu_dereference(skb->dev->macvlan_port);
if (port == NULL)
@@ -192,7 +175,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
src = macvlan_hash_lookup(port, eth->h_source);
if (!src)
/* frame comes from an external address */
- macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_VEPA
+ macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE
| MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
else if (src->mode == MACVLAN_MODE_VEPA)
/* flood to everyone except source */
@@ -210,7 +193,26 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
if (vlan == NULL)
return skb;

- macvlan_unicast(skb, vlan);
+ dev = vlan->dev;
+ if (unlikely(!(dev->flags & IFF_UP))) {
+ kfree_skb(skb);
+ return NULL;
+ }
+
+ skb = skb_share_check(skb, GFP_ATOMIC);
+ if (!skb) {
+ dev->stats.rx_errors++;
+ dev->stats.rx_dropped++;
+ return NULL;
+ }
+
+ dev->stats.rx_bytes += skb->len + ETH_HLEN;
+ dev->stats.rx_packets++;
+
+ skb->dev = dev;
+ skb->pkt_type = PACKET_HOST;
+
+ netif_rx(skb);
return NULL;
}

@@ -227,17 +229,12 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
- const struct ethhdr *eth;

skb->protocol = eth_type_trans(skb, dev);
- eth = eth_hdr(skb);
-
- skb_dst_drop(skb);
- skb->mark = 0;
- secpath_reset(skb);
- nf_reset(skb);

if (vlan->mode == MACVLAN_MODE_BRIDGE) {
+ const struct ethhdr *eth = eth_hdr(skb);
+
/* send to other bridge ports directly */
if (is_multicast_ether_addr(eth->h_dest)) {
macvlan_broadcast(skb, port, dev, MACVLAN_MODE_BRIDGE);
@@ -245,8 +242,17 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
}

dest = macvlan_hash_lookup(port, eth->h_dest);
- if (dest && dest->mode == MACVLAN_MODE_BRIDGE)
- return macvlan_unicast(skb, dest);
+ if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
+ int length = dev_forward_skb(dest->dev, skb);
+ if (length < 0) {
+ dev->stats.rx_dropped++;
+ } else {
+ dev->stats.rx_bytes += length;
+ dev->stats.rx_packets++;
+ }
+
+ return NET_XMIT_SUCCESS;
+ }
}

return macvlan_xmit_world(skb, dev);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index ade5b34..5bb7fb9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
struct veth_net_stats *stats, *rcv_stats;
int length, cpu;

- skb_orphan(skb);
-
priv = netdev_priv(dev);
rcv = priv->peer;
rcv_priv = netdev_priv(rcv);
@@ -165,23 +163,13 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
stats = per_cpu_ptr(priv->stats, cpu);
rcv_stats = per_cpu_ptr(rcv_priv->stats, cpu);

- if (!(rcv->flags & IFF_UP))
- goto tx_drop;
-
- if (skb->len > (rcv->mtu + MTU_PAD))
- goto rx_drop;
-
- skb->tstamp.tv64 = 0;
- skb->pkt_type = PACKET_HOST;
- skb->protocol = eth_type_trans(skb, rcv);
if (dev->features & NETIF_F_NO_CSUM)
skb->ip_summed = rcv_priv->ip_summed;
+
+ length = dev_forward_skb(rcv, skb);

- skb->mark = 0;
- secpath_reset(skb);
- nf_reset(skb);
-
- length = skb->len;
+ if (length < 0)
+ goto drop;

stats->tx_bytes += length;
stats->tx_packets++;
@@ -189,17 +177,14 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
rcv_stats->rx_bytes += length;
rcv_stats->rx_packets++;

- netif_rx(skb);
return NETDEV_TX_OK;

-tx_drop:
+drop:
kfree_skb(skb);
- stats->tx_dropped++;
- return NETDEV_TX_OK;
-
-rx_drop:
- kfree_skb(skb);
- rcv_stats->rx_dropped++;
+ if (length == -ENETDOWN)
+ stats->tx_dropped++;
+ else
+ rcv_stats->rx_dropped++;
return NETDEV_TX_OK;
}

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 812a5f3..90225d6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1502,6 +1502,7 @@ extern int dev_set_mac_address(struct net_device *,
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev,
struct netdev_queue *txq);
+extern int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);

extern int netdev_budget;

diff --git a/net/core/dev.c b/net/core/dev.c
index b8f74cf..ca89b81 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -104,6 +104,7 @@
#include <net/dst.h>
#include <net/pkt_sched.h>
#include <net/checksum.h>
+#include <net/xfrm.h>
#include <linux/highmem.h>
#include <linux/init.h>
#include <linux/kmod.h>
@@ -1352,6 +1353,42 @@ static inline void net_timestamp(struct sk_buff *skb)
skb->tstamp.tv64 = 0;
}

+/**
+ * dev_forward_skb - loopback an skb to another netif
+ *
+ * @dev: destination network device
+ * @skb: buffer to forward
+ *
+ * dev_forward_skb can be used for injecting an skb from the
+ * start_xmit function of one device into the receive queue
+ * of another device.
+ */
+int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
+{
+ int sent;
+
+ skb_orphan(skb);
+
+ if (!(dev->flags & IFF_UP))
+ return -ENETDOWN;
+
+ if (skb->len > (dev->mtu + dev->hard_header_len))
+ return -EMSGSIZE;
+
+ skb->tstamp.tv64 = 0;
+ skb->pkt_type = PACKET_HOST;
+ skb->protocol = eth_type_trans(skb, dev);
+ skb->mark = 0;
+ secpath_reset(skb);
+ nf_reset(skb);
+ sent = skb->len;
+ if (netif_rx(skb) == NET_RX_DROP)
+ return -EBUSY;
+
+ return sent;
+}
+EXPORT_SYMBOL_GPL(dev_forward_skb);
+
/*
* Support routine. Sends outgoing frames to any network
* taps currently in use.

2009-11-18 23:55:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

Arnd Bergmann <[email protected]> writes:

> On Wednesday 18 November 2009 14:37:50 Eric W. Biederman wrote:
>> Arnd Bergmann <[email protected]> writes:
>> > On Wednesday 18 November 2009, Eric Dumazet wrote:
>> >>
>> >> Why do you drop dst here ?
>> >>
>> >> It seems strange, since this driver specifically masks out IFF_XMIT_DST_RELEASE
>> >> in its macvlan_setup() :
>> >>
>> >> dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
>
> It seems that we should never drop dst then. We either forward the frame to
> netif_rx or to dev_queue_xmit, and from how I read it now, we want to keep
> the dst in both cases.

When we loop back on our selves we certainly need to have dst clear because
we don't know how to cache routes through multiple network namespaces. It
might be handy if we had those but that is a problem for another day.

>> Please copy and ideally share code with the veth driver for recycling a skb.
>> There are bunch of little things you have to do to get it right. As I recally
>> I was missing a few details in my original patch.
>
> Are you thinking of something like the patch below? I haven't had the chance
> to test this, but one thing it does is to handle the internal forwarding
> differently from the receive path.

Yes. That is a bit more sharing than I had anticipated, but it looks like
it works.

Eric

2009-11-19 11:47:01

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

On Thursday 19 November 2009, Eric W. Biederman wrote:
> > It seems that we should never drop dst then. We either forward the frame to
> > netif_rx or to dev_queue_xmit, and from how I read it now, we want to keep
> > the dst in both cases.
>
> When we loop back on our selves we certainly need to have dst clear because
> we don't know how to cache routes through multiple network namespaces.

Ah, right. So should I add the explicit dst_drop to the new dev_forward_skb()
then? The veth driver doesn't need it, but it also looks like it won't hurt.

Arnd <><

2009-11-19 14:38:33

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 3/3] macvlan: export macvlan mode through netlink

Arnd Bergmann wrote:
> On Wednesday 18 November 2009, Eric Dumazet wrote:
>>> --- a/drivers/net/macvlan.c
>>> +++ b/drivers/net/macvlan.c
>>> @@ -33,12 +33,6 @@
>>>
>>> #define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)
>>>
>>> -enum macvlan_type {
>>> - MACVLAN_PRIVATE = 1,
>>> - MACVLAN_VEPA = 2,
>>> - MACVLAN_BRIDGE = 4,
>>> -};
>> I realize you defined MACVLAN_PRIVATE in patch 2, but used MACVLAN_MODE_PRIVATE,
>> so patch 2 is not compilable and breaks bisection ?
>>
>
> Hmm, right. I'll fix that up as well.

> +enum ifla_macvlan_mode {
> + MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
> + MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
> + MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
> +};

Please also keep the enum named something without ifla_ since
the mode values themselves are not netlink attributes. Just
macvlan_mode seems fine.

2009-11-19 14:47:08

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/3] macvlan: export macvlan mode through netlink

On Thursday 19 November 2009, Patrick McHardy wrote:
> > +enum ifla_macvlan_mode {
> > + MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
> > + MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
> > + MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
> > +};
>
> Please also keep the enum named something without ifla_ since
> the mode values themselves are not netlink attributes. Just
> macvlan_mode seems fine.

Ok, good. I was unsure on the naming and also changed between MACVLAN_*
and MACVLAN_MODE_* a few times.

Thanks for the clarification,

Arnd <><

2009-11-19 14:47:52

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/3] macvlan: Reflect macvlan packets meant for other macvlan devices

Arnd Bergmann wrote:
> On Thursday 19 November 2009, Eric W. Biederman wrote:
>>> It seems that we should never drop dst then. We either forward the frame to
>>> netif_rx or to dev_queue_xmit, and from how I read it now, we want to keep
>>> the dst in both cases.
>> When we loop back on our selves we certainly need to have dst clear because
>> we don't know how to cache routes through multiple network namespaces.
>
> Ah, right. So should I add the explicit dst_drop to the new dev_forward_skb()
> then? The veth driver doesn't need it, but it also looks like it won't hurt.

Yes, I think that should be fine.

2009-12-18 13:45:37

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] iplink: add macvlan options for bridge mode

Ping!

Stephen, I submitted this twice but never heard back from you.
The changes to macvlan have been merged in 2.6.33-rc1, so it
would be good to have this included as well.

Arnd

On Tuesday 17 November 2009, Arnd Bergmann wrote:
> Macvlan can now optionally support forwarding between its
> ports, if they are in "bridge" mode. This adds support
> for this option to "ip link add", "ip link set" and "ip
> -d link show".
>
> The default mode in the kernel is now "vepa" mode, meaning
> "virtual ethernet port aggregator". This mode is used
> together with the "hairpin" mode of an ethernet bridge
> that the parent of the macvlan device is connected to.
> All frames still get sent out to the external interface,
> but the adjacent bridge is able to send them back on
> the same wire in hairpin mode, so the macvlan ports
> are able to see each other, which the bridge can be
> configured to monitor and control traffic between
> all macvlan instances. Multicast traffic coming in
> from the external interface is checked for the source
> MAC address and only delivered to ports that have not
> yet seen it.
>
> In bridge mode, macvlan will send all multicast traffic
> to other interfaces that are also in bridge mode but
> not to those in vepa mode, which get them on the way
> back from the hairpin.
>
> The third supported mode is "private", which prevents
> communication between macvlans even if the adjacent
> bridge is in hairpin mode. This behavior is closer to
> the original implementation of macvlan but stricly
> maintains isolation.
>
> Signed-off-by: Arnd Bergmann <[email protected]>

2009-12-18 17:26:13

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH] iplink: add macvlan options for bridge mode

On Fri, 18 Dec 2009 14:45:07 +0100
Arnd Bergmann <[email protected]> wrote:

> Ping!
>
> Stephen, I submitted this twice but never heard back from you.
> The changes to macvlan have been merged in 2.6.33-rc1, so it
> would be good to have this included as well.
>
> Arnd
>

iproute gets updated (after patch is in upstream). I haven't started
on the 2.6.33 version yet.

2009-12-18 17:40:22

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH] iplink: add macvlan options for bridge mode

On Friday 18 December 2009, Stephen Hemminger wrote:
> On Fri, 18 Dec 2009 14:45:07 +0100
> Arnd Bergmann <[email protected]> wrote:
>
> > Ping!
> >
> > Stephen, I submitted this twice but never heard back from you.
> > The changes to macvlan have been merged in 2.6.33-rc1, so it
> > would be good to have this included as well.
> >
> > Arnd
> >
>
> iproute gets updated (after patch is in upstream). I haven't started
> on the 2.6.33 version yet.

Ok, thanks for the info. Tell me when/if I should resubmit.
I also have another (smaller) patch that also accepts the same
options for the macvtap driver, which is not upstream.
Should I also send that when the kernel code hits net-next,
2.6.34-rc1 or 2.6.34?

Arnd