2009-11-24 00:56:56

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCHv2 0/4] macvlan: add vepa and bridge mode

Second version, all feedback so far addressed, thanks for the
help and interest!

The patch to iproute2 has not changed, so I'm not including
it this time. Patch 4/4 (the netlink interface) is basically
unchanged as well but included for completeness.

The other changes have moved forward a bit, to the point where
I find them a lot cleaner and am more confident in the code
being ready for inclusion. The implementation hardly resembles
Erics original patch now, so I've dropped his signed-off-by.

Please take a look and ack if you are happy so we can get it
into 2.6.33.

---

Version 1 description:
This is based on an earlier patch from Eric Biederman adding
forwarding between macvlans. I extended his approach to
allow the administrator to choose the mode for each macvlan,
and to implement a functional VEPA between macvlan.

Still missing from this is support for communication between
the lower device that the macvlans are based on. This would
be extremely useful but as others have found out before me
requires significant changes not only to macvlan but also
to the common transmit path.

I've tested VEPA operation with the hairpin support
added to the bridge driver by Anna Fischer.

Arnd <><

Arnd Bergmann (4):
veth: move loopback logic to common location
macvlan: cleanup rx statistics
macvlan: implement bridge, VEPA and private mode
macvlan: export macvlan mode through netlink

drivers/net/macvlan.c | 183 ++++++++++++++++++++++++++++++++++++---------
drivers/net/veth.c | 17 +----
include/linux/if_link.h | 15 ++++
include/linux/netdevice.h | 2 +
net/core/dev.c | 36 +++++++++
5 files changed, 204 insertions(+), 49 deletions(-)


2009-11-24 00:56:24

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 1/4] veth: move loopback logic to common location

The veth driver contains code to forward an skb
from the start_xmit function of one network
device into the receive path of another device.

Moving that code into a common location lets us
reuse the code for direct forwarding of data
between macvlan ports, and possibly in other
drivers.

Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/veth.c | 17 +++--------------
include/linux/netdevice.h | 2 ++
net/core/dev.c | 36 ++++++++++++++++++++++++++++++++++++
3 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 2d657f2..6c4b5a2 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -155,8 +155,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
struct veth_net_stats *stats, *rcv_stats;
int length, cpu;

- skb_orphan(skb);
-
priv = netdev_priv(dev);
rcv = priv->peer;
rcv_priv = netdev_priv(rcv);
@@ -168,20 +166,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
if (!(rcv->flags & IFF_UP))
goto tx_drop;

- if (skb->len > (rcv->mtu + MTU_PAD))
- goto rx_drop;
-
- skb->tstamp.tv64 = 0;
- skb->pkt_type = PACKET_HOST;
- skb->protocol = eth_type_trans(skb, rcv);
if (dev->features & NETIF_F_NO_CSUM)
skb->ip_summed = rcv_priv->ip_summed;

- skb->mark = 0;
- secpath_reset(skb);
- nf_reset(skb);
-
- length = skb->len;
+ length = skb->len + ETH_HLEN;
+ if (dev_forward_skb(rcv, skb) != NET_RX_SUCCESS)
+ goto rx_drop;

stats->tx_bytes += length;
stats->tx_packets++;
@@ -189,7 +179,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
rcv_stats->rx_bytes += length;
rcv_stats->rx_packets++;

- netif_rx(skb);
return NETDEV_TX_OK;

tx_drop:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 97873e3..9428793 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1562,6 +1562,8 @@ extern int dev_set_mac_address(struct net_device *,
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev,
struct netdev_queue *txq);
+extern int dev_forward_skb(struct net_device *dev,
+ struct sk_buff *skb);

extern int netdev_budget;

diff --git a/net/core/dev.c b/net/core/dev.c
index ccefa24..ba18e82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -105,6 +105,7 @@
#include <net/dst.h>
#include <net/pkt_sched.h>
#include <net/checksum.h>
+#include <net/xfrm.h>
#include <linux/highmem.h>
#include <linux/init.h>
#include <linux/kmod.h>
@@ -1419,6 +1420,41 @@ static inline void net_timestamp(struct sk_buff *skb)
skb->tstamp.tv64 = 0;
}

+/**
+ * dev_forward_skb - loopback an skb to another netif
+ *
+ * @dev: destination network device
+ * @skb: buffer to forward
+ *
+ * return values:
+ * NET_RX_SUCCESS (no congestion)
+ * NET_RX_DROP (packet was dropped)
+ *
+ * dev_forward_skb can be used for injecting an skb from the
+ * start_xmit function of one device into the receive queue
+ * of another device.
+ */
+int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
+{
+ skb_orphan(skb);
+
+ if (!(dev->flags & IFF_UP))
+ return NET_RX_DROP;
+
+ if (skb->len > (dev->mtu + dev->hard_header_len))
+ return NET_RX_DROP;
+
+ skb_dst_drop(skb);
+ skb->tstamp.tv64 = 0;
+ skb->pkt_type = PACKET_HOST;
+ skb->protocol = eth_type_trans(skb, dev);
+ skb->mark = 0;
+ secpath_reset(skb);
+ nf_reset(skb);
+ return netif_rx(skb);
+}
+EXPORT_SYMBOL_GPL(dev_forward_skb);
+
/*
* Support routine. Sends outgoing frames to any network
* taps currently in use.
--
1.6.3.3

2009-11-24 00:57:05

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 2/4] macvlan: cleanup rx statistics

We have very similar code for rx statistics in
two places in the macvlan driver, with a third
one being added in the next patch.

Consolidate them into one function to improve
overall readability of the driver.

Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/macvlan.c | 63 +++++++++++++++++++++++++-----------------------
1 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index ae2b5c7..a0dea23 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -116,42 +116,53 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
return 0;
}

+static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
+ int success, int multicast)
+{
+ struct macvlan_rx_stats *rx_stats;
+
+ rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+ rx_stats->rx_packets += success != 0;
+ rx_stats->rx_bytes += success ? length : 0;
+ rx_stats->multicast += success && multicast;
+ rx_stats->rx_errors += !success;
+}
+
+static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
+ const struct ethhdr *eth)
+{
+ if (!skb)
+ return NET_RX_DROP;
+
+ skb->dev = dev;
+ if (!compare_ether_addr_64bits(eth->h_dest,
+ dev->broadcast))
+ skb->pkt_type = PACKET_BROADCAST;
+ else
+ skb->pkt_type = PACKET_MULTICAST;
+
+ return netif_rx(skb);
+}
+
static void macvlan_broadcast(struct sk_buff *skb,
const struct macvlan_port *port)
{
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_dev *vlan;
struct hlist_node *n;
- struct net_device *dev;
struct sk_buff *nskb;
unsigned int i;
- struct macvlan_rx_stats *rx_stats;
+ int err;

if (skb->protocol == htons(ETH_P_PAUSE))
return;

for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
- dev = vlan->dev;
- rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
-
nskb = skb_clone(skb, GFP_ATOMIC);
- if (nskb == NULL) {
- rx_stats->rx_errors++;
- continue;
- }
-
- rx_stats->rx_bytes += skb->len + ETH_HLEN;
- rx_stats->rx_packets++;
- rx_stats->multicast++;
-
- nskb->dev = dev;
- if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast))
- nskb->pkt_type = PACKET_BROADCAST;
- else
- nskb->pkt_type = PACKET_MULTICAST;
-
- netif_rx(nskb);
+ err = macvlan_broadcast_one(nskb, vlan->dev, eth);
+ macvlan_count_rx(vlan, skb->len + ETH_HLEN,
+ likely(err == NET_RX_SUCCESS), 1);
}
}
}
@@ -163,7 +174,6 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
const struct macvlan_port *port;
const struct macvlan_dev *vlan;
struct net_device *dev;
- struct macvlan_rx_stats *rx_stats;

port = rcu_dereference(skb->dev->macvlan_port);
if (port == NULL)
@@ -183,15 +193,8 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
kfree_skb(skb);
return NULL;
}
- rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
skb = skb_share_check(skb, GFP_ATOMIC);
- if (skb == NULL) {
- rx_stats->rx_errors++;
- return NULL;
- }
-
- rx_stats->rx_bytes += skb->len + ETH_HLEN;
- rx_stats->rx_packets++;
+ macvlan_count_rx(vlan, skb->len + ETH_HLEN, likely(skb != NULL), 0);

skb->dev = dev;
skb->pkt_type = PACKET_HOST;
--
1.6.3.3

2009-11-24 00:56:26

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 3/4] macvlan: implement bridge, VEPA and private mode

This allows each macvlan slave device to be in one
of three modes, depending on the use case:

MACVLAN_PRIVATE:
The device never communicates with any other device
on the same upper_dev. This even includes frames
coming back from a reflective relay, where supported
by the adjacent bridge.

MACVLAN_VEPA:
The new Virtual Ethernet Port Aggregator (VEPA) mode,
we assume that the adjacent bridge returns all frames
where both source and destination are local to the
macvlan port, i.e. the bridge is set up as a reflective
relay.
Broadcast frames coming in from the upper_dev get
flooded to all macvlan interfaces in VEPA mode.
We never deliver any frames locally.

MACVLAN_BRIDGE:
We provide the behavior of a simple bridge between
different macvlan interfaces on the same port. Frames
from one interface to another one get delivered directly
and are not sent out externally. Broadcast frames get
flooded to all other bridge ports and to the external
interface, but when they come back from a reflective
relay, we don't deliver them again.
Since we know all the MAC addresses, the macvlan bridge
mode does not require learning or STP like the bridge
module does.

Based on an earlier patch "macvlan: Reflect macvlan packets
meant for other macvlan devices" by Eric Biederman.

Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Eric Biederman <[email protected]>
---
drivers/net/macvlan.c | 75 +++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index a0dea23..b840b3a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -29,9 +29,16 @@
#include <linux/if_link.h>
#include <linux/if_macvlan.h>
#include <net/rtnetlink.h>
+#include <net/xfrm.h>

#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)

+enum macvlan_mode {
+ MACVLAN_MODE_PRIVATE = 1,
+ MACVLAN_MODE_VEPA = 2,
+ MACVLAN_MODE_BRIDGE = 4,
+};
+
struct macvlan_port {
struct net_device *dev;
struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
@@ -59,6 +66,7 @@ struct macvlan_dev {
struct macvlan_port *port;
struct net_device *lowerdev;
struct macvlan_rx_stats *rx_stats;
+ enum macvlan_mode mode;
};


@@ -129,11 +137,14 @@ static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
}

static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
- const struct ethhdr *eth)
+ const struct ethhdr *eth, int local)
{
if (!skb)
return NET_RX_DROP;

+ if (local)
+ return dev_forward_skb(dev, skb);
+
skb->dev = dev;
if (!compare_ether_addr_64bits(eth->h_dest,
dev->broadcast))
@@ -145,7 +156,9 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
}

static void macvlan_broadcast(struct sk_buff *skb,
- const struct macvlan_port *port)
+ const struct macvlan_port *port,
+ struct net_device *src,
+ enum macvlan_mode mode)
{
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_dev *vlan;
@@ -159,8 +172,12 @@ static void macvlan_broadcast(struct sk_buff *skb,

for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
+ if ((vlan->dev == src) || !(vlan->mode & mode))
+ continue;
+
nskb = skb_clone(skb, GFP_ATOMIC);
- err = macvlan_broadcast_one(nskb, vlan->dev, eth);
+ err = macvlan_broadcast_one(nskb, vlan->dev, eth,
+ mode == MACVLAN_MODE_BRIDGE);
macvlan_count_rx(vlan, skb->len + ETH_HLEN,
likely(err == NET_RX_SUCCESS), 1);
}
@@ -173,6 +190,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
const struct ethhdr *eth = eth_hdr(skb);
const struct macvlan_port *port;
const struct macvlan_dev *vlan;
+ const struct macvlan_dev *src;
struct net_device *dev;

port = rcu_dereference(skb->dev->macvlan_port);
@@ -180,7 +198,20 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
return skb;

if (is_multicast_ether_addr(eth->h_dest)) {
- macvlan_broadcast(skb, port);
+ src = macvlan_hash_lookup(port, eth->h_source);
+ if (!src)
+ /* frame comes from an external address */
+ macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE
+ | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
+ else if (src->mode == MACVLAN_MODE_VEPA)
+ /* flood to everyone except source */
+ macvlan_broadcast(skb, port, src->dev,
+ MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
+ else if (src->mode == MACVLAN_MODE_BRIDGE)
+ /* flood only to VEPA ports, bridge ports
+ already saw the frame */
+ macvlan_broadcast(skb, port, src->dev,
+ MACVLAN_MODE_VEPA);
return skb;
}

@@ -203,18 +234,46 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
return NULL;
}

+static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ const struct macvlan_dev *vlan = netdev_priv(dev);
+ const struct macvlan_port *port = vlan->port;
+ const struct macvlan_dev *dest;
+
+ if (vlan->mode == MACVLAN_MODE_BRIDGE) {
+ const struct ethhdr *eth = (void *)skb->data;
+
+ /* send to other bridge ports directly */
+ if (is_multicast_ether_addr(eth->h_dest)) {
+ macvlan_broadcast(skb, port, dev, MACVLAN_MODE_BRIDGE);
+ goto xmit_world;
+ }
+
+ dest = macvlan_hash_lookup(port, eth->h_dest);
+ if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
+ int length = skb->len + ETH_HLEN;
+ int ret = dev_forward_skb(dest->dev, skb);
+ macvlan_count_rx(dest, length,
+ likely(ret == NET_RX_SUCCESS), 0);
+
+ return NET_XMIT_SUCCESS;
+ }
+ }
+
+xmit_world:
+ skb->dev = vlan->lowerdev;
+ return dev_queue_xmit(skb);
+}
+
static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
struct net_device *dev)
{
int i = skb_get_queue_mapping(skb);
struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
- const struct macvlan_dev *vlan = netdev_priv(dev);
unsigned int len = skb->len;
int ret;

- skb->dev = vlan->lowerdev;
- ret = dev_queue_xmit(skb);
-
+ ret = macvlan_queue_xmit(skb, dev);
if (likely(ret == NET_XMIT_SUCCESS)) {
txq->tx_packets++;
txq->tx_bytes += len;
--
1.6.3.3

2009-11-24 00:57:36

by Arnd Bergmann

[permalink] [raw]
Subject: [PATCH 4/4] macvlan: export macvlan mode through netlink

In order to support all three modes of macvlan at
runtime, extend the existing netlink protocol
to allow choosing the mode per macvlan slave
interface.

This depends on a matching patch to iproute2
in order to become accessible in user land.

Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/macvlan.c | 63 ++++++++++++++++++++++++++++++++++++++++++----
include/linux/if_link.h | 15 +++++++++++
2 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index b840b3a..3db96b9 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -33,12 +33,6 @@

#define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE)

-enum macvlan_mode {
- MACVLAN_MODE_PRIVATE = 1,
- MACVLAN_MODE_VEPA = 2,
- MACVLAN_MODE_BRIDGE = 4,
-};
-
struct macvlan_port {
struct net_device *dev;
struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
@@ -600,6 +594,18 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
return -EADDRNOTAVAIL;
}
+
+ if (data && data[IFLA_MACVLAN_MODE]) {
+ u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
+ switch (mode) {
+ case MACVLAN_MODE_PRIVATE:
+ case MACVLAN_MODE_VEPA:
+ case MACVLAN_MODE_BRIDGE:
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
return 0;
}

@@ -664,6 +670,13 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
vlan->dev = dev;
vlan->port = port;

+ vlan->mode = MACVLAN_MODE_VEPA;
+ if (data && data[IFLA_MACVLAN_MODE]) {
+ u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
+
+ vlan->mode = mode;
+ }
+
err = register_netdevice(dev);
if (err < 0)
return err;
@@ -685,6 +698,39 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
macvlan_port_destroy(port->dev);
}

+static int macvlan_changelink(struct net_device *dev,
+ struct nlattr *tb[], struct nlattr *data[])
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+ if (data && data[IFLA_MACVLAN_MODE]) {
+ u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
+ vlan->mode = mode;
+ }
+
+ return 0;
+}
+
+static size_t macvlan_get_size(const struct net_device *dev)
+{
+ return nla_total_size(4);
+}
+
+static int macvlan_fill_info(struct sk_buff *skb,
+ const struct net_device *dev)
+{
+ struct macvlan_dev *vlan = netdev_priv(dev);
+
+ NLA_PUT_U32(skb, IFLA_MACVLAN_MODE, vlan->mode);
+ return 0;
+
+nla_put_failure:
+ return -EMSGSIZE;
+}
+
+static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
+ [IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
+};
+
static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
.kind = "macvlan",
.priv_size = sizeof(struct macvlan_dev),
@@ -693,6 +739,11 @@ static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
.validate = macvlan_validate,
.newlink = macvlan_newlink,
.dellink = macvlan_dellink,
+ .maxtype = IFLA_MACVLAN_MAX,
+ .policy = macvlan_policy,
+ .changelink = macvlan_changelink,
+ .get_size = macvlan_get_size,
+ .fill_info = macvlan_fill_info,
};

static int macvlan_device_event(struct notifier_block *unused,
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 1d3b242..6674791 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -181,4 +181,19 @@ struct ifla_vlan_qos_mapping {
__u32 to;
};

+/* MACVLAN section */
+enum {
+ IFLA_MACVLAN_UNSPEC,
+ IFLA_MACVLAN_MODE,
+ __IFLA_MACVLAN_MAX,
+};
+
+#define IFLA_MACVLAN_MAX (__IFLA_MACVLAN_MAX - 1)
+
+enum macvlan_mode {
+ MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
+ MACVLAN_MODE_VEPA = 2, /* talk to other ports through ext bridge */
+ MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */
+};
+
#endif /* _LINUX_IF_LINK_H */
--
1.6.3.3

2009-11-24 08:16:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 2/4] macvlan: cleanup rx statistics

Arnd Bergmann a ?crit :
> We have very similar code for rx statistics in
> two places in the macvlan driver, with a third
> one being added in the next patch.
>
> Consolidate them into one function to improve
> overall readability of the driver.
>
> Signed-off-by: Arnd Bergmann <[email protected]>
> ---
> drivers/net/macvlan.c | 63 +++++++++++++++++++++++++-----------------------
> 1 files changed, 33 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index ae2b5c7..a0dea23 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -116,42 +116,53 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
> return 0;
> }
>
> +static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
> + int success, int multicast)

success and multicast should be declared as bool

> +{
> + struct macvlan_rx_stats *rx_stats;
> +
> + rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
> + rx_stats->rx_packets += success != 0;
> + rx_stats->rx_bytes += success ? length : 0;
> + rx_stats->multicast += success && multicast;
> + rx_stats->rx_errors += !success;
> +}
> +

I find following more readable, it probably generates more branches,
but avoid dirtying rx_errors if it is in another cache line.

if (likely(success)) {
rx_stats->rx_packets++;
rx_stats->rx_bytes += length;
if (multicast)
rx_stats->multicast++;
} else {
rx_stats->rx_errors++;
}


> - rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
> skb = skb_share_check(skb, GFP_ATOMIC);
> - if (skb == NULL) {
> - rx_stats->rx_errors++;
> - return NULL;
> - }
> -
> - rx_stats->rx_bytes += skb->len + ETH_HLEN;
> - rx_stats->rx_packets++;
> + macvlan_count_rx(vlan, skb->len + ETH_HLEN, likely(skb != NULL), 0);

its not _likely_ that skb != NULL, its a fact :)

-> macvlan_count_rx(vlan, skb->len + ETH_HLEN, true, false);

2009-11-24 08:45:30

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/4] macvlan: cleanup rx statistics

On Tuesday 24 November 2009 08:15:53 Eric Dumazet wrote:
> Arnd Bergmann a ?crit :
> >
> > +static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
> > + int success, int multicast)
>
> success and multicast should be declared as bool

ok

> > +{
> > + struct macvlan_rx_stats *rx_stats;
> > +
> > + rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
> > + rx_stats->rx_packets += success != 0;
> > + rx_stats->rx_bytes += success ? length : 0;
> > + rx_stats->multicast += success && multicast;
> > + rx_stats->rx_errors += !success;
> > +}
> > +
>
> I find following more readable, it probably generates more branches,
> but avoid dirtying rx_errors if it is in another cache line.
>
> if (likely(success)) {
> rx_stats->rx_packets++;
> rx_stats->rx_bytes += length;
> if (multicast)
> rx_stats->multicast++;
> } else {
> rx_stats->rx_errors++;
> }

Given that the structure only has four members and alloc_percpu requests
cache aligned data, it is rather likely to be in the same cache line.

I'll have a look at what gcc generates on x86-64 for both versions
and use the version you suggested unless it looks significantly more
expensive.

Since we're into micro-optimization territory, do you think it should
be marked inline or not?

> > - rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
> > skb = skb_share_check(skb, GFP_ATOMIC);
> > - if (skb == NULL) {
> > - rx_stats->rx_errors++;
> > - return NULL;
> > - }
> > -
> > - rx_stats->rx_bytes += skb->len + ETH_HLEN;
> > - rx_stats->rx_packets++;
> > + macvlan_count_rx(vlan, skb->len + ETH_HLEN, likely(skb != NULL), 0);
>
> its not _likely_ that skb != NULL, its a fact :)
>
> -> macvlan_count_rx(vlan, skb->len + ETH_HLEN, true, false);

I don't understand. Note how I removed the check for NULL above and
the skb pointer may be the result of a failing skb_clone().

Looking at this again, I actually introduced a bug by calling netif_rx
on a possibly NULL skb, I'll fix that.

Thanks!

Arnd <><

2009-11-24 09:29:21

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 2/4] macvlan: cleanup rx statistics

On Tuesday 24 November 2009 08:45:14 Arnd Bergmann wrote:
> On Tuesday 24 November 2009 08:15:53 Eric Dumazet wrote:
> > Arnd Bergmann a ?crit :
> > I find following more readable, it probably generates more branches,
> > but avoid dirtying rx_errors if it is in another cache line.
> >
> > if (likely(success)) {
> > rx_stats->rx_packets++;
> > rx_stats->rx_bytes += length;
> > if (multicast)
> > rx_stats->multicast++;
> > } else {
> > rx_stats->rx_errors++;
> > }
>
> Given that the structure only has four members and alloc_percpu requests
> cache aligned data, it is rather likely to be in the same cache line.
>
> I'll have a look at what gcc generates on x86-64 for both versions
> and use the version you suggested unless it looks significantly more
> expensive.

Ok, that's what I got for trying to be clever. My version did not avoid
any branches, just created more code. I'll fold this update into my
patches then:
---
macvlan: cleanups

Use bool instead of int for flags, don't misoptimize rx counters,
avoid accessing a NULL skb.

Signed-off-by: Arnd Bergmann <[email protected]>

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 3db96b9..ff5f0b0 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -119,19 +119,23 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
}

static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
- int success, int multicast)
+ bool success, bool multicast)
{
struct macvlan_rx_stats *rx_stats;

rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
- rx_stats->rx_packets += success != 0;
- rx_stats->rx_bytes += success ? length : 0;
- rx_stats->multicast += success && multicast;
- rx_stats->rx_errors += !success;
+ if (likely(success)) {
+ rx_stats->rx_packets++;;
+ rx_stats->rx_bytes += length;
+ if (multicast)
+ rx_stats->multicast++;
+ } else {
+ rx_stats->rx_errors++;
+ }
}

static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
- const struct ethhdr *eth, int local)
+ const struct ethhdr *eth, bool local)
{
if (!skb)
return NET_RX_DROP;
@@ -173,7 +177,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
err = macvlan_broadcast_one(nskb, vlan->dev, eth,
mode == MACVLAN_MODE_BRIDGE);
macvlan_count_rx(vlan, skb->len + ETH_HLEN,
- likely(err == NET_RX_SUCCESS), 1);
+ err == NET_RX_SUCCESS, 1);
}
}
}
@@ -186,6 +190,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
const struct macvlan_dev *vlan;
const struct macvlan_dev *src;
struct net_device *dev;
+ int len;

port = rcu_dereference(skb->dev->macvlan_port);
if (port == NULL)
@@ -218,8 +223,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
kfree_skb(skb);
return NULL;
}
+ len = skb->len + ETH_HLEN;
skb = skb_share_check(skb, GFP_ATOMIC);
- macvlan_count_rx(vlan, skb->len + ETH_HLEN, likely(skb != NULL), 0);
+ macvlan_count_rx(vlan, len, skb != NULL, 0);
+ if (!skb)
+ return NULL;

skb->dev = dev;
skb->pkt_type = PACKET_HOST;
@@ -248,7 +256,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
int length = skb->len + ETH_HLEN;
int ret = dev_forward_skb(dest->dev, skb);
macvlan_count_rx(dest, length,
- likely(ret == NET_RX_SUCCESS), 0);
+ ret == NET_RX_SUCCESS, 0);

return NET_XMIT_SUCCESS;
}

2009-11-24 09:51:18

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Arnd Bergmann wrote:
> +int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> +{
> + skb_orphan(skb);
> +
> + if (!(dev->flags & IFF_UP))
> + return NET_RX_DROP;
> +
> + if (skb->len > (dev->mtu + dev->hard_header_len))
> + return NET_RX_DROP;
> +
> + skb_dst_drop(skb);
> + skb->tstamp.tv64 = 0;
> + skb->pkt_type = PACKET_HOST;
> + skb->protocol = eth_type_trans(skb, dev);
> + skb->mark = 0;

skb->mark clearing should stay private to veth since its usually
supposed to stay intact. The only exception is packets crossing
namespaces, where they should appear like a freshly received skbs.

> + secpath_reset(skb);
> + nf_reset(skb);
> + return netif_rx(skb);
> +}
> +EXPORT_SYMBOL_GPL(dev_forward_skb);

2009-11-24 10:03:50

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

On Tuesday 24 November 2009 09:51:19 Patrick McHardy wrote:
> > + skb_dst_drop(skb);
> > + skb->tstamp.tv64 = 0;
> > + skb->pkt_type = PACKET_HOST;
> > + skb->protocol = eth_type_trans(skb, dev);
> > + skb->mark = 0;
>
> skb->mark clearing should stay private to veth since its usually
> supposed to stay intact. The only exception is packets crossing
> namespaces, where they should appear like a freshly received skbs.

But isn't that what we want in macvlan as well when we're
forwarding from one downstream interface to another?

I did all my testing with macvlan interfaces in separate namespaces
communicating with each other, so I'd assume that we should always
clear skb->mark and skb->dst in this function. Maybe I should make
the documentation clearer?

---
net: clarify documentation of dev_forward_skb

Signed-off-by: Arnd Bergmann <[email protected]>

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1433,6 +1433,10 @@ static inline void net_timestamp(struct sk_buff *skb)
* dev_forward_skb can be used for injecting an skb from the
* start_xmit function of one device into the receive queue
* of another device.
+ *
+ * The receiving device may be in another namespace, so
+ * we have to clear all information in the skb that could
+ * impact namespace isolation.
*/
int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
{

2009-11-24 10:17:09

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Arnd Bergmann wrote:
> On Tuesday 24 November 2009 09:51:19 Patrick McHardy wrote:
>>> + skb_dst_drop(skb);
>>> + skb->tstamp.tv64 = 0;
>>> + skb->pkt_type = PACKET_HOST;
>>> + skb->protocol = eth_type_trans(skb, dev);
>>> + skb->mark = 0;
>> skb->mark clearing should stay private to veth since its usually
>> supposed to stay intact. The only exception is packets crossing
>> namespaces, where they should appear like a freshly received skbs.
>
> But isn't that what we want in macvlan as well when we're
> forwarding from one downstream interface to another?

In the TX direction you can use the mark for TC classification
on the underlying device.

> I did all my testing with macvlan interfaces in separate namespaces
> communicating with each other, so I'd assume that we should always
> clear skb->mark and skb->dst in this function.

Good point, in that case we probably should clear it as well. But
in the non-namespace case the TC classification currently works and
this is consistent with any other virtual device driver, so it
should continue to work.

2009-11-24 10:34:58

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

On Tuesday 24 November 2009 10:17:11 Patrick McHardy wrote:
> Arnd Bergmann wrote:
> > On Tuesday 24 November 2009 09:51:19 Patrick McHardy wrote:
> >>> + skb_dst_drop(skb);
> >>> + skb->tstamp.tv64 = 0;
> >>> + skb->pkt_type = PACKET_HOST;
> >>> + skb->protocol = eth_type_trans(skb, dev);
> >>> + skb->mark = 0;
> >> skb->mark clearing should stay private to veth since its usually
> >> supposed to stay intact. The only exception is packets crossing
> >> namespaces, where they should appear like a freshly received skbs.
> >
> > But isn't that what we want in macvlan as well when we're
> > forwarding from one downstream interface to another?
>
> In the TX direction you can use the mark for TC classification
> on the underlying device.

I don't use dev_forward_skb for the case where the data is sent
to the underlying device, so the TC classification should stay
intact.

> > I did all my testing with macvlan interfaces in separate namespaces
> > communicating with each other, so I'd assume that we should always
> > clear skb->mark and skb->dst in this function.
>
> Good point, in that case we probably should clear it as well. But
> in the non-namespace case the TC classification currently works and
> this is consistent with any other virtual device driver, so it
> should continue to work.

Do you think we should be able to use TC to direct traffic between
macvlans on the same underlying device in bridge mode? It does sound
useful, but I'm not sure how to implement that or if you'd expect
it to work with the current code. If we support that, it should probably
also work with namespaces, by consuming the mark in the macvlan
and veth drivers.

Arnd <><

2009-11-24 10:40:22

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Arnd Bergmann wrote:
> On Tuesday 24 November 2009 10:17:11 Patrick McHardy wrote:
>> Arnd Bergmann wrote:
>>> On Tuesday 24 November 2009 09:51:19 Patrick McHardy wrote:
>>>>> + skb_dst_drop(skb);
>>>>> + skb->tstamp.tv64 = 0;
>>>>> + skb->pkt_type = PACKET_HOST;
>>>>> + skb->protocol = eth_type_trans(skb, dev);
>>>>> + skb->mark = 0;
>>>> skb->mark clearing should stay private to veth since its usually
>>>> supposed to stay intact. The only exception is packets crossing
>>>> namespaces, where they should appear like a freshly received skbs.
>>> But isn't that what we want in macvlan as well when we're
>>> forwarding from one downstream interface to another?
>> In the TX direction you can use the mark for TC classification
>> on the underlying device.
>
> I don't use dev_forward_skb for the case where the data is sent
> to the underlying device, so the TC classification should stay
> intact.

Right, I see. This looks fine.

>>> I did all my testing with macvlan interfaces in separate namespaces
>>> communicating with each other, so I'd assume that we should always
>>> clear skb->mark and skb->dst in this function.
>> Good point, in that case we probably should clear it as well. But
>> in the non-namespace case the TC classification currently works and
>> this is consistent with any other virtual device driver, so it
>> should continue to work.
>
> Do you think we should be able to use TC to direct traffic between
> macvlans on the same underlying device in bridge mode? It does sound
> useful, but I'm not sure how to implement that or if you'd expect
> it to work with the current code. If we support that, it should probably
> also work with namespaces, by consuming the mark in the macvlan
> and veth drivers.

I don't think its necessary, we bypass outgoing queuing anyways.
But if you'd want to add it, just keeping the skb->mark clearing
in veth should work from what I can tell.

2009-11-24 10:41:35

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 2/4] macvlan: cleanup rx statistics

Arnd Bergmann wrote:
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index ae2b5c7..a0dea23 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -116,42 +116,53 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
> return 0;
> }
>
> +static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,

Please use unsigned int for length values.

Regarding Eric's comments, I also think it would be more readable to use
if (success) {
...
} else {
...
}

> + int success, int multicast)
> +{
> + struct macvlan_rx_stats *rx_stats;
> +
> + rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
> + rx_stats->rx_packets += success != 0;
> + rx_stats->rx_bytes += success ? length : 0;
> + rx_stats->multicast += success && multicast;
> + rx_stats->rx_errors += !success;
> +}
> +
> +static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
> + const struct ethhdr *eth)
> +{
> + if (!skb)
> + return NET_RX_DROP;
> +
> + skb->dev = dev;
> + if (!compare_ether_addr_64bits(eth->h_dest,
> + dev->broadcast))

This would fit on one line without reducing readability.

> + skb->pkt_type = PACKET_BROADCAST;
> + else
> + skb->pkt_type = PACKET_MULTICAST;
> +
> + return netif_rx(skb);
> +}

2009-11-24 10:42:37

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 3/4] macvlan: implement bridge, VEPA and private mode

Arnd Bergmann wrote:
> This allows each macvlan slave device to be in one
> of three modes, depending on the use case:
>
> MACVLAN_PRIVATE:
> The device never communicates with any other device
> on the same upper_dev. This even includes frames
> coming back from a reflective relay, where supported
> by the adjacent bridge.
>
> MACVLAN_VEPA:
> The new Virtual Ethernet Port Aggregator (VEPA) mode,
> we assume that the adjacent bridge returns all frames
> where both source and destination are local to the
> macvlan port, i.e. the bridge is set up as a reflective
> relay.
> Broadcast frames coming in from the upper_dev get
> flooded to all macvlan interfaces in VEPA mode.
> We never deliver any frames locally.
>
> MACVLAN_BRIDGE:
> We provide the behavior of a simple bridge between
> different macvlan interfaces on the same port. Frames
> from one interface to another one get delivered directly
> and are not sent out externally. Broadcast frames get
> flooded to all other bridge ports and to the external
> interface, but when they come back from a reflective
> relay, we don't deliver them again.
> Since we know all the MAC addresses, the macvlan bridge
> mode does not require learning or STP like the bridge
> module does.

This looks pretty nice. Some stylistic nitpicking below :)

> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index a0dea23..b840b3a 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -29,9 +29,16 @@
> #include <linux/if_link.h>
> #include <linux/if_macvlan.h>
> #include <net/rtnetlink.h>
> +#include <net/xfrm.h>

Do we really need this?

> @@ -129,11 +137,14 @@ static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
> }
>
> static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
> - const struct ethhdr *eth)
> + const struct ethhdr *eth, int local)

bool local?

> {
> if (!skb)
> return NET_RX_DROP;
>
> + if (local)
> + return dev_forward_skb(dev, skb);
> +
> skb->dev = dev;
> if (!compare_ether_addr_64bits(eth->h_dest,
> dev->broadcast))
> @@ -145,7 +156,9 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
> }
>
> static void macvlan_broadcast(struct sk_buff *skb,
> - const struct macvlan_port *port)
> + const struct macvlan_port *port,
> + struct net_device *src,
> + enum macvlan_mode mode)
> {
> const struct ethhdr *eth = eth_hdr(skb);
> const struct macvlan_dev *vlan;
> @@ -159,8 +172,12 @@ static void macvlan_broadcast(struct sk_buff *skb,
>
> for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
> hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
> + if ((vlan->dev == src) || !(vlan->mode & mode))

Please remove those unnecessary parentheses around the
device comparison.

> @@ -173,6 +190,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> const struct ethhdr *eth = eth_hdr(skb);
> const struct macvlan_port *port;
> const struct macvlan_dev *vlan;
> + const struct macvlan_dev *src;
> struct net_device *dev;
>
> port = rcu_dereference(skb->dev->macvlan_port);
> @@ -180,7 +198,20 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> return skb;
>
> if (is_multicast_ether_addr(eth->h_dest)) {
> - macvlan_broadcast(skb, port);
> + src = macvlan_hash_lookup(port, eth->h_source);
> + if (!src)
> + /* frame comes from an external address */
> + macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE
> + | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);

The '|' should go on the first line.

> + else if (src->mode == MACVLAN_MODE_VEPA)
> + /* flood to everyone except source */
> + macvlan_broadcast(skb, port, src->dev,
> + MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
> + else if (src->mode == MACVLAN_MODE_BRIDGE)
> + /* flood only to VEPA ports, bridge ports
> + already saw the frame */

Multi-line comments should begin every line with '*'.

> + macvlan_broadcast(skb, port, src->dev,
> + MACVLAN_MODE_VEPA);

Please align the mode values (in all cases above) to the arguments
on the line above.

> return skb;
> }
>
> @@ -203,18 +234,46 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> return NULL;
> }
>
> +static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + const struct macvlan_dev *vlan = netdev_priv(dev);
> + const struct macvlan_port *port = vlan->port;
> + const struct macvlan_dev *dest;
> +
> + if (vlan->mode == MACVLAN_MODE_BRIDGE) {
> + const struct ethhdr *eth = (void *)skb->data;

eth_hdr()?

> +
> + /* send to other bridge ports directly */
> + if (is_multicast_ether_addr(eth->h_dest)) {
> + macvlan_broadcast(skb, port, dev, MACVLAN_MODE_BRIDGE);
> + goto xmit_world;
> + }
> +
> + dest = macvlan_hash_lookup(port, eth->h_dest);
> + if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
> + int length = skb->len + ETH_HLEN;

unsigned int for length values please.

> + int ret = dev_forward_skb(dest->dev, skb);
> + macvlan_count_rx(dest, length,
> + likely(ret == NET_RX_SUCCESS), 0);
> +
> + return NET_XMIT_SUCCESS;
> + }
> + }
> +
> +xmit_world:
> + skb->dev = vlan->lowerdev;
> + return dev_queue_xmit(skb);
> +}

2009-11-24 10:53:32

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 4/4] macvlan: export macvlan mode through netlink

Arnd Bergmann wrote:
> @@ -600,6 +594,18 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
> if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> return -EADDRNOTAVAIL;
> }
> +
> + if (data && data[IFLA_MACVLAN_MODE]) {
> + u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> + switch (mode) {
> + case MACVLAN_MODE_PRIVATE:
> + case MACVLAN_MODE_VEPA:
> + case MACVLAN_MODE_BRIDGE:
> + break;
> + default:
> + return -EINVAL;

EINVAL is quite unspecific. In this case I think EOPNOTSUPP would
be fine and provide more information.

> + }
> + }
> return 0;
> }
> @@ -664,6 +670,13 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
> vlan->dev = dev;
> vlan->port = port;
>
> + vlan->mode = MACVLAN_MODE_VEPA;
> + if (data && data[IFLA_MACVLAN_MODE]) {
> + u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> +
> + vlan->mode = mode;

This looks a bit strange, like cut-and-paste without reformatting :)
I'd suggest to simply use "vlan->mode = nla_get_u32(...)".

> + }
> +
> err = register_netdevice(dev);
> if (err < 0)
> return err;
> @@ -685,6 +698,39 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
> macvlan_port_destroy(port->dev);
> }
>
> +static int macvlan_changelink(struct net_device *dev,
> + struct nlattr *tb[], struct nlattr *data[])
> +{
> + struct macvlan_dev *vlan = netdev_priv(dev);
> + if (data && data[IFLA_MACVLAN_MODE]) {
> + u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> + vlan->mode = mode;

Same here.

> + }
> +
> + return 0;
> +}

2009-11-24 12:46:32

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 3/4] macvlan: implement bridge, VEPA and private mode

On Tuesday 24 November 2009, Patrick McHardy wrote:
> Arnd Bergmann wrote:

> > @@ -29,9 +29,16 @@
> > #include <linux/if_link.h>
> > #include <linux/if_macvlan.h>
> > #include <net/rtnetlink.h>
> > +#include <net/xfrm.h>
>
> Do we really need this?

Not any more. I originally did secpath_reset here, but that is now moved
to dev_forward_skb() in net/core/dev.c so I'll remove the include here.

> > @@ -129,11 +137,14 @@ static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
> > }
> >
> > static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
> > - const struct ethhdr *eth)
> > + const struct ethhdr *eth, int local)
>
> bool local?

Already changed as a reply to Eric's comments

> > @@ -159,8 +172,12 @@ static void macvlan_broadcast(struct sk_buff *skb,
> >
> > for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
> > hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
> > + if ((vlan->dev == src) || !(vlan->mode & mode))
>
> Please remove those unnecessary parentheses around the
> device comparison.

Ok. I usually prefer to have some extra parentheses in places that not
everyone finds obvious but these are pretty clear.

>
> The '|' should go on the first line.

>
> Multi-line comments should begin every line with '*'.

>
> Please align the mode values (in all cases above) to the arguments
> on the line above.

ok

> > +static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
> > +{
> > + const struct macvlan_dev *vlan = netdev_priv(dev);
> > + const struct macvlan_port *port = vlan->port;
> > + const struct macvlan_dev *dest;
> > +
> > + if (vlan->mode == MACVLAN_MODE_BRIDGE) {
> > + const struct ethhdr *eth = (void *)skb->data;
>
> eth_hdr()?

This is done before calling eth_type_trans and skb->mac_header
appears to be uninitialized then, which gave me an instant kernel
panic here.

> > + dest = macvlan_hash_lookup(port, eth->h_dest);
> > + if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
> > + int length = skb->len + ETH_HLEN;
>
> unsigned int for length values please.

ok

Thanks for the detailed review,

Arnd <><

2009-11-24 12:58:13

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 4/4] macvlan: export macvlan mode through netlink

On Tuesday 24 November 2009, Patrick McHardy wrote:
> Arnd Bergmann wrote:
> > @@ -600,6 +594,18 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
> > if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
> > return -EADDRNOTAVAIL;
> > }
> > +
> > + if (data && data[IFLA_MACVLAN_MODE]) {
> > + u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> > + switch (mode) {
> > + case MACVLAN_MODE_PRIVATE:
> > + case MACVLAN_MODE_VEPA:
> > + case MACVLAN_MODE_BRIDGE:
> > + break;
> > + default:
> > + return -EINVAL;
>
> EINVAL is quite unspecific. In this case I think EOPNOTSUPP would
> be fine and provide more information.

ok

> > + }
> > + }
> > return 0;
> > }
> > @@ -664,6 +670,13 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
> > vlan->dev = dev;
> > vlan->port = port;
> >
> > + vlan->mode = MACVLAN_MODE_VEPA;
> > + if (data && data[IFLA_MACVLAN_MODE]) {
> > + u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> > +
> > + vlan->mode = mode;
>
> This looks a bit strange, like cut-and-paste without reformatting :)
> I'd suggest to simply use "vlan->mode = nla_get_u32(...)".

Yep.

Thanks for the review. Combined changes below.

Arnd <><
---

drivers/net/macvlan.c | 47 +++++++++++++++++++++++------------------------
1 files changed, 23 insertions(+), 24 deletions(-)

--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -118,15 +118,16 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
return 0;
}

-static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
- bool success, bool multicast)
+static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
+ unsigned int len, bool success,
+ bool multicast)
{
struct macvlan_rx_stats *rx_stats;

rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
if (likely(success)) {
rx_stats->rx_packets++;;
- rx_stats->rx_bytes += length;
+ rx_stats->rx_bytes += len;
if (multicast)
rx_stats->multicast++;
} else {
@@ -170,7 +171,7 @@ static void macvlan_broadcast(struct sk_buff *skb,

for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
- if ((vlan->dev == src) || !(vlan->mode & mode))
+ if (vlan->dev == src || !(vlan->mode & mode))
continue;

nskb = skb_clone(skb, GFP_ATOMIC);
@@ -190,7 +191,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
const struct macvlan_dev *vlan;
const struct macvlan_dev *src;
struct net_device *dev;
- int len;
+ unsigned int len;

port = rcu_dereference(skb->dev->macvlan_port);
if (port == NULL)
@@ -200,17 +201,22 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
src = macvlan_hash_lookup(port, eth->h_source);
if (!src)
/* frame comes from an external address */
- macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE
- | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
+ macvlan_broadcast(skb, port, NULL,
+ MACVLAN_MODE_PRIVATE |
+ MACVLAN_MODE_VEPA |
+ MACVLAN_MODE_BRIDGE);
else if (src->mode == MACVLAN_MODE_VEPA)
/* flood to everyone except source */
macvlan_broadcast(skb, port, src->dev,
- MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
+ MACVLAN_MODE_VEPA |
+ MACVLAN_MODE_BRIDGE);
else if (src->mode == MACVLAN_MODE_BRIDGE)
- /* flood only to VEPA ports, bridge ports
- already saw the frame */
+ /*
+ * flood only to VEPA ports, bridge ports
+ * already saw the frame on the way out.
+ */
macvlan_broadcast(skb, port, src->dev,
- MACVLAN_MODE_VEPA);
+ MACVLAN_MODE_VEPA);
return skb;
}

@@ -253,7 +259,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)

dest = macvlan_hash_lookup(port, eth->h_dest);
if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
- int length = skb->len + ETH_HLEN;
+ unsigned int length = skb->len + ETH_HLEN;
int ret = dev_forward_skb(dest->dev, skb);
macvlan_count_rx(dest, length,
ret == NET_RX_SUCCESS, 0);
@@ -604,8 +610,7 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
}

if (data && data[IFLA_MACVLAN_MODE]) {
- u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
- switch (mode) {
+ switch (nla_get_u32(data[IFLA_MACVLAN_MODE])) {
case MACVLAN_MODE_PRIVATE:
case MACVLAN_MODE_VEPA:
case MACVLAN_MODE_BRIDGE:
@@ -679,11 +684,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
vlan->port = port;

vlan->mode = MACVLAN_MODE_VEPA;
- if (data && data[IFLA_MACVLAN_MODE]) {
- u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
-
- vlan->mode = mode;
- }
+ if (data && data[IFLA_MACVLAN_MODE])
+ vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);

err = register_netdevice(dev);
if (err < 0)
@@ -710,11 +712,8 @@ static int macvlan_changelink(struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
{
struct macvlan_dev *vlan = netdev_priv(dev);
- if (data && data[IFLA_MACVLAN_MODE]) {
- u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
- vlan->mode = mode;
- }
-
+ if (data && data[IFLA_MACVLAN_MODE])
+ vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
return 0;
}

2009-11-24 13:15:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

On Tuesday 24 November 2009, Patrick McHardy wrote:
> I don't think its necessary, we bypass outgoing queuing anyways.
> But if you'd want to add it, just keeping the skb->mark clearing
> in veth should work from what I can tell.

Ok, I won't bother with it for now then.

Arnd <><

2009-11-24 13:47:44

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 4/4] macvlan: export macvlan mode through netlink

Arnd Bergmann wrote:
> Thanks for the review. Combined changes below.

Looks good, thanks.

>
> Arnd <><
> ---
>
> drivers/net/macvlan.c | 47 +++++++++++++++++++++++------------------------
> 1 files changed, 23 insertions(+), 24 deletions(-)
>
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -118,15 +118,16 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
> return 0;
> }
>
> -static inline void macvlan_count_rx(const struct macvlan_dev *vlan, int length,
> - bool success, bool multicast)
> +static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
> + unsigned int len, bool success,
> + bool multicast)
> {
> struct macvlan_rx_stats *rx_stats;
>
> rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
> if (likely(success)) {
> rx_stats->rx_packets++;;
> - rx_stats->rx_bytes += length;
> + rx_stats->rx_bytes += len;
> if (multicast)
> rx_stats->multicast++;
> } else {
> @@ -170,7 +171,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
>
> for (i = 0; i < MACVLAN_HASH_SIZE; i++) {
> hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) {
> - if ((vlan->dev == src) || !(vlan->mode & mode))
> + if (vlan->dev == src || !(vlan->mode & mode))
> continue;
>
> nskb = skb_clone(skb, GFP_ATOMIC);
> @@ -190,7 +191,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> const struct macvlan_dev *vlan;
> const struct macvlan_dev *src;
> struct net_device *dev;
> - int len;
> + unsigned int len;
>
> port = rcu_dereference(skb->dev->macvlan_port);
> if (port == NULL)
> @@ -200,17 +201,22 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> src = macvlan_hash_lookup(port, eth->h_source);
> if (!src)
> /* frame comes from an external address */
> - macvlan_broadcast(skb, port, NULL, MACVLAN_MODE_PRIVATE
> - | MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
> + macvlan_broadcast(skb, port, NULL,
> + MACVLAN_MODE_PRIVATE |
> + MACVLAN_MODE_VEPA |
> + MACVLAN_MODE_BRIDGE);
> else if (src->mode == MACVLAN_MODE_VEPA)
> /* flood to everyone except source */
> macvlan_broadcast(skb, port, src->dev,
> - MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE);
> + MACVLAN_MODE_VEPA |
> + MACVLAN_MODE_BRIDGE);
> else if (src->mode == MACVLAN_MODE_BRIDGE)
> - /* flood only to VEPA ports, bridge ports
> - already saw the frame */
> + /*
> + * flood only to VEPA ports, bridge ports
> + * already saw the frame on the way out.
> + */
> macvlan_broadcast(skb, port, src->dev,
> - MACVLAN_MODE_VEPA);
> + MACVLAN_MODE_VEPA);
> return skb;
> }
>
> @@ -253,7 +259,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
>
> dest = macvlan_hash_lookup(port, eth->h_dest);
> if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
> - int length = skb->len + ETH_HLEN;
> + unsigned int length = skb->len + ETH_HLEN;
> int ret = dev_forward_skb(dest->dev, skb);
> macvlan_count_rx(dest, length,
> ret == NET_RX_SUCCESS, 0);
> @@ -604,8 +610,7 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
> }
>
> if (data && data[IFLA_MACVLAN_MODE]) {
> - u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> - switch (mode) {
> + switch (nla_get_u32(data[IFLA_MACVLAN_MODE])) {
> case MACVLAN_MODE_PRIVATE:
> case MACVLAN_MODE_VEPA:
> case MACVLAN_MODE_BRIDGE:
> @@ -679,11 +684,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
> vlan->port = port;
>
> vlan->mode = MACVLAN_MODE_VEPA;
> - if (data && data[IFLA_MACVLAN_MODE]) {
> - u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> -
> - vlan->mode = mode;
> - }
> + if (data && data[IFLA_MACVLAN_MODE])
> + vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
>
> err = register_netdevice(dev);
> if (err < 0)
> @@ -710,11 +712,8 @@ static int macvlan_changelink(struct net_device *dev,
> struct nlattr *tb[], struct nlattr *data[])
> {
> struct macvlan_dev *vlan = netdev_priv(dev);
> - if (data && data[IFLA_MACVLAN_MODE]) {
> - u32 mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> - vlan->mode = mode;
> - }
> -
> + if (data && data[IFLA_MACVLAN_MODE])
> + vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
> return 0;
> }
>
>

2009-11-24 16:42:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Patrick McHardy <[email protected]> writes:

>>>> I did all my testing with macvlan interfaces in separate namespaces
>>>> communicating with each other, so I'd assume that we should always
>>>> clear skb->mark and skb->dst in this function.
>>> Good point, in that case we probably should clear it as well. But
>>> in the non-namespace case the TC classification currently works and
>>> this is consistent with any other virtual device driver, so it
>>> should continue to work.
>>
>> Do you think we should be able to use TC to direct traffic between
>> macvlans on the same underlying device in bridge mode? It does sound
>> useful, but I'm not sure how to implement that or if you'd expect
>> it to work with the current code. If we support that, it should probably
>> also work with namespaces, by consuming the mark in the macvlan
>> and veth drivers.
>
> I don't think its necessary, we bypass outgoing queuing anyways.
> But if you'd want to add it, just keeping the skb->mark clearing
> in veth should work from what I can tell.

veth doesn't have an outgoing queue. The reason we clear skb->mark
in veth is because when reentering the networking stack the packet
needs to be reclassified. At the point of loopback we are talking
a packet that has at least logically gone out of the machine on a
wire and come back into the machine on another physical interface.

So it seems to me we should have consistent handling for macvlans,
veth, for the cases where we are looping packets back around. In
practice I expect all of those cases are going to be cross namespace
as otherwise we would have intercepted the packet before going
out a physical interface.

Eric



2009-11-24 16:56:18

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Eric W. Biederman wrote:
> Patrick McHardy <[email protected]> writes:
>
>>>>> I did all my testing with macvlan interfaces in separate namespaces
>>>>> communicating with each other, so I'd assume that we should always
>>>>> clear skb->mark and skb->dst in this function.
>>>> Good point, in that case we probably should clear it as well. But
>>>> in the non-namespace case the TC classification currently works and
>>>> this is consistent with any other virtual device driver, so it
>>>> should continue to work.
>>> Do you think we should be able to use TC to direct traffic between
>>> macvlans on the same underlying device in bridge mode? It does sound
>>> useful, but I'm not sure how to implement that or if you'd expect
>>> it to work with the current code. If we support that, it should probably
>>> also work with namespaces, by consuming the mark in the macvlan
>>> and veth drivers.
>> I don't think its necessary, we bypass outgoing queuing anyways.
>> But if you'd want to add it, just keeping the skb->mark clearing
>> in veth should work from what I can tell.
>
> veth doesn't have an outgoing queue. The reason we clear skb->mark
> in veth is because when reentering the networking stack the packet
> needs to be reclassified. At the point of loopback we are talking
> a packet that has at least logically gone out of the machine on a
> wire and come back into the machine on another physical interface.
>
> So it seems to me we should have consistent handling for macvlans,
> veth, for the cases where we are looping packets back around. In
> practice I expect all of those cases are going to be cross namespace
> as otherwise we would have intercepted the packet before going
> out a physical interface.

Agreed on the looping case, that's what we're doing now.

In the layered case (macvlan -> eth0) its common behaviour to
keep the mark however. But in case of different namespaces,
I think macvlan should also clear the mark on the dev_queue_xmit()
path since this is just a shortcut to looping the packets
through veth. In fact probably both of them should also clear
skb->priority so other namespaces don't accidentally misclassify
packets.

2009-11-24 18:10:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Patrick McHardy <[email protected]> writes:

> Eric W. Biederman wrote:
>> Patrick McHardy <[email protected]> writes:
>>
>>>>>> I did all my testing with macvlan interfaces in separate namespaces
>>>>>> communicating with each other, so I'd assume that we should always
>>>>>> clear skb->mark and skb->dst in this function.
>>>>> Good point, in that case we probably should clear it as well. But
>>>>> in the non-namespace case the TC classification currently works and
>>>>> this is consistent with any other virtual device driver, so it
>>>>> should continue to work.
>>>> Do you think we should be able to use TC to direct traffic between
>>>> macvlans on the same underlying device in bridge mode? It does sound
>>>> useful, but I'm not sure how to implement that or if you'd expect
>>>> it to work with the current code. If we support that, it should probably
>>>> also work with namespaces, by consuming the mark in the macvlan
>>>> and veth drivers.
>>> I don't think its necessary, we bypass outgoing queuing anyways.
>>> But if you'd want to add it, just keeping the skb->mark clearing
>>> in veth should work from what I can tell.
>>
>> veth doesn't have an outgoing queue. The reason we clear skb->mark
>> in veth is because when reentering the networking stack the packet
>> needs to be reclassified. At the point of loopback we are talking
>> a packet that has at least logically gone out of the machine on a
>> wire and come back into the machine on another physical interface.
>>
>> So it seems to me we should have consistent handling for macvlans,
>> veth, for the cases where we are looping packets back around. In
>> practice I expect all of those cases are going to be cross namespace
>> as otherwise we would have intercepted the packet before going
>> out a physical interface.
>
> Agreed on the looping case, that's what we're doing now.
>
> In the layered case (macvlan -> eth0) its common behaviour to
> keep the mark however. But in case of different namespaces,
> I think macvlan should also clear the mark on the dev_queue_xmit()
> path since this is just a shortcut to looping the packets
> through veth. In fact probably both of them should also clear
> skb->priority so other namespaces don't accidentally misclassify
> packets.

That is why I pushed for what is becoming dev_forward_skb. So that
we have one place where we can make all of those tweaks. It seems
like in every review we find another field that should be cleared/handled
specially.

I don't quite follow what you intend with dev_queue_xmit when the macvlan
is in one namespace and the real physical device is in another. Are
you mentioning that the packet classifier runs in the namespace where
the primary device lives with packets from a different namespace?

Eric

2009-11-24 18:30:24

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

On Tuesday 24 November 2009, Eric W. Biederman wrote:
> I don't quite follow what you intend with dev_queue_xmit when the macvlan
> is in one namespace and the real physical device is in another. Are
> you mentioning that the packet classifier runs in the namespace where
> the primary device lives with packets from a different namespace?

I treat internal and external delivery very differently, the three
cases are:

1. skb from real device to macvlan (macvlan_handle_frame): basically
unchanged from before, except avoiding duplicate broadcasts. All
skbs end up in netif_rx(vlan->dev) without clearing any data.
We catch the frame in netif_receive_skb before it interacts with the
namespace of the real device.

2. skb to external device (macvlan_start_xmit): if the destination
is external, we just end up in dev_queue_xmit, with skb->dev set to
the external device but no other changes. The data is already on the
way out at this stage, so the namespace should not matter any more.

3. internal delivery: an skb from one macvlan to another gets always
sent through dev_forward_skb, which is supposed to clear anything
that must not leave the namespace.

Does this make sense?

Arnd <><

2009-11-24 18:38:25

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Eric W. Biederman wrote:
> Patrick McHardy <[email protected]> writes:
>
>> In the layered case (macvlan -> eth0) its common behaviour to
>> keep the mark however. But in case of different namespaces,
>> I think macvlan should also clear the mark on the dev_queue_xmit()
>> path since this is just a shortcut to looping the packets
>> through veth. In fact probably both of them should also clear
>> skb->priority so other namespaces don't accidentally misclassify
>> packets.
>
> That is why I pushed for what is becoming dev_forward_skb. So that
> we have one place where we can make all of those tweaks. It seems
> like in every review we find another field that should be cleared/handled
> specially.
>
> I don't quite follow what you intend with dev_queue_xmit when the macvlan
> is in one namespace and the real physical device is in another. Are
> you mentioning that the packet classifier runs in the namespace where
> the primary device lives with packets from a different namespace?

Exactly. And I think we should make sure that the namespace of
the macvlan device can't (deliberately or accidentally) cause
misclassification.

2009-11-26 15:22:56

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

On Tuesday 24 November 2009, Patrick McHardy wrote:
> Eric W. Biederman wrote:
> > I don't quite follow what you intend with dev_queue_xmit when the macvlan
> > is in one namespace and the real physical device is in another. Are
> > you mentioning that the packet classifier runs in the namespace where
> > the primary device lives with packets from a different namespace?
>
> Exactly. And I think we should make sure that the namespace of
> the macvlan device can't (deliberately or accidentally) cause
> misclassification.

This is independent of my series and a preexisting problem, right?

Which fields do you think need to be reset to maintain namespace
isolation for the outbound path in macvlan?
---
net: maintain namespace isolation between vlan and real device

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <[email protected]>

---
drivers/net/macvlan.c | 2 +-
include/linux/netdevice.h | 9 +++++++++
net/8021q/vlan_dev.c | 2 +-
net/core/dev.c | 26 ++++++++++++++++++++++----
4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 322112c..edcebf1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
}

xmit_world:
- skb->dev = vlan->lowerdev;
+ skb_set_dev(skb, vlan->lowerdev);
return dev_queue_xmit(skb);
}

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9428793..fdf4a1a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1009,6 +1009,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
return 0;
}

+#ifdef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+ skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
static inline bool netdev_uses_trailer_tags(struct net_device *dev)
{
#ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index de0dc6b..51fcfff 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -323,7 +323,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
}


- skb->dev = vlan_dev_info(dev)->real_dev;
+ skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
len = skb->len;
ret = dev_queue_xmit(skb);

diff --git a/net/core/dev.c b/net/core/dev.c
index f8baa15..7179b58 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
if (skb->len > (dev->mtu + dev->hard_header_len))
return NET_RX_DROP;

- skb_dst_drop(skb);
+ skb_set_dev(skb, dev);
skb->tstamp.tv64 = 0;
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, dev);
- skb->mark = 0;
- secpath_reset(skb);
- nf_reset(skb);
return netif_rx(skb);
}
EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,27 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
return false;
}

+/**
+ * skb_dev_set -- assign a buffer to a new device
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+ if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+ secpath_reset(skb);
+ skb_dst_drop(skb);
+ nf_reset(skb);
+ skb->mark = 0;
+ }
+ skb->dev = dev;
+}
+EXPORT_SYMBOL(skb_set_dev);
+
/*
* Invalidate hardware checksum when packet is to be mangled, and
* complete checksum manually on outgoing path.

2009-11-26 15:33:36

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Arnd Bergmann wrote:
> On Tuesday 24 November 2009, Patrick McHardy wrote:
>> Eric W. Biederman wrote:
>>> I don't quite follow what you intend with dev_queue_xmit when the macvlan
>>> is in one namespace and the real physical device is in another. Are
>>> you mentioning that the packet classifier runs in the namespace where
>>> the primary device lives with packets from a different namespace?
>> Exactly. And I think we should make sure that the namespace of
>> the macvlan device can't (deliberately or accidentally) cause
>> misclassification.
>
> This is independent of my series and a preexisting problem, right?

Correct.

> Which fields do you think need to be reset to maintain namespace
> isolation for the outbound path in macvlan?

In addition to those already handled, I'd say

- priority: affects qdisc classification, may refer to classes of the
old namespace
- ipvs_property: might cause packets to incorrectly skip netfilter hooks
- nf_trace: might trigger packet tracing
- nf_bridge: contains references to network devices in the old NS,
also indicates packet was bridged
- iif: index is only valid in the originating namespace
- tc_index: classification result, should only be set in the namespace
of the classifier
- tc_verd: RTTL etc. should begin at zero again
- probably secmark.

2009-11-26 16:39:02

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Patrick McHardy <[email protected]> writes:

> Arnd Bergmann wrote:
>> On Tuesday 24 November 2009, Patrick McHardy wrote:
>>> Eric W. Biederman wrote:
>>>> I don't quite follow what you intend with dev_queue_xmit when the macvlan
>>>> is in one namespace and the real physical device is in another. Are
>>>> you mentioning that the packet classifier runs in the namespace where
>>>> the primary device lives with packets from a different namespace?
>>> Exactly. And I think we should make sure that the namespace of
>>> the macvlan device can't (deliberately or accidentally) cause
>>> misclassification.
>>
>> This is independent of my series and a preexisting problem, right?
>
> Correct.
>
>> Which fields do you think need to be reset to maintain namespace
>> isolation for the outbound path in macvlan?
>
> In addition to those already handled, I'd say
>
> - priority: affects qdisc classification, may refer to classes of the
> old namespace
> - ipvs_property: might cause packets to incorrectly skip netfilter hooks
> - nf_trace: might trigger packet tracing
> - nf_bridge: contains references to network devices in the old NS,
> also indicates packet was bridged
> - iif: index is only valid in the originating namespace
> - tc_index: classification result, should only be set in the namespace
> of the classifier
> - tc_verd: RTTL etc. should begin at zero again
> - probably secmark.

Wow. I thought we were trying to reduce skbuff, where did all of those
fields come from? Regarless that sounds like a good list to get stomped.

Eric

2009-11-26 17:45:53

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

On Thursday 26 November 2009, Patrick McHardy wrote:
> In addition to those already handled, I'd say
>
> - priority: affects qdisc classification, may refer to classes of the
> old namespace
> - ipvs_property: might cause packets to incorrectly skip netfilter hooks
> - nf_trace: might trigger packet tracing
> - nf_bridge: contains references to network devices in the old NS,
> also indicates packet was bridged
> - iif: index is only valid in the originating namespace
> - probably secmark.

ok

> - tc_index: classification result, should only be set in the namespace
> of the classifier
> - tc_verd: RTTL etc. should begin at zero again

Wouldn't that defeat the purpose of RTTL? If you create a loop
across two devices in different namespaces, it may no longer get
detected. Or is that a different problem again?

Arnd <><

---

net: maintain namespace isolation between vlan and real device

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Still needs testing, don't apply

Signed-off-by: Arnd Bergmann <[email protected]>
---
drivers/net/macvlan.c | 2 +-
include/linux/netdevice.h | 9 +++++++++
net/8021q/vlan_dev.c | 2 +-
net/core/dev.c | 37 +++++++++++++++++++++++++++++++++----
4 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 322112c..edcebf1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
}

xmit_world:
- skb->dev = vlan->lowerdev;
+ skb_set_dev(skb, vlan->lowerdev);
return dev_queue_xmit(skb);
}

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9428793..fdf4a1a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1009,6 +1009,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
return 0;
}

+#ifdef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+ skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
static inline bool netdev_uses_trailer_tags(struct net_device *dev)
{
#ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index de0dc6b..51fcfff 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -323,7 +323,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
}


- skb->dev = vlan_dev_info(dev)->real_dev;
+ skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
len = skb->len;
ret = dev_queue_xmit(skb);

diff --git a/net/core/dev.c b/net/core/dev.c
index f8baa15..220d4e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
if (skb->len > (dev->mtu + dev->hard_header_len))
return NET_RX_DROP;

- skb_dst_drop(skb);
+ skb_set_dev(skb, dev);
skb->tstamp.tv64 = 0;
skb->pkt_type = PACKET_HOST;
skb->protocol = eth_type_trans(skb, dev);
- skb->mark = 0;
- secpath_reset(skb);
- nf_reset(skb);
return netif_rx(skb);
}
EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,39 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
return false;
}

+/**
+ * skb_dev_set -- assign a buffer to a new device
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+ if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+ secpath_reset(skb);
+ skb_dst_drop(skb);
+ nf_reset(skb);
+ skb_init_secmark(skb);
+ skb->mark = 0;
+ skb->priority = 0;
+ skb->nf_trace = 0;
+ skb->ipvs_property = 0;
+#ifdef CONFIG_NET_SCHED
+ skb->tc_index = 0;
+#ifdef CONFIG_NET_CLS_ACT
+ skb->tc_verd = SET_TC_VERD(skb->tc_verd, 0);
+ skb->tc_verd = SET_TC_RTTL(skb->tc_verd, 0);
+#endif
+#endif
+ }
+ skb->dev = dev;
+ skb->skb_iif = skb->dev->ifindex;
+}
+EXPORT_SYMBOL(skb_set_dev);
+
/*
* Invalidate hardware checksum when packet is to be mangled, and
* complete checksum manually on outgoing path.

2009-11-26 21:14:22

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH 1/4] veth: move loopback logic to common location

Arnd Bergmann wrote:
> On Thursday 26 November 2009, Patrick McHardy wrote:
>> In addition to those already handled, I'd say
>>
>> - priority: affects qdisc classification, may refer to classes of the
>> old namespace
>> - ipvs_property: might cause packets to incorrectly skip netfilter hooks
>> - nf_trace: might trigger packet tracing
>> - nf_bridge: contains references to network devices in the old NS,
>> also indicates packet was bridged
>> - iif: index is only valid in the originating namespace
>> - probably secmark.
>
> ok
>
>> - tc_index: classification result, should only be set in the namespace
>> of the classifier
>> - tc_verd: RTTL etc. should begin at zero again
>
> Wouldn't that defeat the purpose of RTTL? If you create a loop
> across two devices in different namespaces, it may no longer get
> detected. Or is that a different problem again?

Mhh good point, that would indeed be possible. OTOH using ingress
filtering in one namespace currently might cause the packet to get
dropped in a different namespace because the ttl runs out. For now
I'd suggest to go the safe route and keep the TTL intact until we
can come up with something better.

> +void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
> +{
> + if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
> + secpath_reset(skb);
> + skb_dst_drop(skb);
> + nf_reset(skb);
> + skb_init_secmark(skb);
> + skb->mark = 0;
> + skb->priority = 0;
> + skb->nf_trace = 0;
> + skb->ipvs_property = 0;
> +#ifdef CONFIG_NET_SCHED
> + skb->tc_index = 0;
> +#ifdef CONFIG_NET_CLS_ACT
> + skb->tc_verd = SET_TC_VERD(skb->tc_verd, 0);
> + skb->tc_verd = SET_TC_RTTL(skb->tc_verd, 0);
> +#endif
> +#endif

This makes we wonder which ones we actually should keep. Most of the
others get reinitialized anyways, so maybe its better to simply clear
the entire area up until ->tail like f.i. skb_recycle_check().

> + }
> + skb->dev = dev;
> + skb->skb_iif = skb->dev->ifindex;

This doesn't seem necessary, if the packet goes through
netif_receive_skb, it will be set anyways.

> +}
> +EXPORT_SYMBOL(skb_set_dev);