2015-07-30 18:13:22

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 0/9] OVS conntrack support

The goal of this series is to allow OVS to send packets through the Linux
kernel connection tracker, and subsequently match on fields populated by
conntrack.

This version includes new handling of IPv4 and IPv6 fragments, support for
conntrack labels, and tracking connections via helpers. The kernel module tests
distributed with the corresponding OVS userspace check a variety of scenarios
implementing one-way firewalls, two-way firewalls, with and without IP
fragments, VLANs and VXLAN tunnels, and in conjunction with FTP helpers, for
both IPv4 and IPv6 traffic.

This functionality is enabled through the CONFIG_OPENVSWITCH_CONNTRACK option.

The branch below has been updated with the corresponding userspace pieces:
https://github.com/justinpettit/ovs conntrack

Joe Stringer (8):
openvswitch: Scrub packet in ovs_vport_receive()
openvswitch: Serialize acts with original netlink len
openvswitch: Move MASKED* macros to datapath.h
ipv6: Export nf_ct_frag6_gather()
openvswitch: Add conntrack action
netfilter: Always export nf_connlabels_replace()
openvswitch: Allow matching on conntrack label
openvswitch: Allow attaching helpers to ct action

Justin Pettit (1):
openvswitch: Allow matching on conntrack mark

include/uapi/linux/openvswitch.h | 49 +++
net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
net/netfilter/nf_conntrack_labels.c | 2 -
net/openvswitch/Kconfig | 12 +
net/openvswitch/Makefile | 1 +
net/openvswitch/actions.c | 224 ++++++++--
net/openvswitch/conntrack.c | 758 ++++++++++++++++++++++++++++++++
net/openvswitch/conntrack.h | 128 ++++++
net/openvswitch/datapath.c | 70 ++-
net/openvswitch/datapath.h | 12 +
net/openvswitch/flow.c | 5 +
net/openvswitch/flow.h | 9 +
net/openvswitch/flow_netlink.c | 103 ++++-
net/openvswitch/flow_netlink.h | 4 +-
net/openvswitch/vport.c | 4 +
15 files changed, 1317 insertions(+), 65 deletions(-)
create mode 100644 net/openvswitch/conntrack.c
create mode 100644 net/openvswitch/conntrack.h

--
2.1.4


2015-07-30 18:13:20

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

Signed-off-by: Joe Stringer <[email protected]>
---
net/openvswitch/vport.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index d14f594..baa018f 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -475,6 +475,9 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
struct sw_flow_key key;
int error;

+ if (!skb->sk || (sock_net(skb->sk) != read_pnet(&vport->dp->net)))
+ skb_scrub_packet(skb, true);
+
stats = this_cpu_ptr(vport->percpu_stats);
u64_stats_update_begin(&stats->syncp);
stats->rx_packets++;
--
2.1.4

2015-07-30 18:17:36

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 2/9] openvswitch: Serialize acts with original netlink len

Previously, we used the kernel-internal netlink actions length to
calculate the size of messages to serialize back to userspace.
However,the sw_flow_actions may not be formatted exactly the same as the
actions on the wire, so store the original actions length when
de-serializing and re-use the original length when serializing.

Signed-off-by: Joe Stringer <[email protected]>
---
net/openvswitch/datapath.c | 2 +-
net/openvswitch/flow.h | 1 +
net/openvswitch/flow_netlink.c | 1 +
3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index ffe984f..d5b5473 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -713,7 +713,7 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts,

/* OVS_FLOW_ATTR_ACTIONS */
if (should_fill_actions(ufid_flags))
- len += nla_total_size(acts->actions_len);
+ len += nla_total_size(acts->orig_len);

return len
+ nla_total_size(sizeof(struct ovs_flow_stats)) /* OVS_FLOW_ATTR_STATS */
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index b62cdb3..082a87b 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -144,6 +144,7 @@ struct sw_flow_id {

struct sw_flow_actions {
struct rcu_head rcu;
+ size_t orig_len; /* From flow_cmd_new netlink actions size */
u32 actions_len;
struct nlattr actions[];
};
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index a6eb77a..d536fb7 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1545,6 +1545,7 @@ static struct sw_flow_actions *nla_alloc_flow_actions(int size, bool log)
return ERR_PTR(-ENOMEM);

sfa->actions_len = 0;
+ sfa->orig_len = size;
return sfa;
}

--
2.1.4

2015-07-30 18:17:14

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 3/9] openvswitch: Move MASKED* macros to datapath.h

This will allow the ovs-conntrack code to reuse these macros.

Signed-off-by: Joe Stringer <[email protected]>
---
net/openvswitch/actions.c | 52 ++++++++++++++++++++++------------------------
net/openvswitch/datapath.h | 4 ++++
2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index cf04c2f..e50678d 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -185,10 +185,6 @@ static int pop_mpls(struct sk_buff *skb, struct sw_flow_key *key,
return 0;
}

-/* 'KEY' must not have any bits set outside of the 'MASK' */
-#define MASKED(OLD, KEY, MASK) ((KEY) | ((OLD) & ~(MASK)))
-#define SET_MASKED(OLD, KEY, MASK) ((OLD) = MASKED(OLD, KEY, MASK))
-
static int set_mpls(struct sk_buff *skb, struct sw_flow_key *flow_key,
const __be32 *mpls_lse, const __be32 *mask)
{
@@ -201,7 +197,7 @@ static int set_mpls(struct sk_buff *skb, struct sw_flow_key *flow_key,
return err;

stack = (__be32 *)skb_mpls_header(skb);
- lse = MASKED(*stack, *mpls_lse, *mask);
+ lse = OVS_MASKED(*stack, *mpls_lse, *mask);
if (skb->ip_summed == CHECKSUM_COMPLETE) {
__be32 diff[] = { ~(*stack), lse };

@@ -244,9 +240,9 @@ static void ether_addr_copy_masked(u8 *dst_, const u8 *src_, const u8 *mask_)
const u16 *src = (const u16 *)src_;
const u16 *mask = (const u16 *)mask_;

- SET_MASKED(dst[0], src[0], mask[0]);
- SET_MASKED(dst[1], src[1], mask[1]);
- SET_MASKED(dst[2], src[2], mask[2]);
+ OVS_SET_MASKED(dst[0], src[0], mask[0]);
+ OVS_SET_MASKED(dst[1], src[1], mask[1]);
+ OVS_SET_MASKED(dst[2], src[2], mask[2]);
}

static int set_eth_addr(struct sk_buff *skb, struct sw_flow_key *flow_key,
@@ -330,10 +326,10 @@ static void update_ipv6_checksum(struct sk_buff *skb, u8 l4_proto,
static void mask_ipv6_addr(const __be32 old[4], const __be32 addr[4],
const __be32 mask[4], __be32 masked[4])
{
- masked[0] = MASKED(old[0], addr[0], mask[0]);
- masked[1] = MASKED(old[1], addr[1], mask[1]);
- masked[2] = MASKED(old[2], addr[2], mask[2]);
- masked[3] = MASKED(old[3], addr[3], mask[3]);
+ masked[0] = OVS_MASKED(old[0], addr[0], mask[0]);
+ masked[1] = OVS_MASKED(old[1], addr[1], mask[1]);
+ masked[2] = OVS_MASKED(old[2], addr[2], mask[2]);
+ masked[3] = OVS_MASKED(old[3], addr[3], mask[3]);
}

static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
@@ -350,15 +346,15 @@ static void set_ipv6_addr(struct sk_buff *skb, u8 l4_proto,
static void set_ipv6_fl(struct ipv6hdr *nh, u32 fl, u32 mask)
{
/* Bits 21-24 are always unmasked, so this retains their values. */
- SET_MASKED(nh->flow_lbl[0], (u8)(fl >> 16), (u8)(mask >> 16));
- SET_MASKED(nh->flow_lbl[1], (u8)(fl >> 8), (u8)(mask >> 8));
- SET_MASKED(nh->flow_lbl[2], (u8)fl, (u8)mask);
+ OVS_SET_MASKED(nh->flow_lbl[0], (u8)(fl >> 16), (u8)(mask >> 16));
+ OVS_SET_MASKED(nh->flow_lbl[1], (u8)(fl >> 8), (u8)(mask >> 8));
+ OVS_SET_MASKED(nh->flow_lbl[2], (u8)fl, (u8)mask);
}

static void set_ip_ttl(struct sk_buff *skb, struct iphdr *nh, u8 new_ttl,
u8 mask)
{
- new_ttl = MASKED(nh->ttl, new_ttl, mask);
+ new_ttl = OVS_MASKED(nh->ttl, new_ttl, mask);

csum_replace2(&nh->check, htons(nh->ttl << 8), htons(new_ttl << 8));
nh->ttl = new_ttl;
@@ -384,7 +380,7 @@ static int set_ipv4(struct sk_buff *skb, struct sw_flow_key *flow_key,
* makes sense to check if the value actually changed.
*/
if (mask->ipv4_src) {
- new_addr = MASKED(nh->saddr, key->ipv4_src, mask->ipv4_src);
+ new_addr = OVS_MASKED(nh->saddr, key->ipv4_src, mask->ipv4_src);

if (unlikely(new_addr != nh->saddr)) {
set_ip_addr(skb, nh, &nh->saddr, new_addr);
@@ -392,7 +388,7 @@ static int set_ipv4(struct sk_buff *skb, struct sw_flow_key *flow_key,
}
}
if (mask->ipv4_dst) {
- new_addr = MASKED(nh->daddr, key->ipv4_dst, mask->ipv4_dst);
+ new_addr = OVS_MASKED(nh->daddr, key->ipv4_dst, mask->ipv4_dst);

if (unlikely(new_addr != nh->daddr)) {
set_ip_addr(skb, nh, &nh->daddr, new_addr);
@@ -480,7 +476,8 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
*(__be32 *)nh & htonl(IPV6_FLOWINFO_FLOWLABEL);
}
if (mask->ipv6_hlimit) {
- SET_MASKED(nh->hop_limit, key->ipv6_hlimit, mask->ipv6_hlimit);
+ OVS_SET_MASKED(nh->hop_limit, key->ipv6_hlimit,
+ mask->ipv6_hlimit);
flow_key->ip.ttl = nh->hop_limit;
}
return 0;
@@ -509,8 +506,8 @@ static int set_udp(struct sk_buff *skb, struct sw_flow_key *flow_key,

uh = udp_hdr(skb);
/* Either of the masks is non-zero, so do not bother checking them. */
- src = MASKED(uh->source, key->udp_src, mask->udp_src);
- dst = MASKED(uh->dest, key->udp_dst, mask->udp_dst);
+ src = OVS_MASKED(uh->source, key->udp_src, mask->udp_src);
+ dst = OVS_MASKED(uh->dest, key->udp_dst, mask->udp_dst);

if (uh->check && skb->ip_summed != CHECKSUM_PARTIAL) {
if (likely(src != uh->source)) {
@@ -550,12 +547,12 @@ static int set_tcp(struct sk_buff *skb, struct sw_flow_key *flow_key,
return err;

th = tcp_hdr(skb);
- src = MASKED(th->source, key->tcp_src, mask->tcp_src);
+ src = OVS_MASKED(th->source, key->tcp_src, mask->tcp_src);
if (likely(src != th->source)) {
set_tp_port(skb, &th->source, src, &th->check);
flow_key->tp.src = src;
}
- dst = MASKED(th->dest, key->tcp_dst, mask->tcp_dst);
+ dst = OVS_MASKED(th->dest, key->tcp_dst, mask->tcp_dst);
if (likely(dst != th->dest)) {
set_tp_port(skb, &th->dest, dst, &th->check);
flow_key->tp.dst = dst;
@@ -582,8 +579,8 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
old_csum = sh->checksum;
old_correct_csum = sctp_compute_cksum(skb, sctphoff);

- sh->source = MASKED(sh->source, key->sctp_src, mask->sctp_src);
- sh->dest = MASKED(sh->dest, key->sctp_dst, mask->sctp_dst);
+ sh->source = OVS_MASKED(sh->source, key->sctp_src, mask->sctp_src);
+ sh->dest = OVS_MASKED(sh->dest, key->sctp_dst, mask->sctp_dst);

new_csum = sctp_compute_cksum(skb, sctphoff);

@@ -759,12 +756,13 @@ static int execute_masked_set_action(struct sk_buff *skb,

switch (nla_type(a)) {
case OVS_KEY_ATTR_PRIORITY:
- SET_MASKED(skb->priority, nla_get_u32(a), *get_mask(a, u32 *));
+ OVS_SET_MASKED(skb->priority, nla_get_u32(a),
+ *get_mask(a, u32 *));
flow_key->phy.priority = skb->priority;
break;

case OVS_KEY_ATTR_SKB_MARK:
- SET_MASKED(skb->mark, nla_get_u32(a), *get_mask(a, u32 *));
+ OVS_SET_MASKED(skb->mark, nla_get_u32(a), *get_mask(a, u32 *));
flow_key->phy.skb_mark = skb->mark;
break;

diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 6b28c5c..487a85f 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -200,6 +200,10 @@ void ovs_dp_notify_wq(struct work_struct *work);
int action_fifos_init(void);
void action_fifos_exit(void);

+/* 'KEY' must not have any bits set outside of the 'MASK' */
+#define OVS_MASKED(OLD, KEY, MASK) ((KEY) | ((OLD) & ~(MASK)))
+#define OVS_SET_MASKED(OLD, KEY, MASK) ((OLD) = OVS_MASKED(OLD, KEY, MASK))
+
#define OVS_NLERR(logging_allowed, fmt, ...) \
do { \
if (logging_allowed && net_ratelimit()) \
--
2.1.4

2015-07-30 18:16:01

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 4/9] ipv6: Export nf_ct_frag6_gather()

Signed-off-by: Joe Stringer <[email protected]>
---
net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 6f187c8..ce3d5d8 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -633,6 +633,7 @@ ret_orig:
kfree_skb(clone);
return skb;
}
+EXPORT_SYMBOL_GPL(nf_ct_frag6_gather);

void nf_ct_frag6_consume_orig(struct sk_buff *skb)
{
--
2.1.4

2015-07-30 18:13:28

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 5/9] openvswitch: Add conntrack action

Expose the kernel connection tracker via OVS. Userspace components can
make use of the "ct()" action, followed by "recirculate", to populate
the conntracking state in the OVS flow key, and subsequently match on
that state.

Example ODP flows allowing traffic from 1->2, only replies from 2->1:
in_port=1,tcp,action=ct(commit,zone=1),2
in_port=2,ct_state=-trk,tcp,action=ct(zone=1),recirc(1)
recirc_id=1,in_port=2,ct_state=+trk+est-new,tcp,action=1

IP fragments are handled by transparently assembling them as part of the
ct action. The maximum received unit (MRU) size is tracked so that
refragmentation can occur during output.

IP frag handling contributed by Andy Zhou.

Signed-off-by: Joe Stringer <[email protected]>
Signed-off-by: Justin Pettit <[email protected]>
Signed-off-by: Andy Zhou <[email protected]>
---
This can be tested with the corresponding userspace component here:
https://www.github.com/justinpettit/openvswitch conntrack
---
include/uapi/linux/openvswitch.h | 41 ++++
net/openvswitch/Kconfig | 11 +
net/openvswitch/Makefile | 1 +
net/openvswitch/actions.c | 162 ++++++++++++-
net/openvswitch/conntrack.c | 480 +++++++++++++++++++++++++++++++++++++++
net/openvswitch/conntrack.h | 82 +++++++
net/openvswitch/datapath.c | 62 +++--
net/openvswitch/datapath.h | 6 +
net/openvswitch/flow.c | 3 +
net/openvswitch/flow.h | 6 +
net/openvswitch/flow_netlink.c | 73 ++++--
net/openvswitch/flow_netlink.h | 4 +-
net/openvswitch/vport.c | 1 +
13 files changed, 897 insertions(+), 35 deletions(-)
create mode 100644 net/openvswitch/conntrack.c
create mode 100644 net/openvswitch/conntrack.h

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index d6b8854..1dae30a 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -164,6 +164,9 @@ enum ovs_packet_cmd {
* %OVS_USERSPACE_ATTR_EGRESS_TUN_PORT attribute, which is sent only if the
* output port is actually a tunnel port. Contains the output tunnel key
* extracted from the packet as nested %OVS_TUNNEL_KEY_ATTR_* attributes.
+ * @OVS_PACKET_ATTR_MRU: Present for an %OVS_PACKET_CMD_ACTION and
+ * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment
+ * size.
*
* These attributes follow the &struct ovs_header within the Generic Netlink
* payload for %OVS_PACKET_* commands.
@@ -180,6 +183,7 @@ enum ovs_packet_attr {
OVS_PACKET_ATTR_UNUSED2,
OVS_PACKET_ATTR_PROBE, /* Packet operation is a feature probe,
error logging should be suppressed. */
+ OVS_PACKET_ATTR_MRU, /* Maximum received IP fragment size. */
__OVS_PACKET_ATTR_MAX
};

@@ -319,6 +323,8 @@ enum ovs_key_attr {
OVS_KEY_ATTR_MPLS, /* array of struct ovs_key_mpls.
* The implementation may restrict
* the accepted length of the array. */
+ OVS_KEY_ATTR_CT_STATE, /* u8 bitmask of OVS_CS_F_* */
+ OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */

#ifdef __KERNEL__
OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */
@@ -431,6 +437,15 @@ struct ovs_key_nd {
__u8 nd_tll[ETH_ALEN];
};

+/* OVS_KEY_ATTR_CT_STATE flags */
+#define OVS_CS_F_NEW 0x01 /* Beginning of a new connection. */
+#define OVS_CS_F_ESTABLISHED 0x02 /* Part of an existing connection. */
+#define OVS_CS_F_RELATED 0x04 /* Related to an established
+ * connection. */
+#define OVS_CS_F_INVALID 0x20 /* Could not track connection. */
+#define OVS_CS_F_REPLY_DIR 0x40 /* Flow is in the reply direction. */
+#define OVS_CS_F_TRACKED 0x80 /* Conntrack has occurred. */
+
/**
* enum ovs_flow_attr - attributes for %OVS_FLOW_* commands.
* @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow
@@ -595,6 +610,29 @@ struct ovs_action_hash {
};

/**
+ * enum ovs_ct_attr - Attributes for %OVS_ACTION_ATTR_CT action.
+ * @OVS_CT_ATTR_FLAGS: u32 connection tracking flags.
+ * @OVS_CT_ATTR_ZONE: u16 connection tracking zone.
+ * @OVS_CT_ATTR_HELPER: variable length string defining conntrack ALG.
+ */
+enum ovs_ct_attr {
+ OVS_CT_ATTR_UNSPEC,
+ OVS_CT_ATTR_FLAGS, /* u8 bitmask of OVS_CT_F_*. */
+ OVS_CT_ATTR_ZONE, /* u16 zone id. */
+ __OVS_CT_ATTR_MAX
+};
+
+#define OVS_CT_ATTR_MAX (__OVS_CT_ATTR_MAX - 1)
+
+/*
+ * OVS_CT_ATTR_FLAGS flags - bitmask of %OVS_CT_F_*
+ * @OVS_CT_F_COMMIT: Commits the flow to the conntrack hashtable in the
+ * specified zone. Future packets for the current connection will be
+ * considered as 'established' or 'related'.
+ */
+#define OVS_CT_F_COMMIT 0x01
+
+/**
* enum ovs_action_attr - Action types.
*
* @OVS_ACTION_ATTR_OUTPUT: Output packet to port.
@@ -623,6 +661,8 @@ struct ovs_action_hash {
* indicate the new packet contents. This could potentially still be
* %ETH_P_MPLS if the resulting MPLS label stack is not empty. If there
* is no MPLS label stack, as determined by ethertype, no action is taken.
+ * @OVS_ACTION_ATTR_CT: Track the connection. Populate the conntrack-related
+ * entries in the flow key.
*
* Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all
* fields within a header are modifiable, e.g. the IPv4 protocol and fragment
@@ -648,6 +688,7 @@ enum ovs_action_attr {
* data immediately followed by a mask.
* The data must be zero for the unmasked
* bits. */
+ OVS_ACTION_ATTR_CT, /* One nested OVS_CT_ATTR_* . */

__OVS_ACTION_ATTR_MAX, /* Nothing past this will be accepted
* from userspace. */
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 6ed1d2d..92bb3d3 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -32,6 +32,17 @@ config OPENVSWITCH

If unsure, say N.

+config OPENVSWITCH_CONNTRACK
+ bool "Open vSwitch conntrack action support"
+ depends on OPENVSWITCH
+ depends on NF_CONNTRACK
+ default OPENVSWITCH
+ ---help---
+ If you say Y here, then Open vSwitch module will be able to pass
+ packets through conntrack.
+
+ Say N to exclude this support and reduce the binary size.
+
config OPENVSWITCH_GRE
tristate "Open vSwitch GRE tunneling support"
depends on OPENVSWITCH
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 38e0e14..5bb8abe 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -15,5 +15,6 @@ openvswitch-y := \
vport-internal_dev.o \
vport-netdev.o

+openvswitch-$(CONFIG_OPENVSWITCH_CONNTRACK) += conntrack.o
obj-$(CONFIG_OPENVSWITCH_GENEVE)+= vport-geneve.o
obj-$(CONFIG_OPENVSWITCH_GRE) += vport-gre.o
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index e50678d..4a62ed4 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -22,6 +22,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/openvswitch.h>
+#include <linux/netfilter_ipv6.h>
#include <linux/sctp.h>
#include <linux/tcp.h>
#include <linux/udp.h>
@@ -29,6 +30,7 @@
#include <linux/if_arp.h>
#include <linux/if_vlan.h>

+#include <net/dst.h>
#include <net/ip.h>
#include <net/ipv6.h>
#include <net/checksum.h>
@@ -38,6 +40,7 @@

#include "datapath.h"
#include "flow.h"
+#include "conntrack.h"
#include "vport.h"

static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
@@ -52,6 +55,16 @@ struct deferred_action {
struct sw_flow_key pkt_key;
};

+struct ovs_frag_data {
+ struct dst_entry *dst;
+ struct vport *vport;
+ struct sw_flow_key *key;
+ struct ovs_skb_cb cb;
+ __be16 vlan_proto;
+};
+
+static DEFINE_PER_CPU(struct ovs_frag_data, ovs_frag_data_storage);
+
#define DEFERRED_ACTION_FIFO_SIZE 10
struct action_fifo {
int head;
@@ -594,14 +607,136 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
return 0;
}

-static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port)
+/* Given an IP frame, reconstruct its MAC header. */
+static void ovs_setup_l2_header(struct sk_buff *skb,
+ const struct ovs_frag_data *data)
+{
+ struct sw_flow_key *key = data->key;
+
+ skb_push(skb, ETH_HLEN);
+ skb_reset_mac_header(skb);
+
+ ether_addr_copy(eth_hdr(skb)->h_source, key->eth.src);
+ ether_addr_copy(eth_hdr(skb)->h_dest, key->eth.dst);
+ eth_hdr(skb)->h_proto = key->eth.type;
+
+ if ((data->key->eth.tci & htons(VLAN_TAG_PRESENT)) &&
+ !skb_vlan_tag_present(skb))
+ __vlan_hwaccel_put_tag(skb, data->vlan_proto,
+ ntohs(key->eth.tci));
+}
+
+static void prepare_frag(struct vport *vport, struct sw_flow_key *key,
+ struct sk_buff *skb)
+{
+ unsigned int hlen = ETH_HLEN;
+ struct ovs_frag_data *data;
+
+ data = this_cpu_ptr(&ovs_frag_data_storage);
+ data->dst = skb_dst(skb);
+ data->vport = vport;
+ data->key = key;
+ data->cb = *OVS_CB(skb);
+
+ if (key->eth.tci & htons(VLAN_TAG_PRESENT)) {
+ if (skb_vlan_tag_present(skb)) {
+ data->vlan_proto = skb->vlan_proto;
+ } else {
+ data->vlan_proto = vlan_eth_hdr(skb)->h_vlan_proto;
+ hlen += VLAN_HLEN;
+ }
+ }
+
+ memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+ skb_pull(skb, hlen);
+}
+
+static int ovs_vport_output(struct sock *sock, struct sk_buff *skb)
+{
+ struct ovs_frag_data *data = this_cpu_ptr(&ovs_frag_data_storage);
+ struct vport *vport = data->vport;
+
+ skb_dst_drop(skb);
+ skb_dst_set(skb, dst_clone(data->dst));
+ *OVS_CB(skb) = data->cb;
+
+ ovs_setup_l2_header(skb, data);
+ ovs_vport_send(vport, skb);
+
+ return 0;
+}
+
+unsigned int
+ovs_dst_get_mtu(const struct dst_entry *dst)
+{
+ return dst->dev->mtu;
+}
+
+static struct dst_ops ovs_dst_ops = {
+ .family = AF_UNSPEC,
+ .mtu = ovs_dst_get_mtu,
+};
+
+static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
+ struct sw_flow_key *key)
{
struct vport *vport = ovs_vport_rcu(dp, out_port);

- if (likely(vport))
- ovs_vport_send(vport, skb);
- else
+ if (likely(vport)) {
+ unsigned int mru = OVS_CB(skb)->mru;
+ struct dst_entry *orig_dst = dst_clone(skb_dst(skb));
+
+ if (!mru || (skb->len <= mru + ETH_HLEN)) {
+ ovs_vport_send(vport, skb);
+ } else if (!vport->dev) {
+ WARN_ONCE(1, "Cannot fragment packets to vport %s\n",
+ vport->ops->get_name(vport));
+ kfree_skb(skb);
+ } else if (mru > vport->dev->mtu) {
+ kfree_skb(skb);
+ } else if (key->eth.type == htons(ETH_P_IP)) {
+ struct dst_entry ovs_dst;
+
+ prepare_frag(vport, key, skb);
+ dst_init(&ovs_dst, &ovs_dst_ops, vport->dev,
+ 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
+
+ skb_dst_drop(skb);
+ skb_dst_set_noref(skb, &ovs_dst);
+ IPCB(skb)->frag_max_size = mru;
+
+ ip_do_fragment(skb->sk, skb, ovs_vport_output);
+ dev_put(ovs_dst.dev);
+ } else if (key->eth.type == htons(ETH_P_IPV6)) {
+ const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
+ struct rt6_info ovs_rt;
+
+ if (!v6ops) {
+ kfree_skb(skb);
+ goto exit;
+ }
+
+ prepare_frag(vport, key, skb);
+ memset(&ovs_rt, 0, sizeof(ovs_rt));
+ dst_init(&ovs_rt.dst, &ovs_dst_ops, vport->dev,
+ 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
+
+ skb_dst_drop(skb);
+ skb_dst_set_noref(skb, &ovs_rt.dst);
+ IP6CB(skb)->frag_max_size = mru;
+
+ v6ops->fragment(skb->sk, skb, ovs_vport_output);
+ dev_put(ovs_rt.dst.dev);
+ } else {
+ WARN_ONCE(1, "Failed fragment to %s: MRU=%d, MTU=%d.",
+ ovs_vport_name(vport), mru, vport->dev->mtu);
+ kfree_skb(skb);
+ }
+exit:
+ dst_release(orig_dst);
+ } else {
kfree_skb(skb);
+ }
}

static int output_userspace(struct datapath *dp, struct sk_buff *skb,
@@ -615,6 +750,10 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb,

memset(&upcall, 0, sizeof(upcall));
upcall.cmd = OVS_PACKET_CMD_ACTION;
+ upcall.userdata = NULL;
+ upcall.portid = 0;
+ upcall.egress_tun_info = NULL;
+ upcall.mru = OVS_CB(skb)->mru;

for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
a = nla_next(a, &rem)) {
@@ -874,7 +1013,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
struct sk_buff *out_skb = skb_clone(skb, GFP_ATOMIC);

if (out_skb)
- do_output(dp, out_skb, prev_port);
+ do_output(dp, out_skb, prev_port, key);

prev_port = -1;
}
@@ -931,16 +1070,25 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
case OVS_ACTION_ATTR_SAMPLE:
err = sample(dp, skb, key, a, attr, len);
break;
+
+ case OVS_ACTION_ATTR_CT:
+ err = ovs_ct_execute(skb, key, nla_data(a));
+ break;
}

if (unlikely(err)) {
- kfree_skb(skb);
+ /* Hide stolen fragments from user space. */
+ if (err == -EINPROGRESS)
+ err = 0;
+ else
+ kfree_skb(skb);
+
return err;
}
}

if (prev_port != -1)
- do_output(dp, skb, prev_port);
+ do_output(dp, skb, prev_port, key);
else
consume_skb(skb);

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
new file mode 100644
index 0000000..284b89e
--- /dev/null
+++ b/net/openvswitch/conntrack.c
@@ -0,0 +1,480 @@
+/*
+ * Copyright (c) 2015 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/openvswitch.h>
+#include <net/ip.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/ipv6/nf_defrag_ipv6.h>
+
+#include "datapath.h"
+#include "conntrack.h"
+#include "flow.h"
+#include "flow_netlink.h"
+
+struct ovs_ct_len_tbl {
+ size_t maxlen;
+ size_t minlen;
+};
+
+struct ovs_conntrack_info {
+ struct nf_conn *ct;
+ u32 flags;
+ u16 zone;
+ u16 family;
+};
+
+static u16 key_to_nfproto(const struct sw_flow_key *key)
+{
+ switch (ntohs(key->eth.type)) {
+ case ETH_P_IP:
+ return NFPROTO_IPV4;
+ case ETH_P_IPV6:
+ return NFPROTO_IPV6;
+ default:
+ return NFPROTO_UNSPEC;
+ }
+}
+
+static struct net *ovs_get_net(const struct sk_buff *skb)
+{
+ struct vport *vport;
+
+ vport = OVS_CB(skb)->input_vport;
+ if (!vport) {
+ WARN_ONCE(1, "Can't obtain netns from vport");
+ return ERR_PTR(-EINVAL);
+ }
+
+ return read_pnet(&vport->dp->net);
+}
+
+/* Map SKB connection state into the values used by flow definition. */
+static u8 __ovs_ct_get_state(enum ip_conntrack_info ctinfo)
+{
+ u8 cstate = OVS_CS_F_TRACKED;
+
+ switch (ctinfo) {
+ case IP_CT_ESTABLISHED_REPLY:
+ case IP_CT_RELATED_REPLY:
+ case IP_CT_NEW_REPLY:
+ cstate |= OVS_CS_F_REPLY_DIR;
+ break;
+ default:
+ break;
+ }
+
+ switch (ctinfo) {
+ case IP_CT_ESTABLISHED:
+ case IP_CT_ESTABLISHED_REPLY:
+ cstate |= OVS_CS_F_ESTABLISHED;
+ break;
+ case IP_CT_RELATED:
+ case IP_CT_RELATED_REPLY:
+ cstate |= OVS_CS_F_RELATED;
+ break;
+ case IP_CT_NEW:
+ case IP_CT_NEW_REPLY:
+ cstate |= OVS_CS_F_NEW;
+ break;
+ default:
+ break;
+ }
+
+ return cstate;
+}
+
+u8 ovs_ct_get_state(const struct sk_buff *skb)
+{
+ enum ip_conntrack_info ctinfo;
+
+ if (!nf_ct_get(skb, &ctinfo))
+ return 0;
+ return __ovs_ct_get_state(ctinfo);
+}
+
+u16 ovs_ct_get_zone(const struct sk_buff *skb)
+{
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+
+ ct = nf_ct_get(skb, &ctinfo);
+
+ return ct ? nf_ct_zone(ct) : NF_CT_DEFAULT_ZONE;
+}
+
+static bool __ovs_ct_state_valid(u8 state)
+{
+ return (state && !(state & OVS_CS_F_INVALID));
+}
+
+bool ovs_ct_state_valid(const struct sw_flow_key *key)
+{
+ return __ovs_ct_state_valid(key->ct.state);
+}
+
+static int handle_fragments(struct net *net, struct sw_flow_key *key,
+ u16 zone, struct sk_buff *skb)
+{
+ struct ovs_skb_cb ovs_cb = *OVS_CB(skb);
+
+ if (key->eth.type == htons(ETH_P_IP)) {
+ enum ip_defrag_users user = IP_DEFRAG_CONNTRACK_IN + zone;
+ int err;
+
+ memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+ err = ip_defrag(skb, user);
+ if (err)
+ return err;
+
+ ovs_cb.mru = IPCB(skb)->frag_max_size;
+ } else if (key->eth.type == htons(ETH_P_IPV6)) {
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
+ enum ip6_defrag_users user = IP6_DEFRAG_CONNTRACK_IN + zone;
+ struct sk_buff *reasm;
+
+ memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
+ reasm = nf_ct_frag6_gather(skb, user);
+ if (reasm == NULL)
+ return -EINPROGRESS;
+
+ if (skb == reasm)
+ return -EINVAL;
+
+ key->ip.proto = ipv6_hdr(reasm)->nexthdr;
+ skb_morph(skb, reasm);
+ consume_skb(reasm);
+ ovs_cb.mru = IP6CB(skb)->frag_max_size;
+#else
+ return -EPFNOSUPPORT;
+#endif
+ } else {
+ return -EPFNOSUPPORT;
+ }
+
+ key->ip.frag = OVS_FRAG_TYPE_NONE;
+ skb_clear_hash(skb);
+ skb->ignore_df = 1;
+ *OVS_CB(skb) = ovs_cb;
+
+ return 0;
+}
+
+static struct nf_conntrack_expect *
+ovs_ct_expect_find(struct net *net, u16 zone, u16 proto,
+ const struct sk_buff *skb)
+{
+ struct nf_conntrack_tuple tuple;
+
+ if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb), proto, &tuple))
+ return NULL;
+ return __nf_ct_expect_find(net, zone, &tuple);
+}
+
+/* Determine whether skb->nfct is equal to the result of conntrack lookup. */
+static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb,
+ const struct ovs_conntrack_info *info)
+{
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ if (!ct)
+ return false;
+ if (!net_eq(net, read_pnet(&ct->ct_net))) {
+ WARN(true, "skb->nfct associated with different namespace\n");
+ return false;
+ }
+ if (info->zone != nf_ct_zone(ct))
+ return false;
+
+ return true;
+}
+
+static void __ovs_ct_update_key(struct sk_buff *skb, struct sw_flow_key *key,
+ u8 state, u16 zone)
+{
+ key->ct.state = state;
+ key->ct.zone = zone;
+}
+
+static void ovs_ct_update_key(struct sk_buff *skb, struct sw_flow_key *key,
+ u16 zone)
+{
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+ u8 state;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ if (ct) {
+ state = __ovs_ct_get_state(ctinfo);
+ zone = nf_ct_zone(ct);
+ if (ct->master)
+ state |= OVS_CS_F_RELATED;
+ } else {
+ state = OVS_CS_F_TRACKED | OVS_CS_F_INVALID;
+ }
+
+ __ovs_ct_update_key(skb, key, state, zone);
+}
+
+static int __ovs_ct_lookup(struct net *net, const struct sw_flow_key *key,
+ const struct ovs_conntrack_info *info,
+ struct sk_buff *skb)
+{
+ /* If we are recirculating packets to match on conntrack fields and
+ * committing with a separate conntrack action, then we don't need to
+ * actually run the packet through conntrack twice unless it's for a
+ * different zone. */
+ if (!skb_nfct_cached(net, skb, info)) {
+ struct nf_conn *tmpl = info->ct;
+
+ /* Associate skb with specified zone. */
+ if (tmpl) {
+ if (skb->nfct)
+ nf_conntrack_put(skb->nfct);
+ nf_conntrack_get(&tmpl->ct_general);
+ skb->nfct = &tmpl->ct_general;
+ skb->nfctinfo = IP_CT_NEW;
+ }
+
+ if (nf_conntrack_in(net, info->family, NF_INET_PRE_ROUTING,
+ skb) != NF_ACCEPT)
+ return -ENOENT;
+ }
+
+ return 0;
+}
+
+/* Lookup connection and read fields into key. */
+static int ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
+ const struct ovs_conntrack_info *info,
+ struct sk_buff *skb)
+{
+ struct nf_conntrack_expect *exp;
+
+ exp = ovs_ct_expect_find(net, info->zone, info->family, skb);
+ if (exp) {
+ u8 state;
+
+ state = OVS_CS_F_TRACKED | OVS_CS_F_NEW | OVS_CS_F_RELATED;
+ __ovs_ct_update_key(skb, key, state, info->zone);
+ } else {
+ int err;
+
+ err = __ovs_ct_lookup(net, key, info, skb);
+ if (err)
+ return err;
+
+ ovs_ct_update_key(skb, key, info->zone);
+ }
+
+ return 0;
+}
+
+/* Lookup connection and confirm if unconfirmed. */
+static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
+ const struct ovs_conntrack_info *info,
+ struct sk_buff *skb)
+{
+ u8 state;
+ int err;
+
+ state = key->ct.state;
+ if (key->ct.zone == info->zone &&
+ ((state & OVS_CS_F_TRACKED) && !(state & OVS_CS_F_NEW))) {
+ /* Previous lookup has shown that this connection is already
+ * tracked and committed. Skip committing. */
+ return 0;
+ }
+
+ err = __ovs_ct_lookup(net, key, info, skb);
+ if (err)
+ return err;
+ if (nf_conntrack_confirm(skb) != NF_ACCEPT)
+ return -EINVAL;
+
+ ovs_ct_update_key(skb, key, info->zone);
+
+ return 0;
+}
+
+int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
+ const struct ovs_conntrack_info *info)
+{
+ struct net *net;
+ int nh_ofs;
+ int err;
+
+ net = ovs_get_net(skb);
+ if (IS_ERR(net))
+ return PTR_ERR(net);
+
+ /* The conntrack module expects to be working at L3. */
+ nh_ofs = skb_network_offset(skb);
+ skb_pull(skb, nh_ofs);
+
+ if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
+ err = handle_fragments(net, key, info->zone, skb);
+ if (err)
+ return err;
+ }
+
+ if (info->flags & OVS_CT_F_COMMIT)
+ err = ovs_ct_commit(net, key, info, skb);
+ else
+ err = ovs_ct_lookup(net, key, info, skb);
+
+ skb_push(skb, nh_ofs);
+ return err;
+}
+
+static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
+ [OVS_CT_ATTR_FLAGS] = { .minlen = sizeof(u32),
+ .maxlen = sizeof(u32) },
+ [OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16),
+ .maxlen = sizeof(u16) },
+};
+
+static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
+ bool log)
+{
+ struct nlattr *a;
+ int rem;
+
+ nla_for_each_nested(a, attr, rem) {
+ int type = nla_type(a);
+ int maxlen = ovs_ct_attr_lens[type].maxlen;
+ int minlen = ovs_ct_attr_lens[type].minlen;
+
+ if (type > OVS_CT_ATTR_MAX) {
+ OVS_NLERR(log,
+ "Unknown conntrack attr (type=%d, max=%d)",
+ type, OVS_CT_ATTR_MAX);
+ return -EINVAL;
+ }
+ if (nla_len(a) < minlen || nla_len(a) > maxlen) {
+ OVS_NLERR(log,
+ "Conntrack attr type has unexpected length (type=%d, length=%d, expected=%d)",
+ type, nla_len(a), maxlen);
+ return -EINVAL;
+ }
+
+ switch (type) {
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+ case OVS_CT_ATTR_ZONE:
+ info->zone = nla_get_u16(a);
+ break;
+#endif
+ case OVS_CT_ATTR_FLAGS:
+ info->flags = nla_get_u32(a);
+ break;
+ default:
+ OVS_NLERR(log, "Unknown conntrack attr (%d)",
+ type);
+ return -EINVAL;
+ }
+ }
+
+ if (rem > 0) {
+ OVS_NLERR(log, "Conntrack attr has %d unknown bytes", rem);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+bool ovs_ct_verify(enum ovs_key_attr attr)
+{
+ if (attr & OVS_KEY_ATTR_CT_STATE)
+ return true;
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+ if (attr & OVS_KEY_ATTR_CT_ZONE)
+ return true;
+#endif
+
+ return false;
+}
+
+int ovs_ct_copy_action(struct net *net, const struct nlattr *attr,
+ const struct sw_flow_key *key,
+ struct sw_flow_actions **sfa, bool log)
+{
+ struct ovs_conntrack_info ct_info;
+ struct nf_conntrack_tuple t;
+ u16 family;
+ int err;
+
+ family = key_to_nfproto(key);
+ if (family == NFPROTO_UNSPEC) {
+ OVS_NLERR(log, "ct family unspecified");
+ return -EINVAL;
+ }
+
+ memset(&ct_info, 0, sizeof(ct_info));
+ ct_info.family = family;
+
+ err = parse_ct(attr, &ct_info, log);
+ if (err)
+ return err;
+
+ /* Set up template for tracking connections in specific zones. */
+ memset(&t, 0, sizeof(t));
+ ct_info.ct = nf_conntrack_alloc(net, ct_info.zone, &t, &t,
+ GFP_KERNEL);
+ if (IS_ERR(ct_info.ct)) {
+ OVS_NLERR(log, "Failed to allocate conntrack template");
+ return PTR_ERR(ct_info.ct);
+ }
+
+ err = ovs_nla_add_action(sfa, OVS_ACTION_ATTR_CT, &ct_info,
+ sizeof(ct_info), log);
+ if (err)
+ goto err_free_ct;
+
+ nf_conntrack_tmpl_insert(net, ct_info.ct);
+ return 0;
+err_free_ct:
+ nf_conntrack_free(ct_info.ct);
+ return err;
+}
+
+int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info,
+ struct sk_buff *skb)
+{
+ struct nlattr *start;
+
+ start = nla_nest_start(skb, OVS_ACTION_ATTR_CT);
+ if (!start)
+ return -EMSGSIZE;
+
+ if (nla_put_u32(skb, OVS_CT_ATTR_FLAGS, ct_info->flags))
+ return -EMSGSIZE;
+#ifdef CONFIG_NF_CONNTRACK_ZONES
+ if (nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone))
+ return -EMSGSIZE;
+#endif
+
+ nla_nest_end(skb, start);
+
+ return 0;
+}
+
+void ovs_ct_free_action(const struct nlattr *a)
+{
+ struct ovs_conntrack_info *ct_info = nla_data(a);
+
+ if (ct_info->ct)
+ nf_ct_put(ct_info->ct);
+}
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
new file mode 100644
index 0000000..7a01751
--- /dev/null
+++ b/net/openvswitch/conntrack.h
@@ -0,0 +1,82 @@
+/*
+ * Copyright (c) 2015 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef OVS_CONNTRACK_H
+#define OVS_CONNTRACK_H 1
+
+struct ovs_net;
+struct sw_flow_key;
+struct sw_flow_actions;
+struct ovs_conntrack_info;
+struct ovs_key_ct_label;
+enum ovs_key_attr;
+
+#if defined(CONFIG_OPENVSWITCH_CONNTRACK)
+bool ovs_ct_verify(enum ovs_key_attr attr);
+int ovs_ct_copy_action(struct net *, const struct nlattr *,
+ const struct sw_flow_key *, struct sw_flow_actions **,
+ bool log);
+int ovs_ct_action_to_attr(const struct ovs_conntrack_info *, struct sk_buff *);
+
+int ovs_ct_execute(struct sk_buff *, struct sw_flow_key *,
+ const struct ovs_conntrack_info *);
+
+u8 ovs_ct_get_state(const struct sk_buff *skb);
+u16 ovs_ct_get_zone(const struct sk_buff *skb);
+bool ovs_ct_state_valid(const struct sw_flow_key *key);
+void ovs_ct_free_action(const struct nlattr *a);
+#else
+#include <linux/errno.h>
+
+static inline bool ovs_ct_verify(int attr)
+{
+ return false;
+}
+
+static inline int ovs_ct_copy_action(struct net *net, const struct nlattr *nla,
+ const struct sw_flow_key *key,
+ struct sw_flow_actions **acts, bool log)
+{
+ return -ENOTSUPP;
+}
+
+static inline int ovs_ct_action_to_attr(const struct ovs_conntrack_info *info,
+ struct sk_buff *skb)
+{
+ return -ENOTSUPP;
+}
+
+static inline int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
+ const struct ovs_conntrack_info *info)
+{
+ return -ENOTSUPP;
+}
+
+static inline u8 ovs_ct_get_state(const struct sk_buff *skb)
+{
+ return 0;
+}
+
+static inline u16 ovs_ct_get_zone(const struct sk_buff *skb)
+{
+ return 0;
+}
+
+static inline bool ovs_ct_state_valid(const struct sw_flow_key *key)
+{
+ return false;
+}
+
+static inline void ovs_ct_free_action(const struct nlattr *a) { }
+#endif
+#endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index d5b5473..23717a3 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -275,6 +275,8 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
memset(&upcall, 0, sizeof(upcall));
upcall.cmd = OVS_PACKET_CMD_MISS;
upcall.portid = ovs_vport_find_upcall_portid(p, skb);
+ upcall.egress_tun_info = NULL;
+ upcall.mru = OVS_CB(skb)->mru;
error = ovs_dp_upcall(dp, skb, key, &upcall);
if (unlikely(error))
kfree_skb(skb);
@@ -400,9 +402,23 @@ static size_t upcall_msg_size(const struct dp_upcall_info *upcall_info,
if (upcall_info->actions_len)
size += nla_total_size(upcall_info->actions_len);

+ /* OVS_PACKET_ATTR_MRU */
+ if (upcall_info->mru)
+ size += nla_total_size(sizeof(unsigned int));
+
return size;
}

+static void pad_packet(struct datapath *dp, struct sk_buff *skb)
+{
+ if (!(dp->user_features & OVS_DP_F_UNALIGNED)) {
+ size_t plen = NLA_ALIGN(skb->len) - skb->len;
+
+ if (plen > 0)
+ memset(skb_put(skb, plen), 0, plen);
+ }
+}
+
static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
const struct sw_flow_key *key,
const struct dp_upcall_info *upcall_info)
@@ -490,6 +506,16 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
nla_nest_end(user_skb, nla);
else
nla_nest_cancel(user_skb, nla);
+ }
+
+ /* Add OVS_PACKET_ATTR_MRU */
+ if (upcall_info->mru) {
+ if (nla_put_u16(user_skb, OVS_PACKET_ATTR_MRU,
+ upcall_info->mru)) {
+ err = -ENOBUFS;
+ goto out;
+ }
+ pad_packet(dp, user_skb);
}

/* Only reserve room for attribute header, packet data is added
@@ -505,12 +531,7 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
goto out;

/* Pad OVS_PACKET_ATTR_PACKET if linear copy was performed */
- if (!(dp->user_features & OVS_DP_F_UNALIGNED)) {
- size_t plen = NLA_ALIGN(user_skb->len) - user_skb->len;
-
- if (plen > 0)
- memset(skb_put(user_skb, plen), 0, plen);
- }
+ pad_packet(dp, user_skb);

((struct nlmsghdr *) user_skb->data)->nlmsg_len = user_skb->len;

@@ -532,12 +553,14 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
struct sk_buff *packet;
struct sw_flow *flow;
struct sw_flow_actions *sf_acts;
+ struct net *net = sock_net(skb->sk);
struct datapath *dp;
struct ethhdr *eth;
struct vport *input_vport;
int len;
int err;
bool log = !a[OVS_PACKET_ATTR_PROBE];
+ unsigned int mru;

err = -EINVAL;
if (!a[OVS_PACKET_ATTR_PACKET] || !a[OVS_PACKET_ATTR_KEY] ||
@@ -564,6 +587,14 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
else
packet->protocol = htons(ETH_P_802_2);

+ /* Set packet's mru */
+ mru = 0;
+ if (a[OVS_PACKET_ATTR_MRU]) {
+ mru = nla_get_u16(a[OVS_PACKET_ATTR_MRU]);
+ packet->ignore_df = 1;
+ }
+ OVS_CB(packet)->mru = mru;
+
/* Build an sw_flow for sending this packet. */
flow = ovs_flow_alloc();
err = PTR_ERR(flow);
@@ -575,7 +606,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
if (err)
goto err_flow_free;

- err = ovs_nla_copy_actions(a[OVS_PACKET_ATTR_ACTIONS],
+ err = ovs_nla_copy_actions(net, a[OVS_PACKET_ATTR_ACTIONS],
&flow->key, &acts, log);
if (err)
goto err_flow_free;
@@ -598,6 +629,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
if (!input_vport)
goto err_unlock;

+ packet->dev = input_vport->dev;
OVS_CB(packet)->input_vport = input_vport;
sf_acts = rcu_dereference(flow->sf_acts);

@@ -624,6 +656,7 @@ static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
[OVS_PACKET_ATTR_KEY] = { .type = NLA_NESTED },
[OVS_PACKET_ATTR_ACTIONS] = { .type = NLA_NESTED },
[OVS_PACKET_ATTR_PROBE] = { .type = NLA_FLAG },
+ [OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 },
};

static const struct genl_ops dp_packet_genl_ops[] = {
@@ -880,6 +913,7 @@ static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,

static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
{
+ struct net *net = sock_net(skb->sk);
struct nlattr **a = info->attrs;
struct ovs_header *ovs_header = info->userhdr;
struct sw_flow *flow = NULL, *new_flow;
@@ -929,8 +963,8 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
goto err_kfree_flow;

/* Validate actions. */
- error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS], &new_flow->key,
- &acts, log);
+ error = ovs_nla_copy_actions(net, a[OVS_FLOW_ATTR_ACTIONS],
+ &new_flow->key, &acts, log);
if (error) {
OVS_NLERR(log, "Flow actions may not be safe on all matching packets.");
goto err_kfree_flow;
@@ -1038,7 +1072,8 @@ error:
}

/* Factor out action copy to avoid "Wframe-larger-than=1024" warning. */
-static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
+static struct sw_flow_actions *get_flow_actions(struct net *net,
+ const struct nlattr *a,
const struct sw_flow_key *key,
const struct sw_flow_mask *mask,
bool log)
@@ -1048,7 +1083,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
int error;

ovs_flow_mask_key(&masked_key, key, mask);
- error = ovs_nla_copy_actions(a, &masked_key, &acts, log);
+ error = ovs_nla_copy_actions(net, a, &masked_key, &acts, log);
if (error) {
OVS_NLERR(log,
"Actions may not be safe on all matching packets");
@@ -1060,6 +1095,7 @@ static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,

static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
{
+ struct net *net = sock_net(skb->sk);
struct nlattr **a = info->attrs;
struct ovs_header *ovs_header = info->userhdr;
struct sw_flow_key key;
@@ -1091,8 +1127,8 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)

/* Validate actions. */
if (a[OVS_FLOW_ATTR_ACTIONS]) {
- acts = get_flow_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, &mask,
- log);
+ acts = get_flow_actions(net, a[OVS_FLOW_ATTR_ACTIONS], &key,
+ &mask, log);
if (IS_ERR(acts)) {
error = PTR_ERR(acts);
goto error;
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 487a85f..fc808a2 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -27,6 +27,7 @@
#include <linux/u64_stats_sync.h>
#include <net/ip_tunnels.h>

+#include "conntrack.h"
#include "flow.h"
#include "flow_table.h"
#include "vport.h"
@@ -97,10 +98,13 @@ struct datapath {
* NULL if the packet is not being tunneled.
* @input_vport: The original vport packet came in on. This value is cached
* when a packet is received by OVS.
+ * @mru: The maximum received fragement size; 0 if the packet is not
+ * fragmented.
*/
struct ovs_skb_cb {
struct ip_tunnel_info *egress_tun_info;
struct vport *input_vport;
+ unsigned int mru;
};
#define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)

@@ -113,6 +117,7 @@ struct ovs_skb_cb {
* then no packet is sent and the packet is accounted in the datapath's @n_lost
* counter.
* @egress_tun_info: If nonnull, becomes %OVS_PACKET_ATTR_EGRESS_TUN_KEY.
+ * @mru: If not zero, Maximum received IP fragment size.
*/
struct dp_upcall_info {
const struct ip_tunnel_info *egress_tun_info;
@@ -121,6 +126,7 @@ struct dp_upcall_info {
int actions_len;
u32 portid;
u8 cmd;
+ unsigned int mru;
};

/**
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 8db22ef..131b807 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -49,6 +49,7 @@
#include "datapath.h"
#include "flow.h"
#include "flow_netlink.h"
+#include "conntrack.h"

u64 ovs_flow_used_time(unsigned long flow_jiffies)
{
@@ -707,6 +708,8 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
key->phy.priority = skb->priority;
key->phy.in_port = OVS_CB(skb)->input_vport->port_no;
key->phy.skb_mark = skb->mark;
+ key->ct.state = ovs_ct_get_state(skb);
+ key->ct.zone = ovs_ct_get_zone(skb);
key->ovs_flow_hash = 0;
key->recirc_id = 0;

diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 082a87b..312c7d7 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -111,6 +111,12 @@ struct sw_flow_key {
} nd;
} ipv6;
};
+ struct {
+ /* Connection tracking fields. */
+ u16 zone;
+ u8 state;
+ } ct;
+
} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */

struct sw_flow_key_range {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d536fb7..4eeaa5a 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -50,6 +50,7 @@
#include <net/vxlan.h>

#include "flow_netlink.h"
+#include "conntrack.h"

struct ovs_len_tbl {
int len;
@@ -281,7 +282,7 @@ size_t ovs_key_attr_size(void)
/* Whenever adding new OVS_KEY_ FIELDS, we should consider
* updating this function.
*/
- BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 22);
+ BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 24);

return nla_total_size(4) /* OVS_KEY_ATTR_PRIORITY */
+ nla_total_size(0) /* OVS_KEY_ATTR_TUNNEL */
@@ -290,6 +291,8 @@ size_t ovs_key_attr_size(void)
+ nla_total_size(4) /* OVS_KEY_ATTR_SKB_MARK */
+ nla_total_size(4) /* OVS_KEY_ATTR_DP_HASH */
+ nla_total_size(4) /* OVS_KEY_ATTR_RECIRC_ID */
+ + nla_total_size(1) /* OVS_KEY_ATTR_CT_STATE */
+ + nla_total_size(2) /* OVS_KEY_ATTR_CT_ZONE */
+ nla_total_size(12) /* OVS_KEY_ATTR_ETHERNET */
+ nla_total_size(2) /* OVS_KEY_ATTR_ETHERTYPE */
+ nla_total_size(4) /* OVS_KEY_ATTR_VLAN */
@@ -339,6 +342,8 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
[OVS_KEY_ATTR_TUNNEL] = { .len = OVS_ATTR_NESTED,
.next = ovs_tunnel_key_lens, },
[OVS_KEY_ATTR_MPLS] = { .len = sizeof(struct ovs_key_mpls) },
+ [OVS_KEY_ATTR_CT_STATE] = { .len = sizeof(u8) },
+ [OVS_KEY_ATTR_CT_ZONE] = { .len = sizeof(u16) },
};

static bool is_all_zero(const u8 *fp, size_t size)
@@ -768,6 +773,21 @@ static int metadata_from_nlattrs(struct sw_flow_match *match, u64 *attrs,
return -EINVAL;
*attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL);
}
+
+ if (*attrs & (1 << OVS_KEY_ATTR_CT_STATE) &&
+ ovs_ct_verify(OVS_KEY_ATTR_CT_STATE)) {
+ uint8_t ct_state = nla_get_u8(a[OVS_KEY_ATTR_CT_STATE]);
+
+ SW_FLOW_KEY_PUT(match, ct.state, ct_state, is_mask);
+ *attrs &= ~(1ULL << OVS_KEY_ATTR_CT_STATE);
+ }
+ if (*attrs & (1 << OVS_KEY_ATTR_CT_ZONE) &&
+ ovs_ct_verify(OVS_KEY_ATTR_CT_ZONE)) {
+ uint16_t ct_zone = nla_get_u16(a[OVS_KEY_ATTR_CT_ZONE]);
+
+ SW_FLOW_KEY_PUT(match, ct.zone, ct_zone, is_mask);
+ *attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ZONE);
+ }
return 0;
}

@@ -1266,6 +1286,7 @@ int ovs_nla_get_flow_metadata(const struct nlattr *attr,
memset(&match, 0, sizeof(match));
match.key = key;

+ memset(&key->ct, 0, sizeof key->ct);
key->phy.in_port = DP_MAX_PORTS;

return metadata_from_nlattrs(&match, &attrs, a, false, log);
@@ -1314,6 +1335,12 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
if (nla_put_u32(skb, OVS_KEY_ATTR_SKB_MARK, output->phy.skb_mark))
goto nla_put_failure;

+ if (nla_put_u8(skb, OVS_KEY_ATTR_CT_STATE, output->ct.state))
+ goto nla_put_failure;
+
+ if (nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone))
+ goto nla_put_failure;
+
nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key));
if (!nla)
goto nla_put_failure;
@@ -1575,6 +1602,9 @@ void ovs_nla_free_flow_actions(struct sw_flow_actions *sf_acts)
case OVS_ACTION_ATTR_SET:
ovs_nla_free_set_action(a);
break;
+ case OVS_ACTION_ATTR_CT:
+ ovs_ct_free_action(a);
+ break;
}
}

@@ -1647,8 +1677,8 @@ static struct nlattr *__add_action(struct sw_flow_actions **sfa,
return a;
}

-static int add_action(struct sw_flow_actions **sfa, int attrtype,
- void *data, int len, bool log)
+int ovs_nla_add_action(struct sw_flow_actions **sfa, int attrtype, void *data,
+ int len, bool log)
{
struct nlattr *a;

@@ -1663,7 +1693,7 @@ static inline int add_nested_action_start(struct sw_flow_actions **sfa,
int used = (*sfa)->actions_len;
int err;

- err = add_action(sfa, attrtype, NULL, 0, log);
+ err = ovs_nla_add_action(sfa, attrtype, NULL, 0, log);
if (err)
return err;

@@ -1679,12 +1709,12 @@ static inline void add_nested_action_end(struct sw_flow_actions *sfa,
a->nla_len = sfa->actions_len - st_offset;
}

-static int __ovs_nla_copy_actions(const struct nlattr *attr,
+static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
const struct sw_flow_key *key,
int depth, struct sw_flow_actions **sfa,
__be16 eth_type, __be16 vlan_tci, bool log);

-static int validate_and_copy_sample(const struct nlattr *attr,
+static int validate_and_copy_sample(struct net *net, const struct nlattr *attr,
const struct sw_flow_key *key, int depth,
struct sw_flow_actions **sfa,
__be16 eth_type, __be16 vlan_tci, bool log)
@@ -1716,15 +1746,15 @@ static int validate_and_copy_sample(const struct nlattr *attr,
start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SAMPLE, log);
if (start < 0)
return start;
- err = add_action(sfa, OVS_SAMPLE_ATTR_PROBABILITY,
- nla_data(probability), sizeof(u32), log);
+ err = ovs_nla_add_action(sfa, OVS_SAMPLE_ATTR_PROBABILITY,
+ nla_data(probability), sizeof(u32), log);
if (err)
return err;
st_acts = add_nested_action_start(sfa, OVS_SAMPLE_ATTR_ACTIONS, log);
if (st_acts < 0)
return st_acts;

- err = __ovs_nla_copy_actions(actions, key, depth + 1, sfa,
+ err = __ovs_nla_copy_actions(net, actions, key, depth + 1, sfa,
eth_type, vlan_tci, log);
if (err)
return err;
@@ -2058,7 +2088,7 @@ static int copy_action(const struct nlattr *from,
return 0;
}

-static int __ovs_nla_copy_actions(const struct nlattr *attr,
+static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
const struct sw_flow_key *key,
int depth, struct sw_flow_actions **sfa,
__be16 eth_type, __be16 vlan_tci, bool log)
@@ -2082,7 +2112,8 @@ static int __ovs_nla_copy_actions(const struct nlattr *attr,
[OVS_ACTION_ATTR_SET] = (u32)-1,
[OVS_ACTION_ATTR_SET_MASKED] = (u32)-1,
[OVS_ACTION_ATTR_SAMPLE] = (u32)-1,
- [OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash)
+ [OVS_ACTION_ATTR_HASH] = sizeof(struct ovs_action_hash),
+ [OVS_ACTION_ATTR_CT] = (u32)-1,
};
const struct ovs_action_push_vlan *vlan;
int type = nla_type(a);
@@ -2189,13 +2220,20 @@ static int __ovs_nla_copy_actions(const struct nlattr *attr,
break;

case OVS_ACTION_ATTR_SAMPLE:
- err = validate_and_copy_sample(a, key, depth, sfa,
+ err = validate_and_copy_sample(net, a, key, depth, sfa,
eth_type, vlan_tci, log);
if (err)
return err;
skip_copy = true;
break;

+ case OVS_ACTION_ATTR_CT:
+ err = ovs_ct_copy_action(net, a, key, sfa, log);
+ if (err)
+ return err;
+ skip_copy = true;
+ break;
+
default:
OVS_NLERR(log, "Unknown Action type %d", type);
return -EINVAL;
@@ -2214,7 +2252,7 @@ static int __ovs_nla_copy_actions(const struct nlattr *attr,
}

/* 'key' must be the masked key. */
-int ovs_nla_copy_actions(const struct nlattr *attr,
+int ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
const struct sw_flow_key *key,
struct sw_flow_actions **sfa, bool log)
{
@@ -2224,7 +2262,7 @@ int ovs_nla_copy_actions(const struct nlattr *attr,
if (IS_ERR(*sfa))
return PTR_ERR(*sfa);

- err = __ovs_nla_copy_actions(attr, key, 0, sfa, key->eth.type,
+ err = __ovs_nla_copy_actions(net, attr, key, 0, sfa, key->eth.type,
key->eth.tci, log);
if (err)
ovs_nla_free_flow_actions(*sfa);
@@ -2349,6 +2387,13 @@ int ovs_nla_put_actions(const struct nlattr *attr, int len, struct sk_buff *skb)
if (err)
return err;
break;
+
+ case OVS_ACTION_ATTR_CT:
+ err = ovs_ct_action_to_attr(nla_data(a), skb);
+ if (err)
+ return err;
+ break;
+
default:
if (nla_put(skb, type, nla_len(a), nla_data(a)))
return -EMSGSIZE;
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index acd0744..c0b484b 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -62,9 +62,11 @@ int ovs_nla_get_identifier(struct sw_flow_id *sfid, const struct nlattr *ufid,
const struct sw_flow_key *key, bool log);
u32 ovs_nla_get_ufid_flags(const struct nlattr *attr);

-int ovs_nla_copy_actions(const struct nlattr *attr,
+int ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
const struct sw_flow_key *key,
struct sw_flow_actions **sfa, bool log);
+int ovs_nla_add_action(struct sw_flow_actions **sfa, int attrtype,
+ void *data, int len, bool log);
int ovs_nla_put_actions(const struct nlattr *attr,
int len, struct sk_buff *skb);

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index baa018f..8a63df6 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -487,6 +487,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,

OVS_CB(skb)->input_vport = vport;
OVS_CB(skb)->egress_tun_info = NULL;
+ OVS_CB(skb)->mru = 0;
/* Extract flow from 'skb' into 'key'. */
error = ovs_flow_key_extract(tun_info, skb, &key);
if (unlikely(error)) {
--
2.1.4

2015-07-30 18:14:34

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 6/9] openvswitch: Allow matching on conntrack mark

From: Justin Pettit <[email protected]>

Allow matching and setting the conntrack mark field. As with conntrack
state and zone, these are populated by executing the ct() action. Unlike
these, the ct_mark is also a writable field. The set_field() action may
be used to modify the mark, which will take effect on the most recent
conntrack entry.

E.g.: actions:ct(zone=0),ct(zone=1),set_field(1->ct_mark)

This will perform conntrack lookup in zone 0, then lookup in zone 1,
then modify the mark for the entry in zone 1. The mark for the entry in
zone 0 is unchanged. The conntrack entry itself must be committed using
the "commit" flag in the conntrack action flags for this change to persist.

Signed-off-by: Justin Pettit <[email protected]>
Signed-off-by: Joe Stringer <[email protected]>
---
include/uapi/linux/openvswitch.h | 1 +
net/openvswitch/actions.c | 6 ++++++
net/openvswitch/conntrack.c | 40 ++++++++++++++++++++++++++++++++++++++++
net/openvswitch/conntrack.h | 14 ++++++++++++++
net/openvswitch/flow.c | 1 +
net/openvswitch/flow.h | 1 +
net/openvswitch/flow_netlink.c | 15 ++++++++++++++-
7 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 1dae30a..207788c 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -325,6 +325,7 @@ enum ovs_key_attr {
* the accepted length of the array. */
OVS_KEY_ATTR_CT_STATE, /* u8 bitmask of OVS_CS_F_* */
OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */
+ OVS_KEY_ATTR_CT_MARK, /* u32 connection tracking mark */

#ifdef __KERNEL__
OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 4a62ed4..77b01f5 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -944,6 +944,12 @@ static int execute_masked_set_action(struct sk_buff *skb,
err = set_mpls(skb, flow_key, nla_data(a), get_mask(a,
__be32 *));
break;
+
+ case OVS_KEY_ATTR_CT_MARK:
+ err = ovs_ct_set_mark(skb, flow_key, nla_get_u32(a),
+ *get_mask(a, u32 *));
+ break;
+
}

return err;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 284b89e..6dc68dc 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -114,6 +114,15 @@ u16 ovs_ct_get_zone(const struct sk_buff *skb)
return ct ? nf_ct_zone(ct) : NF_CT_DEFAULT_ZONE;
}

+u32 ovs_ct_get_mark(const struct sk_buff *skb)
+{
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ return ct ? ct->mark : 0;
+}
+
static bool __ovs_ct_state_valid(u8 state)
{
return (state && !(state & OVS_CS_F_INVALID));
@@ -207,6 +216,7 @@ static void __ovs_ct_update_key(struct sk_buff *skb, struct sw_flow_key *key,
{
key->ct.state = state;
key->ct.zone = zone;
+ key->ct.mark = ovs_ct_get_mark(skb);
}

static void ovs_ct_update_key(struct sk_buff *skb, struct sw_flow_key *key,
@@ -340,6 +350,32 @@ int ovs_ct_execute(struct sk_buff *skb, struct sw_flow_key *key,
return err;
}

+int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
+ u32 ct_mark, u32 mask)
+{
+#ifdef CONFIG_NF_CONNTRACK_MARK
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
+ u32 new_mark;
+
+ /* This must happen directly after lookup/commit. */
+ ct = nf_ct_get(skb, &ctinfo);
+ if (!ct)
+ return -EINVAL;
+
+ new_mark = ct_mark | (ct->mark & ~(mask));
+ if (ct->mark != new_mark) {
+ ct->mark = new_mark;
+ nf_conntrack_event_cache(IPCT_MARK, ct);
+ key->ct.mark = ct_mark;
+ }
+
+ return 0;
+#else
+ return -ENOTSUPP;
+#endif
+}
+
static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
[OVS_CT_ATTR_FLAGS] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
@@ -403,6 +439,10 @@ bool ovs_ct_verify(enum ovs_key_attr attr)
if (attr & OVS_KEY_ATTR_CT_ZONE)
return true;
#endif
+#ifdef CONFIG_NF_CONNTRACK_MARK
+ if (attr & OVS_KEY_ATTR_CT_MARK)
+ return true;
+#endif

return false;
}
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 7a01751..03a1ec5 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -31,6 +31,9 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *, struct sk_buff *);
int ovs_ct_execute(struct sk_buff *, struct sw_flow_key *,
const struct ovs_conntrack_info *);

+int ovs_ct_set_mark(struct sk_buff *, struct sw_flow_key *, u32 ct_mark,
+ u32 mask);
+u32 ovs_ct_get_mark(const struct sk_buff *skb);
u8 ovs_ct_get_state(const struct sk_buff *skb);
u16 ovs_ct_get_zone(const struct sk_buff *skb);
bool ovs_ct_state_valid(const struct sw_flow_key *key);
@@ -72,11 +75,22 @@ static inline u16 ovs_ct_get_zone(const struct sk_buff *skb)
return 0;
}

+static inline u32 ovs_ct_get_mark(const struct sk_buff *skb)
+{
+ return 0;
+}
+
static inline bool ovs_ct_state_valid(const struct sw_flow_key *key)
{
return false;
}

+static inline int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
+ u32 ct_mark, u32 mask)
+{
+ return -ENOTSUPP;
+}
+
static inline void ovs_ct_free_action(const struct nlattr *a) { }
#endif
#endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 131b807..05ce284 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -710,6 +710,7 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
key->phy.skb_mark = skb->mark;
key->ct.state = ovs_ct_get_state(skb);
key->ct.zone = ovs_ct_get_zone(skb);
+ key->ct.mark = ovs_ct_get_mark(skb);
key->ovs_flow_hash = 0;
key->recirc_id = 0;

diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 312c7d7..e05e697 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -114,6 +114,7 @@ struct sw_flow_key {
struct {
/* Connection tracking fields. */
u16 zone;
+ u32 mark;
u8 state;
} ct;

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 4eeaa5a..90e80a6 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -282,7 +282,7 @@ size_t ovs_key_attr_size(void)
/* Whenever adding new OVS_KEY_ FIELDS, we should consider
* updating this function.
*/
- BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 24);
+ BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 25);

return nla_total_size(4) /* OVS_KEY_ATTR_PRIORITY */
+ nla_total_size(0) /* OVS_KEY_ATTR_TUNNEL */
@@ -293,6 +293,7 @@ size_t ovs_key_attr_size(void)
+ nla_total_size(4) /* OVS_KEY_ATTR_RECIRC_ID */
+ nla_total_size(1) /* OVS_KEY_ATTR_CT_STATE */
+ nla_total_size(2) /* OVS_KEY_ATTR_CT_ZONE */
+ + nla_total_size(4) /* OVS_KEY_ATTR_CT_MARK */
+ nla_total_size(12) /* OVS_KEY_ATTR_ETHERNET */
+ nla_total_size(2) /* OVS_KEY_ATTR_ETHERTYPE */
+ nla_total_size(4) /* OVS_KEY_ATTR_VLAN */
@@ -344,6 +345,7 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
[OVS_KEY_ATTR_MPLS] = { .len = sizeof(struct ovs_key_mpls) },
[OVS_KEY_ATTR_CT_STATE] = { .len = sizeof(u8) },
[OVS_KEY_ATTR_CT_ZONE] = { .len = sizeof(u16) },
+ [OVS_KEY_ATTR_CT_MARK] = { .len = sizeof(u32) },
};

static bool is_all_zero(const u8 *fp, size_t size)
@@ -788,6 +790,13 @@ static int metadata_from_nlattrs(struct sw_flow_match *match, u64 *attrs,
SW_FLOW_KEY_PUT(match, ct.zone, ct_zone, is_mask);
*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_ZONE);
}
+ if (*attrs & (1 << OVS_KEY_ATTR_CT_MARK) &&
+ ovs_ct_verify(OVS_KEY_ATTR_CT_MARK)) {
+ uint32_t mark = nla_get_u32(a[OVS_KEY_ATTR_CT_MARK]);
+
+ SW_FLOW_KEY_PUT(match, ct.mark, mark, is_mask);
+ *attrs &= ~(1ULL << OVS_KEY_ATTR_CT_MARK);
+ }
return 0;
}

@@ -1341,6 +1350,9 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
if (nla_put_u16(skb, OVS_KEY_ATTR_CT_ZONE, output->ct.zone))
goto nla_put_failure;

+ if (nla_put_u32(skb, OVS_KEY_ATTR_CT_MARK, output->ct.mark))
+ goto nla_put_failure;
+
nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key));
if (!nla)
goto nla_put_failure;
@@ -1923,6 +1935,7 @@ static int validate_set(const struct nlattr *a,

case OVS_KEY_ATTR_PRIORITY:
case OVS_KEY_ATTR_SKB_MARK:
+ case OVS_KEY_ATTR_CT_MARK:
case OVS_KEY_ATTR_ETHERNET:
break;

--
2.1.4

2015-07-30 18:14:33

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 7/9] netfilter: Always export nf_connlabels_replace()

The following patches will reuse this code from OVS.

Signed-off-by: Joe Stringer <[email protected]>
---
net/netfilter/nf_conntrack_labels.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_labels.c b/net/netfilter/nf_conntrack_labels.c
index bb53f12..daa7c13 100644
--- a/net/netfilter/nf_conntrack_labels.c
+++ b/net/netfilter/nf_conntrack_labels.c
@@ -48,7 +48,6 @@ int nf_connlabel_set(struct nf_conn *ct, u16 bit)
}
EXPORT_SYMBOL_GPL(nf_connlabel_set);

-#if IS_ENABLED(CONFIG_NF_CT_NETLINK)
static void replace_u32(u32 *address, u32 mask, u32 new)
{
u32 old, tmp;
@@ -89,7 +88,6 @@ int nf_connlabels_replace(struct nf_conn *ct,
return 0;
}
EXPORT_SYMBOL_GPL(nf_connlabels_replace);
-#endif

static struct nf_ct_ext_type labels_extend __read_mostly = {
.len = sizeof(struct nf_conn_labels),
--
2.1.4

2015-07-30 18:14:03

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 8/9] openvswitch: Allow matching on conntrack label

Allow matching and setting the conntrack label field. As with ct_mark,
this is populated by executing the ct() action, and is a writable field.
The set_field() action may be used to modify the label, which will take
effect on the most recent conntrack entry.

E.g.: actions:ct(zone=1),set_field(1->ct_label)

This will perform conntrack lookup in zone 1, then modify the label for
that entry. The conntrack entry itself must be committed using the
"commit" flag in the conntrack action flags for this change to persist.

Signed-off-by: Joe Stringer <[email protected]>
---
include/uapi/linux/openvswitch.h | 6 ++
net/openvswitch/actions.c | 4 ++
net/openvswitch/conntrack.c | 133 +++++++++++++++++++++++++++++++++++++++
net/openvswitch/conntrack.h | 32 ++++++++++
net/openvswitch/datapath.c | 6 ++
net/openvswitch/datapath.h | 2 +
net/openvswitch/flow.c | 1 +
net/openvswitch/flow.h | 1 +
net/openvswitch/flow_netlink.c | 18 +++++-
9 files changed, 202 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 207788c..f360dc9 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -326,6 +326,7 @@ enum ovs_key_attr {
OVS_KEY_ATTR_CT_STATE, /* u8 bitmask of OVS_CS_F_* */
OVS_KEY_ATTR_CT_ZONE, /* u16 connection tracking zone. */
OVS_KEY_ATTR_CT_MARK, /* u32 connection tracking mark */
+ OVS_KEY_ATTR_CT_LABEL, /* 16-octet connection tracking label */

#ifdef __KERNEL__
OVS_KEY_ATTR_TUNNEL_INFO, /* struct ip_tunnel_info */
@@ -438,6 +439,11 @@ struct ovs_key_nd {
__u8 nd_tll[ETH_ALEN];
};

+#define OVS_CT_LABEL_LEN 16
+struct ovs_key_ct_label {
+ __u8 ct_label[OVS_CT_LABEL_LEN];
+};
+
/* OVS_KEY_ATTR_CT_STATE flags */
#define OVS_CS_F_NEW 0x01 /* Beginning of a new connection. */
#define OVS_CS_F_ESTABLISHED 0x02 /* Part of an existing connection. */
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 77b01f5..0d5a72a 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -950,6 +950,10 @@ static int execute_masked_set_action(struct sk_buff *skb,
*get_mask(a, u32 *));
break;

+ case OVS_KEY_ATTR_CT_LABEL:
+ err = ovs_ct_set_label(skb, flow_key, nla_data(a),
+ get_mask(a, struct ovs_key_ct_label *));
+ break;
}

return err;
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 6dc68dc..5acc59a 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -15,6 +15,7 @@
#include <linux/openvswitch.h>
#include <net/ip.h>
#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_labels.h>
#include <net/netfilter/nf_conntrack_zones.h>
#include <net/netfilter/ipv6/nf_defrag_ipv6.h>

@@ -123,6 +124,30 @@ u32 ovs_ct_get_mark(const struct sk_buff *skb)
return ct ? ct->mark : 0;
}

+void ovs_ct_get_label(const struct sk_buff *skb,
+ struct ovs_key_ct_label *label)
+{
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn_labels *cl = NULL;
+ struct nf_conn *ct;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ if (ct)
+ cl = nf_ct_labels_find(ct);
+
+ if (cl) {
+ size_t len = cl->words * sizeof(long);
+
+ if (len > OVS_CT_LABEL_LEN)
+ len = OVS_CT_LABEL_LEN;
+ else if (len < OVS_CT_LABEL_LEN)
+ memset(label, 0, OVS_CT_LABEL_LEN);
+ memcpy(label, cl->bits, len);
+ } else {
+ memset(label, 0, OVS_CT_LABEL_LEN);
+ }
+}
+
static bool __ovs_ct_state_valid(u8 state)
{
return (state && !(state & OVS_CS_F_INVALID));
@@ -217,6 +242,7 @@ static void __ovs_ct_update_key(struct sk_buff *skb, struct sw_flow_key *key,
key->ct.state = state;
key->ct.zone = zone;
key->ct.mark = ovs_ct_get_mark(skb);
+ ovs_ct_get_label(skb, &key->ct.label);
}

static void ovs_ct_update_key(struct sk_buff *skb, struct sw_flow_key *key,
@@ -376,6 +402,41 @@ int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
#endif
}

+int ovs_ct_set_label(struct sk_buff *skb, struct sw_flow_key *key,
+ const struct ovs_key_ct_label *label,
+ const struct ovs_key_ct_label *mask)
+{
+#ifdef CONFIG_NF_CONNTRACK_LABELS
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn_labels *cl;
+ struct nf_conn *ct;
+ int err;
+
+ /* This must happen directly after lookup/commit. */
+ ct = nf_ct_get(skb, &ctinfo);
+ if (!ct)
+ return -EINVAL;
+
+ cl = nf_ct_labels_find(ct);
+ if (!cl) {
+ nf_ct_labels_ext_add(ct);
+ cl = nf_ct_labels_find(ct);
+ }
+ if (!cl || cl->words * sizeof(long) < OVS_CT_LABEL_LEN)
+ return -ENOSPC;
+
+ err = nf_connlabels_replace(ct, (u32 *)label, (u32 *)mask,
+ OVS_CT_LABEL_LEN / sizeof(u32));
+ if (err)
+ return err;
+
+ ovs_ct_get_label(skb, &key->ct.label);
+ return 0;
+#else
+ return -ENOTSUPP;
+#endif
+}
+
static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
[OVS_CT_ATTR_FLAGS] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
@@ -443,6 +504,10 @@ bool ovs_ct_verify(enum ovs_key_attr attr)
if (attr & OVS_KEY_ATTR_CT_MARK)
return true;
#endif
+#ifdef CONFIG_NF_CONNTRACK_LABELS
+ if (attr & OVS_KEY_ATTR_CT_LABEL)
+ return true;
+#endif

return false;
}
@@ -518,3 +583,71 @@ void ovs_ct_free_action(const struct nlattr *a)
if (ct_info->ct)
nf_ct_put(ct_info->ct);
}
+
+/* Load connlabel and ensure it supports 128-bit labels */
+static struct xt_match *load_connlabel(struct net *net)
+{
+#ifdef CONFIG_NF_CONNTRACK_LABELS
+ struct xt_match *match;
+ struct xt_mtchk_param mtpar;
+ struct xt_connlabel_mtinfo info;
+ int err = -EINVAL;
+
+ match = xt_request_find_match(NFPROTO_UNSPEC, "connlabel", 0);
+ if (IS_ERR(match)) {
+ match = NULL;
+ goto exit;
+ }
+
+ info.bit = sizeof(struct ovs_key_ct_label) * 8 - 1;
+ info.options = 0;
+
+ mtpar.net = net;
+ mtpar.table = match->table;
+ mtpar.entryinfo = NULL;
+ mtpar.match = match;
+ mtpar.matchinfo = &info;
+ mtpar.hook_mask = BIT(NF_INET_PRE_ROUTING);
+ mtpar.family = NFPROTO_IPV4;
+
+ err = xt_check_match(&mtpar, XT_ALIGN(match->matchsize), match->proto,
+ 0);
+ if (err)
+ goto exit;
+
+ return match;
+
+exit:
+ OVS_NLERR(true, "Failed to set connlabel length");
+ if (match)
+ module_put(match->me);
+#endif
+ return NULL;
+}
+
+void ovs_ct_init(struct net *net, struct ovs_ct_perdp_data *data)
+{
+ data->xt_v4 = !nf_ct_l3proto_try_module_get(PF_INET);
+ data->xt_v6 = !nf_ct_l3proto_try_module_get(PF_INET6);
+ data->xt_label = load_connlabel(net);
+}
+
+void ovs_ct_exit(struct net *net, struct ovs_ct_perdp_data *data)
+{
+ if (data->xt_v4)
+ nf_ct_l3proto_module_put(PF_INET);
+ if (data->xt_v6)
+ nf_ct_l3proto_module_put(PF_INET6);
+ if (data->xt_label) {
+ const struct xt_match *match = data->xt_label;
+ struct xt_mtdtor_param mtd;
+
+ mtd.net = net;
+ mtd.match = match;
+ mtd.matchinfo = NULL;
+ mtd.family = NFPROTO_IPV4;
+
+ module_put(match->me);
+ mtd.match->destroy(&mtd);
+ }
+}
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 03a1ec5..e85375e 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -14,6 +14,7 @@
#ifndef OVS_CONNTRACK_H
#define OVS_CONNTRACK_H 1

+struct xt_match;
struct ovs_net;
struct sw_flow_key;
struct sw_flow_actions;
@@ -21,7 +22,15 @@ struct ovs_conntrack_info;
struct ovs_key_ct_label;
enum ovs_key_attr;

+struct ovs_ct_perdp_data {
+ bool xt_v4;
+ bool xt_v6;
+ struct xt_match *xt_label;
+};
+
#if defined(CONFIG_OPENVSWITCH_CONNTRACK)
+void ovs_ct_init(struct net *, struct ovs_ct_perdp_data *data);
+void ovs_ct_exit(struct net *, struct ovs_ct_perdp_data *data);
bool ovs_ct_verify(enum ovs_key_attr attr);
int ovs_ct_copy_action(struct net *, const struct nlattr *,
const struct sw_flow_key *, struct sw_flow_actions **,
@@ -34,6 +43,11 @@ int ovs_ct_execute(struct sk_buff *, struct sw_flow_key *,
int ovs_ct_set_mark(struct sk_buff *, struct sw_flow_key *, u32 ct_mark,
u32 mask);
u32 ovs_ct_get_mark(const struct sk_buff *skb);
+void ovs_ct_get_label(const struct sk_buff *skb,
+ struct ovs_key_ct_label *label);
+int ovs_ct_set_label(struct sk_buff *, struct sw_flow_key *,
+ const struct ovs_key_ct_label *label,
+ const struct ovs_key_ct_label *mask);
u8 ovs_ct_get_state(const struct sk_buff *skb);
u16 ovs_ct_get_zone(const struct sk_buff *skb);
bool ovs_ct_state_valid(const struct sw_flow_key *key);
@@ -41,6 +55,14 @@ void ovs_ct_free_action(const struct nlattr *a);
#else
#include <linux/errno.h>

+static inline void ovs_ct_init(struct net *net, struct ovs_ct_perdp_data *data)
+{
+}
+
+static inline void ovs_ct_exit(struct net *net, struct ovs_ct_perdp_data *data)
+{
+}
+
static inline bool ovs_ct_verify(int attr)
{
return false;
@@ -91,6 +113,16 @@ static inline int ovs_ct_set_mark(struct sk_buff *skb, struct sw_flow_key *key,
return -ENOTSUPP;
}

+static inline void ovs_ct_get_label(const struct sk_buff *skb,
+ struct ovs_key_ct_label *label) { }
+static inline int ovs_ct_set_label(struct sk_buff *skb,
+ struct sw_flow_key *key,
+ const struct ovs_key_ct_label *label,
+ const struct ovs_key_ct_label *mask)
+{
+ return -ENOTSUPP;
+}
+
static inline void ovs_ct_free_action(const struct nlattr *a) { }
#endif
#endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 23717a3..1d1d675 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1583,6 +1583,9 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)

ovs_dp_change(dp, a);

+ /* Set up conntrack dependencies. */
+ ovs_ct_init(read_pnet(&dp->net), &dp->ct);
+
/* So far only local changes have been made, now need the lock. */
ovs_lock();

@@ -1619,6 +1622,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
err_destroy_ports_array:
ovs_unlock();
kfree(dp->ports);
+ ovs_ct_exit(read_pnet(&dp->net), &dp->ct);
err_destroy_percpu:
free_percpu(dp->stats_percpu);
err_destroy_table:
@@ -1652,6 +1656,8 @@ static void __dp_destroy(struct datapath *dp)
*/
ovs_dp_detach_port(ovs_vport_ovsl(dp, OVSP_LOCAL));

+ ovs_ct_exit(read_pnet(&dp->net), &dp->ct);
+
/* RCU destroy the flow table */
call_rcu(&dp->rcu, destroy_dp_rcu);
}
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index fc808a2..fd8d146 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -90,6 +90,8 @@ struct datapath {
possible_net_t net;

u32 user_features;
+
+ struct ovs_ct_perdp_data ct;
};

/**
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 05ce284..301eb41 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -711,6 +711,7 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
key->ct.state = ovs_ct_get_state(skb);
key->ct.zone = ovs_ct_get_zone(skb);
key->ct.mark = ovs_ct_get_mark(skb);
+ ovs_ct_get_label(skb, &key->ct.label);
key->ovs_flow_hash = 0;
key->recirc_id = 0;

diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index e05e697..c57994b 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -116,6 +116,7 @@ struct sw_flow_key {
u16 zone;
u32 mark;
u8 state;
+ struct ovs_key_ct_label label;
} ct;

} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 90e80a6..69ab7af 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -282,7 +282,7 @@ size_t ovs_key_attr_size(void)
/* Whenever adding new OVS_KEY_ FIELDS, we should consider
* updating this function.
*/
- BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 25);
+ BUILD_BUG_ON(OVS_KEY_ATTR_TUNNEL_INFO != 26);

return nla_total_size(4) /* OVS_KEY_ATTR_PRIORITY */
+ nla_total_size(0) /* OVS_KEY_ATTR_TUNNEL */
@@ -294,6 +294,7 @@ size_t ovs_key_attr_size(void)
+ nla_total_size(1) /* OVS_KEY_ATTR_CT_STATE */
+ nla_total_size(2) /* OVS_KEY_ATTR_CT_ZONE */
+ nla_total_size(4) /* OVS_KEY_ATTR_CT_MARK */
+ + nla_total_size(16) /* OVS_KEY_ATTR_CT_LABEL */
+ nla_total_size(12) /* OVS_KEY_ATTR_ETHERNET */
+ nla_total_size(2) /* OVS_KEY_ATTR_ETHERTYPE */
+ nla_total_size(4) /* OVS_KEY_ATTR_VLAN */
@@ -346,6 +347,7 @@ static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
[OVS_KEY_ATTR_CT_STATE] = { .len = sizeof(u8) },
[OVS_KEY_ATTR_CT_ZONE] = { .len = sizeof(u16) },
[OVS_KEY_ATTR_CT_MARK] = { .len = sizeof(u32) },
+ [OVS_KEY_ATTR_CT_LABEL] = { .len = sizeof(struct ovs_key_ct_label) },
};

static bool is_all_zero(const u8 *fp, size_t size)
@@ -797,6 +799,15 @@ static int metadata_from_nlattrs(struct sw_flow_match *match, u64 *attrs,
SW_FLOW_KEY_PUT(match, ct.mark, mark, is_mask);
*attrs &= ~(1ULL << OVS_KEY_ATTR_CT_MARK);
}
+ if (*attrs & (1 << OVS_KEY_ATTR_CT_LABEL) &&
+ ovs_ct_verify(OVS_KEY_ATTR_CT_LABEL)) {
+ const struct ovs_key_ct_label *cl;
+
+ cl = nla_data(a[OVS_KEY_ATTR_CT_LABEL]);
+ SW_FLOW_KEY_MEMCPY(match, ct.label, cl->ct_label,
+ sizeof(*cl), is_mask);
+ *attrs &= ~(1ULL << OVS_KEY_ATTR_CT_LABEL);
+ }
return 0;
}

@@ -1353,6 +1364,10 @@ static int __ovs_nla_put_key(const struct sw_flow_key *swkey,
if (nla_put_u32(skb, OVS_KEY_ATTR_CT_MARK, output->ct.mark))
goto nla_put_failure;

+ if (nla_put(skb, OVS_KEY_ATTR_CT_LABEL,
+ sizeof(output->ct.label), &output->ct.label))
+ goto nla_put_failure;
+
nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key));
if (!nla)
goto nla_put_failure;
@@ -1936,6 +1951,7 @@ static int validate_set(const struct nlattr *a,
case OVS_KEY_ATTR_PRIORITY:
case OVS_KEY_ATTR_SKB_MARK:
case OVS_KEY_ATTR_CT_MARK:
+ case OVS_KEY_ATTR_CT_LABEL:
case OVS_KEY_ATTR_ETHERNET:
break;

--
2.1.4

2015-07-30 18:13:34

by Joe Stringer

[permalink] [raw]
Subject: [PATCH net-next 9/9] openvswitch: Allow attaching helpers to ct action

Add support for using conntrack helpers to assist protocol detection.
The new OVS_CT_ATTR_HELPER attribute of the ct action specifies a helper
to be used for this connection.

Example ODP flows allowing FTP connections from ports 1->2:
in_port=1,tcp,action=ct(helper=ftp,commit),2
in_port=2,tcp,ct_state=-trk,action=ct(),recirc(1)
recirc_id=1,in_port=2,tcp,ct_state=+trk-new+est,action=1
recirc_id=1,in_port=2,tcp,ct_state=+trk+rel,action=1

Signed-off-by: Joe Stringer <[email protected]>
---
include/uapi/linux/openvswitch.h | 1 +
net/openvswitch/Kconfig | 1 +
net/openvswitch/conntrack.c | 109 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index f360dc9..e816170 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -626,6 +626,7 @@ enum ovs_ct_attr {
OVS_CT_ATTR_UNSPEC,
OVS_CT_ATTR_FLAGS, /* u8 bitmask of OVS_CT_F_*. */
OVS_CT_ATTR_ZONE, /* u16 zone id. */
+ OVS_CT_ATTR_HELPER,
__OVS_CT_ATTR_MAX
};

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 92bb3d3..c25b221 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -36,6 +36,7 @@ config OPENVSWITCH_CONNTRACK
bool "Open vSwitch conntrack action support"
depends on OPENVSWITCH
depends on NF_CONNTRACK
+ depends on NETFILTER_XTABLES
default OPENVSWITCH
---help---
If you say Y here, then Open vSwitch module will be able to pass
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 5acc59a..e7a5ca7 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -15,6 +15,7 @@
#include <linux/openvswitch.h>
#include <net/ip.h>
#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_helper.h>
#include <net/netfilter/nf_conntrack_labels.h>
#include <net/netfilter/nf_conntrack_zones.h>
#include <net/netfilter/ipv6/nf_defrag_ipv6.h>
@@ -30,6 +31,7 @@ struct ovs_ct_len_tbl {
};

struct ovs_conntrack_info {
+ struct nf_conntrack_helper *helper;
struct nf_conn *ct;
u32 flags;
u16 zone;
@@ -158,6 +160,51 @@ bool ovs_ct_state_valid(const struct sw_flow_key *key)
return __ovs_ct_state_valid(key->ct.state);
}

+/* 'skb' should already be pulled to nh_ofs. */
+static int ovs_ct_helper(struct sk_buff *skb, u16 proto)
+{
+ const struct nf_conntrack_helper *helper;
+ const struct nf_conn_help *help;
+ enum ip_conntrack_info ctinfo;
+ unsigned int protoff;
+ struct nf_conn *ct;
+
+ ct = nf_ct_get(skb, &ctinfo);
+ if (!ct || ctinfo == IP_CT_RELATED_REPLY)
+ return NF_ACCEPT;
+
+ help = nfct_help(ct);
+ if (!help)
+ return NF_ACCEPT;
+
+ helper = rcu_dereference(help->helper);
+ if (!helper)
+ return NF_ACCEPT;
+
+ switch (proto) {
+ case NFPROTO_IPV4:
+ protoff = ip_hdrlen(skb);
+ break;
+ case NFPROTO_IPV6: {
+ u8 nexthdr = ipv6_hdr(skb)->nexthdr;
+ __be16 frag_off;
+
+ protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr),
+ &nexthdr, &frag_off);
+ if (protoff < 0 || (frag_off & htons(~0x7)) != 0) {
+ pr_debug("proto header not found\n");
+ return NF_ACCEPT;
+ }
+ break;
+ }
+ default:
+ WARN_ONCE(1, "helper invoked on non-IP family!");
+ return NF_DROP;
+ }
+
+ return helper->help(skb, protoff, ct, ctinfo);
+}
+
static int handle_fragments(struct net *net, struct sw_flow_key *key,
u16 zone, struct sk_buff *skb)
{
@@ -232,6 +279,13 @@ static bool skb_nfct_cached(const struct net *net, const struct sk_buff *skb,
}
if (info->zone != nf_ct_zone(ct))
return false;
+ if (info->helper) {
+ struct nf_conn_help *help;
+
+ help = nf_ct_ext_find(ct, NF_CT_EXT_HELPER);
+ if (help && help->helper != info->helper)
+ return false;
+ }

return true;
}
@@ -288,6 +342,11 @@ static int __ovs_ct_lookup(struct net *net, const struct sw_flow_key *key,
if (nf_conntrack_in(net, info->family, NF_INET_PRE_ROUTING,
skb) != NF_ACCEPT)
return -ENOENT;
+
+ if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
+ WARN_ONCE(1, "helper rejected packet");
+ return -EINVAL;
+ }
}

return 0;
@@ -437,15 +496,41 @@ int ovs_ct_set_label(struct sk_buff *skb, struct sw_flow_key *key,
#endif
}

+static int ovs_ct_add_helper(struct ovs_conntrack_info *info, const char *name,
+ const struct sw_flow_key *key, bool log)
+{
+ struct nf_conntrack_helper *helper;
+ struct nf_conn_help *help;
+
+ helper = nf_conntrack_helper_try_module_get(name, info->family,
+ key->ip.proto);
+ if (!helper) {
+ OVS_NLERR(log, "Unknown helper \"%s\"", name);
+ return -ENOENT;
+ }
+
+ help = nf_ct_helper_ext_add(info->ct, helper, GFP_KERNEL);
+ if (!help) {
+ module_put(helper->me);
+ return -ENOMEM;
+ }
+
+ help->helper = helper;
+ info->helper = helper;
+ return 0;
+}
+
static const struct ovs_ct_len_tbl ovs_ct_attr_lens[OVS_CT_ATTR_MAX + 1] = {
[OVS_CT_ATTR_FLAGS] = { .minlen = sizeof(u32),
.maxlen = sizeof(u32) },
[OVS_CT_ATTR_ZONE] = { .minlen = sizeof(u16),
.maxlen = sizeof(u16) },
+ [OVS_CT_ATTR_HELPER] = { .minlen = 1,
+ .maxlen = NF_CT_HELPER_NAME_LEN }
};

static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
- bool log)
+ const char **helper, bool log)
{
struct nlattr *a;
int rem;
@@ -477,6 +562,13 @@ static int parse_ct(const struct nlattr *attr, struct ovs_conntrack_info *info,
case OVS_CT_ATTR_FLAGS:
info->flags = nla_get_u32(a);
break;
+ case OVS_CT_ATTR_HELPER:
+ *helper = nla_data(a);
+ if (!memchr(*helper, '\0', nla_len(a))) {
+ OVS_NLERR(log, "Invalid conntrack helper");
+ return -EINVAL;
+ }
+ break;
default:
OVS_NLERR(log, "Unknown conntrack attr (%d)",
type);
@@ -518,6 +610,7 @@ int ovs_ct_copy_action(struct net *net, const struct nlattr *attr,
{
struct ovs_conntrack_info ct_info;
struct nf_conntrack_tuple t;
+ const char *helper = NULL;
u16 family;
int err;

@@ -530,7 +623,7 @@ int ovs_ct_copy_action(struct net *net, const struct nlattr *attr,
memset(&ct_info, 0, sizeof(ct_info));
ct_info.family = family;

- err = parse_ct(attr, &ct_info, log);
+ err = parse_ct(attr, &ct_info, &helper, log);
if (err)
return err;

@@ -542,6 +635,11 @@ int ovs_ct_copy_action(struct net *net, const struct nlattr *attr,
OVS_NLERR(log, "Failed to allocate conntrack template");
return PTR_ERR(ct_info.ct);
}
+ if (helper) {
+ err = ovs_ct_add_helper(&ct_info, helper, key, log);
+ if (err)
+ goto err_free_ct;
+ }

err = ovs_nla_add_action(sfa, OVS_ACTION_ATTR_CT, &ct_info,
sizeof(ct_info), log);
@@ -570,6 +668,11 @@ int ovs_ct_action_to_attr(const struct ovs_conntrack_info *ct_info,
if (nla_put_u16(skb, OVS_CT_ATTR_ZONE, ct_info->zone))
return -EMSGSIZE;
#endif
+ if (ct_info->helper) {
+ if (nla_put_string(skb, OVS_CT_ATTR_HELPER,
+ ct_info->helper->name))
+ return -EMSGSIZE;
+ }

nla_nest_end(skb, start);

@@ -580,6 +683,8 @@ void ovs_ct_free_action(const struct nlattr *a)
{
struct ovs_conntrack_info *ct_info = nla_data(a);

+ if (ct_info->helper)
+ module_put(ct_info->helper->me);
if (ct_info->ct)
nf_ct_put(ct_info->ct);
}
--
2.1.4

2015-07-30 18:40:15

by Thomas Graf

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On 07/30/15 at 11:12am, Joe Stringer wrote:
> Signed-off-by: Joe Stringer <[email protected]>

Can you write a few lines on why this is needed? I have flows which
use the mark to communicate with netfilter through internal ports.

2015-07-30 19:35:38

by Thomas Graf

[permalink] [raw]
Subject: Re: [PATCH net-next 2/9] openvswitch: Serialize acts with original netlink len

On 07/30/15 at 11:12am, Joe Stringer wrote:
> Previously, we used the kernel-internal netlink actions length to
> calculate the size of messages to serialize back to userspace.
> However,the sw_flow_actions may not be formatted exactly the same as the
> actions on the wire, so store the original actions length when
> de-serializing and re-use the original length when serializing.
>
> Signed-off-by: Joe Stringer <[email protected]>

Acked-by: Thomas Graf <[email protected]>

2015-07-30 19:36:40

by Thomas Graf

[permalink] [raw]
Subject: Re: [PATCH net-next 3/9] openvswitch: Move MASKED* macros to datapath.h

On 07/30/15 at 11:12am, Joe Stringer wrote:
> This will allow the ovs-conntrack code to reuse these macros.
>
> Signed-off-by: Joe Stringer <[email protected]>

Acked-by: Thomas Graf <[email protected]>

2015-07-30 19:37:03

by Thomas Graf

[permalink] [raw]
Subject: Re: [PATCH net-next 4/9] ipv6: Export nf_ct_frag6_gather()

On 07/30/15 at 11:12am, Joe Stringer wrote:
> Signed-off-by: Joe Stringer <[email protected]>

Acked-by: Thomas Graf <[email protected]>

2015-07-30 23:17:11

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On 30 July 2015 at 11:40, Thomas Graf <[email protected]> wrote:
> On 07/30/15 at 11:12am, Joe Stringer wrote:
>> Signed-off-by: Joe Stringer <[email protected]>
>
> Can you write a few lines on why this is needed? I have flows which
> use the mark to communicate with netfilter through internal ports.

The problem I was seeing is when packets come from a different
namespace on the localhost, they still have conntrack data associated.
This doesn't make sense, so the intention is to perform nf_reset().
However, it seems like we should actually be doing a bit more - at
least the skb_dst_drop() and perhaps some of the other stuff in
skb_scrub_packet().

Do you want to retain the mark when transitioning between namespaces?

Perhaps something like the below incremental would be sufficient:

diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 8a63df6..82844e6 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -475,7 +475,9 @@ void ovs_vport_receive(struct vport *vport, struct
sk_buff *skb,
struct sw_flow_key key;
int error;

- if (!skb->sk || (sock_net(skb->sk) != read_pnet(&vport->dp->net)))
+ if (!skb->sk)
+ skb_scrub_packet(skb, false);
+ else if (sock_net(skb->sk) != read_pnet(&vport->dp->net))
skb_scrub_packet(skb, true);

stats = this_cpu_ptr(vport->percpu_stats);

2015-07-31 03:43:25

by Pravin Shelar

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On Thu, Jul 30, 2015 at 11:12 AM, Joe Stringer <[email protected]> wrote:
> Signed-off-by: Joe Stringer <[email protected]>
> ---
> net/openvswitch/vport.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index d14f594..baa018f 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -475,6 +475,9 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
> struct sw_flow_key key;
> int error;
>
> + if (!skb->sk || (sock_net(skb->sk) != read_pnet(&vport->dp->net)))
> + skb_scrub_packet(skb, true);
> +
skb scrub drops dst which cause use-after-free bug for flow based
tunnel devices.


> stats = this_cpu_ptr(vport->percpu_stats);
> u64_stats_update_begin(&stats->syncp);
> stats->rx_packets++;
> --
> 2.1.4
>

2015-07-31 07:38:14

by Thomas Graf

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On 07/30/15 at 04:16pm, Joe Stringer wrote:
> On 30 July 2015 at 11:40, Thomas Graf <[email protected]> wrote:
> > On 07/30/15 at 11:12am, Joe Stringer wrote:
> >> Signed-off-by: Joe Stringer <[email protected]>
> >
> > Can you write a few lines on why this is needed? I have flows which
> > use the mark to communicate with netfilter through internal ports.
>
> The problem I was seeing is when packets come from a different
> namespace on the localhost, they still have conntrack data associated.
> This doesn't make sense, so the intention is to perform nf_reset().
> However, it seems like we should actually be doing a bit more - at
> least the skb_dst_drop() and perhaps some of the other stuff in
> skb_scrub_packet().
>
> Do you want to retain the mark when transitioning between namespaces?

Since we have retained it so far I think we should keep on doing
that. I'm pretty sure there are users of it out there besides me.
As you know, it's common to have tap devices in between OVS and the
guest in OpenStack and install netfilter rules there.

As for whether we should scrub it in between namespaces. Probably yes
but it's definitely tremendously useful to be able to transfer some
metadata (mark and dst metadata) between namespaces. The default
behaviour should probably be to scrub it with a flag to keep it. If
that flag is not set and nsid of port != bridge then we scrub the mark
and other metadata.

2015-07-31 13:20:10

by Florian Westphal

[permalink] [raw]
Subject: Re: [PATCH net-next 8/9] openvswitch: Allow matching on conntrack label

Joe Stringer <[email protected]> wrote:
> Allow matching and setting the conntrack label field. As with ct_mark,
> this is populated by executing the ct() action, and is a writable field.
> The set_field() action may be used to modify the label, which will take
> effect on the most recent conntrack entry.
>
> E.g.: actions:ct(zone=1),set_field(1->ct_label)
>
> This will perform conntrack lookup in zone 1, then modify the label for
> that entry. The conntrack entry itself must be committed using the
> "commit" flag in the conntrack action flags for this change to persist.
>
> Signed-off-by: Joe Stringer <[email protected]>

> +/* Load connlabel and ensure it supports 128-bit labels */
> +static struct xt_match *load_connlabel(struct net *net)
> +{
> +#ifdef CONFIG_NF_CONNTRACK_LABELS
> + struct xt_match *match;
> + struct xt_mtchk_param mtpar;
> + struct xt_connlabel_mtinfo info;
> + int err = -EINVAL;
> +
> + match = xt_request_find_match(NFPROTO_UNSPEC, "connlabel", 0);
> + if (IS_ERR(match)) {
> + match = NULL;
> + goto exit;
> + }
> +
> + info.bit = sizeof(struct ovs_key_ct_label) * 8 - 1;
> + info.options = 0;
> +
> + mtpar.net = net;
> + mtpar.table = match->table;
> + mtpar.entryinfo = NULL;
> + mtpar.match = match;
> + mtpar.matchinfo = &info;
> + mtpar.hook_mask = BIT(NF_INET_PRE_ROUTING);
> + mtpar.family = NFPROTO_IPV4;
> +
> + err = xt_check_match(&mtpar, XT_ALIGN(match->matchsize), match->proto,
> + 0);

Yummy :-)

Rather than adding a dependency on xtables I think a better option would
be to move the

par->net->ct.labels_used++;
words = BITS_TO_LONGS(info->bit+1);
if (words > par->net->ct.label_words)
par->net->ct.label_words = words;

parts from the checkentry/destroy hooks of xt_connlabel into
nf_conntrack_labels.c so that you don't need this mtpar stunt above
anymore (and I'd like to add ctlabel set support for nft at one point
so I'd also need to move that out of xt_label).

You can move that out of this series and submit that to nf-devel as
separate patch if you want.

> + ovs_ct_verify(OVS_KEY_ATTR_CT_LABEL)) {
> + const struct ovs_key_ct_label *cl;
> +
> + cl = nla_data(a[OVS_KEY_ATTR_CT_LABEL]);
> + SW_FLOW_KEY_MEMCPY(match, ct.label, cl->ct_label,
> + sizeof(*cl), is_mask);
> + *attrs &= ~(1ULL << OVS_KEY_ATTR_CT_LABEL);
> + }

So you're using labels as arbitrary 128 bit identifier, right?

Nothing wrong with that, just asking.

2015-07-31 14:35:02

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On Thu, 2015-07-30 at 11:12 -0700, Joe Stringer wrote:
> Signed-off-by: Joe Stringer <[email protected]>
> ---
> net/openvswitch/vport.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index d14f594..baa018f 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -475,6 +475,9 @@ void ovs_vport_receive(struct vport *vport, struct
> sk_buff *skb,
> struct sw_flow_key key;
> int error;
>
> + if (!skb->sk || (sock_net(skb->sk) != read_pnet(&vport->dp
> ->net)))
> + skb_scrub_packet(skb, true);
> +
> stats = this_cpu_ptr(vport->percpu_stats);
> u64_stats_update_begin(&stats->syncp);
> stats->rx_packets++;

In general, this shouldn't be necessary as the packet should already be
scrubbed before they arrive here.

Could you maybe add a WARN_ON and check how those skbs with conntrack
data traverse the stack? I also didn't understand why make it dependent
on the socket.

Thanks,
Hannes

2015-07-31 14:52:48

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] openvswitch: Add conntrack action

Hi,

On Thu, 2015-07-30 at 11:12 -0700, Joe Stringer wrote:
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index e50678d..4a62ed4 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -22,6 +22,7 @@
> #include <linux/in.h>
> #include <linux/ip.h>
> #include <linux/openvswitch.h>
> +#include <linux/netfilter_ipv6.h>
> #include <linux/sctp.h>
> #include <linux/tcp.h>
> #include <linux/udp.h>
> @@ -29,6 +30,7 @@
> #include <linux/if_arp.h>
> #include <linux/if_vlan.h>
>
> +#include <net/dst.h>
> #include <net/ip.h>
> #include <net/ipv6.h>
> #include <net/checksum.h>
> @@ -38,6 +40,7 @@
>
> #include "datapath.h"
> #include "flow.h"
> +#include "conntrack.h"
> #include "vport.h"
>
> static int do_execute_actions(struct datapath *dp, struct sk_buff
> *skb,
> @@ -52,6 +55,16 @@ struct deferred_action {
> struct sw_flow_key pkt_key;
> };
>
> +struct ovs_frag_data {
> + struct dst_entry *dst;

As this is a temporary storage area for skb data, we could simply use an
unsigned long here and don't need to force a reference on the dst_entry
in ovs_vport_output.

> + struct vport *vport;
> + struct sw_flow_key *key;
> + struct ovs_skb_cb cb;
> + __be16 vlan_proto;
> +};
> +
> +static DEFINE_PER_CPU(struct ovs_frag_data, ovs_frag_data_storage);
> +
> #define DEFERRED_ACTION_FIFO_SIZE 10
> struct action_fifo {
> int head;
> @@ -594,14 +607,136 @@ static int set_sctp(struct sk_buff *skb, struct
> sw_flow_key *flow_key,
> return 0;
> }
>
> -static void do_output(struct datapath *dp, struct sk_buff *skb, int
> out_port)
> +/* Given an IP frame, reconstruct its MAC header. */
> +static void ovs_setup_l2_header(struct sk_buff *skb,
> + const struct ovs_frag_data *data)
> +{
> + struct sw_flow_key *key = data->key;
> +
> + skb_push(skb, ETH_HLEN);
> + skb_reset_mac_header(skb);
> +
> + ether_addr_copy(eth_hdr(skb)->h_source, key->eth.src);
> + ether_addr_copy(eth_hdr(skb)->h_dest, key->eth.dst);
> + eth_hdr(skb)->h_proto = key->eth.type;
> +
> + if ((data->key->eth.tci & htons(VLAN_TAG_PRESENT)) &&
> + !skb_vlan_tag_present(skb))
> + __vlan_hwaccel_put_tag(skb, data->vlan_proto,
> + ntohs(key->eth.tci));
> +}
> +
> +static void prepare_frag(struct vport *vport, struct sw_flow_key
> *key,
> + struct sk_buff *skb)
> +{
> + unsigned int hlen = ETH_HLEN;
> + struct ovs_frag_data *data;
> +
> + data = this_cpu_ptr(&ovs_frag_data_storage);
> + data->dst = skb_dst(skb);


If data->dst is unsigned long, we could simply use an assignment:

data->dst = skb->_skb_refdst;

At this point we never leave rcu_read_lock section, so we are safe,
maybe we can add a comment for that.

> + data->vport = vport;
> + data->key = key;
> + data->cb = *OVS_CB(skb);
> +
> + if (key->eth.tci & htons(VLAN_TAG_PRESENT)) {
> + if (skb_vlan_tag_present(skb)) {
> + data->vlan_proto = skb->vlan_proto;
> + } else {
> + data->vlan_proto = vlan_eth_hdr(skb)
> ->h_vlan_proto;
> + hlen += VLAN_HLEN;
> + }
> + }
> +
> + memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> + skb_pull(skb, hlen);
> +}
> +
> +static int ovs_vport_output(struct sock *sock, struct sk_buff *skb)
> +{
> + struct ovs_frag_data *data =
> this_cpu_ptr(&ovs_frag_data_storage);
> + struct vport *vport = data->vport;
> +
> + skb_dst_drop(skb);
> + skb_dst_set(skb, dst_clone(data->dst));

We don't need to refcount the dst here, then.

> + *OVS_CB(skb) = data->cb;
> +
> + ovs_setup_l2_header(skb, data);
> + ovs_vport_send(vport, skb);
> +
> + return 0;
> +}
> +
> +unsigned int
> +ovs_dst_get_mtu(const struct dst_entry *dst)
> +{
> + return dst->dev->mtu;
> +}
> +
> +static struct dst_ops ovs_dst_ops = {
> + .family = AF_UNSPEC,
> + .mtu = ovs_dst_get_mtu,
> +};
> +
> +static void do_output(struct datapath *dp, struct sk_buff *skb, int
> out_port,
> + struct sw_flow_key *key)
> {
> struct vport *vport = ovs_vport_rcu(dp, out_port);
>
> - if (likely(vport))
> - ovs_vport_send(vport, skb);
> - else
> + if (likely(vport)) {
> + unsigned int mru = OVS_CB(skb)->mru;
> + struct dst_entry *orig_dst = dst_clone(skb_dst(skb));
> +
> + if (!mru || (skb->len <= mru + ETH_HLEN)) {
> + ovs_vport_send(vport, skb);
> + } else if (!vport->dev) {
> + WARN_ONCE(1, "Cannot fragment packets to
> vport %s\n",
> + vport->ops->get_name(vport));
> + kfree_skb(skb);
> + } else if (mru > vport->dev->mtu) {
> + kfree_skb(skb);
> + } else if (key->eth.type == htons(ETH_P_IP)) {
> + struct dst_entry ovs_dst;
> +
> + prepare_frag(vport, key, skb);
> + dst_init(&ovs_dst, &ovs_dst_ops, vport->dev,
> + 1, DST_OBSOLETE_NONE, DST_NOCOUNT);

I don't think we should take a ref on the netdev here.

dst_init(&ovs_dst, &ovs_dst_ops, NULL,
1, DST_OBSOLETE_NONE, DST_NOCOUNT);
ovs_dst.dev = vport->dev;

> +
> + skb_dst_drop(skb);
> + skb_dst_set_noref(skb, &ovs_dst);
> + IPCB(skb)->frag_max_size = mru;
> +
> + ip_do_fragment(skb->sk, skb,
> ovs_vport_output);
> + dev_put(ovs_dst.dev);

Can be removed then.

It seems a little strange to leave the skb->dst attached to the skb but
drop the reference from the netdevice here. Maybe a comment would make
sense, otherwise it smells fishy.

> + } else if (key->eth.type == htons(ETH_P_IPV6)) {
> + const struct nf_ipv6_ops *v6ops =
> nf_get_ipv6_ops();
> + struct rt6_info ovs_rt;
> +
> + if (!v6ops) {
> + kfree_skb(skb);
> + goto exit;
> + }
> +
> + prepare_frag(vport, key, skb);
> + memset(&ovs_rt, 0, sizeof(ovs_rt));
> + dst_init(&ovs_rt.dst, &ovs_dst_ops, vport
> ->dev,
> + 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
> +
> + skb_dst_drop(skb);
> + skb_dst_set_noref(skb, &ovs_rt.dst);
> + IP6CB(skb)->frag_max_size = mru;
> +
> + v6ops->fragment(skb->sk, skb,
> ovs_vport_output);
> + dev_put(ovs_rt.dst.dev);

Same thought applies here.

> + } else {
> + WARN_ONCE(1, "Failed fragment to %s: MRU=%d,
> MTU=%d.",
> + ovs_vport_name(vport), mru, vport
> ->dev->mtu);
> + kfree_skb(skb);
> + }
> +exit:
> + dst_release(orig_dst);
> + } else {
> kfree_skb(skb);
> + }
> }
>
> static int output_userspace(struct datapath *dp, struct sk_buff *skb,
> @@ -615,6 +750,10 @@ static int output_userspace(struct datapath *dp,
> struct sk_buff *skb,
>
> memset(&upcall, 0, sizeof(upcall));
> upcall.cmd = OVS_PACKET_CMD_ACTION;
> + upcall.userdata = NULL;
> + upcall.portid = 0;
> + upcall.egress_tun_info = NULL;
> + upcall.mru = OVS_CB(skb)->mru;
>
> for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
> a = nla_next(a, &rem)) {
> @@ -874,7 +1013,7 @@ static int do_execute_actions(struct datapath
> *dp, struct sk_buff *skb,
> struct sk_buff *out_skb = skb_clone(skb,
> GFP_ATOMIC);
>
> if (out_skb)
> - do_output(dp, out_skb, prev_port);
> + do_output(dp, out_skb, prev_port,
> key);
>
> prev_port = -1;
> }
> @@ -931,16 +1070,25 @@ static int do_execute_actions(struct datapath
> *dp, struct sk_buff *skb,
> case OVS_ACTION_ATTR_SAMPLE:
> err = sample(dp, skb, key, a, attr, len);
> break;
> +
> + case OVS_ACTION_ATTR_CT:
> + err = ovs_ct_execute(skb, key, nla_data(a));
> + break;
> }
>
> if (unlikely(err)) {
> - kfree_skb(skb);
> + /* Hide stolen fragments from user space. */
> + if (err == -EINPROGRESS)
> + err = 0;
> + else
> + kfree_skb(skb);
> +
> return err;
> }
> }
>
> if (prev_port != -1)
> - do_output(dp, skb, prev_port);
> + do_output(dp, skb, prev_port, key);
> else
> consume_skb(skb);
>


Bye,
Hannes

2015-07-31 15:26:51

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] openvswitch: Add conntrack action

On Thu, 2015-07-30 at 11:12 -0700, Joe Stringer wrote:
> +static void do_output(struct datapath *dp, struct sk_buff *skb, int
> out_port,
> + struct sw_flow_key *key)
> {
> struct vport *vport = ovs_vport_rcu(dp, out_port);
>
> - if (likely(vport))
> - ovs_vport_send(vport, skb);
> - else
> + if (likely(vport)) {
> + unsigned int mru = OVS_CB(skb)->mru;
> + struct dst_entry *orig_dst = dst_clone(skb_dst(skb));

I think you forgot to remove this?

Bye,
Hannes

2015-07-31 17:51:33

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On 31 July 2015 at 07:34, Hannes Frederic Sowa <[email protected]> wrote:
> On Thu, 2015-07-30 at 11:12 -0700, Joe Stringer wrote:
>> Signed-off-by: Joe Stringer <[email protected]>
>> ---
>> net/openvswitch/vport.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
>> index d14f594..baa018f 100644
>> --- a/net/openvswitch/vport.c
>> +++ b/net/openvswitch/vport.c
>> @@ -475,6 +475,9 @@ void ovs_vport_receive(struct vport *vport, struct
>> sk_buff *skb,
>> struct sw_flow_key key;
>> int error;
>>
>> + if (!skb->sk || (sock_net(skb->sk) != read_pnet(&vport->dp
>> ->net)))
>> + skb_scrub_packet(skb, true);
>> +
>> stats = this_cpu_ptr(vport->percpu_stats);
>> u64_stats_update_begin(&stats->syncp);
>> stats->rx_packets++;
>
> In general, this shouldn't be necessary as the packet should already be
> scrubbed before they arrive here.
>
> Could you maybe add a WARN_ON and check how those skbs with conntrack
> data traverse the stack? I also didn't understand why make it dependent
> on the socket.

OK, sure. One case I could think of is with an OVS internal port in
another namespace, directly attached to the bridge. I'll have a play
around with WARN_ON and see if I can come up with something more
trimmed down.

2015-07-31 18:35:42

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] openvswitch: Add conntrack action

Thanks for review,

On 31 July 2015 at 07:52, Hannes Frederic Sowa <[email protected]> wrote:
> On Thu, 2015-07-30 at 11:12 -0700, Joe Stringer wrote:
>> +static void prepare_frag(struct vport *vport, struct sw_flow_key
>> *key,
>> + struct sk_buff *skb)
>> +{
>> + unsigned int hlen = ETH_HLEN;
>> + struct ovs_frag_data *data;
>> +
>> + data = this_cpu_ptr(&ovs_frag_data_storage);
>> + data->dst = skb_dst(skb);
>
>
> If data->dst is unsigned long, we could simply use an assignment:
>
> data->dst = skb->_skb_refdst;
>
> At this point we never leave rcu_read_lock section, so we are safe,
> maybe we can add a comment for that.

OK, it also may be helpful to highlight that prepare_frag() is done
once for an assembled frame, then ovs_vport_output() performs the
inverse for each fragment. I'll do this.

...
>> + } else if (key->eth.type == htons(ETH_P_IP)) {
>> + struct dst_entry ovs_dst;
>> +
>> + prepare_frag(vport, key, skb);
>> + dst_init(&ovs_dst, &ovs_dst_ops, vport->dev,
>> + 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
>
> I don't think we should take a ref on the netdev here.
>
> dst_init(&ovs_dst, &ovs_dst_ops, NULL,
> 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
> ovs_dst.dev = vport->dev;

Some of this was me being overly cautious: take a ref on the dev for
as long as the fragment dst exists; take a ref on the original (eg
tunnel_metadata) dst for the length of handling the output of this
frame.

>> +
>> + skb_dst_drop(skb);
>> + skb_dst_set_noref(skb, &ovs_dst);
>> + IPCB(skb)->frag_max_size = mru;
>> +
>> + ip_do_fragment(skb->sk, skb,
>> ovs_vport_output);
>> + dev_put(ovs_dst.dev);
>
> Can be removed then.
>
> It seems a little strange to leave the skb->dst attached to the skb but
> drop the reference from the netdevice here. Maybe a comment would make
> sense, otherwise it smells fishy.

For each fragment, ovs_vport_output() will revert the changes made
here - restoring the original dst. Either way, if we're not taking a
ref on the netdev then this should be fine.

2015-07-31 20:14:35

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] openvswitch: Add conntrack action

On 31 July 2015 at 08:26, Hannes Frederic Sowa <[email protected]> wrote:
> On Thu, 2015-07-30 at 11:12 -0700, Joe Stringer wrote:
>> +static void do_output(struct datapath *dp, struct sk_buff *skb, int
>> out_port,
>> + struct sw_flow_key *key)
>> {
>> struct vport *vport = ovs_vport_rcu(dp, out_port);
>>
>> - if (likely(vport))
>> - ovs_vport_send(vport, skb);
>> - else
>> + if (likely(vport)) {
>> + unsigned int mru = OVS_CB(skb)->mru;
>> + struct dst_entry *orig_dst = dst_clone(skb_dst(skb));
>
> I think you forgot to remove this?

You're right that it's incorrect, however we do still need to ensure
that the original skb's reference to the orig_dst is released. I'll
tidy this up for v2.

2015-07-31 23:09:59

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 8/9] openvswitch: Allow matching on conntrack label

On 31 July 2015 at 06:20, Florian Westphal <[email protected]> wrote:
> Joe Stringer <[email protected]> wrote:
>> +/* Load connlabel and ensure it supports 128-bit labels */
>> +static struct xt_match *load_connlabel(struct net *net)
>> +{
>> +#ifdef CONFIG_NF_CONNTRACK_LABELS
>> + struct xt_match *match;
>> + struct xt_mtchk_param mtpar;
>> + struct xt_connlabel_mtinfo info;
>> + int err = -EINVAL;
>> +
>> + match = xt_request_find_match(NFPROTO_UNSPEC, "connlabel", 0);
>> + if (IS_ERR(match)) {
>> + match = NULL;
>> + goto exit;
>> + }
>> +
>> + info.bit = sizeof(struct ovs_key_ct_label) * 8 - 1;
>> + info.options = 0;
>> +
>> + mtpar.net = net;
>> + mtpar.table = match->table;
>> + mtpar.entryinfo = NULL;
>> + mtpar.match = match;
>> + mtpar.matchinfo = &info;
>> + mtpar.hook_mask = BIT(NF_INET_PRE_ROUTING);
>> + mtpar.family = NFPROTO_IPV4;
>> +
>> + err = xt_check_match(&mtpar, XT_ALIGN(match->matchsize), match->proto,
>> + 0);
>
> Yummy :-)

You're very graceful :-)

> Rather than adding a dependency on xtables I think a better option would
> be to move the
>
> par->net->ct.labels_used++;
> words = BITS_TO_LONGS(info->bit+1);
> if (words > par->net->ct.label_words)
> par->net->ct.label_words = words;
>
> parts from the checkentry/destroy hooks of xt_connlabel into
> nf_conntrack_labels.c so that you don't need this mtpar stunt above
> anymore (and I'd like to add ctlabel set support for nft at one point
> so I'd also need to move that out of xt_label).
>
> You can move that out of this series and submit that to nf-devel as
> separate patch if you want.

Thanks for the suggestion, I'll send a patch and adjust this code in
v2 accordingly.

>> + ovs_ct_verify(OVS_KEY_ATTR_CT_LABEL)) {
>> + const struct ovs_key_ct_label *cl;
>> +
>> + cl = nla_data(a[OVS_KEY_ATTR_CT_LABEL]);
>> + SW_FLOW_KEY_MEMCPY(match, ct.label, cl->ct_label,
>> + sizeof(*cl), is_mask);
>> + *attrs &= ~(1ULL << OVS_KEY_ATTR_CT_LABEL);
>> + }
>
> So you're using labels as arbitrary 128 bit identifier, right?
>
> Nothing wrong with that, just asking.

Right, it's exposed as an arbitrarily maskable/settable field of 128
bits in length, as that's the maximum today. So it's effectively up to
userspace to use it as a bunch of 1-bit flags or N-bit fields within
the range of the 128 bits.

2015-08-01 02:08:10

by Pravin Shelar

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] openvswitch: Add conntrack action

On Thu, Jul 30, 2015 at 11:12 AM, Joe Stringer <[email protected]> wrote:
> Expose the kernel connection tracker via OVS. Userspace components can
> make use of the "ct()" action, followed by "recirculate", to populate
> the conntracking state in the OVS flow key, and subsequently match on
> that state.
>
> Example ODP flows allowing traffic from 1->2, only replies from 2->1:
> in_port=1,tcp,action=ct(commit,zone=1),2
> in_port=2,ct_state=-trk,tcp,action=ct(zone=1),recirc(1)
> recirc_id=1,in_port=2,ct_state=+trk+est-new,tcp,action=1
>
> IP fragments are handled by transparently assembling them as part of the
> ct action. The maximum received unit (MRU) size is tracked so that
> refragmentation can occur during output.
>
> IP frag handling contributed by Andy Zhou.
>
> Signed-off-by: Joe Stringer <[email protected]>
> Signed-off-by: Justin Pettit <[email protected]>
> Signed-off-by: Andy Zhou <[email protected]>
> ---
> This can be tested with the corresponding userspace component here:
> https://www.github.com/justinpettit/openvswitch conntrack
> ---
> include/uapi/linux/openvswitch.h | 41 ++++
> net/openvswitch/Kconfig | 11 +
> net/openvswitch/Makefile | 1 +
> net/openvswitch/actions.c | 162 ++++++++++++-
> net/openvswitch/conntrack.c | 480 +++++++++++++++++++++++++++++++++++++++
> net/openvswitch/conntrack.h | 82 +++++++
> net/openvswitch/datapath.c | 62 +++--
> net/openvswitch/datapath.h | 6 +
> net/openvswitch/flow.c | 3 +
> net/openvswitch/flow.h | 6 +
> net/openvswitch/flow_netlink.c | 73 ++++--
> net/openvswitch/flow_netlink.h | 4 +-
> net/openvswitch/vport.c | 1 +
> 13 files changed, 897 insertions(+), 35 deletions(-)
> create mode 100644 net/openvswitch/conntrack.c
> create mode 100644 net/openvswitch/conntrack.h
>
...

> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index e50678d..4a62ed4 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -22,6 +22,7 @@
> #include <linux/in.h>
> #include <linux/ip.h>

..
> static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> @@ -52,6 +55,16 @@ struct deferred_action {
> struct sw_flow_key pkt_key;
> };
>
> +struct ovs_frag_data {
> + struct dst_entry *dst;
> + struct vport *vport;
> + struct sw_flow_key *key;
> + struct ovs_skb_cb cb;
> + __be16 vlan_proto;
> +};
> +
> +static DEFINE_PER_CPU(struct ovs_frag_data, ovs_frag_data_storage);
> +
> #define DEFERRED_ACTION_FIFO_SIZE 10
> struct action_fifo {
> int head;
> @@ -594,14 +607,136 @@ static int set_sctp(struct sk_buff *skb, struct sw_flow_key *flow_key,
> return 0;
> }
>
> -static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port)
> +/* Given an IP frame, reconstruct its MAC header. */
> +static void ovs_setup_l2_header(struct sk_buff *skb,
> + const struct ovs_frag_data *data)
> +{
> + struct sw_flow_key *key = data->key;
> +
> + skb_push(skb, ETH_HLEN);
> + skb_reset_mac_header(skb);
> +
> + ether_addr_copy(eth_hdr(skb)->h_source, key->eth.src);
> + ether_addr_copy(eth_hdr(skb)->h_dest, key->eth.dst);
> + eth_hdr(skb)->h_proto = key->eth.type;
> +
> + if ((data->key->eth.tci & htons(VLAN_TAG_PRESENT)) &&
> + !skb_vlan_tag_present(skb))
> + __vlan_hwaccel_put_tag(skb, data->vlan_proto,
> + ntohs(key->eth.tci));
> +}
> +
> +static void prepare_frag(struct vport *vport, struct sw_flow_key *key,
> + struct sk_buff *skb)
> +{
> + unsigned int hlen = ETH_HLEN;
> + struct ovs_frag_data *data;
> +
> + data = this_cpu_ptr(&ovs_frag_data_storage);
> + data->dst = skb_dst(skb);
> + data->vport = vport;
> + data->key = key;
> + data->cb = *OVS_CB(skb);
> +
> + if (key->eth.tci & htons(VLAN_TAG_PRESENT)) {
> + if (skb_vlan_tag_present(skb)) {
> + data->vlan_proto = skb->vlan_proto;
> + } else {
> + data->vlan_proto = vlan_eth_hdr(skb)->h_vlan_proto;
> + hlen += VLAN_HLEN;
> + }
> + }
Not all actions keep flow key uptodate, so here you can access stale values.

> +
> + memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
> + skb_pull(skb, hlen);
> +}
> +
> +static int ovs_vport_output(struct sock *sock, struct sk_buff *skb)
> +{
> + struct ovs_frag_data *data = this_cpu_ptr(&ovs_frag_data_storage);
> + struct vport *vport = data->vport;
> +
> + skb_dst_drop(skb);
> + skb_dst_set(skb, dst_clone(data->dst));
> + *OVS_CB(skb) = data->cb;
> +
> + ovs_setup_l2_header(skb, data);
> + ovs_vport_send(vport, skb);
> +
> + return 0;
> +}
> +
...
> +static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
> + struct sw_flow_key *key)
> {
> struct vport *vport = ovs_vport_rcu(dp, out_port);
>
> - if (likely(vport))
> - ovs_vport_send(vport, skb);
> - else
> + if (likely(vport)) {
> + unsigned int mru = OVS_CB(skb)->mru;
> + struct dst_entry *orig_dst = dst_clone(skb_dst(skb));
> +
> + if (!mru || (skb->len <= mru + ETH_HLEN)) {
This should be marked as likely() case.

> + ovs_vport_send(vport, skb);
> + } else if (!vport->dev) {
> + WARN_ONCE(1, "Cannot fragment packets to vport %s\n",
> + vport->ops->get_name(vport));
> + kfree_skb(skb);
> + } else if (mru > vport->dev->mtu) {
> + kfree_skb(skb);
> + } else if (key->eth.type == htons(ETH_P_IP)) {
> + struct dst_entry ovs_dst;
> +
> + prepare_frag(vport, key, skb);
> + dst_init(&ovs_dst, &ovs_dst_ops, vport->dev,
> + 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
> +
> + skb_dst_drop(skb);
> + skb_dst_set_noref(skb, &ovs_dst);
> + IPCB(skb)->frag_max_size = mru;
> +
> + ip_do_fragment(skb->sk, skb, ovs_vport_output);
> + dev_put(ovs_dst.dev);
> + } else if (key->eth.type == htons(ETH_P_IPV6)) {
> + const struct nf_ipv6_ops *v6ops = nf_get_ipv6_ops();
> + struct rt6_info ovs_rt;
> +
> + if (!v6ops) {
> + kfree_skb(skb);
> + goto exit;
> + }
> +
> + prepare_frag(vport, key, skb);
> + memset(&ovs_rt, 0, sizeof(ovs_rt));
> + dst_init(&ovs_rt.dst, &ovs_dst_ops, vport->dev,
> + 1, DST_OBSOLETE_NONE, DST_NOCOUNT);
> +
> + skb_dst_drop(skb);
> + skb_dst_set_noref(skb, &ovs_rt.dst);
> + IP6CB(skb)->frag_max_size = mru;
> +
> + v6ops->fragment(skb->sk, skb, ovs_vport_output);
> + dev_put(ovs_rt.dst.dev);
> + } else {
> + WARN_ONCE(1, "Failed fragment to %s: MRU=%d, MTU=%d.",
> + ovs_vport_name(vport), mru, vport->dev->mtu);
It would be helpful if the msg also mentions key->eth.type.

> + kfree_skb(skb);
> + }
> +exit:
> + dst_release(orig_dst);
> + } else {
> kfree_skb(skb);
> + }
> }
>
> static int output_userspace(struct datapath *dp, struct sk_buff *skb,
> @@ -615,6 +750,10 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb,
>
> memset(&upcall, 0, sizeof(upcall));
> upcall.cmd = OVS_PACKET_CMD_ACTION;
> + upcall.userdata = NULL;
> + upcall.portid = 0;
> + upcall.egress_tun_info = NULL;
> + upcall.mru = OVS_CB(skb)->mru;
>
> for (a = nla_data(attr), rem = nla_len(attr); rem > 0;
> a = nla_next(a, &rem)) {
> @@ -874,7 +1013,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> struct sk_buff *out_skb = skb_clone(skb, GFP_ATOMIC);
>
> if (out_skb)
> - do_output(dp, out_skb, prev_port);
> + do_output(dp, out_skb, prev_port, key);
>
> prev_port = -1;
> }
> @@ -931,16 +1070,25 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> case OVS_ACTION_ATTR_SAMPLE:
> err = sample(dp, skb, key, a, attr, len);
> break;
> +
> + case OVS_ACTION_ATTR_CT:
> + err = ovs_ct_execute(skb, key, nla_data(a));
> + break;
> }
>
> if (unlikely(err)) {
> - kfree_skb(skb);
> + /* Hide stolen fragments from user space. */
> + if (err == -EINPROGRESS)
> + err = 0;
This does not look safe for error returned from all cases, Can you
check this case specifically for the CT action case.

> + else
> + kfree_skb(skb);
> +
> return err;
> }
> }
>
> if (prev_port != -1)
> - do_output(dp, skb, prev_port);
> + do_output(dp, skb, prev_port, key);
> else
> consume_skb(skb);
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> new file mode 100644
> index 0000000..284b89e
> --- /dev/null
> +++ b/net/openvswitch/conntrack.c
> @@ -0,0 +1,480 @@

...
> +
> +static struct net *ovs_get_net(const struct sk_buff *skb)
> +{
> + struct vport *vport;
> +
> + vport = OVS_CB(skb)->input_vport;
> + if (!vport) {
I do not think this is possible, OVS always initialize input_vport.

> + WARN_ONCE(1, "Can't obtain netns from vport");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + return read_pnet(&vport->dp->net);
> +}
> +
...

> +
> +static inline void ovs_ct_free_action(const struct nlattr *a) { }
> +#endif
> +#endif /* ovs_conntrack.h */
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index d5b5473..23717a3 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -275,6 +275,8 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> memset(&upcall, 0, sizeof(upcall));
> upcall.cmd = OVS_PACKET_CMD_MISS;
> upcall.portid = ovs_vport_find_upcall_portid(p, skb);
> + upcall.egress_tun_info = NULL;
There is no need to set egress_tun_info to NULL.

> + upcall.mru = OVS_CB(skb)->mru;
> error = ovs_dp_upcall(dp, skb, key, &upcall);
> if (unlikely(error))
> kfree_skb(skb);
> @@ -400,9 +402,23 @@ static size_t upcall_msg_size(const struct dp_upcall_info *upcall_info,

2015-08-01 19:17:47

by Thomas Graf

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On 07/31/15 at 10:51am, Joe Stringer wrote:
> On 31 July 2015 at 07:34, Hannes Frederic Sowa <[email protected]> wrote:
> > In general, this shouldn't be necessary as the packet should already be
> > scrubbed before they arrive here.
> >
> > Could you maybe add a WARN_ON and check how those skbs with conntrack
> > data traverse the stack? I also didn't understand why make it dependent
> > on the socket.
>
> OK, sure. One case I could think of is with an OVS internal port in
> another namespace, directly attached to the bridge. I'll have a play
> around with WARN_ON and see if I can come up with something more
> trimmed down.

The OVS internal port will definitely pass through an unscrubbed
packet across namespaces. I think the proper thing to do would be
to scrub but conditionally keep metadata.

2015-08-03 22:58:50

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 5/9] openvswitch: Add conntrack action

On 31 July 2015 at 19:08, Pravin Shelar <[email protected]> wrote:
> On Thu, Jul 30, 2015 at 11:12 AM, Joe Stringer <[email protected]> wrote:
>> +static void prepare_frag(struct vport *vport, struct sw_flow_key *key,
>> + struct sk_buff *skb)
>> +{
>> + unsigned int hlen = ETH_HLEN;
>> + struct ovs_frag_data *data;
>> +
>> + data = this_cpu_ptr(&ovs_frag_data_storage);
>> + data->dst = skb_dst(skb);
>> + data->vport = vport;
>> + data->key = key;
>> + data->cb = *OVS_CB(skb);
>> +
>> + if (key->eth.tci & htons(VLAN_TAG_PRESENT)) {
>> + if (skb_vlan_tag_present(skb)) {
>> + data->vlan_proto = skb->vlan_proto;
>> + } else {
>> + data->vlan_proto = vlan_eth_hdr(skb)->h_vlan_proto;
>> + hlen += VLAN_HLEN;
>> + }
>> + }
> Not all actions keep flow key uptodate, so here you can access stale values.

Hmm, okay. Perhaps the right thing to handle all of these cases is to
just make a copy of everything up to the network offset, and restore
that after fragmentation.

>> if (unlikely(err)) {
>> - kfree_skb(skb);
>> + /* Hide stolen fragments from user space. */
>> + if (err == -EINPROGRESS)
>> + err = 0;
> This does not look safe for error returned from all cases, Can you
> check this case specifically for the CT action case.

I'll place it inside the CT action case.

Thanks for the review, will roll the other fixes into the next version.

2015-08-05 04:41:17

by Joe Stringer

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On 1 August 2015 at 12:17, Thomas Graf <[email protected]> wrote:
> On 07/31/15 at 10:51am, Joe Stringer wrote:
>> On 31 July 2015 at 07:34, Hannes Frederic Sowa <[email protected]> wrote:
>> > In general, this shouldn't be necessary as the packet should already be
>> > scrubbed before they arrive here.
>> >
>> > Could you maybe add a WARN_ON and check how those skbs with conntrack
>> > data traverse the stack? I also didn't understand why make it dependent
>> > on the socket.
>>
>> OK, sure. One case I could think of is with an OVS internal port in
>> another namespace, directly attached to the bridge. I'll have a play
>> around with WARN_ON and see if I can come up with something more
>> trimmed down.
>
> The OVS internal port will definitely pass through an unscrubbed
> packet across namespaces. I think the proper thing to do would be
> to scrub but conditionally keep metadata.

It's only "unscrubbed" when receiving from local stack at the moment.
Some pieces are cleared when handing towards the local stack, and
there's no configuration for that behaviour. Presumably internal port
transmit and receive should mirror each other?

I don't have a specific use case either way. The remaining code for
this series handles this case correctly, it's just a matter of what
behaviour we're looking for. We could implement the flag as you say, I
presume that userspace would need to specify this during vport
creation and the default should work similar to the existing behaviour
(ie, keep metadata). One thing that's not entirely clear to me is
exactly which metadata should be represented by this flag and whether
the single flag is expressive enough.

2015-08-07 22:07:57

by Jesse Gross

[permalink] [raw]
Subject: Re: [PATCH net-next 1/9] openvswitch: Scrub packet in ovs_vport_receive()

On Tue, Aug 4, 2015 at 9:40 PM, Joe Stringer <[email protected]> wrote:
> On 1 August 2015 at 12:17, Thomas Graf <[email protected]> wrote:
>> On 07/31/15 at 10:51am, Joe Stringer wrote:
>>> On 31 July 2015 at 07:34, Hannes Frederic Sowa <[email protected]> wrote:
>>> > In general, this shouldn't be necessary as the packet should already be
>>> > scrubbed before they arrive here.
>>> >
>>> > Could you maybe add a WARN_ON and check how those skbs with conntrack
>>> > data traverse the stack? I also didn't understand why make it dependent
>>> > on the socket.
>>>
>>> OK, sure. One case I could think of is with an OVS internal port in
>>> another namespace, directly attached to the bridge. I'll have a play
>>> around with WARN_ON and see if I can come up with something more
>>> trimmed down.
>>
>> The OVS internal port will definitely pass through an unscrubbed
>> packet across namespaces. I think the proper thing to do would be
>> to scrub but conditionally keep metadata.
>
> It's only "unscrubbed" when receiving from local stack at the moment.
> Some pieces are cleared when handing towards the local stack, and
> there's no configuration for that behaviour. Presumably internal port
> transmit and receive should mirror each other?
>
> I don't have a specific use case either way. The remaining code for
> this series handles this case correctly, it's just a matter of what
> behaviour we're looking for. We could implement the flag as you say, I
> presume that userspace would need to specify this during vport
> creation and the default should work similar to the existing behaviour
> (ie, keep metadata). One thing that's not entirely clear to me is
> exactly which metadata should be represented by this flag and whether
> the single flag is expressive enough.

I would prefer not to have a flag as it seems unnecessarily
complicated (doubly so if we try to have multiple flags to express
different combinations). The use case for moving internal ports to
different namespaces is pretty narrow, so it seems like we can just
pick a set of metadata to keep and go with that. Mark seems the
primary one to me.

I also think that it would be better to use skb->dev to determine the
original namespace rather than the socket.