From: David Ahern <[email protected]>
Nikita Leshenko reported that neighbor entries in one namespace can
evict neighbor entries in another. The problem is that the neighbor
tables have entries across all namespaces without separate accounting
and with global limits on when to scan for entries to evict.
Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
namespace and making the accounting and threshold limits per namespace.
David Ahern (17):
net/ipv4: rename ipv4_neigh_lookup to ipv4_dst_neigh_lookup
net/neigh: export neigh_find_table
net/ipv4: wrappers for arp table references
net/ipv4: Remove open coded use of arp table
net/ipv6: wrappers for neighbor table references
net/ipv6: Remove open coded use of neighbor table
drivers/net: remove open coding of neighbor tables
net: Remove nd_tbl from ipv6 stub
net: Remove arp_tbl and nd_tbl from headers
net: Add key_len to neighbor constructor
net: Change neigh_table_init and neigh_table_clear signature
net/neigh: Change neigh_xmit to take an address family
net/neighbor: Convert internal functions away from neigh_tables
net/ipv4: Convert arp table to per namespace
net/ipv6: Convert neighbor table to per-namespace
net/decnet: Move neighbor table to per-namespace
net/neighbor: Remove neigh_tables and NEIGH enum
drivers/infiniband/ulp/ipoib/ipoib_main.c | 14 +-
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 35 ++---
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 11 +-
.../net/ethernet/mellanox/mlxsw/spectrum_router.c | 27 ++--
.../net/ethernet/mellanox/mlxsw/spectrum_span.c | 8 +-
.../ethernet/netronome/nfp/flower/tunnel_conf.c | 2 +-
drivers/net/ethernet/rocker/rocker_main.c | 4 +-
drivers/net/ethernet/rocker/rocker_ofdpa.c | 2 +-
drivers/net/vrf.c | 4 +-
drivers/net/vxlan.c | 10 +-
include/net/addrconf.h | 1 -
include/net/arp.h | 25 +++-
include/net/ndisc.h | 75 +++++++++-
include/net/neighbour.h | 17 +--
include/net/net_namespace.h | 3 +
include/net/netns/ipv4.h | 1 +
include/net/netns/ipv6.h | 1 +
net/atm/clip.c | 14 +-
net/bridge/br_arp_nd_proxy.c | 4 +-
net/core/filter.c | 3 +-
net/core/neighbour.c | 115 +++++++++-----
net/decnet/dn_neigh.c | 8 +-
net/ieee802154/6lowpan/tx.c | 2 +-
net/ipv4/arp.c | 130 +++++++++-------
net/ipv4/devinet.c | 8 +-
net/ipv4/fib_semantics.c | 2 +-
net/ipv4/ip_output.c | 2 +-
net/ipv4/route.c | 12 +-
net/ipv6/addrconf.c | 16 +-
net/ipv6/af_inet6.c | 1 -
net/ipv6/ip6_output.c | 4 +-
net/ipv6/ndisc.c | 165 +++++++++++----------
net/ipv6/route.c | 12 +-
net/mpls/af_mpls.c | 33 ++---
net/mpls/mpls_iptunnel.c | 6 +-
net/netfilter/nf_flow_table_ip.c | 4 +-
net/netfilter/nft_fwd_netdev.c | 6 +-
37 files changed, 467 insertions(+), 320 deletions(-)
--
2.11.0
From: David Ahern <[email protected]>
Convert IPv6 neighbor table to per-namespace.
This patch is a transition patch for the core neighbor code, so update
the init_net reference as needed for AF_INET6. With the per-namespace
table allow gc parameters to be changed per namespace.
Signed-off-by: David Ahern <[email protected]>
---
include/net/ndisc.h | 6 ++-
include/net/netns/ipv6.h | 1 +
net/core/neighbour.c | 16 +++++--
net/ipv6/ndisc.c | 120 +++++++++++++++++++++++------------------------
4 files changed, 76 insertions(+), 67 deletions(-)
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index 6fc58a61acdd..ce8ccc45cb4e 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -374,7 +374,11 @@ static inline u32 ndisc_hashfn(const void *pkey, const struct net_device *dev, _
static inline struct neigh_table *ipv6_neigh_table(struct net *net)
{
- return neigh_find_table(net, AF_INET6);
+#if IS_ENABLED(CONFIG_IPV6)
+ return net->ipv6.nd_tbl;
+#else
+ return NULL;
+#endif
}
static inline struct neighbour *ipv6_neigh_create(struct net_device *dev,
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 762ac9931b62..62fd0ce9ab0b 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -66,6 +66,7 @@ struct netns_ipv6 {
struct rt6_statistics *rt6_stats;
struct timer_list ip6_fib_timer;
struct hlist_head *fib_table_hash;
+ struct neigh_table *nd_tbl;
struct fib6_table *fib6_main_tbl;
struct list_head fib6_walkers;
struct dst_ops ip6_dst_ops;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 95b9269e3f35..35c41c4876e5 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1488,7 +1488,7 @@ static inline struct neigh_parms *lookup_neigh_parms(struct neigh_table *tbl,
struct net *def_net = &init_net;
struct neigh_parms *p;
- if (tbl->family == AF_INET)
+ if (tbl->family != AF_DECnet)
def_net = neigh_parms_net(p);
list_for_each_entry(p, &tbl->parms_list, list) {
@@ -1617,9 +1617,11 @@ void neigh_table_init(struct net *net, struct neigh_table *tbl)
case AF_INET:
net->ipv4.arp_tbl = tbl;
break;
+#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
- neigh_tables[NEIGH_ND_TABLE] = tbl;
+ net->ipv6.nd_tbl = tbl;
break;
+#endif
case AF_DECnet:
neigh_tables[NEIGH_DN_TABLE] = tbl;
break;
@@ -1635,9 +1637,11 @@ int neigh_table_clear(struct net *net, struct neigh_table *tbl)
case AF_INET:
net->ipv4.arp_tbl = NULL;
break;
+#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
- neigh_tables[NEIGH_ND_TABLE] = NULL;
+ net->ipv6.nd_tbl = NULL;
break;
+#endif
case AF_DECnet:
neigh_tables[NEIGH_DN_TABLE] = NULL;
break;
@@ -1675,9 +1679,11 @@ struct neigh_table *neigh_find_table(struct net *net, u8 family)
case AF_INET:
tbl = net->ipv4.arp_tbl;
break;
+#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
- tbl = neigh_tables[NEIGH_ND_TABLE];
+ tbl = net->ipv6.nd_tbl;
break;
+#endif
case AF_DECnet:
tbl = neigh_tables[NEIGH_DN_TABLE];
break;
@@ -2177,7 +2183,7 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh,
}
err = -ENOENT;
- if (tbl->family != AF_INET) {
+ if (tbl->family == AF_DECnet) {
if ((tb[NDTA_THRESH1] || tb[NDTA_THRESH2] ||
tb[NDTA_THRESH3] || tb[NDTA_GC_INTERVAL]) &&
!net_eq(net, &init_net))
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 6105530fe865..ae78984c4c94 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -107,39 +107,18 @@ static const struct neigh_ops ndisc_direct_ops = {
.connected_output = neigh_direct_output,
};
-struct neigh_table nd_tbl = {
- .family = AF_INET6,
- .key_len = sizeof(struct in6_addr),
- .protocol = cpu_to_be16(ETH_P_IPV6),
- .hash = ndisc_hash,
- .key_eq = ndisc_key_eq,
- .constructor = ndisc_constructor,
- .pconstructor = pndisc_constructor,
- .pdestructor = pndisc_destructor,
- .proxy_redo = pndisc_redo,
- .id = "ndisc_cache",
- .parms = {
- .tbl = &nd_tbl,
- .reachable_time = ND_REACHABLE_TIME,
- .data = {
- [NEIGH_VAR_MCAST_PROBES] = 3,
- [NEIGH_VAR_UCAST_PROBES] = 3,
- [NEIGH_VAR_RETRANS_TIME] = ND_RETRANS_TIMER,
- [NEIGH_VAR_BASE_REACHABLE_TIME] = ND_REACHABLE_TIME,
- [NEIGH_VAR_DELAY_PROBE_TIME] = 5 * HZ,
- [NEIGH_VAR_GC_STALETIME] = 60 * HZ,
- [NEIGH_VAR_QUEUE_LEN_BYTES] = SK_WMEM_MAX,
- [NEIGH_VAR_PROXY_QLEN] = 64,
- [NEIGH_VAR_ANYCAST_DELAY] = 1 * HZ,
- [NEIGH_VAR_PROXY_DELAY] = (8 * HZ) / 10,
- },
- },
- .gc_interval = 30 * HZ,
- .gc_thresh1 = 128,
- .gc_thresh2 = 512,
- .gc_thresh3 = 1024,
+static int parms_data[NEIGH_VAR_DATA_MAX] = {
+ [NEIGH_VAR_MCAST_PROBES] = 3,
+ [NEIGH_VAR_UCAST_PROBES] = 3,
+ [NEIGH_VAR_RETRANS_TIME] = ND_RETRANS_TIMER,
+ [NEIGH_VAR_BASE_REACHABLE_TIME] = ND_REACHABLE_TIME,
+ [NEIGH_VAR_DELAY_PROBE_TIME] = 5 * HZ,
+ [NEIGH_VAR_GC_STALETIME] = 60 * HZ,
+ [NEIGH_VAR_QUEUE_LEN_BYTES] = SK_WMEM_MAX,
+ [NEIGH_VAR_PROXY_QLEN] = 64,
+ [NEIGH_VAR_ANYCAST_DELAY] = 1 * HZ,
+ [NEIGH_VAR_PROXY_DELAY] = (8 * HZ) / 10,
};
-EXPORT_SYMBOL_GPL(nd_tbl);
void __ndisc_fill_addr_option(struct sk_buff *skb, int type, void *data,
int data_len, int pad)
@@ -1865,16 +1844,22 @@ int ndisc_ifinfo_sysctl_change(struct ctl_table *ctl, int write, void __user *bu
static int __net_init ndisc_net_init(struct net *net)
{
+ struct neigh_table *nd_tbl;
struct ipv6_pinfo *np;
struct sock *sk;
int err;
+ nd_tbl = kzalloc(sizeof(*nd_tbl), GFP_KERNEL);
+ if (!nd_tbl)
+ return -ENOMEM;
+
err = inet_ctl_sock_create(&sk, PF_INET6,
SOCK_RAW, IPPROTO_ICMPV6, net);
if (err < 0) {
ND_PRINTK(0, err,
"NDISC: Failed to initialize the control socket (err %d)\n",
err);
+ kfree(nd_tbl);
return err;
}
@@ -1885,12 +1870,52 @@ static int __net_init ndisc_net_init(struct net *net)
/* Do not loopback ndisc messages */
np->mc_loop = 0;
- return 0;
+ rwlock_init(&nd_tbl->lock);
+ nd_tbl->family = AF_INET6;
+ nd_tbl->key_len = sizeof(struct in6_addr);
+ nd_tbl->protocol = cpu_to_be16(ETH_P_IPV6);
+ nd_tbl->hash = ndisc_hash;
+ nd_tbl->key_eq = ndisc_key_eq;
+ nd_tbl->constructor = ndisc_constructor;
+ nd_tbl->pconstructor = pndisc_constructor;
+ nd_tbl->pdestructor = pndisc_destructor;
+ nd_tbl->proxy_redo = pndisc_redo;
+ nd_tbl->id = "ndisc_cache";
+ nd_tbl->gc_interval = 30 * HZ;
+ nd_tbl->gc_thresh1 = 128;
+ nd_tbl->gc_thresh2 = 512;
+ nd_tbl->gc_thresh3 = 1024;
+
+ nd_tbl->parms.tbl = nd_tbl;
+ nd_tbl->parms.reachable_time = ND_REACHABLE_TIME;
+ memcpy(nd_tbl->parms.data, parms_data, sizeof(parms_data));
+
+ neigh_table_init(net, nd_tbl);
+
+ err = 0;
+#ifdef CONFIG_SYSCTL
+ err = neigh_sysctl_register(NULL, &nd_tbl->parms,
+ ndisc_ifinfo_sysctl_change);
+ if (err) {
+ inet_ctl_sock_destroy(net->ipv6.ndisc_sk);
+ kfree(nd_tbl);
+ }
+#endif
+ return err;
}
static void __net_exit ndisc_net_exit(struct net *net)
{
+ struct neigh_table *nd_tbl = net->ipv6.nd_tbl;
+
inet_ctl_sock_destroy(net->ipv6.ndisc_sk);
+
+#ifdef CONFIG_SYSCTL
+ neigh_sysctl_unregister(&nd_tbl->parms);
+#endif
+ net->ipv6.nd_tbl = NULL;
+ neigh_table_clear(net, nd_tbl);
+ kfree(nd_tbl);
}
static struct pernet_operations ndisc_net_ops = {
@@ -1900,30 +1925,7 @@ static struct pernet_operations ndisc_net_ops = {
int __init ndisc_init(void)
{
- int err;
-
- err = register_pernet_subsys(&ndisc_net_ops);
- if (err)
- return err;
- /*
- * Initialize the neighbour table
- */
- neigh_table_init(&init_net, &nd_tbl);
-
-#ifdef CONFIG_SYSCTL
- err = neigh_sysctl_register(NULL, &nd_tbl.parms,
- ndisc_ifinfo_sysctl_change);
- if (err)
- goto out_unregister_pernet;
-out:
-#endif
- return err;
-
-#ifdef CONFIG_SYSCTL
-out_unregister_pernet:
- unregister_pernet_subsys(&ndisc_net_ops);
- goto out;
-#endif
+ return register_pernet_subsys(&ndisc_net_ops);
}
int __init ndisc_late_init(void)
@@ -1938,9 +1940,5 @@ void ndisc_late_cleanup(void)
void ndisc_cleanup(void)
{
-#ifdef CONFIG_SYSCTL
- neigh_sysctl_unregister(&nd_tbl.parms);
-#endif
- neigh_table_clear(&init_net, &nd_tbl);
unregister_pernet_subsys(&ndisc_net_ops);
}
--
2.11.0
From: David Ahern <[email protected]>
Convert decnet neighbor table to per-namespace. Since there are no
other per-namespace parameters for decnet just add the reference to
struct net directly.
Signed-off-by: David Ahern <[email protected]>
---
include/net/net_namespace.h | 3 +++
net/core/neighbour.c | 26 ++++++++++----------------
2 files changed, 13 insertions(+), 16 deletions(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index a71264d75d7f..77799ed7212e 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -108,6 +108,9 @@ struct net {
#if IS_ENABLED(CONFIG_IPV6)
struct netns_ipv6 ipv6;
#endif
+#if IS_ENABLED(CONFIG_DECNET)
+ struct neigh_table *dn_tbl;
+#endif
#if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
struct netns_ieee802154_lowpan ieee802154_lowpan;
#endif
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 35c41c4876e5..1bb7fa14cb62 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1485,15 +1485,11 @@ EXPORT_SYMBOL(pneigh_enqueue);
static inline struct neigh_parms *lookup_neigh_parms(struct neigh_table *tbl,
struct net *net, int ifindex)
{
- struct net *def_net = &init_net;
struct neigh_parms *p;
- if (tbl->family != AF_DECnet)
- def_net = neigh_parms_net(p);
-
list_for_each_entry(p, &tbl->parms_list, list) {
if ((p->dev && p->dev->ifindex == ifindex && net_eq(neigh_parms_net(p), net)) ||
- (!p->dev && !ifindex && net_eq(net, def_net)))
+ (!p->dev && !ifindex))
return p;
}
@@ -1622,9 +1618,11 @@ void neigh_table_init(struct net *net, struct neigh_table *tbl)
net->ipv6.nd_tbl = tbl;
break;
#endif
+#if IS_ENABLED(CONFIG_DECNET)
case AF_DECnet:
- neigh_tables[NEIGH_DN_TABLE] = tbl;
+ net->dn_tbl = tbl;
break;
+#endif
}
}
EXPORT_SYMBOL(neigh_table_init);
@@ -1642,9 +1640,11 @@ int neigh_table_clear(struct net *net, struct neigh_table *tbl)
net->ipv6.nd_tbl = NULL;
break;
#endif
+#if IS_ENABLED(CONFIG_DECNET)
case AF_DECnet:
- neigh_tables[NEIGH_DN_TABLE] = NULL;
+ net->dn_tbl = NULL;
break;
+#endif
}
/* It is not clean... Fix it to unload IPv6 module safely */
@@ -1684,9 +1684,11 @@ struct neigh_table *neigh_find_table(struct net *net, u8 family)
tbl = net->ipv6.nd_tbl;
break;
#endif
+#if IS_ENABLED(CONFIG_DECNET)
case AF_DECnet:
- tbl = neigh_tables[NEIGH_DN_TABLE];
+ tbl = net->dn_tbl;
break;
+#endif
}
return tbl;
@@ -2182,14 +2184,6 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh,
}
}
- err = -ENOENT;
- if (tbl->family == AF_DECnet) {
- if ((tb[NDTA_THRESH1] || tb[NDTA_THRESH2] ||
- tb[NDTA_THRESH3] || tb[NDTA_GC_INTERVAL]) &&
- !net_eq(net, &init_net))
- goto errout_tbl_lock;
- }
-
if (tb[NDTA_THRESH1])
tbl->gc_thresh1 = nla_get_u32(tb[NDTA_THRESH1]);
--
2.11.0
From: David Ahern <[email protected]>
No longer used.
Signed-off-by: David Ahern <[email protected]>
---
include/net/neighbour.h | 8 --------
net/core/neighbour.c | 2 --
2 files changed, 10 deletions(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index e968db1b7742..72002a24e72e 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -222,14 +222,6 @@ struct neigh_table {
struct pneigh_entry **phash_buckets;
};
-enum {
- NEIGH_ARP_TABLE = 0,
- NEIGH_ND_TABLE = 1,
- NEIGH_DN_TABLE = 2,
- NEIGH_NR_TABLES,
- NEIGH_LINK_TABLE = NEIGH_NR_TABLES /* Pseudo table for neigh_xmit */
-};
-
struct neigh_table *neigh_find_table(struct net *net, u8 family);
static inline int neigh_parms_family(struct neigh_parms *p)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 1bb7fa14cb62..7c3d4cb811d3 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1559,8 +1559,6 @@ static void neigh_parms_destroy(struct neigh_parms *parms)
static struct lock_class_key neigh_table_proxy_queue_class;
-static struct neigh_table *neigh_tables[NEIGH_NR_TABLES] __read_mostly;
-
void neigh_table_init(struct net *net, struct neigh_table *tbl)
{
unsigned long now = jiffies;
--
2.11.0
From: David Ahern <[email protected]>
The NEIGH_*_TABLE enum is really related to *how* the tables are
stored (in an array), a detail that does not need to leave the
neighbor code. Change the neigh_xmit API to take a proper address
family.
Signed-off-by: David Ahern <[email protected]>
---
include/net/neighbour.h | 2 +-
net/core/neighbour.c | 10 +++++-----
net/mpls/af_mpls.c | 33 ++++++++++++---------------------
net/mpls/mpls_iptunnel.c | 6 ++----
net/netfilter/nf_flow_table_ip.c | 4 ++--
net/netfilter/nft_fwd_netdev.c | 6 +++---
6 files changed, 25 insertions(+), 36 deletions(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index b70afea05f86..e968db1b7742 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -365,7 +365,7 @@ void neigh_for_each(struct neigh_table *tbl,
void (*cb)(struct neighbour *, void *), void *cookie);
void __neigh_for_each_release(struct neigh_table *tbl,
int (*cb)(struct neighbour *));
-int neigh_xmit(int fam, struct net_device *, const void *, struct sk_buff *);
+int neigh_xmit(u8 fam, struct net_device *, const void *, struct sk_buff *);
void pneigh_for_each(struct neigh_table *tbl,
void (*cb)(struct pneigh_entry *));
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 8bdaeb080ce4..b60087d7c0bc 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -2548,15 +2548,16 @@ void __neigh_for_each_release(struct neigh_table *tbl,
}
EXPORT_SYMBOL(__neigh_for_each_release);
-int neigh_xmit(int index, struct net_device *dev,
+int neigh_xmit(u8 family, struct net_device *dev,
const void *addr, struct sk_buff *skb)
{
int err = -EAFNOSUPPORT;
- if (likely(index < NEIGH_NR_TABLES)) {
+
+ if (likely(family != AF_UNSPEC)) {
struct neigh_table *tbl;
struct neighbour *neigh;
- tbl = neigh_tables[index];
+ tbl = neigh_find_table(dev_net(dev), family);
if (!tbl)
goto out;
rcu_read_lock_bh();
@@ -2570,8 +2571,7 @@ int neigh_xmit(int index, struct net_device *dev,
}
err = neigh->output(neigh, skb);
rcu_read_unlock_bh();
- }
- else if (index == NEIGH_LINK_TABLE) {
+ } else {
err = dev_hard_header(skb, dev, ntohs(skb->protocol),
addr, NULL, skb->len);
if (err < 0)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7a4de6d618b1..a701dc055de2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -34,7 +34,7 @@
*/
#define MAX_MP_SELECT_LABELS 4
-#define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1)
+#define MPLS_NEIGH_TABLE_UNSPEC AF_UNSPEC
static int zero = 0;
static int one = 1;
@@ -453,7 +453,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
/* If via wasn't specified then send out using device address */
if (nh->nh_via_table == MPLS_NEIGH_TABLE_UNSPEC)
- err = neigh_xmit(NEIGH_LINK_TABLE, out_dev,
+ err = neigh_xmit(AF_UNSPEC, out_dev,
out_dev->dev_addr, skb);
else
err = neigh_xmit(nh->nh_via_table, out_dev,
@@ -651,14 +651,12 @@ static struct net_device *find_outdev(struct net *net,
if (!oif) {
switch (nh->nh_via_table) {
- case NEIGH_ARP_TABLE:
+ case AF_INET:
dev = inet_fib_lookup_dev(net, mpls_nh_via(rt, nh));
break;
- case NEIGH_ND_TABLE:
+ case AF_INET6:
dev = inet6_fib_lookup_dev(net, mpls_nh_via(rt, nh));
break;
- case NEIGH_LINK_TABLE:
- break;
}
} else {
dev = dev_get_by_index(net, oif);
@@ -694,7 +692,7 @@ static int mpls_nh_assign_dev(struct net *net, struct mpls_route *rt,
if (!mpls_dev_get(dev))
goto errout;
- if ((nh->nh_via_table == NEIGH_LINK_TABLE) &&
+ if ((nh->nh_via_table == MPLS_NEIGH_TABLE_UNSPEC) &&
(dev->addr_len != nh->nh_via_alen))
goto errout;
@@ -739,15 +737,15 @@ static int nla_get_via(const struct nlattr *nla, u8 *via_alen, u8 *via_table,
/* Validate the address family */
switch (via->rtvia_family) {
case AF_PACKET:
- *via_table = NEIGH_LINK_TABLE;
+ *via_table = MPLS_NEIGH_TABLE_UNSPEC;
break;
case AF_INET:
- *via_table = NEIGH_ARP_TABLE;
+ *via_table = AF_INET;
if (alen != 4)
goto errout;
break;
case AF_INET6:
- *via_table = NEIGH_ND_TABLE;
+ *via_table = AF_INET6;
if (alen != 16)
goto errout;
break;
@@ -1596,23 +1594,16 @@ static struct notifier_block mpls_dev_notifier = {
.notifier_call = mpls_dev_notify,
};
-static int nla_put_via(struct sk_buff *skb,
- u8 table, const void *addr, int alen)
+static int nla_put_via(struct sk_buff *skb, u8 family,
+ const void *addr, int alen)
{
- static const int table_to_family[NEIGH_NR_TABLES + 1] = {
- AF_INET, AF_INET6, AF_DECnet, AF_PACKET,
- };
struct nlattr *nla;
struct rtvia *via;
- int family = AF_UNSPEC;
nla = nla_reserve(skb, RTA_VIA, alen + 2);
if (!nla)
return -EMSGSIZE;
- if (table <= NEIGH_NR_TABLES)
- family = table_to_family[table];
-
via = nla_data(nla);
via->rtvia_family = family;
memcpy(via->rtvia_addr, addr, alen);
@@ -2295,7 +2286,7 @@ static int resize_platform_label_table(struct net *net, size_t limit)
rt0->rt_protocol = RTPROT_KERNEL;
rt0->rt_payload_type = MPT_IPV4;
rt0->rt_ttl_propagate = MPLS_TTL_PROP_DEFAULT;
- rt0->rt_nh->nh_via_table = NEIGH_LINK_TABLE;
+ rt0->rt_nh->nh_via_table = MPLS_NEIGH_TABLE_UNSPEC;
rt0->rt_nh->nh_via_alen = lo->addr_len;
memcpy(__mpls_nh_via(rt0, rt0->rt_nh), lo->dev_addr,
lo->addr_len);
@@ -2309,7 +2300,7 @@ static int resize_platform_label_table(struct net *net, size_t limit)
rt2->rt_protocol = RTPROT_KERNEL;
rt2->rt_payload_type = MPT_IPV6;
rt2->rt_ttl_propagate = MPLS_TTL_PROP_DEFAULT;
- rt2->rt_nh->nh_via_table = NEIGH_LINK_TABLE;
+ rt2->rt_nh->nh_via_table = MPLS_NEIGH_TABLE_UNSPEC;
rt2->rt_nh->nh_via_alen = lo->addr_len;
memcpy(__mpls_nh_via(rt2, rt2->rt_nh), lo->dev_addr,
lo->addr_len);
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index 6e558a419f60..6dc8370c290d 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -138,11 +138,9 @@ static int mpls_xmit(struct sk_buff *skb)
mpls_stats_inc_outucastpkts(out_dev, skb);
if (rt)
- err = neigh_xmit(NEIGH_ARP_TABLE, out_dev, &rt->rt_gateway,
- skb);
+ err = neigh_xmit(AF_INET, out_dev, &rt->rt_gateway, skb);
else if (rt6)
- err = neigh_xmit(NEIGH_ND_TABLE, out_dev, &rt6->rt6i_gateway,
- skb);
+ err = neigh_xmit(AF_INET6, out_dev, &rt6->rt6i_gateway, skb);
if (err)
net_dbg_ratelimited("%s: packet transmission failed: %d\n",
__func__, err);
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index 15ed91309992..e56fcea9c7ba 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -265,7 +265,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
skb->dev = outdev;
nexthop = rt_nexthop(rt, flow->tuplehash[!dir].tuple.src_v4.s_addr);
skb_dst_set_noref(skb, &rt->dst);
- neigh_xmit(NEIGH_ARP_TABLE, outdev, &nexthop, skb);
+ neigh_xmit(AF_INET, outdev, &nexthop, skb);
return NF_STOLEN;
}
@@ -482,7 +482,7 @@ nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
skb->dev = outdev;
nexthop = rt6_nexthop(rt, &flow->tuplehash[!dir].tuple.src_v6);
skb_dst_set_noref(skb, &rt->dst);
- neigh_xmit(NEIGH_ND_TABLE, outdev, nexthop, skb);
+ neigh_xmit(AF_INET6, outdev, nexthop, skb);
return NF_STOLEN;
}
diff --git a/net/netfilter/nft_fwd_netdev.c b/net/netfilter/nft_fwd_netdev.c
index 8abb9891cdf2..c361b0636a8c 100644
--- a/net/netfilter/nft_fwd_netdev.c
+++ b/net/netfilter/nft_fwd_netdev.c
@@ -84,7 +84,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
unsigned int verdict = NF_STOLEN;
struct sk_buff *skb = pkt->skb;
struct net_device *dev;
- int neigh_table;
+ u8 neigh_table;
switch (priv->nfproto) {
case NFPROTO_IPV4: {
@@ -100,7 +100,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
}
iph = ip_hdr(skb);
ip_decrease_ttl(iph);
- neigh_table = NEIGH_ARP_TABLE;
+ neigh_table = AF_INET;
break;
}
case NFPROTO_IPV6: {
@@ -116,7 +116,7 @@ static void nft_fwd_neigh_eval(const struct nft_expr *expr,
}
ip6h = ipv6_hdr(skb);
ip6h->hop_limit--;
- neigh_table = NEIGH_ND_TABLE;
+ neigh_table = AF_INET6;
break;
}
default:
--
2.11.0
From: David Ahern <[email protected]>
Allows arp_constructor to not reference the arp_tbl directly.
ndisc_constructor assumes key length. Could do the same with arp.
Signed-off-by: David Ahern <[email protected]>
---
include/net/neighbour.h | 3 ++-
net/core/neighbour.c | 3 ++-
net/decnet/dn_neigh.c | 4 ++--
net/ipv4/arp.c | 6 +++---
net/ipv6/ndisc.c | 4 ++--
5 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 5bc4d79b4b3a..6cf9ce16eac8 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -198,7 +198,8 @@ struct neigh_table {
const struct net_device *dev,
__u32 *hash_rnd);
bool (*key_eq)(const struct neighbour *, const void *pkey);
- int (*constructor)(struct neighbour *);
+ int (*constructor)(struct neighbour *,
+ unsigned int key_len);
int (*pconstructor)(struct pneigh_entry *);
void (*pdestructor)(struct pneigh_entry *);
void (*proxy_redo)(struct sk_buff *skb);
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index e8630f9de24a..41841d8e4ea4 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -505,7 +505,8 @@ struct neighbour *__neigh_create(struct neigh_table *tbl, const void *pkey,
dev_hold(dev);
/* Protocol specific setup. */
- if (tbl->constructor && (error = tbl->constructor(n)) < 0) {
+ if (tbl->constructor &&
+ (error = tbl->constructor(n, tbl->key_len)) < 0) {
rc = ERR_PTR(error);
goto out_neigh_release;
}
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index 94b306f6d551..74112777beb0 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -49,7 +49,7 @@
#include <net/dn_neigh.h>
#include <net/dn_route.h>
-static int dn_neigh_construct(struct neighbour *);
+static int dn_neigh_construct(struct neighbour *, unsigned int key_len);
static void dn_neigh_error_report(struct neighbour *, struct sk_buff *);
static int dn_neigh_output(struct neighbour *neigh, struct sk_buff *skb);
@@ -108,7 +108,7 @@ struct neigh_table dn_neigh_table = {
.gc_thresh3 = 1024,
};
-static int dn_neigh_construct(struct neighbour *neigh)
+static int dn_neigh_construct(struct neighbour *neigh, unsigned int key_len)
{
struct net_device *dev = neigh->dev;
struct dn_neigh *dn = container_of(neigh, struct dn_neigh, n);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index fd4a380da9bb..7b27faefa01b 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -125,7 +125,7 @@
*/
static u32 arp_hash(const void *pkey, const struct net_device *dev, __u32 *hash_rnd);
static bool arp_key_eq(const struct neighbour *n, const void *pkey);
-static int arp_constructor(struct neighbour *neigh);
+static int arp_constructor(struct neighbour *neigh, unsigned int key_len);
static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb);
static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb);
static void parp_redo(struct sk_buff *skb);
@@ -221,7 +221,7 @@ static bool arp_key_eq(const struct neighbour *neigh, const void *pkey)
return neigh_key_eq32(neigh, pkey);
}
-static int arp_constructor(struct neighbour *neigh)
+static int arp_constructor(struct neighbour *neigh, unsigned int key_len)
{
__be32 addr;
struct net_device *dev = neigh->dev;
@@ -230,7 +230,7 @@ static int arp_constructor(struct neighbour *neigh)
u32 inaddr_any = INADDR_ANY;
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
- memcpy(neigh->primary_key, &inaddr_any, arp_tbl.key_len);
+ memcpy(neigh->primary_key, &inaddr_any, key_len);
addr = *(__be32 *)neigh->primary_key;
rcu_read_lock();
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 14b925f36099..5103d8641b04 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -77,7 +77,7 @@ static u32 ndisc_hash(const void *pkey,
const struct net_device *dev,
__u32 *hash_rnd);
static bool ndisc_key_eq(const struct neighbour *neigh, const void *pkey);
-static int ndisc_constructor(struct neighbour *neigh);
+static int ndisc_constructor(struct neighbour *neigh, unsigned int key_len);
static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb);
static void ndisc_error_report(struct neighbour *neigh, struct sk_buff *skb);
static int pndisc_constructor(struct pneigh_entry *n);
@@ -319,7 +319,7 @@ static bool ndisc_key_eq(const struct neighbour *n, const void *pkey)
return neigh_key_eq128(n, pkey);
}
-static int ndisc_constructor(struct neighbour *neigh)
+static int ndisc_constructor(struct neighbour *neigh, unsigned int key_len)
{
struct in6_addr *addr = (struct in6_addr *)&neigh->primary_key;
struct net_device *dev = neigh->dev;
--
2.11.0
From: David Ahern <[email protected]>
Convert uses of neigh_tables array to neigh_find_table. Maintain existing
family order using a new table_families array. A later patch removes the
neigh_tables array in favor of per-namespace accesses.
Signed-off-by: David Ahern <[email protected]>
---
net/core/neighbour.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index b60087d7c0bc..afb2ee985dd1 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -62,6 +62,19 @@ static int pneigh_ifdown_and_unlock(struct neigh_table *tbl,
static const struct seq_operations neigh_stat_seq_ops;
#endif
+/* used for table dumps to maintain the legacy order of
+ * ipv4, ipv6, decnet
+ */
+static int table_families[] = {
+ AF_INET,
+#if IS_ENABLED(CONFIG_IPV6)
+ AF_INET6,
+#endif
+#if IS_ENABLED(CONFIG_DECNET)
+ AF_DECnet,
+#endif
+};
+
/*
Neighbour hash table buckets are protected with rwlock tbl->lock.
@@ -2046,8 +2059,8 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh,
ndtmsg = nlmsg_data(nlh);
- for (tidx = 0; tidx < NEIGH_NR_TABLES; tidx++) {
- tbl = neigh_tables[tidx];
+ for (tidx = 0; tidx < ARRAY_SIZE(table_families); tidx++) {
+ tbl = neigh_find_table(net, table_families[tidx]);
if (!tbl)
continue;
if (ndtmsg->ndtm_family && tbl->family != ndtmsg->ndtm_family)
@@ -2195,10 +2208,10 @@ static int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
family = ((struct rtgenmsg *) nlmsg_data(cb->nlh))->rtgen_family;
- for (tidx = 0; tidx < NEIGH_NR_TABLES; tidx++) {
+ for (tidx = 0; tidx < ARRAY_SIZE(table_families); tidx++) {
struct neigh_parms *p;
- tbl = neigh_tables[tidx];
+ tbl = neigh_find_table(net, table_families[tidx]);
if (!tbl)
continue;
@@ -2453,6 +2466,7 @@ static int pneigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
static int neigh_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
{
+ struct net *net = sock_net(skb->sk);
struct neigh_table *tbl;
int t, family, s_t;
int proxy = 0;
@@ -2469,9 +2483,8 @@ static int neigh_dump_info(struct sk_buff *skb, struct netlink_callback *cb)
s_t = cb->args[0];
- for (t = 0; t < NEIGH_NR_TABLES; t++) {
- tbl = neigh_tables[t];
-
+ for (t = 0; t < ARRAY_SIZE(table_families); t++) {
+ tbl = neigh_find_table(net, table_families[t]);
if (!tbl)
continue;
if (t < s_t || (family && tbl->family != family))
--
2.11.0
From: David Ahern <[email protected]>
Convert IPv4 neighbor table to per-namespace.
This patch is a transition patch for the core neighbor code, so update
the init_net reference as needed for AF_INET. With the per-namespace
table allow gc parameters to be changed per namespace.
Signed-off-by: David Ahern <[email protected]>
---
include/net/arp.h | 2 +-
include/net/netns/ipv4.h | 1 +
net/core/neighbour.c | 22 +++++++-----
net/ipv4/arp.c | 88 ++++++++++++++++++++++++++++--------------------
4 files changed, 67 insertions(+), 46 deletions(-)
diff --git a/include/net/arp.h b/include/net/arp.h
index fae3561db10b..ec86b286f779 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -9,7 +9,7 @@
static inline struct neigh_table *ipv4_neigh_table(struct net *net)
{
- return neigh_find_table(net, AF_INET);
+ return net->ipv4.arp_tbl;
}
static inline struct neighbour *ipv4_neigh_create(struct net_device *dev,
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 661348f23ea5..bc1fab231500 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -51,6 +51,7 @@ struct netns_ipv4 {
struct ipv4_devconf *devconf_dflt;
struct ip_ra_chain __rcu *ra_chain;
struct mutex ra_mutex;
+ struct neigh_table *arp_tbl;
#ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops *rules_ops;
bool fib_has_custom_rules;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index afb2ee985dd1..95b9269e3f35 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1485,11 +1485,15 @@ EXPORT_SYMBOL(pneigh_enqueue);
static inline struct neigh_parms *lookup_neigh_parms(struct neigh_table *tbl,
struct net *net, int ifindex)
{
+ struct net *def_net = &init_net;
struct neigh_parms *p;
+ if (tbl->family == AF_INET)
+ def_net = neigh_parms_net(p);
+
list_for_each_entry(p, &tbl->parms_list, list) {
if ((p->dev && p->dev->ifindex == ifindex && net_eq(neigh_parms_net(p), net)) ||
- (!p->dev && !ifindex && net_eq(net, &init_net)))
+ (!p->dev && !ifindex && net_eq(net, def_net)))
return p;
}
@@ -1611,7 +1615,7 @@ void neigh_table_init(struct net *net, struct neigh_table *tbl)
switch (family) {
case AF_INET:
- neigh_tables[NEIGH_ARP_TABLE] = tbl;
+ net->ipv4.arp_tbl = tbl;
break;
case AF_INET6:
neigh_tables[NEIGH_ND_TABLE] = tbl;
@@ -1629,7 +1633,7 @@ int neigh_table_clear(struct net *net, struct neigh_table *tbl)
switch (family) {
case AF_INET:
- neigh_tables[NEIGH_ARP_TABLE] = NULL;
+ net->ipv4.arp_tbl = NULL;
break;
case AF_INET6:
neigh_tables[NEIGH_ND_TABLE] = NULL;
@@ -1669,7 +1673,7 @@ struct neigh_table *neigh_find_table(struct net *net, u8 family)
switch (family) {
case AF_INET:
- tbl = neigh_tables[NEIGH_ARP_TABLE];
+ tbl = net->ipv4.arp_tbl;
break;
case AF_INET6:
tbl = neigh_tables[NEIGH_ND_TABLE];
@@ -2173,10 +2177,12 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh,
}
err = -ENOENT;
- if ((tb[NDTA_THRESH1] || tb[NDTA_THRESH2] ||
- tb[NDTA_THRESH3] || tb[NDTA_GC_INTERVAL]) &&
- !net_eq(net, &init_net))
- goto errout_tbl_lock;
+ if (tbl->family != AF_INET) {
+ if ((tb[NDTA_THRESH1] || tb[NDTA_THRESH2] ||
+ tb[NDTA_THRESH3] || tb[NDTA_GC_INTERVAL]) &&
+ !net_eq(net, &init_net))
+ goto errout_tbl_lock;
+ }
if (tb[NDTA_THRESH1])
tbl->gc_thresh1 = nla_get_u32(tb[NDTA_THRESH1]);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 707b40f76852..61c1d02a8fad 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -152,38 +152,19 @@ static const struct neigh_ops arp_direct_ops = {
.connected_output = neigh_direct_output,
};
-struct neigh_table arp_tbl = {
- .family = AF_INET,
- .key_len = 4,
- .protocol = cpu_to_be16(ETH_P_IP),
- .hash = arp_hash,
- .key_eq = arp_key_eq,
- .constructor = arp_constructor,
- .proxy_redo = parp_redo,
- .id = "arp_cache",
- .parms = {
- .tbl = &arp_tbl,
- .reachable_time = 30 * HZ,
- .data = {
- [NEIGH_VAR_MCAST_PROBES] = 3,
- [NEIGH_VAR_UCAST_PROBES] = 3,
- [NEIGH_VAR_RETRANS_TIME] = 1 * HZ,
- [NEIGH_VAR_BASE_REACHABLE_TIME] = 30 * HZ,
- [NEIGH_VAR_DELAY_PROBE_TIME] = 5 * HZ,
- [NEIGH_VAR_GC_STALETIME] = 60 * HZ,
- [NEIGH_VAR_QUEUE_LEN_BYTES] = SK_WMEM_MAX,
- [NEIGH_VAR_PROXY_QLEN] = 64,
- [NEIGH_VAR_ANYCAST_DELAY] = 1 * HZ,
- [NEIGH_VAR_PROXY_DELAY] = (8 * HZ) / 10,
- [NEIGH_VAR_LOCKTIME] = 1 * HZ,
- },
- },
- .gc_interval = 30 * HZ,
- .gc_thresh1 = 128,
- .gc_thresh2 = 512,
- .gc_thresh3 = 1024,
+static int parms_data[NEIGH_VAR_DATA_MAX] = {
+ [NEIGH_VAR_MCAST_PROBES] = 3,
+ [NEIGH_VAR_UCAST_PROBES] = 3,
+ [NEIGH_VAR_RETRANS_TIME] = 1 * HZ,
+ [NEIGH_VAR_BASE_REACHABLE_TIME] = 30 * HZ,
+ [NEIGH_VAR_DELAY_PROBE_TIME] = 5 * HZ,
+ [NEIGH_VAR_GC_STALETIME] = 60 * HZ,
+ [NEIGH_VAR_QUEUE_LEN_BYTES] = SK_WMEM_MAX,
+ [NEIGH_VAR_PROXY_QLEN] = 64,
+ [NEIGH_VAR_ANYCAST_DELAY] = 1 * HZ,
+ [NEIGH_VAR_PROXY_DELAY] = (8 * HZ) / 10,
+ [NEIGH_VAR_LOCKTIME] = 1 * HZ,
};
-EXPORT_SYMBOL(arp_tbl);
int arp_mc_map(__be32 addr, u8 *haddr, struct net_device *dev, int dir)
{
@@ -1291,13 +1272,8 @@ static int arp_proc_init(void);
void __init arp_init(void)
{
- neigh_table_init(&init_net, &arp_tbl);
-
dev_add_pack(&arp_packet_type);
arp_proc_init();
-#ifdef CONFIG_SYSCTL
- neigh_sysctl_register(NULL, &arp_tbl.parms, NULL);
-#endif
register_netdevice_notifier(&arp_netdev_notifier);
}
@@ -1426,15 +1402,53 @@ static const struct seq_operations arp_seq_ops = {
static int __net_init arp_net_init(struct net *net)
{
+ struct neigh_table *arp_tbl;
+
+ arp_tbl = kzalloc(sizeof(*arp_tbl), GFP_KERNEL);
+ if (!arp_tbl)
+ return -ENOMEM;
+
if (!proc_create_net("arp", 0444, net->proc_net, &arp_seq_ops,
- sizeof(struct neigh_seq_state)))
+ sizeof(struct neigh_seq_state))) {
+ kfree(arp_tbl);
return -ENOMEM;
+ }
+
+ arp_tbl->family = AF_INET;
+ arp_tbl->key_len = 4;
+ arp_tbl->protocol = cpu_to_be16(ETH_P_IP);
+ arp_tbl->hash = arp_hash;
+ arp_tbl->key_eq = arp_key_eq;
+ arp_tbl->constructor = arp_constructor;
+ arp_tbl->proxy_redo = parp_redo;
+ arp_tbl->id = "arp_cache";
+ arp_tbl->gc_interval = 30 * HZ;
+ arp_tbl->gc_thresh1 = 128;
+ arp_tbl->gc_thresh2 = 512;
+ arp_tbl->gc_thresh3 = 1024;
+
+ arp_tbl->parms.tbl = arp_tbl;
+ arp_tbl->parms.reachable_time = 30 * HZ;
+ memcpy(arp_tbl->parms.data, parms_data, sizeof(parms_data));
+
+ neigh_table_init(net, arp_tbl);
+
+#ifdef CONFIG_SYSCTL
+ neigh_sysctl_register(NULL, &arp_tbl->parms, NULL);
+#endif
return 0;
}
static void __net_exit arp_net_exit(struct net *net)
{
+ struct neigh_table *arp_tbl = ipv4_neigh_table(net);
+
remove_proc_entry("arp", net->proc_net);
+#ifdef CONFIG_SYSCTL
+ neigh_sysctl_unregister(&arp_tbl->parms);
+#endif
+ neigh_table_clear(net, arp_tbl);
+ kfree(arp_tbl);
}
static struct pernet_operations arp_net_ops = {
--
2.11.0
From: David Ahern <[email protected]>
Convert neigh_table_init and neigh_table_clear to use an address family
instead of the NEIGH_*_TABLE macros which are effectively the same thing
only setup as index into neigh_tables. The functions can do that mapping
internally.
Further, add net to the arg list for both and remove the dependence on
init_net.
Signed-off-by: David Ahern <[email protected]>
---
include/net/neighbour.h | 4 ++--
net/core/neighbour.c | 40 ++++++++++++++++++++++++++++++++--------
net/decnet/dn_neigh.c | 4 ++--
net/ipv4/arp.c | 2 +-
net/ipv6/ndisc.c | 4 ++--
5 files changed, 39 insertions(+), 15 deletions(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 6cf9ce16eac8..b70afea05f86 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -304,8 +304,8 @@ static inline struct neighbour *__neigh_lookup_noref(struct neigh_table *tbl,
return ___neigh_lookup_noref(tbl, tbl->key_eq, tbl->hash, pkey, dev);
}
-void neigh_table_init(int index, struct neigh_table *tbl);
-int neigh_table_clear(int index, struct neigh_table *tbl);
+void neigh_table_init(struct net *net, struct neigh_table *tbl);
+int neigh_table_clear(struct net *net, struct neigh_table *tbl);
struct neighbour *neigh_lookup(struct neigh_table *tbl, const void *pkey,
struct net_device *dev);
struct neighbour *neigh_lookup_nodev(struct neigh_table *tbl, struct net *net,
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 41841d8e4ea4..8bdaeb080ce4 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1548,14 +1548,15 @@ static struct lock_class_key neigh_table_proxy_queue_class;
static struct neigh_table *neigh_tables[NEIGH_NR_TABLES] __read_mostly;
-void neigh_table_init(int index, struct neigh_table *tbl)
+void neigh_table_init(struct net *net, struct neigh_table *tbl)
{
unsigned long now = jiffies;
+ u8 family = tbl->family;
unsigned long phsize;
INIT_LIST_HEAD(&tbl->parms_list);
list_add(&tbl->parms.list, &tbl->parms_list);
- write_pnet(&tbl->parms.net, &init_net);
+ write_pnet(&tbl->parms.net, net);
refcount_set(&tbl->parms.refcnt, 1);
tbl->parms.reachable_time =
neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
@@ -1565,8 +1566,8 @@ void neigh_table_init(int index, struct neigh_table *tbl)
panic("cannot create neighbour cache statistics");
#ifdef CONFIG_PROC_FS
- if (!proc_create_seq_data(tbl->id, 0, init_net.proc_net_stat,
- &neigh_stat_seq_ops, tbl))
+ if (!proc_create_seq_data(tbl->id, 0, net->proc_net_stat,
+ &neigh_stat_seq_ops, tbl))
panic("cannot create neighbour proc dir entry");
#endif
@@ -1595,13 +1596,36 @@ void neigh_table_init(int index, struct neigh_table *tbl)
tbl->last_flush = now;
tbl->last_rand = now + tbl->parms.reachable_time * 20;
- neigh_tables[index] = tbl;
+ switch (family) {
+ case AF_INET:
+ neigh_tables[NEIGH_ARP_TABLE] = tbl;
+ break;
+ case AF_INET6:
+ neigh_tables[NEIGH_ND_TABLE] = tbl;
+ break;
+ case AF_DECnet:
+ neigh_tables[NEIGH_DN_TABLE] = tbl;
+ break;
+ }
}
EXPORT_SYMBOL(neigh_table_init);
-int neigh_table_clear(int index, struct neigh_table *tbl)
+int neigh_table_clear(struct net *net, struct neigh_table *tbl)
{
- neigh_tables[index] = NULL;
+ u8 family = tbl->family;
+
+ switch (family) {
+ case AF_INET:
+ neigh_tables[NEIGH_ARP_TABLE] = NULL;
+ break;
+ case AF_INET6:
+ neigh_tables[NEIGH_ND_TABLE] = NULL;
+ break;
+ case AF_DECnet:
+ neigh_tables[NEIGH_DN_TABLE] = NULL;
+ break;
+ }
+
/* It is not clean... Fix it to unload IPv6 module safely */
cancel_delayed_work_sync(&tbl->gc_work);
del_timer_sync(&tbl->proxy_timer);
@@ -1617,7 +1641,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
kfree(tbl->phash_buckets);
tbl->phash_buckets = NULL;
- remove_proc_entry(tbl->id, init_net.proc_net_stat);
+ remove_proc_entry(tbl->id, net->proc_net_stat);
free_percpu(tbl->stats);
tbl->stats = NULL;
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index 74112777beb0..6d5078ddffac 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -593,7 +593,7 @@ static const struct seq_operations dn_neigh_seq_ops = {
void __init dn_neigh_init(void)
{
- neigh_table_init(NEIGH_DN_TABLE, &dn_neigh_table);
+ neigh_table_init(&init_net, &dn_neigh_table);
proc_create_net("decnet_neigh", 0444, init_net.proc_net,
&dn_neigh_seq_ops, sizeof(struct neigh_seq_state));
}
@@ -601,5 +601,5 @@ void __init dn_neigh_init(void)
void __exit dn_neigh_cleanup(void)
{
remove_proc_entry("decnet_neigh", init_net.proc_net);
- neigh_table_clear(NEIGH_DN_TABLE, &dn_neigh_table);
+ neigh_table_clear(&init_net, &dn_neigh_table);
}
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 7b27faefa01b..707b40f76852 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1291,7 +1291,7 @@ static int arp_proc_init(void);
void __init arp_init(void)
{
- neigh_table_init(NEIGH_ARP_TABLE, &arp_tbl);
+ neigh_table_init(&init_net, &arp_tbl);
dev_add_pack(&arp_packet_type);
arp_proc_init();
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 5103d8641b04..6105530fe865 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1908,7 +1908,7 @@ int __init ndisc_init(void)
/*
* Initialize the neighbour table
*/
- neigh_table_init(NEIGH_ND_TABLE, &nd_tbl);
+ neigh_table_init(&init_net, &nd_tbl);
#ifdef CONFIG_SYSCTL
err = neigh_sysctl_register(NULL, &nd_tbl.parms,
@@ -1941,6 +1941,6 @@ void ndisc_cleanup(void)
#ifdef CONFIG_SYSCTL
neigh_sysctl_unregister(&nd_tbl.parms);
#endif
- neigh_table_clear(NEIGH_ND_TABLE, &nd_tbl);
+ neigh_table_clear(&init_net, &nd_tbl);
unregister_pernet_subsys(&ndisc_net_ops);
}
--
2.11.0
From: David Ahern <[email protected]>
Create a helper, ipv4_neigh_table, for retrieving a reference to the arp
table. Add additional wrappers for commonly used neigh_* functions to
avoid propagating the ipv4_neigh_table lookup all over the code.
Signed-off-by: David Ahern <[email protected]>
---
include/net/arp.h | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/include/net/arp.h b/include/net/arp.h
index 977aabfcdc03..7b503bedd9fb 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -10,6 +10,29 @@
extern struct neigh_table arp_tbl;
+static inline struct neigh_table *ipv4_neigh_table(struct net *net)
+{
+ return neigh_find_table(net, AF_INET);
+}
+
+static inline struct neighbour *ipv4_neigh_create(struct net_device *dev,
+ const void *pkey)
+{
+ return neigh_create(ipv4_neigh_table(dev_net(dev)), pkey, dev);
+}
+
+static inline struct neighbour *ipv4_neigh_create_noref(struct net_device *dev,
+ const void *pkey)
+{
+ return __neigh_create(ipv4_neigh_table(dev_net(dev)), pkey, dev, false);
+}
+
+static inline struct neighbour *ipv4_neigh_lookup(struct net_device *dev,
+ void *key)
+{
+ return neigh_lookup(ipv4_neigh_table(dev_net(dev)), key, dev);
+}
+
static inline u32 arp_hashfn(const void *pkey, const struct net_device *dev, u32 *hash_rnd)
{
u32 key = *(const u32 *)pkey;
@@ -23,7 +46,8 @@ static inline struct neighbour *__ipv4_neigh_lookup_noref(struct net_device *dev
if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
key = INADDR_ANY;
- return ___neigh_lookup_noref(&arp_tbl, neigh_key_eq32, arp_hashfn, &key, dev);
+ return ___neigh_lookup_noref(ipv4_neigh_table(dev_net(dev)),
+ neigh_key_eq32, arp_hashfn, &key, dev);
}
static inline struct neighbour *__ipv4_neigh_lookup(struct net_device *dev, u32 key)
--
2.11.0
From: David Ahern <[email protected]>
Create a helper, ipv6_neigh_table, for retrieving a reference to the ipv6
neighbor table. Add additional wrappers for commonly used neigh_* functions
to avoid propagating the ipv6_neigh_table lookup all over the code.
For ipv6, the neighbor table may not exist (e.g., IPv6 not enabled at
build time or the module is not loaded) so NULL checks are needed before
use.
Signed-off-by: David Ahern <[email protected]>
---
include/net/ndisc.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 65 insertions(+), 4 deletions(-)
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index ddfbb591e2c5..078951ac54fd 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -374,17 +374,70 @@ static inline u32 ndisc_hashfn(const void *pkey, const struct net_device *dev, _
(p32[3] * hash_rnd[3]));
}
-static inline struct neighbour *__ipv6_neigh_lookup_noref(struct net_device *dev, const void *pkey)
+static inline struct neigh_table *ipv6_neigh_table(struct net *net)
{
- return ___neigh_lookup_noref(&nd_tbl, neigh_key_eq128, ndisc_hashfn, pkey, dev);
+ return neigh_find_table(net, AF_INET6);
}
-static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev, const void *pkey)
+static inline struct neighbour *ipv6_neigh_create(struct net_device *dev,
+ const void *pkey,
+ bool want_ref)
+{
+ struct neigh_table *tbl = ipv6_neigh_table(dev_net(dev));
+ struct neighbour *n = NULL;
+
+ if (tbl)
+ n = __neigh_create(tbl, pkey, dev, want_ref);
+
+ return n;
+}
+
+static inline struct neighbour *ipv6_neigh_lookup(struct net_device *dev,
+ const void *pkey)
+{
+ struct neigh_table *tbl = ipv6_neigh_table(dev_net(dev));
+ struct neighbour *n = NULL;
+
+ if (tbl)
+ n = neigh_lookup(tbl, pkey, dev);
+
+ return n;
+}
+
+static inline struct neighbour *__ipv6_neigh_lookup(struct net_device *dev,
+ const void *pkey, int creat)
+{
+ struct neigh_table *tbl = ipv6_neigh_table(dev_net(dev));
+ struct neighbour *n = NULL;
+
+ if (tbl)
+ n = __neigh_lookup(tbl, pkey, dev, creat);
+
+ return n;
+}
+
+static inline
+struct neighbour *___ipv6_neigh_lookup_noref(struct net_device *dev,
+ const void *pkey)
+{
+ struct neigh_table *tbl = ipv6_neigh_table(dev_net(dev));
+ struct neighbour *n = NULL;
+
+ if (tbl)
+ n = ___neigh_lookup_noref(tbl, neigh_key_eq128, ndisc_hashfn,
+ pkey, dev);
+
+ return n;
+}
+
+static inline
+struct neighbour *__ipv6_neigh_lookup_noref(struct net_device *dev,
+ const void *pkey)
{
struct neighbour *n;
rcu_read_lock_bh();
- n = __ipv6_neigh_lookup_noref(dev, pkey);
+ n = ___ipv6_neigh_lookup_noref(dev, pkey);
if (n && !refcount_inc_not_zero(&n->refcnt))
n = NULL;
rcu_read_unlock_bh();
@@ -409,6 +462,14 @@ static inline void __ipv6_confirm_neigh(struct net_device *dev,
rcu_read_unlock_bh();
}
+static inline struct pneigh_entry *ipv6_pneigh_lookup(struct net *net,
+ const void *key,
+ struct net_device *dev,
+ int creat)
+{
+ return pneigh_lookup(ipv6_neigh_table(net), net, key, dev, creat);
+}
+
int ndisc_init(void);
int ndisc_late_init(void);
--
2.11.0
From: David Ahern <[email protected]>
Fixup last 2 users to nd_tbl entry of the stub and remove it.
Signed-off-by: David Ahern <[email protected]>
---
include/net/addrconf.h | 1 -
net/bridge/br_arp_nd_proxy.c | 2 +-
net/core/filter.c | 3 +--
net/ipv6/af_inet6.c | 1 -
4 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 5f43f7a70fe6..3108e9d6a066 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -256,7 +256,6 @@ struct ipv6_stub {
void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr,
const struct in6_addr *solicited_addr,
bool router, bool solicited, bool override, bool inc_opt);
- struct neigh_table *nd_tbl;
};
extern const struct ipv6_stub *ipv6_stub __read_mostly;
diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 29a1e25fc169..9dd2a5dd721b 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -433,7 +433,7 @@ void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br,
return;
}
- n = neigh_lookup(ipv6_stub->nd_tbl, &msg->target, vlandev);
+ n = ipv6_neigh_lookup(vlandev, &msg->target);
if (n) {
struct net_bridge_fdb_entry *f;
diff --git a/net/core/filter.c b/net/core/filter.c
index b9ec916f4e3a..11d4af3886be 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4290,8 +4290,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
* not needed here. Can not use __ipv6_neigh_lookup_noref here
* because we need to get nd_tbl via the stub
*/
- neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
- ndisc_hashfn, dst, dev);
+ neigh = ___ipv6_neigh_lookup_noref(dev, dst);
if (!neigh)
return BPF_FIB_LKUP_RET_NO_NEIGH;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index c9535354149f..9e7d0f3d4edb 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -898,7 +898,6 @@ static const struct ipv6_stub ipv6_stub_impl = {
.ip6_mtu_from_fib6 = ip6_mtu_from_fib6,
.udpv6_encap_enable = udpv6_encap_enable,
.ndisc_send_na = ndisc_send_na,
- .nd_tbl = &nd_tbl,
};
static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = {
--
2.11.0
From: David Ahern <[email protected]>
Convert existing uses for arp_tbl to the helpers introduced in the previous
patch.
Signed-off-by: David Ahern <[email protected]>
---
net/bridge/br_arp_nd_proxy.c | 2 +-
net/ipv4/arp.c | 36 ++++++++++++++++++++----------------
net/ipv4/devinet.c | 8 ++++----
net/ipv4/fib_semantics.c | 2 +-
net/ipv4/ip_output.c | 2 +-
net/ipv4/route.c | 4 ++--
6 files changed, 29 insertions(+), 25 deletions(-)
diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 2cf7716254be..29a1e25fc169 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -183,7 +183,7 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct net_bridge *br,
return;
}
- n = neigh_lookup(&arp_tbl, &tip, vlandev);
+ n = ipv4_neigh_lookup(vlandev, &tip);
if (n) {
struct net_bridge_fdb_entry *f;
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index e90c89ef8c08..fd4a380da9bb 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -678,6 +678,7 @@ static bool arp_is_garp(struct net *net, struct net_device *dev,
static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
{
+ struct neigh_table *tbl = ipv4_neigh_table(net);
struct net_device *dev = skb->dev;
struct in_device *in_dev = __in_dev_get_rcu(dev);
struct arphdr *arp;
@@ -827,7 +828,7 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
if (!dont_send && IN_DEV_ARPFILTER(in_dev))
dont_send = arp_filter(sip, tip, dev);
if (!dont_send) {
- n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
+ n = neigh_event_ns(tbl, sha, &sip, dev);
if (n) {
arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
sip, dev, tip, sha,
@@ -842,8 +843,8 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
(arp_fwd_proxy(in_dev, dev, rt) ||
arp_fwd_pvlan(in_dev, dev, rt, sip, tip) ||
(rt->dst.dev != dev &&
- pneigh_lookup(&arp_tbl, net, &tip, dev, 0)))) {
- n = neigh_event_ns(&arp_tbl, sha, &sip, dev);
+ pneigh_lookup(tbl, net, &tip, dev, 0)))) {
+ n = neigh_event_ns(tbl, sha, &sip, dev);
if (n)
neigh_release(n);
@@ -855,7 +856,7 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
dev->dev_addr, sha,
reply_dst);
} else {
- pneigh_enqueue(&arp_tbl,
+ pneigh_enqueue(tbl,
in_dev->arp_parms, skb);
goto out_free_dst;
}
@@ -866,7 +867,7 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
/* Update our ARP tables */
- n = __neigh_lookup(&arp_tbl, &sip, dev, 0);
+ n = __neigh_lookup(tbl, &sip, dev, 0);
addr_type = -1;
if (n || IN_DEV_ARP_ACCEPT(in_dev)) {
@@ -887,7 +888,7 @@ static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
/* postpone calculation to as late as possible */
inet_addr_type_dev_table(net, dev, sip) ==
RTN_UNICAST)))))
- n = __neigh_lookup(&arp_tbl, &sip, dev, 1);
+ n = __neigh_lookup(tbl, &sip, dev, 1);
}
if (n) {
@@ -1011,7 +1012,7 @@ static int arp_req_set_public(struct net *net, struct arpreq *r,
return -ENODEV;
}
if (mask) {
- if (!pneigh_lookup(&arp_tbl, net, &ip, dev, 1))
+ if (!pneigh_lookup(ipv4_neigh_table(net), net, &ip, dev, 1))
return -ENOBUFS;
return 0;
}
@@ -1063,7 +1064,7 @@ static int arp_req_set(struct net *net, struct arpreq *r,
break;
}
- neigh = __neigh_lookup_errno(&arp_tbl, &ip, dev);
+ neigh = __neigh_lookup_errno(ipv4_neigh_table(net), &ip, dev);
err = PTR_ERR(neigh);
if (!IS_ERR(neigh)) {
unsigned int state = NUD_STALE;
@@ -1098,7 +1099,7 @@ static int arp_req_get(struct arpreq *r, struct net_device *dev)
struct neighbour *neigh;
int err = -ENXIO;
- neigh = neigh_lookup(&arp_tbl, &ip, dev);
+ neigh = ipv4_neigh_lookup(dev, &ip);
if (neigh) {
if (!(neigh->nud_state & NUD_NOARP)) {
read_lock_bh(&neigh->lock);
@@ -1116,9 +1117,9 @@ static int arp_req_get(struct arpreq *r, struct net_device *dev)
static int arp_invalidate(struct net_device *dev, __be32 ip)
{
- struct neighbour *neigh = neigh_lookup(&arp_tbl, &ip, dev);
+ struct neigh_table *tbl = ipv4_neigh_table(dev_net(dev));
+ struct neighbour *neigh = neigh_lookup(tbl, &ip, dev);
int err = -ENXIO;
- struct neigh_table *tbl = &arp_tbl;
if (neigh) {
if (neigh->nud_state & ~NUD_NOARP)
@@ -1141,7 +1142,7 @@ static int arp_req_delete_public(struct net *net, struct arpreq *r,
__be32 mask = ((struct sockaddr_in *)&r->arp_netmask)->sin_addr.s_addr;
if (mask == htonl(0xFFFFFFFF))
- return pneigh_delete(&arp_tbl, net, &ip, dev);
+ return pneigh_delete(ipv4_neigh_table(net), net, &ip, dev);
if (mask)
return -EINVAL;
@@ -1248,13 +1249,13 @@ static int arp_netdev_event(struct notifier_block *this, unsigned long event,
switch (event) {
case NETDEV_CHANGEADDR:
- neigh_changeaddr(&arp_tbl, dev);
+ neigh_changeaddr(ipv4_neigh_table(dev_net(dev)), dev);
rt_cache_flush(dev_net(dev));
break;
case NETDEV_CHANGE:
change_info = ptr;
if (change_info->flags_changed & IFF_NOARP)
- neigh_changeaddr(&arp_tbl, dev);
+ neigh_changeaddr(ipv4_neigh_table(dev_net(dev)), dev);
break;
default:
break;
@@ -1273,7 +1274,7 @@ static struct notifier_block arp_netdev_notifier = {
*/
void arp_ifdown(struct net_device *dev)
{
- neigh_ifdown(&arp_tbl, dev);
+ neigh_ifdown(ipv4_neigh_table(dev_net(dev)), dev);
}
@@ -1403,10 +1404,13 @@ static int arp_seq_show(struct seq_file *seq, void *v)
static void *arp_seq_start(struct seq_file *seq, loff_t *pos)
{
+ struct net *net = seq_file_net(seq);
+
/* Don't want to confuse "arp -a" w/ magic entries,
* so we tell the generic iterator to skip NUD_NOARP.
*/
- return neigh_seq_start(seq, pos, &arp_tbl, NEIGH_SEQ_SKIP_NOARP);
+ return neigh_seq_start(seq, pos, ipv4_neigh_table(net),
+ NEIGH_SEQ_SKIP_NOARP);
}
/* ------------------------------------------------------------------------ */
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index d7585ab1a77a..07a57fd1a343 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -239,6 +239,7 @@ EXPORT_SYMBOL(in_dev_finish_destroy);
static struct in_device *inetdev_init(struct net_device *dev)
{
+ struct net *net = dev_net(dev);
struct in_device *in_dev;
int err = -ENOMEM;
@@ -247,11 +248,10 @@ static struct in_device *inetdev_init(struct net_device *dev)
in_dev = kzalloc(sizeof(*in_dev), GFP_KERNEL);
if (!in_dev)
goto out;
- memcpy(&in_dev->cnf, dev_net(dev)->ipv4.devconf_dflt,
- sizeof(in_dev->cnf));
+ memcpy(&in_dev->cnf, net->ipv4.devconf_dflt, sizeof(in_dev->cnf));
in_dev->cnf.sysctl = NULL;
in_dev->dev = dev;
- in_dev->arp_parms = neigh_parms_alloc(dev, &arp_tbl);
+ in_dev->arp_parms = neigh_parms_alloc(dev, ipv4_neigh_table(net));
if (!in_dev->arp_parms)
goto out_kfree;
if (IPV4_DEVCONF(in_dev->cnf, FORWARDING))
@@ -309,7 +309,7 @@ static void inetdev_destroy(struct in_device *in_dev)
RCU_INIT_POINTER(dev->ip_ptr, NULL);
devinet_sysctl_unregister(in_dev);
- neigh_parms_release(&arp_tbl, in_dev->arp_parms);
+ neigh_parms_release(ipv4_neigh_table(dev_net(dev)), in_dev->arp_parms);
arp_ifdown(dev);
call_rcu(&in_dev->rcu_head, in_dev_rcu_put);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f3c89ccf14c5..d91cf61e044e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -440,7 +440,7 @@ static int fib_detect_death(struct fib_info *fi, int order,
struct neighbour *n;
int state = NUD_NONE;
- n = neigh_lookup(&arp_tbl, &fi->fib_nh[0].nh_gw, fi->fib_dev);
+ n = ipv4_neigh_lookup(fi->fib_dev, &fi->fib_nh[0].nh_gw);
if (n) {
state = n->nud_state;
neigh_release(n);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e2b6bd478afb..0e880d4b859e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -221,7 +221,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
- neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
+ neigh = ipv4_neigh_create_noref(dev, &nexthop);
if (!IS_ERR(neigh)) {
int res;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 74e1df60ab7f..56dfa77c19ab 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -448,7 +448,7 @@ static struct neighbour *ipv4_dst_neigh_lookup(const struct dst_entry *dst,
n = __ipv4_neigh_lookup(dev, *(__force u32 *)pkey);
if (n)
return n;
- return neigh_create(&arp_tbl, pkey, dev);
+ return ipv4_neigh_create(dev, pkey);
}
static void ipv4_confirm_neigh(const struct dst_entry *dst, const void *daddr)
@@ -770,7 +770,7 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
n = __ipv4_neigh_lookup(rt->dst.dev, new_gw);
if (!n)
- n = neigh_create(&arp_tbl, &new_gw, rt->dst.dev);
+ n = ipv4_neigh_create(rt->dst.dev, &new_gw);
if (!IS_ERR(n)) {
if (!(n->nud_state & NUD_VALID)) {
neigh_event_send(n, NULL);
--
2.11.0
From: David Ahern <[email protected]>
No more users at this point.
Signed-off-by: David Ahern <[email protected]>
---
include/net/arp.h | 3 ---
include/net/ndisc.h | 2 --
2 files changed, 5 deletions(-)
diff --git a/include/net/arp.h b/include/net/arp.h
index 7b503bedd9fb..fae3561db10b 100644
--- a/include/net/arp.h
+++ b/include/net/arp.h
@@ -7,9 +7,6 @@
#include <linux/hash.h>
#include <net/neighbour.h>
-
-extern struct neigh_table arp_tbl;
-
static inline struct neigh_table *ipv4_neigh_table(struct net *net)
{
return neigh_find_table(net, AF_INET);
diff --git a/include/net/ndisc.h b/include/net/ndisc.h
index 078951ac54fd..6fc58a61acdd 100644
--- a/include/net/ndisc.h
+++ b/include/net/ndisc.h
@@ -72,8 +72,6 @@ struct net_proto_family;
struct sk_buff;
struct prefix_info;
-extern struct neigh_table nd_tbl;
-
struct nd_msg {
struct icmp6hdr icmph;
struct in6_addr target;
--
2.11.0
From: David Ahern <[email protected]>
Convert existing uses for nd_tbl to the helpers introduced in the previous
patch.
Signed-off-by: David Ahern <[email protected]>
---
net/ipv6/addrconf.c | 16 ++++++++++------
net/ipv6/ip6_output.c | 4 ++--
net/ipv6/ndisc.c | 41 +++++++++++++++++++++++------------------
net/ipv6/route.c | 12 +++++++-----
4 files changed, 42 insertions(+), 31 deletions(-)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1659a6b3cf42..104bf4c6c1f9 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -365,6 +365,8 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
{
+ struct net *net = dev_net(dev);
+ struct neigh_table *neigh_tbl = ipv6_neigh_table(net);
struct inet6_dev *ndev;
int err = -ENOMEM;
@@ -381,13 +383,13 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
ndev->dev = dev;
INIT_LIST_HEAD(&ndev->addr_list);
timer_setup(&ndev->rs_timer, addrconf_rs_timer, 0);
- memcpy(&ndev->cnf, dev_net(dev)->ipv6.devconf_dflt, sizeof(ndev->cnf));
+ memcpy(&ndev->cnf, net->ipv6.devconf_dflt, sizeof(ndev->cnf));
if (ndev->cnf.stable_secret.initialized)
ndev->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
ndev->cnf.mtu6 = dev->mtu;
- ndev->nd_parms = neigh_parms_alloc(dev, &nd_tbl);
+ ndev->nd_parms = neigh_parms_alloc(dev, neigh_tbl);
if (!ndev->nd_parms) {
kfree(ndev);
return ERR_PTR(err);
@@ -400,7 +402,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
if (snmp6_alloc_dev(ndev) < 0) {
netdev_dbg(dev, "%s: cannot allocate memory for statistics\n",
__func__);
- neigh_parms_release(&nd_tbl, ndev->nd_parms);
+ neigh_parms_release(neigh_tbl, ndev->nd_parms);
dev_put(dev);
kfree(ndev);
return ERR_PTR(err);
@@ -465,7 +467,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
return ndev;
err_release:
- neigh_parms_release(&nd_tbl, ndev->nd_parms);
+ neigh_parms_release(ipv6_neigh_table(net), ndev->nd_parms);
ndev->dead = 1;
in6_dev_finish_destroy(ndev);
return ERR_PTR(err);
@@ -3787,9 +3789,11 @@ static int addrconf_ifdown(struct net_device *dev, int how)
/* Last: Shot the device (if unregistered) */
if (how) {
+ struct neigh_table *tbl = ipv6_neigh_table(net);
+
addrconf_sysctl_unregister(idev);
- neigh_parms_release(&nd_tbl, idev->nd_parms);
- neigh_ifdown(&nd_tbl, dev);
+ neigh_parms_release(tbl, idev->nd_parms);
+ neigh_ifdown(tbl, dev);
in6_dev_put(idev);
}
return 0;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 8047fd41ba88..05e11f71da49 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -114,7 +114,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
if (unlikely(!neigh))
- neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+ neigh = ipv6_neigh_create(dst->dev, nexthop, false);
if (!IS_ERR(neigh)) {
sock_confirm_neigh(skb, neigh);
ret = neigh_output(neigh, skb);
@@ -462,7 +462,7 @@ int ip6_forward(struct sk_buff *skb)
/* XXX: idev->cnf.proxy_ndp? */
if (net->ipv6.devconf_all->proxy_ndp &&
- pneigh_lookup(&nd_tbl, net, &hdr->daddr, skb->dev, 0)) {
+ ipv6_pneigh_lookup(net, &hdr->daddr, skb->dev, 0)) {
int proxied = ip6_forward_proxy_check(skb);
if (proxied > 0)
return ip6_input(skb);
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index e640d2f3c55c..14b925f36099 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -729,14 +729,16 @@ static void ndisc_solicit(struct neighbour *neigh, struct sk_buff *skb)
static int pndisc_is_router(const void *pkey,
struct net_device *dev)
{
+ struct net *net = dev_net(dev);
+ struct neigh_table *neigh_tbl = ipv6_neigh_table(net);
struct pneigh_entry *n;
int ret = -1;
- read_lock_bh(&nd_tbl.lock);
- n = __pneigh_lookup(&nd_tbl, dev_net(dev), pkey, dev);
+ read_lock_bh(&neigh_tbl->lock);
+ n = __pneigh_lookup(ipv6_neigh_table(net), net, pkey, dev);
if (n)
ret = !!(n->flags & NTF_ROUTER);
- read_unlock_bh(&nd_tbl.lock);
+ read_unlock_bh(&neigh_tbl->lock);
return ret;
}
@@ -764,6 +766,8 @@ static void ndisc_recv_ns(struct sk_buff *skb)
struct inet6_dev *idev = NULL;
struct neighbour *neigh;
int dad = ipv6_addr_any(saddr);
+ struct neigh_table *neigh_tbl;
+ struct net *net;
bool inc;
int is_router = -1;
u64 nonce = 0;
@@ -816,7 +820,10 @@ static void ndisc_recv_ns(struct sk_buff *skb)
inc = ipv6_addr_is_multicast(daddr);
- ifp = ipv6_get_ifaddr(dev_net(dev), &msg->target, dev, 1);
+ net = dev_net(dev);
+ neigh_tbl = ipv6_neigh_table(net);
+
+ ifp = ipv6_get_ifaddr(net, &msg->target, dev, 1);
if (ifp) {
have_ifp:
if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
@@ -851,8 +858,6 @@ static void ndisc_recv_ns(struct sk_buff *skb)
idev = ifp->idev;
} else {
- struct net *net = dev_net(dev);
-
/* perhaps an address on the master device */
if (netif_is_l3_slave(dev)) {
struct net_device *mdev;
@@ -888,7 +893,7 @@ static void ndisc_recv_ns(struct sk_buff *skb)
*/
struct sk_buff *n = skb_clone(skb, GFP_ATOMIC);
if (n)
- pneigh_enqueue(&nd_tbl, idev->nd_parms, n);
+ pneigh_enqueue(neigh_tbl, idev->nd_parms, n);
goto out;
}
} else
@@ -905,15 +910,15 @@ static void ndisc_recv_ns(struct sk_buff *skb)
}
if (inc)
- NEIGH_CACHE_STAT_INC(&nd_tbl, rcv_probes_mcast);
+ NEIGH_CACHE_STAT_INC(neigh_tbl, rcv_probes_mcast);
else
- NEIGH_CACHE_STAT_INC(&nd_tbl, rcv_probes_ucast);
+ NEIGH_CACHE_STAT_INC(neigh_tbl, rcv_probes_ucast);
/*
* update / create cache entry
* for the source address
*/
- neigh = __neigh_lookup(&nd_tbl, saddr, dev,
+ neigh = __neigh_lookup(neigh_tbl, saddr, dev,
!inc || lladdr || !dev->addr_len);
if (neigh)
ndisc_update(dev, neigh, lladdr, NUD_STALE,
@@ -1007,7 +1012,7 @@ static void ndisc_recv_na(struct sk_buff *skb)
in6_ifa_put(ifp);
return;
}
- neigh = neigh_lookup(&nd_tbl, &msg->target, dev);
+ neigh = ipv6_neigh_lookup(dev, &msg->target);
if (neigh) {
u8 old_flags = neigh->flags;
@@ -1023,7 +1028,7 @@ static void ndisc_recv_na(struct sk_buff *skb)
*/
if (lladdr && !memcmp(lladdr, dev->dev_addr, dev->addr_len) &&
net->ipv6.devconf_all->forwarding && net->ipv6.devconf_all->proxy_ndp &&
- pneigh_lookup(&nd_tbl, net, &msg->target, dev, 0)) {
+ ipv6_pneigh_lookup(net, &msg->target, dev, 0)) {
/* XXX: idev->cnf.proxy_ndp */
goto out;
}
@@ -1091,7 +1096,7 @@ static void ndisc_recv_rs(struct sk_buff *skb)
goto out;
}
- neigh = __neigh_lookup(&nd_tbl, saddr, skb->dev, 1);
+ neigh = __ipv6_neigh_lookup(skb->dev, saddr, 1);
if (neigh) {
ndisc_update(skb->dev, neigh, lladdr, NUD_STALE,
NEIGH_UPDATE_F_WEAK_OVERRIDE|
@@ -1384,8 +1389,8 @@ static void ndisc_router_discovery(struct sk_buff *skb)
*/
if (!neigh)
- neigh = __neigh_lookup(&nd_tbl, &ipv6_hdr(skb)->saddr,
- skb->dev, 1);
+ neigh = __ipv6_neigh_lookup(skb->dev, &ipv6_hdr(skb)->saddr, 1);
+
if (neigh) {
u8 *lladdr = NULL;
if (ndopts.nd_opts_src_lladdr) {
@@ -1768,7 +1773,7 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
switch (event) {
case NETDEV_CHANGEADDR:
- neigh_changeaddr(&nd_tbl, dev);
+ neigh_changeaddr(ipv6_neigh_table(dev_net(dev)), dev);
fib6_run_gc(0, net, false);
/* fallthrough */
case NETDEV_UP:
@@ -1783,10 +1788,10 @@ static int ndisc_netdev_event(struct notifier_block *this, unsigned long event,
case NETDEV_CHANGE:
change_info = ptr;
if (change_info->flags_changed & IFF_NOARP)
- neigh_changeaddr(&nd_tbl, dev);
+ neigh_changeaddr(ipv6_neigh_table(dev_net(dev)), dev);
break;
case NETDEV_DOWN:
- neigh_ifdown(&nd_tbl, dev);
+ neigh_ifdown(ipv6_neigh_table(dev_net(dev)), dev);
fib6_run_gc(0, net, false);
break;
case NETDEV_NOTIFY_PEERS:
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 86a0e4333d42..17f01b8cb05f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -207,10 +207,10 @@ struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
struct neighbour *n;
daddr = choose_neigh_daddr(gw, skb, daddr);
- n = __ipv6_neigh_lookup(dev, daddr);
+ n = __ipv6_neigh_lookup_noref(dev, daddr);
if (n)
return n;
- return neigh_create(&nd_tbl, daddr, dev);
+ return ipv6_neigh_create(dev, daddr, true);
}
static struct neighbour *ip6_dst_neigh_lookup(const struct dst_entry *dst,
@@ -3392,7 +3392,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
*/
dst_confirm_neigh(&rt->dst, &ipv6_hdr(skb)->saddr);
- neigh = __neigh_lookup(&nd_tbl, &msg->target, skb->dev, 1);
+ neigh = __ipv6_neigh_lookup(skb->dev, &msg->target, 1);
if (!neigh)
return;
@@ -4064,9 +4064,11 @@ void rt6_sync_down_dev(struct net_device *dev, unsigned long event)
void rt6_disable_ip(struct net_device *dev, unsigned long event)
{
+ struct net *net = dev_net(dev);
+
rt6_sync_down_dev(dev, event);
- rt6_uncached_list_flush_dev(dev_net(dev), dev);
- neigh_ifdown(&nd_tbl, dev);
+ rt6_uncached_list_flush_dev(net, dev);
+ neigh_ifdown(ipv6_neigh_table(net), dev);
}
struct rt6_mtu_change_arg {
--
2.11.0
From: David Ahern <[email protected]>
neighbor code already has an API for access to neighbor caches by
address family. Export it for use by networking code. Add the
namespace as an input arg and make family a u8 versus an int (all
existing callers pass ndm_family which is a u8).
Signed-off-by: David Ahern <[email protected]>
---
include/net/neighbour.h | 2 ++
net/core/neighbour.c | 7 ++++---
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 6c1eecd56a4d..5bc4d79b4b3a 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -229,6 +229,8 @@ enum {
NEIGH_LINK_TABLE = NEIGH_NR_TABLES /* Pseudo table for neigh_xmit */
};
+struct neigh_table *neigh_find_table(struct net *net, u8 family);
+
static inline int neigh_parms_family(struct neigh_parms *p)
{
return p->tbl->family;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index cbe85d8d4cc2..e8630f9de24a 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1625,7 +1625,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
}
EXPORT_SYMBOL(neigh_table_clear);
-static struct neigh_table *neigh_find_table(int family)
+struct neigh_table *neigh_find_table(struct net *net, u8 family)
{
struct neigh_table *tbl = NULL;
@@ -1643,6 +1643,7 @@ static struct neigh_table *neigh_find_table(int family)
return tbl;
}
+EXPORT_SYMBOL(neigh_find_table);
static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
@@ -1672,7 +1673,7 @@ static int neigh_delete(struct sk_buff *skb, struct nlmsghdr *nlh,
}
}
- tbl = neigh_find_table(ndm->ndm_family);
+ tbl = neigh_find_table(net, ndm->ndm_family);
if (tbl == NULL)
return -EAFNOSUPPORT;
@@ -1740,7 +1741,7 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr *nlh,
goto out;
}
- tbl = neigh_find_table(ndm->ndm_family);
+ tbl = neigh_find_table(net, ndm->ndm_family);
if (tbl == NULL)
return -EAFNOSUPPORT;
--
2.11.0
From: David Ahern <[email protected]>
Consistency with ipv6 name for similar function and allows
ipv4_neigh_lookup to be reused for the wrapper to the arp table.
---
net/ipv4/route.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 1df6e97106d7..74e1df60ab7f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -153,7 +153,7 @@ static u32 *ipv4_cow_metrics(struct dst_entry *dst, unsigned long old)
return NULL;
}
-static struct neighbour *ipv4_neigh_lookup(const struct dst_entry *dst,
+static struct neighbour *ipv4_dst_neigh_lookup(const struct dst_entry *dst,
struct sk_buff *skb,
const void *daddr);
static void ipv4_confirm_neigh(const struct dst_entry *dst, const void *daddr);
@@ -170,7 +170,7 @@ static struct dst_ops ipv4_dst_ops = {
.update_pmtu = ip_rt_update_pmtu,
.redirect = ip_do_redirect,
.local_out = __ip_local_out,
- .neigh_lookup = ipv4_neigh_lookup,
+ .neigh_lookup = ipv4_dst_neigh_lookup,
.confirm_neigh = ipv4_confirm_neigh,
};
@@ -430,7 +430,7 @@ void rt_cache_flush(struct net *net)
rt_genid_bump_ipv4(net);
}
-static struct neighbour *ipv4_neigh_lookup(const struct dst_entry *dst,
+static struct neighbour *ipv4_dst_neigh_lookup(const struct dst_entry *dst,
struct sk_buff *skb,
const void *daddr)
{
@@ -2537,7 +2537,7 @@ static struct dst_ops ipv4_dst_blackhole_ops = {
.update_pmtu = ipv4_rt_blackhole_update_pmtu,
.redirect = ipv4_rt_blackhole_redirect,
.cow_metrics = ipv4_rt_blackhole_cow_metrics,
- .neigh_lookup = ipv4_neigh_lookup,
+ .neigh_lookup = ipv4_dst_neigh_lookup,
};
struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_orig)
--
2.11.0
From: David Ahern <[email protected]>
Remove open use of arp_tbl and nd_tbl in favor of the new
ipv{4,6}_neigh_table helpers. Since the existence of the IPv6 table
is managed by the core networking, the IS_ENABLED checks for IPv6
can be removed in favor of "is the table non-NULL".
Signed-off-by: David Ahern <[email protected]>
---
drivers/infiniband/ulp/ipoib/ipoib_main.c | 14 ++++++---
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 35 +++++++++++-----------
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 11 ++-----
.../net/ethernet/mellanox/mlxsw/spectrum_router.c | 27 ++++++++---------
.../net/ethernet/mellanox/mlxsw/spectrum_span.c | 8 +++--
.../ethernet/netronome/nfp/flower/tunnel_conf.c | 2 +-
drivers/net/ethernet/rocker/rocker_main.c | 4 +--
drivers/net/ethernet/rocker/rocker_ofdpa.c | 2 +-
drivers/net/vrf.c | 4 +--
drivers/net/vxlan.c | 10 +++----
net/atm/clip.c | 14 +++++----
net/ieee802154/6lowpan/tx.c | 2 +-
12 files changed, 70 insertions(+), 63 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 26cde95bc0f3..4f798a7b9cc0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1298,6 +1298,8 @@ struct ipoib_neigh *ipoib_neigh_get(struct net_device *dev, u8 *daddr)
static void __ipoib_reap_neigh(struct ipoib_dev_priv *priv)
{
+ struct net *net = dev_net(priv->dev);
+ struct neigh_table *arp_tbl = ipv4_neigh_table(net);
struct ipoib_neigh_table *ntbl = &priv->ntbl;
struct ipoib_neigh_hash *htbl;
unsigned long neigh_obsolete;
@@ -1318,7 +1320,7 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv *priv)
goto out_unlock;
/* neigh is obsolete if it was idle for two GC periods */
- dt = 2 * arp_tbl.gc_interval;
+ dt = 2 * arp_tbl->gc_interval;
neigh_obsolete = jiffies - dt;
/* handle possible race condition */
if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
@@ -1357,12 +1359,14 @@ static void ipoib_reap_neigh(struct work_struct *work)
{
struct ipoib_dev_priv *priv =
container_of(work, struct ipoib_dev_priv, neigh_reap_task.work);
+ struct net *net = dev_net(priv->dev);
+ struct neigh_table *arp_tbl = ipv4_neigh_table(net);
__ipoib_reap_neigh(priv);
if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
queue_delayed_work(priv->wq, &priv->neigh_reap_task,
- arp_tbl.gc_interval);
+ arp_tbl->gc_interval);
}
@@ -1514,6 +1518,8 @@ void ipoib_neigh_free(struct ipoib_neigh *neigh)
static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
{
+ struct net *net = dev_net(priv->dev);
+ struct neigh_table *arp_tbl = ipv4_neigh_table(net);
struct ipoib_neigh_table *ntbl = &priv->ntbl;
struct ipoib_neigh_hash *htbl;
struct ipoib_neigh __rcu **buckets;
@@ -1525,7 +1531,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
if (!htbl)
return -ENOMEM;
set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
- size = roundup_pow_of_two(arp_tbl.gc_thresh3);
+ size = roundup_pow_of_two(arp_tbl->gc_thresh3);
buckets = kcalloc(size, sizeof(*buckets), GFP_KERNEL);
if (!buckets) {
kfree(htbl);
@@ -1541,7 +1547,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
/* start garbage collection */
clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
queue_delayed_work(priv->wq, &priv->neigh_reap_task,
- arp_tbl.gc_interval);
+ arp_tbl->gc_interval);
return 0;
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 8e3c5b4b90ab..a2283a4b3a17 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -309,17 +309,19 @@ void mlx5e_remove_sqs_fwd_rules(struct mlx5e_priv *priv)
static void mlx5e_rep_neigh_update_init_interval(struct mlx5e_rep_priv *rpriv)
{
-#if IS_ENABLED(CONFIG_IPV6)
- unsigned long ipv6_interval = NEIGH_VAR(&nd_tbl.parms,
+ struct net_device *netdev = rpriv->netdev;
+ struct net *net = dev_net(netdev);
+ struct neigh_table *arp_table = ipv4_neigh_table(net);
+ struct neigh_table *nd_table = ipv6_neigh_table(net);
+
+ unsigned long ipv4_interval = NEIGH_VAR(&arp_table->parms,
DELAY_PROBE_TIME);
-#else
unsigned long ipv6_interval = ~0UL;
-#endif
- unsigned long ipv4_interval = NEIGH_VAR(&arp_tbl.parms,
- DELAY_PROBE_TIME);
- struct net_device *netdev = rpriv->netdev;
struct mlx5e_priv *priv = netdev_priv(netdev);
+ if (nd_table)
+ ipv6_interval = NEIGH_VAR(&nd_table->parms, DELAY_PROBE_TIME);
+
rpriv->neigh_update.min_interval = min_t(unsigned long, ipv6_interval, ipv4_interval);
mlx5_fc_update_sampling_interval(priv->mdev, rpriv->neigh_update.min_interval);
}
@@ -437,19 +439,22 @@ static int mlx5e_rep_netevent_event(struct notifier_block *nb,
struct net_device *netdev = rpriv->netdev;
struct mlx5e_priv *priv = netdev_priv(netdev);
struct mlx5e_neigh_hash_entry *nhe = NULL;
+ struct net *net = dev_net(netdev);
struct mlx5e_neigh m_neigh = {};
+ struct neigh_table *arp_table;
+ struct neigh_table *nd_table;
struct neigh_parms *p;
struct neighbour *n;
bool found = false;
+ arp_table = ipv4_neigh_table(net);
+ nd_table = ipv6_neigh_table(net);
+
switch (event) {
case NETEVENT_NEIGH_UPDATE:
n = ptr;
-#if IS_ENABLED(CONFIG_IPV6)
- if (n->tbl != &nd_tbl && n->tbl != &arp_tbl)
-#else
- if (n->tbl != &arp_tbl)
-#endif
+
+ if (n->tbl != nd_table && n->tbl != arp_table)
return NOTIFY_DONE;
m_neigh.dev = n->dev;
@@ -493,11 +498,7 @@ static int mlx5e_rep_netevent_event(struct notifier_block *nb,
* changes in the default table, we only care about changes
* done per device delay prob time parameter.
*/
-#if IS_ENABLED(CONFIG_IPV6)
- if (!p->dev || (p->tbl != &nd_tbl && p->tbl != &arp_tbl))
-#else
- if (!p->dev || p->tbl != &arp_tbl)
-#endif
+ if (!p->dev || (p->tbl != nd_table && p->tbl != arp_table))
return NOTIFY_DONE;
/* We are in atomic context and can't take RTNL mutex,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 0edf4751a8ba..f4f2bace496a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -45,7 +45,7 @@
#include <net/tc_act/tc_pedit.h>
#include <net/tc_act/tc_csum.h>
#include <net/vxlan.h>
-#include <net/arp.h>
+#include <net/neighbour.h>
#include "en.h"
#include "en_rep.h"
#include "en_tc.h"
@@ -999,13 +999,8 @@ void mlx5e_tc_update_neigh_used_value(struct mlx5e_neigh_hash_entry *nhe)
bool neigh_used = false;
struct neighbour *n;
- if (m_neigh->family == AF_INET)
- tbl = &arp_tbl;
-#if IS_ENABLED(CONFIG_IPV6)
- else if (m_neigh->family == AF_INET6)
- tbl = &nd_tbl;
-#endif
- else
+ tbl = neigh_find_table(dev_net(m_neigh->dev), m_neigh->family);
+ if (!tbl)
return;
list_for_each_entry(e, &nhe->encap_list, encap_list) {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index e51c8dc52f37..c5e6b2972d0f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2019,15 +2019,15 @@ mlxsw_sp_neigh_entry_lookup(struct mlxsw_sp *mlxsw_sp, struct neighbour *n)
static void
mlxsw_sp_router_neighs_update_interval_init(struct mlxsw_sp *mlxsw_sp)
{
- unsigned long interval;
+ /* mlxsw only works with init_net at the moment */
+ struct neigh_table *arp_table = ipv4_neigh_table(&init_net);
+ struct neigh_table *nd_table = ipv6_neigh_table(&init_net);
+ unsigned long interval = NEIGH_VAR(&arp_table->parms, DELAY_PROBE_TIME);
+
+ if (nd_table)
+ interval = min_t(unsigned long, interval,
+ NEIGH_VAR(&nd_table->parms, DELAY_PROBE_TIME));
-#if IS_ENABLED(CONFIG_IPV6)
- interval = min_t(unsigned long,
- NEIGH_VAR(&arp_tbl.parms, DELAY_PROBE_TIME),
- NEIGH_VAR(&nd_tbl.parms, DELAY_PROBE_TIME));
-#else
- interval = NEIGH_VAR(&arp_tbl.parms, DELAY_PROBE_TIME);
-#endif
mlxsw_sp->router->neighs_update.interval = jiffies_to_msecs(interval);
}
@@ -2050,7 +2050,7 @@ static void mlxsw_sp_router_neigh_ent_ipv4_process(struct mlxsw_sp *mlxsw_sp,
dipn = htonl(dip);
dev = mlxsw_sp->router->rifs[rif]->dev;
- n = neigh_lookup(&arp_tbl, &dipn, dev);
+ n = ipv4_neigh_lookup(dev, &dipn);
if (!n)
return;
@@ -2064,6 +2064,7 @@ static void mlxsw_sp_router_neigh_ent_ipv6_process(struct mlxsw_sp *mlxsw_sp,
char *rauhtd_pl,
int rec_index)
{
+ struct neigh_table *nd_table = ipv6_neigh_table(&init_net);
struct net_device *dev;
struct neighbour *n;
struct in6_addr dip;
@@ -2078,7 +2079,7 @@ static void mlxsw_sp_router_neigh_ent_ipv6_process(struct mlxsw_sp *mlxsw_sp,
}
dev = mlxsw_sp->router->rifs[rif]->dev;
- n = neigh_lookup(&nd_tbl, &dip, dev);
+ n = neigh_lookup(nd_table, &dip, dev);
if (!n)
return;
@@ -3721,7 +3722,7 @@ mlxsw_sp_nexthop4_group_create(struct mlxsw_sp *mlxsw_sp, struct fib_info *fi)
return ERR_PTR(-ENOMEM);
nh_grp->priv = fi;
INIT_LIST_HEAD(&nh_grp->fib_list);
- nh_grp->neigh_tbl = &arp_tbl;
+ nh_grp->neigh_tbl = ipv4_neigh_table(&init_net);
nh_grp->gateway = mlxsw_sp_fi_is_gateway(mlxsw_sp, fi);
nh_grp->count = fi->fib_nhs;
@@ -4919,9 +4920,7 @@ mlxsw_sp_nexthop6_group_create(struct mlxsw_sp *mlxsw_sp,
if (!nh_grp)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&nh_grp->fib_list);
-#if IS_ENABLED(CONFIG_IPV6)
- nh_grp->neigh_tbl = &nd_tbl;
-#endif
+ nh_grp->neigh_tbl = ipv6_neigh_table(&init_net);
mlxsw_sp_rt6 = list_first_entry(&fib6_entry->rt6_list,
struct mlxsw_sp_rt6, list);
nh_grp->gateway = mlxsw_sp_rt6_is_gateway(mlxsw_sp, mlxsw_sp_rt6->rt);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
index e42d640cddab..8d21a340cf47 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
@@ -365,6 +365,7 @@ mlxsw_sp_span_entry_gretap4_parms(const struct net_device *to_dev,
bool inherit_ttl = !tparm.iph.ttl;
union mlxsw_sp_l3addr gw = daddr;
struct net_device *l3edev;
+ struct neigh_table *tbl;
if (!(to_dev->flags & IFF_UP) ||
/* Reject tunnels with GRE keys, checksums, etc. */
@@ -376,9 +377,10 @@ mlxsw_sp_span_entry_gretap4_parms(const struct net_device *to_dev,
return mlxsw_sp_span_entry_unoffloadable(sparmsp);
l3edev = mlxsw_sp_span_gretap4_route(to_dev, &saddr.addr4, &gw.addr4);
+ tbl = ipv4_neigh_table(dev_net(l3edev));
return mlxsw_sp_span_entry_tunnel_parms_common(l3edev, saddr, daddr, gw,
tparm.iph.ttl,
- &arp_tbl, sparmsp);
+ tbl, sparmsp);
}
static int
@@ -466,6 +468,7 @@ mlxsw_sp_span_entry_gretap6_parms(const struct net_device *to_dev,
bool inherit_ttl = !tparm.hop_limit;
union mlxsw_sp_l3addr gw = daddr;
struct net_device *l3edev;
+ struct neigh_table *tbl;
if (!(to_dev->flags & IFF_UP) ||
/* Reject tunnels with GRE keys, checksums, etc. */
@@ -477,9 +480,10 @@ mlxsw_sp_span_entry_gretap6_parms(const struct net_device *to_dev,
return mlxsw_sp_span_entry_unoffloadable(sparmsp);
l3edev = mlxsw_sp_span_gretap6_route(to_dev, &saddr.addr6, &gw.addr6);
+ tbl = ipv6_neigh_table(dev_net(l3edev));
return mlxsw_sp_span_entry_tunnel_parms_common(l3edev, saddr, daddr, gw,
tparm.hop_limit,
- &nd_tbl, sparmsp);
+ tbl, sparmsp);
}
static int
diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index 78afe75129ab..2f52cbd01d31 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -201,7 +201,7 @@ void nfp_tunnel_keep_alive(struct nfp_app *app, struct sk_buff *skb)
if (!netdev)
continue;
- n = neigh_lookup(&arp_tbl, &ipv4_addr, netdev);
+ n = ipv4_neigh_lookup(netdev, &ipv4_addr);
if (!n)
continue;
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index aeafdb9ac015..6791941b3472 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -3098,9 +3098,9 @@ static int rocker_netevent_event(struct notifier_block *unused,
switch (event) {
case NETEVENT_NEIGH_UPDATE:
- if (n->tbl != &arp_tbl)
- return NOTIFY_DONE;
dev = n->dev;
+ if (n->tbl != ipv4_neigh_table(dev_net(dev)))
+ return NOTIFY_DONE;
if (!rocker_port_dev_check(dev))
return NOTIFY_DONE;
rocker_port = netdev_priv(dev);
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 6473cc68c2d5..5745d675aaeb 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -1356,7 +1356,7 @@ static int ofdpa_port_ipv4_resolve(struct ofdpa_port *ofdpa_port,
int err = 0;
if (!n) {
- n = neigh_create(&arp_tbl, &ip_addr, dev);
+ n = ipv4_neigh_create(dev, &ip_addr);
if (IS_ERR(n))
return PTR_ERR(n);
}
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index f93547f257fb..304f452f619a 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -367,7 +367,7 @@ static int vrf_finish_output6(struct net *net, struct sock *sk,
nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
if (unlikely(!neigh))
- neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
+ neigh = ipv6_neigh_create(dst->dev, nexthop, false);
if (!IS_ERR(neigh)) {
sock_confirm_neigh(skb, neigh);
ret = neigh_output(neigh, skb);
@@ -575,7 +575,7 @@ static int vrf_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
- neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
+ neigh = ipv4_neigh_create_noref(dev, &nexthop);
if (!IS_ERR(neigh)) {
sock_confirm_neigh(skb, neigh);
ret = neigh_output(neigh, skb);
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ababba37d735..2ef9df11eaff 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1520,8 +1520,7 @@ static int arp_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
ipv4_is_multicast(tip))
goto out;
- n = neigh_lookup(&arp_tbl, &tip, dev);
-
+ n = ipv4_neigh_lookup(dev, &tip);
if (n) {
struct vxlan_fdb *f;
struct sk_buff *reply;
@@ -1675,8 +1674,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
ipv6_addr_is_multicast(&msg->target))
goto out;
- n = neigh_lookup(ipv6_stub->nd_tbl, &msg->target, dev);
-
+ n = ipv6_neigh_lookup(dev, &msg->target);
if (n) {
struct vxlan_fdb *f;
struct sk_buff *reply;
@@ -1736,7 +1734,7 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
if (!pskb_may_pull(skb, sizeof(struct iphdr)))
return false;
pip = ip_hdr(skb);
- n = neigh_lookup(&arp_tbl, &pip->daddr, dev);
+ n = ipv4_neigh_lookup(dev, &pip->daddr);
if (!n && (vxlan->cfg.flags & VXLAN_F_L3MISS)) {
union vxlan_addr ipa = {
.sin.sin_addr.s_addr = pip->daddr,
@@ -1757,7 +1755,7 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
return false;
pip6 = ipv6_hdr(skb);
- n = neigh_lookup(ipv6_stub->nd_tbl, &pip6->daddr, dev);
+ n = ipv6_neigh_lookup(dev, &pip6->daddr);
if (!n && (vxlan->cfg.flags & VXLAN_F_L3MISS)) {
union vxlan_addr ipa = {
.sin6.sin6_addr = pip6->daddr,
diff --git a/net/atm/clip.c b/net/atm/clip.c
index d795b9c5aea4..a85410000abc 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -155,10 +155,12 @@ static int neigh_check_cb(struct neighbour *n)
static void idle_timer_check(struct timer_list *unused)
{
- write_lock(&arp_tbl.lock);
- __neigh_for_each_release(&arp_tbl, neigh_check_cb);
+ struct neigh_table *tbl = ipv4_neigh_table(&init_net);
+
+ write_lock(&tbl->lock);
+ __neigh_for_each_release(tbl, neigh_check_cb);
mod_timer(&idle_timer, jiffies + CLIP_CHECK_INTERVAL * HZ);
- write_unlock(&arp_tbl.lock);
+ write_unlock(&tbl->lock);
}
static int clip_arp_rcv(struct sk_buff *skb)
@@ -465,7 +467,8 @@ static int clip_setentry(struct atm_vcc *vcc, __be32 ip)
rt = ip_route_output(&init_net, ip, 0, 1, 0);
if (IS_ERR(rt))
return PTR_ERR(rt);
- neigh = __neigh_lookup(&arp_tbl, &ip, rt->dst.dev, 1);
+ neigh = __neigh_lookup(ipv4_neigh_table(&init_net), &ip,
+ rt->dst.dev, 1);
ip_rt_put(rt);
if (!neigh)
return -ENOMEM;
@@ -836,7 +839,8 @@ static void *clip_seq_start(struct seq_file *seq, loff_t * pos)
{
struct clip_seq_state *state = seq->private;
state->ns.neigh_sub_iter = clip_seq_sub_iter;
- return neigh_seq_start(seq, pos, &arp_tbl, NEIGH_SEQ_NEIGH_ONLY);
+ return neigh_seq_start(seq, pos, ipv4_neigh_table(&init_net),
+ NEIGH_SEQ_NEIGH_ONLY);
}
static int clip_seq_show(struct seq_file *seq, void *v)
diff --git a/net/ieee802154/6lowpan/tx.c b/net/ieee802154/6lowpan/tx.c
index e6ff5128e61a..d472b08fbf25 100644
--- a/net/ieee802154/6lowpan/tx.c
+++ b/net/ieee802154/6lowpan/tx.c
@@ -64,7 +64,7 @@ int lowpan_header_create(struct sk_buff *skb, struct net_device *ldev,
} else {
__le16 short_addr = cpu_to_le16(IEEE802154_ADDR_SHORT_UNSPEC);
- n = neigh_lookup(&nd_tbl, &hdr->daddr, ldev);
+ n = ipv6_neigh_lookup(ldev, &hdr->daddr);
if (n) {
llneigh = lowpan_802154_neigh(neighbour_priv(n));
read_lock_bh(&n->lock);
--
2.11.0
On Tue, Jul 17, 2018 at 5:11 AM <[email protected]> wrote:
>
> From: David Ahern <[email protected]>
>
> Nikita Leshenko reported that neighbor entries in one namespace can
> evict neighbor entries in another. The problem is that the neighbor
> tables have entries across all namespaces without separate accounting
> and with global limits on when to scan for entries to evict.
It is nothing new, people including me already noticed this before.
>
> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> namespace and making the accounting and threshold limits per namespace.
The last discussion about this a long time ago concluded that neigh
table entries are controllable by remote, so after moving it to per netns,
it would be easier to DOS the host.
On 7/17/18 11:40 AM, Cong Wang wrote:
> On Tue, Jul 17, 2018 at 5:11 AM <[email protected]> wrote:
>>
>> From: David Ahern <[email protected]>
>>
>> Nikita Leshenko reported that neighbor entries in one namespace can
>> evict neighbor entries in another. The problem is that the neighbor
>> tables have entries across all namespaces without separate accounting
>> and with global limits on when to scan for entries to evict.
>
> It is nothing new, people including me already noticed this before.
>
>
>>
>> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
>> namespace and making the accounting and threshold limits per namespace.
>
>
> The last discussion about this a long time ago concluded that neigh
> table entries are controllable by remote, so after moving it to per netns,
> it would be easier to DOS the host.
>
There are still limits on the total number of entries and with
per-namespace limits an admin has better control.
On Tue, Jul 17, 2018 at 10:43 AM David Ahern <[email protected]> wrote:
>
> On 7/17/18 11:40 AM, Cong Wang wrote:
> > On Tue, Jul 17, 2018 at 5:11 AM <[email protected]> wrote:
> >>
> >> From: David Ahern <[email protected]>
> >>
> >> Nikita Leshenko reported that neighbor entries in one namespace can
> >> evict neighbor entries in another. The problem is that the neighbor
> >> tables have entries across all namespaces without separate accounting
> >> and with global limits on when to scan for entries to evict.
> >
> > It is nothing new, people including me already noticed this before.
> >
> >
> >>
> >> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> >> namespace and making the accounting and threshold limits per namespace.
> >
> >
> > The last discussion about this a long time ago concluded that neigh
> > table entries are controllable by remote, so after moving it to per netns,
> > it would be easier to DOS the host.
> >
>
> There are still limits on the total number of entries and with
> per-namespace limits an admin has better control.
Per-netns limit is *exactly* the problem here.
Quote from David Miller:
"
From: [email protected] (Eric W. Biederman)
Date: Wed, 25 Jun 2014 18:17:08 -0700
> I disagree that removing a global DOS prevention check is a benefit.
> Certainly large semantics changes like that should not happen without
> being discussed in the patch description.
Agreed, this is the most important core issue.
If we just make these things per netns, then as a result if you create
N namespaces we will allow N times more neighbour entries to be
sitting in the system at once.
Actually, I'm really surprised the limits get hit and this actually
causes problems.
"
You can see the original discussion here:
https://marc.info/?l=linux-netdev&m=140356141019653&w=2
On 7/17/18 11:53 AM, Cong Wang wrote:
> You can see the original discussion here:
> https://marc.info/?l=linux-netdev&m=140356141019653&w=2
>
Thanks for the reference.
I was surprised that the tables are still global. A number of objections
raised in that thread were due to a large patch tackling multiple
issues. This set is focused one thing - moving the tables to net - and
does so in small incremental changes to make it easy to review.
One of DaveM's comments:
"Finally, another problem are permanent neigh entries as those cannot
be reclaimed, that might be part of the main problem here.
One idea wrt. permanent entries is that we could decide that, since
they are administratively added, they don't count against the
thresholds and limits."
this is another we have hit and with same thinking ... permanent entries
should not count in the gc numbers. We need to address this for EVPN.
As for the per-namespace tables, it is 4 years later and over that time
Linux supports a number of features: EVPN which is very mac heavy, VRR
which doubles mac entries (one against the VRR device and one against
the lower device) and NOS level features such as mlxsw which has to
ensure mac entries for nexthop gateaways stay active. In addition there
are other features on the horizon - like the ability to use namespaces
to create virtual switches (what Cisco calls a VDC) where you absolutely
want isolation and not allowing entries from virtual switch to evict
entries from another. And of course the continued proliferation of
containerized workloads where isolation is desired.
I understand the concern about global resource and limits: as it stands
you have to increase the limits in init_net to the max expected and hope
for the best. With per namespace limits you can lower the limits of each
namespace better control the total impact on the total memory used.
Perhaps the defaults for namespaces after init_net could have really low
defaults (e.g., 16 / 32 / 64 for gc_thresh 1/2/3) requiring admin
intervention.
On Tue, Jul 17, 2018 at 12:02 PM David Ahern <[email protected]> wrote:
> As for the per-namespace tables, it is 4 years later and over that time
> Linux supports a number of features: EVPN which is very mac heavy, VRR
> which doubles mac entries (one against the VRR device and one against
> the lower device) and NOS level features such as mlxsw which has to
> ensure mac entries for nexthop gateaways stay active. In addition there
> are other features on the horizon - like the ability to use namespaces
> to create virtual switches (what Cisco calls a VDC) where you absolutely
> want isolation and not allowing entries from virtual switch to evict
> entries from another. And of course the continued proliferation of
> containerized workloads where isolation is desired.
As long as no change in neigh table code base itself, these can't
address the concern people raised before.
>
> I understand the concern about global resource and limits: as it stands
> you have to increase the limits in init_net to the max expected and hope
> for the best. With per namespace limits you can lower the limits of each
> namespace better control the total impact on the total memory used.
The problem is that the number of containers in a host is usually
not predictable.
Of course, you can say containers limit kernel memory too, but
memcg is not part of netns. I once told David Miller cpuset is the
isolation for isolating per-CPU softnet_data, he didn't like it. Based
on that I don't think you can convince him with memcg as a solution
here.
From: David Ahern <[email protected]>
Date: Tue, 17 Jul 2018 13:02:18 -0600
> I understand the concern about global resource and limits: as it stands
> you have to increase the limits in init_net to the max expected and hope
> for the best. With per namespace limits you can lower the limits of each
> namespace better control the total impact on the total memory used.
> Perhaps the defaults for namespaces after init_net could have really low
> defaults (e.g., 16 / 32 / 64 for gc_thresh 1/2/3) requiring admin
> intervention.
How does this work when a namespace creates another namespace?
Changing the defaults for non-init_net namespaces could work, but that
could be a surprise to some people.
>>>>> David Ahern <[email protected]> writes:
[email protected] wrote:
> Nikita Leshenko reported that neighbor entries in one namespace can
> evict neighbor entries in another. The problem is that the neighbor
> tables have entries across all namespaces without separate accounting
> and with global limits on when to scan for entries to evict.
> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> namespace and making the accounting and threshold limits per namespace.
This is a good improvement, thank you.
We absolutely need to keep a DOS against a single netns from causing
evictions in another netns.
Within a namespace there may be neighbours entries that are more
sure/valid/useful than others. I would like an API to be able to
mark them explicitely, but that could come leter.
In particular, in the 802.15.4 case, NE that arrive via encrypted
channels should be preferred over entries that arrive over unencrypted
channels. This is needed for IETF 6tisch secure join work, for instance.
I believe that we could use network namespaces to implement though.
I had not considered that before, and I think that it will work, but
there might be something subtle that I've missed. (Alex?)
It appears that one can tune the amount of space on a per-namespace basis:
+ nd_tbl->gc_thresh1 = 128;
+ nd_tbl->gc_thresh2 = 512;
+ nd_tbl->gc_thresh3 = 1024;
> Remove open use of arp_tbl and nd_tbl in favor of the new
> ipv{4,6}_neigh_table helpers. Since the existence of the IPv6 table
> is managed by the core networking, the IS_ENABLED checks for IPv6
> can be removed in favor of "is the table non-NULL".
What's the advantage of changing this check? (I am ignorant)
On 7/18/18 6:54 PM, Michael Richardson wrote:
>> Remove open use of arp_tbl and nd_tbl in favor of the new
>> ipv{4,6}_neigh_table helpers. Since the existence of the IPv6 table
>> is managed by the core networking, the IS_ENABLED checks for IPv6
>> can be removed in favor of "is the table non-NULL".
>
> What's the advantage of changing this check? (I am ignorant)
>
Just makes the code simpler.
The current nd_tbl is a global owned by the ipv6 code
(net/ipv6/ndisc.c). If CONFIG_IPV6 is not enabled, then nd_tbl is not
defined which leads code referencing it to use if checks like this:
#if IS_ENABLED(CONFIG_IPV6)
if (!p->dev || (p->tbl != &nd_tbl && p->tbl != &arp_tbl))
#else
if (!p->dev || p->tbl != &arp_tbl)
#endif
With the neigh_find_table approach the IS_ENABLED can be removed in
favor of 'if (tbl)'. If tbl is set, then ipv6 is loaded and initialized.
On 7/17/18 9:59 PM, David Miller wrote:
> From: David Ahern <[email protected]>
> Date: Tue, 17 Jul 2018 13:02:18 -0600
>
>> I understand the concern about global resource and limits: as it stands
>> you have to increase the limits in init_net to the max expected and hope
>> for the best. With per namespace limits you can lower the limits of each
>> namespace better control the total impact on the total memory used.
>> Perhaps the defaults for namespaces after init_net could have really low
>> defaults (e.g., 16 / 32 / 64 for gc_thresh 1/2/3) requiring admin
>> intervention.
>
> How does this work when a namespace creates another namespace?
>
> Changing the defaults for non-init_net namespaces could work, but that
> could be a surprise to some people.
>
Patches 14 (ipv4) and 15 (ipv6) currently use the existing hardcoded
values - not based on current init_net or anything else. This could be
changed to:
+ if (net_eq(net, &init_net)) {
+ arp_tbl->gc_thresh1 = 128;
+ arp_tbl->gc_thresh2 = 512;
+ arp_tbl->gc_thresh3 = 1024;
+ } else {
+ arp_tbl->gc_thresh1 = 16;
+ arp_tbl->gc_thresh2 = 32;
+ arp_tbl->gc_thresh3 = 64;
+ }
and update the documentation that any new network namespaces have lower
defaults.
As for any change in behavior: today neighbor entries from one namespace
can be removed due to actions in another so no obvious correlation. With
lower settings then gc could kick in and remove entries that otherwise
would not have been. The big hit would be to a new namespace where an
app inserts a lot of PERMANENT entries.
Chatting with Nikolay about this and he brought up a good corollary - ip
fragmentation. It really is a similar problem in that memory is consumed
as a result of packets received from an external entity. The ipfrag
sysctls are per namespace with a limit that non-init_net namespaces can
not set high_thresh > the current value of init_net. Potential memory
consumed by fragments scales with the number of namespaces which is the
primary concern with making neighbor tables per namespace.
If we kept the current default settings (128/512/1024) per namespace we
still have capped memory use, and the one user visible hit that comes to
mind is the namespace with a lot of PERM entries.
On Thu, Jul 19, 2018 at 9:16 AM David Ahern <[email protected]> wrote:
>
> Chatting with Nikolay about this and he brought up a good corollary - ip
> fragmentation. It really is a similar problem in that memory is consumed
> as a result of packets received from an external entity. The ipfrag
> sysctls are per namespace with a limit that non-init_net namespaces can
> not set high_thresh > the current value of init_net. Potential memory
> consumed by fragments scales with the number of namespaces which is the
> primary concern with making neighbor tables per namespace.
Nothing new, already discussed:
https://marc.info/?l=linux-netdev&m=140391416215988&w=2
:)
On 7/19/18 11:12 AM, Cong Wang wrote:
> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <[email protected]> wrote:
>>
>> Chatting with Nikolay about this and he brought up a good corollary - ip
>> fragmentation. It really is a similar problem in that memory is consumed
>> as a result of packets received from an external entity. The ipfrag
>> sysctls are per namespace with a limit that non-init_net namespaces can
>> not set high_thresh > the current value of init_net. Potential memory
>> consumed by fragments scales with the number of namespaces which is the
>> primary concern with making neighbor tables per namespace.
>
> Nothing new, already discussed:
> https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>
> :)
>
Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
local memory resources due to received packets. bridge and vxlan fdb's
are fairly straightforward analogs to neighbor entries; they are per
device with no limits on the number of entries. Fragments have memory
limits per namespace. So neighbor tables are the only ones with this
strict limitation and concern on memory consumption.
I get the impression there is no longer a strong resistance against
moving the tables to per namespace, but deciding what is the right
approach to handle backwards compatibility. Correct? Changing the
accounting is inevitably going to be noticeable to some use case(s), but
with sysctl settings it is a simple runtime update once the user knows
to make the change.
neighbor entries round up to 512 byte allocations, so with the current
gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
limits per namespace seems high which is why I suggested a per-namespace
default of (16/32/64) which amounts to 32k per namespace limit by
default. Open to other suggestions as well.
From: David Ahern <[email protected]>
Date: Tue, 24 Jul 2018 09:14:01 -0600
> I get the impression there is no longer a strong resistance against
> moving the tables to per namespace, but deciding what is the right
> approach to handle backwards compatibility. Correct? Changing the
> accounting is inevitably going to be noticeable to some use case(s), but
> with sysctl settings it is a simple runtime update once the user knows
> to make the change.
>
> neighbor entries round up to 512 byte allocations, so with the current
> gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
> limits per namespace seems high which is why I suggested a per-namespace
> default of (16/32/64) which amounts to 32k per namespace limit by
> default. Open to other suggestions as well.
No objection from me about going to per-ns neigh tables.
About the defaults, I wonder if we can scale them to the amount of
memory given to the ns or something like that? I bet this will better
match the intended use of the ns.
On Tue, Jul 24, 2018 at 8:14 AM David Ahern <[email protected]> wrote:
>
> On 7/19/18 11:12 AM, Cong Wang wrote:
> > On Thu, Jul 19, 2018 at 9:16 AM David Ahern <[email protected]> wrote:
> >>
> >> Chatting with Nikolay about this and he brought up a good corollary - ip
> >> fragmentation. It really is a similar problem in that memory is consumed
> >> as a result of packets received from an external entity. The ipfrag
> >> sysctls are per namespace with a limit that non-init_net namespaces can
> >> not set high_thresh > the current value of init_net. Potential memory
> >> consumed by fragments scales with the number of namespaces which is the
> >> primary concern with making neighbor tables per namespace.
> >
> > Nothing new, already discussed:
> > https://marc.info/?l=linux-netdev&m=140391416215988&w=2
> >
> > :)
> >
>
> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
> local memory resources due to received packets. bridge and vxlan fdb's
> are fairly straightforward analogs to neighbor entries; they are per
> device with no limits on the number of entries. Fragments have memory
> limits per namespace. So neighbor tables are the only ones with this
> strict limitation and concern on memory consumption.
>
> I get the impression there is no longer a strong resistance against
> moving the tables to per namespace, but deciding what is the right
> approach to handle backwards compatibility. Correct? Changing the
> accounting is inevitably going to be noticeable to some use case(s), but
> with sysctl settings it is a simple runtime update once the user knows
> to make the change.
This question definitely should go to Eric Biederman who was against
my proposal.
Let's add Eric into CC.
Cong Wang <[email protected]> writes:
> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <[email protected]> wrote:
>>
>> On 7/19/18 11:12 AM, Cong Wang wrote:
>> > On Thu, Jul 19, 2018 at 9:16 AM David Ahern <[email protected]> wrote:
>> >>
>> >> Chatting with Nikolay about this and he brought up a good corollary - ip
>> >> fragmentation. It really is a similar problem in that memory is consumed
>> >> as a result of packets received from an external entity. The ipfrag
>> >> sysctls are per namespace with a limit that non-init_net namespaces can
>> >> not set high_thresh > the current value of init_net. Potential memory
>> >> consumed by fragments scales with the number of namespaces which is the
>> >> primary concern with making neighbor tables per namespace.
>> >
>> > Nothing new, already discussed:
>> > https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>> >
>> > :)
>> >
>>
>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>> local memory resources due to received packets. bridge and vxlan fdb's
>> are fairly straightforward analogs to neighbor entries; they are per
>> device with no limits on the number of entries. Fragments have memory
>> limits per namespace. So neighbor tables are the only ones with this
>> strict limitation and concern on memory consumption.
>>
>> I get the impression there is no longer a strong resistance against
>> moving the tables to per namespace, but deciding what is the right
>> approach to handle backwards compatibility. Correct? Changing the
>> accounting is inevitably going to be noticeable to some use case(s), but
>> with sysctl settings it is a simple runtime update once the user knows
>> to make the change.
>
> This question definitely should go to Eric Biederman who was against
> my proposal.
>
> Let's add Eric into CC.
Given that the entries are per device and the devices are per-namespace,
semantically neighbours are already kept in a per-namespace manner. So
this is all about making the code not honoring global resource limits.
Making the code not honor gc_thresh3.
Skimming through the code today the default for gc_thresh3 is 1024.
Which means that we limit the neighbour tables to 1024 entries per
protocol type.
There are some pretty compelling reasons especially with ipv4 to keep
the subnet size down. Arp storms are a real thing.
I don't know off the top of my head what the reasons for limiting the
neighbour table sizes. I would be much more comfortable with a patchset
like this if we did some research and figured out the reasons why
we have a global limit. Then changed the code to remove those limits.
When the limits are gone. When the code can support large subnets
without tuning. We we don't have to worry about someone scanning an all
addresses in an ipv6 subnet and causing a DOS on working machines.
I think it is completely appropriate to look to see if something per
network namespace needs to happen.
So please let's address the limits, not the fact that some specific
corner case ran into them.
If we are going to neuter gc_thresh3 let's go as far as removing it
entirely. If we are going to make the neighbour table per something
let's make it per network device. If we can afford the multiple hash
tables then a hash table per device is better. Perhaps we want to move
to rhash tables while we look at this, instead of an old hand grown
version of resizable hash table.
Unless I misread something all your patchset did is reshuffle code and
data structures so that gc_thresh3 does not apply accross namespaces.
That does not feel like it really fixes anything. That just lies to
people.
Further unless I misread something you are increasing the number of
timers to 3 per namespace. If I create create a thousand network
namespaces that feels like it will hurt system performance overall.
Eric
On 7/25/18 6:33 AM, Eric W. Biederman wrote:
> Cong Wang <[email protected]> writes:
>
>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <[email protected]> wrote:
>>>
>>> On 7/19/18 11:12 AM, Cong Wang wrote:
>>>> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <[email protected]> wrote:
>>>>>
>>>>> Chatting with Nikolay about this and he brought up a good corollary - ip
>>>>> fragmentation. It really is a similar problem in that memory is consumed
>>>>> as a result of packets received from an external entity. The ipfrag
>>>>> sysctls are per namespace with a limit that non-init_net namespaces can
>>>>> not set high_thresh > the current value of init_net. Potential memory
>>>>> consumed by fragments scales with the number of namespaces which is the
>>>>> primary concern with making neighbor tables per namespace.
>>>>
>>>> Nothing new, already discussed:
>>>> https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>>>>
>>>> :)
>>>>
>>>
>>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>>> local memory resources due to received packets. bridge and vxlan fdb's
>>> are fairly straightforward analogs to neighbor entries; they are per
>>> device with no limits on the number of entries. Fragments have memory
>>> limits per namespace. So neighbor tables are the only ones with this
>>> strict limitation and concern on memory consumption.
>>>
>>> I get the impression there is no longer a strong resistance against
>>> moving the tables to per namespace, but deciding what is the right
>>> approach to handle backwards compatibility. Correct? Changing the
>>> accounting is inevitably going to be noticeable to some use case(s), but
>>> with sysctl settings it is a simple runtime update once the user knows
>>> to make the change.
>>
>> This question definitely should go to Eric Biederman who was against
>> my proposal.
>>
>> Let's add Eric into CC.
>
> Given that the entries are per device and the devices are per-namespace,
> semantically neighbours are already kept in a per-namespace manner. So
> this is all about making the code not honoring global resource limits.
> Making the code not honor gc_thresh3.
>
> Skimming through the code today the default for gc_thresh3 is 1024.
> Which means that we limit the neighbour tables to 1024 entries per
> protocol type.
>
> There are some pretty compelling reasons especially with ipv4 to keep
> the subnet size down. Arp storms are a real thing.
>
> I don't know off the top of my head what the reasons for limiting the
> neighbour table sizes. I would be much more comfortable with a patchset
> like this if we did some research and figured out the reasons why
> we have a global limit. Then changed the code to remove those limits.
>
> When the limits are gone. When the code can support large subnets
> without tuning. We we don't have to worry about someone scanning an all
> addresses in an ipv6 subnet and causing a DOS on working machines.
> I think it is completely appropriate to look to see if something per
> network namespace needs to happen.
>
> So please let's address the limits, not the fact that some specific
> corner case ran into them.
>
> If we are going to neuter gc_thresh3 let's go as far as removing it
> entirely. If we are going to make the neighbour table per something
> let's make it per network device. If we can afford the multiple hash
> tables then a hash table per device is better. Perhaps we want to move
> to rhash tables while we look at this, instead of an old hand grown
> version of resizable hash table.
Given the uses cases with increasing number of devices (> 10,000),
per-device tables will have more problems than per namespace - in
reference to your concern in the last paragraph below.
>
> Unless I misread something all your patchset did is reshuffle code and
> data structures so that gc_thresh3 does not apply accross namespaces.
> That does not feel like it really fixes anything. That just lies to
> people.
This patch set fixes the lie that network namespaces provide complete
isolation when in fact one namespace can evict neighbor entries from
another. An arp storm you are concerned about in one namespace impacts
all containers.
It starts by removing the proliferation of open coded references to
arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
From there (patches 14-16) it makes the tables per-namespace and hence
makes the gc_thresh parameters which are per-table now
per-table-per-namespace.
So it removes the global thresholds because the global ones are just
wrong given the meaning of a network namespace and provides the more
appropriate per-namespace limits.
>
> Further unless I misread something you are increasing the number of
> timers to 3 per namespace. If I create create a thousand network
> namespaces that feels like it will hurt system performance overall.
It seems to me the timers are per neighbor entry not table. The per
table ones are for proxies.
David Ahern <[email protected]> writes:
> On 7/25/18 6:33 AM, Eric W. Biederman wrote:
>> Cong Wang <[email protected]> writes:
>>
>>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern <[email protected]> wrote:
>>>>
>>>> On 7/19/18 11:12 AM, Cong Wang wrote:
>>>>> On Thu, Jul 19, 2018 at 9:16 AM David Ahern <[email protected]> wrote:
>>>>>>
>>>>>> Chatting with Nikolay about this and he brought up a good corollary - ip
>>>>>> fragmentation. It really is a similar problem in that memory is consumed
>>>>>> as a result of packets received from an external entity. The ipfrag
>>>>>> sysctls are per namespace with a limit that non-init_net namespaces can
>>>>>> not set high_thresh > the current value of init_net. Potential memory
>>>>>> consumed by fragments scales with the number of namespaces which is the
>>>>>> primary concern with making neighbor tables per namespace.
>>>>>
>>>>> Nothing new, already discussed:
>>>>> https://marc.info/?l=linux-netdev&m=140391416215988&w=2
>>>>>
>>>>> :)
>>>>>
>>>>
>>>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>>>> local memory resources due to received packets. bridge and vxlan fdb's
>>>> are fairly straightforward analogs to neighbor entries; they are per
>>>> device with no limits on the number of entries. Fragments have memory
>>>> limits per namespace. So neighbor tables are the only ones with this
>>>> strict limitation and concern on memory consumption.
>>>>
>>>> I get the impression there is no longer a strong resistance against
>>>> moving the tables to per namespace, but deciding what is the right
>>>> approach to handle backwards compatibility. Correct? Changing the
>>>> accounting is inevitably going to be noticeable to some use case(s), but
>>>> with sysctl settings it is a simple runtime update once the user knows
>>>> to make the change.
>>>
>>> This question definitely should go to Eric Biederman who was against
>>> my proposal.
>>>
>>> Let's add Eric into CC.
>>
>> Given that the entries are per device and the devices are per-namespace,
>> semantically neighbours are already kept in a per-namespace manner. So
>> this is all about making the code not honoring global resource limits.
>> Making the code not honor gc_thresh3.
>>
>> Skimming through the code today the default for gc_thresh3 is 1024.
>> Which means that we limit the neighbour tables to 1024 entries per
>> protocol type.
>>
>> There are some pretty compelling reasons especially with ipv4 to keep
>> the subnet size down. Arp storms are a real thing.
>>
>> I don't know off the top of my head what the reasons for limiting the
>> neighbour table sizes. I would be much more comfortable with a patchset
>> like this if we did some research and figured out the reasons why
>> we have a global limit. Then changed the code to remove those limits.
>>
>> When the limits are gone. When the code can support large subnets
>> without tuning. We we don't have to worry about someone scanning an all
>> addresses in an ipv6 subnet and causing a DOS on working machines.
>> I think it is completely appropriate to look to see if something per
>> network namespace needs to happen.
>>
>> So please let's address the limits, not the fact that some specific
>> corner case ran into them.
>>
>> If we are going to neuter gc_thresh3 let's go as far as removing it
>> entirely. If we are going to make the neighbour table per something
>> let's make it per network device. If we can afford the multiple hash
>> tables then a hash table per device is better. Perhaps we want to move
>> to rhash tables while we look at this, instead of an old hand grown
>> version of resizable hash table.
>
> Given the uses cases with increasing number of devices (> 10,000),
> per-device tables will have more problems than per namespace - in
> reference to your concern in the last paragraph below.
>
>>
>> Unless I misread something all your patchset did is reshuffle code and
>> data structures so that gc_thresh3 does not apply accross namespaces.
>> That does not feel like it really fixes anything. That just lies to
>> people.
>
> This patch set fixes the lie that network namespaces provide complete
> isolation when in fact one namespace can evict neighbor entries from
> another. An arp storm you are concerned about in one namespace impacts
> all containers.
Network namespaces can not provide complete isolation. They share the
same kernel and they do not dedicate resources to each other.
Namespaces in general are about the names. They are about sharing a
machine efficiently.
I humbly suggest that anyone who wants ``complete'' isolation to use vm
at the very least.
I do think the limits on the neighbour table are quite likely too
strict. We should be able to relax them and continue to have a
networking stack that works for everyone.
> It starts by removing the proliferation of open coded references to
> arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
> From there (patches 14-16) it makes the tables per-namespace and hence
> makes the gc_thresh parameters which are per-table now
> per-table-per-namespace.
>
> So it removes the global thresholds because the global ones are just
> wrong given the meaning of a network namespace and provides the more
> appropriate per-namespace limits.
Absolutely NOT. Global thresholds are exactly correct given the fact
you are running on a single kernel.
Memory is not free (Even though we are swimming in enough of it memory
rarely matters). One of the few remaining challenges is for containers
is finding was to limit resources in such a way that one application
does not mess things up for another container during ordinary usage.
It looks like the neighbour tables absolutely are that kind of problem,
because the artificial limits are too strict. Completely giving up on
limits does not seem right approach either. We need to fix the limits
we have (perhaps making them go away entirely), not just apply a
band-aid. Let's get to the bottom of this and make the system better.
>> Further unless I misread something you are increasing the number of
>> timers to 3 per namespace. If I create create a thousand network
>> namespaces that feels like it will hurt system performance overall.
>
> It seems to me the timers are per neighbor entry not table. The per
> table ones are for proxies.
It seems I misread that bit when I was refreshing my memory on what
everything is doing. If we can already have 1024 timers that makes
timers not a concern.
Eric
On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>
> Absolutely NOT. Global thresholds are exactly correct given the fact
> you are running on a single kernel.
>
> Memory is not free (Even though we are swimming in enough of it memory
> rarely matters). One of the few remaining challenges is for containers
> is finding was to limit resources in such a way that one application
> does not mess things up for another container during ordinary usage.
>
> It looks like the neighbour tables absolutely are that kind of problem,
> because the artificial limits are too strict. Completely giving up on
> limits does not seem right approach either. We need to fix the limits
> we have (perhaps making them go away entirely), not just apply a
> band-aid. Let's get to the bottom of this and make the system better.
Eric: yes, they all share the global resource of memory and there should
be limits on how many entries a remote entity can create.
Network namespaces can provide a separation such that one namespace does
not disrupt networking in another. It is absolutely appropriate to do
so. Your rigid stance is inconsistent given the basic meaning of a
network namespace and the parallels to this same problem -- bridges,
vxlans, and ip fragments. Only neighbor tables are not per-device or per
namespace; your insistence on global limits is missing the mark and wrong.
On 7/24/18 11:14 AM, David Miller wrote:
> From: David Ahern <[email protected]>
> Date: Tue, 24 Jul 2018 09:14:01 -0600
>
>> I get the impression there is no longer a strong resistance against
>> moving the tables to per namespace, but deciding what is the right
>> approach to handle backwards compatibility. Correct? Changing the
>> accounting is inevitably going to be noticeable to some use case(s), but
>> with sysctl settings it is a simple runtime update once the user knows
>> to make the change.
>>
>> neighbor entries round up to 512 byte allocations, so with the current
>> gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
>> limits per namespace seems high which is why I suggested a per-namespace
>> default of (16/32/64) which amounts to 32k per namespace limit by
>> default. Open to other suggestions as well.
>
> No objection from me about going to per-ns neigh tables.
>
> About the defaults, I wonder if we can scale them to the amount of
> memory given to the ns or something like that? I bet this will better
> match the intended use of the ns.
>
Not sure how to do that. I am not aware of memory allocations to a
network namespace. As I understand it containers use cgroups to control
memory use, but I am not aware of any direct ties to namespace.
David Ahern <[email protected]> writes:
> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>>
>> Absolutely NOT. Global thresholds are exactly correct given the fact
>> you are running on a single kernel.
>>
>> Memory is not free (Even though we are swimming in enough of it memory
>> rarely matters). One of the few remaining challenges is for containers
>> is finding was to limit resources in such a way that one application
>> does not mess things up for another container during ordinary usage.
>>
>> It looks like the neighbour tables absolutely are that kind of problem,
>> because the artificial limits are too strict. Completely giving up on
>> limits does not seem right approach either. We need to fix the limits
>> we have (perhaps making them go away entirely), not just apply a
>> band-aid. Let's get to the bottom of this and make the system better.
>
> Eric: yes, they all share the global resource of memory and there should
> be limits on how many entries a remote entity can create.
>
> Network namespaces can provide a separation such that one namespace does
> not disrupt networking in another. It is absolutely appropriate to do
> so. Your rigid stance is inconsistent given the basic meaning of a
> network namespace and the parallels to this same problem -- bridges,
> vxlans, and ip fragments. Only neighbor tables are not per-device or per
> namespace; your insistence on global limits is missing the mark and wrong.
That is not what I said. Let me rephrase and see if you understand.
The problem appears to be of lots of devices. Fundamentally if you use
lots of network devices today unless you adjust gc_thresh3 you will run
out of neighbour table entries.
The problem has a bigger scope than what you are looking at.
If you fix the core problem you won't see the problem in the context
of network namespaces either.
Default limits should be something that will never be hit unless
something goes crazy. We are hitting them. Therefore by definition
there is a bug in these limits.
And yes there is absolutely a place for global limits on things like
inodes, file descriptors etc, that does not care about which part of the
kernel you are in. However hitting those limits in normal operation is
a bug.
We have ourselves a bug.
Eric
p.s. I wrote the definition of network namespaces and it absolutely does
have room for global limits. One of the things Linus has periodically
yelled at me about is that there are not enough of them.
From: Eric W. Biederman
> Sent: 25 July 2018 18:38
...
> >> Further unless I misread something you are increasing the number of
> >> timers to 3 per namespace. If I create create a thousand network
> >> namespaces that feels like it will hurt system performance overall.
> >
> > It seems to me the timers are per neighbor entry not table. The per
> > table ones are for proxies.
>
> It seems I misread that bit when I was refreshing my memory on what
> everything is doing. If we can already have 1024 timers that makes
> timers not a concern.
Surely it is enough to just have a timestamp in each entry.
Deletion of expired items need not be done until insert (which
has the table suitable locked) bumps into an expired item.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight <[email protected]> writes:
> From: Eric W. Biederman
>> Sent: 25 July 2018 18:38
> ...
>> >> Further unless I misread something you are increasing the number of
>> >> timers to 3 per namespace. If I create create a thousand network
>> >> namespaces that feels like it will hurt system performance overall.
>> >
>> > It seems to me the timers are per neighbor entry not table. The per
>> > table ones are for proxies.
>>
>> It seems I misread that bit when I was refreshing my memory on what
>> everything is doing. If we can already have 1024 timers that makes
>> timers not a concern.
>
> Surely it is enough to just have a timestamp in each entry.
> Deletion of expired items need not be done until insert (which
> has the table suitable locked) bumps into an expired item.
Part of the state machine the timer is used for is sending retransmits
before we get out first response, so the state machine is not just
about expiry and may not be simple enough to do without a timer.
I honestly don't remember the tradeoffs between the various kinds of
timers. If we already can have up to 1000 timers without problems I suspect
they are cheap enough optimizing for fewer times simply doesn't matter.
Certainly we are not in regression territory if we don't.
As for bumping into expired items on insert it would be a bad hash table
if we were bumping into any other entries frequently on insert.
Eric
On 07/17/2018 03:06 PM, [email protected] wrote:
> From: David Ahern <[email protected]>
>
> Nikita Leshenko reported that neighbor entries in one namespace can
> evict neighbor entries in another. The problem is that the neighbor
> tables have entries across all namespaces without separate accounting
> and with global limits on when to scan for entries to evict.
>
> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> namespace and making the accounting and threshold limits per namespace.
Dear David,
I prepared own patch set to fix this problem and found your one.
It looks perfect for me, and I hope David Miller will merge it soon,
however I have found a few drawbacks:
1) I know that if net_device exist it always have correct net reference,
so dev_net(dev) will be always correct.
However I afraid that device reference itself is correct in some places.
For example,
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
@@ -376,9 +377,10 @@ mlxsw_sp_span_entry_gretap4_parms(const struct net_device *to_dev,
return mlxsw_sp_span_entry_unoffloadable(sparmsp);
l3edev = mlxsw_sp_span_gretap4_route(to_dev, &saddr.addr4, &gw.addr4);
+ tbl = ipv4_neigh_table(dev_net(l3edev));
return mlxsw_sp_span_entry_tunnel_parms_common(l3edev, saddr, daddr, gw,
tparm.iph.ttl,
- &arp_tbl, sparmsp);
+ tbl, sparmsp);
}
mlxsw_sp_span_entry_tunnel_parms_common() have "if (!edev)" check inside,
so it seems l3edev can be set to NULL here and lead to crash inside dev_net(l3edev).
There are few other suspicious places and I think they should be carefully re-checked.
2) modified arp_net_init() does not check return value neigh_sysctl_register() and lacks correct rollback.
It was acceptable in arp_init, because it was called only once on boot, but now it will be called
for each new net namespace, it can have real chances to fail lead to memory crash/memory corruption.
3) modified neigh_table_init() is called many times per netns but it can panic in case failed memory allocation.
I think it should be reworked to return errors in such cases, its callers should check it and add correct rollbacks.
4) currently neigh_table_clear() always return 0, I think it makes sense to change it to return void.
Thank you,
Vasily Averin
On 8/12/18 12:46 AM, Vasily Averin wrote:
> On 07/17/2018 03:06 PM, [email protected] wrote:
>> From: David Ahern <[email protected]>
>>
>> Nikita Leshenko reported that neighbor entries in one namespace can
>> evict neighbor entries in another. The problem is that the neighbor
>> tables have entries across all namespaces without separate accounting
>> and with global limits on when to scan for entries to evict.
>>
>> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
>> namespace and making the accounting and threshold limits per namespace.
>
> Dear David,
> I prepared own patch set to fix this problem and found your one.
> It looks perfect for me, and I hope David Miller will merge it soon,
> however I have found a few drawbacks:
>
Hi:
I just returned from an extended vacation. I will revive this topic in
the next few days.
Thanks for the comments. I will address in the next version.
On 7/25/18 1:17 PM, Eric W. Biederman wrote:
> David Ahern <[email protected]> writes:
>
>> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>>>
>>> Absolutely NOT. Global thresholds are exactly correct given the fact
>>> you are running on a single kernel.
>>>
>>> Memory is not free (Even though we are swimming in enough of it memory
>>> rarely matters). One of the few remaining challenges is for containers
>>> is finding was to limit resources in such a way that one application
>>> does not mess things up for another container during ordinary usage.
>>>
>>> It looks like the neighbour tables absolutely are that kind of problem,
>>> because the artificial limits are too strict. Completely giving up on
>>> limits does not seem right approach either. We need to fix the limits
>>> we have (perhaps making them go away entirely), not just apply a
>>> band-aid. Let's get to the bottom of this and make the system better.
>>
>> Eric: yes, they all share the global resource of memory and there should
>> be limits on how many entries a remote entity can create.
>>
>> Network namespaces can provide a separation such that one namespace does
>> not disrupt networking in another. It is absolutely appropriate to do
>> so. Your rigid stance is inconsistent given the basic meaning of a
>> network namespace and the parallels to this same problem -- bridges,
>> vxlans, and ip fragments. Only neighbor tables are not per-device or per
>> namespace; your insistence on global limits is missing the mark and wrong.
>
> That is not what I said. Let me rephrase and see if you understand.
>
> The problem appears to be of lots of devices. Fundamentally if you use
> lots of network devices today unless you adjust gc_thresh3 you will run
> out of neighbour table entries.
>
> The problem has a bigger scope than what you are looking at.
>
> If you fix the core problem you won't see the problem in the context
> of network namespaces either.
>
> Default limits should be something that will never be hit unless
> something goes crazy. We are hitting them. Therefore by definition
> there is a bug in these limits.
I disagree that the problem is a global limit. It is trivial for users
to increase gc_thresh3. That does not solve the fundamental problem.
>
>
> And yes there is absolutely a place for global limits on things like
> inodes, file descriptors etc, that does not care about which part of the
> kernel you are in. However hitting those limits in normal operation is
> a bug.
>
> We have ourselves a bug.
I agree we have a bug; we disagree on what that bug is.
I am just back from vacation and re-read your responses. No where do you
acknowledge the fundamental point of this patch set - that adding a new
neighbor entry in one namespace can evict an entry in another namespace
or worse networking in one namespace can fail due to table overflow
because of entries from another. That is a real problem.
It is not a matter of increasing the default gc_thresh3 to some number
N; it is ensuring that regardless of the value of gc_thresh3 one
namespace is not affected by another.
You created network namespaces and it provides isolation -- separate
tables essentially -- for devices, FIB entries, sockets, etc, but you
argue against completing the task with separate neighbor tables which is
very strange given the impact (completely broken networking).
>
> Eric
>
> p.s. I wrote the definition of network namespaces and it absolutely does
> have room for global limits. One of the things Linus has periodically
> yelled at me about is that there are not enough of them.
>
David Ahern <[email protected]> writes:
> On 7/25/18 1:17 PM, Eric W. Biederman wrote:
>> David Ahern <[email protected]> writes:
>>
>>> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>>>>
>>>> Absolutely NOT. Global thresholds are exactly correct given the fact
>>>> you are running on a single kernel.
>>>>
>>>> Memory is not free (Even though we are swimming in enough of it memory
>>>> rarely matters). One of the few remaining challenges is for containers
>>>> is finding was to limit resources in such a way that one application
>>>> does not mess things up for another container during ordinary usage.
>>>>
>>>> It looks like the neighbour tables absolutely are that kind of problem,
>>>> because the artificial limits are too strict. Completely giving up on
>>>> limits does not seem right approach either. We need to fix the limits
>>>> we have (perhaps making them go away entirely), not just apply a
>>>> band-aid. Let's get to the bottom of this and make the system better.
>>>
>>> Eric: yes, they all share the global resource of memory and there should
>>> be limits on how many entries a remote entity can create.
>>>
>>> Network namespaces can provide a separation such that one namespace does
>>> not disrupt networking in another. It is absolutely appropriate to do
>>> so. Your rigid stance is inconsistent given the basic meaning of a
>>> network namespace and the parallels to this same problem -- bridges,
>>> vxlans, and ip fragments. Only neighbor tables are not per-device or per
>>> namespace; your insistence on global limits is missing the mark and wrong.
>>
>> That is not what I said. Let me rephrase and see if you understand.
>>
>> The problem appears to be of lots of devices. Fundamentally if you use
>> lots of network devices today unless you adjust gc_thresh3 you will run
>> out of neighbour table entries.
>>
>> The problem has a bigger scope than what you are looking at.
>>
>> If you fix the core problem you won't see the problem in the context
>> of network namespaces either.
>>
>> Default limits should be something that will never be hit unless
>> something goes crazy. We are hitting them. Therefore by definition
>> there is a bug in these limits.
>
> I disagree that the problem is a global limit. It is trivial for users
> to increase gc_thresh3. That does not solve the fundamental problem.
>
>>
>>
>> And yes there is absolutely a place for global limits on things like
>> inodes, file descriptors etc, that does not care about which part of the
>> kernel you are in. However hitting those limits in normal operation is
>> a bug.
>>
>> We have ourselves a bug.
>
> I agree we have a bug; we disagree on what that bug is.
>
> I am just back from vacation and re-read your responses. No where do you
> acknowledge the fundamental point of this patch set - that adding a new
> neighbor entry in one namespace can evict an entry in another namespace
> or worse networking in one namespace can fail due to table overflow
> because of entries from another. That is a real problem.
>
> It is not a matter of increasing the default gc_thresh3 to some number
> N; it is ensuring that regardless of the value of gc_thresh3 one
> namespace is not affected by another.
My suggestion is to look at the problem and it's requirements and figure
out how to safely remove gc_thresh3 entirely. We do have to ensure
neighbour tables don't grow too large, I expect we can do it in a way
that can scale from a small machine with few neighbours to a large
machine with many neighbours.
Perhaps the code just needs to limit the number of neighbours who have
never replied and the code is probing for an a per interface basis.
It still may make sense to have a global limit of perhaps a million
entries just because that would be an indicator that something has truly
gone weird.
> You created network namespaces and it provides isolation -- separate
> tables essentially -- for devices, FIB entries, sockets, etc, but you
> argue against completing the task with separate neighbor tables which is
> very strange given the impact (completely broken networking).
Namespaces provide isolation at the level of names. The objects still
share a kernel and compete for resources. Not competing for resources
would require each namespace have it's own dedicated pool of resources
which over the whole machine would be much less efficient.
That is the fundamental design difference between namespaces and VM's
and it is why namespaces can be much cheaper and much more resource
efficient. Reserving your worst case resource usage ahead of time tends
to result in a lot of inefficiencies.
Eric