2005-09-08 17:13:47

by Suzanne Wood

[permalink] [raw]
Subject: [RFC][PATCH] identify in_dev_get rcu read-side critical sections


Please consider this request for suggestions on an attempt at a partial patch
based on the assumptions below to identify rcu read-side critical sections
for in_dev_get() defined in inetdevice.h. Thank you.

drivers/s390/net/lcs.c | 2 ++
drivers/s390/net/qeth_main.c | 4 ++++
include/linux/inetdevice.h | 2 +-
net/ipv4/arp.c | 17 +++++++++++------
net/ipv4/devinet.c | 6 +++++-
net/ipv4/icmp.c | 4 ++--
net/ipv4/igmp.c | 22 ++++++++++++++++++----
net/ipv4/ip_gre.c | 1 +
net/ipv4/ip_input.c | 6 ++++--
net/ipv4/route.c | 41 ++++++++++++++++++++++++++++-------------
10 files changed, 76 insertions(+), 29 deletions(-)

---------------------------------------------------

Assumptions made in this patch development:

1. Programmer who inserts rcu_lock/unlock around an in_dev_get or
__in_dev_get expects the protection of deferred destruction.

2. The marking of an rcu read-side critical section done by
the calling function has the vision of the extent of use of the
protected dereference.

3. Pairings of in_dev_get/in_dev_put and __in_dev_get/__in_dev_put
indicate the need to balance increment/decrement of refcnt,
so __in_dev_get which does not increment is likely paired in
error with __in_dev_put which decrements (unless in_dev_hold
or similar is in the path).

4. If programmer chooses __in_dev_get rather than in_dev_get,
the refcnt is not desired.

5. If a function returns an rcu protected pointer, the rcu_read_lock
is in place, so the rcu_read_unlock occurs in the caller.

Questions/Suggestions:

A pairing of in_dev_get with in_dev_put may indicate the addition
in the in_dev_put conditional of something like
call_rcu(&idev->rcu_head, in_dev->rcu_put)
after in_dev_finish_destroy or replace the latter with a call like
inetdev_destroy().

Differences between inetdevice.h in kernel 2.5.60 and linux-2.6.13-rc6
include the replacement in in_dev_get() of read_lock(&inetdev_lock)/unlock
with rcu_read_lock/unlock and the addition of the rcu_head to the
in_ifaddr struct.

In net/ipv4/route.c, consider __mkroute_output and follow the
three refcnt increments done to out_dev by dev_hold and in_dev_get,
then one decrement to in_dev. Correct refcnt may also be an issue in
xfrm4_policy.c to allow the atomic_dec_and_test in in_dev_put
to appropriately recognize the condition to free the in_device.

In net/ipv4/arp.c, arp_req_set() includes the following where,
with possible update, might this be risky to test the deref and
deref again:
if (__in_dev_get(dev)) {
__in_dev_get(dev)->cnf.proxy_arp = 1;
return 0;
}
and similarly in arp_req_delete? While the vast majority of uses
of __in_dev_get indicate rcu protection, the above may be the exception
that breaks the rule. It seems reasonable to have both in_dev_get
and __in_dev_get use rcu_dereference and have the difference between
them be the refcnt. For now, I'll submit a patch addressing only
in_dev_get and hope for feedback on patching __in_dev_get. Because
one way or another, there are __in_dev_get uses that probably need
to be addressed.

While this is a subjective attempt to understand programmer intent,
with your help and suggestions, the plan is to build an automated
checker. Thank you.

------------------------------------------------------------------

diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/drivers/s390/net/lcs.c linux-2.6.13-rc6/drivers/s390/net/lcs.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/drivers/s390/net/lcs.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/drivers/s390/net/lcs.c 2005-09-07 21:37:41.000000000 -0700
@@ -1270,6 +1270,7 @@ lcs_register_mc_addresses(void *data)
return 0;
LCS_DBF_TEXT(4, trace, "regmulti");

+ rcu_read_lock();
in4_dev = in_dev_get(card->dev);
if (in4_dev == NULL)
goto out;
@@ -1278,6 +1279,7 @@ lcs_register_mc_addresses(void *data)
lcs_set_mc_addresses(card, in4_dev);
read_unlock(&in4_dev->mc_list_lock);
in_dev_put(in4_dev);
+ rcu_read_unlock();

lcs_fix_multicast_list(card);
out:
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/drivers/s390/net/qeth_main.c linux-2.6.13-rc6/drivers/s390/net/qeth_main.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/drivers/s390/net/qeth_main.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/drivers/s390/net/qeth_main.c 2005-09-07 22:22:10.000000000 -0700
@@ -5441,6 +5441,7 @@ qeth_add_vlan_mc(struct qeth_card *card)
if (vg->vlan_devices[i] == NULL ||
!(vg->vlan_devices[i]->flags & IFF_UP))
continue;
+ rcu_read_lock();
in_dev = in_dev_get(vg->vlan_devices[i]);
if (!in_dev)
continue;
@@ -5448,6 +5449,7 @@ qeth_add_vlan_mc(struct qeth_card *card)
qeth_add_mc(card,in_dev);
read_unlock(&in_dev->mc_list_lock);
in_dev_put(in_dev);
+ rcu_read_unlock();
}
#endif
}
@@ -5458,6 +5460,7 @@ qeth_add_multicast_ipv4(struct qeth_card
struct in_device *in4_dev;

QETH_DBF_TEXT(trace,4,"chkmcv4");
+ rcu_read_lock();
in4_dev = in_dev_get(card->dev);
if (in4_dev == NULL)
return;
@@ -5466,6 +5469,7 @@ qeth_add_multicast_ipv4(struct qeth_card
qeth_add_vlan_mc(card);
read_unlock(&in4_dev->mc_list_lock);
in_dev_put(in4_dev);
+ rcu_read_unlock();
}

#ifdef CONFIG_QETH_IPV6
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/include/linux/inetdevice.h linux-2.6.13-rc6/include/linux/inetdevice.h
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/include/linux/inetdevice.h 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/include/linux/inetdevice.h 2005-08-19 14:27:02.000000000 -0700
@@ -148,7 +148,7 @@ in_dev_get(const struct net_device *dev)
struct in_device *in_dev;

rcu_read_lock();
- in_dev = dev->ip_ptr;
+ in_dev = rcu_dereference(dev->ip_ptr);
if (in_dev)
atomic_inc(&in_dev->refcnt);
rcu_read_unlock();
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/arp.c linux-2.6.13-rc6/net/ipv4/arp.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/arp.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/arp.c 2005-09-07 22:49:09.000000000 -0700
@@ -334,9 +334,10 @@ static void arp_solicit(struct neighbour
struct net_device *dev = neigh->dev;
u32 target = *(u32*)neigh->primary_key;
int probes = atomic_read(&neigh->probes);
- struct in_device *in_dev = in_dev_get(dev);
+ struct in_device *in_dev;

- if (!in_dev)
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(dev)) == NULL)
return;

switch (IN_DEV_ARP_ANNOUNCE(in_dev)) {
@@ -362,6 +363,8 @@ static void arp_solicit(struct neighbour

if (in_dev)
in_dev_put(in_dev);
+ rcu_read_unlock();
+
if (!saddr)
saddr = inet_select_addr(dev, target, RT_SCOPE_LINK);

@@ -543,11 +546,12 @@ static inline int arp_fwd_proxy(struct i
return 0;

/* place to check for proxy_arp for routes */
-
+ rcu_read_lock();
if ((out_dev = in_dev_get(rt->u.dst.dev)) != NULL) {
omi = IN_DEV_MEDIUM_ID(out_dev);
in_dev_put(out_dev);
}
+ rcu_read_unlock();
return (omi != imi && omi != -1);
}

@@ -710,7 +714,7 @@ static void parp_redo(struct sk_buff *sk
static int arp_process(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
- struct in_device *in_dev = in_dev_get(dev);
+ struct in_device *in_dev;
struct arphdr *arp;
unsigned char *arp_ptr;
struct rtable *rt;
@@ -723,8 +727,8 @@ static int arp_process(struct sk_buff *s
/* arp_rcv below verifies the ARP header and verifies the device
* is ARP'able.
*/
-
- if (in_dev == NULL)
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(dev)) == NULL)
goto out;

arp = skb->nh.arph;
@@ -918,6 +922,7 @@ static int arp_process(struct sk_buff *s
out:
if (in_dev)
in_dev_put(in_dev);
+ rcu_read_unlock();
kfree_skb(skb);
return 0;
}
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/devinet.c linux-2.6.13-rc6/net/ipv4/devinet.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/devinet.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/devinet.c 2005-09-08 00:38:01.000000000 -0700
@@ -372,12 +372,15 @@ static int inet_set_ifa(struct net_devic
return inet_insert_ifa(ifa);
}

+/* returns with rcu read-critical section open, so rcu_read_unlock() in caller */
+
struct in_device *inetdev_by_index(int ifindex)
{
struct net_device *dev;
struct in_device *in_dev = NULL;
read_lock(&dev_base_lock);
dev = __dev_get_by_index(ifindex);
+ rcu_read_lock();
if (dev)
in_dev = in_dev_get(dev);
read_unlock(&dev_base_lock);
@@ -409,7 +412,8 @@ static int inet_rtm_deladdr(struct sk_bu

if ((in_dev = inetdev_by_index(ifm->ifa_index)) == NULL)
goto out;
- __in_dev_put(in_dev);
+ in_dev_put(in_dev);
+ rcu_read_unlock();

for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
ifap = &ifa->ifa_next) {
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/icmp.c linux-2.6.13-rc6/net/ipv4/icmp.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/icmp.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/icmp.c 2005-09-07 23:37:08.000000000 -0700
@@ -890,10 +890,10 @@ static void icmp_address_reply(struct sk
if (skb->len < 4 || !(rt->rt_flags&RTCF_DIRECTSRC))
goto out;

+ rcu_read_lock();
in_dev = in_dev_get(dev);
if (!in_dev)
goto out;
- rcu_read_lock();
if (in_dev->ifa_list &&
IN_DEV_LOG_MARTIANS(in_dev) &&
IN_DEV_FORWARD(in_dev)) {
@@ -913,8 +913,8 @@ static void icmp_address_reply(struct sk
NIPQUAD(*mp), dev->name, NIPQUAD(rt->rt_src));
}
}
- rcu_read_unlock();
in_dev_put(in_dev);
+ rcu_read_unlock();
out:;
}

diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/igmp.c linux-2.6.13-rc6/net/ipv4/igmp.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/igmp.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/igmp.c 2005-09-08 00:35:52.000000000 -0700
@@ -864,10 +864,12 @@ int igmp_rcv(struct sk_buff *skb)
{
/* This basically follows the spec line by line -- see RFC1112 */
struct igmphdr *ih;
- struct in_device *in_dev = in_dev_get(skb->dev);
+ struct in_device *in_dev;
int len = skb->len;

- if (in_dev==NULL) {
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(skb->dev) == NULL) {
+ rcu_read_unlock();
kfree_skb(skb);
return 0;
}
@@ -875,6 +877,7 @@ int igmp_rcv(struct sk_buff *skb)
if (!pskb_may_pull(skb, sizeof(struct igmphdr)) ||
(u16)csum_fold(skb_checksum(skb, 0, len, 0))) {
in_dev_put(in_dev);
+ rcu_read_unlock();
kfree_skb(skb);
return 0;
}
@@ -907,6 +910,7 @@ int igmp_rcv(struct sk_buff *skb)
NETDEBUG(printk(KERN_DEBUG "New IGMP type=%d, why we do not know about it?\n", ih->type));
}
in_dev_put(in_dev);
+ rcu_read_unlock();
kfree_skb(skb);
return 0;
}
@@ -1307,8 +1311,8 @@ static struct in_device * ip_mc_find_dev
if (imr->imr_ifindex) {
idev = inetdev_by_index(imr->imr_ifindex);
if (idev)
- __in_dev_put(idev);
- return idev;
+ __in_dev_put(idev); /* decrement before return? */
+ return idev; /* caller should rcu_read_unlock */
}
if (imr->imr_address.s_addr) {
dev = ip_dev_find(imr->imr_address.s_addr);
@@ -2098,6 +2102,7 @@ void ip_mc_drop_socket(struct sock *sk)
(void) ip_mc_leave_src(sk, iml, in_dev);
ip_mc_dec_group(in_dev, iml->multi.imr_multiaddr.s_addr);
in_dev_put(in_dev);
+ rcu_read_unlock();
}
sock_kfree_s(sk, iml, sizeof(*iml));

@@ -2154,6 +2159,7 @@ static inline struct ip_mc_list *igmp_mc
state->dev;
state->dev = state->dev->next) {
struct in_device *in_dev;
+ rcu_read_lock();
in_dev = in_dev_get(state->dev);
if (!in_dev)
continue;
@@ -2165,6 +2171,7 @@ static inline struct ip_mc_list *igmp_mc
}
read_unlock(&in_dev->mc_list_lock);
in_dev_put(in_dev);
+ rcu_read_unlock();
}
return im;
}
@@ -2183,11 +2190,13 @@ static struct ip_mc_list *igmp_mc_get_ne
state->in_dev = NULL;
break;
}
+ rcu_read_lock();
state->in_dev = in_dev_get(state->dev);
if (!state->in_dev)
continue;
read_lock(&state->in_dev->mc_list_lock);
im = state->in_dev->mc_list;
+ rcu_read_unlock();
}
return im;
}
@@ -2317,6 +2326,7 @@ static inline struct ip_sf_list *igmp_mc
state->dev;
state->dev = state->dev->next) {
struct in_device *idev;
+ rcu_read_lock();
idev = in_dev_get(state->dev);
if (unlikely(idev == NULL))
continue;
@@ -2334,6 +2344,7 @@ static inline struct ip_sf_list *igmp_mc
}
read_unlock(&idev->mc_list_lock);
in_dev_put(idev);
+ rcu_read_unlock();
}
return psf;
}
@@ -2356,11 +2367,14 @@ static struct ip_sf_list *igmp_mcf_get_n
state->idev = NULL;
goto out;
}
+ rcu_read_lock();
state->idev = in_dev_get(state->dev);
if (!state->idev)
+ rcu_read_unlock();
continue;
read_lock(&state->idev->mc_list_lock);
state->im = state->idev->mc_list;
+ rcu_read_unlock();
}
if (!state->im)
break;
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/ip_gre.c linux-2.6.13-rc6/net/ipv4/ip_gre.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/ip_gre.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/ip_gre.c 2005-09-08 00:42:38.000000000 -0700
@@ -1121,6 +1121,7 @@ static int ipgre_close(struct net_device
ip_mc_dec_group(in_dev, t->parms.iph.daddr);
in_dev_put(in_dev);
}
+ rcu_read_unlock();
}
return 0;
}
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/ip_input.c linux-2.6.13-rc6/net/ipv4/ip_input.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/ip_input.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/ip_input.c 2005-09-08 00:45:42.000000000 -0700
@@ -330,8 +330,9 @@ static inline int ip_rcv_finish(struct s

opt = &(IPCB(skb)->opt);
if (opt->srr) {
- struct in_device *in_dev = in_dev_get(dev);
- if (in_dev) {
+ struct in_device *in_dev;
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(dev)) != NULL) {
if (!IN_DEV_SOURCE_ROUTE(in_dev)) {
if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit())
printk(KERN_INFO "source route option %u.%u.%u.%u -> %u.%u.%u.%u\n",
@@ -341,6 +342,7 @@ static inline int ip_rcv_finish(struct s
}
in_dev_put(in_dev);
}
+ rcu_read_unlock();
if (ip_options_rcv_srr(skb))
goto drop;
}
diff -urpNa -X dontdiff /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/route.c linux-2.6.13-rc6/net/ipv4/route.c
--- /mnt/shared/linuxKernel2_6/linux-2.6.13-rc6/net/ipv4/route.c 2005-08-07 11:18:56.000000000 -0700
+++ linux-2.6.13-rc6/net/ipv4/route.c 2005-09-08 08:39:01.000000000 -0700
@@ -1112,14 +1112,15 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
u32 saddr, u8 tos, struct net_device *dev)
{
int i, k;
- struct in_device *in_dev = in_dev_get(dev);
+ struct in_device *in_dev;
struct rtable *rth, **rthp;
u32 skeys[2] = { saddr, 0 };
int ikeys[2] = { dev->ifindex, 0 };

tos &= IPTOS_RT_MASK;

- if (!in_dev)
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(dev)) == NULL)
return;

if (new_gw == old_gw || !IN_DEV_RX_REDIRECTS(in_dev)
@@ -1171,6 +1172,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
if (rt == NULL) {
ip_rt_put(rth);
in_dev_put(in_dev);
+ rcu_read_unlock();
return;
}

@@ -1223,6 +1225,7 @@ void ip_rt_redirect(u32 old_gw, u32 dadd
}
}
in_dev_put(in_dev);
+ rcu_read_unlock();
return;

reject_redirect:
@@ -1236,6 +1239,7 @@ reject_redirect:
NIPQUAD(saddr), NIPQUAD(daddr), tos);
#endif
in_dev_put(in_dev);
+ rcu_read_unlock();
}

static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
@@ -1284,9 +1288,9 @@ static struct dst_entry *ipv4_negative_a
void ip_rt_send_redirect(struct sk_buff *skb)
{
struct rtable *rt = (struct rtable*)skb->dst;
- struct in_device *in_dev = in_dev_get(rt->u.dst.dev);
-
- if (!in_dev)
+ struct in_device *in_dev;
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(rt->u.dst.dev)) == NULL)
return;

if (!IN_DEV_TX_REDIRECTS(in_dev))
@@ -1327,6 +1331,7 @@ void ip_rt_send_redirect(struct sk_buff
}
out:
in_dev_put(in_dev);
+ rcu_read_unlock();
}

static int ip_error(struct sk_buff *skb)
@@ -1482,10 +1487,12 @@ static void ipv4_dst_ifdown(struct dst_e
struct rtable *rt = (struct rtable *) dst;
struct in_device *idev = rt->idev;
if (dev != &loopback_dev && idev && idev->dev == dev) {
- struct in_device *loopback_idev = in_dev_get(&loopback_dev);
- if (loopback_idev) {
+ struct in_device *loopback_idev;
+ rcu_read_lock();
+ if (loopback_idev = in_dev_get(&loopback_dev) != NULL) {
rt->idev = loopback_idev;
in_dev_put(idev);
+ rcu_read_unlock();
}
}
}
@@ -1593,12 +1600,12 @@ static int ip_route_input_mc(struct sk_b
unsigned hash;
struct rtable *rth;
u32 spec_dst;
- struct in_device *in_dev = in_dev_get(dev);
+ struct in_device *in_dev;
u32 itag = 0;

/* Primary sanity checks. */
-
- if (in_dev == NULL)
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(dev)) == NULL)
return -EINVAL;

if (MULTICAST(saddr) || BADCLASS(saddr) || LOOPBACK(saddr) ||
@@ -1656,15 +1663,18 @@ static int ip_route_input_mc(struct sk_b
RT_CACHE_STAT_INC(in_slow_mc);

in_dev_put(in_dev);
+ rcu_read_unlock();
hash = rt_hash_code(daddr, saddr ^ (dev->ifindex << 5), tos);
return rt_intern_hash(hash, rth, (struct rtable**) &skb->dst);

e_nobufs:
in_dev_put(in_dev);
+ rcu_read_unlock();
return -ENOBUFS;

e_inval:
in_dev_put(in_dev);
+ rcu_read_unlock();
return -EINVAL;
}

@@ -1714,6 +1724,7 @@ static inline int __mkroute_input(struct
u32 spec_dst, itag;

/* get a working reference to the output device */
+ rcu_read_lock()
out_dev = in_dev_get(FIB_RES_DEV(*res));
if (out_dev == NULL) {
if (net_ratelimit())
@@ -1796,6 +1807,7 @@ static inline int __mkroute_input(struct
cleanup:
/* release the working reference to the output device */
in_dev_put(out_dev);
+ rcu_read_unlock();
return err;
}

@@ -1899,7 +1911,7 @@ static int ip_route_input_slow(struct sk
u8 tos, struct net_device *dev)
{
struct fib_result res;
- struct in_device *in_dev = in_dev_get(dev);
+ struct in_device *in_dev;
struct flowi fl = { .nl_u = { .ip4_u =
{ .daddr = daddr,
.saddr = saddr,
@@ -1919,8 +1931,8 @@ static int ip_route_input_slow(struct sk
int free_res = 0;

/* IP on this device is disabled. */
-
- if (!in_dev)
+ rcu_read_lock();
+ if ((in_dev = in_dev_get(dev)) == NULL)
goto out;

/* Check for the most weird martians, which can be not detected
@@ -1983,6 +1995,7 @@ static int ip_route_input_slow(struct sk

done:
in_dev_put(in_dev);
+ rcu_read_unlock();
if (free_res)
fib_res_put(&res);
out: return err;
@@ -2174,6 +2187,7 @@ static inline int __mkroute_output(struc
flags |= RTCF_LOCAL;

/* get work reference to inet device */
+ rcu_read_lock();
in_dev = in_dev_get(dev_out);
if (!in_dev)
return -EINVAL;
@@ -2270,6 +2284,7 @@ static inline int __mkroute_output(struc
*result = rth;
cleanup:
/* release work reference to inet device */
+ rcu_read_unlock();
in_dev_put(in_dev);

return err;


2005-09-27 20:56:34

by David Miller

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

From: Suzanne Wood <[email protected]>
Date: Thu, 8 Sep 2005 10:12:52 -0700 (PDT)

> Please consider this request for suggestions on an attempt at a partial patch
> based on the assumptions below to identify rcu read-side critical sections
> for in_dev_get() defined in inetdevice.h. Thank you.

Thanks a lot for your patch, and patience in waiting for it
to get reviewed :-)

I agree with the changes to add rcu_dereference() use.
Those were definitely lacking and needed.

This following case is clever and correct, though. It is from
the net/ipv4/devinet.c part of your patch:

@@ -409,7 +412,8 @@ static int inet_rtm_deladdr(struct sk_bu

if ((in_dev = inetdev_by_index(ifm->ifa_index)) == NULL)
goto out;
- __in_dev_put(in_dev);
+ in_dev_put(in_dev);
+ rcu_read_unlock();

for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
ifap = &ifa->ifa_next) {

Everyone gets fooled by a certain invariant in the Linux networking
locking. If the RTNL semaphore is held, _all_ device and address
configuration changes are blocked. IP addresses cannot be removed,
devices cannot be brought up or down, routes cannot be added or
deleted, etc. The RTNL semaphore serializes all of these operations.
And it is held during inet_rtm_deladdr() here.

So we _know_ that if inetdev_by_index() returns non-NULL someone
else (the device itself) holds at least _one_ reference to that
object which cannot go away, because all such actions would need
to take the RTNL semaphore which we hold.

So __in_dev_put() is safe here.

Arguably, it's being overly clever for questionable gain.
It definitely deserves a comment, how about that? :-)

Finally, about adding rcu_read_{lock,unlock}() around even
in_dev_{get,put}(). I bet that really isn't needed but I cannot
articulate why we can get away without it. For example, if we
are given a pair executed in a function like:

in_dev_get();

...

in_dev_put();

who cares if we preempt? The local function's execution holds the
necessary reference, so the object's refcount cannot ever fall to
zero.

We can't get any RCU callbacks invoked, as a result, so we don't
need the rcu_read_{lock,unlock}() calls here.

The in_dev_put() uses atomic_dec_and_test(), which provides a memory
barrier, so no out-of-order cpu memory references to the object
can escape past the decrement to zero of the object reference count.

In short, I think adding rcu_read_{lock,unlock}() is very heavy
handed and unnecessary.

2005-09-28 00:22:42

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

Many thanks for all you've provided here.

> From [email protected] Tue Sep 27 14:36:32 2005

> I agree with the changes to add rcu_dereference() use.
> Those were definitely lacking and needed.

> This following case is clever and correct, though. It is from
> the net/ipv4/devinet.c part of your patch:

> @@ -409,7 +412,8 @@ static int inet_rtm_deladdr(struct sk_bu
>
> if ((in_dev = inetdev_by_index(ifm->ifa_index)) == NULL)
> goto out;
> - __in_dev_put(in_dev);
> + in_dev_put(in_dev);
> + rcu_read_unlock();
>
> for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
> ifap = &ifa->ifa_next) {

> Everyone gets fooled by a certain invariant in the Linux networking
> locking. If the RTNL semaphore is held, _all_ device and address
> configuration changes are blocked. IP addresses cannot be removed,
> devices cannot be brought up or down, routes cannot be added or
> deleted, etc. The RTNL semaphore serializes all of these operations.
> And it is held during inet_rtm_deladdr() here.

> So we _know_ that if inetdev_by_index() returns non-NULL someone
> else (the device itself) holds at least _one_ reference to that
> object which cannot go away, because all such actions would need
> to take the RTNL semaphore which we hold.

> So __in_dev_put() is safe here.

In this case, you want the refcnt decremented without the unnecessary
test that in_dev_put() would incur. I was concerned about the
pairings of __in_dev_get which doesn't increment refcnt with
__in_dev_put which decrements. Didn't mean to address that til
after some feedback, but thank you for clarifying my error here
since I can't trace any pairing with the use of __in_dev_put
in inet_rtm_deladdr.

> Arguably, it's being overly clever for questionable gain.
> It definitely deserves a comment, how about that? :-)

> Finally, about adding rcu_read_{lock,unlock}() around even
> in_dev_{get,put}(). I bet that really isn't needed but I cannot
> articulate why we can get away without it. For example, if we
> are given a pair executed in a function like:

> in_dev_get();

> ...

> in_dev_put();

> who cares if we preempt? The local function's execution holds the
> necessary reference, so the object's refcount cannot ever fall to
> zero.

> We can't get any RCU callbacks invoked, as a result, so we don't
> need the rcu_read_{lock,unlock}() calls here.

> The in_dev_put() uses atomic_dec_and_test(), which provides a memory
> barrier, so no out-of-order cpu memory references to the object
> can escape past the decrement to zero of the object reference count.

> In short, I think adding rcu_read_{lock,unlock}() is very heavy
> handed and unnecessary.

In Paul McKenney's reference at
http://www.rdrop.com/users/paulmck/RCU/whatisRCU.html
"Reference counts may be used in conjunction with RCU to maintain
longer-term references to data structures." So you're right. I
was basing those rcu_read_lock extents on the idea that the calling
function has the vision of the need for protection of an
rcu_dereference'd pointer. Paul has also provided further insight
into discriminating between read-side and update-side uses of
rcu_dereference which I need to incorporate.

Many thanks again and I'll try for a better submission.

2005-09-28 02:56:09

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

David S. Miller <[email protected]> wrote:
>
> I agree with the changes to add rcu_dereference() use.
> Those were definitely lacking and needed.

Actually I'm not so sure that they are all needed. I only looked
at the very first one in the patch which is in in_dev_get(). That
one certainly isn't necessary because the old value of ip_ptr
is valid as long as the reference count does not hit zero.

The later is guaranteed by the increment in in_dev_get().

Because the pervasiveness of reference counting in the network stack,
I believe that we should scrutinise the other bits in the patch too
to make sure that they are all needed.

In general, using rcu_dereference/rcu_assign_pointer does not
guarantee correct code. We really need to look at each case
individually.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-28 14:50:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Wed, Sep 28, 2005 at 12:55:45PM +1000, Herbert Xu wrote:
> David S. Miller <[email protected]> wrote:
> >
> > I agree with the changes to add rcu_dereference() use.
> > Those were definitely lacking and needed.
>
> Actually I'm not so sure that they are all needed. I only looked
> at the very first one in the patch which is in in_dev_get(). That
> one certainly isn't necessary because the old value of ip_ptr
> is valid as long as the reference count does not hit zero.
>
> The later is guaranteed by the increment in in_dev_get().
>
> Because the pervasiveness of reference counting in the network stack,
> I believe that we should scrutinise the other bits in the patch too
> to make sure that they are all needed.
>
> In general, using rcu_dereference/rcu_assign_pointer does not
> guarantee correct code. We really need to look at each case
> individually.

Yep, these two APIs are only part of the solution.

The reference-count approach is only guaranteed to work if the kernel
thread that did the reference-count increment is later referencing that
same data element. Otherwise, one has the following possible situation
on DEC Alpha:

o CPU 0 initializes and inserts a new element into the data
structure, using rcu_assign_pointer() to provide any needed
memory barriers. (Or, if RCU is not being used, under the
appropriate update-side lock.)

o CPU 1 acquires a reference to this new element, presumably
using either a lock or rcu_read_lock() and rcu_dereference()
in order to do so safely. CPU 1 then increments the reference
count.

o CPU 2 picks up a pointer to this new element, but in a way
that relies on the reference count having been incremented,
without using locking, rcu_read_lock(), rcu_dereference(),
and so on.

This CPU can then see the pre-initialized contents of the
newly inserted data structure (again, but only on DEC Alpha).

Again, if the same kernel thread that incremented the reference count
is later accessing it, no problem, even on Alpha.

Thanx, Paul

2005-09-28 22:11:54

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Wed, Sep 28, 2005 at 07:51:10AM -0700, Paul E. McKenney wrote:
>
> The reference-count approach is only guaranteed to work if the kernel
> thread that did the reference-count increment is later referencing that
> same data element. Otherwise, one has the following possible situation
> on DEC Alpha:

You're quite right. Without the rcu_dereference users of in_dev_get()
may see pre-initialisation contents of in_dev.

So these barriers are definitely needed.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-29 16:03:24

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

The original motivation for this patch was in __in_dev_get
usage. I'll try to test a build, but should submittals be
incremental? first addressing in_dev_get, then
__in_dev_get? What seems resolved so far follows.

The exchange below suggests that it is equally important
to have the rcu_dereference() in __in_dev_get(), so the
idea of the only difference between in_dev_get and
__in_dev_get being the refcnt may be accepted.

Correct usage may be a question with the mismatched
definitions (in terms of refcnt) of __in_dev_get()
and __in_dev_put() that superficially appear
paired and this may merit a comment. If interested,
examples are mentioned in
http://www.uwsg.iu.edu/hypermail/linux/kernel/0509.1/0184.html
and
http://www.ussg.iu.edu/hypermail/linux/kernel/0509.3/0757.html

But when the refcnt is employed for the DEC Alpha,
rcu-protection or other locking must be in place for
multiple CPUs, which apparently affirms the value
of the marking of an rcu read-side critical section
done by the calling function which has the vision of the
extent of use of the protected dereference.

Is this all reasonable to you?
Thank you very much.

----- Original Message -----
From: Paul E. McKenney
Sent: Wednesday, September 28, 2005 7:51 AM


> On Wed, Sep 28, 2005 at 12:55:45PM +1000, Herbert Xu wrote:
>> David S. Miller wrote:
>> >
>> > I agree with the changes to add rcu_dereference() use.
>> > Those were definitely lacking and needed.
>>
>> Actually I'm not so sure that they are all needed. I only looked
>> at the > guarantee correct code. We really need to look at each case
>> individually.
>
> Yep, these two APIs are only part of the solution.
>
> The reference-count approach is only guaranteed to work if the kernel
> thread that did the reference-count increment is later referencing that
> same data element. Otherwise, one has the following possible situation
> on DEC Alpha:
>
> o CPU 0 initializes and inserts a new element into the data
> structure, using rcu_assign_pointer() to provide any needed
> memory barriers. (Or, if RCU is not being used, under the
> appropriate update-side lock.)
>
> o CPU 1 acquires a reference to this new element, presumably
> using either a lock or rcu_read_lock() and rcu_dereference()
> in order to do so safely. CPU 1 then increments the reference
> count.
>
> o CPU 2 picks up a pointer to this new element, but in a way
> that relies on the reference count having been incremented,
> without using locking, rcu_read_lock(), rcu_dereference(),
> and so on.
>
> This CPU can then see the pre-initialized contents of the
> newly inserted data structure (again, but only on DEC Alpha).
>
> Again, if the same kernel thread that incremented the reference count
> is later accessing it, no problem, even on Alpha.
>
> Thanx, Paul
>

2005-09-29 21:29:29

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 09:02:29AM -0700, Suzanne Wood wrote:
>
> The exchange below suggests that it is equally important
> to have the rcu_dereference() in __in_dev_get(), so the
> idea of the only difference between in_dev_get and
> __in_dev_get being the refcnt may be accepted.

With __in_dev_get() it's the caller's responsibility to ensure
that RCU works correctly. Therefore if any rcu_dereference is
needed it should be done by the caller.

Some callers of __in_dev_get() don't need rcu_dereference at all
because they're protected by the rtnl.

BTW, could you please move the rcu_dereference in in_dev_get()
into the if clause? The barrier is not needed when ip_ptr is
NULL.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-29 23:31:06

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

> Date: Fri, 30 Sep 2005 07:28:36 +1000
> From: Herbert Xu <[email protected]>

> On Thu, Sep 29, 2005 at 09:02:29AM -0700, Suzanne Wood wrote:
> >
> > The exchange below suggests that it is equally important
> > to have the rcu_dereference() in __in_dev_get(), so the
> > idea of the only difference between in_dev_get and
> > __in_dev_get being the refcnt may be accepted.

> With __in_dev_get() it's the caller's responsibility to ensure
> that RCU works correctly. Therefore if any rcu_dereference is
> needed it should be done by the caller.

This sounds reasonable to me. Does everyone agree?

> Some callers of __in_dev_get() don't need rcu_dereference at all
> because they're protected by the rtnl.

> BTW, could you please move the rcu_dereference in in_dev_get()
> into the if clause? The barrier is not needed when ip_ptr is
> NULL.

The trouble with that may be that there are three events, the
dereference, the assignment, and the conditional test. The
rcu_dereference() is meant to assure deferred destruction
throughout.

Thank you very much for your suggestions.

2005-09-29 23:40:30

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

> From suzannew Thu Sep 29 16:30:28 2005

> > From: Herbert Xu 30 Sep 2005 07:28

> > BTW, could you please move the rcu_dereference in in_dev_get()
> > into the if clause? The barrier is not needed when ip_ptr is
> > NULL.

> The trouble with that may be that there are three events, the
> dereference, the assignment, and the conditional test. The
> rcu_dereference() is meant to assure deferred destruction
> throughout.

Sorry, I was thinking in terms of the rcu_read_lock, so this is misstated.

2005-09-30 00:00:37

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

Sorry to be thinking on-line, but if you mean this:

if (in_dev = rcu_dereference(dev->ip_ptr))

I think that's fine.

> From suzannew Thu Sep 29 16:39:57 2005

> > From suzannew Thu Sep 29 16:30:28 2005

> > > From: Herbert Xu 30 Sep 2005 07:28

> > > BTW, could you please move the rcu_dereference in in_dev_get()
> > > into the if clause? The barrier is not needed when ip_ptr is
> > > NULL.

> > The trouble with that may be that there are three events, the
> > dereference, the assignment, and the conditional test. The
> > rcu_dereference() is meant to assure deferred destruction
> > throughout.

2005-09-30 00:24:09

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 04:59:56PM -0700, Suzanne Wood wrote:
> Sorry to be thinking on-line, but if you mean this:
>
> if (in_dev = rcu_dereference(dev->ip_ptr))
>
> I think that's fine.

Close. What I had in mind is

rcu_read_lock();
in_dev = dev->ip_ptr;
if (in_dev) {
in_dev = rcu_dereference(in_dev);
atomic_inc(&in_dev->refcnt);
}
rcu_read_unlock();
return in_dev;

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-30 00:27:43

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 05:23:46PM -0700, Paul E. McKenney wrote:
>
> Is there any case where __in_dev_get() might be called without
> needing to be wrapped with rcu_dereference()? If so, then I
> agree (FWIW, given my meagre knowledge of Linux networking).

Yes. All paths that call __in_dev_get() under the rtnl do not
need rcu_dereference (or any RCU at all) since the rtnl prevents
any ip_ptr modification from occuring.

> However, rcu_dereference() only generates a memory barrier on DEC
> Alpha, so there is normally no penalty for using it in the NULL-pointer
> case. So, when using rcu_dereference() unconditionally simplifies
> the code, it may make sense to "just do it".

Here is what the code would look like:

rcu_read_lock();
in_dev = dev->ip_ptr;
if (in_dev) {
in_dev = rcu_dereference(in_dev);
atomic_inc(&in_dev->refcnt);
}
rcu_read_unlock();
return in_dev;

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-30 00:22:19

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 04:30:28PM -0700, Suzanne Wood wrote:
>
> > BTW, could you please move the rcu_dereference in in_dev_get()
> > into the if clause? The barrier is not needed when ip_ptr is
> > NULL.
>
> The trouble with that may be that there are three events, the
> dereference, the assignment, and the conditional test. The
> rcu_dereference() is meant to assure deferred destruction
> throughout.

The deferred destruction is guaranteed here by the reference count.
The only purpose served by rcu_dereference() in in_dev_get() is to
prevent the user from seeing pre-initialisation data.

When the pointer is NULL, you can't see any data at all, let alone
pre-initialisation data.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-30 00:23:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 04:30:28PM -0700, Suzanne Wood wrote:
> > Date: Fri, 30 Sep 2005 07:28:36 +1000
> > From: Herbert Xu <[email protected]>
>
> > On Thu, Sep 29, 2005 at 09:02:29AM -0700, Suzanne Wood wrote:
> > >
> > > The exchange below suggests that it is equally important
> > > to have the rcu_dereference() in __in_dev_get(), so the
> > > idea of the only difference between in_dev_get and
> > > __in_dev_get being the refcnt may be accepted.
>
> > With __in_dev_get() it's the caller's responsibility to ensure
> > that RCU works correctly. Therefore if any rcu_dereference is
> > needed it should be done by the caller.
>
> This sounds reasonable to me. Does everyone agree?

Is there any case where __in_dev_get() might be called without
needing to be wrapped with rcu_dereference()? If so, then I
agree (FWIW, given my meagre knowledge of Linux networking).

If all __in_dev_get() invocations need to be wrapped in
rcu_dereference(), then it seems to me that there would be
motivation to bury rcu_dereference() in __in_dev_get().

> > Some callers of __in_dev_get() don't need rcu_dereference at all
> > because they're protected by the rtnl.
>
> > BTW, could you please move the rcu_dereference in in_dev_get()
> > into the if clause? The barrier is not needed when ip_ptr is
> > NULL.
>
> The trouble with that may be that there are three events, the
> dereference, the assignment, and the conditional test. The
> rcu_dereference() is meant to assure deferred destruction
> throughout.

One only needs an rcu_dereference() once on the data-flow path from
fetching the RCU-protected pointer to dereferencing that pointer.
If the pointer is NULL, there is no way you can dereference it,
so, technically, Herbert is quite correct.

However, rcu_dereference() only generates a memory barrier on DEC
Alpha, so there is normally no penalty for using it in the NULL-pointer
case. So, when using rcu_dereference() unconditionally simplifies
the code, it may make sense to "just do it".

Thanx, Paul

2005-09-30 00:36:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Fri, Sep 30, 2005 at 10:27:19AM +1000, Herbert Xu wrote:
> On Thu, Sep 29, 2005 at 05:23:46PM -0700, Paul E. McKenney wrote:
> >
> > Is there any case where __in_dev_get() might be called without
> > needing to be wrapped with rcu_dereference()? If so, then I
> > agree (FWIW, given my meagre knowledge of Linux networking).
>
> Yes. All paths that call __in_dev_get() under the rtnl do not
> need rcu_dereference (or any RCU at all) since the rtnl prevents
> any ip_ptr modification from occuring.
>
> > However, rcu_dereference() only generates a memory barrier on DEC
> > Alpha, so there is normally no penalty for using it in the NULL-pointer
> > case. So, when using rcu_dereference() unconditionally simplifies
> > the code, it may make sense to "just do it".
>
> Here is what the code would look like:
>
> rcu_read_lock();
> in_dev = dev->ip_ptr;
> if (in_dev) {
> in_dev = rcu_dereference(in_dev);
> atomic_inc(&in_dev->refcnt);
> }
> rcu_read_unlock();
> return in_dev;

How about:

rcu_read_lock();
in_dev = dev->ip_ptr;
if (rcu_dereference(in_dev)) {
atomic_inc(&in_dev->refcnt);
}
rcu_read_unlock();
return in_dev;

Admittedly only saves one line, but...

Thanx, Paul

2005-09-30 01:04:39

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 05:36:42PM -0700, Paul E. McKenney wrote:
>
> > rcu_read_lock();
> > in_dev = dev->ip_ptr;
> > if (in_dev) {
> > in_dev = rcu_dereference(in_dev);
> > atomic_inc(&in_dev->refcnt);
> > }
> > rcu_read_unlock();
> > return in_dev;
>
> How about:
>
> rcu_read_lock();
> in_dev = dev->ip_ptr;
> if (rcu_dereference(in_dev)) {
> atomic_inc(&in_dev->refcnt);
> }
> rcu_read_unlock();
> return in_dev;

With this the barrier will taken even when in_dev is NULL.

I agree this isn't such a big deal since it only impacts Alpha and then
only when in_dev is NULL. But as we already do the branch anyway to
increment the reference count, we might as well make things a little
better for Alpha.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-09-30 01:07:19

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

In reviewing the 44 kernel uses of __in_dev_get and seeing many
cases in 13 of 20 C code files of insertions of rcu_read_lock with
and without the rcu_dereference that is indicated, so it does appear
often to be programmer intent.

Of the programs using __in_dev_get that don't include rcu, devinet.c
and igmp.c use an rtnl lock. Five other programs that use __in_dev_get
without rcu have rtnl locking in the program source code, but I need
to actually look further into the call tree to say more.

Are there three cases then? RCU protection with refcnt, RCU without refcnt,
and the bare cast of the dereference?

Thank you very much for getting it back on track.

> From [email protected] Thu Sep 29 17:23:18 2005

> Is there any case where __in_dev_get() might be called without
> needing to be wrapped with rcu_dereference()? If so, then I
> agree (FWIW, given my meagre knowledge of Linux networking).

> If all __in_dev_get() invocations need to be wrapped in
> rcu_dereference(), then it seems to me that there would be
> motivation to bury rcu_dereference() in __in_dev_get().

> > > Some callers of __in_dev_get() don't need rcu_dereference at all
> > > because they're protected by the rtnl.
> >
> > > BTW, could you please move the rcu_dereference in in_dev_get()
> > > into the if clause? The barrier is not needed when ip_ptr is
> > > NULL.
> >
> > The trouble with that may be that there are three events, the
> > dereference, the assignment, and the conditional test. The
> > rcu_dereference() is meant to assure deferred destruction
> > throughout.

> One only needs an rcu_dereference() once on the data-flow path from
> fetching the RCU-protected pointer to dereferencing that pointer.
> If the pointer is NULL, there is no way you can dereference it,
> so, technically, Herbert is quite correct.

> However, rcu_dereference() only generates a memory barrier on DEC
> Alpha, so there is normally no penalty for using it in the NULL-pointer
> case. So, when using rcu_dereference() unconditionally simplifies
> the code, it may make sense to "just do it".

> Thanx, Paul

2005-09-30 01:15:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Fri, Sep 30, 2005 at 11:04:04AM +1000, Herbert Xu wrote:
> On Thu, Sep 29, 2005 at 05:36:42PM -0700, Paul E. McKenney wrote:
> >
> > > rcu_read_lock();
> > > in_dev = dev->ip_ptr;
> > > if (in_dev) {
> > > in_dev = rcu_dereference(in_dev);
> > > atomic_inc(&in_dev->refcnt);
> > > }
> > > rcu_read_unlock();
> > > return in_dev;
> >
> > How about:
> >
> > rcu_read_lock();
> > in_dev = dev->ip_ptr;
> > if (rcu_dereference(in_dev)) {
> > atomic_inc(&in_dev->refcnt);
> > }
> > rcu_read_unlock();
> > return in_dev;
>
> With this the barrier will taken even when in_dev is NULL.
>
> I agree this isn't such a big deal since it only impacts Alpha and then
> only when in_dev is NULL. But as we already do the branch anyway to
> increment the reference count, we might as well make things a little
> better for Alpha.

OK, how about this instead?

rcu_read_lock();
in_dev = dev->ip_ptr;
if (in_dev) {
atomic_inc(&rcu_dereference(in_dev)->refcnt);
}
rcu_read_unlock();
return in_dev;

Thanx, Paul

2005-09-30 01:19:51

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 06:16:03PM -0700, Paul E. McKenney wrote:
>
> OK, how about this instead?
>
> rcu_read_lock();
> in_dev = dev->ip_ptr;
> if (in_dev) {
> atomic_inc(&rcu_dereference(in_dev)->refcnt);
> }
> rcu_read_unlock();
> return in_dev;

Looks great. Thanks Paul.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-10-01 01:14:16

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Thu, Sep 29, 2005 at 06:06:50PM -0700, Suzanne Wood wrote:
>
> Are there three cases then? RCU protection with refcnt, RCU without refcnt,
> and the bare cast of the dereference?

Correct.

The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
introduces __in_dev_get_rcu() to cover the second case.

1) RCU with refcnt should use in_dev_get().
2) RCU without refcnt should use __in_dev_get_rcu().
3) All others must hold RTNL and use __in_dev_get_rtnl().

There is one exception in net/ipv4/route.c which is in fact a pre-existing
race condition. I've marked it as such so that we remember to fix it.

This patch is based on suggestions and prior work by Suzanne Wood and
Paul McKenney.

Signed-off-by: Herbert Xu <[email protected]>

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Attachments:
(No filename) (987.00 B)
p (16.64 kB)
Download all attachments

2005-10-01 06:57:08

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

Many thanks for addressing this issue.

----- Original Message -----
> From: Herbert Xu <[email protected]>
> Date: Sat, 1 Oct 2005 11:13:12 +1000
>
> On Thu, Sep 29, 2005 at 06:06:50PM -0700, Suzanne Wood wrote:
>>
>> Are there three cases then? RCU protection with refcnt, RCU without refcnt,
>> and the bare cast of the dereference?
>
> Correct.
>
> The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
> introduces __in_dev_get_rcu() to cover the second case.
>
> 1) RCU with refcnt should use in_dev_get().
> 2) RCU without refcnt should use __in_dev_get_rcu().
> 3) All others must hold RTNL and use __in_dev_get_rtnl().
>

The naming to indicate purpose looks good and the leading underscores
in the prior work did seem to imply wrapping to add functionality.

But it is interesting to have discarded what was developed yesterday
to minimize rcu_dereference impact:

>> ----- Original message -----
>> From: Herbert Xu <[email protected]>
>> Date: Fri, 30 Sep 2005 11:19:07 +1000
>>
>> On Thu, Sep 29, 2005 at 06:16:03PM -0700, Paul E. McKenney wrote:
>>>
>>> OK, how about this instead?
>>>
>>> rcu_read_lock();
>>> in_dev = dev->ip_ptr;
>>> if (in_dev) {
>>> atomic_inc(&rcu_dereference(in_dev)->refcnt);
>>> }
>>> rcu_read_unlock();
>>> return in_dev;
>>
>> Looks great.

while adding a function call level by wrapping __in_dev_get_rcu
with in_dev_get as suggested here.

> --
> diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
> --- a/include/linux/inetdevice.h
> +++ b/include/linux/inetdevice.h
> @@ -142,13 +142,21 @@ static __inline__ int bad_mask(u32 mask,
>
> #define endfor_ifa(in_dev) }
>
> +static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
> +{
> + struct in_device *in_dev = dev->ip_ptr;
> + if (in_dev)
> + in_dev = rcu_dereference(in_dev);
> + return in_dev;
> +}
> +
> static __inline__ struct in_device *
> in_dev_get(const struct net_device *dev)
> {
> struct in_device *in_dev;
>
> rcu_read_lock();
> - in_dev = dev->ip_ptr;
> + in_dev = __in_dev_get_rcu(dev);
> if (in_dev)
> atomic_inc(&in_dev->refcnt);
> rcu_read_unlock();
> @@ -156,7 +164,7 @@ in_dev_get(const struct net_device *dev)
> }
>
> static __inline__ struct in_device *
> -__in_dev_get(const struct net_device *dev)
> +__in_dev_get_rtnl(const struct net_device *dev)
> {
> return (struct in_device*)dev->ip_ptr;
> }

The other thing I'd hoped to address in pktgen.c was
removing the __in_dev_put() which decrements refcnt
while __in_dev_get_rcu() does not increment.

> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -1667,7 +1667,7 @@ static void pktgen_setup_inject(struct p
> struct in_device *in_dev;
>
> rcu_read_lock();
> - in_dev = __in_dev_get(pkt_dev->odev);
> + in_dev = __in_dev_get_rcu(pkt_dev->odev);
> if (in_dev) {
> if (in_dev->ifa_list) {
> pkt_dev->saddr_min = in_dev->ifa_list->ifa_address;

pkt_dev->saddr_max = pkt_dev->saddr_min;
}
- __in_dev_put(in_dev);
}
rcu_read_unlock();

Thank you very much again for resolving this by clarifying both the
RCU and RTNL usages.

2005-10-01 07:13:15

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Fri, Sep 30, 2005 at 11:56:41PM -0700, Suzanne Wood wrote:
>
> But it is interesting to have discarded what was developed yesterday
> to minimize rcu_dereference impact:
>
> >> ----- Original message -----
> >> From: Herbert Xu <[email protected]>
> >> Date: Fri, 30 Sep 2005 11:19:07 +1000
> >>
> >> On Thu, Sep 29, 2005 at 06:16:03PM -0700, Paul E. McKenney wrote:
> >>>
> >>> OK, how about this instead?
> >>>
> >>> rcu_read_lock();
> >>> in_dev = dev->ip_ptr;
> >>> if (in_dev) {
> >>> atomic_inc(&rcu_dereference(in_dev)->refcnt);
> >>> }
> >>> rcu_read_unlock();
> >>> return in_dev;
> >>
> >> Looks great.
>
> while adding a function call level by wrapping __in_dev_get_rcu
> with in_dev_get as suggested here.

It might look different, but it should compile to the same result.
GCC should be smart enough to combine the two branches and produce a
memory barrier only when in_dev is not NULL.

> The other thing I'd hoped to address in pktgen.c was
> removing the __in_dev_put() which decrements refcnt
> while __in_dev_get_rcu() does not increment.

Well spotted.

Here is a patch on top of the last one to fix this bogus decrement.

Signed-off-by: Herbert Xu <[email protected]>

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Attachments:
(No filename) (1.41 kB)
p (336.00 B)
Download all attachments

2005-10-01 18:01:05

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

With the spotlight leaving in_dev_get, we have the parallel question
of in_dev_put() and __in_dev_put() both defined with refcnt
decrement, but the preceding underscore may lend itself to an
inadvertant pairing and refcnt inaccuracy.

In the following cscope of linux-2.6.14-rc2 prior to Herbert's patches
which remove the occurence in pktgen.c and the others are
paired with __in_dev_get_rtnl(), except xfrm4_policy.c():

C symbol: __in_dev_put
File Function Line
File Function Line
0 inetdevice.h <global> 172 #define __in_dev_put(idev) atomic_dec(&(idev)->refcnt)
1 pktgen.c pktgen_setup_inject 1687 __in_dev_put(in_dev);
2 devinet.c inet_rtm_deladdr 412 __in_dev_put(in_dev);
3 igmp.c igmp_gq_timer_expire 686 __in_dev_put(in_dev);
4 igmp.c igmp_ifc_timer_expire 698 __in_dev_put(in_dev);
5 igmp.c igmp_heard_query 801 __in_dev_put(in_dev);
6 igmp.c ip_mc_down 1228 __in_dev_put(in_dev);
7 igmp.c ip_mc_down 1231 __in_dev_put(in_dev);
8 igmp.c ip_mc_find_dev 1310 __in_dev_put(idev);
9 xfrm4_policy.c xfrm4_dst_ifdown 280 __in_dev_put(loopback_idev);

It may not be reasonable to rename __in_dev_put for its parallel definition
since its current usage is with __in_dev_get_rtnl() which does not increment
refcnt.

It is also probably good to retain the old __in_dev_get()
and __in_dev_put() and deprecate them.

Old business: this is probably stating the obvious,
with Herbert's patches in place, none of the test patch
I'd submitted 8 Sept should take effect.

Thank you.
Suzanne

----- Original Message -----
From: "Herbert Xu" <[email protected]>
Sent: Saturday, October 01, 2005 12:12 AM

> On Fri, Sep 30, 2005 at 11:56:41PM -0700, Suzanne Wood wrote:
>>
>> The other thing I'd hoped to address in pktgen.c was
>> removing the __in_dev_put() which decrements refcnt
>> while __in_dev_get_rcu() does not increment.
>
> Well spotted.
>
> Here is a patch on top of the last one to fix this bogus decrement.
>
> Signed-off-by: Herbert Xu <[email protected]>

2005-10-01 18:03:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Sat, Oct 01, 2005 at 05:12:48PM +1000, Herbert Xu wrote:
> On Fri, Sep 30, 2005 at 11:56:41PM -0700, Suzanne Wood wrote:
> >
> > But it is interesting to have discarded what was developed yesterday
> > to minimize rcu_dereference impact:
> >
> > >> ----- Original message -----
> > >> From: Herbert Xu <[email protected]>
> > >> Date: Fri, 30 Sep 2005 11:19:07 +1000
> > >>
> > >> On Thu, Sep 29, 2005 at 06:16:03PM -0700, Paul E. McKenney wrote:
> > >>>
> > >>> OK, how about this instead?
> > >>>
> > >>> rcu_read_lock();
> > >>> in_dev = dev->ip_ptr;
> > >>> if (in_dev) {
> > >>> atomic_inc(&rcu_dereference(in_dev)->refcnt);
> > >>> }
> > >>> rcu_read_unlock();
> > >>> return in_dev;
> > >>
> > >> Looks great.
> >
> > while adding a function call level by wrapping __in_dev_get_rcu
> > with in_dev_get as suggested here.
>
> It might look different, but it should compile to the same result.
> GCC should be smart enough to combine the two branches and produce a
> memory barrier only when in_dev is not NULL.
>
> > The other thing I'd hoped to address in pktgen.c was
> > removing the __in_dev_put() which decrements refcnt
> > while __in_dev_get_rcu() does not increment.
>
> Well spotted.
>
> Here is a patch on top of the last one to fix this bogus decrement.

Both this and Herbert's prior patch look good to me!

Thanx, Paul

> Signed-off-by: Herbert Xu <[email protected]>
>
> Thanks,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -1673,7 +1673,6 @@ static void pktgen_setup_inject(struct p
> pkt_dev->saddr_min = in_dev->ifa_list->ifa_address;
> pkt_dev->saddr_max = pkt_dev->saddr_min;
> }
> - __in_dev_put(in_dev);
> }
> rcu_read_unlock();
> }

2005-10-01 18:37:42

by Suzanne Wood

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

Please excuse this restatement of an earlier concern.

> From suzannew Sat Oct 1 11:00:28 2005

> With the spotlight leaving in_dev_get, we have the parallel question
> of in_dev_put() and __in_dev_put() both defined with refcnt
> decrement, but the preceding underscore may lend itself to an
> inadvertant pairing and refcnt inaccuracy.

Dave Miller already addressed this question in
http://www.ussg.iu.edu/hypermail/linux/kernel/0509.3/0757.html

> It may not be reasonable to rename __in_dev_put for its parallel definition
> since its current usage is with __in_dev_get_rtnl() which does not increment
> refcnt.

But the following may be worth considering.

> It is also probably good to retain the old __in_dev_get()
and deprecate it.

Thank you.

2005-10-01 19:29:59

by Herbert Xu

[permalink] [raw]
Subject: Re: [RFC][PATCH] identify in_dev_get rcu read-side critical sections

On Sat, Oct 01, 2005 at 11:37:14AM -0700, Suzanne Wood wrote:
>
> But the following may be worth considering.
>
> > It is also probably good to retain the old __in_dev_get()
> and deprecate it.

I think it's better to get rid of it actually so that if there are
external modules using this they can make sure that they're using
the right function.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt