2003-02-19 17:38:01

by Simon Kirby

[permalink] [raw]
Subject: Longstanding networking / SMP issue? (duplextest)

Hi all,

About a year ago I wrote a tool to test for Ethernet duplex mismatches
based on a theory I had of "ping -f" not working very well for this due
to it being mostly synchronous in nature. My idea was to send a small
number of packets all at once, and then wait for all of the responses.
This would increase the chances that the link or links between the sender
and receiver would be both transmitting and receiving at the same time,
exploiting the duplex mismatch, and causing packet loss.

The program, "duplextest", can be found here:

http://blue.netnation.com/sim/projects/duplextest/

Over the year, this program has been very useful in finding mismatches.
However, I have noticed occasionally that it appears to be showing packet
loss in cases where there should be no packet loss.

Investigating this issue, I found that even a "tcpdump -n icmp" running
on the host being duplextested also reports that itself is not replying
to all of the ICMP packets. The tcpdump shows more requests are coming
in than responses going out, exactly matching what duplextest sees
remotely.

Baffled by this, I did a bunch of testing and tried to narrow down which
servers were experiencing this problem. A sweep of some of our server
blocks showed that only SMP boxes were having the problem; however, not
all SMP boxes were affected.

I tried to narrow down whether specific SMP CPUs were responsible for the
problem, and found that it tended to happen on some ASUS P2B-DS boards,
but then also found it happening on Intel Xeon boards, Tyan boards, etc.,
with PIIs, PIIIs, and Xeon CPUs. We have a few SMP Athlons, but I
haven't seen any problems with them.

After some playing around, I decided to try binding the receiving NIC's
IRQ to one CPU with /proc/irq/x/smp_affinity. After I did this, the
packet loss vanished! I bound the NIC's IRQ it to the other CPU, and
packet loss was still gone. As soon as I echoed "3" (both CPUs) back
into smp_affinity, the loss returned.

All of the problems I've seen are on servers with Intel EtherExpress Pro
100+ cards, but most of our servers have these cards cards, so it's hard
to tell if this is a factor. I can see exactly the problem with both the
eepro100 driver and the e100 driver.

Another strange thing about this problem is that it seems to come and go.
About a month ago, a development box was also having the problem, but now
it appears to have stopped. I assumed the kernel version was perhaps the
reason, but placing the exact kernel image from the working box on
another box experiencing the problem did not cure the problem on the
broken box. 2.4.20 is running on most of the boxes which are seeing the
problem, but this problem has been happening for at least one year (I
noticed it when I first wrote duplextest).

It is difficult to reproduce this problem, but I am releasing duplextest
here in hopes that others will see the same issue. It is very easy to
tell apart true duplex mismatches from this problem by running "tcpdump
-n icmp" on the remote box and seeing if it does indeed show that it is
responding to all requests (if so, the problem is likely duplex or on the
link layer).

The interfaces show no errors or overruns after the problem:

# ip -s -s link ls eth0
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
link/ether 00:e0:81:02:c0:cb brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
2624545296 179127520 0 0 0 0
RX errors: length crc frame fifo missed
0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
212425629 133630424 0 0 0 0
TX errors: aborted fifo window heartbeat
0 0 0 0

I don't see any other explanation. Does anybody else see this problem or
have any ideas? The fact that "tcpdump -n icmp" on the receiving box
shows that the box itself is not responding to all ICMP ECHO packets is
strange. As far as I can tell there is no rate limiting, so I don't see
any explanation for this.

Thanks,

Simon-

[ Simon Kirby ][ Network Operations ]
[ [email protected] ][ NetNation Communications ]
[ Opinions expressed are not necessarily those of my employer. ]


2003-02-19 21:22:55

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Simon Kirby <[email protected]>
Date: Wed, 19 Feb 2003 09:47:57 -0800

eepro100 driver and the e100 driver.

Do you see it with other cards? That would be the interesting
clue :-)

2003-02-20 07:44:38

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Andi Kleen <[email protected]>
Date: 20 Feb 2003 08:52:46 +0100

That's probably because of the lazy ICMP socket locking used for the
ICMP socket. When an ICMP is already in process the next ICMP triggered
from a softirq (e.g. ECHO-REQUEST) is dropped
(see net/ipv4/icmp_xmit_lock_bh())

Good spotting Andi.

2003-02-20 07:42:44

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

Simon Kirby <[email protected]> writes:
>
> Baffled by this, I did a bunch of testing and tried to narrow down which
> servers were experiencing this problem. A sweep of some of our server
> blocks showed that only SMP boxes were having the problem; however, not
> all SMP boxes were affected.

That's probably because of the lazy ICMP socket locking used for the
ICMP socket. When an ICMP is already in process the next ICMP triggered
from a softirq (e.g. ECHO-REQUEST) is dropped
(see net/ipv4/icmp_xmit_lock_bh())

There is also a similar problem with ICMPs getting passed to an TCP socket.
When the socket is already locked it is dropped. You can check for
the later by looking at the LockDroppedIcmps counter in netstat -s

I guess it would be useful to cover the first drop case in that counter
too, try this untested patch and then check the counter in netstat -s.

--- linux-2.4.20/net/ipv4/icmp.c-o 2002-08-03 02:39:46.000000000 +0200
+++ linux-2.4.20/net/ipv4/icmp.c 2003-02-20 08:51:21.000000000 +0100
@@ -196,8 +196,10 @@
static int icmp_xmit_lock_bh(void)
{
if (!spin_trylock(&icmp_socket->sk->lock.slock)) {
- if (icmp_xmit_holder == smp_processor_id())
+ if (icmp_xmit_holder == smp_processor_id()) {
+ NET_INC_STATS_BH(LockDroppedIcmps)
return -EAGAIN;
+ }
spin_lock(&icmp_socket->sk->lock.slock);
}
icmp_xmit_holder = smp_processor_id();



-Andi

2003-02-20 09:10:41

by Simon Kirby

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Thu, Feb 20, 2003 at 08:52:46AM +0100, Andi Kleen wrote:

> That's probably because of the lazy ICMP socket locking used for the
> ICMP socket. When an ICMP is already in process the next ICMP triggered
> from a softirq (e.g. ECHO-REQUEST) is dropped
> (see net/ipv4/icmp_xmit_lock_bh())

Hmm...and this is considered desired behavior? It seems like an odd way
of handling packets intended to test latency and reliability. :)

This is most likely the cause, but I will test tomorrow to confirm.

Thanks,

Simon-

[ Simon Kirby ][ Network Operations ]
[ [email protected] ][ NetNation Communications ]
[ Opinions expressed are not necessarily those of my employer. ]

2003-02-20 09:24:22

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Thu, Feb 20, 2003 at 01:20:43AM -0800, Simon Kirby wrote:
> On Thu, Feb 20, 2003 at 08:52:46AM +0100, Andi Kleen wrote:
>
> > That's probably because of the lazy ICMP socket locking used for the
> > ICMP socket. When an ICMP is already in process the next ICMP triggered
> > from a softirq (e.g. ECHO-REQUEST) is dropped
> > (see net/ipv4/icmp_xmit_lock_bh())
>
> Hmm...and this is considered desired behavior? It seems like an odd way
> of handling packets intended to test latency and reliability. :)

IP is best-effort. Dropping packets in odd cases to make locking simpler
is not unreasonable. Would you prefer an slower kernel?

-Andi

2003-02-20 10:02:35

by Eric Dumazet

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)


> On Thu, Feb 20, 2003 at 01:20:43AM -0800, Simon Kirby wrote:
> > On Thu, Feb 20, 2003 at 08:52:46AM +0100, Andi Kleen wrote:
> >
> > > That's probably because of the lazy ICMP socket locking used for the
> > > ICMP socket. When an ICMP is already in process the next ICMP
triggered
> > > from a softirq (e.g. ECHO-REQUEST) is dropped
> > > (see net/ipv4/icmp_xmit_lock_bh())
> >
> > Hmm...and this is considered desired behavior? It seems like an odd way
> > of handling packets intended to test latency and reliability. :)
>
> IP is best-effort. Dropping packets in odd cases to make locking simpler
> is not unreasonable. Would you prefer an slower kernel?

Yes IP is best-effort. But this argument cant explain why IP on linux works
better if we disable SMP on linux...

I'm sorry Andy but "IP is best effort" sounds very lazy in this case.

SMP should give more power to the machine, not drop some frames because
'another CPU' is busy and has a lock we *need*.

Eric

2003-02-20 10:44:32

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

> Yes IP is best-effort. But this argument cant explain why IP on linux works
> better if we disable SMP on linux...

It has nothing to do with SMP. The lazy locking dropping packets can happen
on UP kernels too in extreme cases. Also with preempt.

-Andi

2003-02-20 10:53:17

by Eric Dumazet

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)



> > Yes IP is best-effort. But this argument cant explain why IP on linux
works
> > better if we disable SMP on linux...
>
> It has nothing to do with SMP. The lazy locking dropping packets can
happen
> on UP kernels too in extreme cases. Also with preempt.
>

Well, I too noticed that binding NIC IRQS one one CPU

echo 1 > /proc/irq/18/smp_affinity

helps a lot in normal cases.

Some of us want SMP machine because application needs a lot of CPU, but we
also need to not drop frames to save the limited network bandwith.

Thanks

2003-02-21 04:30:28

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Andi Kleen <[email protected]>
Date: Thu, 20 Feb 2003 10:34:22 +0100

On Thu, Feb 20, 2003 at 01:20:43AM -0800, Simon Kirby wrote:
> Hmm...and this is considered desired behavior? It seems like an odd way
> of handling packets intended to test latency and reliability. :)

IP is best-effort. Dropping packets in odd cases to make locking simpler
is not unreasonable. Would you prefer an slower kernel?

True.

But this is a quality of implementation issue and I doubt the kernel
would be slower if we fixed this silly behavior.

Frankly, the locking is due to lazyness, rather than a specific design
decision. So let's fix it.

2003-02-21 07:17:15

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Thu, Feb 20, 2003 at 08:24:38PM -0800, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Thu, 20 Feb 2003 10:34:22 +0100
>
> On Thu, Feb 20, 2003 at 01:20:43AM -0800, Simon Kirby wrote:
> > Hmm...and this is considered desired behavior? It seems like an odd way
> > of handling packets intended to test latency and reliability. :)
>
> IP is best-effort. Dropping packets in odd cases to make locking simpler
> is not unreasonable. Would you prefer an slower kernel?
>
> True.
>
> But this is a quality of implementation issue and I doubt the kernel
> would be slower if we fixed this silly behavior.
>
> Frankly, the locking is due to lazyness, rather than a specific design
> decision. So let's fix it.

For icmp_xmit_lock it can be only done in a limited fashion - you are
always restricted by the buffer size of the ICMP socket. Also I don't
know how to lock the socket from BH context nicely - the only simple way
probably is the trick from the retransmit timer to just try again
in a jiffie, but could have nasty queueing up under high load.

Fixing the error drop behaviour of TCP will be somewhat nasty too.

In both cases you'll need a retry timer (unreliable) or an dedicated ICMP
backlog (complicated)

-Andi

2003-02-21 09:49:09

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Andi Kleen <[email protected]>
Date: Fri, 21 Feb 2003 08:27:19 +0100

For icmp_xmit_lock it can be only done in a limited fashion - you are
always restricted by the buffer size of the ICMP socket. Also I don't
know how to lock the socket from BH context nicely - the only simple way
probably is the trick from the retransmit timer to just try again
in a jiffie, but could have nasty queueing up under high load.

Fixing the error drop behaviour of TCP will be somewhat nasty too.

In both cases you'll need a retry timer (unreliable) or an dedicated ICMP
backlog (complicated)

The big problem is that we have one ICMP socket for UP and only
one for SMP too. That's just dumb, we should make this be a
per-cpu thing.

I suspect this will fix the original bug report.

I don't think the TCP case is much of an issue. TCP retries things
etc.

2003-02-21 10:12:57

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

> The big problem is that we have one ICMP socket for UP and only
> one for SMP too. That's just dumb, we should make this be a
> per-cpu thing.

Doesn't work on preemptive, does it? How do you keep it on a single CPU
when it runs in process context ?

-Andi

2003-02-21 10:17:17

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Andi Kleen <[email protected]>
Date: Fri, 21 Feb 2003 11:22:58 +0100

Doesn't work on preemptive, does it? How do you keep it on a single CPU
when it runs in process context ?

What runs in process context? ICMP responses are generated from BH's.

2003-02-21 10:36:38

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Fri, Feb 21, 2003 at 02:11:25AM -0800, David S. Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Fri, 21 Feb 2003 11:22:58 +0100
>
> Doesn't work on preemptive, does it? How do you keep it on a single CPU
> when it runs in process context ?
>
> What runs in process context? ICMP responses are generated from BH's.

hmm, i was thinking about tcp process context, but you're right it can
only generate icmps for dead ports and that happens while still in
tcp_v4_rcv().

at least ipip.c will send icmps from process context (ipip_tunnel_xmit)

netfilter will probably do too.

-Andi

2003-02-21 15:09:16

by Bill Davidsen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Thu, 20 Feb 2003, Andi Kleen wrote:

> On Thu, Feb 20, 2003 at 01:20:43AM -0800, Simon Kirby wrote:
> > On Thu, Feb 20, 2003 at 08:52:46AM +0100, Andi Kleen wrote:
> >
> > > That's probably because of the lazy ICMP socket locking used for the
> > > ICMP socket. When an ICMP is already in process the next ICMP triggered
> > > from a softirq (e.g. ECHO-REQUEST) is dropped
> > > (see net/ipv4/icmp_xmit_lock_bh())
> >
> > Hmm...and this is considered desired behavior? It seems like an odd way
> > of handling packets intended to test latency and reliability. :)
>
> IP is best-effort. Dropping packets in odd cases to make locking simpler
> is not unreasonable. Would you prefer an slower kernel?

Software is not a zero sum exercise. Therefore "fast" and "correct" are
not mutually exclusive.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-02-23 09:18:37

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Andi Kleen <[email protected]>
Date: Fri, 21 Feb 2003 11:45:41 +0100

at least ipip.c will send icmps from process context (ipip_tunnel_xmit)

netfilter will probably do too.

We can easily make them all run in BH (protected) context :-)

This makes the locking much simpler. Add to this the per-cpu socket
idea I proposed and we arrive at the patch below (against 2.5.x,
easily ported to 2.4.x just remove the cpu_possible() stuff).

Comments?

--- net/ipv4/icmp.c.~1~ Sat Feb 22 23:41:53 2003
+++ net/ipv4/icmp.c Sun Feb 23 01:25:43 2003
@@ -223,57 +223,29 @@
static struct icmp_control icmp_pointers[NR_ICMP_TYPES+1];

/*
- * The ICMP socket. This is the most convenient way to flow control
+ * The ICMP socket(s). This is the most convenient way to flow control
* our ICMP output as well as maintain a clean interface throughout
* all layers. All Socketless IP sends will soon be gone.
+ *
+ * On SMP we have one ICMP socket per-cpu.
*/
-struct socket *icmp_socket;
-
-/* ICMPv4 socket is only a bit non-reenterable (unlike ICMPv6,
- which is strongly non-reenterable). A bit later it will be made
- reenterable and the lock may be removed then.
- */
-
-static int icmp_xmit_holder = -1;
-
-static int icmp_xmit_lock_bh(void)
-{
- int rc;
- if (!spin_trylock(&icmp_socket->sk->lock.slock)) {
- rc = -EAGAIN;
- if (icmp_xmit_holder == smp_processor_id())
- goto out;
- spin_lock(&icmp_socket->sk->lock.slock);
- }
- rc = 0;
- icmp_xmit_holder = smp_processor_id();
-out:
- return rc;
-}
+static struct socket *__icmp_socket[NR_CPUS];
+#define icmp_socket __icmp_socket[smp_processor_id()]

-static __inline__ int icmp_xmit_lock(void)
+static __inline__ void icmp_xmit_lock(void)
{
- int ret;
local_bh_disable();
- ret = icmp_xmit_lock_bh();
- if (ret)
- local_bh_enable();
- return ret;
-}

-static void icmp_xmit_unlock_bh(void)
-{
- icmp_xmit_holder = -1;
- spin_unlock(&icmp_socket->sk->lock.slock);
+ if (!spin_trylock(&icmp_socket->sk->lock.slock))
+ BUG();
}

-static __inline__ void icmp_xmit_unlock(void)
+static void icmp_xmit_unlock(void)
{
- icmp_xmit_unlock_bh();
+ spin_unlock(&icmp_socket->sk->lock.slock);
local_bh_enable();
}

-
/*
* Send an ICMP frame.
*/
@@ -404,10 +376,11 @@
struct rtable *rt = (struct rtable *)skb->dst;
u32 daddr;

- if (ip_options_echo(&icmp_param->replyopts, skb) ||
- icmp_xmit_lock_bh())
+ if (ip_options_echo(&icmp_param->replyopts, skb))
goto out;

+ icmp_xmit_lock();
+
icmp_param->data.icmph.checksum = 0;
icmp_out_count(icmp_param->data.icmph.type);

@@ -434,7 +407,7 @@
icmp_push_reply(icmp_param, &ipc, rt);
ip_rt_put(rt);
out_unlock:
- icmp_xmit_unlock_bh();
+ icmp_xmit_unlock();
out:;
}

@@ -519,8 +492,7 @@
}
}

- if (icmp_xmit_lock())
- goto out;
+ icmp_xmit_lock();

/*
* Construct source address and options.
@@ -1141,19 +1113,30 @@
void __init icmp_init(struct net_proto_family *ops)
{
struct inet_opt *inet;
- int err = sock_create(PF_INET, SOCK_RAW, IPPROTO_ICMP, &icmp_socket);
+ int i;

- if (err < 0)
- panic("Failed to create the ICMP control socket.\n");
- icmp_socket->sk->allocation = GFP_ATOMIC;
- icmp_socket->sk->sndbuf = SK_WMEM_MAX * 2;
- inet = inet_sk(icmp_socket->sk);
- inet->ttl = MAXTTL;
- inet->pmtudisc = IP_PMTUDISC_DONT;
-
- /* Unhash it so that IP input processing does not even
- * see it, we do not wish this socket to see incoming
- * packets.
- */
- icmp_socket->sk->prot->unhash(icmp_socket->sk);
+ for (i = 0; i < NR_CPUS; i++) {
+ int err;
+
+ if (!cpu_possible(i))
+ continue;
+
+ err = sock_create(PF_INET, SOCK_RAW, IPPROTO_ICMP,
+ &__icmp_socket[i]);
+
+ if (err < 0)
+ panic("Failed to create the ICMP control socket.\n");
+
+ __icmp_socket[i]->sk->allocation = GFP_ATOMIC;
+ __icmp_socket[i]->sk->sndbuf = SK_WMEM_MAX * 2;
+ inet = inet_sk(__icmp_socket[i]->sk);
+ inet->ttl = MAXTTL;
+ inet->pmtudisc = IP_PMTUDISC_DONT;
+
+ /* Unhash it so that IP input processing does not even
+ * see it, we do not wish this socket to see incoming
+ * packets.
+ */
+ __icmp_socket[i]->sk->prot->unhash(__icmp_socket[i]->sk);
+ }
}
--- net/ipv6/icmp.c.~1~ Sat Feb 22 23:48:49 2003
+++ net/ipv6/icmp.c Sun Feb 23 01:27:09 2003
@@ -67,10 +67,11 @@
DEFINE_SNMP_STAT(struct icmpv6_mib, icmpv6_statistics);

/*
- * ICMP socket for flow control.
+ * ICMP socket(s) for flow control.
*/

-struct socket *icmpv6_socket;
+static struct socket *__icmpv6_socket[NR_CPUS];
+#define icmpv6_socket __icmpv6_socket[smp_processor_id()]

static int icmpv6_rcv(struct sk_buff *skb);

@@ -87,39 +88,16 @@
__u32 csum;
};

-
-static int icmpv6_xmit_holder = -1;
-
-static int icmpv6_xmit_lock_bh(void)
+static __inline__ void icmpv6_xmit_lock(void)
{
- if (!spin_trylock(&icmpv6_socket->sk->lock.slock)) {
- if (icmpv6_xmit_holder == smp_processor_id())
- return -EAGAIN;
- spin_lock(&icmpv6_socket->sk->lock.slock);
- }
- icmpv6_xmit_holder = smp_processor_id();
- return 0;
-}
-
-static __inline__ int icmpv6_xmit_lock(void)
-{
- int ret;
local_bh_disable();
- ret = icmpv6_xmit_lock_bh();
- if (ret)
- local_bh_enable();
- return ret;
-}
-
-static void icmpv6_xmit_unlock_bh(void)
-{
- icmpv6_xmit_holder = -1;
- spin_unlock(&icmpv6_socket->sk->lock.slock);
+ if (!spin_trylock(&icmpv6_socket->sk->lock.slock))
+ BUG();
}

static __inline__ void icmpv6_xmit_unlock(void)
{
- icmpv6_xmit_unlock_bh();
+ spin_unlock(&icmpv6_socket->sk->lock.slock);
local_bh_enable();
}

@@ -341,8 +319,7 @@
fl.uli_u.icmpt.type = type;
fl.uli_u.icmpt.code = code;

- if (icmpv6_xmit_lock())
- return;
+ icmpv6_xmit_lock();

if (!icmpv6_xrlim_allow(sk, type, &fl))
goto out;
@@ -415,15 +392,14 @@
fl.uli_u.icmpt.type = ICMPV6_ECHO_REPLY;
fl.uli_u.icmpt.code = 0;

- if (icmpv6_xmit_lock_bh())
- return;
+ icmpv6_xmit_lock();

ip6_build_xmit(sk, icmpv6_getfrag, &msg, &fl, msg.len, NULL, -1,
MSG_DONTWAIT);
ICMP6_INC_STATS_BH(Icmp6OutEchoReplies);
ICMP6_INC_STATS_BH(Icmp6OutMsgs);

- icmpv6_xmit_unlock_bh();
+ icmpv6_xmit_unlock();
}

static void icmpv6_notify(struct sk_buff *skb, int type, int code, u32 info)
@@ -626,26 +602,47 @@
int __init icmpv6_init(struct net_proto_family *ops)
{
struct sock *sk;
- int err;
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ int err;
+
+ if (!cpu_possible(i))
+ continue;

- err = sock_create(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6, &icmpv6_socket);
- if (err < 0) {
- printk(KERN_ERR
- "Failed to initialize the ICMP6 control socket (err %d).\n",
- err);
- icmpv6_socket = NULL; /* for safety */
- return err;
+ err = sock_create(PF_INET6, SOCK_RAW, IPPROTO_ICMPV6,
+ &__icmpv6_socket[i]);
+ if (err < 0) {
+ int j;
+
+ printk(KERN_ERR
+ "Failed to initialize the ICMP6 control socket "
+ "(err %d).\n",
+ err);
+ for (j = 0; j < i; j++) {
+ if (!cpu_possible(j))
+ continue;
+ sock_release(__icmpv6_socket[j]);
+ __icmpv6_socket[j] = NULL; /* for safety */
+ }
+ return err;
+ }
+
+ sk = __icmpv6_socket[i]->sk;
+ sk->allocation = GFP_ATOMIC;
+ sk->sndbuf = SK_WMEM_MAX*2;
+ sk->prot->unhash(sk);
}

- sk = icmpv6_socket->sk;
- sk->allocation = GFP_ATOMIC;
- sk->sndbuf = SK_WMEM_MAX*2;
- sk->prot->unhash(sk);

if (inet6_add_protocol(&icmpv6_protocol, IPPROTO_ICMPV6) < 0) {
printk(KERN_ERR "Failed to register ICMP6 protocol\n");
- sock_release(icmpv6_socket);
- icmpv6_socket = NULL;
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_possible(i))
+ continue;
+ sock_release(__icmpv6_socket[i]);
+ __icmpv6_socket[i] = NULL;
+ }
return -EAGAIN;
}

@@ -654,8 +651,14 @@

void icmpv6_cleanup(void)
{
- sock_release(icmpv6_socket);
- icmpv6_socket = NULL; /* For safety. */
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_possible(i))
+ continue;
+ sock_release(__icmpv6_socket[i]);
+ __icmpv6_socket[i] = NULL; /* For safety. */
+ }
inet6_del_protocol(&icmpv6_protocol, IPPROTO_ICMPV6);
}

2003-02-23 09:55:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Sun, Feb 23, 2003 at 01:12:17AM -0800, David S. Miller wrote:
> +static struct socket *__icmp_socket[NR_CPUS];
> +#define icmp_socket __icmp_socket[smp_processor_id()]

This should be per-cpu data

> -static __inline__ int icmp_xmit_lock(void)
> +static __inline__ void icmp_xmit_lock(void)
> {
> - int ret;
> local_bh_disable();
> - ret = icmp_xmit_lock_bh();
> - if (ret)
> - local_bh_enable();
> - return ret;
> -}
>
> -static void icmp_xmit_unlock_bh(void)
> -{
> - icmp_xmit_holder = -1;
> - spin_unlock(&icmp_socket->sk->lock.slock);
> + if (!spin_trylock(&icmp_socket->sk->lock.slock))
> + BUG();

unlikely()?

> -static __inline__ void icmp_xmit_unlock(void)
> +static void icmp_xmit_unlock(void)
> {
> - icmp_xmit_unlock_bh();
> + spin_unlock(&icmp_socket->sk->lock.slock);
> local_bh_enable();

spin_unlock_bh

> + icmp_xmit_lock();

Hmm, and I guess the code would be much more readable if you used
the spin_lock call directly. The impliclit icmp_socket doesn't
really help readability either.

2003-02-23 10:05:53

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Christoph Hellwig <[email protected]>
Date: Sun, 23 Feb 2003 10:02:34 +0000

> + icmp_xmit_lock();

Hmm, and I guess the code would be much more readable if you used
the spin_lock call directly. The impliclit icmp_socket doesn't
really help readability either.

Sure it does, it encapsulates the icmp_socket->sk->foo->far->fum
dereferencing into one place instead of 4.

2003-02-23 10:03:24

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Christoph Hellwig <[email protected]>
Date: Sun, 23 Feb 2003 10:02:34 +0000

On Sun, Feb 23, 2003 at 01:12:17AM -0800, David S. Miller wrote:
> +static struct socket *__icmp_socket[NR_CPUS];
> +#define icmp_socket __icmp_socket[smp_processor_id()]

This should be per-cpu data

My bad. Thanks for spotting this.

2003-02-23 10:20:33

by Andi Kleen

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Sun, Feb 23, 2003 at 01:55:11AM -0800, David S. Miller wrote:
> From: Christoph Hellwig <[email protected]>
> Date: Sun, 23 Feb 2003 10:02:34 +0000
>
> On Sun, Feb 23, 2003 at 01:12:17AM -0800, David S. Miller wrote:
> > +static struct socket *__icmp_socket[NR_CPUS];
> > +#define icmp_socket __icmp_socket[smp_processor_id()]
>
> This should be per-cpu data
>
> My bad. Thanks for spotting this.

Won't work if the IPv4 code is ever made modular.

-Andi

2003-02-23 10:27:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

On Sun, Feb 23, 2003 at 11:30:39AM +0100, Andi Kleen wrote:
> > My bad. Thanks for spotting this.
>
> Won't work if the IPv4 code is ever made modular.

I think Rusty is working on making it working for modules. Anyway I think
that would be one of the minor problem when making ipv4 modular :)

2003-02-23 10:29:33

by David Miller

[permalink] [raw]
Subject: Re: Longstanding networking / SMP issue? (duplextest)

From: Andi Kleen <[email protected]>
Date: Sun, 23 Feb 2003 11:30:39 +0100

On Sun, Feb 23, 2003 at 01:55:11AM -0800, David S. Miller wrote:
> My bad. Thanks for spotting this.

Won't work if the IPv4 code is ever made modular.

Right, which is why I can't use per_cpu on the ipv6 side.