2005-01-13 14:54:18

by Scott Doty

[permalink] [raw]
Subject: 2.4.28(+?): Strange ARP problem

Hi,

We use Linux extensively here at Sonic.net. Our web servers have two
NIC's -- a NIC with a public IP address, and a NIC on our SAN (with NetApps).

When we tried to upgrade to 2.4.28, we encountered a problem with NetApp
reachability, which turns out to have been a problem with ARP: we
were seeing two ARP entries for the NetApp IP's. One would be correct, and
one would be "incomplete".

Occasionally, a system would glom onto the incomplete entry, and NFS
connectivity would tank. This doesn't happen with 2.4.27.

We'd like to upgrade to 2.4.29-rc2, but we have much trepidation about doing
so. I certainly don't want to treat the list as "our own personal help
desk" (as warned about in the FAQ), but was hoping someone could shed some
light on the problem. I think either myself or one of our guys can write a
patch to fix it, if someone would point us in the right direction.

Thank you,

-Scott


2005-01-13 15:46:15

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.28(+?): Strange ARP problem

On Thu, Jan 13, 2005 at 06:50:29AM -0800, Scott Doty wrote:
> Hi,
>
> We use Linux extensively here at Sonic.net. Our web servers have two
> NIC's -- a NIC with a public IP address, and a NIC on our SAN (with NetApps).
>
> When we tried to upgrade to 2.4.28, we encountered a problem with NetApp
> reachability, which turns out to have been a problem with ARP: we
> were seeing two ARP entries for the NetApp IP's. One would be correct, and
> one would be "incomplete".
>
> Occasionally, a system would glom onto the incomplete entry, and NFS
> connectivity would tank. This doesn't happen with 2.4.27.
>
> We'd like to upgrade to 2.4.29-rc2, but we have much trepidation about doing
> so. I certainly don't want to treat the list as "our own personal help
> desk" (as warned about in the FAQ), but was hoping someone could shed some
> light on the problem. I think either myself or one of our guys can write a
> patch to fix it, if someone would point us in the right direction.
>
> Thank you,

Scott,

I have no idea of what might be causing such regression - I see a few ARP
related changelogs on v2.4.28-rc2:

o [IPV4]: Set ARP hw type correctly for BOOTP over FDDI
o [IPV4]: Permit the official ARP hw type in SIOCSARP for FDDI

Maybe you can try earlier v2.4.28's (-rc1 for one) to check where
the problem starts to happen?

David, Herbert, any ideas?


2005-01-13 16:23:38

by Scott Doty

[permalink] [raw]
Subject: Re: 2.4.28(+?): Strange ARP problem

On Thu, Jan 13, 2005 at 10:09:00AM -0200, Marcelo Tosatti wrote:
> Maybe you can try earlier v2.4.28's (-rc1 for one) to check where
> the problem starts to happen?

Good plan, we'll try different kernel versions to pin down exactly where the
problem starts.

-Scott

2005-01-13 21:06:40

by Herbert Xu

[permalink] [raw]
Subject: Re: 2.4.28(+?): Strange ARP problem

On Thu, Jan 13, 2005 at 10:09:00AM -0200, Marcelo Tosatti wrote:
>
> Maybe you can try earlier v2.4.28's (-rc1 for one) to check where
> the problem starts to happen?

The symptom sounds like the bug in the 2.4 backport of the neighbour
hash updates. In neigh_create, hash_val needs to be computed inside
the lock (and after the growing), not outside.

Someone even posted a patch for it. I'll dig it up tonight if it
doesn't show up by then.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-01-13 22:21:35

by Herbert Xu

[permalink] [raw]
Subject: Re: 2.4.28(+?): Strange ARP problem

On Thu, Jan 13, 2005 at 02:01:00PM -0800, Scott Doty wrote:
>
> I just built a patch from your description -- it's attached.

Yep, that looks good. Please let us know if that fixes your problems.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-01-14 00:19:54

by Scott Doty

[permalink] [raw]
Subject: Re: 2.4.28(+?): Strange ARP problem

On Fri, Jan 14, 2005 at 08:01:42AM +1100, Herbert Xu wrote:
> On Thu, Jan 13, 2005 at 10:09:00AM -0200, Marcelo Tosatti wrote:
> >
> > Maybe you can try earlier v2.4.28's (-rc1 for one) to check where
> > the problem starts to happen?
>
> The symptom sounds like the bug in the 2.4 backport of the neighbour
> hash updates. In neigh_create, hash_val needs to be computed inside
> the lock (and after the growing), not outside.
>
> Someone even posted a patch for it. I'll dig it up tonight if it
> doesn't show up by then.

I just built a patch from your description -- it's attached.

-Scott


Attachments:
(No filename) (600.00 B)
patch-arp (606.00 B)
Download all attachments