2002-02-27 19:05:06

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?

Hello!

> I think his analysis is alright but he patch is questionable.

Yes. "if (tb) tcp_tw_put(tw)" cannot be right, no doubts.

Seems, it is enough to remove from bind hash _before_ established.

The idea was that bind hash is pure slave of another state, so that
it need not refcounting at all. Note that adding the second increment
does not help: when we verify that leakage (the situation, when
bucket is in bind hash, but has no timer running) is impossible
we immediately arrive to elimination of the refcount.

Raghu, could you check the variant with inverted order of removal?
Do you see holes? From my side... I need to think more. :-)

Alexey


2002-02-27 19:47:52

by Raghu Angadi

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?


--- [email protected] wrote:
> Hello!
>
> > I think his analysis is alright but he patch is questionable.
>
> Yes. "if (tb) tcp_tw_put(tw)" cannot be right, no doubts.

In timewait_kill(), tb is set only _after_ it has been removed from the
established (if it exist there). In _hashdance() tw is inserted into
established _before_ it is inserted into bindhash. So the above is one way of
saying do tw_put() when it has been deleted from _both_ the lists.
Also note that patch removes "if (!tw->pprev) return;" thingy. Infact this
return sort of implied you never thought tw could be deleted from ehash and
not from bind hash.

> Seems, it is enough to remove from bind hash _before_ established.
>
> The idea was that bind hash is pure slave of another state, so that
> it need not refcounting at all. Note that adding the second increment
> does not help: when we verify that leakage (the situation, when
> bucket is in bind hash, but has no timer running) is impossible
> we immediately arrive to elimination of the refcount.
>
> Raghu, could you check the variant with inverted order of removal?
> Do you see holes? From my side... I need to think more. :-)

Just the inteverted order of removal will fix _this_ perticular case.
But we still end up doing tw_put() in timewait_kill() even though tw is still
in the bind list (just got inserted on the other processor). This seems
conceptually incorrect or confusing. The refcnt increment in tw_hashdance()
is for these two lists and tw_put() in timewait_kill() should correspond to
the deletion from _both_ the lists.

If you want to avoid timewait_kill() getting called twice altogether.. then
its a different issue.

Raghu.

> Alexey


__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

2002-02-27 20:14:14

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?

Hello!

> inverted order of insertion into the lists in tw_hashdance() is probably
> cleaner fix than inverted order of removal..

Why are they not equivalent? Good question? :-)

Alexey

2002-02-27 20:06:12

by Raghu Angadi

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?


inverted order of insertion into the lists in tw_hashdance() is probably
cleaner fix than inverted order of removal.. (that is.. if you dont like "if
(tb) put();" or double refcnting :-)

--- Raghu Angadi <[email protected]> wrote:
>
> --- [email protected] wrote:
> > Hello!
> >
> > > I think his analysis is alright but he patch is questionable.
> >
> > Yes. "if (tb) tcp_tw_put(tw)" cannot be right, no doubts.
>
> In timewait_kill(), tb is set only _after_ it has been removed from the
> established (if it exist there). In _hashdance() tw is inserted into
> established _before_ it is inserted into bindhash. So the above is one way
> of
> saying do tw_put() when it has been deleted from _both_ the lists.
> Also note that patch removes "if (!tw->pprev) return;" thingy. Infact this
> return sort of implied you never thought tw could be deleted from ehash and
> not from bind hash.
>
> > Seems, it is enough to remove from bind hash _before_ established.
> >
> > The idea was that bind hash is pure slave of another state, so that
> > it need not refcounting at all. Note that adding the second increment
> > does not help: when we verify that leakage (the situation, when
> > bucket is in bind hash, but has no timer running) is impossible
> > we immediately arrive to elimination of the refcount.
> >
> > Raghu, could you check the variant with inverted order of removal?
> > Do you see holes? From my side... I need to think more. :-)
>
> Just the inteverted order of removal will fix _this_ perticular case.
> But we still end up doing tw_put() in timewait_kill() even though tw is
> still
> in the bind list (just got inserted on the other processor). This seems
> conceptually incorrect or confusing. The refcnt increment in tw_hashdance()
> is for these two lists and tw_put() in timewait_kill() should correspond to
> the deletion from _both_ the lists.
>
> If you want to avoid timewait_kill() getting called twice altogether.. then
> its a different issue.
>
> Raghu.
>
> > Alexey
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Greetings - Send FREE e-cards for every occasion!
> http://greetings.yahoo.com


__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

2002-02-27 20:09:21

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?

Hello!

> the deletion from _both_ the lists.

No, it is only for insertion to established hash table.

References from bind hash are not considered as references.
Look, if socket will sit only in bind table nobody ever will see it.
Where is the the reference? :-) It just must _not_ stay in bind hash,
if no other references remained, that's invariant which we should provide
now. If we will fail, we are in troubles.

So, its absence in bind hash must be guaranteed to the time of destruction.
Look at this from another aspect: imagine you increment refcnt when
adding to binding table. OK. So, what does guarantee that bucket
will not remain in bind hash forever? And "it will not" is equivalent
to "refcnt is not useful".

Anyway, I will think on this at night, I am not ready to tell how to
do this right.


> If you want to avoid timewait_kill() getting called twice altogether.

Sorry, I did not understand what do you mean here. It can be called
twice or three times or more. This is impossible to avoid without adding
spinlock to timewait bucket.

Alexey

2002-02-27 20:28:17

by Raghu Angadi

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?


--- [email protected] wrote:
> So, its absence in bind hash must be guaranteed to the time of destruction.
> Look at this from another aspect: imagine you increment refcnt when
> adding to binding table.

> OK. So, what does guarantee that bucket
> will not remain in bind hash forever?
ease of finding the bug :). 'cause this would leave tw in the list withought
deleting it.. that would blow tw_bind_bucket_cache pool and that is so much
easier to find. would've been fixed in very early days.. probably in 2.3.0.1
:->. Still not argueing for double refcount. Reversing the removal order
would work too.. now we will have "if (!tw->tb) return;" instead.

> And "it will not" is equivalent
> to "refcnt is not useful".
>
> Anyway, I will think on this at night, I am not ready to tell how to
> do this right.
>
>
> > If you want to avoid timewait_kill() getting called twice altogether.
>
> Sorry, I did not understand what do you mean here. It can be called
> twice or three times or more. This is impossible to avoid without adding
> spinlock to timewait bucket.

I didn't think you would want to avoid multiple calls to tw_kill() either.

Raghu.
> Alexey


__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

2002-02-27 20:33:57

by Raghu Angadi

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?


--- [email protected] wrote:
> Hello!
>
> > inverted order of insertion into the lists in tw_hashdance() is probably
> > cleaner fix than inverted order of removal..
>
> Why are they not equivalent? Good question? :-)

they are. with "if (!tb->tb) return;" instead of "if (!tb->pprev) return;".
silly me was thinking of literal cut-n-paste invertion of the oder :->

> Alexey


__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

2002-02-28 05:55:45

by Raghu Angadi

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?

Hi,

I would like to bring your attention to 2nd oops posted in the first mail for
this thread (http://www.uwsg.iu.edu/hypermail/linux/kernel/0202.1/1193.html).
Initially I thought both might be same because they were both releated bind
hash lists. But I have to began suspect it. The first oops is because of race
caused by tw structures left out in the list.. for which a patch is being
discussed here.

The second one seems to be purely because of a corrupted tb
(tcp_bind_hashbucket).. there is no race there.. tb just got pointers into
arbitary place. Also we saw second oops sometimes with relatively less load.
These are two locations it we have seen (with different bad values for
affending pointers each time it occured):

net/ipv4/tcp_ipv4.c:tcp_v4_get_port():

spin_lock(&head->lock);
for (tb = head->chain; tb; tb = tb->next)
1======> if (tb->port == rover)
/*tb is a bad pointer, not NULL */ goto next;
break;
next:
spin_unlock(&head->lock);
/*
more stuff removed
*/
struct sock *sk2 = tb->owners;
int sk_reuse = sk->reuse;
for( ; sk2 != NULL; sk2 = sk2->bind_next) {
if (sk != sk2 &&
2======> sk->bound_dev_if == sk2->bound_dev_if) {
/* sk2 (tb->owners) is a bad pointer */

I will think more about if the problem that causes 1st oops can cause this as
well.

Thanks,
Raghu.

--- [email protected] wrote:
> Hello!
>
> > the deletion from _both_ the lists.
>
> No, it is only for insertion to established hash table.
>
> References from bind hash are not considered as references.
> Look, if socket will sit only in bind table nobody ever will see it.
> Where is the the reference? :-) It just must _not_ stay in bind hash,
> if no other references remained, that's invariant which we should provide
> now. If we will fail, we are in troubles.
>
> So, its absence in bind hash must be guaranteed to the time of destruction.
> Look at this from another aspect: imagine you increment refcnt when
> adding to binding table. OK. So, what does guarantee that bucket
> will not remain in bind hash forever? And "it will not" is equivalent
> to "refcnt is not useful".
>
> Anyway, I will think on this at night, I am not ready to tell how to
> do this right.
>
>
> > If you want to avoid timewait_kill() getting called twice altogether.
>
> Sorry, I did not understand what do you mean here. It can be called
> twice or three times or more. This is impossible to avoid without adding
> spinlock to timewait bucket.
>
> Alexey


__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

2002-03-01 19:09:09

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?

Hello!

> > > inverted order of insertion into the lists in tw_hashdance() is probably
> > > cleaner fix than inverted order of removal..
> >
> > Why are they not equivalent? Good question? :-)
>
> they are. with "if (!tb->tb) return;" instead of "if (!tb->pprev) return;".
> silly me was thinking of literal cut-n-paste invertion of the oder :->

However, you were 100% right. They are really not equivalent. :-)

Right solution is to exchange order of insertion. tcp_timewait_kill()
is right and need not changes.


Proof follows:

Main invariant: that guy who inserts/removes socket to/from established
hash table, must make this for binding table. That guy who did not find
socket in established hash, must not touch binding table.

In this case concurrent tcp_timewait_kill()s are happy: socket will
be removed from binding table once and only once, when it is removed
from established hash. The second tcp_timewait_kill() does not find
socket in established table and it must _not_ touch binding table,
despite of socket can be there at the moment.

The mess happens while concurrent remove and insert: insert adds
socket to established table and then to binding table. Racing remove
removes it from established table, but cannot satisfy invariant
because socket is still not in binding table. (This place should
be asserted with a BUG() for future)

The second statement, which completes the proof: removing is possible
only after the socket is added to established hash table (it is evident,
until this time bucket is private to creator).

Alexey

2002-03-04 20:49:35

by Raghu Angadi

[permalink] [raw]
Subject: Re: Fw: memory corruption in tcp bind hash buckets on SMP?


We have been load testing the kernel with this patch (reverse the insertion
order in __tcp_tw_hashdance(). It seems to work fine till now.

--- [email protected] wrote:
> Hello!
>
> The mess happens while concurrent remove and insert: insert adds
> socket to established table and then to binding table. Racing remove
> removes it from established table, but cannot satisfy invariant
> because socket is still not in binding table. (This place should
> be asserted with a BUG() for future)

I think we should just remove the conditional for tw->tb in
tcp_timewait_kill(). That way its clear to the reader that we expect tb to be
set by this time. If it is NULL the code will anyway oops there.. we dont
need BUG().

Raghu.

> The second statement, which completes the proof: removing is possible
> only after the socket is added to established hash table (it is evident,
> until this time bucket is private to creator).
>
> Alexey


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - sign up for Fantasy Baseball
http://sports.yahoo.com

2002-03-04 23:30:03

by David Miller

[permalink] [raw]
Subject: Re: memory corruption in tcp bind hash buckets on SMP?

From: Raghu Angadi <[email protected]>
Date: Mon, 4 Mar 2002 12:48:32 -0800 (PST)

We have been load testing the kernel with this patch (reverse the insertion
order in __tcp_tw_hashdance(). It seems to work fine till now.

Where was "this patch" posted? I never saw anyone post
any code, all anyone did was merely describe the change :)

2002-03-05 00:54:57

by Raghu Angadi

[permalink] [raw]
Subject: Re: memory corruption in tcp bind hash buckets on SMP?


1st patch: reverses the insertion order in __tcp_tw_hashdance(). This is what
we tested

2nd patch: removes conditional around tw->tb in tcp_timewait_kill(). I have
only made sure it compiles. "tw-tb == NULL;" is not strictly required.. but
esures oops if we end up in similar race again.

patch 1:
--------

--- linux-2.4.7-10/net/ipv4/tcp_minisocks.c Thu Sep 6 13:11:49 2001
+++ linux-2.4.7-10-mod/net/ipv4/tcp_minisocks.c Fri Mar 1 15:41:26 2002
@@ -304,9 +304,25 @@
struct tcp_bind_hashbucket *bhead;
struct sock **head, *sktw;

+
+ /* Step 1: Put TW into bind hash. Original socket stays there too.
+ Note, that any socket with sk->num!=0 MUST be bound in binding
+ cache, even if it is closed.
+ */
+ bhead = &tcp_bhash[tcp_bhashfn(sk->num)];
+ spin_lock(&bhead->lock);
+ tw->tb = (struct tcp_bind_bucket *)sk->prev;
+ BUG_TRAP(sk->prev!=NULL);
+ if ((tw->bind_next = tw->tb->owners) != NULL)
+ tw->tb->owners->bind_pprev = &tw->bind_next;
+ tw->tb->owners = (struct sock*)tw;
+ tw->bind_pprev = &tw->tb->owners;
+ spin_unlock(&bhead->lock);
+
write_lock(&ehead->lock);

- /* Step 1: Remove SK from established hash. */
+ /* Step 2: Remove SK from established hash. */
if (sk->pprev) {
if(sk->next)
sk->next->pprev = sk->pprev;
@@ -315,7 +331,7 @@
sock_prot_dec_use(sk->prot);
}

- /* Step 2: Hash TW into TIMEWAIT half of established hash table. */
+ /* Step 3: Hash TW into TIMEWAIT half of established hash table. */
head = &(ehead + tcp_ehash_size)->chain;
sktw = (struct sock *)tw;
if((sktw->next = *head) != NULL)
@@ -325,20 +341,6 @@
atomic_inc(&tw->refcnt);

write_unlock(&ehead->lock);
-
- /* Step 3: Put TW into bind hash. Original socket stays there too.
- Note, that any socket with sk->num!=0 MUST be bound in binding
- cache, even if it is closed.
- */
- bhead = &tcp_bhash[tcp_bhashfn(sk->num)];
- spin_lock(&bhead->lock);
- tw->tb = (struct tcp_bind_bucket *)sk->prev;
- BUG_TRAP(sk->prev!=NULL);
- if ((tw->bind_next = tw->tb->owners) != NULL)
- tw->tb->owners->bind_pprev = &tw->bind_next;
- tw->tb->owners = (struct sock*)tw;
- tw->bind_pprev = &tw->tb->owners;
- spin_unlock(&bhead->lock);
}

/*


patch 2:
--------

--- linux-2.4.7-10/net/ipv4/tcp_minisocks.c Thu Sep 6 13:11:49 2001
+++ linux-2.4.7-10-mod/net/ipv4/tcp_minisocks.c Mon Mar 4 16:41:35 2002
@@ -75,17 +75,16 @@
/* Disassociate with bind bucket. */
bhead = &tcp_bhash[tcp_bhashfn(tw->num)];
spin_lock(&bhead->lock);
- if ((tb = tw->tb) != NULL) {
- if(tw->bind_next)
- tw->bind_next->bind_pprev = tw->bind_pprev;
- *(tw->bind_pprev) = tw->bind_next;
- tw->tb = NULL;
- if (tb->owners == NULL) {
- if (tb->next)
- tb->next->pprev = tb->pprev;
- *(tb->pprev) = tb->next;
- kmem_cache_free(tcp_bucket_cachep, tb);
- }
+ tb = tw->tb;
+ if(tw->bind_next)
+ tw->bind_next->bind_pprev = tw->bind_pprev;
+ *(tw->bind_pprev) = tw->bind_next;
+ tw->tb = NULL;
+ if (tb->owners == NULL) {
+ if (tb->next)
+ tb->next->pprev = tb->pprev;
+ *(tb->pprev) = tb->next;
+ kmem_cache_free(tcp_bucket_cachep, tb);
}
spin_unlock(&bhead->lock);


--- "David S. Miller" <[email protected]> wrote:
> From: Raghu Angadi <[email protected]>
> Date: Mon, 4 Mar 2002 12:48:32 -0800 (PST)
>
> We have been load testing the kernel with this patch (reverse the
> insertion
> order in __tcp_tw_hashdance(). It seems to work fine till now.
>
> Where was "this patch" posted? I never saw anyone post
> any code, all anyone did was merely describe the change :)


__________________________________________________
Do You Yahoo!?
Yahoo! Sports - sign up for Fantasy Baseball
http://sports.yahoo.com

2002-03-05 04:33:04

by David Miller

[permalink] [raw]
Subject: Re: memory corruption in tcp bind hash buckets on SMP?


Thanks, I've added both of these patches to my tree(s).