LinuxLists.cc - Re: [RFC] Use RCU for tcp

2004-09-02 05:42:16

Subject: Re: [RFC] Use RCU for tcp_ehash lookup

On Tue, 31 Aug 2004 18:29:41 +0530
Srivatsa Vaddagiri <[email protected]> wrote:

> Some notes on the patch:
>
> - Although readprofile shows improvement in tick count for
> __tcp_v4_lookup_established, I haven't come across any benchmarks that is
> benefited noticeably by the lock-free lookup. I have tried httperf, netperf
> and simple file transfer tests so far.
>
> This could possibly be because the hash table size on the machines I was
> testing was high (tcp_ehash_size = 128K), leading to low contention rate on
> the hash bucket locks. Also because of the fact that lookup could happen in
> parallel to socket input packet processing.
>
> I would be interested to know if anyone has seen high-rate of lock contention
> for hash bucket lock. Such workloads would benefit from the lock-free lookup.

The reason you don't see any improvement is that the ehash table is
pretty write heavy.

I'm not totally against your patch, I just don't think that the TCP established
hash table qualifies as "read heavy" as per what RCU is truly effective for.

> - I presume that one of the reasons for keeping the hash table so big is to
> keep lock contention low (& to reduce the size of hash chains). If the lookup
> is made lock-free, then could the size of the hash table be reduced (without
> adversely impacting performance)?

It's large so that the hash itself is effective, not for locking reasons.

> - Biggest problem I had converting over to RCU was the refcount race between
> sock_put and sock_hold. sock_put might see the refcount go to zero and decide
> to free the object, while on some other CPU, sock_get's are pending against
> the same object. The patch handles the race by deciding to free the object
> only from the RCU callback.

That's exactly what I was concerned about when I saw that you had attempted
this change. It is incredibly important for state changes and updates to
be seen as atomic by the packet input processing engine. It would be illegal
for a cpu running TCP input to see a socket in two tables at the same time
(for example, in the main established area and in the second half for TIME_WAIT
buckets).

If the visibility of the socket is wrong, sockets could be erroneously
be reset during the transition from established to TIME_WAIT state.
Beware!

> - Socket table lookups that happens thr', say /proc/net/tcp or tcpdiag_dump, is
> not lock-free yet. This is because of movement of socket performed in
> __tcp_tw_hashdance, between established half to time-wait half.
> There is a window during this movement, when the same socket is present
> on both time-wait half as well as established half. I felt that it is not
> good to have /proc/net/tcp report two instances of the same socket. Hence
> I resorted to have /proc/net/tcp and tcpdiag_dump doing the lookup using
> a spinlock.

/proc/net/tcp should simply not be used by people, we
have the netlink interface to get socket listings which
actually scales.

Leaving /proc/net/tcp readable on servers with real users is
a DoS waiting to happen.

> Note that __tcp_v4_lookup_established should not be affected by the above
> movement because I found it scans the established half first and _then_ the
> time wait half. So even if the same socket is present in both established half
> and time wait half, __tcp_v4_lookup_established will lookup only one of them
> (& not both).

I hope this is true.

2004-09-02 16:36:49

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC] Use RCU for tcp_ehash lookup

On Wed, Sep 01, 2004 at 10:41:08PM -0700, David S. Miller wrote:
> On Tue, 31 Aug 2004 18:29:41 +0530
> Srivatsa Vaddagiri <[email protected]> wrote:
> > - Biggest problem I had converting over to RCU was the refcount race between
> > sock_put and sock_hold. sock_put might see the refcount go to zero and decide
> > to free the object, while on some other CPU, sock_get's are pending against
> > the same object. The patch handles the race by deciding to free the object
> > only from the RCU callback.
>
> That's exactly what I was concerned about when I saw that you had attempted
> this change. It is incredibly important for state changes and updates to
> be seen as atomic by the packet input processing engine. It would be illegal
> for a cpu running TCP input to see a socket in two tables at the same time
> (for example, in the main established area and in the second half for TIME_WAIT
> buckets).
>
> If the visibility of the socket is wrong, sockets could be erroneously
> be reset during the transition from established to TIME_WAIT state.
> Beware!

If the usages is too write-intensive, then RCU will certainly be less
likely to work well. But there is nothing quite like actually trying
it to see how it works. ;-)

That aside, it -is- possible to make such state changes appear atomic,
even when moving elements from one list to another. One way of doing
this is to atomically replace the element with a "tombstone" element.
Normal pointer writes suffice. The "tombstone" is set up so that searches
for the outgoing element will stall (e.g., spin or sleep, depending
on the environment). The element is moved to its destination list.
At this point, searches for the element in the old list will still
stall, while searches for the element in the new list will succeed.
The tombstone is now marked so that CPUs stall on it now resume, but
indicating failure to find the element in the old list.

Of course, this approach makes writes more expensive than they otherwise
would be, so, again, RCU is best for read-intensive uses. ;-)

The fact that this data structure is not very read-intensive is due
to the fact that short-lived TCP connections are quite common, right?
Or am I missing the finer points of this data structure's workings?

Thanx, Paul

2004-09-03 04:04:55

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [RFC] Use RCU for tcp_ehash lookup

On Wed, Sep 01, 2004 at 10:41:08PM -0700, David S. Miller wrote:
> The reason you don't see any improvement is that the ehash table is
> pretty write heavy.

In my simple one-file-transfer-test-at-a-time, it should have been read-mostly.
Probably the fact lookups are not serialized wrt input pakcet processing
may have shadowed the benefits of lock-free lookup. However perhaps
if I have multiple file transfer sessions in progress (one per cpu maybe),
then the benefit of reduced time spent in looking up a socket, could be passed
on to threads doing network input.

> I'm not totally against your patch, I just don't think that the TCP established
> hash table qualifies as "read heavy" as per what RCU is truly effective for.

IMHO the benefits of lock-free will be seen only in such scenarios, i.e where
read_lock ended up having to spin-wait on a update to finish. In the lock-free
case, there is no such wait.

> That's exactly what I was concerned about when I saw that you had attempted
> this change. It is incredibly important for state changes and updates to
> be seen as atomic by the packet input processing engine. It would be illegal
> for a cpu running TCP input to see a socket in two tables at the same time
> (for example, in the main established area and in the second half for TIME_WAIT
> buckets).
>
> If the visibility of the socket is wrong, sockets could be erroneously
> be reset during the transition from established to TIME_WAIT state.
> Beware!

This is precisely the reason why I changed the order of movement in
__tcp_tw_hashdance. Earlier, it was removing the socket from the
established half and _then_ adding it to time-wait half. This would
have lead to a window where the socket is neither in established-half
not in the time-wait half. A packet arriving in this window (& doing
lock-free lookup) would have been dropped.

Hence I reversed the order of movement to add in time-wait first
before removing from established half.

> > Note that __tcp_v4_lookup_established should not be affected by the above
> > movement because I found it scans the established half first and _then_ the
> > time wait half. So even if the same socket is present in both established half
> > and time wait half, __tcp_v4_lookup_established will lookup only one of them
> > (& not both).
>
> I hope this is true.

AFAICS it is true! If __tcp_v4_lookup_established finds it in the established
half, it does no further lookup in the time-wait half.

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017