LinuxLists.cc - [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

2008-07-17 12:17:26

Subject: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

This allows less strict control of access to the qdisc attached to a
netdev_queue. It is even allowed to enqueue into a qdisc which is
in the process of being destroyed. The RCU handler will toss out
those packets.

We will need this to handle sharing of a qdisc amongst multiple
TX queues. In such a setup the lock has to be shared, so will
be inside of the qdisc itself. At which point the netdev_queue
lock cannot be used to hard synchronize access to the ->qdisc
pointer.

One operation we have to keep inside of qdisc_destroy() is the list
deletion. It is the only piece of state visible after the RCU quiesce
period, so we have to undo it early and under the appropriate locking.

The operations in the RCU handler do not need any looking because the
qdisc tree is no longer visible to anything at that point.

Signed-off-by: David S. Miller <[email protected]>
---
net/sched/sch_generic.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7e078c5..082db8a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -545,6 +545,17 @@ EXPORT_SYMBOL(qdisc_reset);
static void __qdisc_destroy(struct rcu_head *head)
{
struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
+ const struct Qdisc_ops *ops = qdisc->ops;
+
+ gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
+ if (ops->reset)
+ ops->reset(qdisc);
+ if (ops->destroy)
+ ops->destroy(qdisc);
+
+ module_put(ops->owner);
+ dev_put(qdisc_dev(qdisc));
+
kfree((char *) qdisc - qdisc->padded);
}

@@ -552,21 +563,12 @@ static void __qdisc_destroy(struct rcu_head *head)

void qdisc_destroy(struct Qdisc *qdisc)
{
- const struct Qdisc_ops *ops = qdisc->ops;
-
if (qdisc->flags & TCQ_F_BUILTIN ||
!atomic_dec_and_test(&qdisc->refcnt))
return;

list_del(&qdisc->list);
- gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
- if (ops->reset)
- ops->reset(qdisc);
- if (ops->destroy)
- ops->destroy(qdisc);

- module_put(ops->owner);
- dev_put(qdisc_dev(qdisc));
call_rcu(&qdisc->q_rcu, __qdisc_destroy);
}
EXPORT_SYMBOL(qdisc_destroy);
--
1.5.6.2.255.gbed62

2008-07-21 16:43:18

by Herbert Xu

[permalink] [raw]

Subject: Re: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

On Mon, Jul 21, 2008 at 09:25:56AM -0700, David Miller wrote:
>
> Where are these places they are going to "jump all over"? :-)

Well consider the case where you have 4 queues, but a large number
of flows per second (>= 1000). No matter how good your hash is,
there is just no way of squeezing 1000 flows into 4 queues without
getting loads of collisions :)

So let's assume that these flows have been distributed uniformly
by both the RX hash and the TX hash such that each queue is handling
~250 flows. If the TX hash does not match the result produced by
the RX hash, you're going to get a hell lot of contention once you
get into the NIC driver on the TX side.

This is because for NICs like the ones from Intel ones you have to
protect the TX queue accesses so that only one CPU touches a given
queue at any point in time.

The end result is either the driver being bogged down by lock or
TX queue contention, or the mid-layer will have to redistribute
skb's to the right CPUs in which case the synchronisation cost is
simply moved over there.

> If the TX hash is good enough (current one certainly isn't and I will
> work on fixing that), it is likely to spread the accesses enough that
> there won't be many collisions to matter.

I agree that what you've got here makes total sense for a host.
But I think there is room for something different for routers.

> We could provide the option, but it is so dangerous and I also see no
> real tangible benfit from it.

The benefit as far as I can see is that this would allow a packet's
entire journey through Linux to stay on exactly one CPU. There will
be zero memory written by multiple CPUs as far as that packet is
concerned.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-07-21 15:22:35

by Herbert Xu

[permalink] [raw]

Subject: Re: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

On Mon, Jul 21, 2008 at 08:09:01AM -0700, David Miller wrote:
>
> > Actually you've hit it on the head, as an alternative to TX hashing
> > on the packet content, we need to allow TX queue selection based on
> > the current CPU ID.
>
> This we should avoid, it would allow reordering within a flow.

Not if the RX hashing is flow-based...

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-07-17 13:35:39

by jamal

[permalink] [raw]

Subject: Re: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

On Thu, 2008-17-07 at 15:03 +0200, Patrick McHardy wrote:

> Actions are also visible
> globally, so this might still be a problem, not sure though since
> they don't refer to their parent (haven't thought about it much yet).

Actions are fine because they are intended to be globaly shared.
[i.e A classifier on ethx with qdiscA:Y (in/egress) can share an action
with classifer on ethy with qdiscB:Z (eg/ingress)].
Like you i need to digest the patches to understand the impact on the
rest but one thing i did notice was the last patch (replacement of
pfifo_fast):
prioritization based on TOS/DSCP (setsockopt) would no longer work, some
user space code may suffer (routing daemons likely). One suggestion to
fix it is to load pfifo qdisc (which does what fifo_fast is attempting)
for drivers that are h/ware multiq capable.

cheers,
jamal

2008-07-17 13:48:26

by Patrick McHardy

[permalink] [raw]

Subject: Re: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

David Miller wrote:
> From: Patrick McHardy <[email protected]>
> Date: Thu, 17 Jul 2008 15:03:35 +0200
>
>> Still working my way through the patches, but this one caught my
>> eye (we had this before and it caused quite a few problems).
>
> Indeed, it's the most delicate change.
>
> Thanks for the info about all the tricky bits in this area.

One thought that occured to me - we could avoid all the visiblity
issues wrt. dev->qdisc_list by simply getting rid of it :)

If we move the qdisc list from the device to the root Qdisc itself,
it would become invisible automatically as soon as we assign a new
root qdisc to the netdev_queue. Iteration would become slightly
more complicated since we'd have to iterate over all netdev_queues,
but I think it should avoid most of the problems I mentioned
(besides the u32_list thing).

2008-07-21 16:16:28

by Herbert Xu

[permalink] [raw]

Subject: Re: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

On Mon, Jul 21, 2008 at 08:26:27AM -0700, David Miller wrote:
>
> It is totally unwise to do CPU based TX hashing.

Right I'm not suggesting having this as a default. However,
if you have a finely tuned system (e.g., a router) where you've
pinned all you RX queues to specific CPUs and your local apps
as well then it would make sense to provide this as an alternative.

If this alternative doesn't exist, then unless the RX hash happens
to match the TX hash, for routing at least the packets are going
to jump all over the place which isn't nice.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-07-20 14:58:18

by jamal

[permalink] [raw]

Subject: Re: [PATCH 20/31]: pkt_sched: Perform bulk of qdisc destruction in RCU.

On Fri, 2008-18-07 at 10:10 -0700, Roland Dreier wrote:

> This is definitely true, but it is good to keep in mind that in the near
> future we will start to see things look a little like multiple "virtual
> wires." This is because of new ethernet standards like per-priority
> pause, which makes it possible that one hardware ring on a NIC can
> transmit while another ring is paused (possibly because of congestion
> far off in the network).

Thats essentially what i am arguing for.
[I think some, not all, of the wireless qos schemes also have similar
scheduling].

My understanding of these wired "datacentre/virtualization" schemes you
describe is they are strict prio based. When the low prio "virtual wire"
is contending for the "physical wire" with a higher prio "virtual wire",
the high prio always wins.
We just need to make sure this behavior is also maintained whatever
buffering scheme is used within or above the driver(qdisc level).

cheers,
jamal

2008-07-17 22:24:47