This allows less strict control of access to the qdisc attached to a
netdev_queue. It is even allowed to enqueue into a qdisc which is
in the process of being destroyed. The RCU handler will toss out
those packets.
We will need this to handle sharing of a qdisc amongst multiple
TX queues. In such a setup the lock has to be shared, so will
be inside of the qdisc itself. At which point the netdev_queue
lock cannot be used to hard synchronize access to the ->qdisc
pointer.
One operation we have to keep inside of qdisc_destroy() is the list
deletion. It is the only piece of state visible after the RCU quiesce
period, so we have to undo it early and under the appropriate locking.
The operations in the RCU handler do not need any looking because the
qdisc tree is no longer visible to anything at that point.
Signed-off-by: David S. Miller <[email protected]>
---
net/sched/sch_generic.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7e078c5..082db8a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -545,6 +545,17 @@ EXPORT_SYMBOL(qdisc_reset);
static void __qdisc_destroy(struct rcu_head *head)
{
struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
+ const struct Qdisc_ops *ops = qdisc->ops;
+
+ gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
+ if (ops->reset)
+ ops->reset(qdisc);
+ if (ops->destroy)
+ ops->destroy(qdisc);
+
+ module_put(ops->owner);
+ dev_put(qdisc_dev(qdisc));
+
kfree((char *) qdisc - qdisc->padded);
}
@@ -552,21 +563,12 @@ static void __qdisc_destroy(struct rcu_head *head)
void qdisc_destroy(struct Qdisc *qdisc)
{
- const struct Qdisc_ops *ops = qdisc->ops;
-
if (qdisc->flags & TCQ_F_BUILTIN ||
!atomic_dec_and_test(&qdisc->refcnt))
return;
list_del(&qdisc->list);
- gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
- if (ops->reset)
- ops->reset(qdisc);
- if (ops->destroy)
- ops->destroy(qdisc);
- module_put(ops->owner);
- dev_put(qdisc_dev(qdisc));
call_rcu(&qdisc->q_rcu, __qdisc_destroy);
}
EXPORT_SYMBOL(qdisc_destroy);
--
1.5.6.2.255.gbed62
On Mon, Jul 21, 2008 at 09:25:56AM -0700, David Miller wrote:
>
> Where are these places they are going to "jump all over"? :-)
Well consider the case where you have 4 queues, but a large number
of flows per second (>= 1000). No matter how good your hash is,
there is just no way of squeezing 1000 flows into 4 queues without
getting loads of collisions :)
So let's assume that these flows have been distributed uniformly
by both the RX hash and the TX hash such that each queue is handling
~250 flows. If the TX hash does not match the result produced by
the RX hash, you're going to get a hell lot of contention once you
get into the NIC driver on the TX side.
This is because for NICs like the ones from Intel ones you have to
protect the TX queue accesses so that only one CPU touches a given
queue at any point in time.
The end result is either the driver being bogged down by lock or
TX queue contention, or the mid-layer will have to redistribute
skb's to the right CPUs in which case the synchronisation cost is
simply moved over there.
> If the TX hash is good enough (current one certainly isn't and I will
> work on fixing that), it is likely to spread the accesses enough that
> there won't be many collisions to matter.
I agree that what you've got here makes total sense for a host.
But I think there is room for something different for routers.
> We could provide the option, but it is so dangerous and I also see no
> real tangible benfit from it.
The benefit as far as I can see is that this would allow a packet's
entire journey through Linux to stay on exactly one CPU. There will
be zero memory written by multiple CPUs as far as that packet is
concerned.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Jul 21, 2008 at 08:09:01AM -0700, David Miller wrote:
>
> > Actually you've hit it on the head, as an alternative to TX hashing
> > on the packet content, we need to allow TX queue selection based on
> > the current CPU ID.
>
> This we should avoid, it would allow reordering within a flow.
Not if the RX hashing is flow-based...
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Thu, 2008-17-07 at 15:03 +0200, Patrick McHardy wrote:
> Actions are also visible
> globally, so this might still be a problem, not sure though since
> they don't refer to their parent (haven't thought about it much yet).
Actions are fine because they are intended to be globaly shared.
[i.e A classifier on ethx with qdiscA:Y (in/egress) can share an action
with classifer on ethy with qdiscB:Z (eg/ingress)].
Like you i need to digest the patches to understand the impact on the
rest but one thing i did notice was the last patch (replacement of
pfifo_fast):
prioritization based on TOS/DSCP (setsockopt) would no longer work, some
user space code may suffer (routing daemons likely). One suggestion to
fix it is to load pfifo qdisc (which does what fifo_fast is attempting)
for drivers that are h/ware multiq capable.
cheers,
jamal
David Miller wrote:
> From: Patrick McHardy <[email protected]>
> Date: Thu, 17 Jul 2008 15:03:35 +0200
>
>> Still working my way through the patches, but this one caught my
>> eye (we had this before and it caused quite a few problems).
>
> Indeed, it's the most delicate change.
>
> Thanks for the info about all the tricky bits in this area.
One thought that occured to me - we could avoid all the visiblity
issues wrt. dev->qdisc_list by simply getting rid of it :)
If we move the qdisc list from the device to the root Qdisc itself,
it would become invisible automatically as soon as we assign a new
root qdisc to the netdev_queue. Iteration would become slightly
more complicated since we'd have to iterate over all netdev_queues,
but I think it should avoid most of the problems I mentioned
(besides the u32_list thing).
On Mon, Jul 21, 2008 at 08:26:27AM -0700, David Miller wrote:
>
> It is totally unwise to do CPU based TX hashing.
Right I'm not suggesting having this as a default. However,
if you have a finely tuned system (e.g., a router) where you've
pinned all you RX queues to specific CPUs and your local apps
as well then it would make sense to provide this as an alternative.
If this alternative doesn't exist, then unless the RX hash happens
to match the TX hash, for routing at least the packets are going
to jump all over the place which isn't nice.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Fri, 2008-18-07 at 10:10 -0700, Roland Dreier wrote:
> This is definitely true, but it is good to keep in mind that in the near
> future we will start to see things look a little like multiple "virtual
> wires." This is because of new ethernet standards like per-priority
> pause, which makes it possible that one hardware ring on a NIC can
> transmit while another ring is paused (possibly because of congestion
> far off in the network).
Thats essentially what i am arguing for.
[I think some, not all, of the wireless qos schemes also have similar
scheduling].
My understanding of these wired "datacentre/virtualization" schemes you
describe is they are strict prio based. When the low prio "virtual wire"
is contending for the "physical wire" with a higher prio "virtual wire",
the high prio always wins.
We just need to make sure this behavior is also maintained whatever
buffering scheme is used within or above the driver(qdisc level).
cheers,
jamal
From: Patrick McHardy <[email protected]>
Date: Thu, 17 Jul 2008 16:02:20 +0200
> jamal wrote:
> > On Thu, 2008-17-07 at 15:03 +0200, Patrick McHardy wrote:
> >
> > Like you i need to digest the patches to understand the impact on the
> > rest but one thing i did notice was the last patch (replacement of
> > pfifo_fast):
> > prioritization based on TOS/DSCP (setsockopt) would no longer work, some
> > user space code may suffer (routing daemons likely). One suggestion to
> > fix it is to load pfifo qdisc (which does what fifo_fast is attempting)
> > for drivers that are h/ware multiq capable.
>
> That would perform priorization within each qdisc, the individual
> qdiscs would still be transmitted using seperate HW queues though.
I think from certain perspectives it frankly doesn't matter.
It's not like the skb->priority field lets the SKB bypass the packets
already in the TX ring of the chip with a lower priority.
It is true that, once the TX ring is full, the skb->priority thus
begins to have an influence on which packets are moved from the
qdisc to the TX ring of the device.
However, I wonder if we're so sure that we want to give normal users
that kind of powers. Let's say for example that you set the highest
priority possible in the TOS socket option, and you do this for a ton
of UDP sockets, and you just blast packets out as fast as possible.
This backlogs the device TX ring, and if done effectively enough could
keep other sockets blocked out of the device completely.
Are we really really sure it's OK to let users do this? :)
To me, as a default, I think TOS and DSCP really means just on-wire
priority.
If we absolutely want to, we can keep the old pfifo_fast around and use
it (shared on multiq) if a certain sysctl knob is set.
David Miller wrote:
> From: Patrick McHardy <[email protected]>
> Date: Thu, 17 Jul 2008 15:48:22 +0200
>
>> One thought that occured to me - we could avoid all the visiblity
>> issues wrt. dev->qdisc_list by simply getting rid of it :)
>>
>> If we move the qdisc list from the device to the root Qdisc itself,
>> it would become invisible automatically as soon as we assign a new
>> root qdisc to the netdev_queue. Iteration would become slightly
>> more complicated since we'd have to iterate over all netdev_queues,
>> but I think it should avoid most of the problems I mentioned
>> (besides the u32_list thing).
>
> What might make sense is to have a special Qdisc_root structure which
> is simply:
>
> struct Qdisc_root {
> struct Qdisc qd;
> struct list_head qdisc_list;
> };
>
> Everything about tree level synchronization would be type explicit.
Device level grafting is also explicit, so that looks like
a clean way.
> Yes, as you say, the qdisc iteration would get slightly ugly. But
> that doesn't seem to be a huge deal.
>
> But it seems a clean solution to the child qdisc visibility problem.
>
> About u32_list, that thing definitely needs some spinlock. The
> consultation of that list, and refcount mods, only occur during config
> operations. So it's not like we have to grab this lock in the data
> paths.
>
> If we really want to sweep this problem under the rug, there is another
> way. Have the qdisc_destroy() RCU handler kick off a workqueue, and
> grab the RTNL semaphore there during the final destruction calls. :-)
That would be the safe way. The RCU destruction used to cause us
bugs for at least two years, but I actually believe the Qdisc_root
thing will work :)
On Mon, Jul 21, 2008 at 09:56:25AM -0400, jamal wrote:
>
> Ok - You may be able to pull it if you have the exact same hashing on
> hardware rx as it is on transmit stateless filtering.
> I can see a traffic pattern that could be cooked to give some good
> numbers in such a scenario;->
Actually you've hit it on the head, as an alternative to TX hashing
on the packet content, we need to allow TX queue selection based on
the current CPU ID.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Jul 21, 2008 at 09:08:44AM -0400, jamal wrote:
>
> The one thing i am unsure of still:
> I think it would be cleaner to just stop a single queue (instead of all)
> when one hardware queue fills up. i.e if there is no congestion on the
> other hardware queues, packets should continue to be fed to their
> hardware queues and not be buffered at qdisc level.
Right you don't necessarily have to stop all queues but you do
need to direct all packets into the qdisc.
> Parallelization would work if you can get X CPUs to send to X hardware
> queues concurently. Feasible in static host setup like virtualization
> environment where you can tie a vm to a cpu. Not very feasible in
> routing where you are driven to a random hardware tx queue by arriving
> packets.
It should work just fine for routing assuming your card does
multi-queue RX.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: jamal <[email protected]>
Date: Fri, 18 Jul 2008 09:27:21 -0400
> On Fri, 2008-18-07 at 09:10 -0400, jamal wrote:
>
> > wire still even on multiple rings;->
> > If Woz (really) showed up at 9am and the Broussards at 3 am[1] on that
> > single (congestion-buffering) FIFO waiting for the shop/wire to open up,
> > then Woz should jump the queue (if he deserves it) when shop opens at
> > 10am.
>
> Sorry, the url of Woz allegedly jumping the queue:
> http://news.cnet.com/8301-1023_3-9989823-93.html?hhTest=1%C3%A2%C2%88%
> C2%82=rss&subj=news&tag=2547-1_3-0-20
The fundamental issue is what we believe qdiscs schedule, do they
schedule a device, or do they schedule what their namesake implies,
"queues"?
Logically once we have multiple queues, we schedule queues.
Therefore what probably makes sense is that for mostly stateless
priority queueing such as pfifo_fast, doing prioritization at the
queue level is OK.
But where non-trivial classification occurs, we have to either:
1) Make the queue selection match what the classifiers would do
exactly.
OR
2) Point all the queues at a single device global qdisc.
What we have now implements #2. Later on we can try to do something
as sophisticated as #1.
On Mon, Jul 21, 2008 at 09:51:24AM -0700, David Miller wrote:
>
> How so? If the TX hash is well distributed, which it should be,
> it is at least going to approximate the distribution provided by
> the RX hash.
This is a matter of probabilities :) In general, if the TX hash
and the RX hash are completely unrelated, then the end result of
the hash distribution should be independent of each other (independent
in the probablistic sense). That is, for the flows which have
been RX hashed into one queue, they should be hashed on average
across all queues by the TX hash. Conversely, those that have
been hashed into one TX queue would be distributed across all
RX queues.
Now if you've bound each RX queue to a specific CPU, then this
means for a given TX queue, its packets are going to come from
all the CPUs (well all those involved in network reception anyway).
As each TX queue is designed to be accessed by only one CPU at
a time, somebody somewhere has to pay for all this synchronisation :)
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: "Tomas Winkler" <[email protected]>
Date: Sun, 20 Jul 2008 20:34:18 +0300
> On Sun, Jul 20, 2008 at 8:25 PM, David Miller <[email protected]> wrote:
> > From: jamal <[email protected]>
> > Date: Sun, 20 Jul 2008 11:16:03 -0400
> >
> >> IMO, in the case of multiple hardware queues per physical wire,
> >> and such a netdevice already has a built-in hardware scheduler (they all
> >> seem to have this feature) then if we can feed the hardware queues
> >> directly, theres no need for any intermediate buffer(s).
> >> In such a case, to compare with qdisc arch, its like the root qdisc is
> >> in hardware.
> >
> > They tend to implement round-robin or some similar fairness algorithm
> > amongst the queues, with zero concern about packet priorities.
> >
> > It really is just like a bunch of queues to the phsyical layer,
> > fairly shared.
> >
> > These things are built for parallelization, not prioritization.
>
> Except wireless where HW has prioritizing scheduler per physical non-wire.
> Tomas
I know that. We're talking about multiqueue ethernet NICs.
On Mon, 2008-21-07 at 19:58 +0800, Herbert Xu wrote:
> I think I get you now. You're suggesting that we essentially
> do what Dave has right now in the non-contending case, i.e.,
> bypassing the qdisc so we get fully parallel processing until
> one of the hardware queues seizes up.
yes. That way there is no need for an intermediate queueing. As it is
now, packets first get queued to qdisc then we dequeu and send to driver
even when the driver would be happy to take it. That approach is fine
if you want to support non-work conserving schedulers on single-hwqueue
hardware.
> At that point you'd stop all queues and make every packet go
> through the software qdisc to ensure ordering. This continues
> until all queues have vacancies again.
I always visualize these as a single netdevice per hardware tx queue.
If i understood correctly, ordering is taken care of already in the
current patches because the stateless filter selects a hardware-queue.
Dave has those queues sitting as a qdisc level (as pfifo) - which seems
better in retrospect (than what i was thinking that they should sit in
the driver) because one could decide they want to shape packets in the
future on a per-virtual-customer-sharing-a-virtual-wire and attach an
HTB instead.
The one thing i am unsure of still:
I think it would be cleaner to just stop a single queue (instead of all)
when one hardware queue fills up. i.e if there is no congestion on the
other hardware queues, packets should continue to be fed to their
hardware queues and not be buffered at qdisc level.
> If this is what you're suggesting, then I think that will offer
> pretty much the same behaviour as what we've got, while still
> offering at least some (perhaps even most, but that is debatable)
> of the benefits of multi-queue.
>
> At this point I don't think this is something that we need right
> now, but it would be good to make sure that the architecture
> allows such a thing to be implemented in future.
I think it is a pretty good first start (I am a lot more optimistic to
be honest).
Parallelization would work if you can get X CPUs to send to X hardware
queues concurently. Feasible in static host setup like virtualization
environment where you can tie a vm to a cpu. Not very feasible in
routing where you are driven to a random hardware tx queue by arriving
packets.
cheers,
jamal
On Sun, 2008-20-07 at 16:59 -0700, David Miller wrote:
> Every time this topic comes up, you insist on them having to match.
> And I have no idea why.
I dont insist on them matching, just on correctness. i.e if you say you
have RR, then the scheduling needs to meet those requirements not an
estimate - thats all.
> The problem is that the bottleneck is the qdisc itself since all those
> cpus synchonize on it's lock. We can't have a shared qdisc for the
> device and get full parallelization.
>
> That's why we're having one qdisc per TX queue, so that they all don't
> bunch up on the qdisc lock.
That last sentence i have no issues with - it is what i thought wasnt
happening;-> i misunderstood it to be a single fifo shared by all
hardware tx queues from the begining (otherwise i wouldnt be posting).
We are in sync i think, a single pfifo per TX queue is the way to go. I
was suggesting it goes in the driver, but this is cleaner: In the
future, one could could actually replace the pfifo with another qdisc
since the single virtual wire becomes equivalent to a single virtual
netdevice
> Otherwise, there is zero point in all of these TX multiqueue features
> in the hardware if we can't parallelize things fully.
parallelization is achieveable in the ideal case.
cheers,
jamal
On Mon, 2008-21-07 at 11:17 +0800, Herbert Xu wrote:
> Unfortunately the scenario that I wrote this for requires frequent
> addition/removal.
Aha - makes absolute sense then;->
> Only if you also want to share it :) In the end I patched it to
> not share it which is much easier.
I am trying to visualize: if you dont share, you must have 256K copies
then? Assuming also you have a fast lookup since that was design intent.
> Of course if you're volunteering to write the dynamic hash table
> for actions then I'd happily switch back to sharing :)
It is a unique need like you said earlier (and would require a
medium-size surgery).
How about this: if a second user shows up with such a need I could do
it.
If you knew you had a 256K entry, then you could make
NAT_TAB_MASK to be (256K-1) and you are guaranteed to get O(1) lookup if
you dont specify indices.
I know youve patched it already -havent quiet understood how and your
current solution may be better- but one other way is to have a Kconfig
option which lets the user type the size of the nat hash table size at
kernel compile time. So then a change
of the sort:
#ifdef CONFIG_HASH_SIZE
#define NAT_TAB_MASK CONFIGURED_HASH_SIZE
#else
#define NAT_TAB_MASK 15
#endif
What do you think?
cheers,
jamal
From: Herbert Xu <[email protected]>
Date: Mon, 21 Jul 2008 21:58:46 +0800
> On Mon, Jul 21, 2008 at 09:56:25AM -0400, jamal wrote:
> >
> > Ok - You may be able to pull it if you have the exact same hashing on
> > hardware rx as it is on transmit stateless filtering.
> > I can see a traffic pattern that could be cooked to give some good
> > numbers in such a scenario;->
>
> Actually you've hit it on the head, as an alternative to TX hashing
> on the packet content, we need to allow TX queue selection based on
> the current CPU ID.
This we should avoid, it would allow reordering within a flow.
From: jamal <[email protected]>
Date: Sun, 20 Jul 2008 11:16:03 -0400
> IMO, in the case of multiple hardware queues per physical wire,
> and such a netdevice already has a built-in hardware scheduler (they all
> seem to have this feature) then if we can feed the hardware queues
> directly, theres no need for any intermediate buffer(s).
> In such a case, to compare with qdisc arch, its like the root qdisc is
> in hardware.
They tend to implement round-robin or some similar fairness algorithm
amongst the queues, with zero concern about packet priorities.
It really is just like a bunch of queues to the phsyical layer,
fairly shared.
These things are built for parallelization, not prioritization.
On Sun, 2008-20-07 at 22:20 +0800, Herbert Xu wrote:
> Not all actions :) That nat action for example wasn't intended to
> be shared at all. In fact I still need to submit a patch to make
> it skip the shared hash as otherwise it simply won't scale as the
> number of nat actions increases (e.g., to 256K).
True, sharing in the case of nat will cause scaling challenges because
there is a per-action locking. So you dont want to share in that case.
Let me clarify the global "sharedness" of actions, because i dont think
there is an issue:
All actions (on a per-type hash table basis) have an index.
You create filter rule X and specify action nat.
You may specify the index of the action when you create the filter X.
If you then create another filter rule Y, also using the same action
index, then that nat action is shared between rule X and rule Y[1].
If you dont specify the index a new nat action is created.
So in essence, if you create 256K rules each with an action and
as long as you dont specify the action index, you should be fine
because none will be shared.
The only scaling thing i can think of is to try and make the nat action
hash table large to reduce the init lookup. Other than that, once the
action is bound to a filter lookup cost is zero.
cheers,
jamal
[1]This is useful for tow reasons:
a) memory saving purposes: If you dont care that much about performance
or on a uniprocessor machine, one action would just be sufficient for
many rules.
b) accounting purposes; as you know qdiscs/filters/actions are
per-device. Over the years, a need has arosen from some users to have a
"per system" accounting (refer to the IMQ/IFB approach). Eg, if i wanted
the policer action to account for ingress eth0 and egress eth1 for a
user, i couldnt do it without some acrobatics.
From: jamal <[email protected]>
Date: Sun, 20 Jul 2008 18:32:50 -0400
> pfifo_fast would be a bad choice in that case, but even a pfifo cannot
> guarantee proper RR because it would present packets in a FIFO order
> (and example the first 10 could go to hardware queue1 and the next to
> hardware queue2).
Jamal, I have to wonder why are you so hung up on matching our
software qdiscs with whatever fairness algorithm the hardware happens
to implement internally for the TX queues?
Every time this topic comes up, you insist on them having to match.
And I have no idea why.
It's largely irrelevant and the TX queue can be viewed purely as a
buffer between us and the device, nearly a black hole we stick packets
into.
> > These things are built for parallelization, not prioritization.
>
> Total parallelization happens in the ideal case. If X cpus classify
> packets going to X different hardware queueus each CPU grabs only locks
> for that hardware queue.
The problem is that the bottleneck is the qdisc itself since all those
cpus synchonize on it's lock. We can't have a shared qdisc for the
device and get full parallelization.
That's why we're having one qdisc per TX queue, so that they all don't
bunch up on the qdisc lock.
Otherwise, there is zero point in all of these TX multiqueue features
in the hardware if we can't parallelize things fully.
David Miller wrote:
> From: Patrick McHardy <[email protected]>
> Date: Thu, 17 Jul 2008 16:02:20 +0200
>
>> jamal wrote:
>>> prioritization based on TOS/DSCP (setsockopt) would no longer work, some
>>> user space code may suffer (routing daemons likely). One suggestion to
>>> fix it is to load pfifo qdisc (which does what fifo_fast is attempting)
>>> for drivers that are h/ware multiq capable.
>> That would perform priorization within each qdisc, the individual
>> qdiscs would still be transmitted using seperate HW queues though.
>
> I think from certain perspectives it frankly doesn't matter.
>
> It's not like the skb->priority field lets the SKB bypass the packets
> already in the TX ring of the chip with a lower priority.
>
> It is true that, once the TX ring is full, the skb->priority thus
> begins to have an influence on which packets are moved from the
> qdisc to the TX ring of the device.
>
> However, I wonder if we're so sure that we want to give normal users
> that kind of powers. Let's say for example that you set the highest
> priority possible in the TOS socket option, and you do this for a ton
> of UDP sockets, and you just blast packets out as fast as possible.
> This backlogs the device TX ring, and if done effectively enough could
> keep other sockets blocked out of the device completely.
>
> Are we really really sure it's OK to let users do this? :)
>
> To me, as a default, I think TOS and DSCP really means just on-wire
> priority.
>
> If we absolutely want to, we can keep the old pfifo_fast around and use
> it (shared on multiq) if a certain sysctl knob is set.
No, I fully agree that this is too much detail :) Its highly
unlikely that this default behaviour is important on a per
packet level :) I just meant to point out that using a pfifo
is not going to be the same behaviour as previously.
David Miller wrote:
> This allows less strict control of access to the qdisc attached to a
> netdev_queue. It is even allowed to enqueue into a qdisc which is
> in the process of being destroyed. The RCU handler will toss out
> those packets.
>
> We will need this to handle sharing of a qdisc amongst multiple
> TX queues. In such a setup the lock has to be shared, so will
> be inside of the qdisc itself. At which point the netdev_queue
> lock cannot be used to hard synchronize access to the ->qdisc
> pointer.
>
> One operation we have to keep inside of qdisc_destroy() is the list
> deletion. It is the only piece of state visible after the RCU quiesce
> period, so we have to undo it early and under the appropriate locking.
>
> The operations in the RCU handler do not need any looking because the
> qdisc tree is no longer visible to anything at that point.
Still working my way through the patches, but this one caught my
eye (we had this before and it caused quite a few problems).
One of the problems is that only the uppermost qdisc is destroyed
immediately, child qdiscs are still visible on qdisc_list and are
removed without any locking from the RCU callback. There are also
visibility issues for classifiers and actions deeper down in the
hierarchy.
The previous way to work around this was quite ugly. qdisc_destroy()
walked the entire hierarchy to unlink inner classes immediately
from the qdisc_list (commit 85670cc1f changed it to what we do now).
That fixed visibility issues for everything visible only through
qdiscs (child qdiscs and classifiers). Actions are also visible
globally, so this might still be a problem, not sure though since
they don't refer to their parent (haven't thought about it much yet).
Another problem we had earlier with this was that qdiscs previously
assumed changes (destruction) would only happen in process context
and thus didn't disable BHs when taking a read_lock for walking the
hierarchy (deadlocking with write_lock in BH context). This seems to
be handled correctly in your tree by always disabling BHs.
The remaining problem is data that was previously only used and
modified under the RTNL (u32_list is one example). Modifications
during destruction now need protection against concurrent use in
process context.
I still need to get a better understanding of how things work now,
so I won't suggest a fix until then :)
From: Herbert Xu <[email protected]>
Date: Tue, 22 Jul 2008 00:43:06 +0800
> So let's assume that these flows have been distributed uniformly
> by both the RX hash and the TX hash such that each queue is handling
> ~250 flows. If the TX hash does not match the result produced by
> the RX hash, you're going to get a hell lot of contention once you
> get into the NIC driver on the TX side.
How so? If the TX hash is well distributed, which it should be,
it is at least going to approximate the distribution provided by
the RX hash.
> The benefit as far as I can see is that this would allow a packet's
> entire journey through Linux to stay on exactly one CPU. There will
> be zero memory written by multiple CPUs as far as that packet is
> concerned.
So will the existing one, enough of the time for it to matter,
and yes even on a router or firewall.
At least this is my belief :-)
From: Patrick McHardy <[email protected]>
Date: Thu, 17 Jul 2008 15:03:35 +0200
> The remaining problem is data that was previously only used and
> modified under the RTNL (u32_list is one example). Modifications
> during destruction now need protection against concurrent use in
> process context.
I took a closer look at u32_list.
It's members seem to be keyed by qdisc :-)
All this thing is doing is making up for the lack of a classifier
private pointer in the Qdisc.
It's tough because classifiers of different types can be queued
up to a qdisc, so it's not like we can add one traffic classifier
private pointer to struct Qdisc and be done with it.
However, I could not find anything other than u32 that seems to need
some global state that works like this.
So what I've done is checked in the following change for now to make
u32 delete operation safe entirely within a qdisc tree's namespace.
pkt_sched: Get rid of u32_list.
The u32_list is just an indirect way of maintaining a reference
to a U32 node on a per-qdisc basis.
Just add an explicit node pointer for u32 to struct Qdisc an do
away with this global list.
Signed-off-by: David S. Miller <[email protected]>
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 0a158ff..8a44386 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -56,6 +56,8 @@ struct Qdisc
int (*reshape_fail)(struct sk_buff *skb,
struct Qdisc *q);
+ void *u32_node;
+
/* This field is deprecated, but it is still used by CBQ
* and it will live until better solution will be invented.
*/
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 4d75544..527db25 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -75,7 +75,6 @@ struct tc_u_hnode
struct tc_u_common
{
- struct tc_u_common *next;
struct tc_u_hnode *hlist;
struct Qdisc *q;
int refcnt;
@@ -87,8 +86,6 @@ static const struct tcf_ext_map u32_ext_map = {
.police = TCA_U32_POLICE
};
-static struct tc_u_common *u32_list;
-
static __inline__ unsigned u32_hash_fold(__be32 key, struct tc_u32_sel *sel, u8 fshift)
{
unsigned h = ntohl(key & sel->hmask)>>fshift;
@@ -287,9 +284,7 @@ static int u32_init(struct tcf_proto *tp)
struct tc_u_hnode *root_ht;
struct tc_u_common *tp_c;
- for (tp_c = u32_list; tp_c; tp_c = tp_c->next)
- if (tp_c->q == tp->q)
- break;
+ tp_c = tp->q->u32_node;
root_ht = kzalloc(sizeof(*root_ht), GFP_KERNEL);
if (root_ht == NULL)
@@ -307,8 +302,7 @@ static int u32_init(struct tcf_proto *tp)
return -ENOBUFS;
}
tp_c->q = tp->q;
- tp_c->next = u32_list;
- u32_list = tp_c;
+ tp->q->u32_node = tp_c;
}
tp_c->refcnt++;
@@ -402,14 +396,8 @@ static void u32_destroy(struct tcf_proto *tp)
if (--tp_c->refcnt == 0) {
struct tc_u_hnode *ht;
- struct tc_u_common **tp_cp;
- for (tp_cp = &u32_list; *tp_cp; tp_cp = &(*tp_cp)->next) {
- if (*tp_cp == tp_c) {
- *tp_cp = tp_c->next;
- break;
- }
- }
+ tp->q->u32_node = NULL;
for (ht = tp_c->hlist; ht; ht = ht->next) {
ht->refcnt--;
From: Herbert Xu <[email protected]>
Date: Mon, 21 Jul 2008 19:58:33 +0800
> I think I get you now. You're suggesting that we essentially
> do what Dave has right now in the non-contending case, i.e.,
> bypassing the qdisc so we get fully parallel processing until
> one of the hardware queues seizes up.
>
> At that point you'd stop all queues and make every packet go
> through the software qdisc to ensure ordering. This continues
> until all queues have vacancies again.
>
> If this is what you're suggesting, then I think that will offer
> pretty much the same behaviour as what we've got, while still
> offering at least some (perhaps even most, but that is debatable)
> of the benefits of multi-queue.
>
> At this point I don't think this is something that we need right
> now, but it would be good to make sure that the architecture
> allows such a thing to be implemented in future.
Doing something like the noqueue_qdisc (bypassing the qdisc entirely)
is very attractive because it eliminates the qdisc lock. We're only
left with the TX lock.
This whole idea of doing an optimized send to the device when the
TX queue has space is indeed tempting.
But we can't do it for qdiscs that measure rates, enforce limits, etc.
And if the device kicks back at us with an error, we have to
perform the normal ->enqueue() path.
On Sun, Jul 20, 2008 at 11:35:19AM -0400, jamal wrote:
>
> All actions (on a per-type hash table basis) have an index.
> You create filter rule X and specify action nat.
> You may specify the index of the action when you create the filter X.
> If you then create another filter rule Y, also using the same action
> index, then that nat action is shared between rule X and rule Y[1].
This is exactly what I want to get rid of because otherwise even
if no index was specified we'll still do a hash insertion which
simply falls apart with a small hash table. Using a large hash
table on the other is bad for people who only have a few rules.
> [1]This is useful for tow reasons:
> a) memory saving purposes: If you dont care that much about performance
> or on a uniprocessor machine, one action would just be sufficient for
> many rules.
> b) accounting purposes; as you know qdiscs/filters/actions are
> per-device. Over the years, a need has arosen from some users to have a
> "per system" accounting (refer to the IMQ/IFB approach). Eg, if i wanted
> the policer action to account for ingress eth0 and egress eth1 for a
> user, i couldnt do it without some acrobatics.
We could do a dynamic table but so far I'm not convinced that
it's worth anybody's effort to implement :)
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Herbert Xu <[email protected]>
Date: Mon, 21 Jul 2008 23:22:20 +0800
> On Mon, Jul 21, 2008 at 08:09:01AM -0700, David Miller wrote:
> >
> > > Actually you've hit it on the head, as an alternative to TX hashing
> > > on the packet content, we need to allow TX queue selection based on
> > > the current CPU ID.
> >
> > This we should avoid, it would allow reordering within a flow.
>
> Not if the RX hashing is flow-based...
Can you control the process scheduler and where it decides
to run the processing running sendmsg() too?
That's the problem.
It is totally unwise to do CPU based TX hashing.
From: Patrick McHardy <[email protected]>
Date: Thu, 17 Jul 2008 15:03:35 +0200
> Still working my way through the patches, but this one caught my
> eye (we had this before and it caused quite a few problems).
Indeed, it's the most delicate change.
Thanks for the info about all the tricky bits in this area.
On Sun, Jul 20, 2008 at 8:25 PM, David Miller <[email protected]> wrote:
> From: jamal <[email protected]>
> Date: Sun, 20 Jul 2008 11:16:03 -0400
>
>> IMO, in the case of multiple hardware queues per physical wire,
>> and such a netdevice already has a built-in hardware scheduler (they all
>> seem to have this feature) then if we can feed the hardware queues
>> directly, theres no need for any intermediate buffer(s).
>> In such a case, to compare with qdisc arch, its like the root qdisc is
>> in hardware.
>
> They tend to implement round-robin or some similar fairness algorithm
> amongst the queues, with zero concern about packet priorities.
>
> It really is just like a bunch of queues to the phsyical layer,
> fairly shared.
>
> These things are built for parallelization, not prioritization.
Except wireless where HW has prioritizing scheduler per physical non-wire.
Tomas
From: Herbert Xu <[email protected]>
Date: Tue, 22 Jul 2008 00:16:16 +0800
> On Mon, Jul 21, 2008 at 08:26:27AM -0700, David Miller wrote:
> >
> > It is totally unwise to do CPU based TX hashing.
>
> Right I'm not suggesting having this as a default. However,
> if you have a finely tuned system (e.g., a router) where you've
> pinned all you RX queues to specific CPUs and your local apps
> as well then it would make sense to provide this as an alternative.
>
> If this alternative doesn't exist, then unless the RX hash happens
> to match the TX hash, for routing at least the packets are going
> to jump all over the place which isn't nice.
Where are these places they are going to "jump all over"? :-)
You can view the RX and TX queues as roughly independent namespaces.
If the TX hash is good enough (current one certainly isn't and I will
work on fixing that), it is likely to spread the accesses enough that
there won't be many collisions to matter.
If you really think about it, this is even pointless. Packets that
hit RX queue A will always hit TX queue B, and so on and so forth.
Therefore there will be no new level of separation afforded by using
some kind of strict RX to TX mapping.
We could provide the option, but it is so dangerous and I also see no
real tangible benfit from it.
On Mon, Jul 21, 2008 at 07:14:39AM -0400, jamal wrote:
>
> > Only if you also want to share it :) In the end I patched it to
> > not share it which is much easier.
>
> I am trying to visualize: if you dont share, you must have 256K copies
> then? Assuming also you have a fast lookup since that was design intent.
They can't be shared anyway because each of those 256K rules NATs
to a different IP address.
> #ifdef CONFIG_HASH_SIZE
> #define NAT_TAB_MASK CONFIGURED_HASH_SIZE
> #else
> #define NAT_TAB_MASK 15
> #endif
>
> What do you think?
Sorry, I think I'll have to poke my eyes out :)
But yeah if we ever get a generic dynamic hash table implmenetation
then I'd be happy for act_nat to use that.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, 2008-21-07 at 21:19 +0800, Herbert Xu wrote:
> It should work just fine for routing assuming your card does
> multi-queue RX.
Ok - You may be able to pull it if you have the exact same hashing on
hardware rx as it is on transmit stateless filtering.
I can see a traffic pattern that could be cooked to give some good
numbers in such a scenario;->
cheers,
jamal
On Mon, 2008-21-07 at 19:36 +0800, Herbert Xu wrote:
> Sorry, I think I'll have to poke my eyes out :)
But i still need your eyes to poke at my code Herbert;->
> But yeah if we ever get a generic dynamic hash table implmenetation
> then I'd be happy for act_nat to use that.
Ok, i have added it to my todo.
cheers,
jamal
On Mon, 2008-21-07 at 08:11 +0800, Herbert Xu wrote:
> This is exactly what I want to get rid of because otherwise even
> if no index was specified we'll still do a hash insertion which
> simply falls apart with a small hash table. Using a large hash
> table on the other is bad for people who only have a few rules.
True.
But note: this is only during rule creation - once you create the
rule (user space to kernel path), then no more hash table reference.
Fast path has already a filter with actions attached, and is a mere
pointer dereference.
> We could do a dynamic table but so far I'm not convinced that
> it's worth anybody's effort to implement :)
If user<->kernel performance insertion/deletion is important, it is
worth it.
For example:
Dave implemented dynamic hash tables on xfrm (voip setup time with ipsec
is a metric used in the industry in that case) . The only operational
problem i had with xfrm was lack of an upper bound of how large a table
can grow; i would rather user space be told ENOMEM than continuing to
grow in some cases (I actually implemented a patch which put a stop
after a certain number of sad/spd - but i dont expect hugs if i was to
post it;->).
cheers,
jamal
On Fri, 2008-18-07 at 01:48 +0200, Patrick McHardy wrote:
> David Miller wrote:
> > I think from certain perspectives it frankly doesn't matter.
> >
> > It's not like the skb->priority field lets the SKB bypass the packets
> > already in the TX ring of the chip with a lower priority.
> >
> > It is true that, once the TX ring is full, the skb->priority thus
> > begins to have an influence on which packets are moved from the
> > qdisc to the TX ring of the device.
> >
Indeed QoS is irrelevant unless there is congestion.
The question is whether the packets sitting on the fifo qdisc are being
sorted fairly when congestion kicks in. Remember there is still a single
wire still even on multiple rings;->
If Woz (really) showed up at 9am and the Broussards at 3 am[1] on that
single (congestion-buffering) FIFO waiting for the shop/wire to open up,
then Woz should jump the queue (if he deserves it) when shop opens at
10am.
If queues are building up, then by definition you have congestion
somewehere - IOW some resource (wire bandwidth, code-efficiency/cpu,
bus, remote being slow etc) is not keeping up.
I am sorry havent read the patches sufficiently to answer that question
but i suspect that stashing the packets into different hardware queues
already solves this since the hardware does whatever scheduling it needs
to on the rings.
> > However, I wonder if we're so sure that we want to give normal users
> > that kind of powers. Let's say for example that you set the highest
> > priority possible in the TOS socket option, and you do this for a ton
> > of UDP sockets, and you just blast packets out as fast as possible.
> > This backlogs the device TX ring, and if done effectively enough could
> > keep other sockets blocked out of the device completely.
> >
> > Are we really really sure it's OK to let users do this? :)
We do today - if it is a concern, one could make the setsock opts
preferential (example via selinux or setting caps in the kernel etc).
> > To me, as a default, I think TOS and DSCP really means just on-wire
> > priority.
Agreed - with the caveat above on congestion. i.e it is still a single
wire even with multi rings.
> > If we absolutely want to, we can keep the old pfifo_fast around and use
> > it (shared on multiq) if a certain sysctl knob is set.
>
> No, I fully agree that this is too much detail :) Its highly
> unlikely that this default behaviour is important on a per
> packet level :) I just meant to point out that using a pfifo
> is not going to be the same behaviour as previously.
IMO, if non-multiq drivers continue to work as before with the prios,
then nice. multiq could be tuned over a period of time.
cheers,
jamal
From: Herbert Xu <[email protected]>
Date: Sun, 20 Jul 2008 22:32:35 +0800
> Now I understand that having a pfifo is an anathema to multi-queue TX
> by definition. However, even if we couldn't preserve the full
> semantics by default, if we could at least preserve the ordering
> within each queue it would still be a plus.
That's how I basically feel right now as well.
So we can revert that change that killed pfifo_fast
if we want.
From: jamal <[email protected]>
Date: Mon, 21 Jul 2008 07:20:01 -0400
> Actually, can i modify that thought and go back to my initial contention
> now that things are making more sense?;->
> A single s/ware queue per hardware transmit queue is good - but that
> being pfifo_fast would be a lot better. It would, in the minimal, keep
> things as they were for non-multiq and is a sane choice for any virtual
> wire.
Ok, I'll revert the change that changed pfifo_fast into plain fifo_fast.
Thanks.
jamal <[email protected]> wrote:
>
> Actions are fine because they are intended to be globaly shared.
> [i.e A classifier on ethx with qdiscA:Y (in/egress) can share an action
> with classifer on ethy with qdiscB:Z (eg/ingress)].
Not all actions :) That nat action for example wasn't intended to
be shared at all. In fact I still need to submit a patch to make
it skip the shared hash as otherwise it simply won't scale as the
number of nat actions increases (e.g., to 256K).
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Patrick McHardy <[email protected]>
Date: Thu, 17 Jul 2008 15:48:22 +0200
> One thought that occured to me - we could avoid all the visiblity
> issues wrt. dev->qdisc_list by simply getting rid of it :)
>
> If we move the qdisc list from the device to the root Qdisc itself,
> it would become invisible automatically as soon as we assign a new
> root qdisc to the netdev_queue. Iteration would become slightly
> more complicated since we'd have to iterate over all netdev_queues,
> but I think it should avoid most of the problems I mentioned
> (besides the u32_list thing).
What might make sense is to have a special Qdisc_root structure which
is simply:
struct Qdisc_root {
struct Qdisc qd;
struct list_head qdisc_list;
};
Everything about tree level synchronization would be type explicit.
Yes, as you say, the qdisc iteration would get slightly ugly. But
that doesn't seem to be a huge deal.
But it seems a clean solution to the child qdisc visibility problem.
About u32_list, that thing definitely needs some spinlock. The
consultation of that list, and refcount mods, only occur during config
operations. So it's not like we have to grab this lock in the data
paths.
If we really want to sweep this problem under the rug, there is another
way. Have the qdisc_destroy() RCU handler kick off a workqueue, and
grab the RTNL semaphore there during the final destruction calls. :-)
jamal <[email protected]> wrote:
>
>> Otherwise, there is zero point in all of these TX multiqueue features
>> in the hardware if we can't parallelize things fully.
>
> parallelization is achieveable in the ideal case.
I think I get you now. You're suggesting that we essentially
do what Dave has right now in the non-contending case, i.e.,
bypassing the qdisc so we get fully parallel processing until
one of the hardware queues seizes up.
At that point you'd stop all queues and make every packet go
through the software qdisc to ensure ordering. This continues
until all queues have vacancies again.
If this is what you're suggesting, then I think that will offer
pretty much the same behaviour as what we've got, while still
offering at least some (perhaps even most, but that is debatable)
of the benefits of multi-queue.
At this point I don't think this is something that we need right
now, but it would be good to make sure that the architecture
allows such a thing to be implemented in future.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Sun, 2008-20-07 at 22:20 -0400, jamal wrote:
> We are in sync i think, a single pfifo per TX queue is the way to go. I
> was suggesting it goes in the driver, but this is cleaner: In the
Actually, can i modify that thought and go back to my initial contention
now that things are making more sense?;->
A single s/ware queue per hardware transmit queue is good - but that
being pfifo_fast would be a lot better. It would, in the minimal, keep
things as they were for non-multiq and is a sane choice for any virtual
wire.
cheers,
jamal
On Sun, Jul 20, 2008 at 10:33:57PM -0400, jamal wrote:
>
> True.
> But note: this is only during rule creation - once you create the
> rule (user space to kernel path), then no more hash table reference.
> Fast path has already a filter with actions attached, and is a mere
> pointer dereference.
Unfortunately the scenario that I wrote this for requires frequent
addition/removal.
> > We could do a dynamic table but so far I'm not convinced that
> > it's worth anybody's effort to implement :)
>
> If user<->kernel performance insertion/deletion is important, it is
> worth it.
Only if you also want to share it :) In the end I patched it to
not share it which is much easier.
> For example:
> Dave implemented dynamic hash tables on xfrm (voip setup time with ipsec
> is a metric used in the industry in that case) . The only operational
> problem i had with xfrm was lack of an upper bound of how large a table
> can grow; i would rather user space be told ENOMEM than continuing to
> grow in some cases (I actually implemented a patch which put a stop
> after a certain number of sad/spd - but i dont expect hugs if i was to
> post it;->).
Of course if you're volunteering to write the dynamic hash table
for actions then I'd happily switch back to sharing :)
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Fri, 2008-18-07 at 14:05 -0700, David Miller wrote:
> The fundamental issue is what we believe qdiscs schedule, do they
> schedule a device, or do they schedule what their namesake implies,
> "queues"?
In the simple case of a single hardware queue/ring, the mapping between
a hardware queue and "physical wire" is one-to-one.
So in that case one could argue the root qdiscs are scheduling a device.
> Logically once we have multiple queues, we schedule queues.
> Therefore what probably makes sense is that for mostly stateless
> priority queueing such as pfifo_fast, doing prioritization at the
> queue level is OK.
IMO, in the case of multiple hardware queues per physical wire,
and such a netdevice already has a built-in hardware scheduler (they all
seem to have this feature) then if we can feed the hardware queues
directly, theres no need for any intermediate buffer(s).
In such a case, to compare with qdisc arch, its like the root qdisc is
in hardware.
The only need for intermediate s/ware queues is for cases of congestion.
If you could have a single software queue for each hardware queue, then
you still have the obligation of correctness to make sure higher prio
hardware rings get fed with packets first (depending on the hardwares
scheduling capability).
> But where non-trivial classification occurs, we have to either:
>
> 1) Make the queue selection match what the classifiers would do
> exactly.
>
> OR
>
> 2) Point all the queues at a single device global qdisc.
>
> What we have now implements #2. Later on we can try to do something
> as sophisticated as #1.
sure; i think you could achieve the goals by using the single queue with
a software pfifo_fast which maps skb->prio to hardware queues. such a
pfifo_fast may even sit in the driver. This queue will always be empty
unless you have congestion. The other thing is to make sure there is an
upper bound to the size of this queue; otherwise a remote bug could
cause it to grow infinitely and consume all memory.
cheers,
jamal
jamal wrote:
> On Thu, 2008-17-07 at 15:03 +0200, Patrick McHardy wrote:
>
>> Actions are also visible
>> globally, so this might still be a problem, not sure though since
>> they don't refer to their parent (haven't thought about it much yet).
>
> Actions are fine because they are intended to be globaly shared.
> [i.e A classifier on ethx with qdiscA:Y (in/egress) can share an action
> with classifer on ethy with qdiscB:Z (eg/ingress)].
Yes, in that case its not a problem. The case where it behaves
differently than now is when only a single reference exists.
> Like you i need to digest the patches to understand the impact on the
> rest but one thing i did notice was the last patch (replacement of
> pfifo_fast):
> prioritization based on TOS/DSCP (setsockopt) would no longer work, some
> user space code may suffer (routing daemons likely). One suggestion to
> fix it is to load pfifo qdisc (which does what fifo_fast is attempting)
> for drivers that are h/ware multiq capable.
That would perform priorization within each qdisc, the individual
qdiscs would still be transmitted using seperate HW queues though.
> The question is whether the packets sitting on the fifo qdisc are being
> sorted fairly when congestion kicks in. Remember there is still a single
> wire still even on multiple rings;->
This is definitely true, but it is good to keep in mind that in the near
future we will start to see things look a little like multiple "virtual
wires." This is because of new ethernet standards like per-priority
pause, which makes it possible that one hardware ring on a NIC can
transmit while another ring is paused (possibly because of congestion
far off in the network).
- R.
On Sun, 2008-20-07 at 10:25 -0700, David Miller wrote:
> They tend to implement round-robin or some similar fairness algorithm
> amongst the queues, with zero concern about packet priorities.
pfifo_fast would be a bad choice in that case, but even a pfifo cannot
guarantee proper RR because it would present packets in a FIFO order
(and example the first 10 could go to hardware queue1 and the next to
hardware queue2).
My view: i think you need a software queue per hardware queue.
Maybe even these queues residing in the driver; that way you take care
of congestion and it doesnt matter if the hardware is RR or strict prio
(and you dont need the pfifo or pfifo_fast anymore).
The use case would be something along:
packets comes in, you classify find its for queue1, grab the
per-hardware-queue1 lock, find the hardware queue1 is overloaded and
stash it instead in s/ware queue1. If it wasnt congested, it would go on
hardware queue1.
When hardware queue1 becomes available and netif-woken, you pick first
from s/ware queue1 (and batching could apply cleanly here) and send them
to hardware queue.
> It really is just like a bunch of queues to the phsyical layer,
> fairly shared.
I am suprised prioritization is not an issue. [My understanding of the
intel/cisco datacentre cabal is they serve virtual machines using
virtual wires; i would think in such scenarios youd have some customers
who pay more than others].
> These things are built for parallelization, not prioritization.
Total parallelization happens in the ideal case. If X cpus classify
packets going to X different hardware queueus each CPU grabs only locks
for that hardware queue. In virtualization, where only one customer's
traffic is going to a specific hardware queue, things would work well.
Non-virtualization scenario may result in collision in which two or more
CPUs may contend for the same hardware queue (either transmitting or
netif-waking etc).
cheers,
jamal
On Fri, 2008-18-07 at 09:10 -0400, jamal wrote:
> wire still even on multiple rings;->
> If Woz (really) showed up at 9am and the Broussards at 3 am[1] on that
> single (congestion-buffering) FIFO waiting for the shop/wire to open up,
> then Woz should jump the queue (if he deserves it) when shop opens at
> 10am.
Sorry, the url of Woz allegedly jumping the queue:
http://news.cnet.com/8301-1023_3-9989823-93.html?hhTest=1%C3%A2%C2%88%
C2%82=rss&subj=news&tag=2547-1_3-0-20
cheers,
jamal
David Miller <[email protected]> wrote:
>
> Are we really really sure it's OK to let users do this? :)
Well security questions aside, I have seen this used on routers
to ensure certain traffic (e.g., VOIP) get preferential treatment
when it comes to transmission.
Of course some of these routers setup custom qdiscs so they wouldn't
be affected, but there may be ones out there relying on the default
pfifo.
Now I understand that having a pfifo is an anathema to multi-queue TX
by definition. However, even if we couldn't preserve the full
semantics by default, if we could at least preserve the ordering
within each queue it would still be a plus.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Jul 21, 2008 at 10:08:21AM -0700, David Miller wrote:
>
> Can I at least get some commitment that someone will test
> that this really is necessary before we add the CPU ID
> hash option?
Sure, I'll be testing some related things on this front so I'll
try to produce some results that compare these two cases.
For the purposes of demonstrating this I'll choose the worst-case
scenario, that is, I'll ensure that the TX hash result exactly
distributes each RX hash preimage across all possible TX hash
values.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Herbert Xu <[email protected]>
Date: Tue, 22 Jul 2008 01:02:33 +0800
> On Mon, Jul 21, 2008 at 09:51:24AM -0700, David Miller wrote:
> >
> > How so? If the TX hash is well distributed, which it should be,
> > it is at least going to approximate the distribution provided by
> > the RX hash.
>
> This is a matter of probabilities :) In general, if the TX hash
> and the RX hash are completely unrelated, then the end result of
> the hash distribution should be independent of each other (independent
> in the probablistic sense). That is, for the flows which have
> been RX hashed into one queue, they should be hashed on average
> across all queues by the TX hash. Conversely, those that have
> been hashed into one TX queue would be distributed across all
> RX queues.
Theoretically perhaps you are right.
Can I at least get some commitment that someone will test
that this really is necessary before we add the CPU ID
hash option?
On Fri, 2008-22-08 at 03:42 -0700, David Miller wrote:
> I had another idea in the back of my head that I wanted to mention.
>
> The SKB has a hash, and sources set the hash. It doesn't matter
> where in the stack it comes from.
>
> At transmit time, the select queue logic in dev_queue_xmit() will use
> the SKB hash if one has been set already. Otherwise it will do the
> hashing it does currently.
>
> So in the simplest case for forwarding, the RX side puts the RSS
> hash or whatever into this SKB hash location.
>
Makes sense.
In the case Herbert described, I am assuming that the RSS side could be
configured to select a processor statically (MSI or otherwise). And that
the hash value will be that of the processorid ego the tx queue
selected.
> Then, taking things to the next level, protocols set hashes for
> locally created packets. And this leads to being able to delete
> the simple_tx_hash() code entirely and also we won't have to add
> support for every protocol on the planet to that function. :)
>
Well, conntrack could be useful (as long as you dont make it mandatory)
since it already keeps track of flows.
When unresolved, one could add it to some default queue.
cheers,
jamal
From: Herbert Xu <[email protected]>
Date: Fri, 22 Aug 2008 17:41:15 +1000
> On Fri, Aug 22, 2008 at 12:16:20AM -0700, David Miller wrote:
> >
> > I suppose you're talking about the grand unified flow cache
> > that never gets implemented right? I classify that in the
> > same category as net channels at the moment, theoretically
> > very interesting but no practical implementation in sight.
>
> In any case, we could implement what I suggested even without
> a flow cache, by simply storing the info in the dst. This means
> that all flows on the same dst will end up in the same queue, which
> is not as good as what a flow cache could give, but it's useful
> enough to be an option.
I had another idea in the back of my head that I wanted to mention.
The SKB has a hash, and sources set the hash. It doesn't matter
where in the stack it comes from.
At transmit time, the select queue logic in dev_queue_xmit() will use
the SKB hash if one has been set already. Otherwise it will do the
hashing it does currently.
So in the simplest case for forwarding, the RX side puts the RSS
hash or whatever into this SKB hash location.
Then, taking things to the next level, protocols set hashes for
locally created packets. And this leads to being able to delete
the simple_tx_hash() code entirely and also we won't have to add
support for every protocol on the planet to that function. :)
TCP sockets maybe could even do something clever, like, using the
incoming SKB hash of the SYN and SYN+ACK packets for all subsequent
SKBs sent by that socket.
Another nice side effect, we have the choice of allowing IPSEC
encapsulation of locally generated frames to not change the TX queue.
This would have to be controlled by a sysctl or similar because I
can see both ways being useful in different circumstances.
You get the idea.
On Fri, 2008-22-08 at 16:56 +1000, Herbert Xu wrote:
> On Tue, Jul 22, 2008 at 01:11:41AM +0800, Herbert Xu wrote:
> > On Mon, Jul 21, 2008 at 10:08:21AM -0700, David Miller wrote:
> I haven't had a chance to do the test yet but I've just had an
> idea of how we can get the best of both worlds.
>
> The problem with always directing traffic based on the CPU alone
> is that processes move around and we don't want to introduce packet
> reordering because to that.
Assuming multi-rx queues with configurable MSI or otherwise to map
to a receive processor, then in the case of routing/bridging or
otherfavoriteformofforwarding:
If you tie static filters to a specific cpu that will always work.
So no reordering there.
Local traffic i can see migration/reordering happening.
> The problem with hashing based on packet headers alone is that
> it doesn't take CPU affinity into account at all so we may end
> up with a situation where one thread out of a thread pool (e.g.,
> a web server) has n sockets which are hashed to n different
> queues.
Indeed. In the forwarding case, the problem is not reordering rather
all flows will always end up in the same cpu. So if you may end up
just overloading one cpu while the other 1023 stayed idle.
My earlier statement was you could cook traffic scenarios where all
1024 are being fully utilized (the operative term is "cook");->
> So here's the idea, we determine the tx queue for a flow based
> on the CPU on which we saw its first packet. Once we have decided
> on a queue we store that in a dst object (see below). This
> ensures that all subsequent packets of that flow ends up in
> the same queue so there is no reordering. It also avoids the
> problem where traffic genreated by one CPU gets scattered across
> queues.
Wont work with static multi-rx nic; iirc, changing those filters is
_expensive_. so you want it to stay static.
cheers,
jamal
On Fri, Aug 22, 2008 at 03:42:17AM -0700, David Miller wrote:
>
> I had another idea in the back of my head that I wanted to mention.
>
> The SKB has a hash, and sources set the hash. It doesn't matter
> where in the stack it comes from.
Yes this sounds great!
> So in the simplest case for forwarding, the RX side puts the RSS
> hash or whatever into this SKB hash location.
This works on all the routers.
> Then, taking things to the next level, protocols set hashes for
> locally created packets. And this leads to being able to delete
> the simple_tx_hash() code entirely and also we won't have to add
> support for every protocol on the planet to that function. :)
And this takes care of the hosts.
Best of all it's per flow and we don't even need to add any new
infrastructure :)
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Jul 22, 2008 at 01:11:41AM +0800, Herbert Xu wrote:
> On Mon, Jul 21, 2008 at 10:08:21AM -0700, David Miller wrote:
> >
> > Can I at least get some commitment that someone will test
> > that this really is necessary before we add the CPU ID
> > hash option?
>
> Sure, I'll be testing some related things on this front so I'll
> try to produce some results that compare these two cases.
I haven't had a chance to do the test yet but I've just had an
idea of how we can get the best of both worlds.
The problem with always directing traffic based on the CPU alone
is that processes move around and we don't want to introduce packet
reordering because to that.
The problem with hashing based on packet headers alone is that
it doesn't take CPU affinity into account at all so we may end
up with a situation where one thread out of a thread pool (e.g.,
a web server) has n sockets which are hashed to n different
queues.
So here's the idea, we determine the tx queue for a flow based
on the CPU on which we saw its first packet. Once we have decided
on a queue we store that in a dst object (see below). This
ensures that all subsequent packets of that flow ends up in
the same queue so there is no reordering. It also avoids the
problem where traffic genreated by one CPU gets scattered across
queues.
Of course to make this work we need to restart the flow cache
project so that we have somewhere to store this txq assignment.
The good thing is that a flow cache would be of benefit for IPsec
users too and I hear that there is some interest in doing that
in the immediate future. So perhaps we can combine efforts and
use it for txq assignment as well.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Fri, Aug 22, 2008 at 12:16:20AM -0700, David Miller wrote:
>
> I suppose you're talking about the grand unified flow cache
> that never gets implemented right? I classify that in the
> same category as net channels at the moment, theoretically
> very interesting but no practical implementation in sight.
In any case, we could implement what I suggested even without
a flow cache, by simply storing the info in the dst. This means
that all flows on the same dst will end up in the same queue, which
is not as good as what a flow cache could give, but it's useful
enough to be an option.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Herbert Xu <[email protected]>
Date: Fri, 22 Aug 2008 16:56:55 +1000
... Storing initial TXQ assignment in dst entry idea ...
> Of course to make this work we need to restart the flow cache
> project so that we have somewhere to store this txq assignment.
>
> The good thing is that a flow cache would be of benefit for IPsec
> users too and I hear that there is some interest in doing that
> in the immediate future. So perhaps we can combine efforts and
> use it for txq assignment as well.
IPSEC already has a flow cache doesn't it? :)
I suppose you're talking about the grand unified flow cache
that never gets implemented right? I classify that in the
same category as net channels at the moment, theoretically
very interesting but no practical implementation in sight.