2005-04-08 10:20:51

by Tarhon-Onu Victor

[permalink] [raw]
Subject: ACPI/HT or Packet Scheduler BUG?


I am not subscribed to this list so please CC me if you post a
reply, if you need additional info or if you suggest a patch (in which
I would be very interested).

Occurrence:
- the kernels must be compiled with Hyper Threading support (and
the CPU/MB must support it);
- a (tc) process is reading statistics from the htb tree and another
tries to delete or deletes the tree.
The bug will not occur if the system is booted with acpi=off or
if it's not SMP (HT) capable. I reproduced the bug on 2.6.10-1.770_FCsmp
(Fedora Core update package) and vanilla 2.6.11, 2.6.11.5, 2.6.11.6,
2.6.12-rc1 and 2.6.12-rc2 compiled with SMP and ACPI support in order to
enable Hyper Threading (with and without preempt, with or without SMT
support). The compiler I used is GCC version 3.4.2 (from Fedora Core 3).

Result: almost all the time the system hangs but still can
execute SysRq commands.

How I tested
~~~~~~~~~~~~
On a console I was running a script that initializes a htb tree
on an interface (dummy0) in an endless loop:
while /bin/true; do . htbdown.dummy0.sh; done
...and on another console I run tc -s also in an endless loop:
while /bin/true; do tc -s class sh dev dummy0; done

After a while (sometimes after 2 runs of the htbdown.dummy0.sh
script, sometimes after 20) the system hangs. It still responds to SysRq
commands and I was able to see the two tc processes running. Sometimes I
still have ping replies from it and one time, just one time in 2 days I
still was able to log in remotely.
There are no printk messages, or no other warnings or errors
printed in the system log or kernel log. It just hangs and it seems like
something is wasting all the CPU power: when I still was able to log in
I noticed that one of the two virtual CPUs was 100% with system
interrupts and the other was 100% system. I wasn't able to strace any of
the two running tc's.
What I was able to paste with the mouse in my console, to catch
in a typescript and also the script that initializes the htb tree on
dummy0 can be found at ftp://blackblue.iasi.rdsnet.ro/pub/various/k/ .
The test host is a 3.0GHz Intel Prescott and I first noticed the
bug on a system with a 2.8GHz Intel Northwood, both having motherboards
with Intel chipset (865GBF). I am not able to test it in other SMP
environments (Intel Xeon or Itanium, AMD Opteron, Dual P3, etc), so I'm
not able to tell if it's an ACPI bug, a SMP bug or a Packet Scheduler
bug.


--
Any views or opinions presented within this e-mail are solely those of
the author and do not necessarily represent those of any company, unless
otherwise expressly stated.


2005-04-12 12:49:14

by Tarhon-Onu Victor

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Fri, 8 Apr 2005, Tarhon-Onu Victor wrote:

> I am not subscribed to this list so please CC me if you post a reply,
> if you need additional info or if you suggest a patch (in which I would be
> very interested).
>
[snip]
> (Intel Xeon or Itanium, AMD Opteron, Dual P3, etc), so I'm not able to tell
> if it's an ACPI bug, a SMP bug or a Packet Scheduler bug.

It seems like this bug is a packed scheduler one and it was
introduced in 2.6.10-rc2. In the summary of changes from 2.6.10-rc1 to
2.6.10-rc2 there are a lot of changes announced for the packet
scheduler.
I removed all the changes of the packet scheduler files from the
incremental patch 2.6.10-rc1 to 2.6.10-rc2, I applied it to 2.6.10-rc1
and the new 2.6.10-rc2-whithout-sched-changes does not hang.
So the problem should be looked in that changes to the pkt sched
API, the patch containing only those changes is at
ftp://blackblue.iasi.rdsnet.ro/pub/various/k/patch-2.6.10-sched_changes-from_rc1-to-rc2.gz

--
Any views or opinions presented within this e-mail are solely those of
the author and do not necessarily represent those of any company, unless
otherwise expressly stated.

2005-04-14 15:46:44

by Tarhon-Onu Victor

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Tue, 12 Apr 2005, Tarhon-Onu Victor wrote:

> So the problem should be looked in that changes to the pkt sched API,
> the patch containing only those changes is at

The bug is in this portion of code from net/sched/sch_generic.c,
in the qdisc_destroy() function:

==
list_for_each_entry(cq, &cql, list)
list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
if (q->ops->cl_ops == NULL)
list_del_init(&q->list);
else
list_move_tail(&q->list, &cql);
}
list_for_each_entry_safe(cq, n, &cql, list)
list_del_init(&cq->list);
==

...and it happens when q->ops->cl_ops is NULL and
list_del_init(&q->list) is executed.

The stuff from include/linux/list.h looks ok, it seems like one
of those two iterations (list_for_each_entry() and
list_for_each_entry_safe()) enters an endless loop when an element is
removed from the list under some circumstances.

--
Tarhon-Onu Victor
Area Technical Coordinator
RDS Iasi - Network Operations Center
http://www.rdsnet.ro, http://www.rdstel.ro, http://www.rdslink.ro
Phone: +40-232-218385; Fax: +40-232-260099
..........................................................................
Privileged/Confidential Information may be contained in this message. If
you are not the addressee indicated in this message (or responsible for
delivery of the message to such person), you may not copy or deliver this
message to anyone. In such a case, you should destroy this message and
kindly notify the sender by reply e-mail.

2005-04-15 21:37:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Thu, 2005-04-14 at 18:46 +0300, Tarhon-Onu Victor wrote:
> On Tue, 12 Apr 2005, Tarhon-Onu Victor wrote:
>
> > So the problem should be looked in that changes to the pkt sched API,
> > the patch containing only those changes is at
>
> The bug is in this portion of code from net/sched/sch_generic.c,
> in the qdisc_destroy() function:
>
> ==
> list_for_each_entry(cq, &cql, list)
> list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
> if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
> if (q->ops->cl_ops == NULL)
> list_del_init(&q->list);
> else
> list_move_tail(&q->list, &cql);
> }
> list_for_each_entry_safe(cq, n, &cql, list)
> list_del_init(&cq->list);
> ==
>
> ...and it happens when q->ops->cl_ops is NULL and
> list_del_init(&q->list) is executed.
>
> The stuff from include/linux/list.h looks ok, it seems like one
> of those two iterations (list_for_each_entry() and
> list_for_each_entry_safe()) enters an endless loop when an element is
> removed from the list under some circumstances.

There's a comment above qdisc_destroy that says:

/* Under dev->queue_lock and BH! */

I'm not so sure this is the case. I've included the emails of those
listed as Authors of sch_generic.c and sch_htb.c, hopefully they are the
ones who can help (if not, sorry to bother you).

The list.h is fine, but if another task goes down this list when it
list_del_init is done, there's a chance that the reading task can get to
the deleted item just as it is being deleted, and has pointed itself to
itself. p->next == p. This would go into an infinite loop.

The reason sysrq works is because this doesn't stop interrupts. But put
a local_irq_save around that list and run your test, I bet you won't be
able to do anything, but power off with the big button.

Hope someone can help. I don't know the queue disciplines well enough to
make a proper fix.

-- Steve


2005-04-15 21:44:23

by jamal

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?


Didnt see the beginings of this thread - please post on netdev instead
of lkml network related questions.

The real cause seems to be an ARP issue from what i saw in the oops
posted a while back:
--
[4294692.342000] Call Trace:
[4294692.342000] [<c0104d76>] show_stack+0xa6/0xe0
[4294692.342000] [<c0104f2b>] show_registers+0x15b/0x1f0
[4294692.342000] [<c01051a1>] die+0x141/0x2d0
[4294692.342000] [<c011e13e>] do_page_fault+0x22e/0x6a6
[4294692.342000] [<c0104817>] error_code+0x4f/0x54
[4294692.342000] [<c04236da>] qdisc_restart+0xba/0x730
[4294692.342000] [<c04136fe>] dev_queue_xmit+0x13e/0x640
[4294692.342000] [<c0454c4c>] arp_solicit+0xfc/0x210
[4294692.342000] [<c041a6ee>] neigh_timer_handler+0x13e/0x320
[4294692.342000] [<c0137450>] run_timer_softirq+0x130/0x490
[4294692.342000] [<c0131ad2>] __do_softirq+0x42/0xa0
[4294692.342000] [<c01066e1>] do_softirq+0x51/0x60
-----

Is this the same issue?
Can you describe how you create this issue; kernel version etc.

cheers,
jamal

On Fri, 2005-15-04 at 17:37 -0400, Steven Rostedt wrote:
> On Thu, 2005-04-14 at 18:46 +0300, Tarhon-Onu Victor wrote:
> > On Tue, 12 Apr 2005, Tarhon-Onu Victor wrote:
> >
> > > So the problem should be looked in that changes to the pkt sched API,
> > > the patch containing only those changes is at
> >
> > The bug is in this portion of code from net/sched/sch_generic.c,
> > in the qdisc_destroy() function:
> >
> > ==
> > list_for_each_entry(cq, &cql, list)
> > list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
> > if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
> > if (q->ops->cl_ops == NULL)
> > list_del_init(&q->list);
> > else
> > list_move_tail(&q->list, &cql);
> > }
> > list_for_each_entry_safe(cq, n, &cql, list)
> > list_del_init(&cq->list);
> > ==
> >
> > ...and it happens when q->ops->cl_ops is NULL and
> > list_del_init(&q->list) is executed.
> >
> > The stuff from include/linux/list.h looks ok, it seems like one
> > of those two iterations (list_for_each_entry() and
> > list_for_each_entry_safe()) enters an endless loop when an element is
> > removed from the list under some circumstances.
>
> There's a comment above qdisc_destroy that says:
>
> /* Under dev->queue_lock and BH! */
>
> I'm not so sure this is the case. I've included the emails of those
> listed as Authors of sch_generic.c and sch_htb.c, hopefully they are the
> ones who can help (if not, sorry to bother you).
>
> The list.h is fine, but if another task goes down this list when it
> list_del_init is done, there's a chance that the reading task can get to
> the deleted item just as it is being deleted, and has pointed itself to
> itself. p->next == p. This would go into an infinite loop.
>
> The reason sysrq works is because this doesn't stop interrupts. But put
> a local_irq_save around that list and run your test, I bet you won't be
> able to do anything, but power off with the big button.
>
> Hope someone can help. I don't know the queue disciplines well enough to
> make a proper fix.
>
> -- Steve
>
>
>

2005-04-15 21:54:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

Don't think the issue is the same. This problem is happening with
Tarhon-Onu Victor. I'm including his previous two posts from LKML here.
So you can read the whole thing in one letter. He states that the
problem started in 2.6.10-rc2 and looking at a diff, between rc1 and rc2
the list walk was added in qdisc_destroy. Take a look at his scripts
and you may see that on a SMP machine, this my have a race condition.

-- Steve

>From April 8th:


I am not subscribed to this list so please CC me if you post a
reply, if you need additional info or if you suggest a patch (in which
I would be very interested).

Occurrence:
- the kernels must be compiled with Hyper Threading support (and
the CPU/MB must support it);
- a (tc) process is reading statistics from the htb tree and
another
tries to delete or deletes the tree.
The bug will not occur if the system is booted with acpi=off or
if it's not SMP (HT) capable. I reproduced the bug on
2.6.10-1.770_FCsmp
(Fedora Core update package) and vanilla 2.6.11, 2.6.11.5, 2.6.11.6,
2.6.12-rc1 and 2.6.12-rc2 compiled with SMP and ACPI support in order
to
enable Hyper Threading (with and without preempt, with or without SMT
support). The compiler I used is GCC version 3.4.2 (from Fedora Core 3).

Result: almost all the time the system hangs but still can
execute SysRq commands.

How I tested
~~~~~~~~~~~~
On a console I was running a script that initializes a htb tree
on an interface (dummy0) in an endless loop:
while /bin/true; do . htbdown.dummy0.sh; done
...and on another console I run tc -s also in an endless loop:
while /bin/true; do tc -s class sh dev dummy0; done

After a while (sometimes after 2 runs of the htbdown.dummy0.sh
script, sometimes after 20) the system hangs. It still responds to
SysRq
commands and I was able to see the two tc processes running. Sometimes
I
still have ping replies from it and one time, just one time in 2 days I
still was able to log in remotely.
There are no printk messages, or no other warnings or errors
printed in the system log or kernel log. It just hangs and it seems
like
something is wasting all the CPU power: when I still was able to log in
I noticed that one of the two virtual CPUs was 100% with system
interrupts and the other was 100% system. I wasn't able to strace any
of
the two running tc's.
What I was able to paste with the mouse in my console, to catch
in a typescript and also the script that initializes the htb tree on
dummy0 can be found at ftp://blackblue.iasi.rdsnet.ro/pub/various/k/ .
The test host is a 3.0GHz Intel Prescott and I first noticed
the
bug on a system with a 2.8GHz Intel Northwood, both having motherboards
with Intel chipset (865GBF). I am not able to test it in other SMP
environments (Intel Xeon or Itanium, AMD Opteron, Dual P3, etc), so I'm
not able to tell if it's an ACPI bug, a SMP bug or a Packet Scheduler
bug.



>From April 12th:


On Fri, 8 Apr 2005, Tarhon-Onu Victor wrote:

> I am not subscribed to this list so please CC me if you post a
reply,
> if you need additional info or if you suggest a patch (in which I
would be
> very interested).
>
[snip]
> (Intel Xeon or Itanium, AMD Opteron, Dual P3, etc), so I'm not able to
tell
> if it's an ACPI bug, a SMP bug or a Packet Scheduler bug.

It seems like this bug is a packed scheduler one and it was
introduced in 2.6.10-rc2. In the summary of changes from 2.6.10-rc1 to
2.6.10-rc2 there are a lot of changes announced for the packet
scheduler.
I removed all the changes of the packet scheduler files from
the
incremental patch 2.6.10-rc1 to 2.6.10-rc2, I applied it to 2.6.10-rc1
and the new 2.6.10-rc2-whithout-sched-changes does not hang.
So the problem should be looked in that changes to the pkt
sched
API, the patch containing only those changes is at
ftp://blackblue.iasi.rdsnet.ro/pub/various/k/patch-2.6.10-sched_changes-from_rc1-to-rc2.gz



On Fri, 2005-04-15 at 17:44 -0400, jamal wrote:
> Didnt see the beginings of this thread - please post on netdev instead
> of lkml network related questions.
>
> The real cause seems to be an ARP issue from what i saw in the oops
> posted a while back:
> --
> [4294692.342000] Call Trace:
> [4294692.342000] [<c0104d76>] show_stack+0xa6/0xe0
> [4294692.342000] [<c0104f2b>] show_registers+0x15b/0x1f0
> [4294692.342000] [<c01051a1>] die+0x141/0x2d0
> [4294692.342000] [<c011e13e>] do_page_fault+0x22e/0x6a6
> [4294692.342000] [<c0104817>] error_code+0x4f/0x54
> [4294692.342000] [<c04236da>] qdisc_restart+0xba/0x730
> [4294692.342000] [<c04136fe>] dev_queue_xmit+0x13e/0x640
> [4294692.342000] [<c0454c4c>] arp_solicit+0xfc/0x210
> [4294692.342000] [<c041a6ee>] neigh_timer_handler+0x13e/0x320
> [4294692.342000] [<c0137450>] run_timer_softirq+0x130/0x490
> [4294692.342000] [<c0131ad2>] __do_softirq+0x42/0xa0
> [4294692.342000] [<c01066e1>] do_softirq+0x51/0x60
> -----
>
> Is this the same issue?
> Can you describe how you create this issue; kernel version etc.
>
> cheers,
> jamal
>
> On Fri, 2005-15-04 at 17:37 -0400, Steven Rostedt wrote:
> > On Thu, 2005-04-14 at 18:46 +0300, Tarhon-Onu Victor wrote:
> > > On Tue, 12 Apr 2005, Tarhon-Onu Victor wrote:
> > >
> > > > So the problem should be looked in that changes to the pkt sched API,
> > > > the patch containing only those changes is at
> > >
> > > The bug is in this portion of code from net/sched/sch_generic.c,
> > > in the qdisc_destroy() function:
> > >
> > > ==
> > > list_for_each_entry(cq, &cql, list)
> > > list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
> > > if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
> > > if (q->ops->cl_ops == NULL)
> > > list_del_init(&q->list);
> > > else
> > > list_move_tail(&q->list, &cql);
> > > }
> > > list_for_each_entry_safe(cq, n, &cql, list)
> > > list_del_init(&cq->list);
> > > ==
> > >
> > > ...and it happens when q->ops->cl_ops is NULL and
> > > list_del_init(&q->list) is executed.
> > >
> > > The stuff from include/linux/list.h looks ok, it seems like one
> > > of those two iterations (list_for_each_entry() and
> > > list_for_each_entry_safe()) enters an endless loop when an element is
> > > removed from the list under some circumstances.
> >
> > There's a comment above qdisc_destroy that says:
> >
> > /* Under dev->queue_lock and BH! */
> >
> > I'm not so sure this is the case. I've included the emails of those
> > listed as Authors of sch_generic.c and sch_htb.c, hopefully they are the
> > ones who can help (if not, sorry to bother you).
> >
> > The list.h is fine, but if another task goes down this list when it
> > list_del_init is done, there's a chance that the reading task can get to
> > the deleted item just as it is being deleted, and has pointed itself to
> > itself. p->next == p. This would go into an infinite loop.
> >
> > The reason sysrq works is because this doesn't stop interrupts. But put
> > a local_irq_save around that list and run your test, I bet you won't be
> > able to do anything, but power off with the big button.
> >
> > Hope someone can help. I don't know the queue disciplines well enough to
> > make a proper fix.
> >
> > -- Steve
> >
> >
> >
>

2005-04-15 22:55:42

by Thomas Graf

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

* Steven Rostedt <[email protected]> 2005-04-15 17:54
> > ==
> > list_for_each_entry(cq, &cql, list)
> > list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
> > if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
> > if (q->ops->cl_ops == NULL)
> > list_del_init(&q->list);
> > else
> > list_move_tail(&q->list, &cql);
> > }
> > list_for_each_entry_safe(cq, n, &cql, list)
> > list_del_init(&cq->list);
> > ==
> >
> > ...and it happens when q->ops->cl_ops is NULL and
> > list_del_init(&q->list) is executed.
> >
> > The stuff from include/linux/list.h looks ok, it seems like one
> > of those two iterations (list_for_each_entry() and
> > list_for_each_entry_safe()) enters an endless loop when an element is
> > removed from the list under some circumstances.
>
> There's a comment above qdisc_destroy that says:
>
> /* Under dev->queue_lock and BH! */
>
> I'm not so sure this is the case.

It's not, we should change the comment. qdisc_destroy calls for inner
leaf qdiscs comming from rcu scheduled __qdisc_destroy -> qdisc::destroy()
-> qdisc_destroy() will not have a lock on queue_lock but it shouldn't be
a problem since the qdiscs cannot be found anymore.

Another case were it's not locked is upon a deletion of a class where
the class deletes its inner qdisc, although there is only one case
how this can happen and that's under rtnl semaphore so there is no
way we can have a dumper at the same time.

A wild guess would be that one of the many wrong locking places where
dev->queue_lock is acquired but qdisc_tree_lock is not is the problem.
tc_dump_qdisc assumed that locking on qdisc_tree_lock is enough which
is absolutely right, still most callers to qdisc_destroy only lock
on dev->queue_lock for the modification of the tree, which _was_ fine
before we used list.h since the unlinking was atomic. This is no news
and Patrick's patch in 2.6.10 ought to have fixed this by unlinking
inner leaf qdiscs from dev->qdisc_list and thus preventing them to
be found by anyone during unlocked calls to destroy_qdisc.

So... we might have a unlocked qdisc_destroy call for a qdisc not
properly unlinked from dev->qdisc_list.

2005-04-16 01:52:35

by Herbert Xu

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Fri, Apr 15, 2005 at 10:54:22PM +0000, Thomas Graf wrote:
>
> Another case were it's not locked is upon a deletion of a class where
> the class deletes its inner qdisc, although there is only one case
> how this can happen and that's under rtnl semaphore so there is no
> way we can have a dumper at the same time.

Sorry, that's where tc went astray :)

The assumption that the rtnl is held during dumping is false. It is
only true for the initial dump call. All subsequent dumps are not
protected by the rtnl.

The solution is certainly not to place the dumpers under rtnl :)
The dump operation is read-only and should not need any exclusive
locks.

In fact the whole qdisc locking is a mess. It's a cross-breed
between locking with ad-hoc reference counting and RCU. What's
more, the RCU is completely useless too because for the case
where we actually have a queue we still end up taking the spin
lock on each transmit! I think someone's been benchmarking the
loopback device again :)

It needs to be reengineered.

Here is a quick'n'dirty fix to the problem at hand. What happened
between 2.6.10-rc1 and 2.6.10-rc2 is that qdisc_destroy started
changing the next pointer of qdisc entries which totally confuses
the readers because qdisc_destroy doesn't always take the tree lock.

This patch tries to ensure that all top-level calls to qdisc_destroy
come under the tree lock. As Thomas correctedly pointed out, most
of the other qdisc_destroy calls occur after the top qdisc has been
unlinked from the device qdisc_list. However, someone should go
through each one of the remaining ones (they're all in the individual
sch_* implementations) and make sure that this assumption is really
true.

Signed-off-by: Herbert Xu <[email protected]>

If anyone has cycles to spare and a stomach strong enough for
this stuff, here is your chance :)

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Attachments:
(No filename) (2.03 kB)
p (966.00 B)
Download all attachments

2005-04-16 05:02:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Sat, 2005-04-16 at 11:49 +1000, Herbert Xu wrote:


> Here is a quick'n'dirty fix to the problem at hand. What happened
> between 2.6.10-rc1 and 2.6.10-rc2 is that qdisc_destroy started
> changing the next pointer of qdisc entries which totally confuses
> the readers because qdisc_destroy doesn't always take the tree lock.
>
> This patch tries to ensure that all top-level calls to qdisc_destroy
> come under the tree lock. As Thomas correctedly pointed out, most
> of the other qdisc_destroy calls occur after the top qdisc has been
> unlinked from the device qdisc_list. However, someone should go
> through each one of the remaining ones (they're all in the individual
> sch_* implementations) and make sure that this assumption is really
> true.
>
> Signed-off-by: Herbert Xu <[email protected]>
>
> If anyone has cycles to spare and a stomach strong enough for
> this stuff, here is your chance :)
>

FYI,

I ran the test case that Tarhon-Ohn had, but had to change his tc
execution from batch to single lines since the version of tc I have
segfaults on newlines. Anyway, I did see the lock up with 2.6.11.2
after 7 iterations. I applied your patch, and it ran for 30 iterations
before I manually killed it. I didn't test any more than that, but this
seems to be the quick fix for now.

-- Steve


2005-04-16 11:06:24

by Thomas Graf

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

* Herbert Xu <[email protected]> 2005-04-16 11:49
> On Fri, Apr 15, 2005 at 10:54:22PM +0000, Thomas Graf wrote:
> >
> > Another case were it's not locked is upon a deletion of a class where
> > the class deletes its inner qdisc, although there is only one case
> > how this can happen and that's under rtnl semaphore so there is no
> > way we can have a dumper at the same time.
>
> Sorry, that's where tc went astray :)
>
> The assumption that the rtnl is held during dumping is false. It is
> only true for the initial dump call. All subsequent dumps are not
> protected by the rtnl.

Ahh.. it's the unlocked subsequent dump calls that are _still_ running
when the destroy is invoked. That's where Patrick and I went wrong
when we tried to fix this issue. We set our t0 to qdisc_destroy and
didn't really consider any prior unlocked tasks still running.

> In fact the whole qdisc locking is a mess. It's a cross-breed
> between locking with ad-hoc reference counting and RCU. What's
> more, the RCU is completely useless too because for the case
> where we actually have a queue we still end up taking the spin
> lock on each transmit! I think someone's been benchmarking the
> loopback device again :)

It's not completely useless, it speeds up the deletion classful
qdiscs having some depth. However, it's not worth the locking
troubles I guess.

> Here is a quick'n'dirty fix to the problem at hand.

I think it's pretty clean but it doesn't solve the problem completely,
see below.

> This patch tries to ensure that all top-level calls to qdisc_destroy
> come under the tree lock. As Thomas correctedly pointed out, most
> of the other qdisc_destroy calls occur after the top qdisc has been
> unlinked from the device qdisc_list. However, someone should go
> through each one of the remaining ones (they're all in the individual
> sch_* implementations) and make sure that this assumption is really
> true.

qdisc_destroy can still be invoked without qdisc_tree_lock via the
deletion of a class when it calls qdisc_destroy to destroy its
leaf qdisc.

> If anyone has cycles to spare and a stomach strong enough for
> this stuff, here is your chance :)

I will look into this.

2005-04-16 11:15:25

by Herbert Xu

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Sat, Apr 16, 2005 at 01:06:39PM +0200, Thomas Graf wrote:
>
> qdisc_destroy can still be invoked without qdisc_tree_lock via the
> deletion of a class when it calls qdisc_destroy to destroy its
> leaf qdisc.

Indeed. Fortuantely HTB seems to be safe as it calls sch_tree_lock
which is another name for qdisc_tree_lock. CBQ on the other hand
needs to have a little tweak.

> I will look into this.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-04-16 11:25:47

by Herbert Xu

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Sat, Apr 16, 2005 at 01:06:39PM +0200, Thomas Graf wrote:
>
> It's not completely useless, it speeds up the deletion classful
> qdiscs having some depth. However, it's not worth the locking
> troubles I guess.

RCU is meant to optimise the common reader path. In this case
that's the packet transmission code. Unfortunately it fails
miserably when judged by that criterion.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-04-16 11:34:28

by Thomas Graf

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

* Herbert Xu <[email protected]> 2005-04-16 21:23
> On Sat, Apr 16, 2005 at 01:06:39PM +0200, Thomas Graf wrote:
> >
> > It's not completely useless, it speeds up the deletion classful
> > qdiscs having some depth. However, it's not worth the locking
> > troubles I guess.
>
> RCU is meant to optimise the common reader path. In this case
> that's the packet transmission code. Unfortunately it fails
> miserably when judged by that criterion.

There is one case where it can do good for latency which is for
per flow qdiscs or any other scenarios implying hundreds or
thousands of leaf qdiscs where a destroyage of one such qdisc
tree will take up quite some cpu to traverse all the classes
under dev->queue_lock. I don't have any numbers on this, but
I don't completely dislike the method of hiding the qdiscs under
the lock and do the expensive traveling unlocked.

2005-04-16 16:04:40

by jamal

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Sat, 2005-16-04 at 13:34 +0200, Thomas Graf wrote:
> * Herbert Xu <[email protected]> 2005-04-16 21:23
> > On Sat, Apr 16, 2005 at 01:06:39PM +0200, Thomas Graf wrote:
> > >
> > > It's not completely useless, it speeds up the deletion classful
> > > qdiscs having some depth. However, it's not worth the locking
> > > troubles I guess.
> >
> > RCU is meant to optimise the common reader path. In this case
> > that's the packet transmission code. Unfortunately it fails
> > miserably when judged by that criterion.
>
> There is one case where it can do good for latency which is for
> per flow qdiscs or any other scenarios implying hundreds or
> thousands of leaf qdiscs where a destroyage of one such qdisc
> tree will take up quite some cpu to traverse all the classes
> under dev->queue_lock. I don't have any numbers on this, but
> I don't completely dislike the method of hiding the qdiscs under
> the lock and do the expensive traveling unlocked.

The rule of "optimize for the common" fails miserably in this case
because this is not a common case/usage of qdiscs.
I have a feeling though that the patch went in due to
dude-optimizing-loopback as pointed by Herbert.
It could also be it was done because RCU-is-so-cool. I dont recall.
Maybe worth reverting to the earlier scheme if it is going to continue
to be problematic.

cheers,
jamal

2005-04-16 18:20:59

by Thomas Graf

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

* jamal <[email protected]> 2005-04-16 12:04
> The rule of "optimize for the common" fails miserably in this case
> because this is not a common case/usage of qdiscs.

I tend to agree. OTOH, I use exactly such setups... ;->

> I have a feeling though that the patch went in due to
> dude-optimizing-loopback as pointed by Herbert.

I checked, it was in fact during the lockless loopback
optimizations.

> Maybe worth reverting to the earlier scheme if it is going to continue
> to be problematic.

Let me first check and see how the locking can be done at best, it
doesn't match the principles in sch_generic.c anyway at the moment
so once we know how to do the locking efficient and how to remove the
error proneess we can see if the optimization fits in without problems
and make a call.

2005-04-17 17:47:31

by Patrick McHardy

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

Herbert Xu wrote:
> On Sat, Apr 16, 2005 at 01:06:39PM +0200, Thomas Graf wrote:
>
>>qdisc_destroy can still be invoked without qdisc_tree_lock via the
>>deletion of a class when it calls qdisc_destroy to destroy its
>>leaf qdisc.
>
> Indeed. Fortuantely HTB seems to be safe as it calls sch_tree_lock
> which is another name for qdisc_tree_lock. CBQ on the other hand
> needs to have a little tweak.

HTB also needs to be fixed. Destruction is usually defered by the
refcnt until ->put(), htb_put() doesn't lock the tree. Same for
HFSC and CBQ.

Regards
Patrick

2005-04-17 21:41:00

by Herbert Xu

[permalink] [raw]
Subject: Re: ACPI/HT or Packet Scheduler BUG?

On Sun, Apr 17, 2005 at 07:46:16PM +0200, Patrick McHardy wrote:
>
> HTB also needs to be fixed. Destruction is usually defered by the
> refcnt until ->put(), htb_put() doesn't lock the tree. Same for
> HFSC and CBQ.

Yes you're absolutely right.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt