2000-12-16 19:40:01

by Alan

[permalink] [raw]
Subject: Linux 2.2.19pre2


Merge the Andrea VM changes. Nothing else in this pre-patch so we can be
sure we know if it broke something. The other stuff is still on my in queue
so don't resend it.

2.2.19pre2
o Drop the page aging for a moment to merge the
Andrea VM
o Merge Andrea's VM-global patch (Andrea Arcangeli)

2.2.19pre1
o Basic page aging (Neil Schemenauer)
| This is a beginning to trying to get the VM right
| Next stage is to go through Andrea's stuff and sort
| it out the way I want it.
o E820 memory detect backport from 2.4 (Michael Chen)
o Fix cs46xx refusing to run on emachines400 (me)
o Fix parport docs (Tim Waugh)
o Fix USB serial name reporting (me)
o Fix else warning in initio scsi (John Fort)
o Fix incorrect timeout (that couldnt occur
fortunately) in sched.c (Andrew Archibald)
o Fix A20 fix credits (Christian Lademann)
o Support for OnStream SC-x0 tape drives (Willem Riede,
Kurt Garloff)
o Intel 815 added to the AGPGART code (Robert M Love)
o 3Ware scsi fixes (Arnaldo Carvalho de Melo)
o Clean up scsi_init_malloc no mem case (Arnaldo Carvalho de Melo)
o Fix dead module parameter in ip_masq_user.c (Keith Owens)
o Switch max_files and friends to a struct to (Tigran Aivazian)
be sure they stay together
o Update microcode driver (Tigran Aivazian)
o Fix free memory dereference in lance driver (Eli Carter)
o ISOfs fixes (Andries Brouwer)
o Watchdog driver for Advantech boards (Marek Michalkiewicz)
o ISDN updates (Karsten Keil)
o Docs fix (Pavel Rabel)
o wake_one semantics for accept() (Andrew Morton)
o Masquerade updates (Juanjo Ciarlante)
o Add support for long serialnums on the Metricom (Alex Belits)
o Onboard ethernet driver for the Intel 'Panther' (Ard van Breemen,
boards Andries Brouwer)
o VIA686a timer reset to 18Hz background (Vojtech Pavlik)
o 3c527 driver rewrite (Richard Procter)
| This supercedes my driver because
| - it works for more people
| - he has time to use his MCA box to debug it
o Minix subpartition support (Anand Krishnamurthy
Rajeev Pillai)
o Remove unused() crap from DRM. You will need
to hand load agp as well if needed (me)


--
Alan Cox <[email protected]>
Red Hat Kernel Hacker
& Linux 2.2 Maintainer Brainbench MVP for TCP/IP
http://www.linux.org.uk/diary http://www.brainbench.com


2000-12-17 11:27:47

by Petri Kaukasoina

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

I guess the new memory detect does not work correctly with my old work
horse. It is a 100 MHz pentium with 56 Megs RAM. AMIBIOS dated 10/10/94 with
a version number of 51-000-0001169_00111111-101094-SIS550X-H.

2.2.18 reports:
Memory: 55536k/57344k available (624k kernel code, 412k reserved, 732k data, 40k init)

2.2.19pre2 reports:
Memory: 53000k/54784k available (628k kernel code, 408k reserved, 708k data, 40k init)

57344k is 56 Megs which is correct.
54784k is only 53.5 Megs.

2000-12-17 16:11:17

by Kurt Garloff

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Hi,

On Sun, Dec 17, 2000 at 12:56:56PM +0200, Petri Kaukasoina wrote:
> I guess the new memory detect does not work correctly with my old work
> horse. It is a 100 MHz pentium with 56 Megs RAM. AMIBIOS dated 10/10/94 with
> a version number of 51-000-0001169_00111111-101094-SIS550X-H.
>
> 2.2.18 reports:
> Memory: 55536k/57344k available (624k kernel code, 412k reserved, 732k data, 40k init)
>
> 2.2.19pre2 reports:
> Memory: 53000k/54784k available (628k kernel code, 408k reserved, 708k data, 40k init)
>
> 57344k is 56 Megs which is correct.
> 54784k is only 53.5 Megs.

It's this patch that changes things for you:
o E820 memory detect backport from 2.4 (Michael Chen)

The E820 memory detection parses a list from the BIOS, which specifies
the amount of memory, holes, reserved regions, ...
Apparently, your BIOS does not do it completely correctly; otherwise you
should have had crashes before ...

Regards,
--
Kurt Garloff <[email protected]> Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE GmbH, Nuernberg, FRG SCSI, Security


Attachments:
(No filename) (1.13 kB)
(No filename) (232.00 B)
Download all attachments

2000-12-20 11:03:40

by Petri Kaukasoina

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sun, Dec 17, 2000 at 04:38:02PM +0100, Kurt Garloff wrote:
> On Sun, Dec 17, 2000 at 12:56:56PM +0200, Petri Kaukasoina wrote:
> > I guess the new memory detect does not work correctly with my old work
> > horse. It is a 100 MHz pentium with 56 Megs RAM. AMIBIOS dated 10/10/94 with
> > a version number of 51-000-0001169_00111111-101094-SIS550X-H.
> >
> > 2.2.18 reports:
> > Memory: 55536k/57344k available (624k kernel code, 412k reserved, 732k data, 40k init)
> >
> > 2.2.19pre2 reports:
> > Memory: 53000k/54784k available (628k kernel code, 408k reserved, 708k data, 40k init)
> >
> > 57344k is 56 Megs which is correct.
> > 54784k is only 53.5 Megs.
>
> It's this patch that changes things for you:
> o E820 memory detect backport from 2.4 (Michael Chen)
>
> The E820 memory detection parses a list from the BIOS, which specifies
> the amount of memory, holes, reserved regions, ...
> Apparently, your BIOS does not do it completely correctly; otherwise you
> should have had crashes before ...

OK, I booted 2.4.0-test12 which even prints that list:

BIOS-provided physical RAM map:
BIOS-e820: 000000000009fc00 @ 0000000000000000 (usable)
BIOS-e820: 0000000000000400 @ 000000000009fc00 (reserved)
BIOS-e820: 0000000003480000 @ 0000000000100000 (usable)
BIOS-e820: 0000000000100000 @ 00000000fec00000 (reserved)
BIOS-e820: 0000000000100000 @ 00000000fee00000 (reserved)
BIOS-e820: 0000000000010000 @ 00000000ffff0000 (reserved)
Memory: 52232k/54784k available (831k kernel code, 2164k reserved, 62k data, 168k init, 0k highmem)

The last three reserved lines correspond to the missing 2.5 Megs. What are
they? 2.2.18 sees all 56 Megs and works ok and after adding mem=56M on the
kernel command line even 2.2.19pre2 works ok with all the 56 Megs. No
crashes.

2000-12-20 11:16:54

by Kurt Garloff

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Wed, Dec 20, 2000 at 12:32:54PM +0200, Petri Kaukasoina wrote:
> OK, I booted 2.4.0-test12 which even prints that list:
>
> BIOS-provided physical RAM map:
> BIOS-e820: 000000000009fc00 @ 0000000000000000 (usable)
> BIOS-e820: 0000000000000400 @ 000000000009fc00 (reserved)
> BIOS-e820: 0000000003480000 @ 0000000000100000 (usable)
> BIOS-e820: 0000000000100000 @ 00000000fec00000 (reserved)
> BIOS-e820: 0000000000100000 @ 00000000fee00000 (reserved)
> BIOS-e820: 0000000000010000 @ 00000000ffff0000 (reserved)
> Memory: 52232k/54784k available (831k kernel code, 2164k reserved, 62k data, 168k init, 0k highmem)
>
> The last three reserved lines correspond to the missing 2.5 Megs. What are
> they?

Data reserved by your system for whatever purpose.
Most probably ACPI data or similar.

> 2.2.18 sees all 56 Megs and works ok and after adding mem=56M on the
> kernel command line even 2.2.19pre2 works ok with all the 56 Megs. No
> crashes.

If you would have ACPI events, you would probably run into trouble.
Apart from this, chances are that the reserved data is not needed by Linux
and never accessed by the BIOS, so you may get away with using the reserved
memory.
The safe way is to respect the BIOS' RAM map.

Regards,
--
Kurt Garloff <[email protected]> Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE GmbH, Nuernberg, FRG SCSI, Security


Attachments:
(No filename) (1.42 kB)
(No filename) (232.00 B)
Download all attachments

2000-12-20 14:00:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sat, Dec 16, 2000 at 07:11:47PM +0000, Alan Cox wrote:
> o E820 memory detect backport from 2.4 (Michael Chen)

It's broken, it will crash machines:

for (i = 0; i < e820.nr_map; i++) {
unsigned long start, end;
/* RAM? */
if (e820.map[i].type != E820_RAM)
continue;
start = PFN_UP(e820.map[i].addr);
end = PFN_DOWN(e820.map[i].addr + e820.map[i].size);
if (start >= end)
continue;
if (end > max_pfn)
max_pfn = end;
}
memory_end = (max_pfn << PAGE_SHIFT);

this will threat non-RAM holes as RAM. On 2.4.x we do a different things, that
is we collect the max_pfn but then we don't assume that there are no holes
between 1M and max_pfn ;), we instead fill the bootmem allocator _only_ with
E820_RAM segments.

I was in the process of fixing this (I also just backported the thinkpad
%edx clobber fix), but if somebody is going to work on this please let
me know so we stay in sync.

> o wake_one semantics for accept() (Andrew Morton)

I dislike the implementation. I stick with my faster and nicer implementation
that was just included in aa kernels for some time (2.2.18aa2 included it for
example). Andrew, I guess you didn't noticed I just implemented the wakeone for
accept. (I just ported it on top of 2.2.19pre2 after backing out the wakeone in
pre2)

Andrea

2000-12-20 15:23:17

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> > o wake_one semantics for accept() (Andrew Morton)
>
> I dislike the implementation. I stick with my faster and nicer implementation
> that was just included in aa kernels for some time (2.2.18aa2 included it for
> example). Andrew, I guess you didn't noticed I just implemented the wakeone for
> accept. (I just ported it on top of 2.2.19pre2 after backing out the wakeone in
> pre2)

Yes, I noticed your implementation a few weeks back.

'fraid I never liked the use of the TASK_EXCLUSIVE bit in
task_struct.state. It's an enumerated state, not an aggregation
of flags. Most of the kernel treats it as an enumerated state
and so will happily clear the TASK_EXCLUSIVE bit without masking it
off. Fragile.

If a task is on two waitqueues at the same time it becomes a bug:
if the outer waitqueue is non-exclusive and the inner is exclusive,
the task suddenly becomes exclusive on the outer one and converts
it from wake-all to wake-some!

Is it faster? Not sure. You've saved a cacheline read in __wake_up
(which was in fact a preload, if you look at what comes later) at the
cost of having to mask out the bit in current->state every time
we schedule().

Anyway, it's academic. davem would prefer that we do it properly
and move the `exclusive' flag into the waitqueue head to avoid the
task-on-two-waitqueues problem, as was done in 2.4. I think he's
right. Do you?

-

2000-12-20 15:56:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Thu, Dec 21, 2000 at 01:57:15AM +1100, Andrew Morton wrote:
> If a task is on two waitqueues at the same time it becomes a bug:
> if the outer waitqueue is non-exclusive and the inner is exclusive,

Your 2.2.x won't allow that either. You set the `current->task_exclusive = 1'
and so you will get an exclusive wakeups in both waitqueues. You simply cannot
tell register in two waitqueue and expect a non-exlusive wakeup in one and an
exclusive wakeup in the other one.

The only design difference (non implementation difference) between my patch and
your patch is that you have to clear task_exlusive explicitly. You
moved the EXCLUSIVE bitflag out of current->state field. That gives no
advantages and it looks ugiler to me. The robusteness point doesn't hold
IMHO: as soon as current->state is been changed by somebody else
you don't care anymore if it was exclusive wakeup or not.

About the fact I mask out the exlusive bit in schedule that's zero cost
compared to a cacheline miss, but it also depends if you have more wakeups or
schedules (with accept(2) usage there are going to be more schedule than
wakeups, but on other usages that could not be the case) but ok, the
performance point was nearly irrelevant ;).

> Anyway, it's academic. davem would prefer that we do it properly
> and move the `exclusive' flag into the waitqueue head to avoid the
> task-on-two-waitqueues problem, as was done in 2.4. I think he's

The fact you could mix non-exclusive and exlusive wakeups in the same waitqueue
was a feature not a misfeature. Then of course you cannot register in two
waitqueues one with wake-one and one with wake-all but who does that anyways?
Definitely not an issue for 2.2.x.

I think the real reason for spearating the two things as davem proposed is
because otherwise we cannot register for a LIFO wake-one in O(1) as we needed
for accept.

Other thing about your patch, adding TASK_EXCLUSIVE to
wake_up/wake_up_interruptible is useless. You kind of mixed the two things at
the source level. In your patch TASK_EXCLUSIVE should not be defined. Last
thing the wmb() in accept wasn't necessary. At that point you don't care at all
what the wakeup can see.

Andrea

2000-12-20 18:22:28

by Rik van Riel

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Wed, 20 Dec 2000, Andrea Arcangeli wrote:
> On Thu, Dec 21, 2000 at 01:57:15AM +1100, Andrew Morton wrote:
> > If a task is on two waitqueues at the same time it becomes a bug:
> > if the outer waitqueue is non-exclusive and the inner is exclusive,
>
> Your 2.2.x won't allow that either. You set the
> `current->task_exclusive = 1' and so you will get an exclusive
> wakeups in both waitqueues. You simply cannot tell register in
> two waitqueue and expect a non-exlusive wakeup in one and an
> exclusive wakeup in the other one.

Why not? Having a wake-all wakeup on one event and a
wake-one wakeup on another kind of event seems perfectly
reasonable to me at a first glance. Is there something
I've overlooked ?

> > Anyway, it's academic. davem would prefer that we do it properly
> > and move the `exclusive' flag into the waitqueue head to avoid the
> > task-on-two-waitqueues problem, as was done in 2.4. I think he's
>
> The fact you could mix non-exclusive and exlusive wakeups in the
> same waitqueue was a feature not a misfeature. Then of course
> you cannot register in two waitqueues one with wake-one and one
> with wake-all but who does that anyways?

The "who does that anyways" argument could also be said about
mixing exclusive and non-exclusive wakeups in the same waitqueue.
Doing this seems rather confusing ... would you know any application
which needs this ?

> I think the real reason for spearating the two things as davem
> proposed is because otherwise we cannot register for a LIFO
> wake-one in O(1) as we needed for accept.

Do you have any reason to assume Davem and Linus lied about
why they changed the semantics? ;) I'm pretty sure the reason
why Linus and Davem changed the semantics is the same reason
they mailed to this list ... ;)


I think it would be good to have the same semantics in 2.2 as
we have in 2.4. Using different semantics will probably only
lead to confusion. If there is a good reason to go with the
semantics Andrea proposed over the semantics Linus and Davem
have in 2.4, we should probably either change 2.4 or use the
2.4 semantics in 2.2 anyway just to avoid the confusion...

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-12-20 18:42:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Wed, Dec 20, 2000 at 03:48:06PM -0200, Rik van Riel wrote:
> On Wed, 20 Dec 2000, Andrea Arcangeli wrote:
> > On Thu, Dec 21, 2000 at 01:57:15AM +1100, Andrew Morton wrote:
> > > If a task is on two waitqueues at the same time it becomes a bug:
> > > if the outer waitqueue is non-exclusive and the inner is exclusive,
> >
> > Your 2.2.x won't allow that either. You set the
> > `current->task_exclusive = 1' and so you will get an exclusive
> > wakeups in both waitqueues. You simply cannot tell register in
> > two waitqueue and expect a non-exlusive wakeup in one and an
> > exclusive wakeup in the other one.
>
> Why not? Having a wake-all wakeup on one event and a
> wake-one wakeup on another kind of event seems perfectly
> reasonable to me at a first glance. Is there something
> I've overlooked ?

I think you overlooked, never mind. I only said kernel 2.2.19pre2 won't allow
that either. I'm not saying it's impossible to implement or unrasonable.

> > > Anyway, it's academic. davem would prefer that we do it properly
> > > and move the `exclusive' flag into the waitqueue head to avoid the
> > > task-on-two-waitqueues problem, as was done in 2.4. I think he's
> >
> > The fact you could mix non-exclusive and exlusive wakeups in the
> > same waitqueue was a feature not a misfeature. Then of course
> > you cannot register in two waitqueues one with wake-one and one
> > with wake-all but who does that anyways?
>
> The "who does that anyways" argument could also be said about
> mixing exclusive and non-exclusive wakeups in the same waitqueue.
> Doing this seems rather confusing ... would you know any application
> which needs this ?

wake-one accept(2) in 2.2.x unless you want to rewrite part of the TCP stack to
still get select wake-all right (the reason they was mixed in whole 2.3.x was
the same reason we _need_ to somehow mix non-excusive and exlusive waiters in
the same waitqueue in 2.2.x to provide wake-one in accept).

And in 2.2.x the "who does that anyways" is much stronger, since _only_
accept is using the exclusive wakeup.

> I think it would be good to have the same semantics in 2.2 as
> we have in 2.4. [..]

2.3.0 born for allowing O(1) inserction in the waitqueue because only that
change generated and huge amount of details that was not possible to handle in
2.2.x.

> [..] If there is a good reason to go with the
> semantics Andrea proposed [..]

NOTE: I'm only talking about 2.2.19pre2, not 2.4.x. I didn't suggested anything
for 2.4.x and I'm totally fine with two different waitqueues. I even wanted to
differentiate them too in mid 2.3.x to fix accept that was FIFO but still
allowing insertion in the waitqueue in O(1), but didn't had the time to rework
the TCP stack and the rest of wake-one users.

Andrea

2000-12-21 11:06:25

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> The fact you could mix non-exclusive and exlusive wakeups in the same waitqueue
> was a feature not a misfeature. Then of course you cannot register in two
> waitqueues one with wake-one and one with wake-all but who does that anyways?
> Definitely not an issue for 2.2.x.

Definitely? Let's think about that.

> I think the real reason for spearating the two things as davem proposed is
> because otherwise we cannot register for a LIFO wake-one in O(1) as we needed
> for accept.

Yes. In other words, if we try to do O(1) LIFO, we can cause lost wakeups.

> Other thing about your patch, adding TASK_EXCLUSIVE to
> wake_up/wake_up_interruptible is useless.

This enables wake_up_all().



Anyway, this is all just noise.

The key question is: which of the following do we want?

a) A simple, specific accept()-accelerator, and 2.2 remains without
an exclusive wq API or

b) A general purpose exclusive wq mechanism which does not correctly
support waiting on two queues simultaneuously where one is
exclusive or

c) A general purpose exclusive wq mechanism which _does_ support it.

Each choice has merit! You seem to want b). davem wants c).

And given that 2.2 has maybe 2-4 years life left in it, I'd
agree with David. Let's do it once and do it right while the issue
is fresh in our minds.

Yes?

-

2000-12-21 15:51:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Thu, Dec 21, 2000 at 09:38:43PM +1100, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > The fact you could mix non-exclusive and exlusive wakeups in the same waitqueue
> > was a feature not a misfeature. Then of course you cannot register in two
> > waitqueues one with wake-one and one with wake-all but who does that anyways?
> > Definitely not an issue for 2.2.x.
>
> Definitely? Let's think about that.

If you can tell one showstopper bottleneck in 2.2.x that needs to register
in two waitqueues at the same time, one exclusive and one non-exclusive, then I
will think about it.

> > I think the real reason for spearating the two things as davem proposed is
> > because otherwise we cannot register for a LIFO wake-one in O(1) as we needed
> > for accept.
>
> Yes. In other words, if we try to do O(1) LIFO, we can cause lost wakeups.

We can do LIFO in O(N) just fine, no lost wakeups with the "mix" way. Only
problem is additional complexity in the inserction operation.

> > Other thing about your patch, adding TASK_EXCLUSIVE to
> > wake_up/wake_up_interruptible is useless.
>
> This enables wake_up_all().

It is useless as it is in 2.2.19pre2: there's no wake_up_all in 2.2.19pre2.

> Anyway, this is all just noise.
>
> The key question is: which of the following do we want?
>
> a) A simple, specific accept()-accelerator, and 2.2 remains without
> an exclusive wq API or

To make the accellerator we need a minimal wake-one support. So a) doesn't
make sense to me.

> b) A general purpose exclusive wq mechanism which does not correctly
> support waiting on two queues simultaneuously where one is
> exclusive or

That's what we had in whole 2.3.x and that is better suitable for 2.2.x
as it allows to do a) with obviously right approch and minimal effort.

> c) A general purpose exclusive wq mechanism which _does_ support it.
>
> Each choice has merit! You seem to want b). davem wants c).

I have not read any email from DaveM who proposes C for 2.2.19pre3 (or 2.2.x in
general). Are you sure he wasn't talking about 2.4.x?

BTW, you also implemented b) in 2.2.19pre2 ;)

> And given that 2.2 has maybe 2-4 years life left in it, I'd

I hope no ;).

> agree with David. Let's do it once and do it right while the issue
> is fresh in our minds.

I disagree. The changes to separate the two waitqueues even only for accept are
not suitable for 2.2.x. Alan filter should forbid that changes. They're simply
not worthwhile because I cannot even think at one bottleneck-showstopper place
that registers in two waitqueues and wants wake-one from one and wake-all from
the other in 2.2.x.

Andrea

2000-12-21 17:38:51

by Rik van Riel

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Thu, 21 Dec 2000, Andrea Arcangeli wrote:

> > The key question is: which of the following do we want?
> >
> > a) A simple, specific accept()-accelerator, and 2.2 remains without
> > an exclusive wq API or
>
> To make the accellerator we need a minimal wake-one support. So a) doesn't
> make sense to me.
>
> > b) A general purpose exclusive wq mechanism which does not correctly
> > support waiting on two queues simultaneuously where one is
> > exclusive or
>
> That's what we had in whole 2.3.x and that is better suitable for 2.2.x
> as it allows to do a) with obviously right approch and minimal effort.
>
> > c) A general purpose exclusive wq mechanism which _does_ support it.
> >
> > Each choice has merit! You seem to want b). davem wants c).
>
> I have not read any email from DaveM who proposes C for
> 2.2.19pre3 (or 2.2.x in general). Are you sure he wasn't talking
> about 2.4.x?

c) will also implement a) in an obviously right and simple way.

I've still not seen ANY reason why we'd want 2.2 to have different
wake-one semantics from 2.4...

> > And given that 2.2 has maybe 2-4 years life left in it, I'd
>
> I hope no ;).

People are still converting their 2.0 systems to 2.2 as
we speak. I really doubt that 2.2 has _less_ than 3 years
of life left ... as much as I hate this idea ;)

> > agree with David. Let's do it once and do it right while the issue
> > is fresh in our minds.
>
> I disagree. The changes to separate the two waitqueues even only
> for accept are not suitable for 2.2.x. Alan filter should forbid
> that changes. They're simply not worthwhile because I cannot
> even think at ...

They're not worthwhile just because you can't think of an example ?

The same could be said of `b)' above. Do you have an example where
that is the preferable semantics ?

If both are equally preferable (ie. nobody can think of any example
where the corner case is being used), why do you keep insisting on
giving 2.2 different wake-one semantics from 2.4 ?

[these wake-one semantics may become rather important for people
backporting drivers to 2.2 over the next years ... or to something
else which nobody has even thought about ... 2-4 years is a long
time, so 2.2 maintainability is still an issue]

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-12-21 18:15:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Thu, Dec 21, 2000 at 03:07:08PM -0200, Rik van Riel wrote:
> c) will also implement a) in an obviously right and simple way.

So go ahead. If you think that's so simple and obviously right you can post
here a patch here against 2.2.19pre2 that implements C) to show real facts.

My B is here:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre2/wake-one-2

Then we will see how much C) is obviously right and simple way compared to B).

I don't need to see C) implemented to see how much it's obviously right
and simple but if you think I'm wrong again: go ahead.

It would also be nice if you could show a real life
showstopper-production-bottleneck where we need C) to fix it. I cannot see any
useful usage of C in production 2.2.x.

Doing waitqueues in 2.2.x and 2.4.x is an irrelevant point (keeping the same
API and semantics is much better than anything else for 2.2.x unless there's
some serious showstopper that isn't possible to fix with B) and that I still
cannot see).

People backporting drivers from 2.4.x will use wake-all as they had to do
during the whole 2.3.x, that's obviously safe and trivial. If they know what
they're doing they can also use the 2.2.x wake-one API if their task is
registered only in 1 waitqueues (as 99% of usages I'm aware of given
whole 2.3.x implemented B too).

Andrea

2000-12-21 18:26:46

by Rik van Riel

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Thu, 21 Dec 2000, Andrea Arcangeli wrote:

> It would also be nice if you could show a real life
> showstopper-production-bottleneck where we need C) to fix it. I
> cannot see any useful usage of C in production 2.2.x.

Me neither. I'm just wondering at the reason why 2.2 semantics
would be different than 2.4 ones now we still have the choice
of just cut'n'pasting over the (working) code from 2.4...

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-12-21 20:54:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Wed, Dec 20, 2000 at 02:28:58PM +0100, Andrea Arcangeli wrote:
> I was in the process of fixing this (I also just backported the thinkpad
> %edx clobber fix), but if somebody is going to work on this please let
> me know so we stay in sync.

Ok this should fix the e820 memory detection, against 2.2.19pre2:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre2/e820-fix-1

While fixing the code I noticed some bug was inerith from 2.4.x so I forward
ported the fixes to 2.4.0-test13-pre3:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.0-test13-pre3/e820-fix-1

I also include them below so they're handy for Linus:

diff -urN 2.4.0-test13-pre3/arch/i386/kernel/setup.c 2.4.0-test13-pre3-e820/arch/i386/kernel/setup.c
--- 2.4.0-test13-pre3/arch/i386/kernel/setup.c Thu Dec 14 22:33:59 2000
+++ 2.4.0-test13-pre3-e820/arch/i386/kernel/setup.c Thu Dec 21 21:12:47 2000
@@ -477,7 +477,7 @@
if (start < 0x100000ULL && end > 0xA0000ULL) {
if (start < 0xA0000ULL)
add_memory_region(start, 0xA0000ULL-start, type);
- if (end < 0x100000ULL)
+ if (end <= 0x100000ULL)
continue;
start = 0x100000ULL;
size = end - start;
@@ -518,7 +518,8 @@

e820.nr_map = 0;
add_memory_region(0, LOWMEMSIZE(), E820_RAM);
- add_memory_region(HIGH_MEMORY, mem_size << 10, E820_RAM);
+ add_memory_region(HIGH_MEMORY, (mem_size << 10)-HIGH_MEMORY,
+ E820_RAM);
}
printk("BIOS-provided physical RAM map:\n");
print_memory_map(who);


The above patches doesn't include the fix for the thinkpad from Marc Joosen,
a backport of such bugfix is separately backported here (because it's
orthogonal with the other bugfixes):

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre2/thinkpad-e820-mjoosen-1

Andrea

2000-12-22 07:59:07

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> > > Other thing about your patch, adding TASK_EXCLUSIVE to
> > > wake_up/wake_up_interruptible is useless.
> >
> > This enables wake_up_all().
>
> It is useless as it is in 2.2.19pre2: there's no wake_up_all in 2.2.19pre2.

#define wake_up_all(x) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE)

> > Anyway, this is all just noise.
> >
> > The key question is: which of the following do we want?
> >
> > a) A simple, specific accept()-accelerator, and 2.2 remains without
> > an exclusive wq API or
>
> To make the accellerator we need a minimal wake-one support. So a) doesn't
> make sense to me.

It makes heaps of sense. We've introduced into 2.2 an API
which has the same appearance as one in 2.4, but which is
subtly broken wrt the 2.4 one.

I suggest you change the names to something other than
add_waitqueue_exclusive() and TASK_EXCLUSIVE, add a
cautionary comment and then go ahead with your patch.

Except for this bit, which looks slightly fatal:

/*
* We can drop the read-lock early if this
* is the only/last process.
*/
if (next == head) {
read_unlock(&waitqueue_lock);
wake_up_process(p);
goto out;
}

Once the waitqueue_lock has been dropped, the task at `p'
is free to remove itself from the waitqueue and exit. This
CPU can then try to wake up a non-existent task, no?

-

2000-12-22 13:51:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Fri, Dec 22, 2000 at 06:33:00PM +1100, Andrew Morton wrote:
> add_waitqueue_exclusive() and TASK_EXCLUSIVE, add a

There's no add_waitqueue_exclusive in my patch.

> Except for this bit, which looks slightly fatal:
>
> /*
> * We can drop the read-lock early if this
> * is the only/last process.
> */
> if (next == head) {
> read_unlock(&waitqueue_lock);
> wake_up_process(p);
> goto out;
> }
>
> Once the waitqueue_lock has been dropped, the task at `p'
> is free to remove itself from the waitqueue and exit. This
> CPU can then try to wake up a non-existent task, no?

Yes, that was an unlikely-to-happen SMP race I inerith from 2.2.18 and all
previous 2.2.x vanilla kernels. Thanks.

Andrea

2000-12-23 07:22:24

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> [ 2.2 waitqueue stuff ]
>

Andrea, it occurs to me...

__wake_up() is using read_lock(waitqueue_lock). This means that
two CPUs could simultaneously run __wake_up(). They could both
find the same task and they could both try to wake it up. The
net result would be two wakeup events, but only one task woken
up. That's a lost wakeup.

Now, it's probably the case that the 2.2 network serialisation
prevents this from happening - I haven't looked. If not,
then we need to use spin_lock_irqsave.

Alternatively, we can teach wake_up_process to return a success
value if it successfully moved a task onto the runqueue. Test that
in __wake_up and keep on going if it's zero.

2.4 needs this as well, BTW, to fix an SMP race where a task is on two
waitqueues at the same time. I did a patch which took the "wake_up_process
returns success value" approach and it felt clean. I haven't submitted
it due to general end-of-year patch exhaustion :)

If we elect to not address this problem in 2.2 and to rely upon the network
locking then we need to put a BIG FAT warning in the code somewhere, because
if someone else tries to use the new wake-one capability, they may not be
so lucky.

-

2000-12-23 18:43:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sat, Dec 23, 2000 at 05:56:42PM +1100, Andrew Morton wrote:
> If we elect to not address this problem in 2.2 and to rely upon the network

I see. There are two races:

1) race inside __wake_up when it's run on the same waitqueue: 2.2.19pre3
is affected as well as 2.2.18aa2, and 2.4.x is affected
as well when #defining USE_RW_WAIT_QUEUE_SPINLOCK to 1.
2) race between two parallel __wake_up running on different waitqueues
(both 2.2.x and 2.4.x are affected)

1) could be fixed trivially by making the waitqueue_lock a spinlock, but
this way doesn't solve 2). And if we solve 2) properly than 1) gets fixed as
well.

I agree the right fix for 2) (and in turn for 1) ) is to count the number of
exclusive wake_up_process that moves the task in the runqueue, if the task was
just in the runqueue we must not consider it as an exclusive wakeup (so in turn
we'll try again to wakeup the next exclusive-wakeup waiter). This will
fix both races. Since the fix is self contained in __wake_up it's fine
for 2.2.19pre3 as well and we can keep using a read_write lock then.

Those races of course are orthogonal with the issue we discussed previously
in this thread: a task registered in two waitqueues and wanting an exclusive
wakeup from one waitqueue and a wake-all from the other waitqueue (for
addressing that we need to move the wake-one information from the task struct
to the waitqueue_head and I still think that shoudln't be addressed in 2.2.x,
2.2.x is fine with a per-task-struct wake-one information)

Should I take care of the 2.2.x fix, or will you take care of it? I'm not using
the wake-one patch in 2.2.19pre3 because I don't like it (starting from the
useless wmb() in accept) so if you want to take care of 2.2.19pre3 yourself I'd
suggest to apply the wake-one patch against 2.2.19pre3 in my ftp-patch area
first. Otherwise give me an ack and I'll extend myself my wake-one patch to
ignore the wake_up_process()es that doesn't move the task in the runqueue.

Thanks,
Andrea

2000-12-24 00:49:29

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> On Sat, Dec 23, 2000 at 05:56:42PM +1100, Andrew Morton wrote:
> > If we elect to not address this problem in 2.2 and to rely upon the network
>
> I see. There are two races:
>
> 1) race inside __wake_up when it's run on the same waitqueue: 2.2.19pre3
> is affected as well as 2.2.18aa2, and 2.4.x is affected
> as well when #defining USE_RW_WAIT_QUEUE_SPINLOCK to 1.

mm... I think we should kill the USE_RW_WAIT_QUEUE_SPINLOCK option.

> 2) race between two parallel __wake_up running on different waitqueues
> (both 2.2.x and 2.4.x are affected)
>
> 1) could be fixed trivially by making the waitqueue_lock a spinlock, but
> this way doesn't solve 2). And if we solve 2) properly than 1) gets fixed as
> well.

I don't understand the problem with 2) in 2.2? Every task on both waitqueues
gets woken up. Won't it sort itself out OK?

For 2.4, 2) is an issue because we can have tasks on two waitqueues at the
same time, with a mix of exclusive and non. Putting a global spinlock
into __wake_up_common would fix it, but was described as "horrid" by
you-know-who :)

> I agree the right fix for 2) (and in turn for 1) ) is to count the number of
> exclusive wake_up_process that moves the task in the runqueue, if the task was
> just in the runqueue we must not consider it as an exclusive wakeup (so in turn
> we'll try again to wakeup the next exclusive-wakeup waiter). This will
> fix both races. Since the fix is self contained in __wake_up it's fine
> for 2.2.19pre3 as well and we can keep using a read_write lock then.

I really like this approach. It fixes another problem in 2.4:

Example:

static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);

add_wait_queue_exclusive(&q->wait_for_request, &wait);
for (;;) {
__set_current_state(TASK_UNINTERRUPTIBLE);
/* WINDOW HERE */
spin_lock_irq(&io_request_lock);
rq = get_request(q, rw);
spin_unlock_irq(&io_request_lock);
if (rq)
break;
generic_unplug_device(q);
schedule();
}
remove_wait_queue(&q->wait_for_request, &wait);
current->state = TASK_RUNNING;
return rq;
}

If this task enters the schedule() and is then woken, and another
wakeup is sent to the waitqueue while this task is executing in
the marked window, __wake_up_common() will try to wake this
task a second time and will then stop looking for tasks to wake.

The outcome: two wakeups sent to the queue, but only one task woken.

I haven't thought about it super-hard, but I think that if
__wake_up_common's exclusive-mode handling were changed
as you describe, so that it keeps on scanning the queue until it has
*definitely* moved a task onto the runqueue then this
problem goes away.

> Those races of course are orthogonal with the issue we discussed previously
> in this thread: a task registered in two waitqueues and wanting an exclusive
> wakeup from one waitqueue and a wake-all from the other waitqueue (for
> addressing that we need to move the wake-one information from the task struct
> to the waitqueue_head and I still think that shoudln't be addressed in 2.2.x,
> 2.2.x is fine with a per-task-struct wake-one information)

OK by me, as long as people don't uncautiously start using the
capability for other things.

> Should I take care of the 2.2.x fix, or will you take care of it? I'm not using
> the wake-one patch in 2.2.19pre3 because I don't like it (starting from the
> useless wmb() in accept) so if you want to take care of 2.2.19pre3 yourself I'd
> suggest to apply the wake-one patch against 2.2.19pre3 in my ftp-patch area
> first. Otherwise give me an ack and I'll extend myself my wake-one patch to
> ignore the wake_up_process()es that doesn't move the task in the runqueue.

ack.

I'll take another look at the 2.4 patch and ask you to review that
when I've finished with the netdevice wetworks, if that's
OK.

-

2000-12-24 01:24:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sun, Dec 24, 2000 at 11:23:33AM +1100, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 1) could be fixed trivially by making the waitqueue_lock a spinlock, but
> > this way doesn't solve 2). And if we solve 2) properly than 1) gets fixed as

BTW (follow up myself), really making the lock a spinlock (not a readwrite
lock) would fix 2) as well (waitqueue_lock is global in 2.2.x I was thinking at
the per-waitqueue lock of 2.4.x ;).

> > well.
>
> I don't understand the problem with 2) in 2.2? Every task on both waitqueues
> gets woken up. Won't it sort itself out OK?

Not every task if it's a wake-one on both waitqueues. The problem should be the
same in 2.2.x and 2.4.x. But if such usage makes sense is uncertain...

> For 2.4, 2) is an issue because we can have tasks on two waitqueues at the
> same time, with a mix of exclusive and non. Putting a global spinlock
> into __wake_up_common would fix it, but was described as "horrid" by
> you-know-who :)

Yes. And that wouldn't fix the race number 3) below.

> > I agree the right fix for 2) (and in turn for 1) ) is to count the number of
> > exclusive wake_up_process that moves the task in the runqueue, if the task was
> > just in the runqueue we must not consider it as an exclusive wakeup (so in turn
> > we'll try again to wakeup the next exclusive-wakeup waiter). This will
> > fix both races. Since the fix is self contained in __wake_up it's fine
> > for 2.2.19pre3 as well and we can keep using a read_write lock then.
>
> I really like this approach. It fixes another problem in 2.4:
>
> Example:
>
> static struct request *__get_request_wait(request_queue_t *q, int rw)
> {
> register struct request *rq;
> DECLARE_WAITQUEUE(wait, current);
>
> add_wait_queue_exclusive(&q->wait_for_request, &wait);
> for (;;) {
> __set_current_state(TASK_UNINTERRUPTIBLE);
> /* WINDOW HERE */
> spin_lock_irq(&io_request_lock);
> rq = get_request(q, rw);
> spin_unlock_irq(&io_request_lock);

note that the above is racy and can lose a wakeup, 2.4.x needs
set_current_state there (not __set_current_state): spin_lock isn't a two-way
barrier, it only forbids stuff ot exit the critical section. So on some
architecture (not on the alpha for example) the cpu could reorder the code
this way:

spin_lock_irq()
rq = get_request
__set_current_state
spin_unlock_irq

So inverting the order of operations. That needs to be fixed too (luckily
it's a one liner).

> if (rq)
> break;
> generic_unplug_device(q);
> schedule();
> }
> remove_wait_queue(&q->wait_for_request, &wait);
> current->state = TASK_RUNNING;
> return rq;
> }
>
> If this task enters the schedule() and is then woken, and another
> wakeup is sent to the waitqueue while this task is executing in
> the marked window, __wake_up_common() will try to wake this
> task a second time and will then stop looking for tasks to wake.
>
> The outcome: two wakeups sent to the queue, but only one task woken.

Correct.

And btw such race is new and it must been introduced in late 2.4.0-test1X or
so, I'm sure it couldn't happen in whole 2.3.x and 2.4.0-testX because the
wakeup was clearing atomically the exclusive bit from the task->state.

Still talking about late 2.4.x changes, why add_wait_queue_exclusive gone
in kernel/fork.c?? That's obviously not the right place :).

> I haven't thought about it super-hard, but I think that if
> __wake_up_common's exclusive-mode handling were changed
> as you describe, so that it keeps on scanning the queue until it has
> *definitely* moved a task onto the runqueue then this
> problem goes away.

Yes, that's true.

> > Those races of course are orthogonal with the issue we discussed previously
> > in this thread: a task registered in two waitqueues and wanting an exclusive
> > wakeup from one waitqueue and a wake-all from the other waitqueue (for
> > addressing that we need to move the wake-one information from the task struct
> > to the waitqueue_head and I still think that shoudln't be addressed in 2.2.x,
> > 2.2.x is fine with a per-task-struct wake-one information)
>
> OK by me, as long as people don't uncautiously start using the
> capability for other things.
>
> > Should I take care of the 2.2.x fix, or will you take care of it? I'm not using
> > the wake-one patch in 2.2.19pre3 because I don't like it (starting from the
> > useless wmb() in accept) so if you want to take care of 2.2.19pre3 yourself I'd
> > suggest to apply the wake-one patch against 2.2.19pre3 in my ftp-patch area
> > first. Otherwise give me an ack and I'll extend myself my wake-one patch to
> > ignore the wake_up_process()es that doesn't move the task in the runqueue.
>
> ack.
>
> I'll take another look at the 2.4 patch and ask you to review that
> when I've finished with the netdevice wetworks, if that's
> OK.

OK. Thanks for the help.

Andrea

2000-12-24 02:54:40

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> ...
> > if (rq)
> > break;
> > generic_unplug_device(q);
> > schedule();
> > }
> > remove_wait_queue(&q->wait_for_request, &wait);
> > current->state = TASK_RUNNING;
> > return rq;
> > }
> >
> > If this task enters the schedule() and is then woken, and another
> > wakeup is sent to the waitqueue while this task is executing in
> > the marked window, __wake_up_common() will try to wake this
> > task a second time and will then stop looking for tasks to wake.
> >
> > The outcome: two wakeups sent to the queue, but only one task woken.
>
> Correct.
>
> And btw such race is new and it must been introduced in late 2.4.0-test1X or
> so, I'm sure it couldn't happen in whole 2.3.x and 2.4.0-testX because the
> wakeup was clearing atomically the exclusive bit from the task->state.
>

This could happen with the old scheme where exclusiveness
was stored in the task, not the waitqueue.

>From test4:

for (;;) {
__set_current_state(TASK_UNINTERRUPTIBLE | TASK_EXCLUSIVE);
/* WINDOW */
spin_lock_irq(&io_request_lock);
rq = get_request(q, rw);
spin_unlock_irq(&io_request_lock);
if (rq)
break;
generic_unplug_device(q);
schedule();
}

As soon as we set TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE, this
task becomes visible to __wake_up_common and can soak
up a second wakeup. I assume this hasn't been a reported problem
because request queues get lots of wakeups sent to them?

Still, changing the wakeup code in the way we've discussed
seems the way to go. It also makes the extra wake_up()
at the end of x86's __down() unnecessary, which is a small
performance win - semaphores are currently wake-two.

But Linus had a different reason why that wakeup was there.
Need to dig out the email and stare at it. But I don't see a
good reason to muck with the semaphores at this time.

-

2000-12-24 04:52:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sun, Dec 24, 2000 at 01:28:59PM +1100, Andrew Morton wrote:
> This could happen with the old scheme where exclusiveness
> was stored in the task, not the waitqueue.
>
> >From test4:
>
> for (;;) {
> __set_current_state(TASK_UNINTERRUPTIBLE | TASK_EXCLUSIVE);
> /* WINDOW */
> spin_lock_irq(&io_request_lock);
> rq = get_request(q, rw);
> spin_unlock_irq(&io_request_lock);
> if (rq)
> break;
> generic_unplug_device(q);
> schedule();
> }
>
> As soon as we set TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE, this
> task becomes visible to __wake_up_common and can soak
> up a second wakeup. [..]

I disagree, none wakeup could be lost in test4 in the above section of code
(besides the __set_current_state that should be set_current_state ;) and that's
not an issue for both x86 and alpha where spin_lock is a full memory barrier).

Let's examine what happens in test4:

1) before the `for' loop, the task is just visible to __wake_up_common

2) as soon as we set the task state to TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE
the task _start_ asking for an exclusive wakeup from __wake_up_common

At this point consider two different cases:

3a) assume we get a wakup in the /* WINDOW */: the task->state become
runnable from under us (no problem, that's expected). And if it's true we
got a wakeup in /* WINDOW */, then it means get_request will also return
a valid rq and we'll break the loop and we'll use the resource. No wakeup
lost in this case. So far so good. This was the 1 wakeup case.

3b) Assume _two_ wakeups happens in /* WINDOW */: the first one just make
us RUNNABLE as in the 3a) case, the second one will wakeup any _other_
wake-one task _after_ us in the wait queue (normal FIFO order), because the
first wakeup implicitly made us RUNNABLE (not anymore TASK_EXCLUSIVE so we
were not asking anymore for an exclusive wakeup).

The task after us may as well be running in /* WINDOW */ and in such case
it will get as well a `rq' from get_request (as in case 3a) without having
to sleep). All right.

No wakeup could be lost in any case. Two wakeups happened and two
tasks got their I/O request even if both wakeups happend in /* WINDOW */.
Example with N wakeups and N tasks is equivalent.

This race is been introduced in test1X because there the task keeps asking for
an exclusive wakeup even after getting a wakeup: until it has the time to
cleanup and unregister explicitly.

And btw, 2.2.19pre3 has the same race, while the alternate wake-one
implementation in 2.2.18aa2 doesn't have it (for the same reasons).

And /* WINDOW */ is not the only window for such race: the window is the whole
section where the task is registered in the waitqueue in exclusive mode:

add_wait_queue_exclusive
/* all is WINDOW here */
remove_wait_queue

Until remove_wait_queue runs explicitly in the task context any second wakeup
will keep waking up only such task (so in turn it will be lost). So it should
trigger very easily under high load since two wakeups will happen much faster
compared to the task that needs to be rescheduled in order run
remove_wait_queue and cleanup.

Any deadlocks in test1X could be very well explained by this race condition.

> Still, changing the wakeup code in the way we've discussed
> seems the way to go. [..]

I agree. I'm quite convinced it's right way too and of course I see why we
moved the exclusive information in the waitqueue instead of keeping it in the
task struct to provide sane semantics in the long run ;).

Andrea

2000-12-24 05:42:53

by Andrew Morton

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

Andrea Arcangeli wrote:
>
> On Sun, Dec 24, 2000 at 01:28:59PM +1100, Andrew Morton wrote:
> > This could happen with the old scheme where exclusiveness
> > was stored in the task, not the waitqueue.
> >
> > >From test4:
> >
> > for (;;) {
> > __set_current_state(TASK_UNINTERRUPTIBLE | TASK_EXCLUSIVE);
> > /* WINDOW */
> > spin_lock_irq(&io_request_lock);
> > rq = get_request(q, rw);
> > spin_unlock_irq(&io_request_lock);
> > if (rq)
> > break;
> > generic_unplug_device(q);
> > schedule();
> > }
> >
> > As soon as we set TASK_UNINTERRUPTIBLE|TASK_EXCLUSIVE, this
> > task becomes visible to __wake_up_common and can soak
> > up a second wakeup. [..]
>
> I disagree, none wakeup could be lost in test4 in the above section of code
> (besides the __set_current_state that should be set_current_state ;) and that's
> not an issue for both x86 and alpha where spin_lock is a full memory barrier).
>
> Let's examine what happens in test4:
>
> [ snip ]

I was talking about a different scenario:

add_wait_queue_exclusive(&q->wait_for_request, &wait);
for (;;) {
__set_current_state(TASK_UNINTERRUPTIBLE);
/* WINDOW */
spin_lock_irq(&io_request_lock);
rq = get_request(q, rw);
spin_unlock_irq(&io_request_lock);
if (rq)
break;
generic_unplug_device(q);
schedule();
}
remove_wait_queue(&q->wait_for_request, &wait);

Suppose there are two tasks sleeping in the schedule().

A wakeup comes. One task wakes. It loops aound and reaches
the window. At this point in time, another wakeup gets sent
to the waitqueue. It gets directed to the task which just
woke up! It should have been directed to a sleeping task,
not the current one which just *looks* like it's sleeping,
because it's in state TASK_UNINTERRUPTIBLE.

This can happen in test4.

> This race is been introduced in test1X because there the task keeps asking for
> an exclusive wakeup even after getting a wakeup: until it has the time to
> cleanup and unregister explicitly.

No, I think it's the same in test4 and test1X. In current kernels
it's still the case that the woken task gets atomically switched into
state TASK_RUNNING, and then becomes "hidden" from the wakeup code.

The problem is, the task bogusly sets itself back into
TASK_UNINTERRUPTIBLE for a very short period and becomes
eligible for another wakeup.

There's not much which ll_rw_blk.c can do about this.

> And btw, 2.2.19pre3 has the same race, while the alternate wake-one
> implementation in 2.2.18aa2 doesn't have it (for the same reasons).

The above scenario can happen in all kernels.

> And /* WINDOW */ is not the only window for such race: the window is the whole
> section where the task is registered in the waitqueue in exclusive mode:
>
> add_wait_queue_exclusive
> /* all is WINDOW here */
> remove_wait_queue
>
> Until remove_wait_queue runs explicitly in the task context any second wakeup
> will keep waking up only such task (so in turn it will be lost). So it should
> trigger very easily under high load since two wakeups will happen much faster
> compared to the task that needs to be rescheduled in order run
> remove_wait_queue and cleanup.
>
> Any deadlocks in test1X could be very well explained by this race condition.

Yes, but I haven't seen a "stuck in D state" report for a while.
I assume this is because this waitqueue gets lots of wakeups sent to it.

The ones in networking I expect are protected by other locking
which serialises the wakeups.

The one in the x86 semaphores avoids it by sending an extra wakeup
to the waitqueue on the way out.

> > Still, changing the wakeup code in the way we've discussed
> > seems the way to go. [..]
>
> I agree. I'm quite convinced it's right way too and of course I see why we
> moved the exclusive information in the waitqueue instead of keeping it in the
> task struct to provide sane semantics in the long run ;).

Linus suggested at one point that we clear the waitqueue's
WQ_FLAG_EXCLUSIVE bit when we wake it up, so it falls back
to non-exclusive mode. This was to address one of the
task-on-two-waitqueues problems...

-

2000-12-24 15:13:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sun, Dec 24, 2000 at 04:17:10PM +1100, Andrew Morton wrote:
> I was talking about a different scenario:
>
> add_wait_queue_exclusive(&q->wait_for_request, &wait);
> for (;;) {
> __set_current_state(TASK_UNINTERRUPTIBLE);
> /* WINDOW */
> spin_lock_irq(&io_request_lock);
> rq = get_request(q, rw);
> spin_unlock_irq(&io_request_lock);
> if (rq)
> break;
> generic_unplug_device(q);
> schedule();
> }
> remove_wait_queue(&q->wait_for_request, &wait);
>
> Suppose there are two tasks sleeping in the schedule().
>
> A wakeup comes. One task wakes. It loops aound and reaches
> the window. At this point in time, another wakeup gets sent
> to the waitqueue. It gets directed to the task which just
> woke up![..]

Ok, this is a very minor window compared to the current one, but yes, that
could happen too in test4.

> I assume this is because this waitqueue gets lots of wakeups sent to it.

It only gets the strictly necessary number of wakeups.

> Linus suggested at one point that we clear the waitqueue's
> WQ_FLAG_EXCLUSIVE bit when we wake it up, [..]

.. and then set it after checking if a new request is available, just
before schedule(). That would avoid the above race (and the one
I mentioned in previous email) but it doesn't address the lost wakeups
for example when setting USE_RW_WAIT_QUEUE_SPINLOCK to 1.

Considering wakeups only the ones that moves the task to the runqueue will get
rid of the races all together and it looks right conceptually so I prefer it.

Andrea

2000-12-24 16:10:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Linux 2.2.19pre2

On Sun, Dec 24, 2000 at 11:23:33AM +1100, Andrew Morton wrote:
> ack.

This patch against 2.2.19pre3 should fix all races. (note that wait->flags
doesn't need to be initialized in the critical section in test1X too)

ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre3/wake-one-3

Comments?

Andrea

2000-12-29 17:41:01

by Andrea Arcangeli

[permalink] [raw]
Subject: wake-one-3 bug (affected 2.2.19pre3aa[123])

On Sun, Dec 24, 2000 at 04:40:09PM +0100, Andrea Arcangeli wrote:
> On Sun, Dec 24, 2000 at 11:23:33AM +1100, Andrew Morton wrote:
> > ack.
>
> This patch against 2.2.19pre3 should fix all races. (note that wait->flags
> doesn't need to be initialized in the critical section in test1X too)
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre3/wake-one-3
>
> Comments?

Woops, it had a bug (I overlooked the usage of __add_wait_queue in sleep_on),
the bug was reproduced and fixed by Chris Mason and his fix is obviously right,
see:

--- 2.2.19pre3aa3/kernel/sched.c.~1~ Wed Dec 27 04:49:37 2000
+++ 2.2.19pre3aa3/kernel/sched.c Fri Dec 29 17:03:09 2000
@@ -1018,6 +1018,7 @@

#define SLEEP_ON_HEAD \
wait.task = current; \
+ wait.flags = 0; \
write_lock_irqsave(&waitqueue_lock,flags); \
__add_wait_queue(p, &wait); \
write_unlock(&waitqueue_lock);

New patch (revision n.4) against vanilla 2.2.19pre3 is here:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.19pre3/wake-one-4

Since 2.2.19pre3aa[123] included the wake-one-3 patch they can generate task in
D state blocked in sleep_on* too. So if you're running 2.2.19pre3aa[123] you
should either upgrade to 2.2.19pre3aa4 or to apply the above inlined one liner
on top of 2.2.19pre3aa[123] sources and recompile (just make bzImage again will
be enough).

Andrea