2001-12-06 19:01:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Linux 2.4.17-pre5


I'm going to release -pre versions more often from now on so people can
"see" what I'm doing with less latency: I hope that can make developer's
life easier.

So here goes pre5 with quite some changes...

pre5:

- 8139too fixes (Andreas Dilger)
- sym53c8xx_2 update (Gerard Roudier)
- loopback deadlock bugfix (Jan Kara)
- Yet another devfs update (Richard Gooch)
- Enable K7 SSE (John Clemens)
- Make grab_cache_page return NULL instead
ERR_PTR: callers expect NULL on failure (Christoph Hellwig)
- Make ide-{disk-floppy} compile without
PROCFS support (Robert Love)
- Another ymfpci update (Pete Zaitcev)
- indent NCR5380.{c,h}, g_NCR5380.{c,h}, plus
NCR5380 fix (Alan Cox)
- SPARC32/64 update (David S. Miller)
- Fix atyfb warnings (David S. Miller)
- Make bootmem init code correctly align
bootmem data (David S. Miller)
- Networking updates (David S. Miller)
- Fix scanning luns > 7 on SCSI-3 devices (Michael Clark)
- Add sparse lun hint for Chaparral G8324
Fibre-SCSI controller (Michael Clark)
- Really apply sg changes (me)
- Parport updates (Tim Waugh)
- ReiserFS updates (Vladimir V. Saveliev)
- Make AGP code scan all kinds of devices:
they are not always video ones (Alan Cox)
- EXPORT_NO_SYMBOLS in floppy.c (Alan Cox)
- Pentium IV Hyperthreading support (Alan Cox)

pre4:

- Added missing tcp_diag.c and tcp_diag.h (me)

pre3:

- Enable ppro errata workaround (Dave Jones)
- Update tmpfs documentation (Christoph Rohland)
- Fritz!PCIv2 ISDN card support (Kai Germaschewski)
- Really apply ymfpci changes (Pete Zaitcev)
- USB update (Greg KH)
- Adds detection of more eepro100 cards (Troy A. Griffitts)
- Make ftruncate64() compliant with SuS (Andrew Morton)
- ATI64 fb driver update (Geert Uytterhoeven)
- Coda fixes (Jan Harkes)
- devfs update (Richard Gooch)
- Fix ad1848 breakage in -pre2 (Alan Cox)
- Network updates (David S. Miller)
- Add cramfs locking (Christoph Hellwig)
- Move locking of page_table_lock on expand_stack
before accessing any vma field (Manfred Spraul)
- Make time monotonous with gettimeofday (Andi Kleen)
- Add MODULE_LICENSE(GPL) to ide-tape.c (Mikael Pettersson)
- Minor cs46xx ioctl fix (Thomas Woller)

pre2:

- Remove userland header from bonding driver (David S. Miller)
- Create a SLAB for page tables on i386 (Christoph Hellwig)
- Unregister devices at shaper unload time (David S. Miller)
- Remove several unused variables from various
places in the kernel (David S. Miller)
- Fix slab code to not blindly trust cc_data():
it may be not valid on some platforms (David S. Miller)
- Fix RTC driver bug (David S. Miller)
- SPARC 32/64 update (David S. Miller)
- W9966 V4L driver update (Jakob Jemi)
- ad1848 driver fixes (Alan Cox/Daniel T. Cobra)
- PCMCIA update (David Hinds)
- Fix PCMCIA problem with multiple PCI busses (Paul Mackerras)
- Correctly free per-process signal struct (Dave McCracken)
- IA64 PAL/signal headers cleanup (Nathan Myers)
- ymfpci driver cleanup (Pete Zaitcev)
- Change NLS "licenses" to be "GPL/BSD" instead
only BSD. (Robert Love)
- Fix serial module use count (Russell King)
- Update sg to 3.1.22 (Douglas Gilbert)
- ieee1394 update (Ben Collins)
- ReiserFS fixes (Nikita Danilov)
- Update ACPI documentantion (Patrick Mochel)
- Smarter atime update (Andrew Morton)
- Correctly mark ext2 sb as dirty and sync it (Andrew Morton)
- IrDA update (Jean Tourrilhes)
- Count locked buffers at
balance_dirty_state(): Helps interactivity under
heavy IO workloads (Andrew Morton)
- USB update (Greg KH)
- ide-scsi locking fix (Christoph Hellwig)

pre1:

- Change USB maintainer (Greg Kroah-Hartman)
- Speeling fix for rd.c (From Ralf Baechle's tree)
- Updated URL for bigphysmem patch in v4l docs (Adrian Bunk)
- Add buggy 440GX to broken pirq blacklist (Arjan Van de Ven)
- Add new entry to Sound blaster ISAPNP list (Arjan Van de Ven)
- Remove crap character from Configure.help (Niels Kristian Bech Jensen)
- Backout erroneous change to lookup_exec_domain (Christoph Hellwig)
- Update osst sound driver to 1.65 (Willem Riede)
- Fix i810 sound driver problems (Andris Pavenis)
- Add AF_LLC define in network headers (Arnaldo Carvalho de Melo)
- block_size cleanup on some SCSI drivers (Erik Andersen)
- Added missing MODULE_LICENSE("GPL") in some (Andreas Krennmair)
modules
- Add ->show_options() to super_ops and
implement NFS method (Alexander Viro)
- Updated i8k driver (Massimo Dal Zoto)
- devfs update (Richard Gooch)



2001-12-06 18:59:02

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> - Pentium IV Hyperthreading support (Alan Cox)

Actually that one is various Intel people not me 8)

2001-12-06 19:57:41

by Matthias Andree

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Thu, 06 Dec 2001, Marcelo Tosatti wrote:

> pre5:
...
> - Networking updates (David S. Miller)

Would you deem it feasible to elaborate on these? "Networking updates"
is quite opaque and does not carry any information useful to me (at
least). Or is there a place I haven't known of until know where I can
get that information?

2001-12-06 20:15:21

by Rik van Riel

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Thu, 6 Dec 2001, Matthias Andree wrote:
> On Thu, 06 Dec 2001, Marcelo Tosatti wrote:
>
> > pre5:
> ...
> > - Networking updates (David S. Miller)
>
> Would you deem it feasible to elaborate on these?
> Or is there a place I haven't known of until know where I can
> get that information?

Wait for Dave at the lokal Krispy Kream and don't allow
him access to the donuts until he writes a changelog ?

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-06 20:24:25

by Jeff Garzik

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

Rik van Riel wrote:
>
> On Thu, 6 Dec 2001, Matthias Andree wrote:
> > On Thu, 06 Dec 2001, Marcelo Tosatti wrote:
> >
> > > pre5:
> > ...
> > > - Networking updates (David S. Miller)
> >
> > Would you deem it feasible to elaborate on these?
> > Or is there a place I haven't known of until know where I can
> > get that information?
>
> Wait for Dave at the lokal Krispy Kream and don't allow
> him access to the donuts until he writes a changelog ?

Or maybe "rc2log | fmt -72 | less"

changelog is in cvs after all... http://vger.samba.org/

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-12-06 20:24:23

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5



On Thu, 6 Dec 2001, Matthias Andree wrote:

> On Thu, 06 Dec 2001, Marcelo Tosatti wrote:
>
> > pre5:
> ...
> > - Networking updates (David S. Miller)
>
> Would you deem it feasible to elaborate on these? "Networking updates"
> is quite opaque and does not carry any information useful to me (at
> least). Or is there a place I haven't known of until know where I can
> get that information?

I could just put that information together with the changelog, but then it
becomes fscking huge.

Ask davem for more details about it...

2001-12-06 20:46:57

by Luca Montecchiani

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

Hisax compile fix :

--- drivers/isdn/hisax/config.c.orig Thu Dec 6 21:34:23 2001
+++ drivers/isdn/hisax/config.c Thu Dec 6 21:34:31 2001
@@ -485,7 +485,7 @@
if (strlen(str) < HISAX_IDSIZE)
strcpy(HiSaxID, str);
else
- printk(KERN_WARNING "HiSax: ID too long!")
+ printk(KERN_WARNING "HiSax: ID too long!");
} else
strcpy(HiSaxID, "HiSax");


ciao,
luca
--
----------------------------------------------------------
Luca Montecchiani <[email protected]>
http://www.geocities.com/montecchiani
SpeakFreely:sflwl -hlwl.fourmilab.ch luca@ ICQ:17655604
-------------------=(Linux since 1995)=-------------------

2001-12-06 21:01:13

by David Miller

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

From: Rik van Riel <[email protected]>
Date: Thu, 6 Dec 2001 18:14:23 -0200 (BRST)

Wait for Dave at the lokal Krispy Kream and don't allow
him access to the donuts until he writes a changelog ?

Actually, Macelo gets a full changelog.

2001-12-06 21:15:03

by Ben Greear

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

Perhaps Dave could summarize it in < 50 lines. That would
be a whole heap better than having to read the patch to try
to figure out what changed....

Marcelo Tosatti wrote:

>
> On Thu, 6 Dec 2001, Matthias Andree wrote:
>
>
>>On Thu, 06 Dec 2001, Marcelo Tosatti wrote:
>>
>>
>>>pre5:
>>>
>>...
>>
>>>- Networking updates (David S. Miller)
>>>
>>Would you deem it feasible to elaborate on these? "Networking updates"
>>is quite opaque and does not carry any information useful to me (at
>>least). Or is there a place I haven't known of until know where I can
>>get that information?
>>
>
> I could just put that information together with the changelog, but then it
> becomes fscking huge.
>
> Ask davem for more details about it...
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>


--
Ben Greear <[email protected]> <Ben_Greear AT excite.com>
President of Candela Technologies Inc http://www.candelatech.com
ScryMUD: http://scry.wanfear.com http://scry.wanfear.com/~greear


2001-12-06 21:59:36

by David Miller

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

From: Ben Greear <[email protected]>
Date: Thu, 06 Dec 2001 14:14:12 -0700

Perhaps Dave could summarize it in < 50 lines. That would
be a whole heap better than having to read the patch to try
to figure out what changed....

My summary was to him was much less than 50 lines.
In fact, here it is:

1) DecNET doc and code fixes from it's maintainer, Steven
Whitehouse.

2) You accidently reverted earlier socket.h LLC additions.
I assume it's because the networking patch I sent you
had it, yet it was already in your tree, and when Patch
complained you told it "treat as -R". :(

This should fix that.

3) VLAN fixes, in particular stop OOPS on module unload.
Also fix the build when VLAN is non-modular.

4) ip_fw_compat_redir can loose it's timer, fix from netfilter
maintainers.

5) ipt_unclean module handles ECN bits incorrectly.
Fix from netfilter maintainers.

6) Ipv4 TCP error handling looks up listening socket
children incorrectly. src/dest need to be reversed
in such cases.

IPv6 has the same bug, but Alexey needs some more time
to clean up that stuff.

7) SunRPC's csum_partial_copy_to_page_cache() does not handle
odd lengths correctly. Checksums needs to be combined
using csum_block_add() and friends in order to handle this
odd length case.

2001-12-06 22:25:12

by Matthias Andree

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Thu, 06 Dec 2001, David S. Miller wrote:

> My summary was to him was much less than 50 lines.
> In fact, here it is:

<snip>

Thanks a lot.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-12-06 23:39:57

by Dave Jones

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Thu, 6 Dec 2001, Alan Cox wrote:

> > - Pentium IV Hyperthreading support (Alan Cox)
> Actually that one is various Intel people not me 8)

Wouldn't it be better to see such things proven right in 2.5 first ?

Random things like this still appearing in 2.4 that haven't shown
up in 2.5 yet is a little disturbing. Ok its small, and there'll be
more 2.4pre's to get it right if anything is wrong with it, but
the whole forward-porting features thing just seems so... backwards.

regards,
Dave.

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2001-12-07 00:00:37

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> > Actually that one is various Intel people not me 8)
>
> Wouldn't it be better to see such things proven right in 2.5 first ?

o 2.5 isnt going to be usable for that kind of thing in the near future
o There is no code that is "new" for normal paths (in fact Marcelo
wanted a change for the only "definitely harmless" one there was)
>
> Random things like this still appearing in 2.4 that haven't shown
> up in 2.5 yet is a little disturbing. Ok its small, and there'll be

It isnt viable to do driver work or small test critical work in 2.5 yet. The
same happened in 2.2/2.3 so I'm not too worried

2001-12-07 00:30:54

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> Hisax compile fix :
>
> --- drivers/isdn/hisax/config.c.orig Thu Dec 6 21:34:23 2001
> +++ drivers/isdn/hisax/config.c Thu Dec 6 21:34:31 2001
> @@ -485,7 +485,7 @@
> if (strlen(str) < HISAX_IDSIZE)
> strcpy(HiSaxID, str);
> else
> - printk(KERN_WARNING "HiSax: ID too long!")
> + printk(KERN_WARNING "HiSax: ID too long!");
> } else
> strcpy(HiSaxID, "HiSax");
>
>

Ah, shit. Thanks luca, this was my fault. Never cut'n'paste via mouse
on important occasions.
Sorry guys, this was my fault, not Marcelo's.

BTW, for the further ongoing of this patch, I ran into the question if


MODULE_PARM(type, "1-(16)i");

would be a valid statement. I guess not. But if not, could some kind
soul please explain to me how to get rid of the braces "(" ")" given
in definitions from CONFIG stuff.

E.g.:

CONFIG_ME_BEING_DUMB (16)

entering above MODULE_PARM contruction via a definition. I searched
the source tree a bit, but did not find any hints.

Regards,
Stephan

PS: I _will_ re-check the next patches several times, I swear ... :-)




2001-12-07 03:43:47

by Keith Owens

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Fri, 07 Dec 2001 01:12:53 +0100,
Stephan von Krawczynski <[email protected]> wrote:
>BTW, for the further ongoing of this patch, I ran into the question if
>
>MODULE_PARM(type, "1-(16)i");
>
>would be a valid statement. I guess not. But if not, could some kind
>soul please explain to me how to get rid of the braces "(" ")" given
>in definitions from CONFIG stuff.
>
>E.g.:
>
>CONFIG_ME_BEING_DUMB (16)

Don't do that. CONFIG numbers are expected to be plain numbers, not
expressions, e.g. CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16, not (16).
Given the fragility of CML1 I would not be surprised if (16) broke some
of the shell scripts.


2001-12-07 11:56:22

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Fri, 07 Dec 2001 14:43:06 +1100
Keith Owens <[email protected]> wrote:

> On Fri, 07 Dec 2001 01:12:53 +0100,
> Stephan von Krawczynski <[email protected]> wrote:
> >BTW, for the further ongoing of this patch, I ran into the question if
> >
> >MODULE_PARM(type, "1-(16)i");
> >
> >would be a valid statement. I guess not. But if not, could some kind
> >soul please explain to me how to get rid of the braces "(" ")" given
> >in definitions from CONFIG stuff.
> >
> >E.g.:
> >
> >CONFIG_ME_BEING_DUMB (16)
>
> Don't do that. CONFIG numbers are expected to be plain numbers, not
> expressions, e.g. CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16, not (16).
> Given the fragility of CML1 I would not be surprised if (16) broke some
> of the shell scripts.

Huh!!

There is a problem: I made a (really small) patch to Config.in saying:

int ' Maximum number of cards supported by HiSax' CONFIG_HISAX_MAX_CARDS 8

If I check this in the source, it gives me CONFIG_HISAX_MAX_CARDS as (8)

Can you check this out please. It doesn't look like I could do anything against
this.
How do you make your CONFIG-definitions come back without "()" ?

Regards,
Stephan

2001-12-07 13:35:56

by Keith Owens

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Fri, 7 Dec 2001 12:55:30 +0100,
Stephan von Krawczynski <[email protected]> wrote:
>There is a problem: I made a (really small) patch to Config.in saying:
>
> int ' Maximum number of cards supported by HiSax' CONFIG_HISAX_MAX_CARDS 8
>
>If I check this in the source, it gives me CONFIG_HISAX_MAX_CARDS as (8)

Yuck! CML1 outputs integers as #define CONFIG_foo (number) instead of
just number. CML2 does not do that, I was looking at CML2. Add this
to drivers/isdn/Makefile

CFLAGS_foo.o += -DMAX_CARDS=$(subst (,,$(subst ),,$(CONFIG_HISAX_MAX_CARDS)))

In foo.c, use MAX_CARDS instead of CONFIG_HISAX_MAX_CARDS. Change foo
to the name of the object that you are working on. When you build, it
should say -DMAX_CARDS=8.

2001-12-07 14:26:04

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Sat, 08 Dec 2001 00:35:14 +1100
Keith Owens <[email protected]> wrote:

> On Fri, 7 Dec 2001 12:55:30 +0100,
> Stephan von Krawczynski <[email protected]> wrote:
> >There is a problem: I made a (really small) patch to Config.in saying:
> >
> > int ' Maximum number of cards supported by HiSax'
CONFIG_HISAX_MAX_CARDS 8
> >
> >If I check this in the source, it gives me CONFIG_HISAX_MAX_CARDS as (8)
>
> Yuck! CML1 outputs integers as #define CONFIG_foo (number) instead of
> just number. CML2 does not do that, I was looking at CML2. Add this
> to drivers/isdn/Makefile
>
> CFLAGS_foo.o += -DMAX_CARDS=$(subst (,,$(subst ),,$(CONFIG_HISAX_MAX_CARDS)))
>
> In foo.c, use MAX_CARDS instead of CONFIG_HISAX_MAX_CARDS. Change foo
> to the name of the object that you are working on. When you build, it
> should say -DMAX_CARDS=8.

Thanks for this hint, but it is not all that easy. Problem is the definition is
needed for _all_ files in the hisax-subtree, to be more precise for all
currently including hisax.h. I am not very fond of the idea to add additional
conditions to the availability of the HISAX_MAX_CARDS symbol, especially if
they are located in the Makefile.
Anyway, how would you generate this additional -D for all files inside a
certain directory? Obviously the stuff should at least be put inside the
hisax-Makefile, and not one layer above in isdn-Makefile.

I tried "CFLAGS += ..." but that does not work.

Thanks for help,
Stephan

2001-12-07 16:40:43

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5 / Have fun with make

On Sat, 08 Dec 2001 00:35:14 +1100
Keith Owens <[email protected]> wrote:
>
> CFLAGS_foo.o += -DMAX_CARDS=$(subst (,,$(subst ),,$(CONFIG_HISAX_MAX_CARDS)))
>
> In foo.c, use MAX_CARDS instead of CONFIG_HISAX_MAX_CARDS. Change foo
> to the name of the object that you are working on. When you build, it
> should say -DMAX_CARDS=8.

Keith, it is getting weird right now. Your above suggestion does not work, it
does not even execute, because the braces obviously confuse it.
Now I come up with a _working_ solution, but to be honest, I don't dare to give
away the patch, because it looks like this:

EXTRA_CFLAGS += -DHISAX_MAX_CARDS=$(subst ,,$(CONFIG_HISAX_MAX_CARDS))

You read it right, no substition is going on. BUT my test shows this executes
to e.g. "8" and not "(8)". This is what I wanted, but it looks like bs.
Alan would you please tell me if this looks like clean make usage to you, or
how you would drop "()" around integer definitions coming from CONFIG. Warning:
if you think this is ok, I will send the patch ;-) I only want a confirmation
from "Mr. Clean Code" ;-))

Regards,
Stephan

2001-12-07 17:24:08

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5



On Fri, 7 Dec 2001, Dave Jones wrote:

> On Thu, 6 Dec 2001, Alan Cox wrote:
>
> > > - Pentium IV Hyperthreading support (Alan Cox)
> > Actually that one is various Intel people not me 8)
>
> Wouldn't it be better to see such things proven right in 2.5 first ?
>
> Random things like this still appearing in 2.4 that haven't shown
> up in 2.5 yet is a little disturbing. Ok its small, and there'll be
> more 2.4pre's to get it right if anything is wrong with it, but
> the whole forward-porting features thing just seems so... backwards.

The patch does not touch "normal" x86 code: We're just using a new feature
of P4.

If any user reports problems with "hyperthreading" we can disable it by
default...

2001-12-07 22:22:23

by Keith Owens

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5 / Have fun with make

On Fri, 7 Dec 2001 17:39:54 +0100,
Stephan von Krawczynski <[email protected]> wrote:
>On Sat, 08 Dec 2001 00:35:14 +1100
>Keith Owens <[email protected]> wrote:
>>
>> CFLAGS_foo.o += -DMAX_CARDS=$(subst (,,$(subst ),,$(CONFIG_HISAX_MAX_CARDS)))
>>
>> In foo.c, use MAX_CARDS instead of CONFIG_HISAX_MAX_CARDS. Change foo
>> to the name of the object that you are working on. When you build, it
>> should say -DMAX_CARDS=8.
>
>Keith, it is getting weird right now. Your above suggestion does not work, it
>does not even execute, because the braces obviously confuse it.

That's what I get for typing code late at night and not testing it.
The correct implementation of that line is probably

lp:=(
rp:=)

CFLAGS_foo.o += -DMAX_CARDS=$(subst $(lp),,$(subst $(rp),,$(CONFIG_HISAX_MAX_CARDS)))

But as you found, you don't need that anyway.

>Now I come up with a _working_ solution, but to be honest, I don't dare to give
>away the patch, because it looks like this:
>
>EXTRA_CFLAGS += -DHISAX_MAX_CARDS=$(subst ,,$(CONFIG_HISAX_MAX_CARDS))

EXTRA_CFLAGS += -DHISAX_MAX_CARDS=$(CONFIG_HISAX_MAX_CARDS)

will work just as well. The reason that you do not get '(8)' that way
is because CML1 generates inconsistent output. In .config the line
says CONFIG_HISAX_MAX_CARDS=8, in include/linux/autoconf.h it says
#define CONFIG_HISAX_MAX_CARDS (8). The makefiles use .config, the
source code uses autoconf.h.

Inconsistency of inconsistencies, saith the preacher; all is inconsistency.

2001-12-08 04:57:04

by M. Edward Borasky

[permalink] [raw]
Subject: RE: Linux 2.4.17-pre5

They have Krispy Kremes in Europe??
--
Take Your Trading to the Next Level!
M. Edward Borasky, Meta-Trading Coach

[email protected]
http://www.meta-trading-coach.com
http://groups.yahoo.com/group/meta-trading-coach

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of David S. Miller
> Sent: Thursday, December 06, 2001 12:59 PM
> To: [email protected]
> Cc: [email protected]; [email protected]
> Subject: Re: Linux 2.4.17-pre5
>
>
> From: Rik van Riel <[email protected]>
> Date: Thu, 6 Dec 2001 18:14:23 -0200 (BRST)
>
> Wait for Dave at the lokal Krispy Kream and don't allow
> him access to the donuts until he writes a changelog ?
>
> Actually, Macelo gets a full changelog.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2001-12-08 05:42:15

by David Miller

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

From: "M. Edward Borasky" <[email protected]>
Date: Fri, 7 Dec 2001 20:56:46 -0800

They have Krispy Kremes in Europe??

No, a US only phenomenon at the moment :-)

2001-12-08 23:42:57

by Rusty Russell

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Fri, 7 Dec 2001 00:09:12 +0000 (GMT)
Alan Cox <[email protected]> wrote:

> > > Actually that one is various Intel people not me 8)
> >
> > Wouldn't it be better to see such things proven right in 2.5 first ?
>
> o 2.5 isnt going to be usable for that kind of thing in the near future
> o There is no code that is "new" for normal paths (in fact Marcelo
> wanted a change for the only "definitely harmless" one there was)

The sched.c change is also useless (ie. only harmful). Anton and I looked at
adapting the scheduler for hyperthreading, but it looks like the recent
changes have had the side effect of making hyperthreading + the current
scheduler "good enough". If someone wants an in-depth analysis of (1) what
is required to make the "right" decision for hyperthread scheduling with the
current scheduler (much more than the current wedge) and (2) why it doesn't
really matter anyway, please ask.

Anton, can you put the dbench graphs somewhere public?

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2001-12-09 00:22:55

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> The sched.c change is also useless (ie. only harmful). Anton and I looked at
> adapting the scheduler for hyperthreading, but it looks like the recent
> changes have had the side effect of making hyperthreading + the current

I trust Intels own labs over you on this one. In fact there is still
additional work to do to get mm pairing per chip not per cpu unit. Thats
intels patch based on intels work. I suspect they know what their chip
needs.

Alan

2001-12-09 01:22:46

by Anton Blanchard

[permalink] [raw]
Subject: Linux HMT analysis


> Anton, can you put the dbench graphs somewhere public?

Here they are:

http://samba.org/~anton/linux/HMT/

The machine is a 4 way RS64 (ppc64) box, with HMT enabled so Linux
thinks it has 8 cpus.

Since HMT is not an intel only problem it would be nice to solve this in
a slightly more generic way than #if defined(__i386__) && defined(CONFIG_SMP).
Otherwise there will shortly be yet another hack in the scheduler
surrounded by #ifdef CONFIG_PPC64_HMT :)

Its pretty obvious what they are trying to achieve (its always
preferrable to schedule 2 tasks on separate physical cpus rather than
sharing the same one), but their change does not seem to have the
required outcome.

Do we have any results showing the improvement this change made or did
we just accept the changes?

Anton

2001-12-09 01:30:35

by Alan

[permalink] [raw]
Subject: Re: Linux HMT analysis

> Since HMT is not an intel only problem it would be nice to solve this in
> a slightly more generic way than #if defined(__i386__) && defined(CONFIG_SMP).
> Otherwise there will shortly be yet another hack in the scheduler
> surrounded by #ifdef CONFIG_PPC64_HMT :)

Right but instead of saying "it doesnt work" it might be worth saying "it
doesnt work for me on ppc64". The latter I can well believe

> Its pretty obvious what they are trying to achieve (its always
> preferrable to schedule 2 tasks on separate physical cpus rather than
> sharing the same one), but their change does not seem to have the
> required outcome.

It's trying to ensure we make use of idle cpu units. Right now it doesn't
also consider the matching mm check to be per chip not per scheduling unit.
I've suggested that to Intel but not yet had any definitive answer stating
that it is an improvement.

[Actually it isnt always preferable there is a reverse argument in two
cases

1. Where both threads share the same mm
2. When you are trying to save power over performance

]

> Do we have any results showing the improvement this change made or did
> we just accept the changes?

Its based on real world runs with things like Oracle with a constraint that
the change must be clear and provably correct to get into a stable kernel
tree. For 2.5 we know the scheduler has to be rewritten somewhat and getting
it generically right is very important.

Alan

2001-12-09 01:32:16

by Alan

[permalink] [raw]
Subject: Re: Linux HMT analysis

> > Otherwise there will shortly be yet another hack in the scheduler
> > surrounded by #ifdef CONFIG_PPC64_HMT :)

Oh and a PS. Can you send me the PPC64_HMT scheduler hack to look at. If its
sane for 2.4 I can then see if the intel guys think it works for them too.

Alan

2001-12-09 01:58:36

by Rusty Russell

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

In message <[email protected]> you write:
> > The sched.c change is also useless (ie. only harmful). Anton and I looked
at
> > adapting the scheduler for hyperthreading, but it looks like the recent
> > changes have had the side effect of making hyperthreading + the current
>
> I trust Intels own labs over you on this one.

This is voodoo optimization. I don't care WHO did it.

Marcelo, drop the patch. Please delay scheduler hacks until they can
be verified to actually do something.

Given another chip with similar technology (eg. PPC's Hardware Multi
Threading) and the same patch, dbench runs 1 - 10 on 4-way makes NO
POSITIVE DIFFERENCE.

http://samba.org/~anton/linux/HMT/

> I suspect they know what their chip needs.

I find your faith in J. Random Intel Engineer fascinating.

================

The current scheduler actually works quite well if you number your
CPUs right, and to fix the corner cases takes more than this change.
First some simple terminology: let's assume we have two "sides" to
each CPU (ie. each CPU has two IDs, smp_num_cpus()/2 apart):

0 1 2 3
4 5 6 7

The current scheduler code reschedule_idle()s (pushes) from 0 to 3
first anyway, so if we're less than 50% utilized it tends to "just
work". Note that it doesn't stop the schedule() (pulls) on 4 - 7 from
grabbing a process to run even if there is a fully idle CPU, so it's
far from perfect.

Now let's look at the performance-problematic case: dbench 5.

Without HMT/hyperthread:
Fifth process not scheduled at all.

When any of the first four processes schedule(), the fifth
process is pulled onto that processor.

With HMT/hyperthread:
Fifth process scheduled on 4 (shared with 0).

When processes on 1, 2, or 3 schedule(), that processor sits
idle, while processor 0/4 is doing double work (ie. only 2 in
5 chance that the right process will schedule() first).

Finally, 0 or 4 will schedule() then wakeup, and be pulled
onto another CPU (unless they are all busy again).

The result is that dbench 5 runs significantly SLOWER with
hyperthreading than without. We really want to pull a process off a
cpu it is running on, if we are completely idle and it is running on a
double-used CPU. Note that dbench 6 is almost back to normal
performance, since the probability of the right process scheduling
first becomes 4 in 6).

Now, the Intel hack changes reschedule_idle() to push onto the first
completely idle CPU above all others. Nice idea: the only problem is
finding a load where that actually happens, since we push onto low
numbers first anyway. If we have an average of <= 4 running
processes, they spread out nicely, and if we have an average of > 4
then there are no fully idle processes and this hack is useless.

Clear?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2001-12-09 02:34:12

by Davide Libenzi

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Sun, 9 Dec 2001, Rusty Russell wrote:

> With HMT/hyperthread:
> Fifth process scheduled on 4 (shared with 0).
>
> When processes on 1, 2, or 3 schedule(), that processor sits
> idle, while processor 0/4 is doing double work (ie. only 2 in
> 5 chance that the right process will schedule() first).
>
> Finally, 0 or 4 will schedule() then wakeup, and be pulled
> onto another CPU (unless they are all busy again).

It's not easy to get this right anyway.
Using the scheduler i'm working on and setting a trigger load level of 2,
as soon as the idle is scheduled it'll go to grab the task waiting on the
other cpu and it'll make it running.
But this is not always right and, more difficult, it's very problematic to
understand when it's right and when it's not to behave in that way.
Think about a task that has built its own cache image on that cpu and that
it's veru likely that it's going to be woken up very soon.
By picking up another task you're going to trash its cache image.
What i'm thinking is to have the idle, instead of permanently halt()ed
waiting for an IPI, to be woken up at each timer tick to check the overall
balancing status.
Each time an unbalancing is discovered a counter ( on the idle cpu ) is
increased and, when this counter will become above a certain value, the
move is actually performed.
In this way we'll have a value that will make the scheduler to bahave
differently depending on its settings.




- Davide


2001-12-09 06:20:59

by Rusty Russell

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

In message <[email protected]> you write:
> It's not easy to get this right anyway.

Balancing the pull and push mechanisms in the scheduler while trying
to predict the future? "Not easy" is an excellent description.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2001-12-09 09:48:23

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

In article <[email protected]> you wrote:
> On Fri, 7 Dec 2001 00:09:12 +0000 (GMT)
> Alan Cox <[email protected]> wrote:

>> > > Actually that one is various Intel people not me 8)
>> >
>> > Wouldn't it be better to see such things proven right in 2.5 first ?
>>
>> o 2.5 isnt going to be usable for that kind of thing in the near future
>> o There is no code that is "new" for normal paths (in fact Marcelo
>> wanted a change for the only "definitely harmless" one there was)

> The sched.c change is also useless (ie. only harmful).

The intention seems to be to avoid the situation where one "pair" is
executing 2 processes while other "pair"s are fully idle. It makes a
difference for the "system is < 50% busy" case, NOT for the "system is very
busy" case....

2001-12-09 10:01:55

by Eran Mann

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5



David S. Miller wrote:

>
> 3) VLAN fixes, in particular stop OOPS on module unload.
> Also fix the build when VLAN is non-modular.
>

Small correction. The oops is not on module unload, but rather on
removal of VLAN device.
--
Eran Mann Direct : 972-4-9936230
Senior Software Engineer Fax : 972-4-9890430
Optical Access Email : [email protected]




2001-12-09 16:07:54

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> > I trust Intels own labs over you on this one.
> This is voodoo optimization. I don't care WHO did it.

Why don't you spend some time making the PPC64 port actually follow
basic things like the coding standard. Its not voodoo optimisation, its
benchmarked work from Intel.

> Given another chip with similar technology (eg. PPC's Hardware Multi
> Threading) and the same patch, dbench runs 1 - 10 on 4-way makes NO
> POSITIVE DIFFERENCE.

Well let me guess. Perhaps the PPC hardware MT is different. Real numbers
have been done. Getting uppity because we have HT code in that happens to
clash with your work isn't helpful. The fact that the IBM PPC64 port is
9 months behind in this area doesn't mean the rest of us can wait. When the
PPC64 port is usable, mergable and resembles actual Linux code then this
can be looked at again for 2.4.

Perhaps you'd like to submit your PPC64 HT patches to the list today
so that they can be tried comparitively on the Intel HT and we can see if
its a better generic solution ?

For 2.5 the scheduler needs a rewrite anyway so its a non issue there.

Alan

2001-12-09 16:16:14

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> Using the scheduler i'm working on and setting a trigger load level of 2,
> as soon as the idle is scheduled it'll go to grab the task waiting on the
> other cpu and it'll make it running.

That rapidly gets you thrashing around as I suspect you've found.

I'm currently using the following rule in wake up

if(current->mm->runnable > 0) /* One already running ? */
cpu = current->mm->last_cpu;
else
cpu = idle_cpu();
else
cpu = cpu_num[fast_fl1(runnable_set)]

that is
If we are running threads with this mm on a cpu throw them at the
same core
If there is an idle CPU use it
Take the mask of currently executing priority levels, find the last
set bit (lowest pri) being executed, and look up a cpu running at
that priority

Then the idle stealing code will do the rest of the balancing, but at least
it converges towards each mm living on one cpu core.

Alan

2001-12-09 19:47:18

by Davide Libenzi

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Sun, 9 Dec 2001, Alan Cox wrote:

> > Using the scheduler i'm working on and setting a trigger load level of 2,
> > as soon as the idle is scheduled it'll go to grab the task waiting on the
> > other cpu and it'll make it running.
>
> That rapidly gets you thrashing around as I suspect you've found.

Not really because i can make the same choices inside the idle code, out
of he fast path, without slowing the currently running cpu ( the waker ).


> I'm currently using the following rule in wake up
>
> if(current->mm->runnable > 0) /* One already running ? */
> cpu = current->mm->last_cpu;
> else
> cpu = idle_cpu();
> else
> cpu = cpu_num[fast_fl1(runnable_set)]
>
> that is
> If we are running threads with this mm on a cpu throw them at the
> same core
> If there is an idle CPU use it
> Take the mask of currently executing priority levels, find the last
> set bit (lowest pri) being executed, and look up a cpu running at
> that priority
>
> Then the idle stealing code will do the rest of the balancing, but at least
> it converges towards each mm living on one cpu core.

I've done a lot of experiments balancing the cost of moving tasks with
related tlb flushes and cache image trashing, with the cost of actually
leaving a cpu idle for a given period of time.
For example in a dual cpu the cost of leaving an idle cpu for more than
40-50 ms is higher than immediately fill the idle with a stolen task (
trigger rq length == 2 ).
This picture should vary a lot with big SMP systems, that's why i'm
seeking at a biased solution where it's easy to adjust the scheduler
behavior based on the underlying architecture.
For example, by leaving balancing decisions inside the idle code we'll
have a bit more time to consider different moving costs/metrics than will
be present for example in NUMA machines.
By measuring the cost of moving with the cpu idle time we'll have a pretty
good granularity and we could say, for example, that the tolerable cost of
moving a task on a given architecture is 40 ms idle time.
This means that if during 4 consecutive timer ticks ( on 100 HZ archs )
the idle cpu has found an "unbalanced" system, it's allowed to steal a
task to run on it.
Or better, it's allowed to steal a task from a cpu set that has a
"distance" <= 40 ms from its own set.





- Davide


2001-12-09 20:55:05

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5



On Sun, 9 Dec 2001, Rusty Russell wrote:

> In message <[email protected]> you write:
> > > The sched.c change is also useless (ie. only harmful). Anton and I looked
> at
> > > adapting the scheduler for hyperthreading, but it looks like the recent
> > > changes have had the side effect of making hyperthreading + the current
> >
> > I trust Intels own labs over you on this one.
>
> This is voodoo optimization. I don't care WHO did it.
>
> Marcelo, drop the patch. Please delay scheduler hacks until they can
> be verified to actually do something.
>
> Given another chip with similar technology (eg. PPC's Hardware Multi
> Threading) and the same patch, dbench runs 1 - 10 on 4-way makes NO
> POSITIVE DIFFERENCE.
>
> http://samba.org/~anton/linux/HMT/
>
> > I suspect they know what their chip needs.
>
> I find your faith in J. Random Intel Engineer fascinating.
>
> ================
>
> The current scheduler actually works quite well if you number your
> CPUs right, and to fix the corner cases takes more than this change.
> First some simple terminology: let's assume we have two "sides" to
> each CPU (ie. each CPU has two IDs, smp_num_cpus()/2 apart):
>
> 0 1 2 3
> 4 5 6 7
>
> The current scheduler code reschedule_idle()s (pushes) from 0 to 3
> first anyway, so if we're less than 50% utilized it tends to "just
> work". Note that it doesn't stop the schedule() (pulls) on 4 - 7 from
> grabbing a process to run even if there is a fully idle CPU, so it's
> far from perfect.
>
> Now let's look at the performance-problematic case: dbench 5.
>
> Without HMT/hyperthread:
> Fifth process not scheduled at all.
>
> When any of the first four processes schedule(), the fifth
> process is pulled onto that processor.
>
> With HMT/hyperthread:
> Fifth process scheduled on 4 (shared with 0).
>
> When processes on 1, 2, or 3 schedule(), that processor sits
> idle, while processor 0/4 is doing double work (ie. only 2 in
> 5 chance that the right process will schedule() first).
>
> Finally, 0 or 4 will schedule() then wakeup, and be pulled
> onto another CPU (unless they are all busy again).
>
> The result is that dbench 5 runs significantly SLOWER with
> hyperthreading than without. We really want to pull a process off a
> cpu it is running on, if we are completely idle and it is running on a
> double-used CPU. Note that dbench 6 is almost back to normal
> performance, since the probability of the right process scheduling
> first becomes 4 in 6).
>
> Now, the Intel hack changes reschedule_idle() to push onto the first
> completely idle CPU above all others. Nice idea: the only problem is
> finding a load where that actually happens, since we push onto low
> numbers first anyway. If we have an average of <= 4 running
> processes, they spread out nicely, and if we have an average of > 4
> then there are no fully idle processes and this hack is useless.

Rusty,

I've applied the Intel HT code because it is non intrusive.

If you really want to see that change removed, please show me (sensible)
benchmark numbers where the code actually fuckups performance.

Thanks


2001-12-09 23:45:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Sun, Dec 09, 2001 at 04:24:59PM +0000, Alan Cox wrote:
> I'm currently using the following rule in wake up
>
> if(current->mm->runnable > 0) /* One already running ? */
> cpu = current->mm->last_cpu;
> else
> cpu = idle_cpu();
> else
> cpu = cpu_num[fast_fl1(runnable_set)]
>
> that is
> If we are running threads with this mm on a cpu throw them at the
> same core
> If there is an idle CPU use it
> Take the mask of currently executing priority levels, find the last
> set bit (lowest pri) being executed, and look up a cpu running at
> that priority
>
> Then the idle stealing code will do the rest of the balancing, but at least
> it converges towards each mm living on one cpu core.

This implies that the idle loop will poll looking for work to do.
Is that correct? Davide's scheduler also does this. I believe
the current default idle loop (at least for i386) does as little
as possible and stops execting instructions. Comments in the code
mention power consumption. Should we be concerned with this?

--
Mike

2001-12-09 23:48:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Sun, 9 Dec 2001, Mike Kravetz wrote:

> On Sun, Dec 09, 2001 at 04:24:59PM +0000, Alan Cox wrote:
> > I'm currently using the following rule in wake up
> >
> > if(current->mm->runnable > 0) /* One already running ? */
> > cpu = current->mm->last_cpu;
> > else
> > cpu = idle_cpu();
> > else
> > cpu = cpu_num[fast_fl1(runnable_set)]
> >
> > that is
> > If we are running threads with this mm on a cpu throw them at the
> > same core
> > If there is an idle CPU use it
> > Take the mask of currently executing priority levels, find the last
> > set bit (lowest pri) being executed, and look up a cpu running at
> > that priority
> >
> > Then the idle stealing code will do the rest of the balancing, but at least
> > it converges towards each mm living on one cpu core.
>
> This implies that the idle loop will poll looking for work to do.
> Is that correct? Davide's scheduler also does this. I believe
> the current default idle loop (at least for i386) does as little
> as possible and stops execting instructions. Comments in the code
> mention power consumption. Should we be concerned with this?

My idea is not to poll ( due energy issues ) but to wake up idles (
kernel/timer.c ) at every timer tick to let them monitor the overall
balancing status.




- Davide


2001-12-09 23:50:51

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> This implies that the idle loop will poll looking for work to do.
> Is that correct? Davide's scheduler also does this. I believe
> the current default idle loop (at least for i386) does as little
> as possible and stops execting instructions. Comments in the code
> mention power consumption. Should we be concerned with this?

You can poll or IPI. An IPI has the problem that IPI's are horribly slow
on Pentium II/III. On the other hand the Athlon and PIV seem to both have
that bit sorted.

Its really an implementation detail as to whether you poll for work or
someone kicks you. Since we know what the other processors are doing and
who is idle we know when we need to kick them.

Alan

2001-12-10 00:21:23

by Rusty Russell

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

In message <[email protected]> you write:
> Its not voodoo optimisation, its benchmarked work from Intel.

At the very least, please pass this paraphrase on to the Intel people.
I asserted:

If you number each CPU so its two IDs are smp_num_cpus()/2
apart, you will NOT need to put some crappy hack in the
scheduler to pack your CPUs correctly.

> Perhaps you'd like to submit your PPC64 HT patches to the list today
> so that they can be tried comparitively on the Intel HT and we can see if
> its a better generic solution ?

I apologize: clearly my previous post was far too long, as you
obviously did not read it. There is no sched.c patch.

> For 2.5 the scheduler needs a rewrite anyway so its a non issue there.

Disagree. Without widespread understanding of how the simple
scheduler works, writing a more complex one is doomed.

The Intel people, whom you assure me "know what their chip needs"
obviously have trouble understanding the subtleties of the current
scheduler. What hope the rest of us?

Rusty.
PS. Alan, go back and READ what my analysis, or this will be a VERY
long thread.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2001-12-10 00:33:24

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> If you number each CPU so its two IDs are smp_num_cpus()/2
> apart, you will NOT need to put some crappy hack in the
> scheduler to pack your CPUs correctly.

Which is a major change to the x86 tree and an invasive one. Right now the
X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
buried in the x86 arch code there isnt something that assumes this or mixes
a logical and physical cpu id wrongly in error.

At best you are exploiting an obscure quirk of the current scheduler that is
quite likely to break the moment someone factors power management into the
idling equation (turning cpus off and on is more expensive so if you idle
a cpu you want to keep it the idle one for PM). Congratulations on your
zen like mastery of the scheduler algorithm. Now tell me it wont change in
that property.

> > For 2.5 the scheduler needs a rewrite anyway so its a non issue there.
>
> Disagree. Without widespread understanding of how the simple
> scheduler works, writing a more complex one is doomed.

The simple scheduler doesn't work. I've run about 20 schedulers on playing
cards, and at the point you are shuffling things around and its clear what
is happening its actually hard not to start laughing at the current
scheduler once you hit a serious load or serious amounts of processors.

Its a great scheduler for a single or dual processor 486/pentium type box
running a home environment. It gets a bit flaky by the time its running
oracle on a 4 way, it gets very flaky by the time its running lotus back
ends on an 8 way. It doesn't take luancy like java, broken JVM implementations
and volcanomark to make it go astray

Alan

2001-12-10 02:07:23

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

>> If you number each CPU so its two IDs are smp_num_cpus()/2
>> apart, you will NOT need to put some crappy hack in the
>> scheduler to pack your CPUs correctly.
>
> Which is a major change to the x86 tree and an invasive one. Right now the
> X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
> buried in the x86 arch code there isnt something that assumes this or mixes
> a logical and physical cpu id wrongly in error.

I don't think it matters if you do a 1:1 map or not, since the NUMA-Q boxes work
fine without assuming this (I don't use physical APIC id's at all, except for from
the I/O APIC to just broadcast), and I don't think anyone else does either after
we boostrap.

It shouldn't be all that hard to check. Mentally mark every time we look at the
physical APIC id, and which variables are set from that and thus "tainted". I
did this once - I don't think it's very many at all.

I don't think changing the order we look at phys_cpu_present_map should
make much of a difference.

On the other hand, relying on the "arbitrary" cpu numbers either way doesn't
seem like the best of ideas ;-)

Martin.

2001-12-10 05:32:44

by Rusty Russell

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

In message <[email protected]> you write:
> > If you number each CPU so its two IDs are smp_num_cpus()/2
> > apart, you will NOT need to put some crappy hack in the
> > scheduler to pack your CPUs correctly.
>
> Which is a major change to the x86 tree and an invasive one. Right now the
> X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
> buried in the x86 arch code there isnt something that assumes this or mixes
> a logical and physical cpu id wrongly in error.

Agreed, but does the current x86 code does map them like this or not?
If it does, I'm curious as to why they saw a problem which this fixed.

I've been playing with this on and off for months, and trying to
understand what is happening. I posted my findings, and I'd really
like to get some feedback from others doing the same thing.

BUT I CAN'T DO THAT WHEN THERE'S NO DISCUSSION ABOUT PATCHES FROM
ANONYMOUS SOURCES WHICH GET MERGED! FUCK ARGHH FUCK FUCK FUCK.

(BTW, "I trust Intel engineers" is NOT discussion).

> Congratulations on your zen like mastery of the scheduler algorithm.

8) I just try to understand what I've seen on real hardware. It leads
to my belief that HMT cannot be a win on # processes = # CPUS + 1
situations on a non-preemptible scheduler.

> The simple scheduler doesn't work. I've run about 20 schedulers on playing
> cards, and at the point you are shuffling things around and its clear what
> is happening its actually hard not to start laughing at the current
> scheduler once you hit a serious load or serious amounts of processors.

Ack. Even in the limited case of trying to get HMT to work reasonably
in simple cases, scheduling changes are not simply "change the
goodness() function and it alters behavior". 8(

BTW: Alchemy, Voodoo, Zen and Cards. Maybe you should start hacking
on something more deterministic? 8)

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2001-12-10 05:41:15

by Rusty Russell

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

In message <2899076331.1007921423@[10.10.1.2]> you write:
> On the other hand, relying on the "arbitrary" cpu numbers either way doesn't
> seem like the best of ideas ;-)

Well, putting a comment in reschedule_idle() saying "we fill from the
bottom, and some ports rely on this" might be nice 8)

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2001-12-10 08:20:16

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> Agreed, but does the current x86 code does map them like this or not?
> If it does, I'm curious as to why they saw a problem which this fixed.

The current x86 code maps the logical cpus as with the physical ones. In
other words its how they come off the mainboard. Which for HT seems to
be with each HT as (n, n+1)

> understand what is happening. I posted my findings, and I'd really
> like to get some feedback from others doing the same thing.

I never saw your stuff.

> BUT I CAN'T DO THAT WHEN THERE'S NO DISCUSSION ABOUT PATCHES FROM
> ANONYMOUS SOURCES WHICH GET MERGED! FUCK ARGHH FUCK FUCK FUCK.

A mailing list doesn't scale to that. I do have a cunning-plan (tm) but
that requires some work and while its doable for 2.2 or 2.4 I know that
making Linus do or change any tiny bit of his behaviour isn't going to
happen which rather limits the behaviour.

Think about

mail patch to linus-patches@...

linus-patches@ is a script that does

find the diff
find the paths in the diff
regexp them against the list of notifications
email each matching notification a copy

with the regexps including

* [email protected]

so its like mailing Linus but anyone who cares can web add/remove themselves
from the cc list, and its invisible to Linus too.

> BTW: Alchemy, Voodoo, Zen and Cards. Maybe you should start hacking
> on something more deterministic? 8)

Well actually Alchemy is MIPS and nothing to do with me. Trying to turn a
Voodoo card into a video4linux overlay is, Zen is ftp.linux.org.uk and I
hack card drivers. So 3 out of 4.

Alan

2001-12-10 23:13:07

by James Cleverdon

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Monday 10 December 2001 12:28 am, Alan Cox wrote:
> > Agreed, but does the current x86 code does map them like this or not?
> > If it does, I'm curious as to why they saw a problem which this fixed.
>
> The current x86 code maps the logical cpus as with the physical ones. In
> other words its how they come off the mainboard. Which for HT seems to
> be with each HT as (n, n+1)

Yes. Intel has defined the LSB of the physical APIC ID to be the
"hyperthreading" bit. Even numbered IDs are real CPUs; odd IDs are the
virtual CPUs. (Or, as wli calls them, Schwarzenegger and Di Vito. ;^)

This may complicate Rusty's zen scheduler scheme. It certainly has made life
complicated for the BIOS folks. They had to sort all the real CPUs to the
front of the ACPI table, lest those folks so benighted as to run the crippled
version of Win2K (which only on-lines 8 CPUs) only get four real CPUs out of
eight.

Anyway, with Intel's new numbering scheme you only get two real CPUs per
logical cluster of 4, which is kind of a pain....

> > understand what is happening. I posted my findings, and I'd really
> > like to get some feedback from others doing the same thing.

[ Snip! ]

>
> Alan


--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]

2001-12-10 23:22:27

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> This may complicate Rusty's zen scheduler scheme. It certainly has made life
> complicated for the BIOS folks. They had to sort all the real CPUs to the
> front of the ACPI table, lest those folks so benighted as to run the crippled
> version of Win2K (which only on-lines 8 CPUs) only get four real CPUs out of
> eight.

Rotfl, oh that is beautiful

2001-12-11 09:19:53

by Robert Varga

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Mon, Dec 10, 2001 at 11:30:24PM +0000, Alan Cox wrote:
> > This may complicate Rusty's zen scheduler scheme. It certainly has made life
> > complicated for the BIOS folks. They had to sort all the real CPUs to the
> > front of the ACPI table, lest those folks so benighted as to run the crippled
> > version of Win2K (which only on-lines 8 CPUs) only get four real CPUs out of
> > eight.
>
> Rotfl, oh that is beautiful

As it happens a guy from microsoft sitting next to me as I read this
claims the DataCenter version of W2K has no limitation on number of
processors.

--
Kind regards,
Robert Varga
------------------------------------------------------------------------------
[email protected] http://hq.sk/~nite/gpgkey.txt


Attachments:
(No filename) (772.00 B)
(No filename) (232.00 B)
Download all attachments

2001-12-11 09:21:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

Alan Cox <[email protected]> writes:

> > If you number each CPU so its two IDs are smp_num_cpus()/2
> > apart, you will NOT need to put some crappy hack in the
> > scheduler to pack your CPUs correctly.
>
> Which is a major change to the x86 tree and an invasive one. Right now the
> X86 is doing a 1:1 mapping, and I can offer Marcelo no proof that somewhere
> buried in the x86 arch code there isnt something that assumes this or mixes
> a logical and physical cpu id wrongly in error.

Actually we don't do a 1:1 physical to logical mapping. I currently
have a board that has physical id's of: 0:6 and logical id's of 0:1
with no changes to the current x86 code.
>
> At best you are exploiting an obscure quirk of the current scheduler that is
> quite likely to break the moment someone factors power management into the
> idling equation (turning cpus off and on is more expensive so if you idle
> a cpu you want to keep it the idle one for PM). Congratulations on your
> zen like mastery of the scheduler algorithm. Now tell me it wont change in
> that property.

The idea of a cpu priority for filling sounds like a nice one. Even
if we don't use the cpu id bits for it.

Eric

2001-12-11 09:24:03

by David Weinehall

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Tue, Dec 11, 2001 at 10:16:41AM +0100, Robert Varga wrote:
> On Mon, Dec 10, 2001 at 11:30:24PM +0000, Alan Cox wrote:
> > > This may complicate Rusty's zen scheduler scheme. It certainly
> > > has made life complicated for the BIOS folks. They had to sort
> > > all the real CPUs to the front of the ACPI table, lest those folks
> > > so benighted as to run the crippled version of Win2K (which only
> > > on-lines 8 CPUs) only get four real CPUs out of eight.
> >
> > Rotfl, oh that is beautiful
>
> As it happens a guy from microsoft sitting next to me as I read this
> claims the DataCenter version of W2K has no limitation on number of
> processors.

Well, Datacenter is the non-crippled version. You get to pay a lot for
getting a non-crippled version, though. Soooo lame.


/David
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Maintainer of the v2.0 kernel // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </

2001-12-11 23:05:59

by Alan

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

> Actually we don't do a 1:1 physical to logical mapping. I currently
> have a board that has physical id's of: 0:6 and logical id's of 0:1
> with no changes to the current x86 code.

I mistook the physical to apic ones. My fault

/*
* On x86 all CPUs are mapped 1:1 to the APIC space.
* This simplifies scheduling and IPI sending and
* compresses data structures.
*/
static inline int cpu_logical_map(int cpu)
{
return cpu;
}
static inline int cpu_number_map(int cpu)
{
return cpu;
}

2001-12-20 10:07:40

by Pavel Machek

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

Hi!

> > Using the scheduler i'm working on and setting a trigger load level of 2,
> > as soon as the idle is scheduled it'll go to grab the task waiting on the
> > other cpu and it'll make it running.
>
> That rapidly gets you thrashing around as I suspect you've found.
>
> I'm currently using the following rule in wake up
>
> if(current->mm->runnable > 0) /* One already running ? */
> cpu = current->mm->last_cpu;

Is this really a win?

I mean, if I have two tasks that can run from L2 cache, I want them on
different physical CPUs even if they share current->mm, no?
Pavel

--
"I do not steal MS software. It is not worth it."
-- Pavel Kankovsky

2001-12-20 19:07:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: Linux 2.4.17-pre5

On Wed, 19 Dec 2001, Pavel Machek wrote:

> Hi!
>
> > > Using the scheduler i'm working on and setting a trigger load level of 2,
> > > as soon as the idle is scheduled it'll go to grab the task waiting on the
> > > other cpu and it'll make it running.
> >
> > That rapidly gets you thrashing around as I suspect you've found.
> >
> > I'm currently using the following rule in wake up
> >
> > if(current->mm->runnable > 0) /* One already running ? */
> > cpu = current->mm->last_cpu;
>
> Is this really a win?
>
> I mean, if I have two tasks that can run from L2 cache, I want them on
> different physical CPUs even if they share current->mm, no?

It depends. If you've two CPU and these two tasks are the only ones
running, yes running them on separate CPUs is ok because the parallelism
that you'll get is going to pay you back for the cache issues.
And this is the automatic bahavior that you'll get with sane schedulers.
But as a general rule, matching MMs should lead to a tentative to run them
on the same CPU ( give preference, not force ).




- Davide