2004-01-26 23:24:16

by Pete Zaitcev

[permalink] [raw]
Subject: Cset 1.1490.4.201 - dasd naming

Hi, Martin:

In a recent changeset in Linus' tree, there's your diff which blows up
the dasd naming scheme, with the comment:
- Change dasd names from "dasdx" to "dasd_<busid>_".

This breaks mkinitrd, nash, and mount by label (not to mention every
zipl.conf out there, because root= aliases to /sys/block/%s).
Would you please explain what exactly you were thinking when you
submitted that patch?

-- Pete

P.S. Cset
http://linux.bkbits.net:8080/linux-2.5/[email protected]?nav=index.html|src/|src/drivers|src/drivers/s390|src/drivers/s390/block|related/drivers/s390/block/dasd_genhd.c


2004-01-27 08:52:41

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Cset 1.1490.4.201 - dasd naming


Hi Pete,

> In a recent changeset in Linus' tree, there's your diff which blows up
> the dasd naming scheme, with the comment:
> - Change dasd names from "dasdx" to "dasd_<busid>_".
We plan to do this for tape and other ccw devices as well (where applicable).

> This breaks mkinitrd, nash, and mount by label (not to mention every
> zipl.conf out there, because root= aliases to /sys/block/%s).
> Would you please explain what exactly you were thinking when you
> submitted that patch?
The reason for this change is the requirement to have persistent device
names. The /dev/dasdxyz naming schema heavily depends on the order in
which the device are added. Not good for persistent names. This change
affects four things: 1) the internal name, 2) the name of the sysfs
directory, 3) the root= parameter and 4) the hotplug events for dasd
devices.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]



2004-01-27 09:26:04

by Eric Dumazet

[permalink] [raw]
Subject: linux-2.6.1 x86_64 : STACK_TOP and text/data

Hi all

Anybody knows why STACK_TOP is defined to 0xc0000000 in x86_64 ?

This means that stack allocated variables are all in the first 4GB
quadrant in memory.
As the default virtual addresses of text/data of a programm are in this
same quadrant, some programming errors could be undetected.
(Some programmers could still cast some pointers to 'unsigned int' for
example, and this could 'work')

Tru64 has a different strategy :
Program text starts at 0x120000000
Program data starts at 0x140000000
Stack is just under text, but still not in the first 4GB quadrant.

This way, programmers errors are likely to be detected at dev time.

Another point is that BSS zone (heap) cannot exceed 3GB in x86_64 mode,
since the brk hit the stack.
libc malloc then fallback to use a lot of arenas... suboptimal in terms
of vmas.

Strangely, in ia32 emulation mode, the stack is placed at the 4GB limit !

Thank you
Eric Dumazet

2004-01-27 18:24:58

by Andi Kleen

[permalink] [raw]
Subject: Re: linux-2.6.1 x86_64 : STACK_TOP and text/data

dada1 <[email protected]> writes:

> Anybody knows why STACK_TOP is defined to 0xc0000000 in x86_64 ?

STACK_TOP is only for 32bit a.out executables running on x86-64
ELF 32bit and 64bit programs use different defaults.

-Andi

2004-01-27 18:57:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: linux-2.6.1 x86_64 : STACK_TOP and text/data

Andi Kleen wrote:

> STACK_TOP is only for 32bit a.out executables running on x86-64
>
>ELF 32bit and 64bit programs use different defaults.
>
>-Andi
>
>
>
>
Hi Andi

I'm afraid not Andi

I changed include/asm-x86_64/a.out.h

#define STACK_TOP 0x10c0000000 /* instead of 0xc0000000 */

then, after reboot :

file /sbin/init
/sbin/init: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for
GNU/Linux 2.4.0, dynamically linked (uses shared libs), stripped


cat /proc/1/maps

00400000-00408000 r-xp 00000000 03:01 556032
/sbin/init
00508000-00509000 rw-p 00008000 03:01 556032
/sbin/init
00509000-0052a000 rwxp 00000000 00:00 0
10bfffe000-10c0000000 rwxp fffffffffffff000 00:00 0
2a95556000-2a95569000 r-xp 00000000 03:01 637734
/lib64/ld-2.3.2.so
2a95569000-2a9556a000 rw-p 00000000 00:00 0
2a95669000-2a9566a000 rw-p 00013000 03:01 637734
/lib64/ld-2.3.2.so
2a9566a000-2a957a2000 r-xp 00000000 03:01 637741
/lib64/libc-2.3.2.so
2a957a2000-2a9586a000 ---p 00138000 03:01 637741
/lib64/libc-2.3.2.so
2a9586a000-2a958a7000 rw-p 00100000 03:01 637741
/lib64/libc-2.3.2.so
2a958a7000-2a958ac000 rw-p 00000000 00:00 0

See you

2004-01-27 19:31:04

by Andi Kleen

[permalink] [raw]
Subject: Re: linux-2.6.1 x86_64 : STACK_TOP and text/data

On Tue, 27 Jan 2004 19:57:23 +0100
dada1 <[email protected]> wrote:

> Andi Kleen wrote:
>
> > STACK_TOP is only for 32bit a.out executables running on x86-64
> >
> >ELF 32bit and 64bit programs use different defaults.
> >
> >-Andi
> >
> >
> >
> >
> Hi Andi
>
> I'm afraid not Andi

You're right. Thanks for reporting this. This seems to be a 2.6
specific bug, it didn't happen in 2.4.

I will fix it. It should definitely use PAGE_OFFSET for 64bit
processes and 4GB for !3GB 32bit processes.

-Andi

2004-01-27 19:46:32

by Eric Dumazet

[permalink] [raw]
Subject: Re: linux-2.6.1 x86_64 : STACK_TOP and text/data

Andi Kleen wrote:

>
>You're right. Thanks for reporting this. This seems to be a 2.6
>specific bug, it didn't happen in 2.4.
>
>I will fix it. It should definitely use PAGE_OFFSET for 64bit
>processes and 4GB for !3GB 32bit processes.
>
>-Andi
>
>
>
>
Another thing I noticed in last glibc CVS (nptl)

Thread stacks are also allocated in the 1GB quadrant :

nptl/sysdeps/x86_64/pthreaddef.h
/* We prefer to have the stack allocated in the low 4GB since this
allows faster context switches. */
#define ARCH_MAP_FLAGS MAP_32BIT

Is this really true ?
Is memory allocated in the low 4GB is faster on x86_64 (64bit kernel,
64 bit user prog ?)

Thank you

Eric Dumazet


2004-01-27 20:00:07

by Andi Kleen

[permalink] [raw]
Subject: Re: linux-2.6.1 x86_64 : STACK_TOP and text/data

On Tue, 27 Jan 2004 20:45:24 +0100
dada1 <[email protected]> wrote:

> Another thing I noticed in last glibc CVS (nptl)
>
> Thread stacks are also allocated in the 1GB quadrant :
>
> nptl/sysdeps/x86_64/pthreaddef.h
> /* We prefer to have the stack allocated in the low 4GB since this
> allows faster context switches. */
> #define ARCH_MAP_FLAGS MAP_32BIT
>
> Is this really true ?
> Is memory allocated in the low 4GB is faster on x86_64 (64bit kernel,
> 64 bit user prog ?)

That only applies to areas referenced set by set_thread_area() and
referenced by segment registers. For pointers <4GB it can use a faster method at
context switch.

They probably do that because they put the thread local data at the
bottom of the stack and it has to be referenced using %gs.
They should use a fallback if the MAP_32BIT allocation fails.

I suspect they would be better off if they allocated the thread local
data separately. The 2.4 kernel used to do the same, but switched to
separate allocation because this gives better cache colouring
(stacks tend to be aligned too much and use only parts of the cache)

MAP_32BIT only allocates in the first 2GB BTW, it's really MAP_31BIT.

-Andi

2004-01-28 18:06:17

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Cset 1.1490.4.201 - dasd naming

On Tue, 27 Jan 2004 09:52:27 +0100
"Martin Schwidefsky" <[email protected]> wrote:

> > - Change dasd names from "dasdx" to "dasd_<busid>_".

> > This breaks mkinitrd, nash, and mount by label (not to mention every
> > zipl.conf out there, because root= aliases to /sys/block/%s).
> > Would you please explain what exactly you were thinking when you
> > submitted that patch?

> The reason for this change is the requirement to have persistent device
> names. The /dev/dasdxyz naming schema heavily depends on the order in
> which the device are added. Not good for persistent names. This change
> affects four things: 1) the internal name, 2) the name of the sysfs
> directory, 3) the root= parameter and 4) the hotplug events for dasd
> devices.

Martin, it is your architecture to break as you wish, but my gut feeling
is that you'd never get away with this if you did it on anything using
common use peripherals. This is a return to times of UNIX v6 and /dev/rk1a.
The chief penguin repeatedly stated that he wanted to see /dev/diskN
or similar (defined by a userland policy).

Considering Fedora Core 2, I do not know if we have time to repair the
damage. For the moment, I am patching a reverse patch.

Running a statically built dasd and having root=/dev/dasd_0.0.0202_1
in zipl.conf works fine. I am considering if we can get away with just
that and fixing mount by label to work. It's not the end of the world,
but it's annoying.

Is there a story of a real world deployment where the 2.4 scheme was
a hindrance which you could share? Honestly, I'm surprised you bring
the matter of "persistent names" instead of, say, exhaustion of
address range and majors.

-- Pete

2004-01-28 18:25:42

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Cset 1.1490.4.201 - dasd naming


Hi Pete,

> Martin, it is your architecture to break as you wish, but my gut feeling
> is that you'd never get away with this if you did it on anything using
> common use peripherals. This is a return to times of UNIX v6 and /dev/rk1a.
> The chief penguin repeatedly stated that he wanted to see /dev/diskN
> or similar (defined by a userland policy).
The idea was to get rid of the dasdxyz names which are not intuitive.

> Considering Fedora Core 2, I do not know if we have time to repair the
> damage. For the moment, I am patching a reverse patch.
Ok.

> Is there a story of a real world deployment where the 2.4 scheme was
> a hindrance which you could share? Honestly, I'm surprised you bring
> the matter of "persistent names" instead of, say, exhaustion of
> address range and majors.
That is probably the main argument to go back to the old names. After
udev and friends are in place it is not important how the disk is named
internally. The only place where it would surface is on the root=
parameter.

I'll discuss this with the Horst again to see if we really need the
dasd_<busid>_ names or if we can live with the old style names on the
root= parameter.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]



2004-01-28 19:12:06

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Cset 1.1490.4.201 - dasd naming


> That is probably the main argument to go back to the old names. After
> udev and friends are in place it is not important how the disk is named
> internally. The only place where it would surface is on the root=
> parameter.

Even for root=, it probably does not matter as long as udev is used
in the initrd/initramfs. The main argument against the new naming is
that udev can trivially create these or other persistent names, while
it's very hard for udev to calculate the compatible names.

Arnd <><

2004-01-29 18:28:06

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: Cset 1.1490.4.201 - dasd naming

Hi Pete,

> > Is there a story of a real world deployment where the 2.4 scheme was
> > a hindrance which you could share? Honestly, I'm surprised you bring
> > the matter of "persistent names" instead of, say, exhaustion of
> > address range and majors.
> That is probably the main argument to go back to the old names. After
> udev and friends are in place it is not important how the disk is named
> internally. The only place where it would surface is on the root=
> parameter.
>
> I'll discuss this with the Horst again to see if we really need the
> dasd_<busid>_ names or if we can live with the old style names on the
> root= parameter.

We discussed udev and friends today again and we decided to go back to
the old dasdxyz names. You'll send the patch with the next update to
Andrew.

blue skies,
Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH
Sch?naicherstr. 220, D-71032 B?blingen, Telefon: 49 - (0)7031 - 16-2247
E-Mail: [email protected]