What's the rationale for the current assignment of GDT entries? In
particular, this section:
* 0 - null
* 1 - reserved
* 2 - reserved
* 3 - reserved
*
* 4 - unused <==== new cacheline
* 5 - unused
*
* ------- start of TLS (Thread-Local Storage) segments:
*
* 6 - TLS segment #1 [ glibc's TLS segment ]
* 7 - TLS segment #2 [ Wine's %fs Win32 segment ]
* 8 - TLS segment #3
* 9 - reserved
* 10 - reserved
* 11 - reserved
What are entries 1-3 and 9-11 reserved for? Must they be unused for
some reason, or is there some proposed use that has not been impemented yet?
Also, is there a particular reason kernel GDT entries start at 12?
Would there be a problem in using either 4 or 5 for a kernel GDT descriptor?
I'm asking because I'd like to use one of these entries for the PDA
descriptor, so that it is on the same cache line as the TLS
descriptors. That way, the entry/exit segment register reloads would
still only need to touch two GDT cache lines. Would there be a real
problem in doing this?
Thanks,
J
On Wed, 2006-09-13 at 11:58 -0700, Jeremy Fitzhardinge wrote:
> What's the rationale for the current assignment of GDT entries? In
> particular, this section:
>
> * 0 - null
> * 1 - reserved
> * 2 - reserved
> * 3 - reserved
> *
> * 4 - unused <==== new cacheline
> * 5 - unused
> *
> * ------- start of TLS (Thread-Local Storage) segments:
> *
> * 6 - TLS segment #1 [ glibc's TLS segment ]
> * 7 - TLS segment #2 [ Wine's %fs Win32 segment ]
> * 8 - TLS segment #3
> * 9 - reserved
> * 10 - reserved
> * 11 - reserved
>
>
> What are entries 1-3 and 9-11 reserved for? Must they be unused for
> some reason, or is there some proposed use that has not been impemented yet?
I don't know the exact details on these; I do know that several GDT
entries tend to be used by BIOSes in their APM implementations and thus
are better of not being used. That might be the underlying reason
here....
Ar Mer, 2006-09-13 am 21:16 +0200, ysgrifennodd Arjan van de Ven:
> I don't know the exact details on these; I do know that several GDT
> entries tend to be used by BIOSes in their APM implementations and thus
> are better of not being used.
Thats 0x40 which tends to get used as if was a real mode base for BIOS
accesses even via the protected mode interface.
Alan
On Wed, 13 Sep 2006, Jeremy Fitzhardinge wrote:
> What's the rationale for the current assignment of GDT entries? In
> particular, this section:
>
> * 0 - null
> * 1 - reserved
> * 2 - reserved
> * 3 - reserved
> *
> * 4 - unused <==== new cacheline
> * 5 - unused
> *
> * ------- start of TLS (Thread-Local Storage) segments:
> *
> * 6 - TLS segment #1 [ glibc's TLS segment ]
> * 7 - TLS segment #2 [ Wine's %fs Win32 segment ]
> * 8 - TLS segment #3
> * 9 - reserved
> * 10 - reserved
> * 11 - reserved
>
>
> What are entries 1-3 and 9-11 reserved for? Must they be unused for
> some reason, or is there some proposed use that has not been impemented yet?
>
In the ix86, the first descriptor in the GDT is not used. The are
TWO 32-bits words for each GDT entry. The GDT numbers are the offset from
the first, so they are numbered as offsets, there are multiplied by
8, the size of a GDT, by the processor when they are used to set segment
registers. This table is only accessed by the CPU when the LGDT
instruction in executed. When a segment register is set, the invisible
part of the segment, the top 16 bits, contains the information extracted
from the GDT, so no further access is necessary. This means that
it has nothing to do with cache-lines.
The entries 1 through 3 are used during the boot sequence, see
setup.S, search for "gdt" around line 983.
> Also, is there a particular reason kernel GDT entries start at 12?
> Would there be a problem in using either 4 or 5 for a kernel GDT descriptor?
>
> I'm asking because I'd like to use one of these entries for the PDA
> descriptor, so that it is on the same cache line as the TLS
> descriptors. That way, the entry/exit segment register reloads would
> still only need to touch two GDT cache lines. Would there be a real
> problem in doing this?
>
You can add other GDT entries up to 8192 if you want. If you set the
base, limit, type, etc., to something that's different than the
kernel DS, SS, CS, etc., then you need to reload the segment registers
and if the base is different, the code offset will be WRONG so you
will need to tell the linker the new relocation information.
I can't imagine a reason why you'd want to do this.
> Thanks,
> J
>
Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.66 BogoMips).
New book: http://www.AbominableFirebug.com/
_
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
Alan Cox wrote:
> Ar Mer, 2006-09-13 am 21:16 +0200, ysgrifennodd Arjan van de Ven:
>
>> I don't know the exact details on these; I do know that several GDT
>> entries tend to be used by BIOSes in their APM implementations and thus
>> are better of not being used.
>>
>
> Thats 0x40 which tends to get used as if was a real mode base for BIOS
> accesses even via the protected mode interface.
>
Do you mean descriptor entry 8? Because that's TLS #3, not reserved...
J
linux-os (Dick Johnson) wrote:
> The entries 1 through 3 are used during the boot sequence, see
> setup.S, search for "gdt" around line 983.
>
OK, but that's an early GDT used during boot, which shouldn't have any
bearing on the GDT of the running kernel.
> I can't imagine a reason why you'd want to do this.
>
I'm looking at packing all the descriptors together so they share a
cache line, and therefore reduce the likelihood of a cache miss when
loading a segment register.
J
Arjan van de Ven wrote:
> I don't know the exact details on these; I do know that several GDT
> entries tend to be used by BIOSes in their APM implementations and thus
> are better of not being used. That might be the underlying reason
> here....
>
Hm, I see.
Also, thinking about this a bit more, it would be most helpful to move
the PDA descriptor onto the same cache line as the other descriptors
used in the kernel - ie, somewhere in the range of 8-15 (assuming 64
byte line size):
* 8 - TLS segment #3
* 9 - reserved
* 10 - reserved
* 11 - reserved
*
* ------- start of kernel segments:
*
* 12 - kernel code segment
* 13 - kernel data segment
* 14 - default user CS
* 15 - default user DS
This seems pretty wasteful of the GDT cache line, since the kernel+user
cs/ds are shared a cache line with 3 reserved entries and the never-used
TLS #3 descriptor. If it were OK to put the PDA in one of 9,10,11,
then that would be good. Unfortunately the next cache line is clogged
up with PNP and APM stuff, which I presume not movable.
In fact, if we assume that "reserved" means "unusable", it looks like
none of the GDT's cache lines can be freed up to lay out the most
commonly used descriptors into a single cache line:
line 0: NULL descriptor, 3 reserved, 2 unused, 2 TLS
line 1: 1 TLS, 3 reserved, kernel+user code+data
line 2: TSS, LDT, PNPBIOS, APMBIOS
line 3: APMBIOS, ESPFIX, 4 unused, doublefault TSS
Otherwise line 1 would be ideal for putting 3 TLS, kernel+user code+data
and PDA into, thereby making 99.999% of GDT descriptor uses come from
one cache line.
But anyway, what breaks if I put the PDA in 11?
J
On Wed, 13 Sep 2006, Jeremy Fitzhardinge wrote:
> linux-os (Dick Johnson) wrote:
>> The entries 1 through 3 are used during the boot sequence, see
>> setup.S, search for "gdt" around line 983.
>>
>
> OK, but that's an early GDT used during boot, which shouldn't have any
> bearing on the GDT of the running kernel.
>
>> I can't imagine a reason why you'd want to do this.
>>
>
> I'm looking at packing all the descriptors together so they share a
> cache line, and therefore reduce the likelihood of a cache miss when
> loading a segment register.
>
> J
You can certainly see if what you do works, but the last time I
looked, you need the linear address-space mapped by these to load
a new GDT, which needs to be untranslated in the TLB (unity mapped).
You can also search the kernel to see if they are required to
get into and get out of VM86 mode, for the dosemu users. Basically,
you need to change to what you want, see if it boots, then test
everything that might use these entries. It's scary and a lot of
work, probably the reason why nobody's bothered to muck with the
GDT. There is also a "specific" entry, used for pseudo-real mode
to access the BIOS for some APM stuff. I don't remember the number,
but that shouldn't be changed because the BIOS hard-codes it for
setting segments.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.66 BogoMips).
New book: http://www.AbominableFirebug.com/
_
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
Jeremy Fitzhardinge wrote:
> Arjan van de Ven wrote:
>> I don't know the exact details on these; I do know that several GDT
>> entries tend to be used by BIOSes in their APM implementations and thus
>> are better of not being used. That might be the underlying reason
>> here....
>>
>
> Hm, I see.
>
> Also, thinking about this a bit more, it would be most helpful to move
> the PDA descriptor onto the same cache line as the other descriptors
> used in the kernel - ie, somewhere in the range of 8-15 (assuming 64
> byte line size):
>
> * 8 - TLS segment #3
> * 9 - reserved
> * 10 - reserved
> * 11 - reserved
> *
> * ------- start of kernel segments:
> *
> * 12 - kernel code segment
> * 13 - kernel data segment
> * 14 - default user CS
> * 15 - default user DS
>
> This seems pretty wasteful of the GDT cache line, since the
> kernel+user cs/ds are shared a cache line with 3 reserved entries and
> the never-used TLS #3 descriptor. If it were OK to put the PDA in
> one of 9,10,11, then that would be
TLS #3 overlaps BIOS 0x40, but code which calls borken APM / PnP BIOS
and sets up protected mode 0x40 GDT segment does so by swapping out the
TLS segment with the identity simulation of physical 0x400 offset,
swapping it back afterwards. Short of bugs in that code (which there
are, btw), you shouldn't need to be concerned with it.
I believe 9,10,11 are reserved for future users like yourself or
expanded TLS segments. I think a bank of 3 TLS segments in the GDT is
working fine now (does NPTL even use more than one?).
> good. Unfortunately the next cache line is clogged up with PNP and
> APM stuff, which I presume not movable.
Totally movable, actually, just means breaking module dependencies.
>
> In fact, if we assume that "reserved" means "unusable", it looks like
> none of the GDT's cache lines can be freed up to lay out the most
> commonly used descriptors into a single cache line:
>
> line 0: NULL descriptor, 3 reserved, 2 unused, 2 TLS
> line 1: 1 TLS, 3 reserved, kernel+user code+data
> line 2: TSS, LDT, PNPBIOS, APMBIOS
> line 3: APMBIOS, ESPFIX, 4 unused, doublefault TSS
>
> Otherwise line 1 would be ideal for putting 3 TLS, kernel+user
> code+data and PDA into, thereby making 99.999% of GDT descriptor uses
> come from one cache line.
That change is visible to userspace, unfortunately.
>
> But anyway, what breaks if I put the PDA in 11?
Nothing.
Zach
Ar Mer, 2006-09-13 am 13:59 -0700, ysgrifennodd Zachary Amsden:
> TLS #3 overlaps BIOS 0x40, but code which calls borken APM / PnP BIOS
> and sets up protected mode 0x40 GDT segment does so by swapping out the
> TLS segment with the identity simulation of physical 0x400 offset,
> swapping it back afterwards. Short of bugs in that code (which there
> are, btw), you shouldn't need to be concerned with it.
Care to elucidate ?
Zachary Amsden wrote:
> I believe 9,10,11 are reserved for future users like yourself or
> expanded TLS segments. I think a bank of 3 TLS segments in the GDT is
> working fine now (does NPTL even use more than one?).
Nope. And there's a comment that wine uses one more. I think the third
is completely unused.
Does this mean that "reserved" is actually synonymous with "unused" in
asm/segment.h?
>> Otherwise line 1 would be ideal for putting 3 TLS, kernel+user
>> code+data and PDA into, thereby making 99.999% of GDT descriptor uses
>> come from one cache line.
>
> That change is visible to userspace, unfortunately.
Don't think it matters much. 32-bit processes on x86-64 seem perfectly
happy with the TLS being in a different place. I think the ABI is
defined in terms of "use the selector for the entry that
set_thread_area/clone returns", and so is not a constant. But I agree
it would be better not to.
Hm, moving user cs/ds would be pretty visible too... Hm, and it would
have a greater chance of breaking stuff if they changed, compared to
moving the TLS...
So is there any reason for "kernel entries start at 12"? If there's no
reason for it, then we can pack everything useful into 1-5.
>> But anyway, what breaks if I put the PDA in 11?
>
> Nothing.
OK then.
J
On Wed, 13 Sep 2006, Jeremy Fitzhardinge wrote:
> *
> * 4 - unused <==== new cacheline
> * 5 - unused
These _used_ to be the "user CS/DS" respectively, but that got changed
around by me when did the "sysenter" support.
The sysenter logic (or, more properly, the sysexit one) requires that the
user code segment number is the same as the kernel code segment +2 (ie
"+16" in actual selector term). And the user data segment needs to be +3.
So with sysenter, we needed a block of four contiguous segments: kernel
code, kernel data, user code, user data (in that order).
There are other possible things to do, but what we did was to move the
user segments up to just above the kernel ones (which we left in place).
> * 6 - TLS segment #1 [ glibc's TLS segment ]
> * 7 - TLS segment #2 [ Wine's %fs Win32 segment ]
> * 8 - TLS segment #3
> * 9 - reserved
> * 10 - reserved
> * 11 - reserved
These are really reserved, I think we left them that way on purpose so
that if we wanted to, we can allow more of the contiguous per-thread
state. And segment #8 (ie 0x40) is special (TLS segment #3), of course.
Anybody who wants to emulate windows or use the BIOS needs to use that for
their "common BIOS area" thing, iirc.
I think it's generally a good idea to keep the low segment reserved (or at
least free to use for whatever user code), since if there are any special
magic segment descriptor numbers, they tend to be in that low range. The
#8/0x40 thing is just an example.
> What are entries 1-3 and 9-11 reserved for? Must they be unused for some
> reason, or is there some proposed use that has not been impemented yet?
>
> Also, is there a particular reason kernel GDT entries start at 12? Would
> there be a problem in using either 4 or 5 for a kernel GDT descriptor?
See above. The kernel and user segments have to be moved as a block of
four, and obviously we'd like to keep them in the same cacheline too.
Also, the cacheline that contains segment #8/0x40 is not available, so
that together with keeping low segments for user space explains why it's
at segment numbers #12-15 (selectors 0x60/0x68/0x73/0x7b).
But I don't think anything but 0x40 is "set in stone".
Linus
Linus Torvalds wrote:
> These _used_ to be the "user CS/DS" respectively, but that got changed
> around by me when did the "sysenter" support.
>
So does this mean that moving the user-visible cs/ds isn't likely to
break stuff, if it has been done before?
> The sysenter logic (or, more properly, the sysexit one) requires that the
> user code segment number is the same as the kernel code segment +2 (ie
> "+16" in actual selector term). And the user data segment needs to be +3.
>
Yep, I'm aware of that constraint.
> And segment #8 (ie 0x40) is special (TLS segment #3), of course.
> Anybody who wants to emulate windows or use the BIOS needs to use that for
> their "common BIOS area" thing, iirc.
>
Do you mean that something like dosemu/Wine needs to be able to use GDT
#8? Or is it only used in kernel code?
> See above. The kernel and user segments have to be moved as a block of
> four, and obviously we'd like to keep them in the same cacheline too.
> Also, the cacheline that contains segment #8/0x40 is not available,
Why's that? That cacheline (assuming 64 byte line size) already
contains the user/kernel/cs/ds descriptors.
I'm thinking of putting together a patch to change the descriptor use to:
8 - TLS #1
9 - TLS #2
10 - TLS #3
11 - Kernel PDA
12 - Kernel CS
13 - Kernel DS
14 - User CS
15 - User DS
This has the advantage of leaving the user cs/ds unchanged. From what
people had said so far, this should be OK, other than making the heavily
used TLS #1 share the BIOS common area entry number. If this needs to
be usable by userspace for something special, then making it TLS #1
won't fly...
Alternatively, maybe:
0 - NULL
1 - Kernel PDA
2 - Kernel CS
3 - Kernel DS
4 - User CS
5 - User DS
6 - TLS #1
7 - TLS #2
which moves the user cs/ds, but avoids #8.
J
On Wed, 13 Sep 2006, Jeremy Fitzhardinge wrote:
>
> So does this mean that moving the user-visible cs/ds isn't likely to break
> stuff, if it has been done before?
Yes. I _think_ we could do it. It's been done before, and nobody noticed.
That said, it may actually be that programs have since become much more
aware of segments, for a rather perverse reason: the TLS stuff. Old
programs are all very much coded and compiled for a totally flat model,
and as such they really don't know _anything_ about segments. But with
more TLS stuff, it's possible that a modern threded program is at least
aware of _some_ of it.
In other words - I _suspect_ we can move things around, but it would
require some rather heavy testing, at least. Especially programs like Wine
might react badly.
> > And segment #8 (ie 0x40) is special (TLS segment #3), of course. Anybody who
> > wants to emulate windows or use the BIOS needs to use that for their "common
> > BIOS area" thing, iirc.
>
> Do you mean that something like dosemu/Wine needs to be able to use GDT #8?
> Or is it only used in kernel code?
Both. I think the APM BIOS callbacks use GDT#8 too. As long as it's not
one of the really _core_ kernel segments, that's ok (you can swap it
around and nobody will care). But it would be a total disaster (I suspect)
if GDT#8 was the kernel code segment, for example. Suddenly the "switch
things around temporarily" is not as trivial any more, and involves nasty
nasty things.
[ BUT! I haven't ever really had much to do with those BIOS callbacks, and
I'm too lazy to check, so this is all from memory. ]
> > See above. The kernel and user segments have to be moved as a block of four,
> > and obviously we'd like to keep them in the same cacheline too. Also, the
> > cacheline that contains segment #8/0x40 is not available,
>
> Why's that? That cacheline (assuming 64 byte line size) already contains the
> user/kernel/cs/ds descriptors.
Right. That's what I'm saying. We should move them all together, and we
should keep them as aligned as they are now.
> I'm thinking of putting together a patch to change the descriptor use to:
>
> 8 - TLS #1
> 9 - TLS #2
> 10 - TLS #3
So I'd not be surprised if movign the TLS segments around would break
something.
> 11 - Kernel PDA
But you keep the four basic ones in the same place:
> 12 - Kernel CS
> 13 - Kernel DS
> 14 - User CS
> 15 - User DS
So that's obviously ok at least for _those_.
> Alternatively, maybe:
>
> 0 - NULL
> 1 - Kernel PDA
> 2 - Kernel CS
> 3 - Kernel DS
> 4 - User CS
> 5 - User DS
> 6 - TLS #1
> 7 - TLS #2
>
> which moves the user cs/ds, but avoids #8.
I don't like that one, exactly because now the four most common segments
(which get accessed for all system calls) are no longer in the same
32-byte cacheline.
[ Unless we start playing games with offsetting the GDT or something..
Quite frankly, I'd rather keep it simple and obvious. ]
Now, most systems have a 64-byte cacheline these days (and some have a
split 128-byte one), and maybe we'll never go back to the "good old days"
with 32-byte lines, so maybe this is a total non-issue. But fitting in the
same 32-byte aligned thing would still count as a "good thing" in my book.
That said, numbers talk, bullshit walks. If the above just works a lot
better for all modern CPU's that all have 64-byte cachelines (because now
_everything_ is in that bigger cacheline), and if you can show that with
numbers, and nothing breaks in practice, then hey..
Linus
Linus Torvalds wrote:
> So I'd not be surprised if movign the TLS segments around would break
> something.
>
I don't think so. 32-bit code running on x86-64 has different TLS
selectors, and everything seems to work there...
> That said, numbers talk, bullshit walks. If the above just works a lot
> better for all modern CPU's that all have 64-byte cachelines (because now
> _everything_ is in that bigger cacheline), and if you can show that with
> numbers, and nothing breaks in practice, then hey..
>
My goal would be to do a minimal change which packs all the useful stuff
together in a 64-byte line. Ideally it would just use two 32-byte
lines, but I don't think that's as important.
Caching effects are pretty hard to measure anyway, and with something as
deeply x86-microarchitectural as this, I could imagine lots of other CPU
cleverness which could obscure any simple measurement. But packing
things into a line certainly can't hurt.
I'll put something together, and see how it goes...
J
Alan Cox wrote:
> Ar Mer, 2006-09-13 am 13:59 -0700, ysgrifennodd Zachary Amsden:
>
>> TLS #3 overlaps BIOS 0x40, but code which calls borken APM / PnP BIOS
>> and sets up protected mode 0x40 GDT segment does so by swapping out the
>> TLS segment with the identity simulation of physical 0x400 offset,
>> swapping it back afterwards. Short of bugs in that code (which there
>> are, btw), you shouldn't need to be concerned with it.
>>
>
> Care to elucidate ?
>
I believe the current max use case for GDT descriptors is Wine. Wine
compiled against TLS glibc uses entry zero for libc, and allocates
another GDT entry for the first thread created by NTDLL (although I have
no idea why, since there is fallback code to use LDT allocation instead,
and all subsequent allocations happen via the LDT - perhaps some kernel
mode DLL thing insists on having the first thread in the GDT?) DOSemu
by the way, only uses the LDT.
But there is no reason userspace can't allocate 3 TLS descriptors in the
GDT per thread. If it did, the overlap between 0x40 (descriptor #8,
real mode BIOS simulation of physical address 0x400, BIOS data area)
causes a problem. Fortunately, APM and PnP take care to fix this by
swapping in and out the descriptors. Unfortunately, they don't get it
quite right.
Selected code snippets (PnP):
/*
* PnP BIOSes are generally not terribly re-entrant.
* Also, don't rely on them to save everything correctly.
*/
if(pnp_bios_is_utter_crap)
return PNP_FUNCTION_NOT_SUPPORTED;
cpu = get_cpu();
save_desc_40 = get_cpu_gdt_table(cpu)[0x40 / 8];
get_cpu_gdt_table(cpu)[0x40 / 8] = bad_bios_desc; <---- set up
fake BIOS descriptor for 0x400
/* On some boxes IRQ's during PnP BIOS calls are deadly. */
spin_lock_irqsave(&pnp_bios_lock, flags);
... now inline assembler
"pushl %%fs\n\t"
"pushl %%gs\n\t"
"pushfl\n\t"
"movl %%esp, pnp_bios_fault_esp\n\t"
"movl $1f, pnp_bios_fault_eip\n\t"
"lcall %5,%6\n\t"
"1:popfl\n\t"
"popl %%gs\n\t" <---- (**)
"popl %%fs\n\t" <---- (**)
... now restore original GDT descriptor back
spin_unlock_irqrestore(&pnp_bios_lock, flags);
get_cpu_gdt_table(cpu)[0x40 / 8] = save_desc_40;
put_cpu();
But it is too late - damage is already done (at **), since %fs or %gs
could have had a reference to TLS descriptor #3, and they get reloaded
_before_ the GDT is restored. Thus any userspace process that uses TLS
descriptor #3 in FS or GS and makes a BIOS call to PnP may get corrupted
data loaded into the hidden state of FS / GS selectors.
APM has a similar problem. Both are easily fixable, but there has been
too much flux in this area recently to get a stable patch for these
problems, and the problems are exceedingly unlikely, since I don't know
of a single userspace program using TLS descriptor #3, much less one
that makes use of APM or PnP facilities. There is the possibility
however, that such a program could sleep, run the idle thread, which
makes a call into some of these BIOS facilities, and then reschedules
the same program thread - which means FS/GS never get reloaded, thus
maintaining their corrupted values. It is worth fixing, just not a high
priority. I had a patch that fixed both APM and PnP at one time, but it
is covered with mold and now looks like a science experiment. Shall I
apply disinfectant?
Zach
On Wed, 13 Sep 2006 17:25:53 -0700 Zachary Amsden <[email protected]> wrote:
>
> It is worth fixing, just not a high
> priority. I had a patch that fixed both APM and PnP at one time, but it
> is covered with mold and now looks like a science experiment. Shall I
> apply disinfectant?
Yes, please.
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/
Linus Torvalds writes:
> On Wed, 13 Sep 2006, Jeremy Fitzhardinge wrote:
>> So does this mean that moving the user-visible cs/ds isn't
>> likely to break stuff, if it has been done before?
>
> Yes. I _think_ we could do it. It's been done before, and nobody noticed.
>
> That said, it may actually be that programs have since become much more
> aware of segments, for a rather perverse reason: the TLS stuff. Old
> programs are all very much coded and compiled for a totally flat model,
> and as such they really don't know _anything_ about segments. But with
> more TLS stuff, it's possible that a modern threded program is at least
> aware of _some_ of it.
We actually have an ABI problem right now because of this.
Note that i386 and x86_64 use different GDT slots.
As far as I can tell, users need to hard-code the mapping
from TLS slot to segment number. They use 0,1,2 to ask the
kernel to set things up (via set_thread_area), but can't
just pop that into %fs or %gs.
So a 32-bit app using set_thread_area can work on i386 or x86_64,
but not both. I guess glibc gets %gs set up free via clone() with
the right flags, and thus does not need to determine the kernel.
For anything involving set_thread_area though, it gets nasty.
Typical hacks that result from this:
call uname() and look for "x86_64"
see of the addresses of local variables exceed 0xbfffffff
examine /proc/1/maps
check for a /lib64 directory
change SSE register 8 in a signal handler frame and see if it sticks
checksum the vdso code
...
Please save us from these foul hacks.
Jeremy Fitzhardinge writes:
> Zachary Amsden wrote:
>> I believe 9,10,11 are reserved for future users like yourself or
>> expanded TLS segments. I think a bank of 3 TLS segments in the
>> GDT is working fine now (does NPTL even use more than one?).
>
> Nope. And there's a comment that wine uses one more. I think
> the third is completely unused.
I use the third. The sucky thing is that I need to determine if
the kernel is 64-bit to know what I must load into the segment
register. Fortunately this code is not yet out in the wild, so
you can still fix the ABI situation for me at least.
>>> Otherwise line 1 would be ideal for putting 3 TLS, kernel+user
>>> code+data and PDA into, thereby making 99.999% of GDT descriptor
>>> uses come from one cache line.
>>
>> That change is visible to userspace, unfortunately.
>
> Don't think it matters much. 32-bit processes on x86-64 seem
> perfectly happy with the TLS being in a different place.
Heh. I wish. Well, OK, but only because I detect the kernel!
> I think the ABI is defined in terms of "use the selector for
> the entry that set_thread_area/clone returns", and so is not
> a constant. But I agree it would be better not to.
>
> Hm, moving user cs/ds would be pretty visible too... Hm, and
> it would have a greater chance of breaking stuff if they changed,
> compared to moving the TLS...
I think that would be a lower chance, not a greater chance.
Reasons why an app might care:
a. identify a 64-bit kernel
b. far jumps between 32-bit and 64-bit code
c. reload of ds/es after a string operation on thread-private data
Perhaps i386 should change to match x86_64.
"Albert Cahalan" <[email protected]> writes:
> I think that would be a lower chance, not a greater chance.
> Reasons why an app might care:
>
> a. identify a 64-bit kernel
> b. far jumps between 32-bit and 64-bit code
> c. reload of ds/es after a string operation on thread-private data
>
> Perhaps i386 should change to match x86_64.
I agree that the difference is annoying.
However I just wrote a user space implementation of fork that
is capable of copying a process from an i386 only kernel to a x86_64
kernel, and executing there without having to detect the kernel type.
It didn't takes hacks to accomplish that.
The basic syscall is:
int set_thread_area (struct user_desc *u_info);
struct user_desc {
unsigned int entry_number;
unsigned long base_addr;
unsigned int limit;
unsigned int seg_32bit:1;
unsigned int contents:2;
unsigned int read_exec_only:1;
unsigned int limit_in_pages:1;
unsigned int seg_not_present:1;
unsigned int useable:1;
};
If entry_number is -1 the kernel finds a free gdt entry and
sets up the segment and returns with entry_number set to the
segment number.
Eric
Albert Cahalan wrote:
> We actually have an ABI problem right now because of this.
> Note that i386 and x86_64 use different GDT slots.
>
> As far as I can tell, users need to hard-code the mapping
> from TLS slot to segment number. They use 0,1,2 to ask the
> kernel to set things up (via set_thread_area), but can't
> just pop that into %fs or %gs.
That's not true at all. The program I posted earlier in this thread
uses set_thread_area() to allocate a GDT slot, and it works on both
native 32 bit and 32-under-64. The entry_number field in the struct
user_desc is an actual entry number, so you can easily construct a
selector from it.
> Typical hacks that result from this:
>
> call uname() and look for "x86_64"
> see of the addresses of local variables exceed 0xbfffffff
> examine /proc/1/maps
> check for a /lib64 directory
> change SSE register 8 in a signal handler frame and see if it sticks
> checksum the vdso code
> ...
>
> Please save us from these foul hacks.
Er, that all looks completely unnecessary.
J
On 9/14/06, Eric W. Biederman <[email protected]> wrote:
> I agree that the difference is annoying.
>
> However I just wrote a user space implementation of fork that
> is capable of copying a process from an i386 only kernel to a x86_64
> kernel, and executing there without having to detect the kernel type.
>
> It didn't takes hacks to accomplish that.
>
> The basic syscall is:
> int set_thread_area (struct user_desc *u_info);
> struct user_desc {
> unsigned int entry_number;
> unsigned long base_addr;
> unsigned int limit;
> unsigned int seg_32bit:1;
> unsigned int contents:2;
> unsigned int read_exec_only:1;
> unsigned int limit_in_pages:1;
> unsigned int seg_not_present:1;
> unsigned int useable:1;
> };
>
> If entry_number is -1 the kernel finds a free gdt entry and
> sets up the segment and returns with entry_number set to the
> segment number.
Eeeeeew.
So if I grabbed the first two slots before glibc got to
mess with them, glibc wouldn't break horribly?
If I grabbed one slot and glibc grabbed another, Wine
would be OK with the third instead of the second?
So basically it's not allowed to just grab the 3rd slot?
What if I want to find out what is already in use?
Am I supposed to iterate over all 8191 possible
GDT entries? How do I even tell how many slots
are available without using them all up?
Eeeeeeew. Well this was documented exactly nowhere.
The man page is even vague about entry_number,
meaning I had to dig in the kernel source (AMD manual
by my side) to find if that was a GDT slot or TLS slot,
as array index or byte offset, with or without the low bits
all set up for loading into the segment register, loaded for
me or not, etc.
Albert Cahalan wrote:
> Eeeeeew.
>
> So if I grabbed the first two slots before glibc got to
> mess with them, glibc wouldn't break horribly?
> If I grabbed one slot and glibc grabbed another, Wine
> would be OK with the third instead of the second?
Glibc should allocate a slot just the same way, just like wine does as
well. Glibc just usually gets its slot allocated first.
>
> So basically it's not allowed to just grab the 3rd slot?
You can, but you should be prepared for it to fail as well.
>
> What if I want to find out what is already in use?
> Am I supposed to iterate over all 8191 possible
> GDT entries? How do I even tell how many slots
> are available without using them all up?
There are only 32 possible GDT entries in 32-bit i386 Linux, and only
three of them are usable for userspace. You can't find out which slots
are in use, but you can cause one to be allocated and returned to you.
This seems like a perfectly reasonable API to me, why do you think it is
so ugly?
Zach
Albert Cahalan wrote:
> So if I grabbed the first two slots before glibc got to
> mess with them, glibc wouldn't break horribly?
glibc would be happy with anything it got; if you grabbed all 3 TLS
slots it would probably be upset.
> If I grabbed one slot and glibc grabbed another, Wine
> would be OK with the third instead of the second?
Presumably.
> So basically it's not allowed to just grab the 3rd slot?
Eh? You mean there's no "allocate and return TLS slot #N" operation?
No, but all the TLS slots should be interchangeable. Once you've got
your entry numbers and worked out your selector values, you can just use
them.
> What if I want to find out what is already in use?
> Am I supposed to iterate over all 8191 possible
> GDT entries? How do I even tell how many slots
> are available without using them all up?
The kernel reserves 3 slots in the GDT for usermode use, which are
per-thread. If you want more segment descriptors, you can always
allocate an LDT.
> Eeeeeeew. Well this was documented exactly nowhere.
> The man page is even vague about entry_number,
man set_thread_area has this as paragraph 2:
When set_thread_area() is passed an entry_number of -1, it uses a free
TLS entry. If set_thread_area() finds a free TLS entry, the value of
u_info->entry_number is set upon return to show which entry was
changed.
which seems pretty clear to me. A quick run with strace on any binary
shows this in action:
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7fb06c0,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1}) = 0
J
On 9/14/06, Zachary Amsden <[email protected]> wrote:
> Albert Cahalan wrote:
> > So basically it's not allowed to just grab the 3rd slot?
>
> You can, but you should be prepared for it to fail as well.
Without knowing details of the kernel's GDT, how?
> > What if I want to find out what is already in use?
> > Am I supposed to iterate over all 8191 possible
> > GDT entries? How do I even tell how many slots
> > are available without using them all up?
>
> There are only 32 possible GDT entries in 32-bit i386 Linux, and only
> three of them are usable for userspace. You can't find out which slots
> are in use, but you can cause one to be allocated and returned to you.
> This seems like a perfectly reasonable API to me, why do you think it is
> so ugly?
Eh, "returned to you" doesn't work for me. I need to
figure out what other code (not written by me) uses.
I may need to "borrow" a slot if all three slots are in
use. Without using evil knowledge of the GDT, how
am I to do that? I don't know what slots might have
been allocated by other libraries.
Albert Cahalan wrote
>>
>> There are only 32 possible GDT entries in 32-bit i386 Linux, and only
>> three of them are usable for userspace. You can't find out which slots
>> are in use, but you can cause one to be allocated and returned to you.
>> This seems like a perfectly reasonable API to me, why do you think it is
>> so ugly?
>
> Eh, "returned to you" doesn't work for me. I need to
> figure out what other code (not written by me) uses.
I don't understand. Why do you need to figure that out? You need a
selector, you ask for one, and you get assigned one. It is that
simple. You can't figure out what other code uses, and the kernel has
no way to tell you, because that is an application level allocation
problem, not a kernel responsibility. The kernel has no visibility into
userspace intentions regarding segment usage.
> I may need to "borrow" a slot if all three slots are in
> use. Without using evil knowledge of the GDT, how
> am I to do that? I don't know what slots might have
> been allocated by other libraries.
What kind of libraries are you using? Unless this is really, really,
special purpose, they are going to allocate at most one, and that is
only if you use TLS libraries.
If all three slots are in use (i.e. your allocation fails), you'll have
to allocate an LDT selector, just like wine:
void wine_ldt_init_fs( unsigned short sel, const LDT_ENTRY *entry )
{
if ((sel & ~3) == (global_fs_sel & ~3))
{
#ifdef __linux__
struct modify_ldt_s ldt_info;
int ret;
ldt_info.entry_number = sel >> 3;
fill_modify_ldt_struct( &ldt_info, entry );
if ((ret = set_thread_area( &ldt_info ) < 0)) perror(
"set_thread_area" );
#elif defined(__APPLE__)
int ret = thread_set_user_ldt( wine_ldt_get_base(entry),
wine_ldt_get_limit(entry), 0 );
if (ret == -1) perror( "thread_set_user_ldt" );
else assert( ret == global_fs_sel );
#endif /* __APPLE__ */
}
else /* LDT selector */
{
internal_set_entry( sel, entry ); <---- just like this
}
wine_set_fs( sel );
}
Zach
On Wednesday 13 September 2006 20:58, Jeremy Fitzhardinge wrote:
> What's the rationale for the current assignment of GDT entries? In
> particular, this section:
AFAIK it was mostly for APM and various BIOS bugs. IIRC Wine had
some special requirements at some point too, but I can't remember them right
now. On x86-64 I use all GDT entries, although there are a few special
ordering restrictions due to the semantics of SYSCALL. I ignored Wine too and
so far nobody has complained, so whatever requirements they have they can't
be that important.
> I'm asking because I'd like to use one of these entries for the PDA
> descriptor, so that it is on the same cache line as the TLS
> descriptors. That way, the entry/exit segment register reloads would
> still only need to touch two GDT cache lines. Would there be a real
> problem in doing this?
The only way to find out would be to do it. It's quite possible that all
the systems with APM BIOS that needed it are long beyond their MTBF.
-Andi
Ar Mer, 2006-09-13 am 17:25 -0700, ysgrifennodd Zachary Amsden:
> that makes use of APM or PnP facilities. There is the possibility
> however, that such a program could sleep, run the idle thread, which
> makes a call into some of these BIOS facilities, and then reschedules
> the same program thread - which means FS/GS never get reloaded, thus
> maintaining their corrupted values. It is worth fixing, just not a high
> priority. I had a patch that fixed both APM and PnP at one time, but it
> is covered with mold and now looks like a science experiment. Shall I
> apply disinfectant?
I think that would be useful, or just post up the mouldy one for someone
else to rework. If someone is hitting that kind of bug its going to be
pretty horrible to track down.
On Wed, 13 Sep 2006 23:11:05 -0700, Jeremy Fitzhardinge wrote:
>Albert Cahalan wrote:
>> We actually have an ABI problem right now because of this.
>> Note that i386 and x86_64 use different GDT slots.
>>
>> As far as I can tell, users need to hard-code the mapping
>> from TLS slot to segment number. They use 0,1,2 to ask the
>> kernel to set things up (via set_thread_area), but can't
>> just pop that into %fs or %gs.
>
>That's not true at all. The program I posted earlier in this thread
>uses set_thread_area() to allocate a GDT slot, and it works on both
>native 32 bit and 32-under-64.
The i386 TLS API has three components:
(1) set_thread_area(entry_number == -1):
allocates and sets up the first available TLS entry and
copies the chosen GDT index back to user-space
(2) set_thread_area(6 <= entry_number && entry_number <= 8):
allocates and sets up the indicated GDT entry
(3) get_thread_area(6 <= entry_number && entry_number <= 8):
retrieves the contents of the indicated GDT entry
Only (1) works in x86-64's ia32 emulation, the other two fail
with EINVAL because x86-64 only accepts GDT indices 12 to 14
for TLS entries. glibc only uses (1).
If you move the i386 TLS GDT entries to other indices then you
break (2) and (3) also on i386.
It's not difficult to design a better i386 TLS API that avoids
requiring user-space to know the actual GDT indices (just use
logical TLS indices and always copy the GDT index to user-space).
but unfortunately that doesn't help us now because the TLS GDT
indices must remain fixed as long as the current API is supported.
I _personally_ could certainly handle a post-2.6.18 kernel where
the improved API (new syscalls) is in place, the GDT indices have
been moved, and consequently components (2) and (3) of the old API
are broken. However, this still implies breaking binary compatibility,
which is not something to be done lightly.
(What's _really_ sad is that the implementation of the i386 TLS API
internally operates on logical TLS indices, it's just the syscall
interface that insists on requiring actual GDT indices from user-space.)
/Mikael
Mikael Pettersson wrote:
> The i386 TLS API has three components:
>
> (1) set_thread_area(entry_number == -1):
> allocates and sets up the first available TLS entry and
> copies the chosen GDT index back to user-space
> (2) set_thread_area(6 <= entry_number && entry_number <= 8):
> allocates and sets up the indicated GDT entry
> (3) get_thread_area(6 <= entry_number && entry_number <= 8):
> retrieves the contents of the indicated GDT entry
>
> Only (1) works in x86-64's ia32 emulation, the other two fail
> with EINVAL because x86-64 only accepts GDT indices 12 to 14
> for TLS entries. glibc only uses (1).
>
> If you move the i386 TLS GDT entries to other indices then you
> break (2) and (3) also on i386.
>
(2) and (3) are always OK if you pass it the result of (1) - ie to
update or readback a previously allocated descriptor. Neither is useful
without having done (1) first. The fact that 32-on-32 and 32-on-64
differ here means that nothing can (an apparently nothing does) depend
on hardcoded knowledge of the TLS descriptor indicies anyway.
> It's not difficult to design a better i386 TLS API that avoids
> requiring user-space to know the actual GDT indices (just use
> logical TLS indices and always copy the GDT index to user-space).
> but unfortunately that doesn't help us
>
You still need the real indicies to construct a selector to put into a
segment register - ie, actually do something useful. Changing the API
to use abstract "TLS indicies" would also require a call to return the
"TLS base", which hardly seems like an improvement.
Also, there's no inherent reason why the TLS indicies should be
contigious; it happens to be true, but there's nothing useful userspace
can do with that knowledge. Allowing them to be discontigious may be
helpful, for example, in packing the most used TLS entries (ie #1) into
a hot cache line, while putting the lesser-used ones elsewhere. The
current API could deal with this without needing to change.
J
Jeremy Fitzhardinge writes:
> Mikael Pettersson wrote:
> > The i386 TLS API has three components:
> >
> > (1) set_thread_area(entry_number == -1):
> > allocates and sets up the first available TLS entry and
> > copies the chosen GDT index back to user-space
> > (2) set_thread_area(6 <= entry_number && entry_number <= 8):
> > allocates and sets up the indicated GDT entry
> > (3) get_thread_area(6 <= entry_number && entry_number <= 8):
> > retrieves the contents of the indicated GDT entry
> >
> > Only (1) works in x86-64's ia32 emulation, the other two fail
> > with EINVAL because x86-64 only accepts GDT indices 12 to 14
> > for TLS entries. glibc only uses (1).
> >
> > If you move the i386 TLS GDT entries to other indices then you
> > break (2) and (3) also on i386.
> >
>
> (2) and (3) are always OK if you pass it the result of (1) - ie to
> update or readback a previously allocated descriptor. Neither is useful
> without having done (1) first.
In the real world a process' state is influenced by code I have little
control over, usually glibc and other libraries, and fork(). Using (3)
I can inspect parts of my process' state that I did not initialize myself.
> The fact that 32-on-32 and 32-on-64
> differ here means that nothing can (an apparently nothing does) depend
> on hardcoded knowledge of the TLS descriptor indicies anyway.
No, it means that x86-64's ia32 emulation was implemented by someone
who either didn't realize the difference, or didn't care (because
"only glibc matters").
> > It's not difficult to design a better i386 TLS API that avoids
> > requiring user-space to know the actual GDT indices (just use
> > logical TLS indices and always copy the GDT index to user-space).
> > but unfortunately that doesn't help us
> >
>
> You still need the real indicies to construct a selector to put into a
> segment register - ie, actually do something useful.
Sure.
> Changing the API
> to use abstract "TLS indicies" would also require a call to return the
> "TLS base", which hardly seems like an improvement.
The TLS base can obviously be zero.
User-space asks to access TLS #n (for allocs #n can be -1).
The kernel maps that to GDT index #m.
The kernel stores #m in the user-space buffer.
User-space maps #m to a selector.
> Also, there's no inherent reason why the TLS indicies should be
> contigious; it happens to be true, but there's nothing useful userspace
> can do with that knowledge. Allowing them to be discontigious may be
> helpful, for example, in packing the most used TLS entries (ie #1) into
> a hot cache line, while putting the lesser-used ones elsewhere. The
> current API could deal with this without needing to change.
I have said nothing that would prevent the use of sparse TLS GDT indices.
Look, I'm not saying the current API is perfect, far from it. But it does
have valid usage modes which are broken in x86-64's ia32 emulation, and
will break on i386 of you reallocate the TLS GDT indices. This is a fact.
This is why I'm asking that if you change things (thus breaking binary
compatibility even more), that a corrected API be placed in new syscalls.
That is, instead of forcing user-space to do
uname
if (version >= 2.6.N)
call {set,get}_thread_area with new-style parameters
else
call {set,get}_thread_area with old-style parameters
it should do
call new_{set,get}_thread_area with new-style parameters
if (ENOSYS)
call old_{set,get}_thread_area with old-style parameters
/Mikael
Mikael Pettersson wrote:
> > Changing the API
> > to use abstract "TLS indicies" would also require a call to return the
> > "TLS base", which hardly seems like an improvement.
>
> The TLS base can obviously be zero.
>
> User-space asks to access TLS #n (for allocs #n can be -1).
> The kernel maps that to GDT index #m.
> The kernel stores #m in the user-space buffer.
> User-space maps #m to a selector.
>
I'm missing why this is a substantial improvement over the current
interface (or functionally different at all). What does this proposal
let you do that the current one doesn't?
> Look, I'm not saying the current API is perfect, far from it. But it does
> have valid usage modes which are broken in x86-64's ia32 emulation, and
> will break on i386 of you reallocate the TLS GDT indices. This is a fact.
>
Hm, well its a "fact" in that they use different segment descriptors,
but you'd be hard pressed to say that was a breakage. set_thread_area
was added in 2.5.29 (Jul 2002), and x86-64 added support in 2.5.43 (Oct
2002), so the current behaviour is pretty much as it has always been.
If you have a program that expects something different, you either wrote
it in Jul-Oct 2002, or you made an unsustainable assumption about how
set_thread_area() works.
> Look, I'm not saying the current API is perfect, far from it. But it does
> have valid usage modes which are broken in x86-64's ia32 emulation, and
> will break on i386 of you reallocate the TLS GDT indices. This is a fact.
>
You seem to have a specific use-case in mind; do you have a program
which would like to use a new interface? Would you mind spelling it
out, and describe why the current interface doesn't work for you?
J