Hello,
I would like to know why the ARCH_SET_GS action of sys_arch_prctl, write the
MSR MSR_KERNEL_GS_BASE and not the MSR MSR_GS_BASE when the variable "doit"
equals 1? Is that a bug?
In other words, why the following code :
...
if (doit) {
load_gs_index(0);
ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr)
}
...
is not the following one :
...
if (doit) {
ret = checking_wrmsrl(MSR_GS_BASE, addr)
load_gs_index(0);
}
...
I copy for clarity the beginning of the function "do_arch_prctl" :
long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
{
int ret = 0;
int doit = task == current;
int cpu;
switch (code) {
case ARCH_SET_GS:
if (addr >= TASK_SIZE_OF(task))
return -EPERM;
cpu = get_cpu();
/* handle small bases via the GDT because that's faster to
switch. */
if (addr <= 0xffffffff) {
set_32bit_tls(task, GS_TLS, addr);
if (doit) {
load_TLS(&task->thread, cpu);
load_gs_index(GS_TLS_SEL);
}
task->thread.gsindex = GS_TLS_SEL;
task->thread.gs = 0;
} else {
task->thread.gsindex = 0;
task->thread.gs = addr;
if (doit) {
load_gs_index(0);
ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr);
}
}
put_cpu();
break;
[...]
Regards,
Eric
On Tue, 18 Nov 2008 15:33:32 +0100
Eric Lacombe <[email protected]> wrote:
> Hello,
>
> I would like to know why the ARCH_SET_GS action of sys_arch_prctl,
> write the MSR MSR_KERNEL_GS_BASE and not the MSR MSR_GS_BASE when the
> variable "doit" equals 1? Is that a bug?
>
I don't think it is.
The trick is that we use "swapgs" on entering/leaving the kernel, and
that will "swap" gs with the MSR, so when we return to userspace, GS
gets loaded from the MSR_KERNEL_GS_BASE ...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Le mardi 18 novembre 2008 15:45:56, vous avez ?crit?:
> On Tue, 18 Nov 2008 15:33:32 +0100
>
> Eric Lacombe <[email protected]> wrote:
> > Hello,
> >
> > I would like to know why the ARCH_SET_GS action of sys_arch_prctl,
> > write the MSR MSR_KERNEL_GS_BASE and not the MSR MSR_GS_BASE when the
> > variable "doit" equals 1? Is that a bug?
>
> I don't think it is.
> The trick is that we use "swapgs" on entering/leaving the kernel, and
> that will "swap" gs with the MSR, so when we return to userspace, GS
> gets loaded from the MSR_KERNEL_GS_BASE ...
Yeah when we enter the kernel swapgs is used, so the MSR_GS_BASE is switched
with the MSR_KERNEL_GS_BASE.
In fact, what I certainly misunderstand is why load_gs_index use swapgs
inside.
>From that function, I trust that only when gs is loaded, its hidden part is
loaded with the MSR_GS_BASE.
ENTRY(native_load_gs_index)
CFI_STARTPROC
pushf
CFI_ADJUST_CFA_OFFSET 8
DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
SWAPGS
gs_change:
movl %edi,%gs
2: mfence /* workaround */
SWAPGS
popf
CFI_ADJUST_CFA_OFFSET -8
ret
CFI_ENDPROC
ENDPROC(native_load_gs_index)
Regards,
Eric
In fact, if what I thought from that function was ok, we will have in fact the
order of the two following lines inversed :
...
load_gs_index(0);
ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr);
...
So that we would have :
...
ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr);
load_gs_index(0);
...
Regards,
Eric
Le mardi 18 novembre 2008 18:20:03 Eric Lacombe, vous avez ?crit?:
> Le mardi 18 novembre 2008 15:45:56, vous avez ?crit?:
> > On Tue, 18 Nov 2008 15:33:32 +0100
> >
> > Eric Lacombe <[email protected]> wrote:
> > > Hello,
> > >
> > > I would like to know why the ARCH_SET_GS action of sys_arch_prctl,
> > > write the MSR MSR_KERNEL_GS_BASE and not the MSR MSR_GS_BASE when the
> > > variable "doit" equals 1? Is that a bug?
> >
> > I don't think it is.
> > The trick is that we use "swapgs" on entering/leaving the kernel, and
> > that will "swap" gs with the MSR, so when we return to userspace, GS
> > gets loaded from the MSR_KERNEL_GS_BASE ...
>
> Yeah when we enter the kernel swapgs is used, so the MSR_GS_BASE is
> switched with the MSR_KERNEL_GS_BASE.
>
> In fact, what I certainly misunderstand is why load_gs_index use swapgs
> inside.
> From that function, I trust that only when gs is loaded, its hidden part is
> loaded with the MSR_GS_BASE.
>
> ENTRY(native_load_gs_index)
> CFI_STARTPROC
> pushf
> CFI_ADJUST_CFA_OFFSET 8
> DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
> SWAPGS
> gs_change:
> movl %edi,%gs
> 2: mfence /* workaround */
> SWAPGS
> popf
> CFI_ADJUST_CFA_OFFSET -8
> ret
> CFI_ENDPROC
> ENDPROC(native_load_gs_index)
>
> Regards,
>
> Eric
I look at the Intel docs (vol. 3A) again, and see that in 64 bits mode the
hidden field gs.base are physically mapped to the MSR, so it seems that in
order to load gs.base we don't need to load gs (like in 32 bits mode), but
rather we only need to load the MSR.
So I don't understand the purpose of load_gs_index in that context :
if (doit) {
load_gs_index(0);
ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr);
}
Why don't we only load the MSR ?
What is the purpose of calling load_gs_index with 0 as parameter ?
Thanks in advance for your response,
Eric
> ENTRY(native_load_gs_index)
> CFI_STARTPROC
> pushf
> CFI_ADJUST_CFA_OFFSET 8
> DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI))
> SWAPGS
> gs_change:
> movl %edi,%gs
> 2: mfence /* workaround */
> SWAPGS
> popf
> CFI_ADJUST_CFA_OFFSET -8
> ret
> CFI_ENDPROC
> ENDPROC(native_load_gs_index)
Eric Lacombe wrote:
> I look at the Intel docs (vol. 3A) again, and see that in 64 bits mode the
> hidden field gs.base are physically mapped to the MSR, so it seems that in
> order to load gs.base we don't need to load gs (like in 32 bits mode), but
> rather we only need to load the MSR.
>
> So I don't understand the purpose of load_gs_index in that context :
>
> if (doit) {
> load_gs_index(0);
> ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr);
> }
>
> Why don't we only load the MSR ?
> What is the purpose of calling load_gs_index with 0 as parameter ?
>
Because %gs of 0 means "base too large, go to MSR". If you have a
32-bit base, then loading it into the gdt and loading %gs with the right
selector is faster. wrmsr/rdmsr are slow instructions.
J
Le mercredi 19 novembre 2008 02:07:23 Jeremy Fitzhardinge, vous avez ?crit?:
> Eric Lacombe wrote:
> > I look at the Intel docs (vol. 3A) again, and see that in 64 bits mode
> > the hidden field gs.base are physically mapped to the MSR, so it seems
> > that in order to load gs.base we don't need to load gs (like in 32 bits
> > mode), but rather we only need to load the MSR.
> >
> > So I don't understand the purpose of load_gs_index in that context :
> >
> > if (doit) {
> > load_gs_index(0);
> > ret = checking_wrmsrl(MSR_KERNEL_GS_BASE, addr);
> > }
> >
> > Why don't we only load the MSR ?
> > What is the purpose of calling load_gs_index with 0 as parameter ?
>
> Because %gs of 0 means "base too large, go to MSR". If you have a
> 32-bit base, then loading it into the gdt and loading %gs with the right
> selector is faster. wrmsr/rdmsr are slow instructions.
Ok, thanks, so I suppose now that only doing :
asm volatile("movl %0,%%gs" :: "r" (0));
could corrupt the address of the PDA that resides actually in the MSR_GS_BASE.
And that's why load_gs_index is used as it contains "swapgs" before and after
the "mov to gs".
Is that correct?
Regards,
Eric
Eric Lacombe wrote:
> Ok, thanks, so I suppose now that only doing :
> asm volatile("movl %0,%%gs" :: "r" (0));
> could corrupt the address of the PDA that resides actually in the MSR_GS_BASE.
> And that's why load_gs_index is used as it contains "swapgs" before and after
> the "mov to gs".
>
> Is that correct?
>
Yes, loading a selector into a segment register will load the lower 32
bits of the base from the ldt/gdt into the msr and zero the rest.
J
Thanks for your answer, I've got one last question ;)
In the ARCH_GET_GS, can you explain the line 834 to 838?
In fact, at first sight I thought that just the line 836 was sufficient, but I
obviously miss the case where MSR_KERNEL_GS_BASE does not reflect the value
requested, hence my question.
828 case ARCH_GET_GS: {
829 unsigned long base;
830 unsigned gsindex;
831 if (task->thread.gsindex == GS_TLS_SEL)
832 base = read_32bit_tls(task, GS_TLS);
833 else if (doit) {
834 asm("movl %%gs,%0" : "=r" (gsindex));
835 if (gsindex)
836 rdmsrl(MSR_KERNEL_GS_BASE, base);
837 else
838 base = task->thread.gs;
839 }
840 else
841 base = task->thread.gs;
Thanks,
Eric
Eric Lacombe wrote:
> Thanks for your answer, I've got one last question ;)
> In the ARCH_GET_GS, can you explain the line 834 to 838?
>
> In fact, at first sight I thought that just the line 836 was sufficient, but I
> obviously miss the case where MSR_KERNEL_GS_BASE does not reflect the value
> requested, hence my question.
>
I think the rationale is that rdmsr is slow, so reading the value from
the task context is faster where possible.
It's not a very pretty part of the architecture :/
J
> 828 case ARCH_GET_GS: {
> 829 unsigned long base;
> 830 unsigned gsindex;
> 831 if (task->thread.gsindex == GS_TLS_SEL)
> 832 base = read_32bit_tls(task, GS_TLS);
> 833 else if (doit) {
> 834 asm("movl %%gs,%0" : "=r" (gsindex));
> 835 if (gsindex)
> 836 rdmsrl(MSR_KERNEL_GS_BASE, base);
> 837 else
> 838 base = task->thread.gs;
> 839 }
> 840 else
> 841 base = task->thread.gs;
>
> Thanks,
>
> Eric
>
>
>
Le jeudi 20 novembre 2008 01:07:42 Jeremy Fitzhardinge, vous avez ?crit?:
> Eric Lacombe wrote:
> > Thanks for your answer, I've got one last question ;)
> > In the ARCH_GET_GS, can you explain the line 834 to 838?
> >
> > In fact, at first sight I thought that just the line 836 was sufficient,
> > but I obviously miss the case where MSR_KERNEL_GS_BASE does not reflect
> > the value requested, hence my question.
>
> I think the rationale is that rdmsr is slow, so reading the value from
> the task context is faster where possible.
But in this case why not doing instead:
828 case ARCH_GET_GS: {
829 unsigned long base;
830 unsigned gsindex;
831 if (task->thread.gsindex == GS_TLS_SEL)
832 base = read_32bit_tls(task, GS_TLS);
840 else
841 base = task->thread.gs;
> > 828 case ARCH_GET_GS: {
> > 829 unsigned long base;
> > 830 unsigned gsindex;
> > 831 if (task->thread.gsindex == GS_TLS_SEL)
> > 832 base = read_32bit_tls(task, GS_TLS);
> > 833 else if (doit) {
> > 834 asm("movl %%gs,%0" : "=r" (gsindex));
> > 835 if (gsindex)
> > 836 rdmsrl(MSR_KERNEL_GS_BASE, base);
> > 837 else
> > 838 base = task->thread.gs;
> > 839 }
> > 840 else
> > 841 base = task->thread.gs;
and as I see with ARCH_GET_FS we have :
817 case ARCH_GET_FS: {
818 unsigned long base;
819 if (task->thread.fsindex == FS_TLS_SEL)
820 base = read_32bit_tls(task, FS_TLS);
821 else if (doit)
822 rdmsrl(MSR_FS_BASE, base);
823 else
824 base = task->thread.fs;
825 ret = put_user(base, (unsigned long __user *)addr);
826 break;
827 }
So it seems that the "rdmsrl(MSR_FS_BASE, base);" could be faster than an
access to the memory, else why bother with the "doit" case?
Regards,
Eric
Hello,
Does the "doit case" (line 822 in ARCH_GET_FS, function do_arch_prctl) exist
for performance reasons? Else, why "task->thread.fs" (line 824) does not
contain the fs base in the "doit case"?
Can someone explain _precisely_ the lines 835 through 838 (ARCH_GET_GS)?
(I thought that just the line 836 was sufficient, but I
obviously miss the case where MSR_KERNEL_GS_BASE does not reflect the value
requested)
Thanks again for all your previous answers.
Eric
828 case ARCH_GET_GS: {
829 unsigned long base;
830 unsigned gsindex;
831 if (task->thread.gsindex == GS_TLS_SEL)
832 base = read_32bit_tls(task, GS_TLS);
833 else if (doit) {
834 asm("movl %%gs,%0" : "=r" (gsindex));
835 if (gsindex)
836 rdmsrl(MSR_KERNEL_GS_BASE, base);
837 else
838 base = task->thread.gs;
839 }
840 else
841 base = task->thread.gs;
842 ret = put_user(base, (unsigned long __user *)addr);
843 break;
844 }
---
817 case ARCH_GET_FS: {
818 unsigned long base;
819 if (task->thread.fsindex == FS_TLS_SEL)
820 base = read_32bit_tls(task, FS_TLS);
821 else if (doit)
822 rdmsrl(MSR_FS_BASE, base);
823 else
824 base = task->thread.fs;
825 ret = put_user(base, (unsigned long __user *)addr);
826 break;
827 }
Eric Lacombe wrote:
> Hello,
>
> Does the "doit case" (line 822 in ARCH_GET_FS, function do_arch_prctl) exist
> for performance reasons? Else, why "task->thread.fs" (line 824) does not
> contain the fs base in the "doit case"?
>
"doit" gets set when you're operating on yourself. If you're operating
on another process, then you need to use their task structure values
rather than the current process's values. If you're doing it to
yourself, then the task structure may be out of date because its only
updated on a context switch.
> Can someone explain _precisely_ the lines 835 through 838 (ARCH_GET_GS)?
> (I thought that just the line 836 was sufficient, but I
> obviously miss the case where MSR_KERNEL_GS_BASE does not reflect the value
> requested)
>
gsindex and gs store the same information in two ways. gsindex is the
GDT selector number which contains the (32-bit) base address, and gs is
the raw 64-bit base address. If gsindex != 0 then it prevails,
otherwise gs contains the right value.
When you load %gs with a selector, the MSR is updated with the value
from the GDT. Rather than parsing the GDT entry manually to get the
encoded base address, the code in 836 fetches it out of the MSR. On the
other hand, if you're using a raw gs base (ie gsindex==0), then you can
simply read the base directly from the task structure. rdmsr would work
as well, but be less efficient.
J
> Thanks again for all your previous answers.
>
> Eric
>
> 828 case ARCH_GET_GS: {
> 829 unsigned long base;
> 830 unsigned gsindex;
> 831 if (task->thread.gsindex == GS_TLS_SEL)
> 832 base = read_32bit_tls(task, GS_TLS);
> 833 else if (doit) {
> 834 asm("movl %%gs,%0" : "=r" (gsindex));
> 835 if (gsindex)
> 836 rdmsrl(MSR_KERNEL_GS_BASE, base);
> 837 else
> 838 base = task->thread.gs;
> 839 }
> 840 else
> 841 base = task->thread.gs;
> 842 ret = put_user(base, (unsigned long __user *)addr);
> 843 break;
> 844 }
>
> ---
>
> 817 case ARCH_GET_FS: {
> 818 unsigned long base;
> 819 if (task->thread.fsindex == FS_TLS_SEL)
> 820 base = read_32bit_tls(task, FS_TLS);
> 821 else if (doit)
> 822 rdmsrl(MSR_FS_BASE, base);
> 823 else
> 824 base = task->thread.fs;
> 825 ret = put_user(base, (unsigned long __user *)addr);
> 826 break;
> 827 }
>
>
Le lundi 24 novembre 2008 19:22:18 Jeremy Fitzhardinge, vous avez ?crit?:
> Eric Lacombe wrote:
> > Hello,
> >
> > Does the "doit case" (line 822 in ARCH_GET_FS, function do_arch_prctl)
> > exist for performance reasons? Else, why "task->thread.fs" (line 824)
> > does not contain the fs base in the "doit case"?
>
> "doit" gets set when you're operating on yourself. If you're operating
> on another process, then you need to use their task structure values
> rather than the current process's values. If you're doing it to
> yourself, then the task structure may be out of date because its only
> updated on a context switch.
The task_struct is also updated in sys_arch_prctl (ARCH_SET_FS and
ARCH_SET_GS), so not just on a context switch.
How the task structure could be out of date wrt thread.gs and thread.fs?
What could be a typical scenario that could induced gs or fs to be modified and
not thread.gs and thread.fs?
Why we have a difference between ARCH_GET_GS :
> 833 else if (doit) {
> 834 asm("movl %%gs,%0" : "=r" (gsindex));
> 835 if (gsindex)
> 836 rdmsrl(MSR_KERNEL_GS_BASE, base);
> 837 else
> 838 base = task->thread.gs;
> 839 }
and ARCH_GET_FS :
> 821 else if (doit)
> 822 rdmsrl(MSR_FS_BASE, base);
If I follow what you say, why can't we have the same optimization with in
ARCH_GET_FS?
thanks,
Eric