FYI, we noticed the following commit:
commit: 69218e47994da614e7af600bf06887750ab6657a ("x86: Remap GDT tables in the fixmap section")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
in testcase: trinity
with following parameters:
runtime: 300s
test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/
on test machine: qemu-system-i386 -enable-kvm -smp 2 -m 320M
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+--------------------------------------------------------------+------------+------------+
| | f06bdd4001 | 69218e4799 |
+--------------------------------------------------------------+------------+------------+
| boot_successes | 3 | 0 |
| boot_failures | 5 | 8 |
| BUG:kernel_reboot-without-warning_in_boot_stage | 4 | |
| WARNING:at_arch/x86/include/asm/fpu/internal.h:#fpu__restore | 1 | |
| BUG:kernel_hang_in_boot_stage | 0 | 8 |
+--------------------------------------------------------------+------------+------------+
[ 0.000000] sysrq: sysrq always enabled.
[ 0.000000] PID hash table entries: 2048 (order: 1, 8192 bytes)
[ 0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[ 0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[ 0.000000] Initializing CPU#0
[ 0.000000] allocated 331644 bytes of page_ext
[ 0.000000] Initializing HighMem for node 0 (00000000:00000000)
[ 0.000000] Memory: 267332K/327160K available (14061K kernel code, 1104K rwdata, 4696K rodata, 2460K init, 13208K bss, 59828K reserved, 0K cma-reserved, 0K highmem)
[ 0.000000] virtual kernel memory layout:
[ 0.000000] fixmap : 0xffa74000 - 0xfffff000 (5676 kB)
[ 0.000000] pkmap : 0xff400000 - 0xff800000 (4096 kB)
[ 0.000000] vmalloc : 0x547e0000 - 0xff3fe000 (2732 MB)
[ 0.000000] lowmem : 0x40000000 - 0x53fe0000 ( 319 MB)
[ 0.000000] .init : 0x4236a000 - 0x425d1000 (2460 kB)
[ 0.000000] .data : 0x41dbb72c - 0x42368300 (5810 kB)
[ 0.000000] .text : 0x41000000 - 0x41dbb72c (14061 kB)
Elapsed time: 480
BUG: kernel hang in boot stage
initrds=(
/osimage/yocto/yocto-tiny-i386-2016-04-22.cgz
/lkp/scheduled/vm-lkp-hsw01-yocto-i386-25/trinity-300s-yocto-tiny-i386-2016-04-22.cgz-69218e47994da614e7af600bf06887750ab6657a-20170319-117821-bls48g-0.cgz
/lkp/lkp/lkp-i386.cgz
To reproduce:
git clone https://github.com/01org/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Xiaolong
This error happens even with Andy TLS fix on 32-bit (GDT is on fixmap
but not readonly). I am looking into it.
KVM internal error. Suberror: 3
extra data[0]: 80000b0e
extra data[1]: 31
EAX=00000001 EBX=9f9121f3 ECX=4330b100 EDX=fffff000
ESI=547e0000 EDI=ffa74000 EBP=42273ef8 ESP=42273ef8
EIP=4105a0a2 EFL=00210002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00c0f300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0068 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00c0f300 DPL=3 DS [-WA]
FS =00d8 1232d000 ffffffff 00809300 DPL=0 DS16 [-WA]
GS =00e0 5492d4c0 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 ffffffff 00c00000
TR =0080 5492b340 0000206b 00008b00 DPL=0 TSS32-busy
GDT= ffa94000 000000ff
IDT= fffba000 000007ff
CR0=80050033 CR2=ff874000 CR3=02610000 CR4=00000690
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=f0 d7 00 b8 01 00 00 00 8b 15 ec 48 29 42 8a 8a 00 50 87 ff <88>
8a 00 50 87 ff 31 c0 5d c3 55 89 e5 57 56 53 e8 8d f0 d7 00 89 c6 a1
00 ff dd 42 05 00
On Mon, Mar 20, 2017 at 9:57 PM, kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed the following commit:
>
> commit: 69218e47994da614e7af600bf06887750ab6657a ("x86: Remap GDT tables in the fixmap section")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: trinity
> with following parameters:
>
> runtime: 300s
>
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
>
>
> on test machine: qemu-system-i386 -enable-kvm -smp 2 -m 320M
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> +--------------------------------------------------------------+------------+------------+
> | | f06bdd4001 | 69218e4799 |
> +--------------------------------------------------------------+------------+------------+
> | boot_successes | 3 | 0 |
> | boot_failures | 5 | 8 |
> | BUG:kernel_reboot-without-warning_in_boot_stage | 4 | |
> | WARNING:at_arch/x86/include/asm/fpu/internal.h:#fpu__restore | 1 | |
> | BUG:kernel_hang_in_boot_stage | 0 | 8 |
> +--------------------------------------------------------------+------------+------------+
>
> [ 0.000000] sysrq: sysrq always enabled.
> [ 0.000000] PID hash table entries: 2048 (order: 1, 8192 bytes)
> [ 0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
> [ 0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
> [ 0.000000] Initializing CPU#0
> [ 0.000000] allocated 331644 bytes of page_ext
> [ 0.000000] Initializing HighMem for node 0 (00000000:00000000)
> [ 0.000000] Memory: 267332K/327160K available (14061K kernel code, 1104K rwdata, 4696K rodata, 2460K init, 13208K bss, 59828K reserved, 0K cma-reserved, 0K highmem)
> [ 0.000000] virtual kernel memory layout:
> [ 0.000000] fixmap : 0xffa74000 - 0xfffff000 (5676 kB)
> [ 0.000000] pkmap : 0xff400000 - 0xff800000 (4096 kB)
> [ 0.000000] vmalloc : 0x547e0000 - 0xff3fe000 (2732 MB)
> [ 0.000000] lowmem : 0x40000000 - 0x53fe0000 ( 319 MB)
> [ 0.000000] .init : 0x4236a000 - 0x425d1000 (2460 kB)
> [ 0.000000] .data : 0x41dbb72c - 0x42368300 (5810 kB)
> [ 0.000000] .text : 0x41000000 - 0x41dbb72c (14061 kB)
>
>
> Elapsed time: 480
> BUG: kernel hang in boot stage
>
> initrds=(
> /osimage/yocto/yocto-tiny-i386-2016-04-22.cgz
> /lkp/scheduled/vm-lkp-hsw01-yocto-i386-25/trinity-300s-yocto-tiny-i386-2016-04-22.cgz-69218e47994da614e7af600bf06887750ab6657a-20170319-117821-bls48g-0.cgz
> /lkp/lkp/lkp-i386.cgz
>
>
> To reproduce:
>
> git clone https://github.com/01org/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>
>
>
> Thanks,
> Xiaolong
--
Thomas
On Tue, Mar 21, 2017 at 11:16 AM, Thomas Garnier <[email protected]> wrote:
> This error happens even with Andy TLS fix on 32-bit (GDT is on fixmap
> but not readonly). I am looking into it.
>
> KVM internal error. Suberror: 3
> extra data[0]: 80000b0e
> extra data[1]: 31
If I read that right, it's extra data[1] 0x31, which EXIT_REASON_EPT_MISCONFIG.
I'm not seeing how the A bit in a GDT entry could have anything to do
with it. I'm assuming it happens even without Andy's patch?
Linus
On Tue, Mar 21, 2017 at 12:20 PM, Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Mar 21, 2017 at 11:16 AM, Thomas Garnier <[email protected]> wrote:
> > This error happens even with Andy TLS fix on 32-bit (GDT is on fixmap
> > but not readonly). I am looking into it.
> >
> > KVM internal error. Suberror: 3
> > extra data[0]: 80000b0e
> > extra data[1]: 31
>
> If I read that right, it's extra data[1] 0x31, which EXIT_REASON_EPT_MISCONFIG.
>
> I'm not seeing how the A bit in a GDT entry could have anything to do
> with it. I'm assuming it happens even without Andy's patch?
Correct.
>
> Linus
--
Thomas
The issue seems to be related to exceptions happening in close pages
to the fixmap GDT remapping.
The original page fault happen in do_test_wp_bit which set a fixmap
entry to test WP flag. If I grow the number of processors supported
increasing the distance between the remapped GDT page and the WP test
page, the error does not reproduce.
I am still looking at the exact distance between repro and no-repro as
well as the exact root cause.
On Tue, Mar 21, 2017 at 12:23 PM, Thomas Garnier <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 12:20 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> On Tue, Mar 21, 2017 at 11:16 AM, Thomas Garnier <[email protected]> wrote:
>> > This error happens even with Andy TLS fix on 32-bit (GDT is on fixmap
>> > but not readonly). I am looking into it.
>> >
>> > KVM internal error. Suberror: 3
>> > extra data[0]: 80000b0e
>> > extra data[1]: 31
>>
>> If I read that right, it's extra data[1] 0x31, which EXIT_REASON_EPT_MISCONFIG.
>>
>> I'm not seeing how the A bit in a GDT entry could have anything to do
>> with it. I'm assuming it happens even without Andy's patch?
>
> Correct.
>
>>
>> Linus
>
>
>
>
> --
> Thomas
--
Thomas
On Tue, Mar 21, 2017 at 1:25 PM, Thomas Garnier <[email protected]> wrote:
> The issue seems to be related to exceptions happening in close pages
> to the fixmap GDT remapping.
>
> The original page fault happen in do_test_wp_bit which set a fixmap
> entry to test WP flag. If I grow the number of processors supported
> increasing the distance between the remapped GDT page and the WP test
> page, the error does not reproduce.
>
> I am still looking at the exact distance between repro and no-repro as
> well as the exact root cause.
Hmm. Have we set the GDT limit incorrectly, somehow? The GDT *can*
cover 8k entries, which at 8 bytes each would be 64kB.
So somebody trying to load an invalid segment (say, 0xffff) might end
up causing an access to the GDT base + 64k - 8.
It is also possible that the CPU might do a page table writability
check *before* it does the limit check. That would sound odd, though.
Might be a CPU errata.
Linus
On Tue, Mar 21, 2017 at 2:11 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Mar 21, 2017 at 1:25 PM, Thomas Garnier <[email protected]> wrote:
>> The issue seems to be related to exceptions happening in close pages
>> to the fixmap GDT remapping.
>>
>> The original page fault happen in do_test_wp_bit which set a fixmap
>> entry to test WP flag. If I grow the number of processors supported
>> increasing the distance between the remapped GDT page and the WP test
>> page, the error does not reproduce.
>>
>> I am still looking at the exact distance between repro and no-repro as
>> well as the exact root cause.
>
> Hmm. Have we set the GDT limit incorrectly, somehow? The GDT *can*
> cover 8k entries, which at 8 bytes each would be 64kB.
The QEMU barf says the GDT limit is 0xff, for better or for worse.
>
> So somebody trying to load an invalid segment (say, 0xffff) might end
> up causing an access to the GDT base + 64k - 8.
>
> It is also possible that the CPU might do a page table writability
> check *before* it does the limit check. That would sound odd, though.
> Might be a CPU errata.
>
I added a global TLB flush right after __set_fixmap(), with no effect.
I instrumented the code a bit and I see:
[ 0.000000] Checking if this processor honours the WP bit even in
supervisor mode...
[ 0.000000] Will do WP test: PA 258b000 VA ff874000 GDTRW 547e0000
GDTRO ffa94000
KVM internal error. Suberror: 3
extra data[0]: 80000b0e
extra data[1]: 31
EAX=00000001 EBX=cbb13bc3 ECX=00000000 EDX=fffff000
ESI=547e0000 EDI=ffa94000 EBP=42201f4c ESP=42201f4c
EIP=4105819d EFL=00210006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00c0f300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0068 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00c0f300 DPL=3 DS [-WA]
FS =00d8 123b2000 ffffffff 00809300 DPL=0 DS16 [-WA]
GS =00e0 5492d300 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 ffffffff 00000000
TR =0080 5492b180 0000206b 00008b00 DPL=0 TSS32-busy
GDT= ffa94000 000000ff
IDT= fffba000 000007ff
CR0=80050033 CR2=ff874000 CR3=0258b000 CR4=00040690
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=58 d1 00 b8 01 00 00 00 8b 15 ac 13 22 42 8a 8a 00 50 87 ff <88>
8a 00 50 87 ff 31 c0 5d c3 90 90 90 90 90 90 90 90 90 55 2d 84 02 00
00 89 e5 e8 c3 05
The faulting instruction is, as expected:
e: 8a 8a 00 50 87 ff mov -0x78b000(%rdx),%cl
14:* 88 8a 00 50 87 ff mov %cl,-0x78b000(%rdx)
<-- trapping instruction
CR2 is what we expect. It would be nice to see the GPA and GLA for
the EPT misconfiguration, but KVM doesn't appear to show it.
I doubt we're looking at an erratum here. QEMU TCG triple-faults:
[ 0.000000] Will do WP test: PA 258b000 VA ff874000 GDTRW 547e0000
GDTRO ffa94000
check_exception old: 0xffffffff new 0xe [#PF]
0: v=0e e=0003 i=0 cpl=0 IP=0060:000000004105819d
pc=000000004105819d SP=0068:0000000042201f4c CR2=00000000ff874000
EAX=00000001 EBX=88eed8df ECX=00000000 EDX=fffff000
ESI=547e0000 EDI=ffa94000 EBP=42201f4c ESP=42201f4c
EIP=4105819d EFL=00200006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0068 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
FS =00d8 123b2000 ffffffff 008f9300 DPL=0 DS16 [-WA]
GS =00e0 5492d300 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 00000000 00008200 DPL=0 LDT
TR =0080 5492b180 0000206b 00008900 DPL=0 TSS32-avl
GDT= ffa94000 000000ff
IDT= fffba000 000007ff
CR0=80050033 CR2=ff874000 CR3=0258b000 CR4=00000690
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000004 CCD=42201f3c CCO=ADDL
EFER=0000000000000000
check_exception old: 0xe new 0xd [#GP]
1: v=08 e=0000 i=0 cpl=0 IP=0060:000000004105819d
pc=000000004105819d SP=0068:0000000042201f4c
env->regs[R_EAX]=0000000000000001
EAX=00000001 EBX=88eed8df ECX=00000000 EDX=fffff000
ESI=547e0000 EDI=ffa94000 EBP=42201f4c ESP=42201f4c
EIP=4105819d EFL=00200006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
CS =0060 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-]
SS =0068 00000000 ffffffff 00cf9300 DPL=0 DS [-WA]
DS =007b 00000000 ffffffff 00cff300 DPL=3 DS [-WA]
FS =00d8 123b2000 ffffffff 008f9300 DPL=0 DS16 [-WA]
GS =00e0 5492d300 00000018 00409100 DPL=0 DS [--A]
LDT=0000 00000000 00000000 00008200 DPL=0 LDT
TR =0080 5492b180 0000206b 00008900 DPL=0 TSS32-avl
GDT= ffa94000 000000ff
IDT= fffba000 000007ff
CR0=80050033 CR2=ff874000 CR3=0258b000 CR4=00000690
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
CCS=00000004 CCD=42201f3c CCO=ADDL
EFER=0000000000000000
check_exception old: 0x8 new 0xd
Triple fault
There's presumably something genuinely wrong with our GDT.
On Tue, Mar 21, 2017 at 3:32 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 2:11 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Tue, Mar 21, 2017 at 1:25 PM, Thomas Garnier <[email protected]> wrote:
>>> The issue seems to be related to exceptions happening in close pages
>>> to the fixmap GDT remapping.
>>>
>>> The original page fault happen in do_test_wp_bit which set a fixmap
>>> entry to test WP flag. If I grow the number of processors supported
>>> increasing the distance between the remapped GDT page and the WP test
>>> page, the error does not reproduce.
>>>
>>> I am still looking at the exact distance between repro and no-repro as
>>> well as the exact root cause.
>>
>> Hmm. Have we set the GDT limit incorrectly, somehow? The GDT *can*
>> cover 8k entries, which at 8 bytes each would be 64kB.
>
> The QEMU barf says the GDT limit is 0xff, for better or for worse.
>
>>
>> So somebody trying to load an invalid segment (say, 0xffff) might end
>> up causing an access to the GDT base + 64k - 8.
>>
>> It is also possible that the CPU might do a page table writability
>> check *before* it does the limit check. That would sound odd, though.
>> Might be a CPU errata.
>>
>
> There's presumably something genuinely wrong with our GDT.
This is suspicious. I added this code in test_wp_bit:
if (memcmp(get_current_gdt_ro(), get_current_gdt_rw(), 4096) != 0) {
pr_err("Oh crap\n");
BUG_ON(1);
}
It printed "Oh crap" and blew up. Methinks something's wrong with the
fixmap. Is it possible that we're crossing a PMD boundary and failing
to translate the addresses right?
On Tue, Mar 21, 2017 at 4:51 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 3:32 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Mar 21, 2017 at 2:11 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Tue, Mar 21, 2017 at 1:25 PM, Thomas Garnier <[email protected]> wrote:
>>>> The issue seems to be related to exceptions happening in close pages
>>>> to the fixmap GDT remapping.
>>>>
>>>> The original page fault happen in do_test_wp_bit which set a fixmap
>>>> entry to test WP flag. If I grow the number of processors supported
>>>> increasing the distance between the remapped GDT page and the WP test
>>>> page, the error does not reproduce.
>>>>
>>>> I am still looking at the exact distance between repro and no-repro as
>>>> well as the exact root cause.
>>>
>>> Hmm. Have we set the GDT limit incorrectly, somehow? The GDT *can*
>>> cover 8k entries, which at 8 bytes each would be 64kB.
>>
>> The QEMU barf says the GDT limit is 0xff, for better or for worse.
>>
>>>
>>> So somebody trying to load an invalid segment (say, 0xffff) might end
>>> up causing an access to the GDT base + 64k - 8.
>>>
>>> It is also possible that the CPU might do a page table writability
>>> check *before* it does the limit check. That would sound odd, though.
>>> Might be a CPU errata.
>>>
>>
>
>> There's presumably something genuinely wrong with our GDT.
>
> This is suspicious. I added this code in test_wp_bit:
>
> if (memcmp(get_current_gdt_ro(), get_current_gdt_rw(), 4096) != 0) {
> pr_err("Oh crap\n");
> BUG_ON(1);
> }
>
> It printed "Oh crap" and blew up. Methinks something's wrong with the
> fixmap. Is it possible that we're crossing a PMD boundary and failing
> to translate the addresses right?
I might be that. We crash when the PKMAP_BASE is just after the FIX_WP_TEST.
I will continue testing couple scenarios and design a fix. Moving the
GDT FIXMAP at the beginning or align the base (or pad the end).
--
Thomas
On Tue, Mar 21, 2017 at 5:41 PM, Thomas Garnier <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 4:51 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Mar 21, 2017 at 3:32 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Tue, Mar 21, 2017 at 2:11 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>> On Tue, Mar 21, 2017 at 1:25 PM, Thomas Garnier <[email protected]> wrote:
>>>>> The issue seems to be related to exceptions happening in close pages
>>>>> to the fixmap GDT remapping.
>>>>>
>>>>> The original page fault happen in do_test_wp_bit which set a fixmap
>>>>> entry to test WP flag. If I grow the number of processors supported
>>>>> increasing the distance between the remapped GDT page and the WP test
>>>>> page, the error does not reproduce.
>>>>>
>>>>> I am still looking at the exact distance between repro and no-repro as
>>>>> well as the exact root cause.
>>>>
>>>> Hmm. Have we set the GDT limit incorrectly, somehow? The GDT *can*
>>>> cover 8k entries, which at 8 bytes each would be 64kB.
>>>
>>> The QEMU barf says the GDT limit is 0xff, for better or for worse.
>>>
>>>>
>>>> So somebody trying to load an invalid segment (say, 0xffff) might end
>>>> up causing an access to the GDT base + 64k - 8.
>>>>
>>>> It is also possible that the CPU might do a page table writability
>>>> check *before* it does the limit check. That would sound odd, though.
>>>> Might be a CPU errata.
>>>>
>>>
>>
>>> There's presumably something genuinely wrong with our GDT.
>>
>> This is suspicious. I added this code in test_wp_bit:
>>
>> if (memcmp(get_current_gdt_ro(), get_current_gdt_rw(), 4096) != 0) {
>> pr_err("Oh crap\n");
>> BUG_ON(1);
>> }
>>
>> It printed "Oh crap" and blew up. Methinks something's wrong with the
>> fixmap. Is it possible that we're crossing a PMD boundary and failing
>> to translate the addresses right?
>
> I might be that. We crash when the PKMAP_BASE is just after the FIX_WP_TEST.
>
> I will continue testing couple scenarios and design a fix. Moving the
> GDT FIXMAP at the beginning or align the base (or pad the end).
>
Talk about barking up the wrong tree...
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f8e22dbad86c..c564f62c7a8d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -462,7 +464,8 @@ pgprot_t pg_fixmap_gdt_flags = PAGE_KERNEL;
static inline void setup_fixmap_gdt(int cpu)
{
__set_fixmap(get_cpu_gdt_ro_index(cpu),
- __pa(get_cpu_gdt_rw(cpu)), pg_fixmap_gdt_flags);
+ slow_virt_to_phys(get_cpu_gdt_rw(cpu)),
+ pg_fixmap_gdt_flags);
}
/* Load the original GDT from the per-cpu structure */
This makes UP boot for me, but SMP (2 cpus) is still busted.
On Tue, Mar 21, 2017 at 9:27 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Mar 21, 2017 at 5:41 PM, Thomas Garnier <[email protected]> wrote:
>> On Tue, Mar 21, 2017 at 4:51 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Tue, Mar 21, 2017 at 3:32 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Tue, Mar 21, 2017 at 2:11 PM, Linus Torvalds
>>>> <[email protected]> wrote:
>>>>> On Tue, Mar 21, 2017 at 1:25 PM, Thomas Garnier <[email protected]> wrote:
>>>>>> The issue seems to be related to exceptions happening in close pages
>>>>>> to the fixmap GDT remapping.
>>>>>>
>>>>>> The original page fault happen in do_test_wp_bit which set a fixmap
>>>>>> entry to test WP flag. If I grow the number of processors supported
>>>>>> increasing the distance between the remapped GDT page and the WP test
>>>>>> page, the error does not reproduce.
>>>>>>
>>>>>> I am still looking at the exact distance between repro and no-repro as
>>>>>> well as the exact root cause.
>>>>>
>>>>> Hmm. Have we set the GDT limit incorrectly, somehow? The GDT *can*
>>>>> cover 8k entries, which at 8 bytes each would be 64kB.
>>>>
>>>> The QEMU barf says the GDT limit is 0xff, for better or for worse.
>>>>
>>>>>
>>>>> So somebody trying to load an invalid segment (say, 0xffff) might end
>>>>> up causing an access to the GDT base + 64k - 8.
>>>>>
>>>>> It is also possible that the CPU might do a page table writability
>>>>> check *before* it does the limit check. That would sound odd, though.
>>>>> Might be a CPU errata.
>>>>>
>>>>
>>>
>>>> There's presumably something genuinely wrong with our GDT.
>>>
>>> This is suspicious. I added this code in test_wp_bit:
>>>
>>> if (memcmp(get_current_gdt_ro(), get_current_gdt_rw(), 4096) != 0) {
>>> pr_err("Oh crap\n");
>>> BUG_ON(1);
>>> }
>>>
>>> It printed "Oh crap" and blew up. Methinks something's wrong with the
>>> fixmap. Is it possible that we're crossing a PMD boundary and failing
>>> to translate the addresses right?
>>
>> I might be that. We crash when the PKMAP_BASE is just after the FIX_WP_TEST.
>>
>> I will continue testing couple scenarios and design a fix. Moving the
>> GDT FIXMAP at the beginning or align the base (or pad the end).
>>
>
> Talk about barking up the wrong tree...
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index f8e22dbad86c..c564f62c7a8d 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -462,7 +464,8 @@ pgprot_t pg_fixmap_gdt_flags = PAGE_KERNEL;
> static inline void setup_fixmap_gdt(int cpu)
> {
> __set_fixmap(get_cpu_gdt_ro_index(cpu),
> - __pa(get_cpu_gdt_rw(cpu)), pg_fixmap_gdt_flags);
> + slow_virt_to_phys(get_cpu_gdt_rw(cpu)),
> + pg_fixmap_gdt_flags);
> }
>
> /* Load the original GDT from the per-cpu structure */
>
> This makes UP boot for me, but SMP (2 cpus) is still busted.
This change fixed boot for me:
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index b65155cc3760..4e30707d9f9a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,7 +104,9 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
- __end_of_permanent_fixed_addresses,
+ __end_of_permanent_fixed_addresses =
+ (FIX_GDT_REMAP_END + PTRS_PER_PTE - 1) &
+ -PTRS_PER_PTE,
Just ensure PKMAP_BASE & FIX_WP_TEST are on a different PMD.
I don't think that the right fix but it might help understand the
exact root cause.
--
Thomas
* Thomas Garnier <[email protected]> wrote:
> > static inline void setup_fixmap_gdt(int cpu)
> > {
> > __set_fixmap(get_cpu_gdt_ro_index(cpu),
> > - __pa(get_cpu_gdt_rw(cpu)), pg_fixmap_gdt_flags);
> > + slow_virt_to_phys(get_cpu_gdt_rw(cpu)),
> > + pg_fixmap_gdt_flags);
> > }
> >
> > /* Load the original GDT from the per-cpu structure */
> >
> > This makes UP boot for me, but SMP (2 cpus) is still busted.
>
> This change fixed boot for me:
>
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index b65155cc3760..4e30707d9f9a 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -104,7 +104,9 @@ enum fixed_addresses {
> FIX_GDT_REMAP_BEGIN,
> FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
>
> - __end_of_permanent_fixed_addresses,
> + __end_of_permanent_fixed_addresses =
> + (FIX_GDT_REMAP_END + PTRS_PER_PTE - 1) &
> + -PTRS_PER_PTE,
>
> Just ensure PKMAP_BASE & FIX_WP_TEST are on a different PMD.
>
> I don't think that the right fix but it might help understand the
> exact root cause.
Could this be related to the permission bits in the PMD itself getting out of sync
with the PTEs? WP test marks a page writable/unwritable, and maybe we mess up the
restoration. If they are on separate PMDs then this is worked around because the
fixmap GDT is on a separate PMD.
Thanks,
Ingo
On Wed, Mar 22, 2017 at 12:36 AM, Ingo Molnar <[email protected]> wrote:
>
> * Thomas Garnier <[email protected]> wrote:
>
>> > static inline void setup_fixmap_gdt(int cpu)
>> > {
>> > __set_fixmap(get_cpu_gdt_ro_index(cpu),
>> > - __pa(get_cpu_gdt_rw(cpu)), pg_fixmap_gdt_flags);
>> > + slow_virt_to_phys(get_cpu_gdt_rw(cpu)),
>> > + pg_fixmap_gdt_flags);
>> > }
>> >
>> > /* Load the original GDT from the per-cpu structure */
>> >
>> > This makes UP boot for me, but SMP (2 cpus) is still busted.
>>
>> This change fixed boot for me:
>>
>> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
>> index b65155cc3760..4e30707d9f9a 100644
>> --- a/arch/x86/include/asm/fixmap.h
>> +++ b/arch/x86/include/asm/fixmap.h
>> @@ -104,7 +104,9 @@ enum fixed_addresses {
>> FIX_GDT_REMAP_BEGIN,
>> FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
>>
>> - __end_of_permanent_fixed_addresses,
>> + __end_of_permanent_fixed_addresses =
>> + (FIX_GDT_REMAP_END + PTRS_PER_PTE - 1) &
>> + -PTRS_PER_PTE,
>>
>> Just ensure PKMAP_BASE & FIX_WP_TEST are on a different PMD.
>>
>> I don't think that the right fix but it might help understand the
>> exact root cause.
>
> Could this be related to the permission bits in the PMD itself getting out of sync
> with the PTEs? WP test marks a page writable/unwritable, and maybe we mess up the
> restoration. If they are on separate PMDs then this is worked around because the
> fixmap GDT is on a separate PMD.
>
I don't think so. I think it's a pair of bugs related to the way that
percpu areas are virtually mapped.
Bug 1: __pa is totally bogus on percpu pointers. Oddly, we have one
older instance of exactly the same bug (on the same GDT address) in
the kernel. I'll send a patch.
Bug 2: Nothing syncs a freshly-set-up CPU's percpu area into
initial_page_table. This makes access to the gdt fail in
startup_32_smp. This looks like a longstanding bug, and I don't see
what it has to do with Thomas' series. I'm still mulling over what to
do about it.
--Andy
On Wed, Mar 22, 2017 at 9:33 AM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Mar 22, 2017 at 12:36 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Thomas Garnier <[email protected]> wrote:
>>
>>> > static inline void setup_fixmap_gdt(int cpu)
>>> > {
>>> > __set_fixmap(get_cpu_gdt_ro_index(cpu),
>>> > - __pa(get_cpu_gdt_rw(cpu)), pg_fixmap_gdt_flags);
>>> > + slow_virt_to_phys(get_cpu_gdt_rw(cpu)),
>>> > + pg_fixmap_gdt_flags);
>>> > }
>>> >
>>> > /* Load the original GDT from the per-cpu structure */
>>> >
>>> > This makes UP boot for me, but SMP (2 cpus) is still busted.
>>>
>>> This change fixed boot for me:
>>>
>>> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
>>> index b65155cc3760..4e30707d9f9a 100644
>>> --- a/arch/x86/include/asm/fixmap.h
>>> +++ b/arch/x86/include/asm/fixmap.h
>>> @@ -104,7 +104,9 @@ enum fixed_addresses {
>>> FIX_GDT_REMAP_BEGIN,
>>> FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
>>>
>>> - __end_of_permanent_fixed_addresses,
>>> + __end_of_permanent_fixed_addresses =
>>> + (FIX_GDT_REMAP_END + PTRS_PER_PTE - 1) &
>>> + -PTRS_PER_PTE,
>>>
>>> Just ensure PKMAP_BASE & FIX_WP_TEST are on a different PMD.
>>>
>>> I don't think that the right fix but it might help understand the
>>> exact root cause.
>>
>> Could this be related to the permission bits in the PMD itself getting out of sync
>> with the PTEs? WP test marks a page writable/unwritable, and maybe we mess up the
>> restoration. If they are on separate PMDs then this is worked around because the
>> fixmap GDT is on a separate PMD.
>>
>
> I don't think so. I think it's a pair of bugs related to the way that
> percpu areas are virtually mapped.
>
> Bug 1: __pa is totally bogus on percpu pointers. Oddly, we have one
> older instance of exactly the same bug (on the same GDT address) in
> the kernel. I'll send a patch.
>
> Bug 2: Nothing syncs a freshly-set-up CPU's percpu area into
> initial_page_table. This makes access to the gdt fail in
> startup_32_smp. This looks like a longstanding bug, and I don't see
> what it has to do with Thomas' series. I'm still mulling over what to
> do about it.
Why do you think padding the fixmap also fix the problem? That's the
thing I don't get.
With the padding the PA is now correct and the memcmp check also
succeed. That's odd.
>
> --Andy
--
Thomas
On Wed, Mar 22, 2017 at 9:38 AM, Thomas Garnier <[email protected]> wrote:
> On Wed, Mar 22, 2017 at 9:33 AM, Andy Lutomirski <[email protected]> wrote:
>> On Wed, Mar 22, 2017 at 12:36 AM, Ingo Molnar <[email protected]> wrote:
>>>
>>> * Thomas Garnier <[email protected]> wrote:
>>>
>>>> > static inline void setup_fixmap_gdt(int cpu)
>>>> > {
>>>> > __set_fixmap(get_cpu_gdt_ro_index(cpu),
>>>> > - __pa(get_cpu_gdt_rw(cpu)), pg_fixmap_gdt_flags);
>>>> > + slow_virt_to_phys(get_cpu_gdt_rw(cpu)),
>>>> > + pg_fixmap_gdt_flags);
>>>> > }
>>>> >
>>>> > /* Load the original GDT from the per-cpu structure */
>>>> >
>>>> > This makes UP boot for me, but SMP (2 cpus) is still busted.
>>>>
>>>> This change fixed boot for me:
>>>>
>>>> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
>>>> index b65155cc3760..4e30707d9f9a 100644
>>>> --- a/arch/x86/include/asm/fixmap.h
>>>> +++ b/arch/x86/include/asm/fixmap.h
>>>> @@ -104,7 +104,9 @@ enum fixed_addresses {
>>>> FIX_GDT_REMAP_BEGIN,
>>>> FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
>>>>
>>>> - __end_of_permanent_fixed_addresses,
>>>> + __end_of_permanent_fixed_addresses =
>>>> + (FIX_GDT_REMAP_END + PTRS_PER_PTE - 1) &
>>>> + -PTRS_PER_PTE,
>>>>
>>>> Just ensure PKMAP_BASE & FIX_WP_TEST are on a different PMD.
>>>>
>>>> I don't think that the right fix but it might help understand the
>>>> exact root cause.
>>>
>>> Could this be related to the permission bits in the PMD itself getting out of sync
>>> with the PTEs? WP test marks a page writable/unwritable, and maybe we mess up the
>>> restoration. If they are on separate PMDs then this is worked around because the
>>> fixmap GDT is on a separate PMD.
>>>
>>
>> I don't think so. I think it's a pair of bugs related to the way that
>> percpu areas are virtually mapped.
>>
>> Bug 1: __pa is totally bogus on percpu pointers. Oddly, we have one
>> older instance of exactly the same bug (on the same GDT address) in
>> the kernel. I'll send a patch.
>>
>> Bug 2: Nothing syncs a freshly-set-up CPU's percpu area into
>> initial_page_table. This makes access to the gdt fail in
>> startup_32_smp. This looks like a longstanding bug, and I don't see
>> what it has to do with Thomas' series. I'm still mulling over what to
>> do about it.
>
> Why do you think padding the fixmap also fix the problem? That's the
> thing I don't get.
>
> With the padding the PA is now correct and the memcmp check also
> succeed. That's odd.
>
Not sure. There are some complicated heuristics in the percpu code
that determine how it's allocated, and the padding might be affecting
those heuristics.
--Andy