Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: b12d691ea5e01db42ccf3b4207e57cb3ce7cfe91 ("i915: fix remap_io_sg to verify the pgprot")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
in testcase: igt
version: igt-x86_64-fdff4bba-1_20210508
with following parameters:
group: group-01
ucode: 0xe2
on test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz with 28G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 778.550996] kernel BUG at mm/memory.c:2183!
[ 778.551001] invalid opcode: 0000 [#1] SMP PTI
[ 778.555616]
[ 778.559012] CPU: 2 PID: 5959 Comm: prime_mmap_cohe Tainted: G I 5.12.0-11231-gb12d691ea5e0 #1
[ 778.559014] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[ 778.559015] RIP: 0010:remap_pfn_range_notrack (kbuild/src/consumer/mm/memory.c:2183 kbuild/src/consumer/mm/memory.c:2211 kbuild/src/consumer/mm/memory.c:2233 kbuild/src/consumer/mm/memory.c:2255 kbuild/src/consumer/mm/memory.c:2311)
[ 778.591640] Code: 4c 89 f7 e8 2d 96 da ff 84 c0 0f 84 b5 00 00 00 4c 89 f0 48 8b 15 b3 01 38 01 48 c1 e0 0c 4d 85 e4 75 8d 48 21 c2 31 c0 eb a8 <0f> 0b 48 8b 7c 24 48 48 89 c6 4c 89 8c 24 80 00 00 00 48 89 c5 e8
All code
========
0: 4c 89 f7 mov %r14,%rdi
3: e8 2d 96 da ff callq 0xffffffffffda9635
8: 84 c0 test %al,%al
a: 0f 84 b5 00 00 00 je 0xc5
10: 4c 89 f0 mov %r14,%rax
13: 48 8b 15 b3 01 38 01 mov 0x13801b3(%rip),%rdx # 0x13801cd
1a: 48 c1 e0 0c shl $0xc,%rax
1e: 4d 85 e4 test %r12,%r12
21: 75 8d jne 0xffffffffffffffb0
23: 48 21 c2 and %rax,%rdx
26: 31 c0 xor %eax,%eax
28: eb a8 jmp 0xffffffffffffffd2
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 8b 7c 24 48 mov 0x48(%rsp),%rdi
31: 48 89 c6 mov %rax,%rsi
34: 4c 89 8c 24 80 00 00 mov %r9,0x80(%rsp)
3b: 00
3c: 48 89 c5 mov %rax,%rbp
3f: e8 .byte 0xe8
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 8b 7c 24 48 mov 0x48(%rsp),%rdi
7: 48 89 c6 mov %rax,%rsi
a: 4c 89 8c 24 80 00 00 mov %r9,0x80(%rsp)
11: 00
12: 48 89 c5 mov %rax,%rbp
15: e8 .byte 0xe8
[ 778.610405] RSP: 0000:ffffc9000892fc28 EFLAGS: 00010286
[ 778.615627] RAX: 800000011f651227 RBX: 00007f1f92ba6000 RCX: 000fffffc0000000
[ 778.622757] RDX: 000000011f651000 RSI: 0000000000000001 RDI: 000000000011f651
[ 778.629888] RBP: ffffea001d6395a8 R08: 8000000000000027 R09: ffff888000000d10
[ 778.637018] R10: ffff888759fcdbc0 R11: 0000000759ffffff R12: 8000000000000027
[ 778.644150] R13: 00007f1f92ba4000 R14: 000000000011f652 R15: ffff888758e56d20
[ 778.651283] FS: 00007f1f92bb9c00(0000) GS:ffff888759c80000(0000) knlGS:0000000000000000
[ 778.659371] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 778.665114] CR2: 00007f1f92aa4000 CR3: 000000075230e006 CR4: 00000000003706e0
[ 778.672245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 778.679375] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 778.686504] Call Trace:
[ 778.688951] remap_pfn_range (kbuild/src/consumer/mm/memory.c:2342)
[ 778.692700] remap_io_sg (kbuild/src/consumer/drivers/gpu/drm/i915/i915_mm.c:71) i915
[ 778.696751] vm_fault_cpu (kbuild/src/consumer/drivers/gpu/drm/i915/gem/i915_gem_mman.c:263) i915
[ 778.700978] __do_fault (kbuild/src/consumer/mm/memory.c:3646)
[ 778.704385] do_fault (kbuild/src/consumer/mm/memory.c:4002 kbuild/src/consumer/mm/memory.c:4080)
[ 778.707700] __handle_mm_fault (kbuild/src/consumer/mm/memory.c:4325 kbuild/src/consumer/mm/memory.c:4460)
[ 778.711802] ? __might_fault (kbuild/src/consumer/mm/memory.c:5030)
[ 778.715555] handle_mm_fault (kbuild/src/consumer/mm/memory.c:4558)
[ 778.719393] do_user_addr_fault (kbuild/src/consumer/include/linux/sched/signal.h:403 kbuild/src/consumer/arch/x86/mm/fault.c:1392)
[ 778.723584] exc_page_fault (kbuild/src/consumer/arch/x86/include/asm/irqflags.h:40 kbuild/src/consumer/arch/x86/include/asm/irqflags.h:75 kbuild/src/consumer/arch/x86/mm/fault.c:1483 kbuild/src/consumer/arch/x86/mm/fault.c:1531)
[ 778.727335] ? asm_exc_page_fault (kbuild/src/consumer/arch/x86/include/asm/idtentry.h:577)
[ 778.731438] asm_exc_page_fault (kbuild/src/consumer/arch/x86/include/asm/idtentry.h:577)
[ 778.735446] RIP: 0033:0x7f1f965a4b0d
[ 778.739031] Code: 01 00 00 48 83 fa 40 77 77 c5 fe 7f 44 17 e0 c5 fe 7f 07 c5 f8 77 c3 66 0f 1f 44 00 00 c5 f8 77 48 89 d1 40 0f b6 c6 48 89 fa <f3> aa 48 89 d0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 39 d1
All code
========
0: 01 00 add %eax,(%rax)
2: 00 48 83 add %cl,-0x7d(%rax)
5: fa cli
6: 40 77 77 rex ja 0x80
9: c5 fe 7f 44 17 e0 vmovdqu %ymm0,-0x20(%rdi,%rdx,1)
f: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
13: c5 f8 77 vzeroupper
16: c3 retq
17: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
1d: c5 f8 77 vzeroupper
20: 48 89 d1 mov %rdx,%rcx
23: 40 0f b6 c6 movzbl %sil,%eax
27: 48 89 fa mov %rdi,%rdx
2a:* f3 aa rep stos %al,%es:(%rdi) <-- trapping instruction
2c: 48 89 d0 mov %rdx,%rax
2f: c3 retq
30: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
37: 00 00 00 00
3b: 66 90 xchg %ax,%ax
3d: 48 39 d1 cmp %rdx,%rcx
Code starting with the faulting instruction
===========================================
0: f3 aa rep stos %al,%es:(%rdi)
2: 48 89 d0 mov %rdx,%rax
5: c3 retq
6: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
d: 00 00 00 00
11: 66 90 xchg %ax,%ax
13: 48 39 d1 cmp %rdx,%rcx
[ 778.757796] RSP: 002b:00007fff8213d518 EFLAGS: 00010206
[ 778.763016] RAX: 00000000000000c5 RBX: 00007f1f92aa4000 RCX: 0000000000100000
[ 778.770146] RDX: 00007f1f92aa4000 RSI: 00000000000000c5 RDI: 00007f1f92aa4000
[ 778.777276] RBP: 00005626b8459f50 R08: 0000000000000004 R09: 0000000100000000
[ 778.784404] R10: fffffffffffffba5 R11: 00007f1f965a4b30 R12: 0000000000000006
[ 778.791535] R13: 00005626b95d6c40 R14: 00005626b95db1b0 R15: 0000000000000000
[ 778.798668] Modules linked in: ipmi_devintf ipmi_msghandler btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c sd_mod t10_pi sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp i915 kvm_intel kvm vfio_mdev vfio_iommu_type1 vfio irqbypass crct10dif_pclmul mdev crc32_pclmul intel_gtt crc32c_intel ghash_clmulni_intel drm_kms_helper mei_wdt syscopyarea ahci sysfillrect rapl sysimgblt libahci fb_sys_fops intel_cstate mei_me wmi_bmof drm intel_uncore libata mei joydev intel_pch_thermal wmi video intel_pmc_core acpi_pad ip_tables
[ 778.848365] ---[ end trace 13b8b23960c0dd23 ]---
[ 778.852982] RIP: 0010:remap_pfn_range_notrack (kbuild/src/consumer/mm/memory.c:2183 kbuild/src/consumer/mm/memory.c:2211 kbuild/src/consumer/mm/memory.c:2233 kbuild/src/consumer/mm/memory.c:2255 kbuild/src/consumer/mm/memory.c:2311)
[ 778.858381] Code: 4c 89 f7 e8 2d 96 da ff 84 c0 0f 84 b5 00 00 00 4c 89 f0 48 8b 15 b3 01 38 01 48 c1 e0 0c 4d 85 e4 75 8d 48 21 c2 31 c0 eb a8 <0f> 0b 48 8b 7c 24 48 48 89 c6 4c 89 8c 24 80 00 00 00 48 89 c5 e8
All code
========
0: 4c 89 f7 mov %r14,%rdi
3: e8 2d 96 da ff callq 0xffffffffffda9635
8: 84 c0 test %al,%al
a: 0f 84 b5 00 00 00 je 0xc5
10: 4c 89 f0 mov %r14,%rax
13: 48 8b 15 b3 01 38 01 mov 0x13801b3(%rip),%rdx # 0x13801cd
1a: 48 c1 e0 0c shl $0xc,%rax
1e: 4d 85 e4 test %r12,%r12
21: 75 8d jne 0xffffffffffffffb0
23: 48 21 c2 and %rax,%rdx
26: 31 c0 xor %eax,%eax
28: eb a8 jmp 0xffffffffffffffd2
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 8b 7c 24 48 mov 0x48(%rsp),%rdi
31: 48 89 c6 mov %rax,%rsi
34: 4c 89 8c 24 80 00 00 mov %r9,0x80(%rsp)
3b: 00
3c: 48 89 c5 mov %rax,%rbp
3f: e8 .byte 0xe8
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 8b 7c 24 48 mov 0x48(%rsp),%rdi
7: 48 89 c6 mov %rax,%rsi
a: 4c 89 8c 24 80 00 00 mov %r9,0x80(%rsp)
11: 00
12: 48 89 c5 mov %rax,%rbp
15: e8 .byte 0xe8
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
bin/lkp run generated-yaml-file
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang
On Tue, May 18, 2021 at 4:26 PM kernel test robot <[email protected]> wrote:
>
> commit: b12d691ea5e01db42ccf3b4207e57cb3ce7cfe91 ("i915: fix remap_io_sg to verify the pgprot")
> [...]
> [ 778.550996] kernel BUG at mm/memory.c:2183!
> [ 778.559015] RIP: 0010:remap_pfn_range_notrack (kbuild/src/consumer/mm/memory.c:2183 kbuild/src/consumer/mm/memory.c:2211 kbuild/src/consumer/mm/memory.c:2233 kbuild/src/consumer/mm/memory.c:2255 kbuild/src/consumer/mm/memory.c:2311)
> [ 778.688951] remap_pfn_range (kbuild/src/consumer/mm/memory.c:2342)
> [ 778.692700] remap_io_sg (kbuild/src/consumer/drivers/gpu/drm/i915/i915_mm.c:71) i915
Yeah, so that BUG_ON() checks that theer isn't any old mapping there.
You can't just remap over an old one, but it does seem like that is
exactly what commit b12d691ea5e0 ("i915: fix remap_io_sg to verify the
pgprot") ends up doing.
So the code used to just do "apply_to_page_range()", which admittedly
was odd too. But it didn't mind having old mappings and re-applying
something over them.
Converting it to use remap_pfn_range() does look better, but it kind
of depends on it ever being done *once*. But the caller seems to very
much remap the whole vmsa at fault time, so...
I don't know what the right thing to do here is, because I don't know
the invalidation logic and when faults happen.
I see that there is another thread about different issues on the
intel-gfx list. Adding a few people to this kernel test robot thread
too.
I'd be inclined to revert the commits as "not ready yet", but it would
be better if somebody can go "yeah, this should be done properly like
X".
Anybody?
Linus
On Tue, May 18, 2021 at 04:58:31PM -1000, Linus Torvalds wrote:
> On Tue, May 18, 2021 at 4:26 PM kernel test robot <[email protected]> wrote:
> >
> > commit: b12d691ea5e01db42ccf3b4207e57cb3ce7cfe91 ("i915: fix remap_io_sg to verify the pgprot")
> > [...]
> > [ 778.550996] kernel BUG at mm/memory.c:2183!
> > [ 778.559015] RIP: 0010:remap_pfn_range_notrack (kbuild/src/consumer/mm/memory.c:2183 kbuild/src/consumer/mm/memory.c:2211 kbuild/src/consumer/mm/memory.c:2233 kbuild/src/consumer/mm/memory.c:2255 kbuild/src/consumer/mm/memory.c:2311)
> > [ 778.688951] remap_pfn_range (kbuild/src/consumer/mm/memory.c:2342)
> > [ 778.692700] remap_io_sg (kbuild/src/consumer/drivers/gpu/drm/i915/i915_mm.c:71) i915
>
> Yeah, so that BUG_ON() checks that theer isn't any old mapping there.
>
> You can't just remap over an old one, but it does seem like that is
> exactly what commit b12d691ea5e0 ("i915: fix remap_io_sg to verify the
> pgprot") ends up doing.
>
> So the code used to just do "apply_to_page_range()", which admittedly
> was odd too. But it didn't mind having old mappings and re-applying
> something over them.
>
> Converting it to use remap_pfn_range() does look better, but it kind
> of depends on it ever being done *once*. But the caller seems to very
> much remap the whole vmsa at fault time, so...
>
> I don't know what the right thing to do here is, because I don't know
> the invalidation logic and when faults happen.
>
> I see that there is another thread about different issues on the
> intel-gfx list. Adding a few people to this kernel test robot thread
> too.
>
> I'd be inclined to revert the commits as "not ready yet", but it would
> be better if somebody can go "yeah, this should be done properly like
> X".
I think reverting just this commit for now is the best thing.