While running LTP mm test suite on i386 or qemu_i386 this kernel warning
has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4
and linux next.
metadata:
git branch: master
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git commit: dcb7fd82c75ee2d6e6f9d8cc71c52519ed52e258
kernel-config:
https://builds.tuxbuild.com/zFkvGKyEbfYdZOMzwrbl-w/kernel.config
vmlinux: https://builds.tuxbuild.com/zFkvGKyEbfYdZOMzwrbl-w/vmlinux.xz
system.map: https://builds.tuxbuild.com/zFkvGKyEbfYdZOMzwrbl-w/System.map
steps to reproduce:
- boot i386 kernel or qemu_i386 device
- cd /opt/ltp
- ./runltp -f mm
code snippet:
file: mm/mremap.c
#ifdef CONFIG_HAVE_MOVE_PMD
static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
...
/*
* The destination pmd shouldn't be established, free_pgtables()
* should have release it.
*/
if (WARN_ON(!pmd_none(*new_pmd)))
return false;
------------[ cut here ]------------
[ 720.642524] WARNING: CPU: 3 PID: 27091 at mm/mremap.c:211
move_page_tables+0x6ef/0x720
[ 720.650437] Modules linked in: x86_pkg_temp_thermal
[ 720.655314] CPU: 3 PID: 27091 Comm: true Not tainted 5.8.0-rc4 #1
[ 720.661408] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[ 720.668886] EIP: move_page_tables+0x6ef/0x720
[ 720.673245] Code: 85 0c ff ff ff 8b 45 08 8b 4d d4 05 00 00 40 00
25 00 00 c0 ff 2b 45 08 39 c1 0f 46 c1 89 45 d4 01 f0 89 45 cc e9 d6
fa ff ff <0f> 0b 80 7d be 00 0f 84 80 fd ff ff e9 02 fe ff ff a8 80 0f
84 4d
[ 720.691990] EAX: f3250bf8 EBX: f3250bfc ECX: 8c46a067 EDX: 000002fe
[ 720.698249] ESI: bfc00000 EDI: f3250bf8 EBP: eea31dd4 ESP: eea31d74
[ 720.704514] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
[ 720.711353] CR0: 80050033 CR2: b7d9e3b0 CR3: 33250000 CR4: 003406d0
[ 720.717634] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 720.723952] DR6: fffe0ff0 DR7: 00000400
[ 720.727843] Call Trace:
[ 720.730325] setup_arg_pages+0x22b/0x310
[ 720.734277] ? get_random_u32+0x49/0x70
[ 720.738167] ? get_random_u32+0x49/0x70
[ 720.742006] load_elf_binary+0x30e/0x10e0
[ 720.746019] __do_execve_file+0x4c3/0x9f0
[ 720.750031] __ia32_sys_execve+0x25/0x30
[ 720.753959] do_syscall_32_irqs_on+0x3d/0x250
[ 720.758317] ? do_user_addr_fault+0x1a0/0x3c0
[ 720.762701] ? __prepare_exit_to_usermode+0x50/0x1a0
[ 720.767703] ? prepare_exit_to_usermode+0x8/0x20
[ 720.772349] do_fast_syscall_32+0x39/0xb0
[ 720.776361] do_SYSENTER_32+0x15/0x20
[ 720.780028] entry_SYSENTER_32+0x9f/0xf2
[ 720.783951] EIP: 0xb7f1d549
[ 720.786741] Code: Bad RIP value.
[ 720.789965] EAX: ffffffda EBX: bfbbc7c0 ECX: 08067420 EDX: bfbbc9f4
[ 720.796225] ESI: 08058a14 EDI: bfbbc7c9 EBP: bfbbc868 ESP: bfbbc798
[ 720.802489] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[ 720.809276] ---[ end trace bc662e76cba2d081 ]---
[ 724.934902] ------------[ cut here ]------------
------------[ cut here ]------------
[ 738.099993] WARNING: CPU: 1 PID: 29339 at mm/mremap.c:211
move_page_tables+0x6ef/0x720
[ 738.107928] Modules linked in: x86_pkg_temp_thermal
[ 738.112860] CPU: 1 PID: 29339 Comm: true Tainted: G W
5.8.0-rc4 #1
[ 738.120374] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[ 738.127904] EIP: move_page_tables+0x6ef/0x720
[ 738.132316] Code: 85 0c ff ff ff 8b 45 08 8b 4d d4 05 00 00 40 00
25 00 00 c0 ff 2b 45 08 39 c1 0f 46 c1 89 45 d4 01 f0 89 45 cc e9 d6
fa ff ff <0f> 0b 80 7d be 00 0f 84 80 fd ff ff e9 02 fe ff ff a8 80 0f
84 4d
[ 738.151112] EAX: e0f3bbf8 EBX: e0f3bbfc ECX: 8c4b8067 EDX: 000002fe
[ 738.157422] ESI: bfc00000 EDI: e0f3bbf8 EBP: f3809dd4 ESP: f3809d74
[ 738.163756] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
[ 738.170587] CR0: 80050033 CR2: b7d9e3b0 CR3: 20f3b000 CR4: 003406d0
[ 738.176895] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 738.183179] DR6: fffe0ff0 DR7: 00000400
[ 738.187019] Call Trace:
[ 738.189489] setup_arg_pages+0x22b/0x310
[ 738.193423] ? get_random_u32+0x49/0x70
[ 738.197280] ? get_random_u32+0x49/0x70
[ 738.201119] load_elf_binary+0x30e/0x10e0
[ 738.205159] __do_execve_file+0x4c3/0x9f0
[ 738.209198] __ia32_sys_execve+0x25/0x30
[ 738.213122] do_syscall_32_irqs_on+0x3d/0x250
[ 738.217484] ? do_user_addr_fault+0x1a0/0x3c0
[ 738.221841] ? __prepare_exit_to_usermode+0x50/0x1a0
[ 738.226808] ? prepare_exit_to_usermode+0x8/0x20
[ 738.231427] do_fast_syscall_32+0x39/0xb0
[ 738.235438] do_SYSENTER_32+0x15/0x20
[ 738.239107] entry_SYSENTER_32+0x9f/0xf2
[ 738.243030] EIP: 0xb7f1d549
[ 738.245830] Code: Bad RIP value.
[ 738.249067] EAX: ffffffda EBX: bfbbc7c0 ECX: 08067420 EDX: bfbbc9f4
[ 738.255331] ESI: 08058a14 EDI: bfbbc7c9 EBP: bfbbc868 ESP: bfbbc798
[ 738.261595] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[ 738.268382] ---[ end trace bc662e76cba2d083 ]---
[ 770.832465] true (1218) used greatest stack depth: 5252 bytes left
[ 781.451530] ------------[ cut here ]------------
------------[ cut here ]------------
[ 784.798244] WARNING: CPU: 3 PID: 3053 at mm/mremap.c:211
move_page_tables+0x6ef/0x720
[ 784.806090] Modules linked in: x86_pkg_temp_thermal
[ 784.811007] CPU: 3 PID: 3053 Comm: true Tainted: G W
5.8.0-rc4 #1
[ 784.818449] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[ 784.825928] EIP: move_page_tables+0x6ef/0x720
[ 784.830341] Code: 85 0c ff ff ff 8b 45 08 8b 4d d4 05 00 00 40 00
25 00 00 c0 ff 2b 45 08 39 c1 0f 46 c1 89 45 d4 01 f0 89 45 cc e9 d6
fa ff ff <0f> 0b 80 7d be 00 0f 84 80 fd ff ff e9 02 fe ff ff a8 80 0f
84 4d
[ 784.849138] EAX: e0663bf8 EBX: e0663bfc ECX: 8c463067 EDX: 000002fe
[ 784.855456] ESI: bfc00000 EDI: e0663bf8 EBP: f3e39dd4 ESP: f3e39d74
[ 784.861750] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
[ 784.868532] CR0: 80050033 CR2: b7d9e3b0 CR3: 20663000 CR4: 003406d0
[ 784.874843] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 784.881160] DR6: fffe0ff0 DR7: 00000400
[ 784.885053] Call Trace:
[ 784.887522] setup_arg_pages+0x22b/0x310
[ 784.891483] ? get_random_u32+0x49/0x70
[ 784.895347] ? get_random_u32+0x49/0x70
[ 784.899232] load_elf_binary+0x30e/0x10e0
[ 784.903295] __do_execve_file+0x4c3/0x9f0
[ 784.907359] __ia32_sys_execve+0x25/0x30
[ 784.911339] do_syscall_32_irqs_on+0x3d/0x250
[ 784.915750] ? do_user_addr_fault+0x1a0/0x3c0
[ 784.920161] ? __prepare_exit_to_usermode+0x50/0x1a0
[ 784.925180] ? prepare_exit_to_usermode+0x8/0x20
[ 784.929799] do_fast_syscall_32+0x39/0xb0
[ 784.933862] do_SYSENTER_32+0x15/0x20
[ 784.937580] entry_SYSENTER_32+0x9f/0xf2
[ 784.941557] EIP: 0xb7f1d549
[ 784.944399] Code: Bad RIP value.
[ 784.947624] EAX: ffffffda EBX: bfbbc7c0 ECX: 08067420 EDX: bfbbc9f4
[ 784.953944] ESI: 08058a14 EDI: bfbbc7c9 EBP: bfbbc868 ESP: bfbbc798
[ 784.960261] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[ 784.967097] ---[ end trace bc662e76cba2d085 ]---
[ 792.443236] ------------[ cut here ]------------
full test log,
https://lkft.validation.linaro.org/scheduler/job/1541788#L9308
--
Linaro LKFT
https://lkft.linaro.org
On Thu, Jul 9, 2020 at 7:28 AM Naresh Kamboju <[email protected]> wrote:
>
> While running LTP mm test suite on i386 or qemu_i386 this kernel warning
> has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4
> and linux next.
Are you able to correlate this with any particular test case in LTP, or does
it happen for random processes?
In the log you linked to, it happens once for ksm05.c and multiple times for
thp01.c, sources here:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/ksm/ksm05.c
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c
Is it always these tests that trigger the warning, or sometimes others?
When you say it happens with linux-5.4 stable, does that mean you don't see
it with older versions? What is the last known working version?
I also see that you give the virtual machine 16GB of RAM, but as you are
running a 32-bit kernel without PAE, only 2.3GB end up being available,
and some other LTP tests in the log run out of memory.
You could check if the behavior changes if you give the kernel less memory,
e.g. 768MB (lowmem only), or enable CONFIG_X86_PAE to let it use the
entire 16GB.
> full test log,
> https://lkft.validation.linaro.org/scheduler/job/1541788#L9308
Arnd
On Wed, Jul 8, 2020 at 10:28 PM Naresh Kamboju
<[email protected]> wrote:
>
> While running LTP mm test suite on i386 or qemu_i386 this kernel warning
> has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4
> and linux next.
Hmm
If this is repeatable, would you mind making the warning also print
out the old range and new addresses and pmd value?
Something like the attached (UNTESTED!) patch.
Linus
On Thu, 9 Jul 2020 at 13:55, Arnd Bergmann <[email protected]> wrote:
>
> On Thu, Jul 9, 2020 at 7:28 AM Naresh Kamboju <[email protected]> wrote:
> >
> > While running LTP mm test suite on i386 or qemu_i386 this kernel warning
> > has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4
> > and linux next.
>
> Are you able to correlate this with any particular test case in LTP, or does
> it happen for random processes?
>
> In the log you linked to, it happens once for ksm05.c and multiple times for
> thp01.c, sources here:
>
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/ksm/ksm05.c
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c
>
> Is it always these tests that trigger the warning, or sometimes others?
These two test cases are causing this warning multiple times on i386.
>
> When you say it happens with linux-5.4 stable, does that mean you don't see
> it with older versions? What is the last known working version?
I do not notice on stable-4.19 and below versions.
Sorry i did not get the known working commit id or version.
It started happening from stable-rc 5.0 first release.
I have evidence [1] showing it on 5.0.1
>
> I also see that you give the virtual machine 16GB of RAM, but as you are
> running a 32-bit kernel without PAE, only 2.3GB end up being available,
> and some other LTP tests in the log run out of memory.
>
> You could check if the behavior changes if you give the kernel less memory,
> e.g. 768MB (lowmem only), or enable CONFIG_X86_PAE to let it use the
> entire 16GB.
Warning is still happening after enabling PAE config.
But the oom-killer messages are gone. Thank you.
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y
full test log oom-killer messages are gone and kernel warning is still there,
https://lkft.validation.linaro.org/scheduler/job/1552606#L10357
build location:
https://builds.tuxbuild.com/puilcMcGVwzFMN5fDUhY4g/
[1] https://qa-reports.linaro.org/lkft/linux-stable-rc-5.0-oe/build/v5.0.1/testrun/1324990/suite/ltp-mm-tests/test/ksm02/log
---
[ 775.646689] WARNING: CPU: 3 PID: 10858 at
/srv/oe/build/tmp-lkft-glibc/work-shared/intel-core2-32/kernel-source/mm/mremap.c:211
move_page_tables+0x553/0x570
[ 775.647006] Modules linked in: fuse
[ 775.647006] CPU: 3 PID: 10858 Comm: true Not tainted 5.0.1 #1
[ 775.647006] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 775.647006] EIP: move_page_tables+0x553/0x570
- Naresh
On Fri, 10 Jul 2020 at 00:42, Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Jul 8, 2020 at 10:28 PM Naresh Kamboju
> <[email protected]> wrote:
> >
> > While running LTP mm test suite on i386 or qemu_i386 this kernel warning
> > has been noticed from stable 5.4 to stable 5.7 branches and mainline 5.8.0-rc4
> > and linux next.
>
> Hmm
>
> If this is repeatable, would you mind making the warning also print
> out the old range and new addresses and pmd value?
Your patch applied and re-tested.
warning triggered 10 times.
old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Here is the crash output log,
thp01.c:98: PASS: system didn't crash.
[ 741.507000] ------------[ cut here ]------------
[ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211
move_page_tables.cold+0x0/0x2b
[ 741.519812] Modules linked in: x86_pkg_temp_thermal fuse
[ 741.525163] CPU: 1 PID: 15173 Comm: true Not tainted 5.8.0-rc4 #1
[ 741.531313] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[ 741.538760] EIP: move_page_tables.cold+0x0/0x2b
[ 741.543337] Code: b1 a0 03 00 00 81 c1 cc 04 00 00 bb ea ff ff ff
51 68 e0 bc 68 d8 c6 05 dc 29 97 d8 01 e8 13 26 e9 ff 83 c4 0c e9 70
ea ff ff <0f> 0b 52 50 ff 75 08 ff 75 b4 ff 75 d4 68 3c bd 68 d8 e8 f4
25 e9
[ 741.562140] EAX: 7d530067 EBX: e9c90ff8 ECX: 00000000 EDX: 00000000
[ 741.568456] ESI: 00000000 EDI: 7d5ba007 EBP: cef67dd0 ESP: cef67d28
[ 741.574776] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202
[ 741.581623] CR0: 80050033 CR2: b7d53f50 CR3: 107da000 CR4: 003406f0
[ 741.587941] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 741.594259] DR6: fffe0ff0 DR7: 00000400
[ 741.598159] Call Trace:
[ 741.600694] setup_arg_pages+0x22b/0x310
[ 741.604654] ? _raw_spin_unlock_irqrestore+0x45/0x50
[ 741.609677] ? trace_hardirqs_on+0x4b/0x110
[ 741.613930] ? get_random_u32+0x4e/0x80
[ 741.617809] ? get_random_u32+0x4e/0x80
[ 741.621687] load_elf_binary+0x31e/0x10f0
[ 741.625714] ? __do_execve_file+0x5b4/0xbf0
[ 741.629917] ? find_held_lock+0x24/0x80
[ 741.633839] __do_execve_file+0x5a8/0xbf0
[ 741.637893] __ia32_sys_execve+0x2a/0x40
[ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0
[ 741.646246] ? find_held_lock+0x24/0x80
[ 741.650105] ? lock_release+0x8a/0x260
[ 741.653890] ? __might_fault+0x41/0x80
[ 741.657660] do_fast_syscall_32+0x60/0xf0
[ 741.661691] do_SYSENTER_32+0x15/0x20
[ 741.665373] entry_SYSENTER_32+0x9f/0xf2
[ 741.669328] EIP: 0xb7f38549
[ 741.672140] Code: Bad RIP value.
[ 741.675430] EAX: ffffffda EBX: bfe19bf0 ECX: 08067420 EDX: bfe19e24
[ 741.681708] ESI: 08058a14 EDI: bfe19bf9 EBP: bfe19c98 ESP: bfe19bc8
[ 741.687991] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
[ 741.694804] irq event stamp: 23911
[ 741.698253] hardirqs last enabled at (23929): [<d756f075>]
console_unlock+0x4a5/0x610
[ 741.706181] hardirqs last disabled at (23946): [<d756ec5a>]
console_unlock+0x8a/0x610
[ 741.714041] softirqs last enabled at (23962): [<d82b975c>]
__do_softirq+0x2dc/0x3da
[ 741.721849] softirqs last disabled at (23973): [<d74a8275>]
call_on_stack+0x45/0x50
[ 741.729513] ---[ end trace 170f646c1b6225e0 ]---
[ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Build link: https://builds.tuxbuild.com/1cwiUvFIB4M0hPyB1eA3cA/
vmlinux: https://builds.tuxbuild.com/1cwiUvFIB4M0hPyB1eA3cA/vmlinux.xz
system.map: https://builds.tuxbuild.com/1cwiUvFIB4M0hPyB1eA3cA/System.map
full test log,
https://lkft.validation.linaro.org/scheduler/job/1554181#L10557
- Naresh
>
> Something like the attached (UNTESTED!) patch.
>
> Linus
On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju <[email protected]> wrote:
>
> Your patch applied and re-tested.
> warning triggered 10 times.
>
> old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Hmm.. It's not even the overlapping case, it's literally just "move
exactly 2MB of page tables exactly one pmd down". Which should be the
nice efficient case where we can do it without modifying the lower
page tables at all, we just move the PMD entry.
There shouldn't be anything in the new address space from bfa00000-bfdfffff.
That PMD value obviously says differently, but it looks like a nice
normal PMD value, nothing bad there.
I'm starting to think that the issue might be that this is because the
stack segment is special. Not only does it have the growsdown flag,
but that whole thing has the magic guard page logic.
So I wonder if we have installed a guard page _just_ below the old
stack, so that we have populated that pmd because of that.
We used to have an _actual_ guard page and then play nasty games with
vm_start logic. We've gotten rid of that, though, and now we have that
"stack_guard_gap" logic that _should_ mean that vm_start is always
exact and proper (and that pgtbales_free() should have emptied it, but
maybe we have some case we forgot about.
> [ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211 move_page_tables.cold+0x0/0x2b
> [ 741.598159] Call Trace:
> [ 741.600694] setup_arg_pages+0x22b/0x310
> [ 741.621687] load_elf_binary+0x31e/0x10f0
> [ 741.633839] __do_execve_file+0x5a8/0xbf0
> [ 741.637893] __ia32_sys_execve+0x2a/0x40
> [ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0
> [ 741.657660] do_fast_syscall_32+0x60/0xf0
> [ 741.661691] do_SYSENTER_32+0x15/0x20
> [ 741.665373] entry_SYSENTER_32+0x9f/0xf2
> [ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
Nothing looks bad, and the ELF loading phase memory map should be
really quite simple.
The only half-way unusual thing is that you have basically exactly 2MB
of stack at execve time (easy enough to tune by just setting argv/env
right), and it's moved down by exactly 2MB.
And that latter thing is just due to randomization, see
arch_align_stack() in arch/x86/kernel/process.c.
So that would explain why it doesn't happen every time.
What happens if you apply the attached patch to *always* force the 2MB
shift (rather than moving the stack by a random amount), and then run
the other program (t.c -> compiled to "a.out").
The comment should be obvious. But it's untested, I might have gotten
the math wrong. I don't run in a 32-bit environment.
Linus
On Fri, 10 Jul 2020 at 10:55, Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju <[email protected]> wrote:
> >
> > Your patch applied and re-tested.
> > warning triggered 10 times.
> >
> > old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
>
> Hmm.. It's not even the overlapping case, it's literally just "move
> exactly 2MB of page tables exactly one pmd down". Which should be the
> nice efficient case where we can do it without modifying the lower
> page tables at all, we just move the PMD entry.
>
> There shouldn't be anything in the new address space from bfa00000-bfdfffff.
>
> That PMD value obviously says differently, but it looks like a nice
> normal PMD value, nothing bad there.
>
> I'm starting to think that the issue might be that this is because the
> stack segment is special. Not only does it have the growsdown flag,
> but that whole thing has the magic guard page logic.
>
> So I wonder if we have installed a guard page _just_ below the old
> stack, so that we have populated that pmd because of that.
>
> We used to have an _actual_ guard page and then play nasty games with
> vm_start logic. We've gotten rid of that, though, and now we have that
> "stack_guard_gap" logic that _should_ mean that vm_start is always
> exact and proper (and that pgtbales_free() should have emptied it, but
> maybe we have some case we forgot about.
>
> > [ 741.511684] WARNING: CPU: 1 PID: 15173 at mm/mremap.c:211 move_page_tables.cold+0x0/0x2b
> > [ 741.598159] Call Trace:
> > [ 741.600694] setup_arg_pages+0x22b/0x310
> > [ 741.621687] load_elf_binary+0x31e/0x10f0
> > [ 741.633839] __do_execve_file+0x5a8/0xbf0
> > [ 741.637893] __ia32_sys_execve+0x2a/0x40
> > [ 741.641875] do_syscall_32_irqs_on+0x3d/0x2c0
> > [ 741.657660] do_fast_syscall_32+0x60/0xf0
> > [ 741.661691] do_SYSENTER_32+0x15/0x20
> > [ 741.665373] entry_SYSENTER_32+0x9f/0xf2
> > [ 741.734151] old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
>
> Nothing looks bad, and the ELF loading phase memory map should be
> really quite simple.
>
> The only half-way unusual thing is that you have basically exactly 2MB
> of stack at execve time (easy enough to tune by just setting argv/env
> right), and it's moved down by exactly 2MB.
>
> And that latter thing is just due to randomization, see
> arch_align_stack() in arch/x86/kernel/process.c.
>
> So that would explain why it doesn't happen every time.
>
> What happens if you apply the attached patch to *always* force the 2MB
> shift (rather than moving the stack by a random amount), and then run
> the other program (t.c -> compiled to "a.out").
I have applied your patch and test started in a loop for a million times
but the test ran for 35 times. Seems like the test got a timeout after 1 hour.
kernel messages printed while testing a.out
a.out (480) used greatest stack depth: 4872 bytes left
On other device
kworker/dying (172) used greatest stack depth: 5044 bytes left
Re-running test with long timeouts 4 hours and will share findings.
ref:
https://lkft.validation.linaro.org/scheduler/job/1555132#L1515
- Naresh
On Fri, Jul 10, 2020 at 10:48 AM Naresh Kamboju
<[email protected]> wrote:
>
> I have applied your patch and test started in a loop for a million times
> but the test ran for 35 times. Seems like the test got a timeout after 1 hour.
That just means that my test-case was wrong (in the sense that what it
was testing wasn't what was triggering things).
It might be wrong because I got the stack usage calculations wrong,
but it might be wrong simply because the bug you've triggered with LTP
needs something else to trigger.
> Re-running test with long timeouts 4 hours and will share findings.
That test was more intended to trigger the issue reliably, if it was
just a "stack is special, needs that exact 2MB aligned case to be
problematic".
So re-running my "t.c" the test for long times with that kernel patch
that removes the stack randomization isn't going to matter. Either it
triggers the case, or not.
I don't actually see any case where we play with vm_start the way we
used to (it was basically removed with commit 1be7107fbe18: "mm:
larger stack guard gap, between vmas"). So it was kind of a log shot
anyway.
Linus
On Sat, 11 Jul 2020 at 01:35, Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Jul 10, 2020 at 10:48 AM Naresh Kamboju
> <[email protected]> wrote:
I have started bisecting this problem and found the first bad commit
commit 9f132f7e145506efc0744426cb338b18a54afc3b
Author: Joel Fernandes (Google) <[email protected]>
Date: Thu Jan 3 15:28:41 2019 -0800
mm: select HAVE_MOVE_PMD on x86 for faster mremap
Moving page-tables at the PMD-level on x86 is known to be safe. Enable
this option so that we can do fast mremap when possible.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Suggested-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: William Kucharski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e260460210e1..6185d4f33296 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -172,6 +172,7 @@ config X86
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_MIXED_BREAKPOINTS_REGS
select HAVE_MOD_ARCH_SPECIFIC
+ select HAVE_MOVE_PMD
select HAVE_NMI
select HAVE_OPROFILE
select HAVE_OPTPROBES
After reverting the above patch the reported kernel warning got fixed on
Linus mainline tree 5.8.0-rc4.
--
Linaro LKFT
https://lkft.linaro.org
On Sat, Jul 11, 2020 at 10:27 AM Naresh Kamboju
<[email protected]> wrote:
>
> I have started bisecting this problem and found the first bad commit
Thanks for the effort. Bisection is often a great tool to figure out
what's wrong.
Sadly, in this case:
> commit 9f132f7e145506efc0744426cb338b18a54afc3b
> Author: Joel Fernandes (Google) <[email protected]>
> Date: Thu Jan 3 15:28:41 2019 -0800
>
> mm: select HAVE_MOVE_PMD on x86 for faster mremap
Yeah, that's just the commit that enables the code, not the commit
that introduces the fundamental problem.
That said, this is a prime example of why I absolutely detest patch
series that do this kind of thing, and are several patches that create
new functionality, followed by one patch to enable it.
If you can't get things working incrementally, maybe you shouldn't do
them at all. Doing a big series of "hidden work" and then enabling it
later is wrong.
In this case, though, the real patch that did the code isn't that kind
of "big series of hidden work" patch series, it's just the (single)
previous commit 2c91bd4a4e2e ("mm: speed up mremap by 20x on large
regions").
So your bisection is useful, it's just that it really points to that
previous commit, and it's where this code was introduced.
It's also worth noting that that commit doesn't really *break*
anything, since it just falls back to the old behavior when it warns.
So to "fix" your test-case, we could just remove the WARN_ON.
But the WARN_ON() is still worrisome, because the thing it tests for
really _should_ be true.
Now, we actually have a known bug in this area that is talked about
elsewhere: the way unmap does the pgtable_free() is
/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);
if (downgrade)
mmap_write_downgrade(mm);
unmap_region(mm, vma, prev, start, end);
(and unmap_region() is what does the pgtable_free() that should have
cleared the PMD).
And the problem with the "downgrade" is that another thread can change
the beginning of the next vma when it's a grow-down region (or the end
of the prev one if it's a grow-up).
See commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
munmap") for the source of that
But that requires an _actual_ "unmap()" system call (the others set
"downgrade" to false - I'm not entirely sure why), and it requires
another thread to be attached to that VM in order to actually do that
racy concurrent stack size change.
And neither of those two cases will be true for the execve() path.
It's a new VM, with just one thread attached, so no threaded munmap()
going on there.
The fact that it seems to happen with
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c
makes me think it's somehow related to THP mappings, but I don't see
why those would matter. All the same pmd freeing should still have
happened, afaik.
And the printout I asked for a few days back for when it triggered
clearly showed a normal non-huge pmd ("val: 7d530067" is just
"accessed, dirty, rw, user and present", which is a perfectly normal
page directory entry for 4kB pages, and we could move the whole thing
and move 2MB (or 4MB) of aligned virtual memory in one go).
Some race with THP splitting and pgtable_free()? I can't see how
anything would race in execve(), or how anything would have touched
that address below the stack in the first place..
Kirill, Oleg, and reaction from this? Neither of you were on the
original email, I think, it's this one:
https://lore.kernel.org/lkml/CA+G9fYt+6OeibZMD0fP=O3nqFbcN3O4xcLkjq0mpQbZJ2uxB9w@mail.gmail.com/
and I really think it is harmless in that when the warning triggers,
we just go back to the page-by-page code, but while I think the
WARN_ON() should probably be downgraded to a WARN_ON_ONCE(), I do
think it's showing _something_.
I just can't see how this would trigger for execve(). That's just
about the _simplest_ case for us: single-threaded, mappings set up
purely by load_elf_binary() etc.
I'm clearly missing something.
Linus
On Sat, Jul 11, 2020 at 11:12 AM Linus Torvalds
<[email protected]> wrote:
>
> The fact that it seems to happen with
>
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/thp/thp01.c
>
> makes me think it's somehow related to THP mappings, but I don't see
> why those would matter. All the same pmd freeing should still have
> happened, afaik.
No, I think the only reason it happens with that test-case is simply
because that test-case tests many different argument sizes, and as
part of that it ends up then hitting the "stack is exactly one pmd in
size" case when the stack alignment is right too.
So the THP part is almost certainly a red herring.
Maybe. Considering that I don't see what's wrong, anything is possible.
Linus
On Sat, Jul 11, 2020 at 11:12:58AM -0700, Linus Torvalds wrote:
> On Sat, Jul 11, 2020 at 10:27 AM Naresh Kamboju
> <[email protected]> wrote:
> >
> > I have started bisecting this problem and found the first bad commit
>
> Thanks for the effort. Bisection is often a great tool to figure out
> what's wrong.
>
> Sadly, in this case:
>
> > commit 9f132f7e145506efc0744426cb338b18a54afc3b
> > Author: Joel Fernandes (Google) <[email protected]>
> > Date: Thu Jan 3 15:28:41 2019 -0800
> >
> > mm: select HAVE_MOVE_PMD on x86 for faster mremap
>
> Yeah, that's just the commit that enables the code, not the commit
> that introduces the fundamental problem.
>
> That said, this is a prime example of why I absolutely detest patch
> series that do this kind of thing, and are several patches that create
> new functionality, followed by one patch to enable it.
>
> If you can't get things working incrementally, maybe you shouldn't do
> them at all. Doing a big series of "hidden work" and then enabling it
> later is wrong.
>
> In this case, though, the real patch that did the code isn't that kind
> of "big series of hidden work" patch series, it's just the (single)
> previous commit 2c91bd4a4e2e ("mm: speed up mremap by 20x on large
> regions").
>
> So your bisection is useful, it's just that it really points to that
> previous commit, and it's where this code was introduced.
Right, I think I should have squashed the enabling of the config, and the
introduction of the feature in the same patch, but as you pointed that
probably would not have made a difference with this bisect since this a
single patch.
> It's also worth noting that that commit doesn't really *break*
> anything, since it just falls back to the old behavior when it warns.
Agreed, I am also of the opinion that the patch is likely surface an existing
issue and not introducing a new one.
> So to "fix" your test-case, we could just remove the WARN_ON.
>
> But the WARN_ON() is still worrisome, because the thing it tests for
> really _should_ be true.
I'll get some tracing in an emulated i386 environment going and try to figure
out exactly what is going on before the warning triggers. thanks for the other
debug hints in this thread!
thanks,
- Joel
- Joel
On Sat, Jul 11, 2020 at 11:12:58AM -0700, Linus Torvalds wrote:
> Yeah, that's just the commit that enables the code, not the commit
> that introduces the fundamental problem.
>
> That said, this is a prime example of why I absolutely detest patch
> series that do this kind of thing, and are several patches that create
> new functionality, followed by one patch to enable it.
>
> If you can't get things working incrementally, maybe you shouldn't do
> them at all. Doing a big series of "hidden work" and then enabling it
> later is wrong.
I'm struggling with exactly this for my current THP-in-pagecache patches.
There are about fifty patches, each fixing something that won't work if
the page is a THP. And then there's the one patch which actually starts
creating THPs, and that's the one patch any bisect will point to.
But I don't see any other way to do it. It's not like I can put THPs
in the page cache before fixing the things that won't work.
This probably wasn't the kind of thing you had in mind when you wrote
the above, but if you had some advice for my situation, I'd welcome it.
On Sun, Jul 12, 2020 at 10:31 AM Matthew Wilcox <[email protected]> wrote:
>
> But I don't see any other way to do it. It's not like I can put THPs
> in the page cache before fixing the things that won't work.
I agree that sometimes there are bootstrapping issues. Incremental and
explanatory commits are still better than one big commit that
introduces a whole new feature and enables it.
But if at all possible, at least limit the scope of the new feature
first, enabling the simplest possible cases as they become possible so
that there's some incremental testing, and so that bisection can say
"ok, that baseline worked, but then when XYZ happened, things went
sideways".
And even when it's a new feature - if it needs cleanup patches to
other things first, please do that. In fact, please do that as a
completely independent series that goes into a previous kernel release
entirely, so that the cleanup and preparatory patches can be
independently verified by a lot of people who run that _previous_
kernel, so that the baseline of that cleanup phase is something as
stable as possible.
Linus
On Thu, Jul 09, 2020 at 10:22:21PM -0700, Linus Torvalds wrote:
> On Thu, Jul 9, 2020 at 9:29 PM Naresh Kamboju <[email protected]> wrote:
> >
> > Your patch applied and re-tested.
> > warning triggered 10 times.
> >
> > old: bfe00000-c0000000 new: bfa00000 (val: 7d530067)
>
> Hmm.. It's not even the overlapping case, it's literally just "move
> exactly 2MB of page tables exactly one pmd down". Which should be the
> nice efficient case where we can do it without modifying the lower
> page tables at all, we just move the PMD entry.
Hi Linus,
I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it.
The issue is solely within execve() itself and the way it allocates/copies the
temporary stack.
It is actually indeed an overlapping case because the length of the
stack is big enough to cause overlap. The VMA grows quite a bit because of
all the page faults that happen due to the copy of the args/env. Then during
the move of overlapped region, it finds that a PMD is already allocated.
The below patch fixes it and is not warning anymore in 30 minutes of testing
so far.
Naresh, could you also test the below patch on your setup?
thanks,
- Joel
---8<-----------------------
From: Joel Fernandes <[email protected]>
Subject: [PATCH] fs/exec: Fix stack overlap issue during stack moving in i386
When running LTP's thp01 test, it is observed that a warning fires in
move_page_tables() because a PMD is already allocated.
This happens because there is an address space overlap between the
temporary stack created and the range it is being moved to when the
move_page_tables() is requested. During the move_page_tables() loop, it
picks the same valid PMD that was already allocated for the temporary
stack. This loop requires the PMD to be new or it warns. Making sure
the new location of the stack is non-overlapping with the old location
makes the warning go away.
Fixes: b6a2fea39318e ("mm: variable length argument support").
Reported-by: Naresh Kamboju <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
fs/exec.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a703278..a270205228a1a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -755,6 +755,10 @@ int setup_arg_pages(struct linux_binprm *bprm,
stack_shift = vma->vm_end - stack_top;
+ /* Ensure the temporary stack is shifted by atleast its size */
+ if (stack_shift < (vma->vm_end - vma->vm_start))
+ stack_shift = (vma->vm_end - vma->vm_start);
+
bprm->p -= stack_shift;
mm->arg_start = bprm->p;
#endif
--
2.27.0.383.g050319c2ae-goog
On Sun, Jul 12, 2020 at 2:50 PM Joel Fernandes <[email protected]> wrote:
>
> I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it.
> The issue is solely within execve() itself and the way it allocates/copies the
> temporary stack.
>
> It is actually indeed an overlapping case because the length of the
> stack is big enough to cause overlap. The VMA grows quite a bit because of
> all the page faults that happen due to the copy of the args/env. Then during
> the move of overlapped region, it finds that a PMD is already allocated.
Oh, ok, I think I see.
So the issue is that while move_normal_pmd() _itself_ will be clearing
the old pmd entries when it copies them, the 4kB copies that happened
_before_ this time will not have cleared the old pmd that they were
in.
So we've had a deeper stack, and we've already copied some of it one
page at a time, and we're done with those 4kB entries, but now we hit
a 2MB-aligned case and want to move that down. But when it does that,
it hits the (by now hopefully empty) pmd that used to contain the 4kB
entries we copied earlier.
So we've cleared the page table, but we've simply never called
pgtable_clear() here, because the page table got cleared in
move_ptes() when it did that
pte = ptep_get_and_clear(mm, old_addr, old_pte);
on the previous old pte entries.
That makes sense to me. Does that match what you see? Because when I
said it wasn't overlapped, I was really talking about that one single
pmd move itself not being overlapped.
> The below patch fixes it and is not warning anymore in 30 minutes of testing
> so far.
So I'm not hugely happy with the patch, I have to admit.
Why?
Because:
> + /* Ensure the temporary stack is shifted by atleast its size */
> + if (stack_shift < (vma->vm_end - vma->vm_start))
> + stack_shift = (vma->vm_end - vma->vm_start);
This basically overrides the random stack shift done by arch_align_stack().
Yeah, yeah, arch_align_stack() is kind of misnamed. It _used_ to do
what the name implies it does, which on x86 is to just give the
minimum ABI-specified 16-byte stack alignment.
But these days, what it really does is say "align the stack pointer,
but also shift it down by a random amount within 8kB so that the start
of the stack isn't some repeatable thing that an attacker can
trivially control with the right argv/envp size.."
I don't think it works right anyway because we then PAGE_ALIGN() it,
but whatever.
But it looks like what you're doing means that now the size of the
stack will override that shift, and that means that now the
randomization is entirely undone. No?
Plus you do it after we've also checked that we haven't grown the
stack down to below mmap_min_addr().
But maybe I misread that patch.
But I do feel like you figured out why the bug happened, now we're
just discussing whether the patch is the right thing to do.
Maybe saying "doing the pmd copies for the initial stack isn't
important, so let's just note this as a special case and get rid of
the WARN_ON()" might be an alternative solution.
The reason I worried was that I felt like we didn't understand why the
WARN_ON() happened. Now that I do, I think we could just say "ok,
don't warn, we know that this can happen, and it's harmless".
Anybody else have any opinions?
Linus
Hi Linus,
On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
> On Sun, Jul 12, 2020 at 2:50 PM Joel Fernandes <[email protected]> wrote:
> >
> > I reproduced Naresh's issue on a 32-bit x86 machine and the below patch fixes it.
> > The issue is solely within execve() itself and the way it allocates/copies the
> > temporary stack.
> >
> > It is actually indeed an overlapping case because the length of the
> > stack is big enough to cause overlap. The VMA grows quite a bit because of
> > all the page faults that happen due to the copy of the args/env. Then during
> > the move of overlapped region, it finds that a PMD is already allocated.
>
> Oh, ok, I think I see.
>
> So the issue is that while move_normal_pmd() _itself_ will be clearing
> the old pmd entries when it copies them, the 4kB copies that happened
> _before_ this time will not have cleared the old pmd that they were
> in.
>
> So we've had a deeper stack, and we've already copied some of it one
> page at a time, and we're done with those 4kB entries, but now we hit
> a 2MB-aligned case and want to move that down. But when it does that,
> it hits the (by now hopefully empty) pmd that used to contain the 4kB
> entries we copied earlier.
>
> So we've cleared the page table, but we've simply never called
> pgtable_clear() here, because the page table got cleared in
> move_ptes() when it did that
>
> pte = ptep_get_and_clear(mm, old_addr, old_pte);
>
> on the previous old pte entries.
>
> That makes sense to me. Does that match what you see? Because when I
> said it wasn't overlapped, I was really talking about that one single
> pmd move itself not being overlapped.
This matches exactly what you mentioned.
Here is some analysis with specific numbers:
Some time during execve(), the copy_strings() causes page faults. During this
the VMA is growing and new PMD is allocated during the page fault:
copy_strings: Copying args/env old process's memory 8067420 to new stack bfb0eec6 (len 4096)
handle_mm_fault: vma: bfb0d000 c0000000
handle_mm_fault: pmd_alloc: ad: bfb0dec6 ptr: f46d0bf8
Here we see that the the pmd entry's (pmd_t *) pointer value is f46d0bf8 and
the fault was on address bfb0dec6.
Similarly, I note the pattern of other copy_strings() faulting and do the
following mapping:
address space pmd entry pointer
0xbfe00000-0xc0000000 f4698ff8
0xbfc00000-0xbfe00000 f4698ff0
0xbfa00000-0xbfc00000 f4698fe8
This is all from tracing the copy_strings().
Then later, the kernel decides to move the VMA down. The VMA total size 5MB.
The stack is initially at a 1MB aligned address: 0xbfb00000 . exec requests
move_page_tables() to move it down by 4MB. That is, to 0xbf700000 which is
also 1MB aligned. This is an overlapping move.
move_page_tables starts, I see the following prints before the warning fires.
The plan of attack should be, first copy 1MB using move_ptes() like you said.
Then it hits 2MB aligned boundary and starts move_normal_pmd(). For both
move_ptes() and move_normal_pmd(), a pmd_alloc() first happens which is
printed below:
move_page_tables: move_page_tables old=bfb00000 (len:5242880) new=bf700000
move_page_tables: pmd_alloc: ad: bf700000 ptr: f4698fd8
move_page_tables: move_ptes old=bfb00000->bfc00000 new=bf700000
move_page_tables: pmd_alloc: ad: bf800000 ptr: f4698fe0
move_page_tables: move_normal_pmd: old: bfc00000-c0000000 new: bf800000 (val: 0)
move_page_tables: pmd_alloc: ad: bfa00000 ptr: f4698fe8
move_page_tables: move_normal_pmd: old: bfe00000-c0000000 new: bfa00000 (val: 34164067)
So basically, it did 1MB worth of move_ptes(), and twice 2MB worth of
move_normal_pmd. Since the shift was only 4MB, it hit an already allocated
PMD during the second move_normal_pmd. The warning fires as that
move_normal_pmd() sees 0xbf800000 is already allocated before.
As you said, this is harmless.
One thing I observed by code reading is move_page_tables() is (in the case of
mremap) only called for non-overlapping moves (through mremap_to() or
move_vma() as the case maybe). It makes me nervous we are doing an
overlapping move and causing some bug inadvertently.
Could you let me know if there is any reason why we should not make the new
stack's location as non-overlapping, just to keep things simple? (Assuming
we fix the issues related to overriding you mentioned below).
> > The below patch fixes it and is not warning anymore in 30 minutes of testing
> > so far.
>
> So I'm not hugely happy with the patch, I have to admit.
>
> Why?
>
> Because:
>
> > + /* Ensure the temporary stack is shifted by atleast its size */
> > + if (stack_shift < (vma->vm_end - vma->vm_start))
> > + stack_shift = (vma->vm_end - vma->vm_start);
>
> This basically overrides the random stack shift done by arch_align_stack().
>
> Yeah, yeah, arch_align_stack() is kind of misnamed. It _used_ to do
> what the name implies it does, which on x86 is to just give the
> minimum ABI-specified 16-byte stack alignment.
> But these days, what it really does is say "align the stack pointer,
> but also shift it down by a random amount within 8kB so that the start
> of the stack isn't some repeatable thing that an attacker can
> trivially control with the right argv/envp size.."
Got it, thanks I will work on improving the patch along these lines.
> I don't think it works right anyway because we then PAGE_ALIGN() it,
> but whatever.
:)
> But it looks like what you're doing means that now the size of the
> stack will override that shift, and that means that now the
> randomization is entirely undone. No?
Yes, true. I will work on fixing it.
> Plus you do it after we've also checked that we haven't grown the
> stack down to below mmap_min_addr().
>
> But maybe I misread that patch.
>
> But I do feel like you figured out why the bug happened, now we're
> just discussing whether the patch is the right thing to do.
Yes.
> Maybe saying "doing the pmd copies for the initial stack isn't
> important, so let's just note this as a special case and get rid of
> the WARN_ON()" might be an alternative solution.
Personally, I feel it is better to keep the warning just so in the future we
can detect any bugs.
thanks,
- Joel
On Sun, Jul 12, 2020 at 7:53 PM Joel Fernandes <[email protected]> wrote:
>
> > But I do feel like you figured out why the bug happened, now we're
> > just discussing whether the patch is the right thing to do.
>
> Yes.
>
> > Maybe saying "doing the pmd copies for the initial stack isn't
> > important, so let's just note this as a special case and get rid of
> > the WARN_ON()" might be an alternative solution.
>
> Personally, I feel it is better to keep the warning just so in the future we
> can detect any bugs.
I don't disagree, the warning didn't happen to find a bug now, but it
did fine a case we might be able to do better.
So now that I feel we understand the issue, and it's not a horrible
problem, just a (very hard to trigger) warning, I don't think there's
any huge hurry.
I think think I will - for now - change the WARN_ON() to
WARN_ON_ONCE() (so that it doesn't floow the logs if somebody triggers
this odd special case this malisiously), and add a note about how
this happens to the code for posterito.
And if/when you figure out a better way to fix it, we can update the note.
Ok?
Linus
On Sun, Jul 12, 2020 at 08:51:26PM -0700, Linus Torvalds wrote:
> > > Maybe saying "doing the pmd copies for the initial stack isn't
> > > important, so let's just note this as a special case and get rid of
> > > the WARN_ON()" might be an alternative solution.
> >
> > Personally, I feel it is better to keep the warning just so in the future we
> > can detect any bugs.
>
> I don't disagree, the warning didn't happen to find a bug now, but it
> did fine a case we might be able to do better.
>
> So now that I feel we understand the issue, and it's not a horrible
> problem, just a (very hard to trigger) warning, I don't think there's
> any huge hurry.
>
> I think think I will - for now - change the WARN_ON() to
> WARN_ON_ONCE() (so that it doesn't floow the logs if somebody triggers
> this odd special case this malisiously), and add a note about how
> this happens to the code for posterito.
>
> And if/when you figure out a better way to fix it, we can update the note.
>
> Ok?
Yes, that sounds great to me.
thanks,
- Joel
On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
> Anybody else have any opinions?
Maybe we just shouldn't allow move_normal_pmd() if ranges overlap?
Other option: pass 'overlaps' down to move_normal_pmd() and only WARN() if
see establised PMD without overlaps being true.
Untested patch:
diff --git a/mm/mremap.c b/mm/mremap.c
index 5dd572d57ca9..e33fcee541fe 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -245,6 +245,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long extent, next, old_end;
struct mmu_notifier_range range;
pmd_t *old_pmd, *new_pmd;
+ bool overlaps;
+
+ /*
+ * shift_arg_pages() can call move_page_tables() on overlapping ranges.
+ * In this case we cannot use move_normal_pmd() because destination pmd
+ * might be established page table: move_ptes() doesn't free page
+ * table.
+ */
+ if (old_addr > new_addr)
+ overlaps = old_addr - new_addr < len;
+ else
+ overlaps = new_addr - old_addr < len;
old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);
@@ -282,7 +294,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
- } else if (extent == PMD_SIZE) {
+ } else if (!overlaps && extent == PMD_SIZE) {
#ifdef CONFIG_HAVE_MOVE_PMD
/*
* If the extent is PMD-sized, try to speed the move by
--
Kirill A. Shutemov
On Tue, 14 Jul 2020 at 13:03, Kirill A. Shutemov <[email protected]> wrote:
>
> On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
> > Anybody else have any opinions?
>
> Maybe we just shouldn't allow move_normal_pmd() if ranges overlap?
>
> Other option: pass 'overlaps' down to move_normal_pmd() and only WARN() if
> see establised PMD without overlaps being true.
>
> Untested patch:
This patch applied on top of Linus mainline tree and tested on i386.
The reported warning did not happen while testing LTP mm [1].
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 5dd572d57ca9..e33fcee541fe 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -245,6 +245,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> unsigned long extent, next, old_end;
> struct mmu_notifier_range range;
> pmd_t *old_pmd, *new_pmd;
> + bool overlaps;
> +
> + /*
> + * shift_arg_pages() can call move_page_tables() on overlapping ranges.
> + * In this case we cannot use move_normal_pmd() because destination pmd
> + * might be established page table: move_ptes() doesn't free page
> + * table.
> + */
> + if (old_addr > new_addr)
> + overlaps = old_addr - new_addr < len;
> + else
> + overlaps = new_addr - old_addr < len;
>
> old_end = old_addr + len;
> flush_cache_range(vma, old_addr, old_end);
> @@ -282,7 +294,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> split_huge_pmd(vma, old_pmd, old_addr);
> if (pmd_trans_unstable(old_pmd))
> continue;
> - } else if (extent == PMD_SIZE) {
> + } else if (!overlaps && extent == PMD_SIZE) {
> #ifdef CONFIG_HAVE_MOVE_PMD
> /*
> * If the extent is PMD-sized, try to speed the move by
> --
> Kirill A. Shutemov
ref:
https://lkft.validation.linaro.org/scheduler/job/1561734#L1598
- Naresh
Hi Kirill,
On Tue, Jul 14, 2020 at 10:33:06AM +0300, Kirill A. Shutemov wrote:
> On Sun, Jul 12, 2020 at 03:58:06PM -0700, Linus Torvalds wrote:
> > Anybody else have any opinions?
>
> Maybe we just shouldn't allow move_normal_pmd() if ranges overlap?
>
> Other option: pass 'overlaps' down to move_normal_pmd() and only WARN() if
> see establised PMD without overlaps being true.
I was thinking we should not call move_page_tables() with overlapping ranges
at all, just to keep things simple. I am concerned about other issues such as
if you move a range forward, you will end up overwriting part of the source
range.
Allow me some time to develop a proper patch, I have it on my list. I will
try to get to it this week.
I think we can also add a patch to detect the overlap as you did and warn in
such situation.
Thoughts?
thanks,
- Joel
> Untested patch:
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 5dd572d57ca9..e33fcee541fe 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -245,6 +245,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> unsigned long extent, next, old_end;
> struct mmu_notifier_range range;
> pmd_t *old_pmd, *new_pmd;
> + bool overlaps;
> +
> + /*
> + * shift_arg_pages() can call move_page_tables() on overlapping ranges.
> + * In this case we cannot use move_normal_pmd() because destination pmd
> + * might be established page table: move_ptes() doesn't free page
> + * table.
> + */
> + if (old_addr > new_addr)
> + overlaps = old_addr - new_addr < len;
> + else
> + overlaps = new_addr - old_addr < len;
>
> old_end = old_addr + len;
> flush_cache_range(vma, old_addr, old_end);
> @@ -282,7 +294,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> split_huge_pmd(vma, old_pmd, old_addr);
> if (pmd_trans_unstable(old_pmd))
> continue;
> - } else if (extent == PMD_SIZE) {
> + } else if (!overlaps && extent == PMD_SIZE) {
> #ifdef CONFIG_HAVE_MOVE_PMD
> /*
> * If the extent is PMD-sized, try to speed the move by
> --
> Kirill A. Shutemov
On Tue, Jul 14, 2020 at 9:08 AM Joel Fernandes <[email protected]> wrote:
>
> I was thinking we should not call move_page_tables() with overlapping ranges
> at all, just to keep things simple.
No, we're not breaking the existing stack movement code just to keep
things simple.
The rule is "make it as simple as possible, but no simpler".
And "as possible" in the case of Linux means "no breaking of old
interfaces". The stack randomization movement most certainly counts.
Linus
On Tue, Jul 14, 2020 at 12:11 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Jul 14, 2020 at 9:08 AM Joel Fernandes <[email protected]> wrote:
> >
> > I was thinking we should not call move_page_tables() with overlapping ranges
> > at all, just to keep things simple.
>
> No, we're not breaking the existing stack movement code just to keep
> things simple.
>
> The rule is "make it as simple as possible, but no simpler".
>
> And "as possible" in the case of Linux means "no breaking of old
> interfaces". The stack randomization movement most certainly counts.
Hi Linus,
I think you misunderstood me. I was not advocating breaking the stack
movement code or breaking stack randomization, I was going to try to
see if I could keep that working while not having to do an overlapping
move. That would be what I would do in the future patch (until which
as you mentioned, you would switch the WARN_ON to WARN_ON_ONCE).
I agree with you that we keep things as simple as possible while not
breaking functionality.
Cheers,
- Joel
On Tue, Jul 14, 2020 at 11:12 AM Joel Fernandes <[email protected]> wrote:
>
> I think you misunderstood me. I was not advocating breaking the stack
> movement code or breaking stack randomization, I was going to try to
> see if I could keep that working while not having to do an overlapping
> move.
I'm not really seeing how you'd do that with a big stack that gets
close to the stack ulimit.
Except by avoiding randomization.
But the existing randomization may be so bad that it doesn't much
matter. And I do think we limit the execve stack to a reasonably small
fraction of the whole ulimit. So worth exploring, I guess.
The current code with "align_stack" doing randomization could also do
with a lot of clarifications. The code is odd.
Linus