Hi,
I notice a regression report on Bugzilla [1]. Quoting from it:
> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
>
> $ uname -a
> Linux jacob 6.4.1-gentoo #1 SMP PREEMPT_DYNAMIC Sat Jul 1 19:02:42 EDT 2023 x86_64 AMD Ryzen 9 7950X3D 16-Core Processor AuthenticAMD GNU/Linux
> $ cat repro.c
> #define _GNU_SOURCE
> #include <sched.h>
> #include <sys/wait.h>
> #include <unistd.h>
>
> void *threadSafeAlloc(size_t n) {
> static size_t end_index = 0;
> static char buffer[1 << 25];
> size_t start_index = __atomic_load_n(&end_index, __ATOMIC_SEQ_CST);
> while (1) {
> if (start_index + n > sizeof(buffer)) _exit(1);
> if (__atomic_compare_exchange_n(&end_index, &start_index, start_index + n, 1, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) return buffer + start_index;
> }
> }
>
> int thread(void *arg) {
> size_t i;
> size_t n = 1 << 7;
> char *items;
> (void)arg;
> while (1) {
> items = threadSafeAlloc(n);
> for (i = 0; i != n; i += 1) items[i] = '@';
> for (i = 0; i != n; i += 1) if (items[i] != '@') _exit(2);
> }
> }
>
> int main(void) {
> static size_t stacks[2][1 << 9];
> size_t i;
> for (i = 0; i != 2; i += 1) clone(&thread, &stacks[i] + 1, CLONE_THREAD | CLONE_VM | CLONE_SIGHAND, NULL);
> while (1) {
> if (fork() == 0) _exit(0);
> (void)wait(NULL);
> }
> }
> $ cc repro.c
> $ ./a.out
> $ echo $?
> 2
>
> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
>
> With this reproducer in hand, I was able to perform the following bisection:
>
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [6995e2de6891c724bfeb2db33d7b87775f913ad1] Linux 6.4
> git bisect bad 6995e2de6891c724bfeb2db33d7b87775f913ad1
> # status: waiting for good commit(s), bad commit known
> # good: [457391b0380335d5e9a5babdec90ac53928b23b4] Linux 6.3
> git bisect good 457391b0380335d5e9a5babdec90ac53928b23b4
> # good: [d42b1c47570eb2ed818dc3fe94b2678124af109d] Merge tag 'devicetree-for-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
> git bisect good d42b1c47570eb2ed818dc3fe94b2678124af109d
> # bad: [58390c8ce1bddb6c623f62e7ed36383e7fa5c02f] Merge tag 'iommu-updates-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
> git bisect bad 58390c8ce1bddb6c623f62e7ed36383e7fa5c02f
> # good: [888d3c9f7f3ae44101a3fd76528d3dd6f96e9fd0] Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
> git bisect good 888d3c9f7f3ae44101a3fd76528d3dd6f96e9fd0
> # bad: [86e98ed15b3e34460d1b3095bd119b6fac11841c] Merge tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> git bisect bad 86e98ed15b3e34460d1b3095bd119b6fac11841c
> # bad: [7fa8a8ee9400fe8ec188426e40e481717bc5e924] Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect bad 7fa8a8ee9400fe8ec188426e40e481717bc5e924
> # bad: [0120dd6e4e202e19a0e011e486fb2da40a5ea279] zram: make zram_bio_discard more self-contained
> git bisect bad 0120dd6e4e202e19a0e011e486fb2da40a5ea279
> # good: [fce0b4213edb960859dcc65ea414c8efb11948e1] mm/page_alloc: add helper for checking if check_pages_enabled
> git bisect good fce0b4213edb960859dcc65ea414c8efb11948e1
> # bad: [59f876fb9d68a4d8c20305d7a7a0daf4ee9478a8] mm: avoid passing 0 to __ffs()
> git bisect bad 59f876fb9d68a4d8c20305d7a7a0daf4ee9478a8
> # good: [0050d7f5ee532f92e8ab1efcec6547bfac527973] afs: split afs_pagecache_valid() out of afs_validate()
> git bisect good 0050d7f5ee532f92e8ab1efcec6547bfac527973
> # good: [2ac0af1b66e3b66307f53b1cc446514308ec466d] mm: fall back to mmap_lock if vma->anon_vma is not yet set
> git bisect good 2ac0af1b66e3b66307f53b1cc446514308ec466d
> # skip: [0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a] mm/mmap: free vm_area_struct without call_rcu in exit_mmap
> git bisect skip 0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a
> # skip: [70d4cbc80c88251de0a5b3e8df3275901f1fa99a] powerc/mm: try VMA lock-based page fault handling first
> git bisect skip 70d4cbc80c88251de0a5b3e8df3275901f1fa99a
> # good: [444eeb17437a0ef526c606e9141a415d3b7dfddd] mm: prevent userfaults to be handled under per-vma lock
> git bisect good 444eeb17437a0ef526c606e9141a415d3b7dfddd
> # bad: [e06f47a16573decc57498f2d02f9af3bb3e84cf2] s390/mm: try VMA lock-based page fault handling first
> git bisect bad e06f47a16573decc57498f2d02f9af3bb3e84cf2
> # skip: [0bff0aaea03e2a3ed6bfa302155cca8a432a1829] x86/mm: try VMA lock-based page fault handling first
> git bisect skip 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> # skip: [cd7f176aea5f5929a09a91c661a26912cc995d1b] arm64/mm: try VMA lock-based page fault handling first
> git bisect skip cd7f176aea5f5929a09a91c661a26912cc995d1b
> # good: [52f238653e452e0fda61e880f263a173d219acd1] mm: introduce per-VMA lock statistics
> git bisect good 52f238653e452e0fda61e880f263a173d219acd1
> # bad: [c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f] mm: separate vma->lock from vm_area_struct
> git bisect bad c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f
> # only skipped commits left to test
> # possible first bad commit: [c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f] mm: separate vma->lock from vm_area_struct
> # possible first bad commit: [0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a] mm/mmap: free vm_area_struct without call_rcu in exit_mmap
> # possible first bad commit: [70d4cbc80c88251de0a5b3e8df3275901f1fa99a] powerc/mm: try VMA lock-based page fault handling first
> # possible first bad commit: [cd7f176aea5f5929a09a91c661a26912cc995d1b] arm64/mm: try VMA lock-based page fault handling first
> # possible first bad commit: [0bff0aaea03e2a3ed6bfa302155cca8a432a1829] x86/mm: try VMA lock-based page fault handling first
>
> I do not usually see any kernel log output while running the program, just occasional logs about user space segfaults.
See Bugzilla for the full thread.
Jacob: Can you repeat bisection please? Why did you skip VMA lock-based
page fault commits in your bisection?
Anyway, I'm adding it to regzbot:
#regzbot introduced: 0bff0aaea03e2a..c7f8f31c00d187 https://bugzilla.kernel.org/show_bug.cgi?id=217624
#regzbot title: Memory corruption in multithreaded user space program while calling fork (possibly caused by trying VMA lock-based page fault)
Thanks.
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217624
--
An old man doll... just what I always wanted! - Clara
On 02.07.23 14:27, Bagas Sanjaya wrote:
> I notice a regression report on Bugzilla [1]. Quoting from it:
>
>> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
> [...]
>> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
>>
>> With this reproducer in hand, I was able to perform the following bisection:
> [...]
>
> See Bugzilla for the full thread.
Additional details from
https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
```
I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
reverted no longer causes any memory corruption with either my
reproducer or the original program.
```
FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]]
That's the same commit that causes build problems with go:
https://lore.kernel.org/all/[email protected]/
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
#regzbot introduced: 0bff0aaea03e2a3
On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
Leemhuis) <[email protected]> wrote:
>
> On 02.07.23 14:27, Bagas Sanjaya wrote:
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> >
> >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
> > [...]
> >> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
> >>
> >> With this reproducer in hand, I was able to perform the following bisection:
> > [...]
> >
> > See Bugzilla for the full thread.
>
> Additional details from
> https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
>
> ```
> I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> reverted no longer causes any memory corruption with either my
> reproducer or the original program.
> ```
>
> FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]]
>
> That's the same commit that causes build problems with go:
>
> https://lore.kernel.org/all/[email protected]/
Thanks! I'll investigate this later today. After discussing with
Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
the issue is fixed. I'll post a patch shortly.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot introduced: 0bff0aaea03e2a3
On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
> >
> > On 02.07.23 14:27, Bagas Sanjaya wrote:
> > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > >
> > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
> > > [...]
> > >> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
> > >>
> > >> With this reproducer in hand, I was able to perform the following bisection:
> > > [...]
> > >
> > > See Bugzilla for the full thread.
> >
> > Additional details from
> > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
> >
> > ```
> > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> > reverted no longer causes any memory corruption with either my
> > reproducer or the original program.
> > ```
> >
> > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]]
> >
> > That's the same commit that causes build problems with go:
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Thanks! I'll investigate this later today. After discussing with
> Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> the issue is fixed. I'll post a patch shortly.
Posted at: https://lore.kernel.org/all/[email protected]/
>
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > If I did something stupid, please tell me, as explained on that page.
> >
> > #regzbot introduced: 0bff0aaea03e2a3
On Mon, Jul 03, 2023 at 11:27:19AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
> > Leemhuis) <[email protected]> wrote:
> > >
> > > On 02.07.23 14:27, Bagas Sanjaya wrote:
> > > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > > >
> > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
> > > > [...]
> > > >> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
> > > >>
> > > >> With this reproducer in hand, I was able to perform the following bisection:
> > > > [...]
> > > >
> > > > See Bugzilla for the full thread.
> > >
> > > Additional details from
> > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
> > >
> > > ```
> > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> > > reverted no longer causes any memory corruption with either my
> > > reproducer or the original program.
> > > ```
> > >
> > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]]
> > >
> > > That's the same commit that causes build problems with go:
> > >
> > > https://lore.kernel.org/all/[email protected]/
> >
> > Thanks! I'll investigate this later today. After discussing with
> > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > the issue is fixed. I'll post a patch shortly.
>
> Posted at: https://lore.kernel.org/all/[email protected]/
As that change fixes something in 6.4, why not cc: stable on it as well?
thanks,
greg k-h
On Mon, Jul 3, 2023 at 11:44 AM Greg KH <[email protected]> wrote:
>
> On Mon, Jul 03, 2023 at 11:27:19AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
> > > Leemhuis) <[email protected]> wrote:
> > > >
> > > > On 02.07.23 14:27, Bagas Sanjaya wrote:
> > > > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > > > >
> > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
> > > > > [...]
> > > > >> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
> > > > >>
> > > > >> With this reproducer in hand, I was able to perform the following bisection:
> > > > > [...]
> > > > >
> > > > > See Bugzilla for the full thread.
> > > >
> > > > Additional details from
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
> > > >
> > > > ```
> > > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> > > > reverted no longer causes any memory corruption with either my
> > > > reproducer or the original program.
> > > > ```
> > > >
> > > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> > > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]]
> > > >
> > > > That's the same commit that causes build problems with go:
> > > >
> > > > https://lore.kernel.org/all/[email protected]/
> > >
> > > Thanks! I'll investigate this later today. After discussing with
> > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > the issue is fixed. I'll post a patch shortly.
> >
> > Posted at: https://lore.kernel.org/all/[email protected]/
>
> As that change fixes something in 6.4, why not cc: stable on it as well?
Sorry, I thought since per-VMA locks were introduced in 6.4 and this
patch is fixing 6.4 I didn't need to send it to stable for older
versions. Did I miss something?
Thanks,
Suren.
>
> thanks,
>
> greg k-h
On Tue, Jul 04, 2023 at 12:45:39AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jul 3, 2023 at 11:44 AM Greg KH <[email protected]> wrote:
> >
> > On Mon, Jul 03, 2023 at 11:27:19AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
> > > > Leemhuis) <[email protected]> wrote:
> > > > >
> > > > > On 02.07.23 14:27, Bagas Sanjaya wrote:
> > > > > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > > > > >
> > > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but random crashes in a user space program. After a lot of reduction, I have come up with the following reproducer program:
> > > > > > [...]
> > > > > >> After tuning the various parameters for my computer, exit code 2, which indicates that memory corruption was detected, occurs approximately 99% of the time. Exit code 1, which occurs approximately 1% of the time, means it ran out of statically-allocated memory before reproducing the issue, and increasing the memory usage any more only leads to diminishing returns. There is also something like a 0.1% chance that it segfaults due to memory corruption elsewhere than in the statically-allocated buffer.
> > > > > >>
> > > > > >> With this reproducer in hand, I was able to perform the following bisection:
> > > > > > [...]
> > > > > >
> > > > > > See Bugzilla for the full thread.
> > > > >
> > > > > Additional details from
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
> > > > >
> > > > > ```
> > > > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> > > > > reverted no longer causes any memory corruption with either my
> > > > > reproducer or the original program.
> > > > > ```
> > > > >
> > > > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> > > > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]]
> > > > >
> > > > > That's the same commit that causes build problems with go:
> > > > >
> > > > > https://lore.kernel.org/all/[email protected]/
> > > >
> > > > Thanks! I'll investigate this later today. After discussing with
> > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > the issue is fixed. I'll post a patch shortly.
> > >
> > > Posted at: https://lore.kernel.org/all/[email protected]/
> >
> > As that change fixes something in 6.4, why not cc: stable on it as well?
>
> Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> patch is fixing 6.4 I didn't need to send it to stable for older
> versions. Did I miss something?
6.4.y is a stable kernel tree right now, so yes, it needs to be included
there :)
On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
> > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > the issue is fixed. I'll post a patch shortly.
> > > >
> > > > Posted at: https://lore.kernel.org/all/[email protected]/
> > >
> > > As that change fixes something in 6.4, why not cc: stable on it as well?
> >
> > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > patch is fixing 6.4 I didn't need to send it to stable for older
> > versions. Did I miss something?
>
> 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> there :)
I'm in wait-a-few-days-mode on this. To see if we have a backportable
fix rather than disabling the feature in -stable.
On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
>
> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
>
> > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > > the issue is fixed. I'll post a patch shortly.
> > > > >
> > > > > Posted at: https://lore.kernel.org/all/[email protected]/
> > > >
> > > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > >
> > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > versions. Did I miss something?
> >
> > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > there :)
>
> I'm in wait-a-few-days-mode on this. To see if we have a backportable
> fix rather than disabling the feature in -stable.
Ok, I think we have a fix posted at [2] and it's cleanly applies to
6.4.y stable branch as well. However fork() performance might slightly
regress, therefore disabling per-VMA locks by default for now seems to
be preferable even with this fix (see discussion at
https://lore.kernel.org/all/[email protected]/).
IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
to stable@vger?
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
>
On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan <[email protected]> wrote:
> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
> >
> > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
> >
> > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > >
> > > > > > Posted at: https://lore.kernel.org/all/[email protected]/
> > > > >
> > > > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > > >
> > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > versions. Did I miss something?
> > >
> > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > there :)
> >
> > I'm in wait-a-few-days-mode on this. To see if we have a backportable
> > fix rather than disabling the feature in -stable.
>
> Ok, I think we have a fix posted at [2] and it's cleanly applies to
> 6.4.y stable branch as well. However fork() performance might slightly
> regress, therefore disabling per-VMA locks by default for now seems to
> be preferable even with this fix (see discussion at
> https://lore.kernel.org/all/[email protected]/).
> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> to stable@vger?
>
> [1] https://lore.kernel.org/all/[email protected]/
This one isn't sufficient for .configs which already have
PER_VMA_LOCK=y. Using `depends on BROKEN' would be better.
> [2] https://lore.kernel.org/all/[email protected]/
>
We're still awaiting tester input on this?
I think a clean new fully-changelogged two-patch series would be the
best way to handle this. Please ensure that the [0/2] intro clearly
explains what we're proposing here, and why.
Also, "might slightly regress" is a bit weak. These things are
measurable, no? Because a better solution would be to fix 6.4.x and
mainline and leave it at that.
On Tue, Jul 4, 2023 at 2:28 PM Andrew Morton <[email protected]> wrote:
>
> On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan <[email protected]> wrote:
>
> > On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
> > >
> > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
> > >
> > > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > > >
> > > > > > > Posted at: https://lore.kernel.org/all/[email protected]/
> > > > > >
> > > > > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > > > >
> > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > > versions. Did I miss something?
> > > >
> > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > > there :)
> > >
> > > I'm in wait-a-few-days-mode on this. To see if we have a backportable
> > > fix rather than disabling the feature in -stable.
> >
> > Ok, I think we have a fix posted at [2] and it's cleanly applies to
> > 6.4.y stable branch as well. However fork() performance might slightly
> > regress, therefore disabling per-VMA locks by default for now seems to
> > be preferable even with this fix (see discussion at
> > https://lore.kernel.org/all/[email protected]/).
> > IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> > cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> > to stable@vger?
> >
> > [1] https://lore.kernel.org/all/[email protected]/
>
> This one isn't sufficient for .configs which already have
> PER_VMA_LOCK=y. Using `depends on BROKEN' would be better.
>
> > [2] https://lore.kernel.org/all/[email protected]/
> >
>
> We're still awaiting tester input on this?
Yeah, and it seems to be negative... Anyway, I'll post a dependency on BROKEN.
>
> I think a clean new fully-changelogged two-patch series would be the
> best way to handle this. Please ensure that the [0/2] intro clearly
> explains what we're proposing here, and why.
>
> Also, "might slightly regress" is a bit weak. These things are
> measurable, no? Because a better solution would be to fix 6.4.x and
> mainline and leave it at that.
>
On Tue, Jul 4, 2023 at 3:04 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, Jul 4, 2023 at 2:28 PM Andrew Morton <[email protected]> wrote:
> >
> > On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan <[email protected]> wrote:
> >
> > > On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
> > > >
> > > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
> > > >
> > > > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > > > >
> > > > > > > > Posted at: https://lore.kernel.org/all/[email protected]/
> > > > > > >
> > > > > > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > > > > >
> > > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > > > versions. Did I miss something?
> > > > >
> > > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > > > there :)
> > > >
> > > > I'm in wait-a-few-days-mode on this. To see if we have a backportable
> > > > fix rather than disabling the feature in -stable.
> > >
> > > Ok, I think we have a fix posted at [2] and it's cleanly applies to
> > > 6.4.y stable branch as well. However fork() performance might slightly
> > > regress, therefore disabling per-VMA locks by default for now seems to
> > > be preferable even with this fix (see discussion at
> > > https://lore.kernel.org/all/[email protected]/).
> > > IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> > > cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> > > to stable@vger?
> > >
> > > [1] https://lore.kernel.org/all/[email protected]/
> >
> > This one isn't sufficient for .configs which already have
> > PER_VMA_LOCK=y. Using `depends on BROKEN' would be better.
> >
> > > [2] https://lore.kernel.org/all/[email protected]/
> > >
> >
> > We're still awaiting tester input on this?
>
> Yeah, and it seems to be negative... Anyway, I'll post a dependency on BROKEN.
I posted the patchset at
https://lore.kernel.org/all/[email protected]/
CC'ing stable@vger with the cover letter explaining the situation. The
negative report might have been a fluke, so let's wait for more
testing. In the meantime we can disable the feature by applying the
last patch in that series.
>
> >
> > I think a clean new fully-changelogged two-patch series would be the
> > best way to handle this. Please ensure that the [0/2] intro clearly
> > explains what we're proposing here, and why.
Done.
> >
> > Also, "might slightly regress" is a bit weak. These things are
> > measurable, no? Because a better solution would be to fix 6.4.x and
> > mainline and leave it at that.
They are measurable and they were included in the fix I posted. I
added the numbers in the new cover letter as well.
Thanks,
Suren.
> >
On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
> >
> > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
> >
> > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > >
> > > > > > Posted at: https://lore.kernel.org/all/[email protected]/
> > > > >
> > > > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > > >
> > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > versions. Did I miss something?
> > >
> > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > there :)
> >
> > I'm in wait-a-few-days-mode on this. To see if we have a backportable
> > fix rather than disabling the feature in -stable.
>
> Ok, I think we have a fix posted at [2] and it's cleanly applies to
> 6.4.y stable branch as well. However fork() performance might slightly
> regress, therefore disabling per-VMA locks by default for now seems to
> be preferable even with this fix (see discussion at
> https://lore.kernel.org/all/[email protected]/).
> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> to stable@vger?
We can't do anything for stable until it lands in Linus's tree, so if
you didn't happen to have the stable@ tag in the patch, just email us
the git SHA1 and I can pick it up that way.
thanks,
greg k-h
On 05.07.23 09:08, Greg KH wrote:
> On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote:
>> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
>>> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
>>>>>>>> Thanks! I'll investigate this later today. After discussing with
>>>>>>>> Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
>>>>>>>> the issue is fixed. I'll post a patch shortly.
>>>>>>>
>>>>>>> Posted at: https://lore.kernel.org/all/[email protected]/
>>>>>>
>>>>>> As that change fixes something in 6.4, why not cc: stable on it as well?
>>>>>
>>>>> Sorry, I thought since per-VMA locks were introduced in 6.4 and this
>>>>> patch is fixing 6.4 I didn't need to send it to stable for older
>>>>> versions. Did I miss something?
>>>>
>>>> 6.4.y is a stable kernel tree right now, so yes, it needs to be included
>>>> there :)
>>>
>>> I'm in wait-a-few-days-mode on this. To see if we have a backportable
>>> fix rather than disabling the feature in -stable.
Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
Greg said below and that we already had three reports I know of I'd
prefer if we could fix this rather sooner than later in mainline --
especially as Arch Linux and openSUSE Tumbleweed likely have switched to
6.4.y already or will do so soon.
>> Ok, I think we have a fix posted at [2] and it's cleanly applies to
>> 6.4.y stable branch as well. However fork() performance might slightly
>> regress, therefore disabling per-VMA locks by default for now seems to
>> be preferable even with this fix (see discussion at
>> https://lore.kernel.org/all/[email protected]/).
>> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
>> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
>> to stable@vger?
>
> We can't do anything for stable until it lands in Linus's tree, so if
> you didn't happen to have the stable@ tag in the patch, just email us
> the git SHA1 and I can pick it up that way.
>
> thanks,
>
> greg k-h
Ciao, Thorsten
On Wed, Jul 05, 2023 at 10:51:57AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 05.07.23 09:08, Greg KH wrote:
> > On Tue, Jul 04, 2023 at 01:22:54PM -0700, Suren Baghdasaryan wrote:
> >> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton <[email protected]> wrote:
> >>> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH <[email protected]> wrote:
> >>>>>>>> Thanks! I'll investigate this later today. After discussing with
> >>>>>>>> Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> >>>>>>>> the issue is fixed. I'll post a patch shortly.
> >>>>>>>
> >>>>>>> Posted at: https://lore.kernel.org/all/[email protected]/
> >>>>>>
> >>>>>> As that change fixes something in 6.4, why not cc: stable on it as well?
> >>>>>
> >>>>> Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> >>>>> patch is fixing 6.4 I didn't need to send it to stable for older
> >>>>> versions. Did I miss something?
> >>>>
> >>>> 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> >>>> there :)
> >>>
> >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable
> >>> fix rather than disabling the feature in -stable.
>
> Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> Greg said below and that we already had three reports I know of I'd
> prefer if we could fix this rather sooner than later in mainline --
> especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> 6.4.y already or will do so soon.
Ick, yeah, and Fedora should be switching soon too, and I want to drop
support for 6.3.y "any day now". Is there just a revert we can do now
first to resolve the regression and then work on fixing this up "better"
for 6.6-rc1?
thanks,
greg k-h
On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten Leemhuis)" <[email protected]> wrote:
> >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable
> >>> fix rather than disabling the feature in -stable.
>
> Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> Greg said below and that we already had three reports I know of I'd
> prefer if we could fix this rather sooner than later in mainline --
> especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> 6.4.y already or will do so soon.
I'll send today's 2-patch series to Linus today or tomorrow.
On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton <[email protected]> wrote:
>
> On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten Leemhuis)" <[email protected]> wrote:
>
> > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable
> > >>> fix rather than disabling the feature in -stable.
> >
> > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> > Greg said below and that we already had three reports I know of I'd
> > prefer if we could fix this rather sooner than later in mainline --
> > especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> > 6.4.y already or will do so soon.
>
> I'll send today's 2-patch series to Linus today or tomorrow.
I need to make a correction to the patch fixing the issue (the first
one in the series) and will post it within an hour. Thanks!
On Wed, Jul 5, 2023 at 9:14 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton <[email protected]> wrote:
> >
> > On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten Leemhuis)" <[email protected]> wrote:
> >
> > > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable
> > > >>> fix rather than disabling the feature in -stable.
> > >
> > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
> > > Greg said below and that we already had three reports I know of I'd
> > > prefer if we could fix this rather sooner than later in mainline --
> > > especially as Arch Linux and openSUSE Tumbleweed likely have switched to
> > > 6.4.y already or will do so soon.
> >
> > I'll send today's 2-patch series to Linus today or tomorrow.
>
> I need to make a correction to the patch fixing the issue (the first
> one in the series) and will post it within an hour. Thanks!
Corrected patch is posted at
https://lore.kernel.org/all/[email protected]/.
The other patch is unchanged but I posted v3 of the entire patchset
anyway.
Andrew, could you please replace the old version with this one before
sending it to Linus?
Thanks,
Suren.
[adding Linus to the list of recipients to ensure the fix makes it into
-rc1 (and can finally be backported to -stable).
Linus, here is the backstory, as I assume you haven't seen this yet:
CONFIG_PER_VMA_LOCK (which defaults to Y; merged for v6.4-rc1 in
0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling first"))
sometimes causes memory corruption reported here:
https://lore.kernel.org/all/[email protected]/
https://bugzilla.kernel.org/show_bug.cgi?id=217624
The plan since early this week is to mark CONFIG_PER_VMA_LOCK as broken;
latest patch that does this is this one afaics:
https://lore.kernel.org/all/[email protected]/
But that change or something similar hasn't reached you yet afaics;
note, this is the second patch of a series with two patches]
On 05.07.23 17:49, Andrew Morton wrote:
> On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten Leemhuis)" <[email protected]> wrote:
>
>>>>> I'm in wait-a-few-days-mode on this. To see if we have a backportable
>>>>> fix rather than disabling the feature in -stable.
>>
>> Andrew, how long will you remain in "wait-a-few-days-mode"? Given what
>> Greg said below and that we already had three reports I know of I'd
>> prefer if we could fix this rather sooner than later in mainline --
>> especially as Arch Linux and openSUSE Tumbleweed likely have switched to
>> 6.4.y already or will do so soon.
>
> I'll send today's 2-patch series to Linus today or tomorrow.
That afaics did not happen until now. :-(
This makes me regret that I did not CC Linus earlier. I always feel like
a snitcher when I do that. But in retrospective it seems it would have
been the right thing to do given the problem, as I suspect Linus would
have quickly applied the patch or marked the feature as broken himself.
So thx to this (and a handful of earlier, similar situations) I now
fully made my peace with feeling like a snitcher (I always knew that
it's kinda part of the position). When something in me says "Ick, this
looks bad to my untrained eyes" I'll immediately CC Linus.
Linus, if I take things to far just let me know. But I assume you get a
lot of mails and won't mind a few more.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
On Sat, 8 Jul 2023 at 04:35, Thorsten Leemhuis
<[email protected]> wrote:
>
> The plan since early this week is to mark CONFIG_PER_VMA_LOCK as broken;
> latest patch that does this is this one afaics:
Bah.
Both marking it as broken and the pending fix seems excessive.
Why isn't the trivial fix just to say "yes, fork() gets the mmap_lock
for writing for a reason, and that reason is that it acts kind of like
mprotect()".
And then just do what those functions do.
IOW, why isn't the fix just to do
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
for_each_vma(old_vmi, mpnt) {
struct file *file;
+ vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
and be done with this? Yes, we could move it down a bit more, ignoring
the VM_DONTCOPY vma's, but they are so uncommon as to not matter, so
who cares?
Linus
On Sat, 8 Jul 2023 at 10:39, Andrew Morton <[email protected]> wrote:
>
> That was the v1 fix, but after some discussion
> (https://lkml.kernel.org/r/[email protected])
> it was decided to take the "excessive" approach.
That makes absolutely _zero_ sense.
It seems to be complete voodoo programming.
To some degree I don't care what happens in stable kernels, but
there's no way we'll do that kind of thing in mainline without some
logic or reason, when it makes no sense.
flush_cache_dup_mm() is entirely irrelevant to the whole issue, for
several reason, but the core one being that it only matters on broken
virtually indexed caches, so none of the architectures that do per-vma
locking.
And the argument that "After the mmap_write_lock_killable(), there
will still be a period where page faults can happen" may be true
(that's kind of the *point* of per-vma locking), but it's irrelevant.
It's true for *all* users of mmap_write_lock_killable, whether in fork
or anywhere else. What makes fork() so magically special?
It's why we have that vma_start_write(), to say "I'm now modifying
*this* vma, so stop accessing it in parallel".
Because no, flush_cache_dup_mm() is not the magical reason to do that thing.
Maybe there is something else going on, but no, we don't write crazy
code without a reason for it. That's completely unmaintainable,
because people will look at that code, not understand it (because
there is nothing to understand) and be afraid to touch it. For no
actual reason.
The obvious place to say "I'm now starting to modify the vma" is when
you actually start to modify the vma.
> Also, this change needs a couple more updates:
Those updates seem sane, and come with explanations of why they exist.
Looks fine to me.
Suren, please send me the proper fixes. Not the voodoo one. The ones
you can explain.
And if stable wants to do something else, then that's fine. But for
the development kernel,. we have two options:
- fix the PER_VMA_LOCK code
- decide that it's not worth it, and just revert it all
and honestly, I'm ok with that second option, simply because this has
all been way too much pain.
But no, we don't mark it broken thinking we can't deal with it, or do
random non-sensible code code we can't explain.
Linus
On Sat, 8 Jul 2023 10:29:42 -0700 Linus Torvalds <[email protected]> wrote:
> On Sat, 8 Jul 2023 at 04:35, Thorsten Leemhuis
> <[email protected]> wrote:
> >
> > The plan since early this week is to mark CONFIG_PER_VMA_LOCK as broken;
> > latest patch that does this is this one afaics:
>
> Bah.
>
> Both marking it as broken and the pending fix seems excessive.
>
> Why isn't the trivial fix just to say "yes, fork() gets the mmap_lock
> for writing for a reason, and that reason is that it acts kind of like
> mprotect()".
>
> And then just do what those functions do.
>
> IOW, why isn't the fix just to do
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> for_each_vma(old_vmi, mpnt) {
> struct file *file;
>
> + vma_start_write(mpnt);
> if (mpnt->vm_flags & VM_DONTCOPY) {
> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> continue;
>
> and be done with this? Yes, we could move it down a bit more, ignoring
> the VM_DONTCOPY vma's, but they are so uncommon as to not matter, so
> who cares?
That was the v1 fix, but after some discussion
(https://lkml.kernel.org/r/[email protected])
it was decided to take the "excessive" approach.
Also, this change needs a couple more updates:
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]
So I'm thinking it's best to disable the feature in 6.4.x and reenable
it for 6.5 once all this is sorted out.
On Sat, 8 Jul 2023 at 11:40, Suren Baghdasaryan <[email protected]> wrote:
>
> My understanding was that flush_cache_dup_mm() is there to ensure
> nothing is in the cache, so locking VMAs before doing that would
> ensure that no page faults would pollute the caches after we flushed
> them. Is that reasoning incorrect?
It is indeed incorrect.
The VIVT caches are fundamentally broken, and we have various random
hacks for them to make them work in legacy situations.
And that flush_cache_dup_mm() is exactly that: a band-aid to make sure
that when we do a fork(), any previous writes that are dirty in the
caches will have made it to memory, so that they will show up in the
*new* process that has a different virtual mapping.
BUT!
This has nothing to do with page faults, or other threads.
If you have a threaded application that does fork(), it can - and will
- dirty the VIVT caches *during* the fork, and so the whole
"flush_cache_dup_mm()" is completely and fundamentally race wrt any
*new* activity.
It's not even what it is trying to deal with. All it tries to do is to
make sure that the newly forked child AT LEAST sees all the changes
that the parent did up to the point of the fork. Anything after that
is simply not relevant at all.
So think of all this not as some kind of absolute synchronization and
cache coherency (because you will never get that on a VIVT
architecture anyway), but as a "for the simple cases, this will at
least get you the expected behavior".
But as mentioned, for the issue of PER_VMA_LOCK, this is all *doubly*
irrelevant. Not only was it not relevant to begin with (ie that cache
flush only synchronizes parent -> child, not other-threads -> child),
but VIVT caches don't even exist on any relevant architecture because
they are fundamentally broken in so many other ways.
So all our "synchronize caches by hand" is literally just band-aid for
legacy architectures. I think it's mostly things like the old broken
MIPS chips, some sparc32, pa-risc: the "old RISC" stuff, where people
simplified the hardware a bit too much.
VIVT is lovely for hardware people becasue they get a shortcut. But
it's "lovely" in the same way that "PI=3" is lovely. It's simpler -
but it's _wrong_.
And it's almost entirely useless if you ever do SMP. I guarantee we
have tons of races with it for very fundamental reasons - the problems
it causes for software are not fixable, they are "hidable for the
simple case".
So you'll also find things like dcache_page_flush(), which flushes
writes to a page to memory. And exactly like the fork() case, it's
*not* real cache coherency, and it's *not* some kind of true global
serialization.
It's used in cases where we have a particular user that wants the
changes *it* made to be made visible. And exactly like
flush_cache_dup_mm(), it cannot deal with concurrent changes that
other threads make.
> Ok, I think these two are non-controversial:
> https://lkml.kernel.org/r/[email protected]
> https://lkml.kernel.org/r/[email protected]
These look sane to me. I wonder if the vma_start_write() should have
been somewhere else, but at least it makes sense in context, even if I
get the feeling that maybe it should have been done in some helper
earlier.
As it is, we randomly do it in other helpers like vm_flags_set(), and
I've often had the reaction that these vma_start_write() calls are
randomly sprinked around without any clear _design_ for where they
are.
> and the question now is how we fix the fork() case:
> https://lore.kernel.org/all/[email protected]/
> (if my above explanation makes sense to you)
See above. That patch is nonsensical. Trying to order
flush_cache_dup_mm() is not about page faults, and is fundamentally
not doable with threads anyway.
> https://lore.kernel.org/all/[email protected]/
This is the one that makes sense to me.
Linus
On Sat, Jul 8, 2023 at 12:23 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, 8 Jul 2023 at 12:17, Suren Baghdasaryan <[email protected]> wrote:
> >
> > Do you want me to disable per-VMA locks by default as well?
>
> No. I seriously believe that if the per-vma locking is so broken that
> it needs to be disabled in a development kernel, we should just admit
> failure, and revert it all.
>
> And not in a "revert it for a later attempt" kind of way.
>
> So it would be a "revert it because it added insurmountable problems
> that we couldn't figure out" thing that implies *not* trying it again
> in that form at all, and much soul-searching before somebody decides
> that they have a more maintainable model for it all.
Got it. I hope that's not the case and so far we haven't received an
indication that the fixes were insufficient.
>
> If stable decides that the fixes are not back-portable, and the whole
> thing needs to be disabled for stable, that's one thing. But if we
> decide that in mainline, it's a "this was a failure" thing.
The patches applied cleanly to 6.4.y stable branch the last time I
checked, so should not be a problem.
Thanks,
Suren.
>
> Linus
On Sat, Jul 8, 2023 at 11:05 AM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, 8 Jul 2023 at 10:39, Andrew Morton <[email protected]> wrote:
> >
> > That was the v1 fix, but after some discussion
> > (https://lkml.kernel.org/r/[email protected])
> > it was decided to take the "excessive" approach.
>
> That makes absolutely _zero_ sense.
>
> It seems to be complete voodoo programming.
>
> To some degree I don't care what happens in stable kernels, but
> there's no way we'll do that kind of thing in mainline without some
> logic or reason, when it makes no sense.
>
> flush_cache_dup_mm() is entirely irrelevant to the whole issue, for
> several reason, but the core one being that it only matters on broken
> virtually indexed caches, so none of the architectures that do per-vma
> locking.
>
> And the argument that "After the mmap_write_lock_killable(), there
> will still be a period where page faults can happen" may be true
> (that's kind of the *point* of per-vma locking), but it's irrelevant.
>
> It's true for *all* users of mmap_write_lock_killable, whether in fork
> or anywhere else. What makes fork() so magically special?
>
> It's why we have that vma_start_write(), to say "I'm now modifying
> *this* vma, so stop accessing it in parallel".
>
> Because no, flush_cache_dup_mm() is not the magical reason to do that thing.
My understanding was that flush_cache_dup_mm() is there to ensure
nothing is in the cache, so locking VMAs before doing that would
ensure that no page faults would pollute the caches after we flushed
them. Is that reasoning incorrect?
>
> Maybe there is something else going on, but no, we don't write crazy
> code without a reason for it. That's completely unmaintainable,
> because people will look at that code, not understand it (because
> there is nothing to understand) and be afraid to touch it. For no
> actual reason.
>
> The obvious place to say "I'm now starting to modify the vma" is when
> you actually start to modify the vma.
>
> > Also, this change needs a couple more updates:
>
> Those updates seem sane, and come with explanations of why they exist.
> Looks fine to me.
>
> Suren, please send me the proper fixes. Not the voodoo one. The ones
> you can explain.
Ok, I think these two are non-controversial:
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]
and the question now is how we fix the fork() case:
https://lore.kernel.org/all/[email protected]/
(if my above explanation makes sense to you)
or
https://lore.kernel.org/all/[email protected]/
Please let me know which ones and I'll send you the patchset including
these patches.
Thanks,
Suren.
>
> And if stable wants to do something else, then that's fine. But for
> the development kernel,. we have two options:
>
> - fix the PER_VMA_LOCK code
>
> - decide that it's not worth it, and just revert it all
>
> and honestly, I'm ok with that second option, simply because this has
> all been way too much pain.
>
> But no, we don't mark it broken thinking we can't deal with it, or do
> random non-sensible code code we can't explain.
>
> Linus
On Sat, 8 Jul 2023 at 12:17, Suren Baghdasaryan <[email protected]> wrote:
>
> Do you want me to disable per-VMA locks by default as well?
No. I seriously believe that if the per-vma locking is so broken that
it needs to be disabled in a development kernel, we should just admit
failure, and revert it all.
And not in a "revert it for a later attempt" kind of way.
So it would be a "revert it because it added insurmountable problems
that we couldn't figure out" thing that implies *not* trying it again
in that form at all, and much soul-searching before somebody decides
that they have a more maintainable model for it all.
If stable decides that the fixes are not back-portable, and the whole
thing needs to be disabled for stable, that's one thing. But if we
decide that in mainline, it's a "this was a failure" thing.
Linus
On Sat, Jul 8, 2023 at 12:06 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, 8 Jul 2023 at 11:40, Suren Baghdasaryan <[email protected]> wrote:
> >
> > My understanding was that flush_cache_dup_mm() is there to ensure
> > nothing is in the cache, so locking VMAs before doing that would
> > ensure that no page faults would pollute the caches after we flushed
> > them. Is that reasoning incorrect?
>
> It is indeed incorrect.
>
> The VIVT caches are fundamentally broken, and we have various random
> hacks for them to make them work in legacy situations.
>
> And that flush_cache_dup_mm() is exactly that: a band-aid to make sure
> that when we do a fork(), any previous writes that are dirty in the
> caches will have made it to memory, so that they will show up in the
> *new* process that has a different virtual mapping.
>
> BUT!
>
> This has nothing to do with page faults, or other threads.
>
> If you have a threaded application that does fork(), it can - and will
> - dirty the VIVT caches *during* the fork, and so the whole
> "flush_cache_dup_mm()" is completely and fundamentally race wrt any
> *new* activity.
>
> It's not even what it is trying to deal with. All it tries to do is to
> make sure that the newly forked child AT LEAST sees all the changes
> that the parent did up to the point of the fork. Anything after that
> is simply not relevant at all.
>
> So think of all this not as some kind of absolute synchronization and
> cache coherency (because you will never get that on a VIVT
> architecture anyway), but as a "for the simple cases, this will at
> least get you the expected behavior".
>
> But as mentioned, for the issue of PER_VMA_LOCK, this is all *doubly*
> irrelevant. Not only was it not relevant to begin with (ie that cache
> flush only synchronizes parent -> child, not other-threads -> child),
> but VIVT caches don't even exist on any relevant architecture because
> they are fundamentally broken in so many other ways.
>
> So all our "synchronize caches by hand" is literally just band-aid for
> legacy architectures. I think it's mostly things like the old broken
> MIPS chips, some sparc32, pa-risc: the "old RISC" stuff, where people
> simplified the hardware a bit too much.
>
> VIVT is lovely for hardware people becasue they get a shortcut. But
> it's "lovely" in the same way that "PI=3" is lovely. It's simpler -
> but it's _wrong_.
>
> And it's almost entirely useless if you ever do SMP. I guarantee we
> have tons of races with it for very fundamental reasons - the problems
> it causes for software are not fixable, they are "hidable for the
> simple case".
>
> So you'll also find things like dcache_page_flush(), which flushes
> writes to a page to memory. And exactly like the fork() case, it's
> *not* real cache coherency, and it's *not* some kind of true global
> serialization.
>
> It's used in cases where we have a particular user that wants the
> changes *it* made to be made visible. And exactly like
> flush_cache_dup_mm(), it cannot deal with concurrent changes that
> other threads make.
Thanks for the explanation! It's quite educational.
>
> > Ok, I think these two are non-controversial:
> > https://lkml.kernel.org/r/[email protected]
> > https://lkml.kernel.org/r/[email protected]
>
> These look sane to me. I wonder if the vma_start_write() should have
> been somewhere else, but at least it makes sense in context, even if I
> get the feeling that maybe it should have been done in some helper
> earlier.
>
> As it is, we randomly do it in other helpers like vm_flags_set(), and
> I've often had the reaction that these vma_start_write() calls are
> randomly sprinked around without any clear _design_ for where they
> are.
We write-lock a VMA before any modification. I tried to provide
explanations for each such locking in my comments/patch descriptions
but I guess I haven't done a good job at that...
>
> > and the question now is how we fix the fork() case:
> > https://lore.kernel.org/all/[email protected]/
> > (if my above explanation makes sense to you)
>
> See above. That patch is nonsensical. Trying to order
> flush_cache_dup_mm() is not about page faults, and is fundamentally
> not doable with threads anyway.
>
> > https://lore.kernel.org/all/[email protected]/
>
> This is the one that makes sense to me.
Ok, I sent you 3-patch series with the fixes here:
https://lore.kernel.org/all/[email protected]/
Do you want me to disable per-VMA locks by default as well?
>
> Linus