2024-05-31 10:50:09

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <[email protected]> wrote:
> c) If there's not enough memory to service a page fault, it's not fatal.
[...]
> @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>
> lru_gen_exit_fault();
>
> + /* If the mapping is droppable, then errors due to OOM aren't fatal. */
> + if (vma->vm_flags & VM_DROPPABLE)
> + ret &= ~VM_FAULT_OOM;

Can you remind me how this is supposed to work? If we get an OOM
error, and the error is not fatal, does that mean we'll just keep
hitting the same fault handler over and over again (until we happen to
have memory available again I guess)?

Or is there something in this series that somehow redirects userspace
execution to getrandom() in that case?


> +
> if (flags & FAULT_FLAG_USER) {
> mem_cgroup_exit_user_fault();
> /*


2024-05-31 12:13:29

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote:
> On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <[email protected]> wrote:
> > c) If there's not enough memory to service a page fault, it's not fatal.
> [...]
> > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> >
> > lru_gen_exit_fault();
> >
> > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */
> > + if (vma->vm_flags & VM_DROPPABLE)
> > + ret &= ~VM_FAULT_OOM;
>
> Can you remind me how this is supposed to work? If we get an OOM
> error, and the error is not fatal, does that mean we'll just keep
> hitting the same fault handler over and over again (until we happen to
> have memory available again I guess)?

Right, it'll just keep retrying. I agree this isn't great, which is why
in the 2023 patchset, I had additional code to simply skip the faulting
instruction, and then the userspace code would notice the inconsistency
and fallback to the syscall. This worked pretty well. But it meant
decoding the instruction and in general skipping instructions is weird,
and that made this patchset very very contentious. Since the skipping
behavior isn't actually required by the /security goals/ of this, I
figured I'd just drop that. And maybe we can all revisit it together
sometime down the line. But for now I'm hoping for something a little
easier to swallow.

Jason

2024-05-31 13:01:14

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <[email protected]> wrote:
> On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote:
> > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <[email protected]> wrote:
> > > c) If there's not enough memory to service a page fault, it's not fatal.
> > [...]
> > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > >
> > > lru_gen_exit_fault();
> > >
> > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */
> > > + if (vma->vm_flags & VM_DROPPABLE)
> > > + ret &= ~VM_FAULT_OOM;
> >
> > Can you remind me how this is supposed to work? If we get an OOM
> > error, and the error is not fatal, does that mean we'll just keep
> > hitting the same fault handler over and over again (until we happen to
> > have memory available again I guess)?
>
> Right, it'll just keep retrying. I agree this isn't great, which is why
> in the 2023 patchset, I had additional code to simply skip the faulting
> instruction, and then the userspace code would notice the inconsistency
> and fallback to the syscall. This worked pretty well. But it meant
> decoding the instruction and in general skipping instructions is weird,
> and that made this patchset very very contentious. Since the skipping
> behavior isn't actually required by the /security goals/ of this, I
> figured I'd just drop that. And maybe we can all revisit it together
> sometime down the line. But for now I'm hoping for something a little
> easier to swallow.

In that case, since we need to be able to populate this memory to make
forward progress, would it make sense to remove the parts of the patch
that treat the allocation as if it was allowed to silently fail (the
"__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I
think that would also simplify this a bit by making this type of
memory a little less special.

2024-06-07 14:35:22

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, May 31, 2024 at 03:00:26PM +0200, Jann Horn wrote:
> On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <[email protected]> wrote:
> > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote:
> > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <[email protected]> wrote:
> > > > c) If there's not enough memory to service a page fault, it's not fatal.
> > > [...]
> > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > > >
> > > > lru_gen_exit_fault();
> > > >
> > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */
> > > > + if (vma->vm_flags & VM_DROPPABLE)
> > > > + ret &= ~VM_FAULT_OOM;
> > >
> > > Can you remind me how this is supposed to work? If we get an OOM
> > > error, and the error is not fatal, does that mean we'll just keep
> > > hitting the same fault handler over and over again (until we happen to
> > > have memory available again I guess)?
> >
> > Right, it'll just keep retrying. I agree this isn't great, which is why
> > in the 2023 patchset, I had additional code to simply skip the faulting
> > instruction, and then the userspace code would notice the inconsistency
> > and fallback to the syscall. This worked pretty well. But it meant
> > decoding the instruction and in general skipping instructions is weird,
> > and that made this patchset very very contentious. Since the skipping
> > behavior isn't actually required by the /security goals/ of this, I
> > figured I'd just drop that. And maybe we can all revisit it together
> > sometime down the line. But for now I'm hoping for something a little
> > easier to swallow.
>
> In that case, since we need to be able to populate this memory to make
> forward progress, would it make sense to remove the parts of the patch
> that treat the allocation as if it was allowed to silently fail (the
> "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I
> think that would also simplify this a bit by making this type of
> memory a little less special.

The whole point, though, is that it needs to not fail or warn. It's
memory that can be dropped/zeroed at any moment, and the code is
deliberately robust to that.

Jason

2024-06-07 15:13:38

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, Jun 7, 2024 at 4:35 PM Jason A. Donenfeld <[email protected]> wrote:
> On Fri, May 31, 2024 at 03:00:26PM +0200, Jann Horn wrote:
> > On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <[email protected]> wrote:
> > > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote:
> > > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <[email protected]> wrote:
> > > > > c) If there's not enough memory to service a page fault, it's not fatal.
> > > > [...]
> > > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > > > >
> > > > > lru_gen_exit_fault();
> > > > >
> > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */
> > > > > + if (vma->vm_flags & VM_DROPPABLE)
> > > > > + ret &= ~VM_FAULT_OOM;
> > > >
> > > > Can you remind me how this is supposed to work? If we get an OOM
> > > > error, and the error is not fatal, does that mean we'll just keep
> > > > hitting the same fault handler over and over again (until we happen to
> > > > have memory available again I guess)?
> > >
> > > Right, it'll just keep retrying. I agree this isn't great, which is why
> > > in the 2023 patchset, I had additional code to simply skip the faulting
> > > instruction, and then the userspace code would notice the inconsistency
> > > and fallback to the syscall. This worked pretty well. But it meant
> > > decoding the instruction and in general skipping instructions is weird,
> > > and that made this patchset very very contentious. Since the skipping
> > > behavior isn't actually required by the /security goals/ of this, I
> > > figured I'd just drop that. And maybe we can all revisit it together
> > > sometime down the line. But for now I'm hoping for something a little
> > > easier to swallow.
> >
> > In that case, since we need to be able to populate this memory to make
> > forward progress, would it make sense to remove the parts of the patch
> > that treat the allocation as if it was allowed to silently fail (the
> > "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I
> > think that would also simplify this a bit by making this type of
> > memory a little less special.
>
> The whole point, though, is that it needs to not fail or warn. It's
> memory that can be dropped/zeroed at any moment, and the code is
> deliberately robust to that.

Sure - but does it have to be more robust than accessing a newly
allocated piece of memory [which hasn't been populated with anonymous
pages yet] or bringing a swapped-out page back from swap?

I'm not an expert on OOM handling, but my understanding is that the
kernel tries _really_ hard to avoid failing low-order GFP_KERNEL
allocations, with the help of the OOM killer. My understanding is that
those allocations basically can't fail with a NULL return unless the
process has already been killed or it is in a memcg_kmem cgroup that
contains only processes that have been marked as exempt from OOM
killing. (Or if you're using error injection to explicitly tell the
kernel to fail the allocation.)
My understanding is that normal outcomes of an out-of-memory situation
are things like the OOM killer killing processes (including
potentially the calling one) to free up memory, or the OOM killer
panic()ing the whole system as a last resort; but getting a NULL
return from page_alloc(GFP_KERNEL) without getting killed is not one
of those outcomes.

2024-06-07 15:51:21

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri, Jun 7, 2024 at 5:12 PM Jann Horn <[email protected]> wrote:
> On Fri, Jun 7, 2024 at 4:35 PM Jason A. Donenfeld <[email protected]> wrote:
> > On Fri, May 31, 2024 at 03:00:26PM +0200, Jann Horn wrote:
> > > On Fri, May 31, 2024 at 2:13 PM Jason A. Donenfeld <[email protected]> wrote:
> > > > On Fri, May 31, 2024 at 12:48:58PM +0200, Jann Horn wrote:
> > > > > On Tue, May 28, 2024 at 2:24 PM Jason A. Donenfeld <[email protected]> wrote:
> > > > > > c) If there's not enough memory to service a page fault, it's not fatal.
> > > > > [...]
> > > > > > @@ -5689,6 +5689,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > > > > >
> > > > > > lru_gen_exit_fault();
> > > > > >
> > > > > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */
> > > > > > + if (vma->vm_flags & VM_DROPPABLE)
> > > > > > + ret &= ~VM_FAULT_OOM;
> > > > >
> > > > > Can you remind me how this is supposed to work? If we get an OOM
> > > > > error, and the error is not fatal, does that mean we'll just keep
> > > > > hitting the same fault handler over and over again (until we happen to
> > > > > have memory available again I guess)?
> > > >
> > > > Right, it'll just keep retrying. I agree this isn't great, which is why
> > > > in the 2023 patchset, I had additional code to simply skip the faulting
> > > > instruction, and then the userspace code would notice the inconsistency
> > > > and fallback to the syscall. This worked pretty well. But it meant
> > > > decoding the instruction and in general skipping instructions is weird,
> > > > and that made this patchset very very contentious. Since the skipping
> > > > behavior isn't actually required by the /security goals/ of this, I
> > > > figured I'd just drop that. And maybe we can all revisit it together
> > > > sometime down the line. But for now I'm hoping for something a little
> > > > easier to swallow.
> > >
> > > In that case, since we need to be able to populate this memory to make
> > > forward progress, would it make sense to remove the parts of the patch
> > > that treat the allocation as if it was allowed to silently fail (the
> > > "__GFP_NOWARN | __GFP_NORETRY" and the "ret &= ~VM_FAULT_OOM")? I
> > > think that would also simplify this a bit by making this type of
> > > memory a little less special.
> >
> > The whole point, though, is that it needs to not fail or warn. It's
> > memory that can be dropped/zeroed at any moment, and the code is
> > deliberately robust to that.
>
> Sure - but does it have to be more robust than accessing a newly
> allocated piece of memory [which hasn't been populated with anonymous
> pages yet] or bringing a swapped-out page back from swap?
>
> I'm not an expert on OOM handling, but my understanding is that the
> kernel tries _really_ hard to avoid failing low-order GFP_KERNEL
> allocations, with the help of the OOM killer. My understanding is that
> those allocations basically can't fail with a NULL return unless the
> process has already been killed or it is in a memcg_kmem cgroup that
> contains only processes that have been marked as exempt from OOM
> killing. (Or if you're using error injection to explicitly tell the
> kernel to fail the allocation.)
> My understanding is that normal outcomes of an out-of-memory situation
> are things like the OOM killer killing processes (including
> potentially the calling one) to free up memory, or the OOM killer
> panic()ing the whole system as a last resort; but getting a NULL
> return from page_alloc(GFP_KERNEL) without getting killed is not one
> of those outcomes.

Or, from a different angle: You're trying to allocate memory, and you
can't make forward progress until that memory has been allocated
(unless the process is killed). That's what GFP_KERNEL is for. Stuff
like "__GFP_NOWARN | __GFP_NORETRY" is for when you have a backup plan
that lets you make progress (perhaps in a slightly less efficient way,
or by dropping some incoming data, or something like that), and it
hints to the page allocator that it doesn't have to try hard to
reclaim memory if it can't find free memory quickly.

2024-06-10 12:00:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Fri 07-06-24 17:50:34, Jann Horn wrote:
[...]
> Or, from a different angle: You're trying to allocate memory, and you
> can't make forward progress until that memory has been allocated
> (unless the process is killed). That's what GFP_KERNEL is for. Stuff
> like "__GFP_NOWARN | __GFP_NORETRY" is for when you have a backup plan
> that lets you make progress (perhaps in a slightly less efficient way,
> or by dropping some incoming data, or something like that), and it
> hints to the page allocator that it doesn't have to try hard to
> reclaim memory if it can't find free memory quickly.

Correct. A psedu-busy wait for allocation to succeed sounds like a very
bad idea to imprint into ABI. Is there really any design requirement to
make these mappings to never cause the OOM killer?

Making the content dropable under memory pressure because it is
inherently recoverable is something else (this is essentially an
implicit MADV_FREE semantic) but putting a requirement on the memory
allocation on the fault sounds just wrong to me.

--
Michal Hocko
SUSE Labs

2024-06-14 18:35:55

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH v16 1/5] mm: add VM_DROPPABLE for designating always lazily freeable mappings

On Mon, Jun 10, 2024 at 02:00:21PM +0200, Michal Hocko wrote:
> On Fri 07-06-24 17:50:34, Jann Horn wrote:
> [...]
> > Or, from a different angle: You're trying to allocate memory, and you
> > can't make forward progress until that memory has been allocated
> > (unless the process is killed). That's what GFP_KERNEL is for. Stuff
> > like "__GFP_NOWARN | __GFP_NORETRY" is for when you have a backup plan
> > that lets you make progress (perhaps in a slightly less efficient way,
> > or by dropping some incoming data, or something like that), and it
> > hints to the page allocator that it doesn't have to try hard to
> > reclaim memory if it can't find free memory quickly.
>
> Correct. A psedu-busy wait for allocation to succeed sounds like a very
> bad idea to imprint into ABI. Is there really any design requirement to
> make these mappings to never cause the OOM killer?
>
> Making the content dropable under memory pressure because it is
> inherently recoverable is something else (this is essentially an
> implicit MADV_FREE semantic) but putting a requirement on the memory
> allocation on the fault sounds just wrong to me.

The idea is that syscall getrandom() won't make a process be killed, so
neither should vgetrandom().

But there's an argument to be made that the NOWARN|NORETRY logic only
made sense with the now-dropped "skip instruction on fault" patch that
was so controversial before, since in that case, there wouldn't be
infinite retry, but rather skipping and then falling back to the
syscall. I think this is nicer behavior, but the implementation caused a
stir, so I'm not at the moment going that route. Given that, I think
I'll follow your advice and get rid of NOWARN|NORETRY for this too. And
then maybe we'll all revisit that later.

Jason