Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 3 Jan 2023 20:05:10 +0100
From:   "Jason A. Donenfeld" <Jason@zx2c4.com>
To:     Andy Lutomirski <luto@kernel.org>
Cc:     Ingo Molnar <mingo@kernel.org>, linux-kernel@vger.kernel.org,
        patches@lists.linux.dev, tglx@linutronix.de,
        linux-crypto@vger.kernel.org, linux-api@vger.kernel.org,
        x86@kernel.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>,
        Carlos O'Donell <carlos@redhat.com>,
        Florian Weimer <fweimer@redhat.com>,
        Arnd Bergmann <arnd@arndb.de>, Jann Horn <jannh@google.com>,
        Christian Brauner <brauner@kernel.org>, linux-mm@kvack.org,
        Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v14 2/7] mm: add VM_DROPPABLE for designating always
 lazily freeable mappings
Message-ID: <Y7R8Zq6sIKAIprtr@zx2c4.com>
References: <20230101162910.710293-1-Jason@zx2c4.com>
 <20230101162910.710293-3-Jason@zx2c4.com>
 <Y7QIg/hAIk7eZE42@gmail.com>
 <CALCETrWdw5kxrtr4M7AkKYDOJEE1cU1wENWgmgOxn0rEJz4y3w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CALCETrWdw5kxrtr4M7AkKYDOJEE1cU1wENWgmgOxn0rEJz4y3w@mail.gmail.com>
Precedence: bulk

Hi Andy,

Thanks for your constructive suggestions.

On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:
> > > c) If there's not enough memory to service a page fault, it's not fatal,
> > >    and no signal is sent. Instead, writes are simply lost.
> 
> This just seems massively overcomplicated to me.  If there isn't
> enough memory to fault in a page of code, we don't have some magic
> instruction emulator in the kernel.  We either OOM or we wait for
> memory to show up.

Before addressing the other parts of your email, I thought I'd touch on
this. Quoting from the email I just wrote Ingo:

| *However* - if your big objection to this patch is that the instruction
| skipping is problematic, we could actually punt that part. The result
| will be that userspace just retries the memory write and the fault
| happens again, and eventually it succeeds. From a perspective of
| vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
| immediately fallback to the syscall under memory pressure -- but you
| could argue that nobody really cares about performance at that point
| anyway, and so just retrying the fault until it succeeds is a less
| complex behavior that would be just fine.
| 
| Let me know if you think that'd be an acceptable compromise, and I'll
| roll it into v15. As a preview, it pretty much amounts to dropping 3/7
| and editing the commit message in this 2/7 patch.

IOW, I think the main ideas of the patch work just fine without "point
c" with the instruction skipping. Instead, waiting/retrying could
potentially work. So, okay, it seems like the two of you both hate the
instruction decoder stuff, so I'll plan on working that part in, in one
way or another, for v15.

> On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@kernel.org> wrote:
> > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > new system call that has certain requirements:
> > >
> > > - It shouldn't be written to core dumps.
> > >   * Easy: VM_DONTDUMP.
> > > - It should be zeroed on fork.
> > >   * Easy: VM_WIPEONFORK.
> 
> I have a rather different suggestion: make a special mapping.  Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm.  Instead,
> have an actual special mapping with callbacks that does the right
> thing.  No fancy VM flags.

Oooo! I like this. Avoiding adding VM_* flags would indeed be nice.
I had seen things that I thought looked in this direction with the shmem
API, but when I got into the details, it looked like this was meant for
something else and couldn't address most of what I wanted here.

If you say this is possible, I'll look again to see if I can figure it
out. Though, if you have some API name at the top of your head, you
might save me some code squinting time.

> Memory pressure: have it free and unmap it self.  Gets accessed again?
>  ->fault can handle it.

Right.

> Want to mlock it?  No, don't do that -- that's absurd.  Just arrange
> so that, if it gets evicted, it's not written out anywhere.  And when
> it gets faulted back in it does the right thing -- see above.

Presumably mlock calls are redirected to some function pointer so I can
just return EINTR?

> Zero on fork?  I'm sure that's manageable with a special mapping.  If
> not, you can add a new vm operation or similar to make it work.  (Kind
> of like how we extended special mappings to get mremap right a couple
> years go.)  But maybe you don't want to *zero* it on fork and you want
> to do something more intelligent.  Fine -- you control ->fault!

Doing something more intelligent would be an interesting development, I
guess... But, before I think about that, all mapping have flags;
couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the
API you have in mind not work that way? (Side note: I also want
VM_DONTDUMP to work.)

> > > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > >   page faulting in memory if none is available
> > >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > >   * Uh-oh: VM_NORESERVE means segfaults.
> 
> ->fault can do whatever you want.
> 
> And there is no shortage of user memory that *must* be made available
> on fault in order to resume the faulting process.  ->fault can handle
> this.

I'll look to see how other things handle this...

Anyway, thanks for the suggestion. That seems like a good future
direction for this.

Jason