LinuxLists.cc - [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 5/31/24 02:19, Byungchul Park wrote:
..
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..03683bf66031 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2872,6 +2872,12 @@ static inline void file_end_write(struct file *file)
> if (!S_ISREG(file_inode(file)->i_mode))
> return;
> sb_end_write(file_inode(file)->i_sb);
> +
> + /*
> + * XXX: If needed, can be optimized by avoiding luf_flush() if
> + * the address space of the file has never been involved by luf.
> + */
> + luf_flush();
> }
..
> +void luf_flush(void)
> +{
> + unsigned long flags;
> + unsigned short int ugen;
> +
> + /*
> + * Obtain the latest ugen number.
> + */
> + spin_lock_irqsave(&luf_lock, flags);
> + ugen = luf_gen;
> + spin_unlock_irqrestore(&luf_lock, flags);
> +
> + check_luf_flush(ugen);
> +}

Am I reading this right? There's now an unconditional global spinlock
acquired in the sys_write() path? How can this possibly scale?

So, yeah, I think an optimization is absolutely needed. But, on a more
fundamental level, I just don't believe these patches are being tested.
Even a simple microbenchmark should show a pretty nasty regression on
any decently large system:

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/write1.c

Second, I was just pointing out sys_write() as an example of how the
page cache could change. Couldn't a separate, read/write mmap() of the
file do the same thing and *not* go through sb_end_write()?

So:

fd = open("foo");
ptr1 = mmap(fd, PROT_READ);
ptr2 = mmap(fd, PROT_READ|PROT_WRITE);

foo = *ptr1; // populate the page cache
... page cache page is reclaimed and LUF'd
*ptr2 = bar; // new page cache page is allocated and written to

printk("*ptr1: %d\n", *ptr1);

Doesn't the printk() see stale data?

I think tglx would call all of this "tinkering". The approach to this
series is to "fix" narrow, specific cases that reviewers point out, make
it compile, then send it out again, hoping someone will apply it.

So, for me, until the approach to this series changes: NAK, for x86.
Andrew, please don't take this series. Or, if you do, please drop the
patch enabling it on x86.

I also have the feeling our VFS friends won't take kindly to having
random luf_foo() hooks in their hot paths, optimized or not. I don't
see any of them on cc.

2024-05-31 18:05:11

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

Dave Hansen <[email protected]> wrote:
>
> On 5/31/24 02:19, Byungchul Park wrote:
> ..
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 0283cf366c2a..03683bf66031 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2872,6 +2872,12 @@ static inline void file_end_write(struct file *file)
> > if (!S_ISREG(file_inode(file)->i_mode))
> > return;
> > sb_end_write(file_inode(file)->i_sb);
> > +
> > + /*
> > + * XXX: If needed, can be optimized by avoiding luf_flush() if
> > + * the address space of the file has never been involved by luf.
> > + */
> > + luf_flush();
> > }
> ..
> > +void luf_flush(void)
> > +{
> > + unsigned long flags;
> > + unsigned short int ugen;
> > +
> > + /*
> > + * Obtain the latest ugen number.
> > + */
> > + spin_lock_irqsave(&luf_lock, flags);
> > + ugen = luf_gen;
> > + spin_unlock_irqrestore(&luf_lock, flags);
> > +
> > + check_luf_flush(ugen);
> > +}
>
> Am I reading this right? There's now an unconditional global spinlock

It looked *too much* to split the lock to several locks as rcu does until
version 11. However, this code introduced in v11 looks problematic.

> acquired in the sys_write() path? How can this possibly scale?

I should find a better way.

> So, yeah, I think an optimization is absolutely needed. But, on a more
> fundamental level, I just don't believe these patches are being tested.
> Even a simple microbenchmark should show a pretty nasty regression on
> any decently large system:
>
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/write1.c
>
> Second, I was just pointing out sys_write() as an example of how the
> page cache could change. Couldn't a separate, read/write mmap() of the
> file do the same thing and *not* go through sb_end_write()?
>
> So:
>
> fd = open("foo");
> ptr1 = mmap(fd, PROT_READ);
> ptr2 = mmap(fd, PROT_READ|PROT_WRITE);
>
> foo = *ptr1; // populate the page cache
> ... page cache page is reclaimed and LUF'd
> *ptr2 = bar; // new page cache page is allocated and written to

I think this part would work but I'm not convinced. I will check again.

> printk("*ptr1: %d\n", *ptr1);
>
> Doesn't the printk() see stale data?
>
> I think tglx would call all of this "tinkering". The approach to this
> series is to "fix" narrow, specific cases that reviewers point out, make
> it compile, then send it out again, hoping someone will apply it.

Sorry for not perfect work and bothering you but you know what? I
can see what is happening in this community too. Of course, I bet
you would post better quality mm patches from the 1st version than
me but might not in other subsystems.

> So, for me, until the approach to this series changes: NAK, for x86.

I understand why you got mad and feel sorry but I couldn't expect
the regression you mentioned above. And I admit the patches have
had problems I couldn't find in advance until you, Hildenbrand and
Ying. I will do better.

> Andrew, please don't take this series. Or, if you do, please drop the
> patch enabling it on x86.

I don't want to ask to merge either, if there are still issues.

> I also have the feeling our VFS friends won't take kindly to having

That is also what I thought it was. What should I do then?
I don't believe you do not agree with the concept itself. Thing is
the current version is not good enough. I will do my best by doing
what I can do.

> random luf_foo() hooks in their hot paths, optimized or not. I don't
> see any of them on cc.

Yes. I should've cc'd them. I will.

Byungchul

2024-05-31 21:47:54

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 5/31/24 11:04, Byungchul Park wrote:
...
> I don't believe you do not agree with the concept itself. Thing is
> the current version is not good enough. I will do my best by doing
> what I can do.

More performance is good. I agree with that.

But it has to be weighed against the risk and the complexity. The more
I look at this approach, the more I think this is not a good trade off.
There's a lot of risk and a lot of complexity and we haven't seen the
full complexity picture. The gaps are being fixed by adding complexity
in new subsystems (the VFS in this case).

There are going to be winners and losers, and this version for example
makes file writes lose performance.

Just to be crystal clear: I disagree with the concept of leaving stale
TLB entries in place in an attempt to gain performance.

2024-05-31 22:10:15

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Fri, May 31, 2024 at 02:46:23PM -0700, Dave Hansen wrote:
> On 5/31/24 11:04, Byungchul Park wrote:
> ...
> > I don't believe you do not agree with the concept itself. Thing is
> > the current version is not good enough. I will do my best by doing
> > what I can do.
>
> More performance is good. I agree with that.
>
> But it has to be weighed against the risk and the complexity. The more
> I look at this approach, the more I think this is not a good trade off.
> There's a lot of risk and a lot of complexity and we haven't seen the
> full complexity picture. The gaps are being fixed by adding complexity
> in new subsystems (the VFS in this case).
>
> There are going to be winners and losers, and this version for example
> makes file writes lose performance.
>
> Just to be crystal clear: I disagree with the concept of leaving stale
> TLB entries in place in an attempt to gain performance.

FWIW, I agree with Dave. This feels insanely dangerous and I don't
think you're paranoid enough about things that can go wrong.

2024-06-01 02:20:30

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

Dave Hansen <[email protected]> wrote:
>
> On 5/31/24 11:04, Byungchul Park wrote:
> ...
> > I don't believe you do not agree with the concept itself. Thing is
> > the current version is not good enough. I will do my best by doing
> > what I can do.
>
> More performance is good. I agree with that.
>
> But it has to be weighed against the risk and the complexity. The more
> I look at this approach, the more I think this is not a good trade off.
> There's a lot of risk and a lot of complexity and we haven't seen the

All the complexity comes from the fact that I can't use a new space in
struct page - that can make the design even lockless.

I agree that keeping things simple is the best but I don't think all the
existing fields in struct page are the result of trying to make things
simple that you love. Some of them are more complicated.

I'd like to find a better way together instead of yelling "it's unworthy
cuz it's too complicated and there's too little space in mm world to
accommodate new things".

However, for the issues already discussed, I will think about it more
before the next spin.

Byungchul

> full complexity picture. The gaps are being fixed by adding complexity
> in new subsystems (the VFS in this case).
>
> There are going to be winners and losers, and this version for example
> makes file writes lose performance.
>
> Just to be crystal clear: I disagree with the concept of leaving stale
> TLB entries in place in an attempt to gain performance.
>

2024-06-01 07:22:32

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 31.05.24 23:46, Dave Hansen wrote:
> On 5/31/24 11:04, Byungchul Park wrote:
> ...
>> I don't believe you do not agree with the concept itself. Thing is
>> the current version is not good enough. I will do my best by doing
>> what I can do.
>
> More performance is good. I agree with that.
>
> But it has to be weighed against the risk and the complexity. The more
> I look at this approach, the more I think this is not a good trade off.
> There's a lot of risk and a lot of complexity and we haven't seen the
> full complexity picture. The gaps are being fixed by adding complexity
> in new subsystems (the VFS in this case).
>
> There are going to be winners and losers, and this version for example
> makes file writes lose performance.
>
> Just to be crystal clear: I disagree with the concept of leaving stale
> TLB entries in place in an attempt to gain performance.

There is the inherent problem that a CPU reading from such (unmapped but
not flushed yet) memory will not get a page fault, which I think is the
most controversial part here (besides interaction with other deferred
TLB flushing, and how this glues into the buddy).

What we used to do so far was limiting the timeframe where that could
happen, under well-controlled circumstances. On the common unmap/zap
path, we perform the batched TLB flush before any page faults / VMA
changes would have be possible and munmap() would have returned with
"succeess". Now that time frame could be significantly longer.

So in current code, at the point in time where we would process a page
fault, mmap()/munmap()/... the TLB would have been flushed already.

To "mimic" the old behavior, we'd essentially have to force any page
faults/mmap/whatsoever to perform the deferred flush such that the CPU
will see the "reality" again. Not sure how that could be done in a
*consistent* way (check whenever we take the mmap/vma lock etc ...) and
if there would still be a performance win.

--
Cheers,

David / dhildenb

2024-06-03 09:42:36

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Sat, Jun 01, 2024 at 09:22:17AM +0200, David Hildenbrand wrote:
> On 31.05.24 23:46, Dave Hansen wrote:
> > On 5/31/24 11:04, Byungchul Park wrote:
> > ...
> > > I don't believe you do not agree with the concept itself. Thing is
> > > the current version is not good enough. I will do my best by doing
> > > what I can do.
> >
> > More performance is good. I agree with that.
> >
> > But it has to be weighed against the risk and the complexity. The more
> > I look at this approach, the more I think this is not a good trade off.
> > There's a lot of risk and a lot of complexity and we haven't seen the
> > full complexity picture. The gaps are being fixed by adding complexity
> > in new subsystems (the VFS in this case).
> >
> > There are going to be winners and losers, and this version for example
> > makes file writes lose performance.
> >
> > Just to be crystal clear: I disagree with the concept of leaving stale
> > TLB entries in place in an attempt to gain performance.
>
> There is the inherent problem that a CPU reading from such (unmapped but not
> flushed yet) memory will not get a page fault, which I think is the most
> controversial part here (besides interaction with other deferred TLB
> flushing, and how this glues into the buddy).
>
> What we used to do so far was limiting the timeframe where that could
> happen, under well-controlled circumstances. On the common unmap/zap path,
> we perform the batched TLB flush before any page faults / VMA changes would
> have be possible and munmap() would have returned with "succeess". Now that
> time frame could be significantly longer.
>
> So in current code, at the point in time where we would process a page
> fault, mmap()/munmap()/... the TLB would have been flushed already.
>
> To "mimic" the old behavior, we'd essentially have to force any page
> faults/mmap/whatsoever to perform the deferred flush such that the CPU will
> see the "reality" again. Not sure how that could be done in a *consistent*

In luf's point of view, the points where the deferred flush should be
performed are simply:

1. when changing the vma maps, that might be luf'ed.
2. when updating data of the pages, that might be luf'ed.

All we need to do is to indentify the points:

1. when changing the vma maps, that might be luf'ed.

a) mmap and munmap e.i. fault handler or unmap_region().
b) permission to writable e.i. mprotect or fault handler.
c) what I'm missing.

2. when updating data of the pages, that might be luf'ed.

a) updating files through vfs e.g. file_end_write().
b) updating files through writable maps e.i. 1-a) or 1-b).
c) what I'm missing.

Some of them are already performing necessary tlb flush and the others
are not. luf has to handle the others, that I've been focusing on. Of
course, there might be what I'm missing tho.

Worth noting again, luf is working only on *migration* and *reclaim*
currently. Thing is when to stop the pending initiated from migration
or reclaim by luf.

Byungchul

> way (check whenever we take the mmap/vma lock etc ...) and if there would
> still be a performance win.
>
> --
> Cheers,
>
> David / dhildenb

2024-06-03 13:23:57

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 6/3/24 02:35, Byungchul Park wrote:
...> In luf's point of view, the points where the deferred flush should be
> performed are simply:
>
> 1. when changing the vma maps, that might be luf'ed.
> 2. when updating data of the pages, that might be luf'ed.

It's simple, but the devil is in the details as always.

> All we need to do is to indentify the points:
>
> 1. when changing the vma maps, that might be luf'ed.
>
> a) mmap and munmap e.i. fault handler or unmap_region().
> b) permission to writable e.i. mprotect or fault handler.
> c) what I'm missing.

I'd say it even more generally: anything that installs a PTE which is
inconsistent with the original PTE. That, of course, includes writes.
But it also includes crazy things that we do like uprobes. Take a look
at __replace_page().

I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
there. But it needs some really thorough review.

But the bigger concern is that, if there was a problem, I can't think of
a systematic way to find it.

> 2. when updating data of the pages, that might be luf'ed.
>
> a) updating files through vfs e.g. file_end_write().
> b) updating files through writable maps e.i. 1-a) or 1-b).
> c) what I'm missing.

Filesystems or block devices that change content without a "write" from
the local system. Network filesystems and block devices come to mind.
I honestly don't know what all the rules are around these, but they
could certainly be troublesome.

There appear to be some interactions for NFS between file locking and
page cache flushing.

But, stepping back ...

I'd honestly be a lot more comfortable if there was even a debugging LUF
mode that enforced a rule that said:

1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs
2. A LUF'd page's position in the page cache can't be replaced until
after a luf_flush()

or *some* other independent set of rules that can tell us when something
goes wrong. That uprobes code, for instance, seems like it will work.
But I can also imagine writing it ten other ways where it would break
when combined with LUF.

2024-06-03 16:07:27

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 03.06.24 15:23, Dave Hansen wrote:
> On 6/3/24 02:35, Byungchul Park wrote:
> ...> In luf's point of view, the points where the deferred flush should be
>> performed are simply:
>>
>> 1. when changing the vma maps, that might be luf'ed.
>> 2. when updating data of the pages, that might be luf'ed.
>
> It's simple, but the devil is in the details as always.
>
>> All we need to do is to indentify the points:
>>
>> 1. when changing the vma maps, that might be luf'ed.
>>
>> a) mmap and munmap e.i. fault handler or unmap_region().
>> b) permission to writable e.i. mprotect or fault handler.
>> c) what I'm missing.
>
> I'd say it even more generally: anything that installs a PTE which is
> inconsistent with the original PTE. That, of course, includes writes.
> But it also includes crazy things that we do like uprobes. Take a look
> at __replace_page().
>
> I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
> there. But it needs some really thorough review.
>
> But the bigger concern is that, if there was a problem, I can't think of
> a systematic way to find it.

Fully agreed!

>
>> 2. when updating data of the pages, that might be luf'ed.
>>
>> a) updating files through vfs e.g. file_end_write().
>> b) updating files through writable maps e.i. 1-a) or 1-b).
>> c) what I'm missing.
>
> Filesystems or block devices that change content without a "write" from
> the local system. Network filesystems and block devices come to mind.
> I honestly don't know what all the rules are around these, but they
> could certainly be troublesome.
>
> There appear to be some interactions for NFS between file locking and
> page cache flushing.
>
> But, stepping back ...
>
> I'd honestly be a lot more comfortable if there was even a debugging LUF
> mode that enforced a rule that said:
>
> 1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs

I was playing with the idea of using a PTE marker. Then it's clear for
munmap/mremap/page faults that there is an outstanding flush required.
the alternative might be a VMA flag, but that's harder to actually
enforce an invariant.

> 2. A LUF'd page's position in the page cache can't be replaced until
> after a luf_flush()

That's the most tricky bit. I think these are the VFS concerns like

1) Page migration/reclaim ends up freeing the old page. TLB not flushed.
2) write() to the new page / write from other process to the new page
3) CPU reads stale content from old page

PTE markers can't handle that.

--
Cheers,

David / dhildenb

2024-06-03 16:37:59

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 6/3/24 09:05, David Hildenbrand wrote:
...
>> 2. A LUF'd page's position in the page cache can't be replaced until
>> after a luf_flush()
>
> That's the most tricky bit. I think these are the VFS concerns like
>
> 1) Page migration/reclaim ends up freeing the old page. TLB not flushed.
> 2) write() to the new page / write from other process to the new page
> 3) CPU reads stale content from old page
>
> PTE markers can't handle that.

Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
Presumably some xa_value() that means a reader has to go do a
luf_flush() before going any farther.

That would actually have a chance at fixing two issues: One where a new
page cache insertion is attempted. The other where someone goes to look
in the page cache and takes some action _because_ it is empty (I think
NFS is doing some of this for file locks).

LUF is also pretty fundamentally built on the idea that files can't
change without LUF being aware. That model seems to work decently for
normal old filesystems on normal old local block devices. I'm worried
about NFS, and I don't know how seriously folks take FUSE, but it
obviously can't work well for FUSE.

2024-06-03 17:01:49

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
> Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
> Presumably some xa_value() that means a reader has to go do a
> luf_flush() before going any farther.

I can allocate one for that. We've got something like 1000 currently
unused values which can't be mistaken for anything else.

> That would actually have a chance at fixing two issues: One where a new
> page cache insertion is attempted. The other where someone goes to look
> in the page cache and takes some action _because_ it is empty (I think
> NFS is doing some of this for file locks).
>
> LUF is also pretty fundamentally built on the idea that files can't
> change without LUF being aware. That model seems to work decently for
> normal old filesystems on normal old local block devices. I'm worried
> about NFS, and I don't know how seriously folks take FUSE, but it
> obviously can't work well for FUSE.

I'm more concerned with:

- page goes back to buddy
- page is allocated to slab
- application reads through stale TLB entry and sees kernel memory

Or did that scenario get resolved?

2024-06-03 18:01:30

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 03.06.24 19:01, Matthew Wilcox wrote:
> On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
>> Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
>> Presumably some xa_value() that means a reader has to go do a
>> luf_flush() before going any farther.
>
> I can allocate one for that. We've got something like 1000 currently
> unused values which can't be mistaken for anything else.

I'm curious when to set that, though.

While migrating/reclaiming, when unmapping the folio from the page
tables, the folio is still valid in the page cache. So at the point in
time of unmapping from one process, we cannot simply replace the folio
in the page cache by some other value -- I think.

Maybe it's all easier than I think.

--
Cheers,

David / dhildenb

2024-06-04 00:35:07

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Mon, Jun 03, 2024 at 06:01:05PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
> > Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
> > Presumably some xa_value() that means a reader has to go do a
> > luf_flush() before going any farther.
>
> I can allocate one for that. We've got something like 1000 currently
> unused values which can't be mistaken for anything else.
>
> > That would actually have a chance at fixing two issues: One where a new
> > page cache insertion is attempted. The other where someone goes to look
> > in the page cache and takes some action _because_ it is empty (I think
> > NFS is doing some of this for file locks).
> >
> > LUF is also pretty fundamentally built on the idea that files can't
> > change without LUF being aware. That model seems to work decently for
> > normal old filesystems on normal old local block devices. I'm worried
> > about NFS, and I don't know how seriously folks take FUSE, but it
> > obviously can't work well for FUSE.
>
> I'm more concerned with:
>
> - page goes back to buddy
> - page is allocated to slab

At this point, tlb flush needed will be performed in prep_new_page().

> - application reads through stale TLB entry and sees kernel memory

No worry for this case.

Byungchul
>
> Or did that scenario get resolved?

2024-06-04 01:56:36

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

2024-06-04 04:44:07

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Tue, Jun 04, 2024 at 10:53:48AM +0900, Byungchul Park wrote:
> On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote:
> > On 6/3/24 02:35, Byungchul Park wrote:
> > ...> In luf's point of view, the points where the deferred flush should be
> > > performed are simply:
> > >
> > > 1. when changing the vma maps, that might be luf'ed.
> > > 2. when updating data of the pages, that might be luf'ed.
> >
> > It's simple, but the devil is in the details as always.
>
> Agree with that.
>
> > > All we need to do is to indentify the points:
> > >
> > > 1. when changing the vma maps, that might be luf'ed.
> > >
> > > a) mmap and munmap e.i. fault handler or unmap_region().
> > > b) permission to writable e.i. mprotect or fault handler.
> > > c) what I'm missing.
> >
> > I'd say it even more generally: anything that installs a PTE which is
> > inconsistent with the original PTE. That, of course, includes writes.
> > But it also includes crazy things that we do like uprobes. Take a look
> > at __replace_page().
> >
> > I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
> > there. But it needs some really thorough review.
> >
> > But the bigger concern is that, if there was a problem, I can't think of
> > a systematic way to find it.
> >
> > > 2. when updating data of the pages, that might be luf'ed.
> > >
> > > a) updating files through vfs e.g. file_end_write().
> > > b) updating files through writable maps e.i. 1-a) or 1-b).
> > > c) what I'm missing.
> >
> > Filesystems or block devices that change content without a "write" from
> > the local system. Network filesystems and block devices come to mind.
>
> AFAIK, every network filesystem eventully "updates" its connected local
> filesystem. It could be still handled at the point where updating the
> local file system.
>
> > I honestly don't know what all the rules are around these, but they
> > could certainly be troublesome.
> >
> > There appear to be some interactions for NFS between file locking and
> > page cache flushing.
> >
> > But, stepping back ...
> >
> > I'd honestly be a lot more comfortable if there was even a debugging LUF
>
> I'd better provide a method for better debugging. Lemme know whatever
> it is we need.
>
> > mode that enforced a rule that said:

Do you means a debugging mode that can WARN or inform the situation that
we don't want? If yes, sure. Now that I get this, I will re-read all
you guys' talk.

Byungchul

> Why "debugging mode"? The following rules should be enforced always.
>
> > 1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs
>
> "luf_flush() should be followed when.." is more correct because
> "luf_flush() -> another luf -> the pte gets rewritten" can happen. So
> it should be "the pte gets rewritten -> another luf by any chance ->
> luf_flush()", that is still safe.
>
> > 2. A LUF'd page's position in the page cache can't be replaced until
> > after a luf_flush()
>
> "luf_flush() should be followed when.." is more correct too.
>
> These two rules are exactly same as what I described but more specific.
> I like your way to describe the rules.
>
> Byungchul
>
> > or *some* other independent set of rules that can tell us when something
> > goes wrong. That uprobes code, for instance, seems like it will work.
> > But I can also imagine writing it ten other ways where it would break
> > when combined with LUF.

2024-06-04 08:18:53

by Huang, Ying

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

David Hildenbrand <[email protected]> writes:

> On 03.06.24 19:01, Matthew Wilcox wrote:
>> On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
>>> Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
>>> Presumably some xa_value() that means a reader has to go do a
>>> luf_flush() before going any farther.
>> I can allocate one for that. We've got something like 1000
>> currently
>> unused values which can't be mistaken for anything else.
>
> I'm curious when to set that, though.
>
> While migrating/reclaiming, when unmapping the folio from the page
> tables, the folio is still valid in the page cache. So at the point in
> time of unmapping from one process, we cannot simply replace the folio
> in the page cache by some other value -- I think.
>
> Maybe it's all easier than I think.

IIUC, we need to held folio lock before replacing the folio in the page
cache. In page_cache_delete(), folio_test_locked() is checked. And, we
will lock the folio before writing to it via write syscall. So, it's
safe to defer TLB flushing until we unlock the folio.

--
Best Regards,
Huang, Ying

2024-06-06 08:33:44

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On 04.06.24 06:43, Byungchul Park wrote:
> On Tue, Jun 04, 2024 at 10:53:48AM +0900, Byungchul Park wrote:
>> On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote:
>>> On 6/3/24 02:35, Byungchul Park wrote:
>>> ...> In luf's point of view, the points where the deferred flush should be
>>>> performed are simply:
>>>>
>>>> 1. when changing the vma maps, that might be luf'ed.
>>>> 2. when updating data of the pages, that might be luf'ed.
>>>
>>> It's simple, but the devil is in the details as always.
>>
>> Agree with that.
>>
>>>> All we need to do is to indentify the points:
>>>>
>>>> 1. when changing the vma maps, that might be luf'ed.
>>>>
>>>> a) mmap and munmap e.i. fault handler or unmap_region().
>>>> b) permission to writable e.i. mprotect or fault handler.
>>>> c) what I'm missing.
>>>
>>> I'd say it even more generally: anything that installs a PTE which is
>>> inconsistent with the original PTE. That, of course, includes writes.
>>> But it also includes crazy things that we do like uprobes. Take a look
>>> at __replace_page().
>>>
>>> I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
>>> there. But it needs some really thorough review.
>>>
>>> But the bigger concern is that, if there was a problem, I can't think of
>>> a systematic way to find it.
>>>
>>>> 2. when updating data of the pages, that might be luf'ed.
>>>>
>>>> a) updating files through vfs e.g. file_end_write().
>>>> b) updating files through writable maps e.i. 1-a) or 1-b).
>>>> c) what I'm missing.
>>>
>>> Filesystems or block devices that change content without a "write" from
>>> the local system. Network filesystems and block devices come to mind.
>>
>> AFAIK, every network filesystem eventully "updates" its connected local
>> filesystem. It could be still handled at the point where updating the
>> local file system.
>>
>>> I honestly don't know what all the rules are around these, but they
>>> could certainly be troublesome.
>>>
>>> There appear to be some interactions for NFS between file locking and
>>> page cache flushing.
>>>
>>> But, stepping back ...
>>>
>>> I'd honestly be a lot more comfortable if there was even a debugging LUF
>>
>> I'd better provide a method for better debugging. Lemme know whatever
>> it is we need.
>>
>>> mode that enforced a rule that said:
>
> Do you means a debugging mode that can WARN or inform the situation that
> we don't want? If yes, sure. Now that I get this, I will re-read all
> you guys' talk.
>

In my opinion either for debugging, or for actually enforcing it at
runtime. Whatever the cost of that would be needs to be determined.

--
Cheers,

David / dhildenb

2024-06-10 13:24:34

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Tue 04-06-24 09:34:48, Byungchul Park wrote:
> On Mon, Jun 03, 2024 at 06:01:05PM +0100, Matthew Wilcox wrote:
> > On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
> > > Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
> > > Presumably some xa_value() that means a reader has to go do a
> > > luf_flush() before going any farther.
> >
> > I can allocate one for that. We've got something like 1000 currently
> > unused values which can't be mistaken for anything else.
> >
> > > That would actually have a chance at fixing two issues: One where a new
> > > page cache insertion is attempted. The other where someone goes to look
> > > in the page cache and takes some action _because_ it is empty (I think
> > > NFS is doing some of this for file locks).
> > >
> > > LUF is also pretty fundamentally built on the idea that files can't
> > > change without LUF being aware. That model seems to work decently for
> > > normal old filesystems on normal old local block devices. I'm worried
> > > about NFS, and I don't know how seriously folks take FUSE, but it
> > > obviously can't work well for FUSE.
> >
> > I'm more concerned with:
> >
> > - page goes back to buddy
> > - page is allocated to slab
>
> At this point, tlb flush needed will be performed in prep_new_page().

But that does mean that an unaware caller would get an additional
overhead of the flushing, right? I think it would be just a matter of
time before somebody can turn that into a side channel attack, not to
mention unexpected latencies introduced.
--
Michal Hocko
SUSE Labs

2024-06-11 01:10:49

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Mon, Jun 10, 2024 at 03:23:49PM +0200, Michal Hocko wrote:
> On Tue 04-06-24 09:34:48, Byungchul Park wrote:
> > On Mon, Jun 03, 2024 at 06:01:05PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
> > > > Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
> > > > Presumably some xa_value() that means a reader has to go do a
> > > > luf_flush() before going any farther.
> > >
> > > I can allocate one for that. We've got something like 1000 currently
> > > unused values which can't be mistaken for anything else.
> > >
> > > > That would actually have a chance at fixing two issues: One where a new
> > > > page cache insertion is attempted. The other where someone goes to look
> > > > in the page cache and takes some action _because_ it is empty (I think
> > > > NFS is doing some of this for file locks).
> > > >
> > > > LUF is also pretty fundamentally built on the idea that files can't
> > > > change without LUF being aware. That model seems to work decently for
> > > > normal old filesystems on normal old local block devices. I'm worried
> > > > about NFS, and I don't know how seriously folks take FUSE, but it
> > > > obviously can't work well for FUSE.
> > >
> > > I'm more concerned with:
> > >
> > > - page goes back to buddy
> > > - page is allocated to slab
> >
> > At this point, tlb flush needed will be performed in prep_new_page().
>
> But that does mean that an unaware caller would get an additional
> overhead of the flushing, right? I think it would be just a matter of

pcp for locality is already a better source of side channel attack. FYI,
tlb flush gets barely performed only if pending tlb flush exists.

> time before somebody can turn that into a side channel attack, not to
> mention unexpected latencies introduced.

Nope. The pending tlb flush performed in prep_new_page() is the one
that would've done already with the vanilla kernel. It's not additional
tlb flushes but it's subset of all the skipped ones.

It's worth noting all the existing mm reclaim mechaisms have already
introduced worse unexpected latencies.

Byungchul

> --
> Michal Hocko
> SUSE Labs

2024-06-11 09:12:35

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote:
> On 6/3/24 02:35, Byungchul Park wrote:
> ...> In luf's point of view, the points where the deferred flush should be
> > performed are simply:
> >
> > 1. when changing the vma maps, that might be luf'ed.
> > 2. when updating data of the pages, that might be luf'ed.
>
> It's simple, but the devil is in the details as always.
>
> > All we need to do is to indentify the points:
> >
> > 1. when changing the vma maps, that might be luf'ed.
> >
> > a) mmap and munmap e.i. fault handler or unmap_region().
> > b) permission to writable e.i. mprotect or fault handler.
> > c) what I'm missing.
>
> I'd say it even more generally: anything that installs a PTE which is
> inconsistent with the original PTE. That, of course, includes writes.
> But it also includes crazy things that we do like uprobes. Take a look
> at __replace_page().
>
> I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
> there. But it needs some really thorough review.
>
> But the bigger concern is that, if there was a problem, I can't think of
> a systematic way to find it.
>
> > 2. when updating data of the pages, that might be luf'ed.
> >
> > a) updating files through vfs e.g. file_end_write().
> > b) updating files through writable maps e.i. 1-a) or 1-b).
> > c) what I'm missing.
>
> Filesystems or block devices that change content without a "write" from
> the local system. Network filesystems and block devices come to mind.
> I honestly don't know what all the rules are around these, but they
> could certainly be troublesome.
>
> There appear to be some interactions for NFS between file locking and
> page cache flushing.
>
> But, stepping back ...
>
> I'd honestly be a lot more comfortable if there was even a debugging LUF
> mode that enforced a rule that said:
>
> 1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs
> 2. A LUF'd page's position in the page cache can't be replaced until
> after a luf_flush()

I'm thinking a debug mode doing the following *pseudo* code - check the
logic only since the grammer might be wrong:

0-a) Introduce new fields in page_ext:

#ifdef LUF_DEBUG
struct list_head __percpu luf_node;
#endif

0-b) Introduce new fields in struct address_space:

#ifdef LUF_DEBUG
struct list_head __percpu luf_node;
#endif

0-c) Introduce new fields in struct task_struct:

#ifdef LUF_DEBUG
cpumask_t luf_pending_cpus;
#endif

0-d) Define percpu list_head to link luf'd folios:

#ifdef LUF_DEBUG
DEFINE_PER_CPU(struct list_head, luf_folios);
DEFINE_PER_CPU(struct list_head, luf_address_spaces);
#endif

1) When skipping tlb flush in reclaim or migration for a folio:

#ifdef LUF_DEBUG
ext = get_page_ext_for_luf_debug(folio);
as = folio_mapping(folio);

for_each_cpu(cpu, skip_cpus) {
list_add(per_cpu_ptr(ext->luf_node, cpu),
per_cpu_ptr(luf_folios, cpu));
if (as)
list_add(per_cpu_ptr(as->luf_node, cpu),
per_cpu_ptr(luf_address_spaces, cpu));
}
put_page_ext(ext);
#endif

2) When performing tlb flush in try_to_unmap_flush():
Remind luf only works on unmapping during reclaim and migration.

#ifdef LUF_DEBUG
for_each_cpu(cpu, now_flushing_cpus) {
for_each_node_safe(folio, per_cpu_ptr(luf_folios)) {
ext = get_page_ext_for_luf_debug(folio);
list_del_init(per_cpu_ptr(ext->luf_node, cpu))
put_page_ext(ext);
}

for_each_node_safe(as, per_cpu_ptr(luf_address_spaces))
list_del_init(per_cpu_ptr(as->luf_node, cpu))

cpumask_clear_cpu(cpu, current->luf_pending_cpus);
}
#endif

3) In pte_mkwrite():

#ifdef LUF_DEBUG
ext = get_page_ext_for_luf_debug(folio);

for_each_cpu(cpu, online_cpus)
if (!list_empty(per_cpu_ptr(ext->luf_node, cpu)))
cpumask_set_cpu(cpu, current->luf_pending_cpus);
put_page_ext(ext);
#endif

4) On returning to user:

#ifdef LUF_DEBUG
WARN_ON(!cpumask_empty(current->luf_pending_cpus));
#endif

5) On right after every a_ops->write_end() call:

#ifdef LUF_DEBUG
as = get_address_space_to_write_to();
for_each_cpu(cpu, online_cpus)
if (!list_empty(per_cpu_ptr(as->luf_node, cpu)))
cpumask_set_cpu(cpu, current->luf_pending_cpus);
#endif

luf_flush_or_its_optimized_version();

#ifdef LUF_DEBUG
WARN_ON(!cpumask_empty(current->luf_pending_cpus));
#endif

I will implement the debug mode this way with all serialized. Do you
think it works for what we want?

Byungchul

> or *some* other independent set of rules that can tell us when something
> goes wrong. That uprobes code, for instance, seems like it will work.
> But I can also imagine writing it ten other ways where it would break
> when combined with LUF.

2024-06-11 12:23:49

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Tue 11-06-24 09:55:23, Byungchul Park wrote:
> On Mon, Jun 10, 2024 at 03:23:49PM +0200, Michal Hocko wrote:
> > On Tue 04-06-24 09:34:48, Byungchul Park wrote:
> > > On Mon, Jun 03, 2024 at 06:01:05PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Jun 03, 2024 at 09:37:46AM -0700, Dave Hansen wrote:
> > > > > Yeah, we'd need some equivalent of a PTE marker, but for the page cache.
> > > > > Presumably some xa_value() that means a reader has to go do a
> > > > > luf_flush() before going any farther.
> > > >
> > > > I can allocate one for that. We've got something like 1000 currently
> > > > unused values which can't be mistaken for anything else.
> > > >
> > > > > That would actually have a chance at fixing two issues: One where a new
> > > > > page cache insertion is attempted. The other where someone goes to look
> > > > > in the page cache and takes some action _because_ it is empty (I think
> > > > > NFS is doing some of this for file locks).
> > > > >
> > > > > LUF is also pretty fundamentally built on the idea that files can't
> > > > > change without LUF being aware. That model seems to work decently for
> > > > > normal old filesystems on normal old local block devices. I'm worried
> > > > > about NFS, and I don't know how seriously folks take FUSE, but it
> > > > > obviously can't work well for FUSE.
> > > >
> > > > I'm more concerned with:
> > > >
> > > > - page goes back to buddy
> > > > - page is allocated to slab
> > >
> > > At this point, tlb flush needed will be performed in prep_new_page().
> >
> > But that does mean that an unaware caller would get an additional
> > overhead of the flushing, right? I think it would be just a matter of
>
> pcp for locality is already a better source of side channel attack. FYI,
> tlb flush gets barely performed only if pending tlb flush exists.

Right but rare and hard to predict latencies are much worse than
consistent once.

> > time before somebody can turn that into a side channel attack, not to
> > mention unexpected latencies introduced.
>
> Nope. The pending tlb flush performed in prep_new_page() is the one
> that would've done already with the vanilla kernel. It's not additional
> tlb flushes but it's subset of all the skipped ones.

But those skipped once could have happened in a completely different
context (e.g. a different process or even a diffrent security domain),
right?

> It's worth noting all the existing mm reclaim mechaisms have already
> introduced worse unexpected latencies.

Right, but a reclaim, especially direct reclaim, are expected to be
slow. It is much different to see spike latencies on system with a lot
of memory.
--
Michal Hocko
SUSE Labs

2024-06-14 01:58:11

[permalink] [raw]

Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped

On Tue, Jun 04, 2024 at 10:53:48AM +0900, Byungchul Park wrote:
> On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote:
> > On 6/3/24 02:35, Byungchul Park wrote:
> > ...> In luf's point of view, the points where the deferred flush should be
> > > performed are simply:
> > >
> > > 1. when changing the vma maps, that might be luf'ed.
> > > 2. when updating data of the pages, that might be luf'ed.
> >
> > It's simple, but the devil is in the details as always.
>
> Agree with that.
>
> > > All we need to do is to indentify the points:
> > >
> > > 1. when changing the vma maps, that might be luf'ed.
> > >
> > > a) mmap and munmap e.i. fault handler or unmap_region().
> > > b) permission to writable e.i. mprotect or fault handler.
> > > c) what I'm missing.
> >
> > I'd say it even more generally: anything that installs a PTE which is
> > inconsistent with the original PTE. That, of course, includes writes.
> > But it also includes crazy things that we do like uprobes. Take a look
> > at __replace_page().
> >
> > I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
> > there. But it needs some really thorough review.
> >
> > But the bigger concern is that, if there was a problem, I can't think of
> > a systematic way to find it.
> >
> > > 2. when updating data of the pages, that might be luf'ed.
> > >
> > > a) updating files through vfs e.g. file_end_write().
> > > b) updating files through writable maps e.i. 1-a) or 1-b).
> > > c) what I'm missing.
> >
> > Filesystems or block devices that change content without a "write" from
> > the local system. Network filesystems and block devices come to mind.
>
> AFAIK, every network filesystem eventully "updates" its connected local
> filesystem. It could be still handled at the point where updating the
> local file system.

To cover client of network file systems and any using page cache, struct
address_space_operations's write_end() call sites seem to be the best
place to handle that. At the same time, of course, I should limit the
target of luf to 'folio_mapping(folio) != NULL' for file pages.

Byungchul

> > I honestly don't know what all the rules are around these, but they
> > could certainly be troublesome.
> >
> > There appear to be some interactions for NFS between file locking and
> > page cache flushing.
> >
> > But, stepping back ...
> >
> > I'd honestly be a lot more comfortable if there was even a debugging LUF
>
> I'd better provide a method for better debugging. Lemme know whatever
> it is we need.
>
> > mode that enforced a rule that said:
>
> Why "debugging mode"? The following rules should be enforced always.
>
> > 1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs
>
> "luf_flush() should be followed when.." is more correct because
> "luf_flush() -> another luf -> the pte gets rewritten" can happen. So
> it should be "the pte gets rewritten -> another luf by any chance ->
> luf_flush()", that is still safe.
>
> > 2. A LUF'd page's position in the page cache can't be replaced until
> > after a luf_flush()
>
> "luf_flush() should be followed when.." is more correct too.
>
> These two rules are exactly same as what I described but more specific.
> I like your way to describe the rules.
>
> Byungchul
>
> > or *some* other independent set of rules that can tell us when something
> > goes wrong. That uprobes code, for instance, seems like it will work.
> > But I can also imagine writing it ten other ways where it would break
> > when combined with LUF.

2024-06-14 02:45:39