2023-02-21 21:49:41

by Matthew Wilcox

[permalink] [raw]
Subject: What size anonymous folios should we allocate?

In a sense this question is premature, because we don't have any code
in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
but let's pretend that code already exists and is just waiting for us
to answer this policy question.

I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
option and 3. a sysfs tunable. It is foolish to expect the distro
packager or the sysadmin to be able to make such a decision. The
correct decision will depend upon the instantaneous workload of the
entire machine and we'll want different answers for different VMAs.

I'm open to applications having some kind of madvise() call they can
use to specify hints, but I would prefer to handle memory efficiently
for applications which do not.

For pagecache memory, we use the per-fd readahead code; if readahead has
been successful in the past we bump up the folio size until it reaches
its maximum. There is no equivalent for anonymous memory.

I'm working my way towards a solution that looks a little like this:

A. We modify khugepaged to quadruple the folio size each time it scans.
At the moment, it always attempts to promote straight from order 0
to PMD size. Instead, if it finds four adjacent order-0 folios,
it will allocate an order-2 folio to replace them. Next time it
scans, it finds four order-2 folios and replaces them with a single
order-4 folio. And so on, up to PMD order.

B. A further modification is that it will require three of the four
folios being combined to be on the active list. If two (or more)
of the four folios are inactive, we should leave them alone; either
they will remain inactive and eventually be evicted, or they will be
activated and eligible for merging in a future pass of khugepaged.

C. We add a new wrinkle to the LRU handling code. When our scan of the
active list examines a folio, we look to see how many of the PTEs
mapping the folio have been accessed. If it is fewer than half, and
those half are all in either the first or last half of the folio, we
split it. The active half stays on the active list and the inactive
half is moved to the inactive list.

I feel that these three changes should allow us to iterate towards a
solution for any given VMA that is close to optimal, and adapts to a
changing workload with no intervention from a sysadmin, or even hint
from a program.

There are three different circumstances where we currently allocate
anonymous memory. The first is for mmap(MAP_ANONYMOUS), the second is
COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork
anonymous mapping.

For the first option, the only hint we have is the size of the VMA.
I'm tempted to suggest our initial guess at the right size folio to
allocate should be scaled to that, although I don't have a clear idea
about what the scale factor should be.

For the second case, I want to strongly suggest that the size of the
folio allocated by the page cache should be of no concern. It is largely
irrelevant to the application's usage pattern what size the page cache
has chosen to cache the file. I might start out very conservatively
here with an order-0 allocation.

For the third case, in contrast, the parent had already established
an appropriate size folio to use for this VMA before calling fork().
Whether it is the parent or the child causing the COW, it should probably
inherit that choice and we should default to the same size folio that
was already found.


I don't stay current with the research literature, so if someone wants
to point me to a well-studied algorithm and let me know that I can stop
thinking about this, that'd be great. And if anyone wants to start
working on implementing this, that'd also be great.

P.S. I didn't want to interrupt the flow of the above description to
note that allocation of any high-order folio can and will fail, so
there will definitely be fallback points to order-0 folios, which will
be no different from today. Except that maybe we'll be able to iterate
towards the correct folio size in the new khugepaged.

P.P.S. I still consider myself a bit of a novice in the handling of
anonymous memory, so don't be shy to let me know what I got wrong.


2023-02-21 23:05:53

by Yang Shi

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <[email protected]> wrote:
>
> In a sense this question is premature, because we don't have any code
> in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> but let's pretend that code already exists and is just waiting for us
> to answer this policy question.
>
> I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> option and 3. a sysfs tunable. It is foolish to expect the distro
> packager or the sysadmin to be able to make such a decision. The
> correct decision will depend upon the instantaneous workload of the
> entire machine and we'll want different answers for different VMAs.

Yeah, I agree those 3 options should be avoided. For some
architectures, there are a or multiple sweet size(s) benefiting from
hardware. For example, ARM64 contiguous PTE supports up to 16
consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
entries. Some implementations may support intermediate sizes (for
example, 8K, 16K and 32K, but this may make the hardware design
harder), but some may not. AMD's coalesce PTE supports a different
size (128K if I remember correctly). So the multiple of the size
supported by hardware (64K or 128K) seems like the common ground from
maximizing hardware benefit point of view. Of course, nothing prevents
the kernel from allocating other orders.

ARM even supports contiguous PMD, but that would be too big to
allocate by buddy allocator.

>
> I'm open to applications having some kind of madvise() call they can
> use to specify hints, but I would prefer to handle memory efficiently
> for applications which do not.
>
> For pagecache memory, we use the per-fd readahead code; if readahead has
> been successful in the past we bump up the folio size until it reaches
> its maximum. There is no equivalent for anonymous memory.

Yes, kernel can't tell it although the userspace may experience fewer
TLB misses. Anyway it is not an indicator that could be used by kernel
to make a decision.

>
> I'm working my way towards a solution that looks a little like this:
>
> A. We modify khugepaged to quadruple the folio size each time it scans.
> At the moment, it always attempts to promote straight from order 0
> to PMD size. Instead, if it finds four adjacent order-0 folios,
> it will allocate an order-2 folio to replace them. Next time it
> scans, it finds four order-2 folios and replaces them with a single
> order-4 folio. And so on, up to PMD order.

Actually I was thinking about the reverse, starting from the biggest
possible order, for example, 2M -> 1M -> ... 64K -> ... 4K. And the
page fault path should be able to use the same fallback order. But
excessive fallback tries may be harmful either.

>
> B. A further modification is that it will require three of the four
> folios being combined to be on the active list. If two (or more)
> of the four folios are inactive, we should leave them alone; either
> they will remain inactive and eventually be evicted, or they will be
> activated and eligible for merging in a future pass of khugepaged.

If we use the fallback policy, we should be able to just leave it to
reclamation time. When checking reference we could tell what PTEs are
accessed, then split if there is significant internal fragmentation.

>
> C. We add a new wrinkle to the LRU handling code. When our scan of the
> active list examines a folio, we look to see how many of the PTEs
> mapping the folio have been accessed. If it is fewer than half, and
> those half are all in either the first or last half of the folio, we
> split it. The active half stays on the active list and the inactive
> half is moved to the inactive list.

With contiguous PTE, every PTE still maintains its own access bit (but
it is implementation defined, some implementations may just set access
bit once for one PTE in the contiguous region per arm arm IIUC). But
anyway this is definitely feasible.

>
> I feel that these three changes should allow us to iterate towards a
> solution for any given VMA that is close to optimal, and adapts to a
> changing workload with no intervention from a sysadmin, or even hint
> from a program.

Yes, I agree.

>
> There are three different circumstances where we currently allocate
> anonymous memory. The first is for mmap(MAP_ANONYMOUS), the second is
> COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork
> anonymous mapping.
>
> For the first option, the only hint we have is the size of the VMA.
> I'm tempted to suggest our initial guess at the right size folio to
> allocate should be scaled to that, although I don't have a clear idea
> about what the scale factor should be.
>
> For the second case, I want to strongly suggest that the size of the
> folio allocated by the page cache should be of no concern. It is largely
> irrelevant to the application's usage pattern what size the page cache
> has chosen to cache the file. I might start out very conservatively
> here with an order-0 allocation.
>
> For the third case, in contrast, the parent had already established
> an appropriate size folio to use for this VMA before calling fork().
> Whether it is the parent or the child causing the COW, it should probably
> inherit that choice and we should default to the same size folio that
> was already found.

Actually this is not what THP does now. The current THP behavior is to
split the PMD then fallback to order-0 page fault. For smaller orders,
we may consider allocating a large folio.

>
>
> I don't stay current with the research literature, so if someone wants
> to point me to a well-studied algorithm and let me know that I can stop
> thinking about this, that'd be great. And if anyone wants to start
> working on implementing this, that'd also be great.
>
> P.S. I didn't want to interrupt the flow of the above description to
> note that allocation of any high-order folio can and will fail, so
> there will definitely be fallback points to order-0 folios, which will
> be no different from today. Except that maybe we'll be able to iterate
> towards the correct folio size in the new khugepaged.
>
> P.P.S. I still consider myself a bit of a novice in the handling of
> anonymous memory, so don't be shy to let me know what I got wrong.
>

2023-02-22 03:09:17

by Kent Overstreet

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
> On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <[email protected]> wrote:
> >
> > In a sense this question is premature, because we don't have any code
> > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> > but let's pretend that code already exists and is just waiting for us
> > to answer this policy question.
> >
> > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> > option and 3. a sysfs tunable. It is foolish to expect the distro
> > packager or the sysadmin to be able to make such a decision. The
> > correct decision will depend upon the instantaneous workload of the
> > entire machine and we'll want different answers for different VMAs.
>
> Yeah, I agree those 3 options should be avoided. For some
> architectures, there are a or multiple sweet size(s) benefiting from
> hardware. For example, ARM64 contiguous PTE supports up to 16
> consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
> entries. Some implementations may support intermediate sizes (for
> example, 8K, 16K and 32K, but this may make the hardware design
> harder), but some may not. AMD's coalesce PTE supports a different
> size (128K if I remember correctly). So the multiple of the size
> supported by hardware (64K or 128K) seems like the common ground from
> maximizing hardware benefit point of view. Of course, nothing prevents
> the kernel from allocating other orders.
>
> ARM even supports contiguous PMD, but that would be too big to
> allocate by buddy allocator.

Every time this discussion comes up it seems like MM people have a major
blind spot, where they're only thinking about PTE looking and TLB
overhead and forgetting every other codepath in the kernel that deals
with cached data - historically one physical page at a time.

By framing the discussion in terms of what's best for the hardware,
you're screwing over all the pure software codepaths. This stupity has
gone on for long enough with the ridicurous normalpage/hugepage split,
let's not continue it.

Talk to any filesystem person, you don't want to fragment data
unnecessarily. That's effectively what you're advocating for, by
continuing to talk about hardware page sizes.

You need to get away from designing things around hardware limitations
and think in more general terms. The correct answer is "anonymous pages
should be any power of two size".

2023-02-22 03:53:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
> On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <[email protected]> wrote:
> >
> > In a sense this question is premature, because we don't have any code
> > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> > but let's pretend that code already exists and is just waiting for us
> > to answer this policy question.
> >
> > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> > option and 3. a sysfs tunable. It is foolish to expect the distro
> > packager or the sysadmin to be able to make such a decision. The
> > correct decision will depend upon the instantaneous workload of the
> > entire machine and we'll want different answers for different VMAs.
>
> Yeah, I agree those 3 options should be avoided. For some
> architectures, there are a or multiple sweet size(s) benefiting from
> hardware. For example, ARM64 contiguous PTE supports up to 16
> consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
> entries. Some implementations may support intermediate sizes (for
> example, 8K, 16K and 32K, but this may make the hardware design
> harder), but some may not. AMD's coalesce PTE supports a different
> size (128K if I remember correctly). So the multiple of the size
> supported by hardware (64K or 128K) seems like the common ground from
> maximizing hardware benefit point of view. Of course, nothing prevents
> the kernel from allocating other orders.

All of this is true (although I think AMDs intermediate size is actually
32kB, not 128kB), but irrelevant. Software overhead is FAR more important
than hardware overhead. If we swap out the wrong page or have to run
around doing reclaim, that absolutely dwarfs the performance impact of
using small TLB entries.

So we need to strike the right balance between using larger folios
for efficiency and smaller folios for precision of knowing which
pages are still part of the process working set.

> Actually I was thinking about the reverse, starting from the biggest
> possible order, for example, 2M -> 1M -> ... 64K -> ... 4K. And the
> page fault path should be able to use the same fallback order. But
> excessive fallback tries may be harmful either.

What's your reasoning here?

> > B. A further modification is that it will require three of the four
> > folios being combined to be on the active list. If two (or more)
> > of the four folios are inactive, we should leave them alone; either
> > they will remain inactive and eventually be evicted, or they will be
> > activated and eligible for merging in a future pass of khugepaged.
>
> If we use the fallback policy, we should be able to just leave it to
> reclamation time. When checking reference we could tell what PTEs are
> accessed, then split if there is significant internal fragmentation.

I think it's going to lead to excessive memory usage. There was data
presented last LSFMM that we already have far too much memory tied up
in THP for many workloads.

> > C. We add a new wrinkle to the LRU handling code. When our scan of the
> > active list examines a folio, we look to see how many of the PTEs
> > mapping the folio have been accessed. If it is fewer than half, and
> > those half are all in either the first or last half of the folio, we
> > split it. The active half stays on the active list and the inactive
> > half is moved to the inactive list.
>
> With contiguous PTE, every PTE still maintains its own access bit (but
> it is implementation defined, some implementations may just set access
> bit once for one PTE in the contiguous region per arm arm IIUC). But
> anyway this is definitely feasible.

If a CPU doesn't have separate access bits for PTEs, then we should just
not use the contiguous bits. Knowing which parts of the folio are
unused is more important than using the larger TLB entries.

> > For the third case, in contrast, the parent had already established
> > an appropriate size folio to use for this VMA before calling fork().
> > Whether it is the parent or the child causing the COW, it should probably
> > inherit that choice and we should default to the same size folio that
> > was already found.
>
> Actually this is not what THP does now. The current THP behavior is to
> split the PMD then fallback to order-0 page fault. For smaller orders,
> we may consider allocating a large folio.

I know it's not what THP does now. I think that's because the gap
between PMD and PAGE size is too large and we end up wasting too much
memory. We also have very crude mechanisms for determining when to
use THPs. With the adaptive mechanism I described above, I think it's
time to change that.


2023-02-22 22:49:13

by Yang Shi

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On Tue, Feb 21, 2023 at 7:52 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
> > On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <[email protected]> wrote:
> > >
> > > In a sense this question is premature, because we don't have any code
> > > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> > > but let's pretend that code already exists and is just waiting for us
> > > to answer this policy question.
> > >
> > > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> > > option and 3. a sysfs tunable. It is foolish to expect the distro
> > > packager or the sysadmin to be able to make such a decision. The
> > > correct decision will depend upon the instantaneous workload of the
> > > entire machine and we'll want different answers for different VMAs.
> >
> > Yeah, I agree those 3 options should be avoided. For some
> > architectures, there are a or multiple sweet size(s) benefiting from
> > hardware. For example, ARM64 contiguous PTE supports up to 16
> > consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
> > entries. Some implementations may support intermediate sizes (for
> > example, 8K, 16K and 32K, but this may make the hardware design
> > harder), but some may not. AMD's coalesce PTE supports a different
> > size (128K if I remember correctly). So the multiple of the size
> > supported by hardware (64K or 128K) seems like the common ground from
> > maximizing hardware benefit point of view. Of course, nothing prevents
> > the kernel from allocating other orders.
>
> All of this is true (although I think AMDs intermediate size is actually
> 32kB, not 128kB), but irrelevant. Software overhead is FAR more important
> than hardware overhead. If we swap out the wrong page or have to run
> around doing reclaim, that absolutely dwarfs the performance impact of
> using small TLB entries.
>
> So we need to strike the right balance between using larger folios
> for efficiency and smaller folios for precision of knowing which
> pages are still part of the process working set.

As long as the large folio is not PMD mapped, we know which subpages
are accessed. Anyway I do agree we need the balance.

>
> > Actually I was thinking about the reverse, starting from the biggest
> > possible order, for example, 2M -> 1M -> ... 64K -> ... 4K. And the
> > page fault path should be able to use the same fallback order. But
> > excessive fallback tries may be harmful either.
>
> What's your reasoning here?

There is no indicator to tell kernel what order is preferred or proper
for anonymous page, just like you elaborated. We don't know the access
pattern of the workloads either. But we have to start from an order.
Basically I looked at it from the other angle. The approach just
simply starts from an order which could maximize the hardware benefit.
This may incur some memory waste until memory reclamation is kicked
in. And there should be some feedback mechanism to give a hint for a
more proper order for the future page fault off the top of my head,
then we finally reach the balance.

I can't tell which one (starting from low order or high order) is
better for now.

>
> > > B. A further modification is that it will require three of the four
> > > folios being combined to be on the active list. If two (or more)
> > > of the four folios are inactive, we should leave them alone; either
> > > they will remain inactive and eventually be evicted, or they will be
> > > activated and eligible for merging in a future pass of khugepaged.
> >
> > If we use the fallback policy, we should be able to just leave it to
> > reclamation time. When checking reference we could tell what PTEs are
> > accessed, then split if there is significant internal fragmentation.
>
> I think it's going to lead to excessive memory usage. There was data
> presented last LSFMM that we already have far too much memory tied up
> in THP for many workloads.

It is possible. But that data was based on PMD size THP IIRC. The
lower order may make difference hopefully.

>
> > > C. We add a new wrinkle to the LRU handling code. When our scan of the
> > > active list examines a folio, we look to see how many of the PTEs
> > > mapping the folio have been accessed. If it is fewer than half, and
> > > those half are all in either the first or last half of the folio, we
> > > split it. The active half stays on the active list and the inactive
> > > half is moved to the inactive list.
> >
> > With contiguous PTE, every PTE still maintains its own access bit (but
> > it is implementation defined, some implementations may just set access
> > bit once for one PTE in the contiguous region per arm arm IIUC). But
> > anyway this is definitely feasible.
>
> If a CPU doesn't have separate access bits for PTEs, then we should just
> not use the contiguous bits. Knowing which parts of the folio are
> unused is more important than using the larger TLB entries.

Yeah, we are on the same page.

>
> > > For the third case, in contrast, the parent had already established
> > > an appropriate size folio to use for this VMA before calling fork().
> > > Whether it is the parent or the child causing the COW, it should probably
> > > inherit that choice and we should default to the same size folio that
> > > was already found.
> >
> > Actually this is not what THP does now. The current THP behavior is to
> > split the PMD then fallback to order-0 page fault. For smaller orders,
> > we may consider allocating a large folio.
>
> I know it's not what THP does now. I think that's because the gap
> between PMD and PAGE size is too large and we end up wasting too much
> memory. We also have very crude mechanisms for determining when to
> use THPs. With the adaptive mechanism I described above, I think it's
> time to change that.

Yeah, IIRC the major reason was the complain about long latency for
COW fault. Both THP allocation and data copy could incur the long
latency. So lower order may make difference hopefully.

>

2023-03-01 10:25:12

by Ryan Roberts

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

I'd like to throw in my 2p here. Although quick disclaimer first; I'm new to mm
so I'm sure I'll say a bunch of dumb stuff - please go easy ;-)

In the few workloads that I'm focused on, I can see a disparity in performance
between a kernel configured for 4K vs 16K pages. My goal is to release the extra
performance we see in the 16K variant into the 4K variant.

My results show that a ~half of the uplift is down to SW efficiency savings in
the kernel; 4x fewer data aborts (vast majority for anon memory) and less effort
spent in mm-heavy syscalls (as expected). And the other ~half is down to HW; the
TLB has less pressure, which causes everything to speed up a bit. But this "bit"
is important, given most of the time is spent in user space, which only benefits
from the HW part. See [1] for more details.

Newer Arm CPUs have a uarch feature called Hardware Page Aggregation (HPA). This
allows the TLB to aggregate 2-8 physically- and virtually- contiguous pages into
a single TLB entry to reduce pressure. (Note this is separate from the contig
bit and is invisible from a SW programming perspective).

So my hope is that I can get the equivalent SW efficiencies with a 4K base page
size and large anonymous folios, and also benefit from a lot of the HW
performance due to it all naturally fitting HPA's requirements.


On 21/02/2023 21:49, Matthew Wilcox wrote:
> In a sense this question is premature, because we don't have any code
> in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> but let's pretend that code already exists and is just waiting for us
> to answer this policy question.
>
> I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> option and 3. a sysfs tunable. It is foolish to expect the distro
> packager or the sysadmin to be able to make such a decision. The
> correct decision will depend upon the instantaneous workload of the
> entire machine and we'll want different answers for different VMAs.
>
> I'm open to applications having some kind of madvise() call they can
> use to specify hints, but I would prefer to handle memory efficiently
> for applications which do not.

Firmly agree.

>
> For pagecache memory, we use the per-fd readahead code; if readahead has
> been successful in the past we bump up the folio size until it reaches
> its maximum. There is no equivalent for anonymous memory.
>
> I'm working my way towards a solution that looks a little like this:
>
> A. We modify khugepaged to quadruple the folio size each time it scans.
> At the moment, it always attempts to promote straight from order 0
> to PMD size. Instead, if it finds four adjacent order-0 folios,
> it will allocate an order-2 folio to replace them. Next time it
> scans, it finds four order-2 folios and replaces them with a single
> order-4 folio. And so on, up to PMD order.


From the SW efficiencies perspective, what is the point of doing a replacement
after you have allocated all the order-0 folios? Surely that just adds more
overhead? I think the aim has to be to try to allocate the correct order up
front to cut down the allocation cost; one order-2 allocation is ~4x less
expensive than 4x order-0, right?

I wonder if it is preferable to optimistically allocate a mid-order folio to
begin with, then later choose to split or merge from there? Perhaps these folios
could initially go on a separate list to make them faster to split and reclaim
the unused portions when under memory pressure? (My data/workloads suggest 16K
allocations are knee, and making them bigger than that doesn't proportionally
improve performance).

>
> B. A further modification is that it will require three of the four
> folios being combined to be on the active list. If two (or more)
> of the four folios are inactive, we should leave them alone; either
> they will remain inactive and eventually be evicted, or they will be
> activated and eligible for merging in a future pass of khugepaged.
>
> C. We add a new wrinkle to the LRU handling code. When our scan of the
> active list examines a folio, we look to see how many of the PTEs
> mapping the folio have been accessed. If it is fewer than half, and
> those half are all in either the first or last half of the folio, we
> split it. The active half stays on the active list and the inactive
> half is moved to the inactive list.
>
> I feel that these three changes should allow us to iterate towards a
> solution for any given VMA that is close to optimal, and adapts to a
> changing workload with no intervention from a sysadmin, or even hint
> from a program.
>
> There are three different circumstances where we currently allocate
> anonymous memory. The first is for mmap(MAP_ANONYMOUS), the second is
> COW on a file-backed MAP_PRIVATE and the third is COW of a post-fork
> anonymous mapping.
>
> For the first option, the only hint we have is the size of the VMA.
> I'm tempted to suggest our initial guess at the right size folio to
> allocate should be scaled to that, although I don't have a clear idea
> about what the scale factor should be.

Ahh - perhaps I misunderstood what you were saying above. My experience has been
that order-2 seems to be the knee in terms of performance gain, so perhaps one
approach would be to start with order-2 allocations, then adjust based on the
observed page fault pattern within the VMA? i.e. if you're getting mostly
in-order faults, increase the VMA's scaling factor, and if its mostly random,
decrease. (just a suggestion based on intuition, so feel free to shoot it down).

>
> For the second case, I want to strongly suggest that the size of the
> folio allocated by the page cache should be of no concern. It is largely
> irrelevant to the application's usage pattern what size the page cache
> has chosen to cache the file. I might start out very conservatively
> here with an order-0 allocation.
>
> For the third case, in contrast, the parent had already established
> an appropriate size folio to use for this VMA before calling fork().
> Whether it is the parent or the child causing the COW, it should probably
> inherit that choice and we should default to the same size folio that
> was already found.
>
>
> I don't stay current with the research literature, so if someone wants
> to point me to a well-studied algorithm and let me know that I can stop
> thinking about this, that'd be great. And if anyone wants to start
> working on implementing this, that'd also be great.
>
> P.S. I didn't want to interrupt the flow of the above description to
> note that allocation of any high-order folio can and will fail, so
> there will definitely be fallback points to order-0 folios, which will
> be no different from today. Except that maybe we'll be able to iterate
> towards the correct folio size in the new khugepaged.
>
> P.P.S. I still consider myself a bit of a novice in the handling of
> anonymous memory, so don't be shy to let me know what I got wrong.
>
>

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/[email protected]/


2023-03-27 12:50:19

by Vlastimil Babka

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On 2/22/23 04:52, Matthew Wilcox wrote:
> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
>
>> > C. We add a new wrinkle to the LRU handling code. When our scan of the
>> > active list examines a folio, we look to see how many of the PTEs
>> > mapping the folio have been accessed. If it is fewer than half, and
>> > those half are all in either the first or last half of the folio, we
>> > split it. The active half stays on the active list and the inactive
>> > half is moved to the inactive list.
>>
>> With contiguous PTE, every PTE still maintains its own access bit (but
>> it is implementation defined, some implementations may just set access
>> bit once for one PTE in the contiguous region per arm arm IIUC). But
>> anyway this is definitely feasible.
>
> If a CPU doesn't have separate access bits for PTEs, then we should just
> not use the contiguous bits. Knowing which parts of the folio are
> unused is more important than using the larger TLB entries.

Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC
the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in
the thread, that sounds similar. So I IIUC there will be a larger TLB entry
transparently, and then I don't expect the CPU to update individual bits as
that would defeat the purpose. So I'd expect it will either set them all to
active when forming the larger TLB entry, or set them on a single subpage
and leave the rest at whatever state they were. Hm I wonder if the exact
behavior is defined anywhere.

>> > For the third case, in contrast, the parent had already established
>> > an appropriate size folio to use for this VMA before calling fork().
>> > Whether it is the parent or the child causing the COW, it should probably
>> > inherit that choice and we should default to the same size folio that
>> > was already found.
>>
>> Actually this is not what THP does now. The current THP behavior is to
>> split the PMD then fallback to order-0 page fault. For smaller orders,
>> we may consider allocating a large folio.
>
> I know it's not what THP does now. I think that's because the gap
> between PMD and PAGE size is too large and we end up wasting too much
> memory. We also have very crude mechanisms for determining when to
> use THPs. With the adaptive mechanism I described above, I think it's
> time to change that.
>
>

2023-03-27 15:31:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On 27/03/2023 13:41, Vlastimil Babka wrote:
> On 2/22/23 04:52, Matthew Wilcox wrote:
>> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
>>
>>>> C. We add a new wrinkle to the LRU handling code. When our scan of the
>>>> active list examines a folio, we look to see how many of the PTEs
>>>> mapping the folio have been accessed. If it is fewer than half, and
>>>> those half are all in either the first or last half of the folio, we
>>>> split it. The active half stays on the active list and the inactive
>>>> half is moved to the inactive list.
>>>
>>> With contiguous PTE, every PTE still maintains its own access bit (but
>>> it is implementation defined, some implementations may just set access
>>> bit once for one PTE in the contiguous region per arm arm IIUC). But
>>> anyway this is definitely feasible.
>>
>> If a CPU doesn't have separate access bits for PTEs, then we should just
>> not use the contiguous bits. Knowing which parts of the folio are
>> unused is more important than using the larger TLB entries.
>
> Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC
> the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in
> the thread, that sounds similar. So I IIUC there will be a larger TLB entry
> transparently, and then I don't expect the CPU to update individual bits as
> that would defeat the purpose. So I'd expect it will either set them all to
> active when forming the larger TLB entry, or set them on a single subpage
> and leave the rest at whatever state they were. Hm I wonder if the exact
> behavior is defined anywhere.

For arm64, at least, there are 2 separate mechanisms:

"The Contiguous Bit" (D8.6.1 in the Arm ARM) is a bit in the translation table
descriptor that SW can set to indicate that a set of adjacent entries are
contiguous and have same attributes and permissions etc. It is architectural.
The order of the contiguous range is fixed and depends on the base page size
that is in use. When in use, HW access and dirty reporting is only done at the
granularity of the contiguous block.

"HPA" is a micro-architectural feature on some Arm CPUs, which aims to do a
similar thing, but is transparent to SW. In this case, the dirty and access bits
remain per-page. But when they differ, this affects the performance of the feature.

Typically HPA can coalesce up to 4 adjacent entries, whereas for a 4KB base page
at least, the contiguous bit applies to 16 adjacent entries.

I'm hearing that there are workloads where being able to use the contiguous bit
really does make a difference, so I would like to explore solutions that can
work when we only have access/dirty at the folio level.

Thanks,
Ryan



>
>>>> For the third case, in contrast, the parent had already established
>>>> an appropriate size folio to use for this VMA before calling fork().
>>>> Whether it is the parent or the child causing the COW, it should probably
>>>> inherit that choice and we should default to the same size folio that
>>>> was already found.
>>>
>>> Actually this is not what THP does now. The current THP behavior is to
>>> split the PMD then fallback to order-0 page fault. For smaller orders,
>>> we may consider allocating a large folio.
>>
>> I know it's not what THP does now. I think that's because the gap
>> between PMD and PAGE size is too large and we end up wasting too much
>> memory. We also have very crude mechanisms for determining when to
>> use THPs. With the adaptive mechanism I described above, I think it's
>> time to change that.
>>
>>
>

2023-03-27 15:51:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On 3/27/23 17:30, Ryan Roberts wrote:
> On 27/03/2023 13:41, Vlastimil Babka wrote:
>> On 2/22/23 04:52, Matthew Wilcox wrote:
>>> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
>>>
>>>>> C. We add a new wrinkle to the LRU handling code. When our scan of the
>>>>> active list examines a folio, we look to see how many of the PTEs
>>>>> mapping the folio have been accessed. If it is fewer than half, and
>>>>> those half are all in either the first or last half of the folio, we
>>>>> split it. The active half stays on the active list and the inactive
>>>>> half is moved to the inactive list.
>>>>
>>>> With contiguous PTE, every PTE still maintains its own access bit (but
>>>> it is implementation defined, some implementations may just set access
>>>> bit once for one PTE in the contiguous region per arm arm IIUC). But
>>>> anyway this is definitely feasible.
>>>
>>> If a CPU doesn't have separate access bits for PTEs, then we should just
>>> not use the contiguous bits. Knowing which parts of the folio are
>>> unused is more important than using the larger TLB entries.
>>
>> Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC
>> the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in
>> the thread, that sounds similar. So I IIUC there will be a larger TLB entry
>> transparently, and then I don't expect the CPU to update individual bits as
>> that would defeat the purpose. So I'd expect it will either set them all to
>> active when forming the larger TLB entry, or set them on a single subpage
>> and leave the rest at whatever state they were. Hm I wonder if the exact
>> behavior is defined anywhere.
>
> For arm64, at least, there are 2 separate mechanisms:
>
> "The Contiguous Bit" (D8.6.1 in the Arm ARM) is a bit in the translation table
> descriptor that SW can set to indicate that a set of adjacent entries are
> contiguous and have same attributes and permissions etc. It is architectural.
> The order of the contiguous range is fixed and depends on the base page size
> that is in use. When in use, HW access and dirty reporting is only done at the
> granularity of the contiguous block.
>
> "HPA" is a micro-architectural feature on some Arm CPUs, which aims to do a
> similar thing, but is transparent to SW. In this case, the dirty and access bits
> remain per-page. But when they differ, this affects the performance of the feature.
>
> Typically HPA can coalesce up to 4 adjacent entries, whereas for a 4KB base page
> at least, the contiguous bit applies to 16 adjacent entries.

Hm if it's 4 entries on arm64 and presumably 8 on AMD, maybe we can only
care about how actively accessed are the individual "subpages" above that
size, to avoid dealing with this uncertainty whether HW tracks them. At such
smallish sizes we shouldn't induce massive overhead?

> I'm hearing that there are workloads where being able to use the contiguous bit
> really does make a difference, so I would like to explore solutions that can
> work when we only have access/dirty at the folio level.

And on the higher orders where we have explicit control via bits, we could
split the explicitly contiguous mappings once in a while to determine if the
sub-folios are still accessed? Although maybe with 16x4kB pages limit it may
still be not worth the trouble?

> Thanks,
> Ryan

2023-03-28 10:18:29

by Ryan Roberts

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On 27/03/2023 16:48, Vlastimil Babka wrote:
> On 3/27/23 17:30, Ryan Roberts wrote:
>> On 27/03/2023 13:41, Vlastimil Babka wrote:
>>> On 2/22/23 04:52, Matthew Wilcox wrote:
>>>> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
>>>>
>>>>>> C. We add a new wrinkle to the LRU handling code. When our scan of the
>>>>>> active list examines a folio, we look to see how many of the PTEs
>>>>>> mapping the folio have been accessed. If it is fewer than half, and
>>>>>> those half are all in either the first or last half of the folio, we
>>>>>> split it. The active half stays on the active list and the inactive
>>>>>> half is moved to the inactive list.
>>>>>
>>>>> With contiguous PTE, every PTE still maintains its own access bit (but
>>>>> it is implementation defined, some implementations may just set access
>>>>> bit once for one PTE in the contiguous region per arm arm IIUC). But
>>>>> anyway this is definitely feasible.
>>>>
>>>> If a CPU doesn't have separate access bits for PTEs, then we should just
>>>> not use the contiguous bits. Knowing which parts of the folio are
>>>> unused is more important than using the larger TLB entries.
>>>
>>> Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC
>>> the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in
>>> the thread, that sounds similar. So I IIUC there will be a larger TLB entry
>>> transparently, and then I don't expect the CPU to update individual bits as
>>> that would defeat the purpose. So I'd expect it will either set them all to
>>> active when forming the larger TLB entry, or set them on a single subpage
>>> and leave the rest at whatever state they were. Hm I wonder if the exact
>>> behavior is defined anywhere.
>>
>> For arm64, at least, there are 2 separate mechanisms:
>>
>> "The Contiguous Bit" (D8.6.1 in the Arm ARM) is a bit in the translation table
>> descriptor that SW can set to indicate that a set of adjacent entries are
>> contiguous and have same attributes and permissions etc. It is architectural.
>> The order of the contiguous range is fixed and depends on the base page size
>> that is in use. When in use, HW access and dirty reporting is only done at the
>> granularity of the contiguous block.
>>
>> "HPA" is a micro-architectural feature on some Arm CPUs, which aims to do a
>> similar thing, but is transparent to SW. In this case, the dirty and access bits
>> remain per-page. But when they differ, this affects the performance of the feature.
>>
>> Typically HPA can coalesce up to 4 adjacent entries, whereas for a 4KB base page
>> at least, the contiguous bit applies to 16 adjacent entries.
>
> Hm if it's 4 entries on arm64 and presumably 8 on AMD, maybe we can only
> care about how actively accessed are the individual "subpages" above that
> size, to avoid dealing with this uncertainty whether HW tracks them. At such
> smallish sizes we shouldn't induce massive overhead?

I'm not sure I've fully understood this point. For arm64's HPA, there is no
"uncertainty [about] whether HW tracks them"; HW will always track access/dirty
individually for each base page. The problem is the inverse; if SW (or HW) sets
those bits differently in each page, then TLB coalescing performance may
decrease. Or are you actually suggesting that SW should always set the bits the
same for a 4 or 8 page run, and forgo the extra granularity?

>
>> I'm hearing that there are workloads where being able to use the contiguous bit
>> really does make a difference, so I would like to explore solutions that can
>> work when we only have access/dirty at the folio level.
>
> And on the higher orders where we have explicit control via bits, we could
> split the explicitly contiguous mappings once in a while to determine if the
> sub-folios are still accessed? Although maybe with 16x4kB pages limit it may
> still be not worth the trouble?

I have a bigger-picture question; why is it useful to split these large folios?
I think there are 2 potential reasons (but would like to be educated):

1. If a set of sub-pages that were pre-faulted as part of a large folio have
_never_ been accessed and we are under memory pressure, I guess we would like to
split the folio and free those pages?

2. If a set of subpages within a folio are cold (but were written in the past)
and a separate set of subpages within the same folio are hot and we are under
memory pressure, we would like to swap out the cold pages?

If the first reason is important, I guess we would want to initially map
non-contig, then only remap as contig once every subpage has been touched at
least once.

For the second reason, my intuition says that a conceptual single access and
dirty bit per folio should be sufficient, and folios could be split from
time-to-time to see if one half is cold?

Thanks,
Ryan



>
>> Thanks,
>> Ryan
>

2023-03-28 12:18:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: What size anonymous folios should we allocate?

On 3/28/23 12:12, Ryan Roberts wrote:
> On 27/03/2023 16:48, Vlastimil Babka wrote:
>> On 3/27/23 17:30, Ryan Roberts wrote:
>>> On 27/03/2023 13:41, Vlastimil Babka wrote:
>>>> On 2/22/23 04:52, Matthew Wilcox wrote:
>>>>> On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
>>>>>
>>>>>>> C. We add a new wrinkle to the LRU handling code. When our scan of the
>>>>>>> active list examines a folio, we look to see how many of the PTEs
>>>>>>> mapping the folio have been accessed. If it is fewer than half, and
>>>>>>> those half are all in either the first or last half of the folio, we
>>>>>>> split it. The active half stays on the active list and the inactive
>>>>>>> half is moved to the inactive list.
>>>>>>
>>>>>> With contiguous PTE, every PTE still maintains its own access bit (but
>>>>>> it is implementation defined, some implementations may just set access
>>>>>> bit once for one PTE in the contiguous region per arm arm IIUC). But
>>>>>> anyway this is definitely feasible.
>>>>>
>>>>> If a CPU doesn't have separate access bits for PTEs, then we should just
>>>>> not use the contiguous bits. Knowing which parts of the folio are
>>>>> unused is more important than using the larger TLB entries.
>>>>
>>>> Hm but AFAIK the AMD aggregation is transparent, there are no bits. And IIUC
>>>> the "Hardware Page Aggregation (HPA)" Ryan was talking about elsewhere in
>>>> the thread, that sounds similar. So I IIUC there will be a larger TLB entry
>>>> transparently, and then I don't expect the CPU to update individual bits as
>>>> that would defeat the purpose. So I'd expect it will either set them all to
>>>> active when forming the larger TLB entry, or set them on a single subpage
>>>> and leave the rest at whatever state they were. Hm I wonder if the exact
>>>> behavior is defined anywhere.
>>>
>>> For arm64, at least, there are 2 separate mechanisms:
>>>
>>> "The Contiguous Bit" (D8.6.1 in the Arm ARM) is a bit in the translation table
>>> descriptor that SW can set to indicate that a set of adjacent entries are
>>> contiguous and have same attributes and permissions etc. It is architectural.
>>> The order of the contiguous range is fixed and depends on the base page size
>>> that is in use. When in use, HW access and dirty reporting is only done at the
>>> granularity of the contiguous block.
>>>
>>> "HPA" is a micro-architectural feature on some Arm CPUs, which aims to do a
>>> similar thing, but is transparent to SW. In this case, the dirty and access bits
>>> remain per-page. But when they differ, this affects the performance of the feature.

Oh looks like I get this part properly. Wonder if AMD works the same.

>>> Typically HPA can coalesce up to 4 adjacent entries, whereas for a 4KB base page
>>> at least, the contiguous bit applies to 16 adjacent entries.
>>
>> Hm if it's 4 entries on arm64 and presumably 8 on AMD, maybe we can only
>> care about how actively accessed are the individual "subpages" above that
>> size, to avoid dealing with this uncertainty whether HW tracks them. At such
>> smallish sizes we shouldn't induce massive overhead?
>
> I'm not sure I've fully understood this point. For arm64's HPA, there is no
> "uncertainty [about] whether HW tracks them"; HW will always track access/dirty
> individually for each base page. The problem is the inverse; if SW (or HW) sets
> those bits differently in each page, then TLB coalescing performance may
> decrease. Or are you actually suggesting that SW should always set the bits the
> same for a 4 or 8 page run, and forgo the extra granularity?

I guess we'll need some experiments to see what's the optimal way. IIRC what
we do is just clearing the access bit and then let HW set them. If we have
4/8-page folio on the LRU then we likely should clear the whole of it.
Perhaps if all subpages are indeed hot enough, the HW will eventually set
the accessed bits back and then create the coelesced TLB entry. If we are
about the reclaim or split the folio, we would see if it wasn't hot enough
and all subpages have the accessed bit, or not, so maybe that should all
automatically work.

>>
>>> I'm hearing that there are workloads where being able to use the contiguous bit
>>> really does make a difference, so I would like to explore solutions that can
>>> work when we only have access/dirty at the folio level.
>>
>> And on the higher orders where we have explicit control via bits, we could
>> split the explicitly contiguous mappings once in a while to determine if the
>> sub-folios are still accessed? Although maybe with 16x4kB pages limit it may
>> still be not worth the trouble?
>
> I have a bigger-picture question; why is it useful to split these large folios?
> I think there are 2 potential reasons (but would like to be educated):
>
> 1. If a set of sub-pages that were pre-faulted as part of a large folio have
> _never_ been accessed and we are under memory pressure, I guess we would like to
> split the folio and free those pages?
>
> 2. If a set of subpages within a folio are cold (but were written in the past)
> and a separate set of subpages within the same folio are hot and we are under
> memory pressure, we would like to swap out the cold pages?

These are not fundamentally different, only 1. depends if we optimistically
start large (I think the proposal here was not to start (too) large).

> If the first reason is important, I guess we would want to initially map
> non-contig, then only remap as contig once every subpage has been touched at
> least once.

Yeah. But the second reason will always apply anyway, access patterns of a
workload may change over time.

> For the second reason, my intuition says that a conceptual single access and
> dirty bit per folio should be sufficient, and folios could be split from
> time-to-time to see if one half is cold?

Maybe not complete folios need split but just their mappings?

> Thanks,
> Ryan
>
>
>
>>
>>> Thanks,
>>> Ryan
>>
>