2023-11-24 13:28:02

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios

Are you interested in some made-up math, new locking primitives and
slightly unpleasant performance numbers on first sight? :)

Great, you've come to the right place! As an added bonus, we'll be messing
with Copy-On-Write and mapcounts, which we all love, of course.

This is the current state of a project to see if we can find a way to
efficiently and precisely tell whether a PTE-mapped THP is _currently_
mapped by a single process in its page tables, or if multiple processes
are involved; similar to how we are able to do it for small folios very
efficiently.

Now that I am a bit more confident that the math and the locking is
correct, and I roughly know what the added cost will be, here is the
current state as WIP. Still, expect to find serious flaws :P

The additional tracking is performed for all THPs, whereby the biggest
benefit is currently for anonymous THP because of added Copy-on-Write
reuse support after fork.

I'm sure some people will dislike involving mapcounts -- please don't eat
me alive because I used the m-word ;). But, we're already using mapcounts
for that purpose, and I don't really see us not relying on the mapcount
for that or even getting rid of them. Having that said, one has to be very
careful when working with the mapcount(s), because the semantics can be
weird and misleading.

IMHO, we should further work on removing most manual usage of the mapcounts
and instead hide them in properly documented functions that spell out the
semantics: like folio_mapped_shared() does in this series.

Further, this series prepares for removing the subpage mapcounts. So
long-term less mapcounts is good, right?! :)

Even if the additional "rmap ID" tracking does not hold up to its
promises, there is some interesting stuff in here that might come in handy.

The first next steps should be to implement+include more rmap batching and
the per-folio mapcount, which will be valuable indepenent of all the other
stuff. In the meantime the other stuff can mature and we can make up our
mind if we want it, and if so, for which folios.

Needless to say, this is WIP, and not a request to merge this now.

1. Overview
===========

Can we find a way to precisely and efficiently tell whether a PTE-mapped
THP -- more generally, called partially-mappable folio" here --
is _currently_ "mapped exclusively" or "mapped shared" into page tables?

Usually that means: is one process mapping this folio or are multiple
processes involved? "Special" are KSM pages and pagecache folios that are
mapped multiple times into the same process, and we'll happily "simplify"
for such pagecache folios just like we do for small folios. Well, and
s/process/MM/, because processes can share the "struct mm_struct".

It is worth stressing that we only focus on _current_ page table
mappings; a folio might _currently_ be mapped exclusively" but is not
actually exclusive (e.g., pagecache, swapcache, ...): other processes are
currently not actively using it and maybe never will.

All use cases that need _guarantees_ about "exclusive ownership" instead of
just "currently exclusively mapped" for correctness can use the precise
"mapped shared vs. mapped exclusively" as a building block.

While that "mapped shared vs mapped exclusively" information is easily
available for small folios (mapcount == 1 vs. mapcount != 1),
partially-mappable folios are more problematic. Each individual
(PTE/PMD/...) mapping accounts towards the total mapcount of the folio and
needs to increment the refcount: one mapping, one reference.

For example, a PTE-mapped PMD-sized THP on x86_64 has "refcount == 512"
and "mapcount == 512". That scheme is very simple, scalable and safe; and
is able to handle partial remapping, partial unmapping, multiple page
tables, ...very efficiently.

But that also implies that mapcount != 1" is not expressive for these
large folios to conclude "mapped shared".

2. Motivation
=============

The most important use case for a precise "mapped shared vs. mapped
exclusively" primitive is currently Copy-On-Write (COW) of PTE-mapped THP:
when we get a write fault after fork(), we may reuse any part of the
THP if that THP is completely exclusive to that process. One way of
implementing such a logic is: are all references from page table mappings,
and do all page table mappings belong to this process?

But "mapped shared vs. mapped exclusively" is also valuable to have for
other large folios: Some system calls / machinery don't want to mess with
folios that are currently not exclusive to a process: for example, we don't
want users to swapout, split or migrate folios that are also actively used
by other processes. Similarly, automatic mechanisms like NUMA balancing and
khugepaged want to detect active sharing of folios efficiently.

To be more specific, this work was motivated by the following factors:

(a) Partially-mappable folios are getting more important; especially
with THPs that cannot be mapped by PMDs; here, we
always get PTE-mapped THP. So PTE-mapped THPs are no longer a corner
case.
(b) We want to reuse PTE-mapped anonymous THPs after fork() to improve
performance; for now, we always copy and never reuse. With small-sized
anonymous THP on the horizon, this will become very relevant.
(c) There is currently no universal way to tell whether a PTE-mapped THP
is "mapped exclusively" or "mapped shared"; not even the subpage
mapcounts can help with that.
(d) We want to fully support THPs that are mapped into multiple page tables
in the same process: examples include mremap(), future support for
THP sizes larger than a PMD, and THPs that cross leaf page tables due
virtual start addresses [pagecache folios already/at some point?];
we want this problem to be solved once and for all.
(e) In the context of memdesc, there is interest in removing the
subpage mapcount; so whatever we build is ideally independent of
that and further helps towards removing it.
(f) folio_mapcount() gets slower the larger the folio is and we want to
speed that up; more generally, we don't want to scan the memmap of
folios like folio_mapcount() currently does.

There have been a lot of ideas on how to deal with that over the past
year(s) that all didn't really materialize: some are too expensive, turn
imprecise quickly, or make the code more complicated and fragile.

Examples include having a mapcount for the #VMAs a large folio is mapped
into, a mapcount for the #leaf page tables a large folio is mapped into,
scanning page tables, scanning the memmap, walking the rmap ...

3. Basic Idea
=============

The idea is to not make the semantics of the mapcount of large folios
to be different from the mapcount of small folios.

Instead, the idea is use the folio mapcount + some additional tracking+math
to detect if a single MM is mapping (parts of) a large folio, or if
multiple MMs are involved.

It achieves that by plugging into the existing rmap page (un)mapping logic,
without any new hooks or page table / rmap scanning.

Imagine that we have a total mapcount (folio->_total_mapcount) and an
"rmap value" (folio->_rmap_val) per large folio, and assigned each MM a
unique rmap ID (mm->mm_rmap_id).

Whenever we would map/unmap parts of the large folio, we would
increment/decrement the total mapcount of the folio:

folio->_total_mapcount +=/-= 1;

and add/remove the MM rmap ID to/from a folio's rmap value:

folio->_rmap_val +=/-= mm->mm_rmap_id;

So when we would want to test if a single given MM currently exclusively
maps the folio, would:

folio_mapcount() * mm->mm_rmap_id == folio->_rmap_val

be sufficient to identify mapped exclusively"?

No, for two main reasons:
(1) Concurrent (un)mapping might trick us into making a wrong decision.
(2) folio->_rmap_val might be the sum of rmap IDs from other MMs.

Well, and integer overflows, that kind-of fall into (2).

To sort out (1), we need some locking scheme -- such as a spinlock or a
seqcount -- to detect concurrent updates and be able to read values that
are consistent.

To sort out (2), we need special numbers as rmap IDs that have the
guarantee that:

folio_mapcount() * mm->mm_rmap_id == folio->_rmap_val

is true if and only if the folio is "mapped exclusively"; well, at least as
long as we consider a folio not obviously" shared.

For example, if:

folio_mapcount() > folio_nr_pages()

is true, we know that the folios is "obviously mapped shared" and we don't
even have to look at the rmap values.

Turns out that such numbers exist, but they can grow large fairly
quickly. More details in the respective patch.

I have a proof draft that these numbers actually get the job done. I
suspect this has been proven already a million times. Anyhow, I can send
that draft to anybody interested in it as long as it is in draft mode.

How many such numbers do we need? Well, at least as many as we have
possible processes ... so let's go for 16M: 16*1024*1024.

However, with large folios (large P), supporting 16M numbers end up not
fitting into 64bit. To not mess with 128bit/256bit/512bit math, and to be
able to perform atomic updates of the rmap values, we have to use multiple
64bit values instead; we use our simple MM rmap ID to generate multiple
"sub-IDs" that fit into 64bit values, whereby our sub-IDs are generated
as described above.

The number of rmap values we have to store in the folio, depending on the
folio size and when supporting 16M rmap IDs:

* == order-2: 1x 64bit rmap values
* <= order-5: 2x 64bit rmap values
* <= order-9: 3x 64bit rmap values
* <= order-10: 4x 64bit rmap values
* <= order-12: 5x 64bit rmap values
* <= order-15: 6x 64bit rmap values
* <= order-21: 8x 64bit rmap values [not implemented]
* <= order-30: 12x 64bit rmap values [not implemented]

On most architectures, including x86-64, we should currently never
require more than 3 values. On arm64 with 64k pagesize, we'd need 6 values
for handling 512 MiB THP that are order-13 IIRC.

4. Detailed Approach
====================

Tracking infrastructure:

(1) Add a per-folio mapcount to large folios, so we can read/write it
atomically and efficiently: folio->_total_mapcount. For now, it
coexists with the subpage mapcounts.
(2) Assign each MM a unique rmap ID < 16M: mm->mm_rmap_id.
(3) Derive 1..X magic sub-ID[X] from that rmap ID, X depending on the
folio size.
(4) Define 1..X folio->_rmap_valX, X depending on the folio size.

While (1) is required in any case when we want to make folio_mapcount()
faster and remove the subpage mapcounts, it already helps to improve
detecting "mapped shared" in more cases by converting
folio_estimated_sharers() to folio_mapped_shared().

Tracking on the writer side -- (un)mapping parts of a folio:

(1) Add/remove the per-MM sub-IDs to/from the respective rmap values
when incrementing/decrementing the total mapcount: when (un)mapping
parts of the folio.
(2) Add a locking scheme (atomic seqcount) to allow for concurrent
updates of the mapcount+values (atomic values), while allowing
readers to detect concurrent updates to retry seqcount-style.
(3) Optimize the locking scheme for the common case of having a single
exclusive writer ("no concurrent (un)mapping"), to reduce the number
of atomic RMW operations.

This make it possible to detect "certainly mapped shared" vs.
"certainly mapped exclusively" to implement a precise
folio_mapped_shared() function and COW reuse after fork().

Approach on the reader side:

(1) Only look at the mapcount+rmap values if the folio is not obviously
shared; folio_mapcount() > folio_nr_pages() implies "mapped
shared".
(2) For all X, if folio->rmap_val[X] == folio_mapcount() * sub-ID[X] of a
given MM, the folio is mapped exclusively", otherwise it's mapped
shared".
(3) Use the atomic seqcount, to catch concurrent (un)mapping that
leaves the counters in an intermediate state and (conditionally)
retry seqcount-style.

In some cases, we might simply want to stop early if there is concurrent
(un)mapping happening, and indicate "maybe mapped shared".

5. Atomic operations ...
========================

In the rmap code, we used to require at most 2 atomic RMW operations when
(un)mapping parts of a THP: updating the subpage mapcount and
(conditionally) updating folio->nr_pages_mapped (add-return-relaxed).

To manage the total mapcount, we need 1 additional atomic RMW operation
(add).

To manage the folio->_rmap_valX, we need:
* 2 atomic RMW operations for the atomic seqcount
* 1..6 atomic RMW operation for the folio->_rmap_valX

So, in total, an additional 4..9 atomic RMW operations. Hm.

There are three things to consider and optimize:

(1) When batching rmap operations (fork, unmap, ...), we'll only have
to perform these operations once per batch -- in contrast to the
subpage mapcount updates. This series batches rmap operations for
PTE-remapping a PMD-mapped THP, which ends up performing better
than the current state even with additional tracking.
(2) In the common case, we don't have concurrent (un)mapping. So if we
are exclusively holding the atomic seqcount, we can avoid atomic
RMW operations for updating the total mapcount and the rmap values.
This series implements that "exclusive atomic seqcount" optimization.
(3) In the future, we might be able to get rid of the subpage mapcount
and folio->nr_pages_mapped. This series also updates
these values under the atomic seqcount, and makes them similarly use
the "exclusive atomic seqcount" optimization.

Apparently, on some hardware (AMD EPYC) atomic additions seem to be
just as fast as doing ordinary additions. On such hardware, (2) is not
required. More on that below.

Reducing the number of additions/atomics further will require using
128bit/256bit/ integer math.

6. Patch Organization
=====================

Patch #1 to #3 add the "total mapcount" to large folios and convert
folio_estimated_sharers() to folio_mapped_shared() and improve it.

Patch #4 to #7 prepare for rmap ID tracking, and implement the
atomic seqcount + rmap ID tracking. rmap ID tracking is implemented using
4/5/6 rmap values using lookup tables for subids. The atomic seqcount
is only used when adjusting the the total mapcount and the rmap values.

Patch #8 to #10 further improve folio_mapped_shared() based on rmap IDs
and implement COW reuse of PTE-mapped anon THP.

Patch #11 adds support for 1/2/3 rmap values by pre-calculating the subids
and storing them in the MM struct.

Patch #12 to #14 implement batching of rmap operations when PTE-remapping
a PMD-mapped anon THP.

Patch #15 just adds some sanity checks.

Patch #16 to #17 implement the "exclusive atomic seqcount" optimization, to
optimize for "no concurrent (un)mapping" of the same folio.

Patch #18 implements an interesting optimization for the atomic seqcount.

Patch #19 to #20 use the per-folio atomic seqcount not only to protect the
total mapcount and the rmap values, but also the other mapcounts.

7. Performance
==============

The added tracking is noticeable. This series adds the tracking for all
partially-mappable folios (i.e., anon, pagecache, shmem), but I focus on
benchmarks on the anon 2 MiB THP for now.

I focus on the extreme case of all PTE-mapped THP", with no concurrent
(un)mapping of the same folio -- micro benchmarks -- and with heavy
concurrent (un)mapping of the same folio -- vm-scalability benchmarks.

There are certainly many more benchmarks to run at some point, these
should give a good indication of what the problematic areas are and
how the optimizations perform.

x86_64 benchmarks only, so we only trigger the "3/4 rmap values" case.

7.1 Environment
---------------

I run the benchmarks on an 2-socket Intel(R) Xeon(R) Silver 4210R and
a 2-socket AMD EPYC 7313. Turbo and SMT disabled on both systems,
performance governor enabled on the Intel CPU.

Code stages that are tested:

* baseline" : v6.7-rc1
* mapcount" : after patch #3 (total mapcount + folio_mapped_shared())
* rmap-id" : after patch #10 (rmap_id tracking + COW reuse)
* 123 : after patch #11 (support for 1, 2, 3 values)
* batch : after patch #14 (rmap batching in __split_huge_pmd_locked())
* exclusive : after patch #17 (exclusive atomic seqcount)
* nocmpxchg : after patch #18 (use add-return instead of cmpxchg)
* all : after patch #20 (all mapcount mods under rmap seqcount)

7.2 Micro benchmark
-------------------

Allocate 1 GiB of (2 MiB) THP. Then, perform an action and measure the
time it takes to perform that action only.

(1) remap: Trigger PTE-remapping each PMD-mapped THP using mprotect() on a
single subpage of each THP.
(2) munmap: Trigger PTE-mapping as in (2). Measure munmap() of the 1 GiB.
(3) fork: Trigger PTE-mapping as in (2). Measure fork().
(4) cow: fork() a child and leave it alive. Measure memset() on the 1 GiB.
(5) cow-byte: fork() a child and leave it alive. Measure writing a single
byte to each page in the whole 1 GiB.
(6) cow-reuse: Same as (5), but don't leave the child alive.
(7) cow-reuse-byte: Same as (6), but don't leave the child alive.
(8) pte-dontneed: Trigger MADV_DONTNEED on each subpage.

The tables only show the % change in runtime compared to the baseline.

_Negative is good_ (less runtime).

Results on Intel(R) Xeon(R) Silver 4210R CPU with 10 cores and 2 sockets
(20 cores):

| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) |
--------------------------------------------------------------------------
mapcount | 13.4% | 0.3% | 14.2% | 0.5% | 0.2% | -0.2% | -0.4% | 2.8% |
rmap-id |123.7% | 48.8% |105.9% | 3.0% | 3.2% | -30.0% | -64.2% | 14.8% |
123 |108.7% | 42.1% | 93.3% | 1.7% | 2.7% | -30.0% | -64.3% | 12.9% |
batch | -8.1% | 41.4% | 95.5% | 1.5% | 2.2% | -29.7% | -64.2% | 5.7% |
exclusive | -8.8% | 15.1% | 40.5% | 1.1% | 0.9% | -29.6% | -64.0% | 1.1% |
nocmpxchg |-10.4% | 12.6% | 46.4% | 0.9% | 1.4% | -30.0% | -64.3% | 0.8% |
all |-34.8% | 1.0% | 27.8% | 0.7% | 0.4% | -30.0% | -63.9% | -0.5% |

Results on AMD EPYC 7313 16-Core Processor and 2 sockets (32 cores):

| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) |
--------------------------------------------------------------------------
mapcount | 24.3% | 1.8% | 25.9% | 1.5% | 2.6% | 1.4% | 1.9% | 4.5% |
rmap-id | 55.4% | 12.5% | 27.9% | 2.6% | 4.3% | -50.4% | -60.9% | 6.3% |
123 | 45.6% | 6.3% | 30.4% | 2.9% | 4.1% | -51.1% | -60.5% | 6.7% |
batch | -9.6% | 7.2% | 31.5% | 2.1% | 3.0% | -51.1% | -60.7% | 4.0% |
exclusive |-10.8% | 14.0% | 48.4% | 0.5% | 2.3% | -51.4% | -61.0% | 5.0% |
nocmpxchg |-11.3% | 25.9% | 55.7% | 2.0% | 3.3% | -50.9% | -60.2% | 3.4% |
all |-21.8% | 16.1% | 43.0% | 3.2% | 4.8% | -50.2% | -60.8% | 5.8% |


Discussion:
* rmap batching can do magic, as shown in (1).
* (2) and (3) perform mostly only refcount+mapcount operations, which is
why the additional tracking negatively affects performance. For both,
we can batch rmap operations.
* For (4) and (5) the additional overhead is smallish; one way to
optimize is COW'ing more on pagefault, to essentially batch rmap
operations.
* The COW-reuse optimization in (6) and (7) is obvious.
* (8) cannot really be optimized by rmap batching: we cannot discard more
than what we are told to, at least when it comes to anonymous folios.
* AMD EPYC doesn't benefit from the "exclusive atomic seqcount"
optimization in most cases, and it can actually be counter-productive. We
should just disable that optimization at runtime when we detect CPUs
that can handle atomic RMW operations that efficiently.
* There are some (reproducible) unexplainable performance differences that
don't make too much sense: for example, why is "nocmpxchg" on Intel
*slower* for fork? Why do the numbers on Intel for (7) differ slightly?
I suspect code size/layout changes are responsible: especially fork()
seems to be sensitive to each added instruction.

In summary, the "really bad" cases -- fork() and unmap() can be improved
by rmap batching. On Intel, only fork() really remains "sub-optimal",
everything else is quite acceptable.

On AMD EPYC, we want to disable the exclusive optimization(s), although
weirdly, in some cases they help, in others they don't. That needs
further investigation. I'd still say that with "batch" state, the result
are acceptable, and only fork+unmap primarily needs some love.

7.3 VM scalability
------------------

Running case-anon-cow-seq and case-anon-cow-rand with one instance on
each core, with an effective 1 GiB area that each instance has to COW.
The benchmark measures memory throughput.

These benchmarks will repeatedly PTE-remap each THP in the child process,
to then COW each single page, triggering repeatedly concurrent (un)mapping
of pages of the same folios from all cores. The sequential case is
probably the worst case we can possibly trigger concurrently and
repeatedly.

The tables only show the % change in throughput compared to the baseline.

_Positive is good_ (more throughput).

Results on Intel(R) Xeon(R) Silver 4210R CPU with 10 cores and 2 sockets
(20 cores):

| seq | rand |
------------------------------
mapcount | -8.2% | -0.1% |
rmap-id | -16.2% | -0.3% |
123 | -12.3% | -0.3% |
batch | -8.3% | -0.4% |
exclusive | -7.1% | 0.4% |
nocmpxchg | -6.6% | -0.3% |
all | -5.6% | -0.3% |


Results on AMD EPYC 7313 16-Core Processor and 2 sockets (32 cores):

| seq | rand |
------------------------------
mapcount | -2.1% | -0.1% |
rmap-id | -10.4% | -0.1% |
123 | -10.8% | -0.7% |
batch | -3.8% | -0.1% |
exclusive | -4.2% | 0.0% |
nocmpxchg | -4.6% | 0.0% |
all | -4.5% | -0.5% |


Discussion:
* In the random access case, the performance change is minimal and
negligible.
* It's interesting how much "harm" the total mapcount alone creates in
the sequential case.
* For AMD EPYC, the sweet spot is once again without any exclusive
optimizations; these optimizations don't really help in any way.
* AMD EPYC can deal with concurrent atomics much more efficiently.
* For Intel, all optimizations also seem to help towards optimizing the
"heavy concurrent (un)mapping" case. So we still seem to hit quite some
"exclusive" cases -- and shared writers waiting for the exclusive writer
do not seem to get really affected by that.
* Again, the rmap batching when PTE-remapping the PMD on the first write
fault is obvious in the sequential case.

In summary, in the more common "less-concurrent" randomized case, the
performance change is negligible. In the highly concurrent numbers are
not as bad as I originally thought, quite the opposite. Note that AMD
EPYC performs much better with 32 cores compared to the Intel CPU with 20
cores.

With higher core count, the impact will surely get worse, but it's
questionable how relevant these benchmarks then still are: if you have
all cores repeatedly (un)mapping parts of the same folio, probably your
application is doing something wrong.

8. TODOs
========

* Batch rmap operations during fork(), unmap, migration, ...
* More testing and bench-marking.
* I'm sure there are reasonable micro-optimizations.
* Maybe remove some rmap value variants (is 5 really required?)
* Atomic seqcount documentation / tests / MAINTAINER.
* Detect whether the "exclusive atomic seqcount holder" optimization is
actually beneficial for the current hardware ... somehow .. and
conditionally enable it. Or rather: conditionally disable it.

9. Ideas
========

* Use rmap ID tracking only for selected folios (e.g., anon), selected
folio sizes, etc.
* We could not do the rmap ID tracking for pages that are essentially
always considered shared (executables, shared libraries, ...) by marking
them as "always assumed mapped shared".
* Once we detect heavy contention -- many concurrent writers -- we could
stop the rmap ID tracking and mark the folio as "always assumed
mapped shared".
* Using 128bit/256bit/512bit atomics might help.
* If a sub-ID is 0, we could just optimize out the atomic RMW operation.
This would primarily help for smallish rmap IDs.
* Does spreading folio->_rmap_valX across more cachelines help to reduce
contention? Some early experiments indicate: no. Maybe squeezing the
data into less cachelines helps [but that's hard]?
* Does pre-calculating the sub-IDs (e.g., either in the MM or in a pcpu
variable) help? Some early experiments indicate: no.
* We could not do the rmap ID tracking for “entire mappings” and never
look at the rmap ID values once we have any “entire” mapping. But that
only applies to PMD-sized THP.
* Finding ways of merging the total mapcount with the seqcount, or
avoiding the seqcount completely.
* Put everything into mm/rmap.c instead of mm/rmap_id.c ?

10. Q/A
=======

Q: But aren't we consuming a lot of memmap space?
A: Not really. Excluding the atomic seqcount, we need 8 byte for order-2,
16 byte order-5 and 32 byte for order-10. The subpage mapcounts alone
are 16 bytes for order-2, 128 byte for order-5 and 4096 bytes for
order-10.

Q: Do we really need the atomic seqcount?
A: At least for anonymous THP, it's currently an integral part to
stabilize the refcount vs. mapcount and precisely detect "certainly
exclusive". In general, without the atomic seqcount there could be cases
where we could be tricked into wrongly concluding "mapped shared" or
"mapped exclusively", even when re-reading the total mapcount or the
rmap values.

Q: What about 32bit?
A: We would only want to use 32bit atomics. With 64k rmap IDs we could
support order-15 folios with a maximum of 8x 32bit rmap values.

Q: What about RT?
A: If we disable the "exclusive" path for the atomic seqcount, we wouldn't
have to spin on the writer's side. But we'd still have to disable
preemption. For now, THPs are incompatible with RT either way,

Q: What about using a bit spinlock + ordinary seqcount instead of the
atomic seqcount?
A: Bit spinlocks are unfair and frowned upon, and it increases complexity.
Some early experiments indicated that the performance hit during the
vm-scalability benchmarks is much more extreme.

11. Conclusion
==============

It was to be expected that additional tracking comes with a price. That
price can be heavily reduced by (a) batching rmap operations and by (b)
optimizing for "no concurrent (un)mapping". Some of these optimizations
result in a net performance improvement, even with the additional tracking
in place.

This series batches only some RMAP operations: fork, umap/zap, migration
can similarly benefit from rmap batching independent of this approach.

There are cases where we cannot batch rmap operations, but these are
rather rare and the performance penalty is IMHO acceptable: the assumption
is that the majority of operations will operate on complete folios. COW is
certainly currently one exception worth spelling out, and it can only be
batched by COW'ing more.

HW with expensive atomic RMW operations can heavily benefit from the
optimizations; for HW with extremely fast atomic RMW operations (e.g., AMD
EPYC) -- where an atomic add seems to be just as fast as an ordinary add
-- the "no concurrent (un)mapping" result in a slowdown in most cases,
because likely "more instructions" and "more conditionals" end up resulting
in slower code.

On such HW where atomics don't dominate the costs, we'd want to disable
any "no concurrent (un)mapping" optimizations. However, such HW can handle
the additional tracking already itself much more efficiently, at least
in the "concurrent" case. The "no concurrency" case is harder to optimize.

In general: replacing per-subpage tracking by global per-folio tracking
will inherently result in cacheline contention: we already do have such
contention on the per-folio refcount.

One interesting thing to investigate is how smallish large folios behave
even after batching rmap operations: we cannot batch that much but still
have to manage the seqcount etc. Further, the performance on other
architectures/platforms/more cores will be interesting to evaluate.

[I'll note that the basic approach can be abstracted and reused in other
context, whenever multiple entities are involved, whereby each entity can
hold multiple references to an object, and we want to detect if the object
references are from a single entity, or from multiple entities.]

Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Ying Huang <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>

David Hildenbrand (20):
mm/rmap: factor out adding folio range into __folio_add_rmap_range()
mm: add a total mapcount for large folios
mm: convert folio_estimated_sharers() to folio_mapped_shared() and
improve it
mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and
page_dup_file_rmap()
mm/rmap: abstract total mapcount operations for partially-mappable
folios
atomic_seqcount: new (raw) seqcount variant to support concurrent
writers
mm/rmap_id: track if one ore multiple MMs map a partially-mappable
folio
mm: pass MM to folio_mapped_shared()
mm: improve folio_mapped_shared() for partially-mappable folios using
rmap IDs
mm/memory: COW reuse support for PTE-mapped THP with rmap IDs
mm/rmap_id: support for 1, 2 and 3 values by manual calculation
mm/rmap: introduce folio_add_anon_rmap_range()
mm/huge_memory: batch rmap operations in __split_huge_pmd_locked()
mm/huge_memory: avoid folio_refcount() < folio_mapcount() in
__split_huge_pmd_locked()
mm/rmap_id: verify precalculated subids with CONFIG_DEBUG_VM
atomic_seqcount: support a single exclusive writer in the absence of
other writers
mm/rmap_id: reduce atomic RMW operations when we are the exclusive
writer
atomic_seqcount: use atomic add-return instead of atomic cmpxchg on
64bit
mm/rmap: factor out removing folio range into
__folio_remove_rmap_range()
mm/rmap: perform all mapcount operations of large folios under the
rmap seqcount

Documentation/mm/transhuge.rst | 12 +-
include/linux/atomic_seqcount.h | 262 +++++++++++++++
include/linux/mm.h | 118 +++++--
include/linux/mm_types.h | 66 +++-
include/linux/rmap.h | 309 +++++++++++++++++-
kernel/fork.c | 32 ++
lib/Kconfig.debug | 11 +
mm/Kconfig | 21 ++
mm/Makefile | 1 +
mm/debug.c | 4 +-
mm/huge_memory.c | 47 ++-
mm/hugetlb.c | 13 +-
mm/init-mm.c | 4 +
mm/internal.h | 10 +-
mm/madvise.c | 6 +-
mm/memory.c | 97 +++++-
mm/mempolicy.c | 24 +-
mm/migrate.c | 4 +-
mm/page_alloc.c | 13 +
mm/rmap.c | 370 +++++++++++++--------
mm/rmap_id.c | 561 ++++++++++++++++++++++++++++++++
21 files changed, 1745 insertions(+), 240 deletions(-)
create mode 100644 include/linux/atomic_seqcount.h
create mode 100644 mm/rmap_id.c


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
--
2.41.0


2023-11-24 13:28:14

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 05/20] mm/rmap: abstract total mapcount operations for partially-mappable folios

Let's prepare for doing additional accounting whenever modifying the total
mapcount of partially-mappable (!hugetlb) folios. Pass the VMA as well.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 41 ++++++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 23 ++++++++++++-----------
2 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6cb497f6feab..9d5c2ed6ced5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -168,6 +168,39 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,

struct anon_vma *folio_get_anon_vma(struct folio *folio);

+static inline void folio_set_large_mapcount(struct folio *folio,
+ int count, struct vm_area_struct *vma)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ /* increment count (starts at -1) */
+ atomic_set(&folio->_total_mapcount, count - 1);
+}
+
+static inline void folio_inc_large_mapcount(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ atomic_inc(&folio->_total_mapcount);
+}
+
+static inline void folio_add_large_mapcount(struct folio *folio,
+ int count, struct vm_area_struct *vma)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ atomic_add(count, &folio->_total_mapcount);
+}
+
+static inline void folio_dec_large_mapcount(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ atomic_dec(&folio->_total_mapcount);
+}
+
/* RMAP flags, currently only relevant for some anon rmap operations. */
typedef int __bitwise rmap_t;

@@ -219,11 +252,17 @@ static inline void __page_dup_rmap(struct page *page,
return;
}

+ if (unlikely(folio_test_hugetlb(folio))) {
+ atomic_inc(&folio->_entire_mapcount);
+ atomic_inc(&folio->_total_mapcount);
+ return;
+ }
+
if (compound)
atomic_inc(&folio->_entire_mapcount);
else
atomic_inc(&page->_mapcount);
- atomic_inc(&folio->_total_mapcount);
+ folio_inc_large_mapcount(folio, dst_vma);
}

static inline void page_dup_file_rmap(struct page *page,
diff --git a/mm/rmap.c b/mm/rmap.c
index 38765796dca8..689ad85cf87e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1105,8 +1105,8 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
}

static unsigned int __folio_add_rmap_range(struct folio *folio,
- struct page *page, unsigned int nr_pages, bool compound,
- int *nr_pmdmapped)
+ struct page *page, unsigned int nr_pages,
+ struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
int first, count, nr = 0;
@@ -1130,7 +1130,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
nr++;
}
} while (page++, --count > 0);
- atomic_add(nr_pages, &folio->_total_mapcount);
+ folio_add_large_mapcount(folio, nr_pages, vma);
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */

@@ -1148,7 +1148,7 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
nr = 0;
}
}
- atomic_inc(&folio->_total_mapcount);
+ folio_inc_large_mapcount(folio, vma);
} else {
VM_WARN_ON_ONCE_FOLIO(true, folio);
}
@@ -1258,7 +1258,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
unsigned int nr, nr_pmdmapped = 0;
bool compound = flags & RMAP_COMPOUND;

- nr = __folio_add_rmap_range(folio, page, 1, compound, &nr_pmdmapped);
+ nr = __folio_add_rmap_range(folio, page, 1, vma, compound,
+ &nr_pmdmapped);
if (nr_pmdmapped)
__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
if (nr)
@@ -1329,8 +1330,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
}

if (folio_test_large(folio))
- /* increment count (starts at -1) */
- atomic_set(&folio->_total_mapcount, 0);
+ folio_set_large_mapcount(folio, 1, vma);

__lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
__folio_set_anon(folio, vma, address, true);
@@ -1355,7 +1355,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
{
unsigned int nr, nr_pmdmapped = 0;

- nr = __folio_add_rmap_range(folio, page, nr_pages, compound,
+ nr = __folio_add_rmap_range(folio, page, nr_pages, vma, compound,
&nr_pmdmapped);
if (nr_pmdmapped)
__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
@@ -1411,16 +1411,17 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,

VM_BUG_ON_PAGE(compound && !PageHead(page), page);

- if (folio_test_large(folio))
- atomic_dec(&folio->_total_mapcount);
-
/* Hugetlb pages are not counted in NR_*MAPPED */
if (unlikely(folio_test_hugetlb(folio))) {
/* hugetlb pages are always mapped with pmds */
atomic_dec(&folio->_entire_mapcount);
+ atomic_dec(&folio->_total_mapcount);
return;
}

+ if (folio_test_large(folio))
+ folio_dec_large_mapcount(folio, vma);
+
/* Is page being unmapped by PTE? Is this its last map to be removed? */
if (likely(!compound)) {
last = atomic_add_negative(-1, &page->_mapcount);
--
2.41.0

2023-11-24 13:28:17

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 03/20] mm: convert folio_estimated_sharers() to folio_mapped_shared() and improve it

Callers of folio_estimated_sharers() only care about "mapped shared vs.
mapped exclusively". Let's rename the function and improve our detection
for partially-mappable folios (i.e., PTE-mapped THPs).

For now we can only implement, based on our guess, "certainly mapped
shared vs. maybe mapped exclusively". Ideally, we'd have something like
"maybe mapped shared vs. certainly mapped exclusive" -- or even better
"certainly mapped shared vs. certainly mapped exclusively" instead. But
these semantics are currently impossible using our guess-based heuristic
we apply for partially-mappable folios.

Naming the function "folio_certainly_mapped_shared" could be possible,
but let's just keep it simple an call it "folio_mapped_shared" and
document the fuzziness that applies for now.

As we can now read the total mapcount of large folios very efficiently,
use that to improve our implementation, falling back to making a guess only
in case the folio is not "obviously mapped shared".

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 68 +++++++++++++++++++++++++++++++++++++++-------
mm/huge_memory.c | 2 +-
mm/madvise.c | 6 ++--
mm/memory.c | 2 +-
mm/mempolicy.c | 14 ++++------
mm/migrate.c | 2 +-
6 files changed, 70 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fe91aaefa3db..17dac913f367 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2114,21 +2114,69 @@ static inline size_t folio_size(struct folio *folio)
}

/**
- * folio_estimated_sharers - Estimate the number of sharers of a folio.
+ * folio_mapped_shared - Report if a folio is certainly mapped by
+ * multiple entities in their page tables
* @folio: The folio.
*
- * folio_estimated_sharers() aims to serve as a function to efficiently
- * estimate the number of processes sharing a folio. This is done by
- * looking at the precise mapcount of the first subpage in the folio, and
- * assuming the other subpages are the same. This may not be true for large
- * folios. If you want exact mapcounts for exact calculations, look at
- * page_mapcount() or folio_mapcount().
+ * This function checks if a folio is certainly *currently* mapped by
+ * multiple entities in their page table ("mapped shared") or if the folio
+ * may be mapped exclusively by a single entity ("mapped exclusively").
*
- * Return: The estimated number of processes sharing a folio.
+ * Usually, we consider a single entity to be a single MM. However, some
+ * folios (KSM, pagecache) can be mapped multiple times into the same MM.
+ *
+ * For KSM folios, each individual page table mapping is considered a
+ * separate entity. So if a KSM folio is mapped multiple times into the
+ * same process, it is considered "mapped shared".
+ *
+ * For pagecache folios that are entirely mapped multiple times into the
+ * same MM (i.e., multiple VMAs in the same MM cover the same
+ * file range), we traditionally (and for simplicity) consider them,
+ * "mapped shared". For partially-mapped folios (e..g, PTE-mapped THP), we
+ * might detect them either as "mapped shared" or "mapped exclusively" --
+ * whatever is simpler.
+ *
+ * For small folios and entirely mapped large folios (e.g., hugetlb,
+ * PMD-mapped PMD-sized THP), the result will be exactly correct.
+ *
+ * For all other (partially-mappable) folios, such as PTE-mapped THP, the
+ * return value is partially fuzzy: true is not fuzzy, because it means
+ * "certainly mapped shared", but false means "maybe mapped exclusively".
+ *
+ * Note that this function only considers *current* page table mappings
+ * tracked via rmap -- that properly adjusts the folio mapcount(s) -- and
+ * does not consider:
+ * (1) any way the folio might get mapped in the (near) future (e.g.,
+ * swapcache, pagecache, temporary unmapping for migration).
+ * (2) any way a folio might be mapped besides using the rmap (PFN mappings).
+ * (3) any form of page table sharing.
+ *
+ * Return: Whether the folio is certainly mapped by multiple entities.
*/
-static inline int folio_estimated_sharers(struct folio *folio)
+static inline bool folio_mapped_shared(struct folio *folio)
{
- return page_mapcount(folio_page(folio, 0));
+ unsigned int total_mapcount;
+
+ if (likely(!folio_test_large(folio)))
+ return atomic_read(&folio->page._mapcount) != 0;
+ total_mapcount = folio_total_mapcount(folio);
+
+ /* A single mapping implies "mapped exclusively". */
+ if (total_mapcount == 1)
+ return false;
+
+ /* If there is an entire mapping, it must be the only mapping. */
+ if (folio_entire_mapcount(folio) || unlikely(folio_test_hugetlb(folio)))
+ return total_mapcount != 1;
+ /*
+ * Partially-mappable folios are tricky ... but some are "obviously
+ * mapped shared": if we have more (PTE) mappings than we have pages
+ * in the folio, some other entity is certainly involved.
+ */
+ if (total_mapcount > folio_nr_pages(folio))
+ return true;
+ /* ... guess based on the mapcount of the first page of the folio. */
+ return atomic_read(&folio->page._mapcount) > 0;
}

#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..874eeeb90e0b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1638,7 +1638,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* If other processes are mapping this folio, we couldn't discard
* the folio unless they all do MADV_FREE so let's skip the folio.
*/
- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
goto out;

if (!folio_trylock(folio))
diff --git a/mm/madvise.c b/mm/madvise.c
index cf4d694280e9..1a82867c8c2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));

/* Do not interfere with other mappings of this folio */
- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
goto huge_unlock;

if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;

- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
@@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;

- if (folio_estimated_sharers(folio) != 1)
+ if (folio_mapped_shared(folio))
break;
if (!folio_trylock(folio))
break;
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..6bcfa763a146 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* Flag if the folio is shared between multiple address spaces. This
* is later used when determining whether to group tasks together
*/
- if (folio_estimated_sharers(folio) > 1 && (vma->vm_flags & VM_SHARED))
+ if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
flags |= TNF_SHARED;

nid = folio_nid(folio);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..0492113497cc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -605,12 +605,11 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
* Choosing not to migrate a shared folio is not counted as a failure.
*
- * To check if the folio is shared, ideally we want to make sure
- * every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated sharers of the folio instead.
+ * See folio_mapped_shared() on possible imprecision when we cannot
+ * easily detect if a folio is shared.
*/
if ((flags & MPOL_MF_MOVE_ALL) ||
- (folio_estimated_sharers(folio) == 1 && !hugetlb_pmd_shared(pte)))
+ (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
if (!isolate_hugetlb(folio, qp->pagelist))
qp->nr_failed++;
unlock:
@@ -988,11 +987,10 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
* Choosing not to migrate a shared folio is not counted as a failure.
*
- * To check if the folio is shared, ideally we want to make sure
- * every page is mapped to the same process. Doing that is very
- * expensive, so check the estimated sharers of the folio instead.
+ * See folio_mapped_shared() on possible imprecision when we cannot
+ * easily detect if a folio is shared.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || folio_estimated_sharers(folio) == 1) {
+ if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) {
if (folio_isolate_lru(folio)) {
list_add_tail(&folio->lru, foliolist);
node_stat_mod_folio(folio,
diff --git a/mm/migrate.c b/mm/migrate.c
index 35a88334bb3c..fda41bc09903 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
* every page is mapped to the same process. Doing that is very
* expensive, so check the estimated mapcount of the folio instead.
*/
- if (folio_estimated_sharers(folio) != 1 && folio_is_file_lru(folio) &&
+ if (folio_mapped_shared(folio) && folio_is_file_lru(folio) &&
(vma->vm_flags & VM_EXEC))
goto out;

--
2.41.0

2023-11-24 13:28:25

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 08/20] mm: pass MM to folio_mapped_shared()

We'll need the MM next to make a better decision regarding
partially-mappable folios (e.g., PTE-mapped THP) using per-MM rmap IDs.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 4 +++-
mm/huge_memory.c | 2 +-
mm/madvise.c | 6 +++---
mm/memory.c | 2 +-
mm/mempolicy.c | 14 +++++++-------
mm/migrate.c | 2 +-
6 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17dac913f367..765e688690f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2117,6 +2117,7 @@ static inline size_t folio_size(struct folio *folio)
* folio_mapped_shared - Report if a folio is certainly mapped by
* multiple entities in their page tables
* @folio: The folio.
+ * @mm: The mm the folio is mapped into.
*
* This function checks if a folio is certainly *currently* mapped by
* multiple entities in their page table ("mapped shared") or if the folio
@@ -2153,7 +2154,8 @@ static inline size_t folio_size(struct folio *folio)
*
* Return: Whether the folio is certainly mapped by multiple entities.
*/
-static inline bool folio_mapped_shared(struct folio *folio)
+static inline bool folio_mapped_shared(struct folio *folio,
+ struct mm_struct *mm)
{
unsigned int total_mapcount;

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0228b04c4053..fd7251923557 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1639,7 +1639,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* If other processes are mapping this folio, we couldn't discard
* the folio unless they all do MADV_FREE so let's skip the folio.
*/
- if (folio_mapped_shared(folio))
+ if (folio_mapped_shared(folio, mm))
goto out;

if (!folio_trylock(folio))
diff --git a/mm/madvise.c b/mm/madvise.c
index 1a82867c8c2e..e3e4f3ea5f6d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -365,7 +365,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio = pfn_folio(pmd_pfn(orig_pmd));

/* Do not interfere with other mappings of this folio */
- if (folio_mapped_shared(folio))
+ if (folio_mapped_shared(folio, mm))
goto huge_unlock;

if (pageout_anon_only_filter && !folio_test_anon(folio))
@@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;

- if (folio_mapped_shared(folio))
+ if (folio_mapped_shared(folio, mm))
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
@@ -665,7 +665,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (folio_test_large(folio)) {
int err;

- if (folio_mapped_shared(folio))
+ if (folio_mapped_shared(folio, mm))
break;
if (!folio_trylock(folio))
break;
diff --git a/mm/memory.c b/mm/memory.c
index 14416d05e1b6..5048d58d6174 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4848,7 +4848,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* Flag if the folio is shared between multiple address spaces. This
* is later used when determining whether to group tasks together
*/
- if (folio_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
+ if (folio_mapped_shared(folio, vma->vm_mm) && (vma->vm_flags & VM_SHARED))
flags |= TNF_SHARED;

nid = folio_nid(folio);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0492113497cc..bd0243da26bf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -418,7 +418,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
};

static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
- unsigned long flags);
+ struct mm_struct *mm, unsigned long flags);
static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
pgoff_t ilx, int *nid);

@@ -481,7 +481,7 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
return;
if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
!vma_migratable(walk->vma) ||
- !migrate_folio_add(folio, qp->pagelist, qp->flags))
+ !migrate_folio_add(folio, qp->pagelist, walk->mm, qp->flags))
qp->nr_failed++;
}

@@ -561,7 +561,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
}
if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
!vma_migratable(vma) ||
- !migrate_folio_add(folio, qp->pagelist, flags)) {
+ !migrate_folio_add(folio, qp->pagelist, walk->mm, flags)) {
qp->nr_failed++;
if (strictly_unmovable(flags))
break;
@@ -609,7 +609,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
* easily detect if a folio is shared.
*/
if ((flags & MPOL_MF_MOVE_ALL) ||
- (!folio_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
+ (!folio_mapped_shared(folio, walk->mm) && !hugetlb_pmd_shared(pte)))
if (!isolate_hugetlb(folio, qp->pagelist))
qp->nr_failed++;
unlock:
@@ -981,7 +981,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,

#ifdef CONFIG_MIGRATION
static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
- unsigned long flags)
+ struct mm_struct *mm, unsigned long flags)
{
/*
* Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
@@ -990,7 +990,7 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
* See folio_mapped_shared() on possible imprecision when we cannot
* easily detect if a folio is shared.
*/
- if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio)) {
+ if ((flags & MPOL_MF_MOVE_ALL) || !folio_mapped_shared(folio, mm)) {
if (folio_isolate_lru(folio)) {
list_add_tail(&folio->lru, foliolist);
node_stat_mod_folio(folio,
@@ -1195,7 +1195,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
#else

static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
- unsigned long flags)
+ struct mm_struct *mm, unsigned long flags)
{
return false;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 341a84c3e8e4..8a1d75ff2dc6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2559,7 +2559,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
* every page is mapped to the same process. Doing that is very
* expensive, so check the estimated mapcount of the folio instead.
*/
- if (folio_mapped_shared(folio) && folio_is_file_lru(folio) &&
+ if (folio_mapped_shared(folio, vma->vm_mm) && folio_is_file_lru(folio) &&
(vma->vm_flags & VM_EXEC))
goto out;

--
2.41.0

2023-11-24 13:28:49

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 10/20] mm/memory: COW reuse support for PTE-mapped THP with rmap IDs

For now, we only end up reusing small folios and PMD-mapped large folios
(i.e., THP) after fork(); PTE-mapped THPs are never reused, except when
only a single page of the folio remains mapped. Instead, we end up copying
each subpage even though the THP might be exclusive to the MM.

The logic we're using for small folios and PMD-mapped THPs is the
following: Is the only reference to the folio from a single page table
mapping? Then:
(a) There are no other references to the folio from other MMs
(e.g., page table mapping, GUP)
(b) There are no other references to the folio from page migration/
swapout/swapcache that might temporarily unmap the folio.

Consequently, the folio is exclusive to that process and can be reused.
In that case, we end up with folio_refcount(folio) == 1 and an implied
folio_mapcount(folio) == 1, while holding the page table lock and the
page lock to protect against possible races.

For PTE-mapped THP, however, we have not one, but multiple references
from page tables, whereby such THPs can be mapped into multiple
page tables in the MM.

Reusing the logic that we use for small folios and PMD-mapped THPs means,
that when reusing a PTE-mapped THP, we want to make sure that:
(1) All folio references are from page table mappings.
(2) All page table mappings belong to the same MM.
(3) We didn't race with (un)mapping of the page related to other page
tables, such that the mapcount and refcount are stable.

For (1), we can check
folio_refcount(folio) == folio_mapcount(folio)
For (2) and (3), we can use our new rmap ID infrastructure.

We won't bother with the swapcache and LRU cache for now. Add some sanity
checks under CONFIG_DEBUG_VM, to identify any obvious problems early.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 89 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 5048d58d6174..fb533995ff68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3360,6 +3360,95 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
static bool wp_can_reuse_anon_folio(struct folio *folio,
struct vm_area_struct *vma)
{
+#ifdef CONFIG_RMAP_ID
+ if (folio_test_large(folio)) {
+ bool retried = false;
+ unsigned long start;
+ int mapcount, i;
+
+ /*
+ * The assumption for anonymous folios is that each page can
+ * only get mapped once into a MM. This also holds for
+ * small folios -- except when KSM is involved. KSM does
+ * currently not apply to large folios.
+ *
+ * Further, each taken mapcount must be paired with exactly one
+ * taken reference, whereby references must be incremented
+ * before the mapcount when mapping a page, and references must
+ * be decremented after the mapcount when unmapping a page.
+ *
+ * So if all references to a folio are from mappings, and all
+ * mappings are due to our (MM) page tables, and there was no
+ * concurrent (un)mapping, this folio is certainly exclusive.
+ *
+ * We currently don't optimize for:
+ * (a) folio is mapped into multiple page tables in this
+ * MM (e.g., mremap) and other page tables are
+ * concurrently (un)mapping the folio.
+ * (b) the folio is in the swapcache. Likely the other PTEs
+ * are still swap entries and folio_free_swap() would fail.
+ * (c) the folio is in the LRU cache.
+ */
+retry:
+ start = raw_read_atomic_seqcount(&folio->_rmap_atomic_seqcount);
+ if (start & ATOMIC_SEQCOUNT_WRITERS_MASK)
+ return false;
+ mapcount = folio_mapcount(folio);
+
+ /* Is this folio possibly exclusive ... */
+ if (mapcount > folio_nr_pages(folio) || folio_entire_mapcount(folio))
+ return false;
+
+ /* ... and are all references from mappings ... */
+ if (folio_ref_count(folio) != mapcount)
+ return false;
+
+ /* ... and do all mappings belong to us ... */
+ if (!__folio_has_large_matching_rmap_val(folio, mapcount, vma->vm_mm))
+ return false;
+
+ /* ... and was there no concurrent (un)mapping ? */
+ if (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount,
+ start))
+ return false;
+
+ /* Safety checks we might want to drop in the future. */
+ if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+ unsigned int mapcount;
+
+ if (WARN_ON_ONCE(folio_test_ksm(folio)))
+ return false;
+ /*
+ * We might have raced against swapout code adding
+ * the folio to the swapcache (which, by itself, is not
+ * problematic). Let's simply check again if we would
+ * properly detect the additional reference now and
+ * properly fail.
+ */
+ if (unlikely(folio_test_swapcache(folio))) {
+ if (WARN_ON_ONCE(retried))
+ return false;
+ retried = true;
+ goto retry;
+ }
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ mapcount = page_mapcount(folio_page(folio, i));
+ if (WARN_ON_ONCE(mapcount > 1))
+ return false;
+ }
+ }
+
+ /*
+ * This folio is exclusive to us. Do we need the page lock?
+ * Likely not, and a trylock would be unfortunate if this
+ * folio is mapped into multiple page tables and we get
+ * concurrent page faults. If there would be references from
+ * page migration/swapout/swapcache, we would have detected
+ * an additional reference and never ended up here.
+ */
+ return true;
+ }
+#endif /* CONFIG_RMAP_ID */
/*
* We have to verify under folio lock: these early checks are
* just an optimization to avoid locking the folio and freeing
--
2.41.0

2023-11-24 13:28:54

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 12/20] mm/rmap: introduce folio_add_anon_rmap_range()

There are probably ways to have an even cleaner interface (e.g.,
pass the mapping granularity instead of "compound"). For now, let's
handle it like folio_add_file_rmap_range().

Use separate loops for handling the "SetPageAnonExclusive()" case and
performing debug checks. The latter should get optimized out automatically
without CONFIG_DEBUG_VM.

We'll use this function to batch rmap operations when PTE-remapping a
PMD-mapped THP next.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 3 ++
mm/rmap.c | 69 +++++++++++++++++++++++++++++++++-----------
2 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 39aeab457f4a..76e6fb1dad5c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -393,6 +393,9 @@ typedef int __bitwise rmap_t;
* rmap interfaces called when adding or removing pte of page
*/
void folio_move_anon_rmap(struct folio *, struct vm_area_struct *);
+void folio_add_anon_rmap_range(struct folio *, struct page *,
+ unsigned int nr_pages, struct vm_area_struct *,
+ unsigned long address, rmap_t flags);
void page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/rmap.c b/mm/rmap.c
index 689ad85cf87e..da7fa46a18fc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1240,25 +1240,29 @@ static void __page_check_anon_rmap(struct folio *folio, struct page *page,
}

/**
- * page_add_anon_rmap - add pte mapping to an anonymous page
- * @page: the page to add the mapping to
- * @vma: the vm area in which the mapping is added
- * @address: the user virtual address mapped
- * @flags: the rmap flags
+ * folio_add_anon_rmap_range - add mappings to a page range of an anon folio
+ * @folio: The folio to add the mapping to
+ * @page: The first page to add
+ * @nr_pages: The number of pages which will be mapped
+ * @vma: The vm area in which the mapping is added
+ * @address: The user virtual address of the first page to map
+ * @flags: The rmap flags
+ *
+ * The page range of folio is defined by [first_page, first_page + nr_pages)
*
* The caller needs to hold the pte lock, and the page must be locked in
* the anon_vma case: to serialize mapping,index checking after setting,
- * and to ensure that PageAnon is not being upgraded racily to PageKsm
- * (but PageKsm is never downgraded to PageAnon).
+ * and to ensure that an anon folio is not being upgraded racily to a KSM folio
+ * (but KSM folios are never downgraded).
*/
-void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
+void folio_add_anon_rmap_range(struct folio *folio, struct page *page,
+ unsigned int nr_pages, struct vm_area_struct *vma,
unsigned long address, rmap_t flags)
{
- struct folio *folio = page_folio(page);
- unsigned int nr, nr_pmdmapped = 0;
+ unsigned int i, nr, nr_pmdmapped = 0;
bool compound = flags & RMAP_COMPOUND;

- nr = __folio_add_rmap_range(folio, page, 1, vma, compound,
+ nr = __folio_add_rmap_range(folio, page, nr_pages, vma, compound,
&nr_pmdmapped);
if (nr_pmdmapped)
__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
@@ -1279,12 +1283,20 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
} else if (likely(!folio_test_ksm(folio))) {
__page_check_anon_rmap(folio, page, vma, address);
}
- if (flags & RMAP_EXCLUSIVE)
- SetPageAnonExclusive(page);
- /* While PTE-mapping a THP we have a PMD and a PTE mapping. */
- VM_WARN_ON_FOLIO((atomic_read(&page->_mapcount) > 0 ||
- (folio_test_large(folio) && folio_entire_mapcount(folio) > 1)) &&
- PageAnonExclusive(page), folio);
+
+ if (flags & RMAP_EXCLUSIVE) {
+ for (i = 0; i < nr_pages; i++)
+ SetPageAnonExclusive(page + i);
+ }
+ for (i = 0; i < nr_pages; i++) {
+ struct page *cur_page = page + i;
+
+ /* While PTE-mapping a THP we have a PMD and a PTE mapping. */
+ VM_WARN_ON_FOLIO((atomic_read(&cur_page->_mapcount) > 0 ||
+ (folio_test_large(folio) &&
+ folio_entire_mapcount(folio) > 1)) &&
+ PageAnonExclusive(cur_page), folio);
+ }

/*
* For large folio, only mlock it if it's fully mapped to VMA. It's
@@ -1296,6 +1308,29 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
mlock_vma_folio(folio, vma);
}

+/**
+ * page_add_anon_rmap - add mappings to an anonymous page
+ * @page: The page to add the mapping to
+ * @vma: The vm area in which the mapping is added
+ * @address: The user virtual address of the page to map
+ * @flags: The rmap flags
+ *
+ * See folio_add_anon_rmap_range().
+ */
+void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, rmap_t flags)
+{
+ struct folio *folio = page_folio(page);
+ unsigned int nr_pages;
+
+ if (likely(!(flags & RMAP_COMPOUND)))
+ nr_pages = 1;
+ else
+ nr_pages = folio_nr_pages(folio);
+
+ folio_add_anon_rmap_range(folio, page, nr_pages, vma, address, flags);
+}
+
/**
* folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
* @folio: The folio to add the mapping to.
--
2.41.0

2023-11-24 13:28:54

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 01/20] mm/rmap: factor out adding folio range into __folio_add_rmap_range()

Let's factor it out, optimize for small folios, and add some more sanity
checks.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/rmap.c | 119 ++++++++++++++++++++++++------------------------------
1 file changed, 53 insertions(+), 66 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7a27a2b41802..afddf3d82a8f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1127,6 +1127,54 @@ int folio_total_mapcount(struct folio *folio)
return mapcount;
}

+static unsigned int __folio_add_rmap_range(struct folio *folio,
+ struct page *page, unsigned int nr_pages, bool compound,
+ int *nr_pmdmapped)
+{
+ atomic_t *mapped = &folio->_nr_pages_mapped;
+ int first, nr = 0;
+
+ VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
+ VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
+ VM_WARN_ON_FOLIO(compound && nr_pages != folio_nr_pages(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_large(folio) && nr_pages != 1, folio);
+
+ if (likely(!folio_test_large(folio)))
+ return atomic_inc_and_test(&page->_mapcount);
+
+ /* Is page being mapped by PTE? Is this its first map to be added? */
+ if (!compound) {
+ do {
+ first = atomic_inc_and_test(&page->_mapcount);
+ if (first) {
+ first = atomic_inc_return_relaxed(mapped);
+ if (first < COMPOUND_MAPPED)
+ nr++;
+ }
+ } while (page++, --nr_pages > 0);
+ } else if (folio_test_pmd_mappable(folio)) {
+ /* That test is redundant: it's for safety or to optimize out */
+
+ first = atomic_inc_and_test(&folio->_entire_mapcount);
+ if (first) {
+ nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
+ if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
+ *nr_pmdmapped = folio_nr_pages(folio);
+ nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+ /* Raced ahead of a remove and another add? */
+ if (unlikely(nr < 0))
+ nr = 0;
+ } else {
+ /* Raced ahead of a remove of COMPOUND_MAPPED */
+ nr = 0;
+ }
+ }
+ } else {
+ VM_WARN_ON_ONCE_FOLIO(true, folio);
+ }
+ return nr;
+}
+
/**
* folio_move_anon_rmap - move a folio to our anon_vma
* @folio: The folio to move to our anon_vma
@@ -1227,38 +1275,10 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
unsigned long address, rmap_t flags)
{
struct folio *folio = page_folio(page);
- atomic_t *mapped = &folio->_nr_pages_mapped;
- int nr = 0, nr_pmdmapped = 0;
+ unsigned int nr, nr_pmdmapped = 0;
bool compound = flags & RMAP_COMPOUND;
- bool first;
-
- /* Is page being mapped by PTE? Is this its first map to be added? */
- if (likely(!compound)) {
- first = atomic_inc_and_test(&page->_mapcount);
- nr = first;
- if (first && folio_test_large(folio)) {
- nr = atomic_inc_return_relaxed(mapped);
- nr = (nr < COMPOUND_MAPPED);
- }
- } else if (folio_test_pmd_mappable(folio)) {
- /* That test is redundant: it's for safety or to optimize out */
-
- first = atomic_inc_and_test(&folio->_entire_mapcount);
- if (first) {
- nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
- if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
- nr_pmdmapped = folio_nr_pages(folio);
- nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
- /* Raced ahead of a remove and another add? */
- if (unlikely(nr < 0))
- nr = 0;
- } else {
- /* Raced ahead of a remove of COMPOUND_MAPPED */
- nr = 0;
- }
- }
- }

+ nr = __folio_add_rmap_range(folio, page, 1, compound, &nr_pmdmapped);
if (nr_pmdmapped)
__lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped);
if (nr)
@@ -1349,43 +1369,10 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
unsigned int nr_pages, struct vm_area_struct *vma,
bool compound)
{
- atomic_t *mapped = &folio->_nr_pages_mapped;
- unsigned int nr_pmdmapped = 0, first;
- int nr = 0;
-
- VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
-
- /* Is page being mapped by PTE? Is this its first map to be added? */
- if (likely(!compound)) {
- do {
- first = atomic_inc_and_test(&page->_mapcount);
- if (first && folio_test_large(folio)) {
- first = atomic_inc_return_relaxed(mapped);
- first = (first < COMPOUND_MAPPED);
- }
-
- if (first)
- nr++;
- } while (page++, --nr_pages > 0);
- } else if (folio_test_pmd_mappable(folio)) {
- /* That test is redundant: it's for safety or to optimize out */
-
- first = atomic_inc_and_test(&folio->_entire_mapcount);
- if (first) {
- nr = atomic_add_return_relaxed(COMPOUND_MAPPED, mapped);
- if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
- nr_pmdmapped = folio_nr_pages(folio);
- nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
- /* Raced ahead of a remove and another add? */
- if (unlikely(nr < 0))
- nr = 0;
- } else {
- /* Raced ahead of a remove of COMPOUND_MAPPED */
- nr = 0;
- }
- }
- }
+ unsigned int nr, nr_pmdmapped = 0;

+ nr = __folio_add_rmap_range(folio, page, nr_pages, compound,
+ &nr_pmdmapped);
if (nr_pmdmapped)
__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr_pmdmapped);
--
2.41.0

2023-11-24 13:28:56

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 13/20] mm/huge_memory: batch rmap operations in __split_huge_pmd_locked()

Let's batch the rmap operations, as a preparation to making individual
page_add_anon_rmap() calls more expensive.

While at it, use more folio operations (but only in the code branch we're
touching), use VM_WARN_ON_FOLIO(), and pass RMAP_COMPOUND instead of
manually setting PageAnonExclusive.

We should never see non-anon pages on that branch: otherwise, the
existing page_add_anon_rmap() call would have been flawed already.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/huge_memory.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd7251923557..f47971d1afbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2100,6 +2100,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long haddr, bool freeze)
{
struct mm_struct *mm = vma->vm_mm;
+ struct folio *folio;
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
@@ -2195,16 +2196,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
page = pmd_page(old_pmd);
+ folio = page_folio(page);
if (pmd_dirty(old_pmd)) {
dirty = true;
- SetPageDirty(page);
+ folio_set_dirty(folio);
}
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
uffd_wp = pmd_uffd_wp(old_pmd);

- VM_BUG_ON_PAGE(!page_count(page), page);
+ VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);

/*
* Without "freeze", we'll simply split the PMD, propagating the
@@ -2221,11 +2224,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*
* See page_try_share_anon_rmap(): invalidate PMD first.
*/
- anon_exclusive = PageAnon(page) && PageAnonExclusive(page);
+ anon_exclusive = PageAnonExclusive(page);
if (freeze && anon_exclusive && page_try_share_anon_rmap(page))
freeze = false;
- if (!freeze)
- page_ref_add(page, HPAGE_PMD_NR - 1);
+ if (!freeze) {
+ rmap_t rmap_flags = RMAP_NONE;
+
+ folio_ref_add(folio, HPAGE_PMD_NR - 1);
+ if (anon_exclusive)
+ rmap_flags = RMAP_EXCLUSIVE;
+ folio_add_anon_rmap_range(folio, page, HPAGE_PMD_NR,
+ vma, haddr, rmap_flags);
+ }
}

/*
@@ -2268,8 +2278,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
if (write)
entry = pte_mkwrite(entry, vma);
- if (anon_exclusive)
- SetPageAnonExclusive(page + i);
if (!young)
entry = pte_mkold(entry);
/* NOTE: this may set soft-dirty too on some archs */
@@ -2279,7 +2287,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_mksoft_dirty(entry);
if (uffd_wp)
entry = pte_mkuffd_wp(entry);
- page_add_anon_rmap(page + i, vma, addr, RMAP_NONE);
}
VM_BUG_ON(!pte_none(ptep_get(pte)));
set_pte_at(mm, addr, pte, entry);
--
2.41.0

2023-11-24 13:29:04

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 04/20] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap()

We'll need access to the destination MM when modifying the total mapcount
of a partially-mappable folio next. So pass in the destination VMA for
consistency.

While at it, change the parameter order for page_try_dup_anon_rmap() such
that the "bool compound" parameter is last, to match the other rmap
functions.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 21 +++++++++++++--------
mm/huge_memory.c | 2 +-
mm/hugetlb.c | 9 +++++----
mm/memory.c | 6 +++---
mm/migrate.c | 2 +-
5 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 42e2c74d4d6e..6cb497f6feab 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -208,7 +208,8 @@ void hugepage_add_anon_rmap(struct folio *, struct vm_area_struct *,
void hugepage_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
unsigned long address);

-static inline void __page_dup_rmap(struct page *page, bool compound)
+static inline void __page_dup_rmap(struct page *page,
+ struct vm_area_struct *dst_vma, bool compound)
{
struct folio *folio = page_folio(page);

@@ -225,17 +226,19 @@ static inline void __page_dup_rmap(struct page *page, bool compound)
atomic_inc(&folio->_total_mapcount);
}

-static inline void page_dup_file_rmap(struct page *page, bool compound)
+static inline void page_dup_file_rmap(struct page *page,
+ struct vm_area_struct *dst_vma, bool compound)
{
- __page_dup_rmap(page, compound);
+ __page_dup_rmap(page, dst_vma, compound);
}

/**
* page_try_dup_anon_rmap - try duplicating a mapping of an already mapped
* anonymous page
* @page: the page to duplicate the mapping for
+ * @dst_vma: the destination vma
+ * @src_vma: the source vma
* @compound: the page is mapped as compound or as a small page
- * @vma: the source vma
*
* The caller needs to hold the PT lock and the vma->vma_mm->write_protect_seq.
*
@@ -247,8 +250,10 @@ static inline void page_dup_file_rmap(struct page *page, bool compound)
*
* Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise.
*/
-static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
- struct vm_area_struct *vma)
+static inline int page_try_dup_anon_rmap(struct page *page,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma,
+ bool compound)
{
VM_BUG_ON_PAGE(!PageAnon(page), page);

@@ -267,7 +272,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
* future on write faults.
*/
if (likely(!is_device_private_page(page) &&
- unlikely(page_needs_cow_for_dma(vma, page))))
+ unlikely(page_needs_cow_for_dma(src_vma, page))))
return -EBUSY;

ClearPageAnonExclusive(page);
@@ -276,7 +281,7 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound,
* the page R/O into both processes.
*/
dup:
- __page_dup_rmap(page, compound);
+ __page_dup_rmap(page, dst_vma, compound);
return 0;
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 874eeeb90e0b..51a878efca0e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1166,7 +1166,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
VM_BUG_ON_PAGE(!PageHead(src_page), src_page);

get_page(src_page);
- if (unlikely(page_try_dup_anon_rmap(src_page, true, src_vma))) {
+ if (unlikely(page_try_dup_anon_rmap(src_page, dst_vma, src_vma, true))) {
/* Page maybe pinned: split and retry the fault on PTEs. */
put_page(src_page);
pte_free(dst_mm, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cf84784064c7..1ddef4082cad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5401,9 +5401,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* sleep during the process.
*/
if (!folio_test_anon(pte_folio)) {
- page_dup_file_rmap(&pte_folio->page, true);
+ page_dup_file_rmap(&pte_folio->page, dst_vma,
+ true);
} else if (page_try_dup_anon_rmap(&pte_folio->page,
- true, src_vma)) {
+ dst_vma, src_vma, true)) {
pte_t src_pte_old = entry;
struct folio *new_folio;

@@ -6272,7 +6273,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (anon_rmap)
hugepage_add_new_anon_rmap(folio, vma, haddr);
else
- page_dup_file_rmap(&folio->page, true);
+ page_dup_file_rmap(&folio->page, vma, true);
new_pte = make_huge_pte(vma, &folio->page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
/*
@@ -6723,7 +6724,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
goto out_release_unlock;

if (folio_in_pagecache)
- page_dup_file_rmap(&folio->page, true);
+ page_dup_file_rmap(&folio->page, dst_vma, true);
else
hugepage_add_new_anon_rmap(folio, dst_vma, dst_addr);

diff --git a/mm/memory.c b/mm/memory.c
index 6bcfa763a146..14416d05e1b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -836,7 +836,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
get_page(page);
rss[mm_counter(page)]++;
/* Cannot fail as these pages cannot get pinned. */
- BUG_ON(page_try_dup_anon_rmap(page, false, src_vma));
+ BUG_ON(page_try_dup_anon_rmap(page, dst_vma, src_vma, false));

/*
* We do not preserve soft-dirty information, because so
@@ -950,7 +950,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
* future.
*/
folio_get(folio);
- if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
+ if (unlikely(page_try_dup_anon_rmap(page, dst_vma, src_vma, false))) {
/* Page may be pinned, we have to copy. */
folio_put(folio);
return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
@@ -959,7 +959,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
rss[MM_ANONPAGES]++;
} else if (page) {
folio_get(folio);
- page_dup_file_rmap(page, false);
+ page_dup_file_rmap(page, dst_vma, false);
rss[mm_counter_file(page)]++;
}

diff --git a/mm/migrate.c b/mm/migrate.c
index fda41bc09903..341a84c3e8e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -252,7 +252,7 @@ static bool remove_migration_pte(struct folio *folio,
hugepage_add_anon_rmap(folio, vma, pvmw.address,
rmap_flags);
else
- page_dup_file_rmap(new, true);
+ page_dup_file_rmap(new, vma, true);
set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte,
psize);
} else
--
2.41.0

2023-11-24 13:29:06

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 06/20] atomic_seqcount: new (raw) seqcount variant to support concurrent writers

Assume we have a writer side that is fairly simple and only updates some
counters by adding some values:
folio->counter_a += diff_a;
folio->counter_b += diff_b;
folio->counter_c += diff_c;
...

Further, assume that our readers want to always read consistent
set of counters. That is, they not only want to read each counter
atomically, but also get a consistent/atomic view across *all*
counters, detecting the case where there are concurrent modifications of
the counters.

Traditionally, we'd use a seqcount protected by some locking on the
writer side. The readers can run lockless, detect when there were
concurrent updates, to simply retry again to re-read all values.

However, a seqcount requires to serialize all writers to only allow for a
single writer at a time. Alternatives might include per-cpu
counters / local atomics, but for the target use cases, both primitives
are not applicable:

We want to store counters (2 to 7 for now, depending on the folio size) in
the "struct folio" of some larger folios (order >=2 ) whereby the counters
get adjusted whenever we (un)map part of a folio.

(a) The reader side must be able to get a consistent view of the
counters and be able to detect concurrent changes (i.e., concurrent
(un)mapping), as described above. In some cases we can simply stop
immediately if we detect any concurrent writer -- any concurrent
(un)map activity.
(b) The writer side updates the counters as described above and should
ideally run completely lockless. In many cases, we always have a
single write at a time. But in some scenarios, we can trigger
a lot of concurrent writers. We want the writer
side to be able to make progress instead of repeadetly spinning,
waiting for possibly many other writers.
(c) Space in the "struct folio" especially for smallish folios is very
limited, and the "struct page" layout imposes various restrictions
on where we can even put new data; growing the size of the
"struct page" is not desired because it can result in serious metadata
overhead and easily has performance implications (cache-line). So we
cannot place ordinary spinlocks in there (especially also because they
change their size based on lockdep and actual implementation), and the
only real alternative is a bit spinlock, which is really undesired.

If we want to allow concurrent writers, we can use atomic RMW operations
when updating the counters:
atomic_add(diff_a, &folio->counter_a);
atomic_add(diff_b, &folio->counter_b);
atomic_add(diff_c, &folio->counter_c);
...

But the existing seqcount to make the reader size detect concurrent
updates is not capable of handling concurrent writers.

So let's add a new atomic seqcount for exactly that purpose. Instead of
using a single LSB in the seqcount to detect a single concurrent writer, it
uses multiple LSBs to detect multiple concurrent writers. As the
seqcount can be modified concurrently, it ends up being an atomic type. In
theory, each CPU can participate, so we have to steal quite some LSBs on
64bit. As that reduces the bits available for the actual sequence quite
drastically especially on 64bit, and there is the concern that 16bit for
the sequence might not be sufficient, just use an atomic_long_t for now.

For the use case discussed, we will place the new atomic seqcount into the
"struct folio"/"struct page", where the limitations as described above
apply. For that use case, the "raw" variant -- raw_atomic_seqcount_t -- is
required, so we only add that.

For the normal seqcount on the writer side, we have the following memory
ordering:

s->sequence++
smp_wmb();
[critical section]
smp_wmb();
s->sequence++

It's important that other CPUs don't observe stores to the sequence
to be reordered with stores in the critical section.

For the atomic seqcount, we could have similarly used:

atomic_long_add(SHARED, &s->sequence);
smp_wmb();
[critical section]
smp_wmb();
atomic_long_add(STEP - SHARED, &s->sequence);

But especially on x86_64, the atomic_long_add() already implies a full
memory barrier. So instead, we can do:

atomic_long_add(SHARED, &s->sequence);
__smp_mb__after_atomic();
[critical section]
__smp_mb__before_atomic();
atomic_long_add(STEP - SHARED, &s->sequence);

Or alternatively:

atomic_long_add_return(SHARED, &s->sequence);
[critical section]
atomic_long_add_return(STEP - SHARED, &s->sequence);

Could we use acquire-release semantics? Like the following:

atomic_long_add_return_acquire(SHARED, &s->sequence)
[critical section]
atomic_long_add_return_release(STEP - SHARED, &s->sequence)

Maybe, but (a) it would make it different to normal seqcounts,
because stores before/after the atomic_long_add_*() could now be reordered
and; (b) memory-barriers.txt might indicate that the sequence counter store
might be reordered: "For compound atomics performing both a load and a
store, ACQUIRE semantics apply only to the load and RELEASE semantics
apply only to the store portion of the operation.".

So let's keep it simple for now.

Effectively, with the atomic seqcount We end up with more atomic RMW
operations in the critical section but get no writer starvation / lock
contention in return.

We'll limit the implementation to !PREEMPT_RT and disallowing
readers/writers from interrupt context.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/atomic_seqcount.h | 170 ++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 11 +++
2 files changed, 181 insertions(+)
create mode 100644 include/linux/atomic_seqcount.h

diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h
new file mode 100644
index 000000000000..109447b663a1
--- /dev/null
+++ b/include/linux/atomic_seqcount.h
@@ -0,0 +1,170 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __LINUX_ATOMIC_SEQLOCK_H
+#define __LINUX_ATOMIC_SEQLOCK_H
+
+#include <linux/compiler.h>
+#include <linux/threads.h>
+#include <linux/preempt.h>
+
+/*
+ * raw_atomic_seqcount_t -- a reader-writer consistency mechanism with
+ * lockless readers (read-only retry loops), and lockless writers.
+ * The writers must use atomic RMW operations in the critical section.
+ *
+ * This locking mechanism is applicable when all individual operations
+ * performed by writers can be expressed using atomic RMW operations
+ * (so they can run lockless) and readers only need a way to get an atomic
+ * view over all individual atomic values: like writers atomically updating
+ * multiple counters, and readers wanting to observe a consistent state
+ * across all these counters.
+ *
+ * For now, only the raw variant is implemented, that doesn't perform any
+ * lockdep checks.
+ *
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Author(s): David Hildenbrand <[email protected]>
+ */
+
+typedef struct raw_atomic_seqcount {
+ atomic_long_t sequence;
+} raw_atomic_seqcount_t;
+
+#define raw_seqcount_init(s) atomic_long_set(&((s)->sequence), 0)
+
+#ifdef CONFIG_64BIT
+
+#define ATOMIC_SEQCOUNT_SHARED_WRITER 0x0000000000000001ul
+/* 65536 CPUs */
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x0000000000008000ul
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x000000000000fffful
+#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000000fffful
+/* We have 48bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000010000ul
+
+#else /* CONFIG_64BIT */
+
+#define ATOMIC_SEQCOUNT_SHARED_WRITER 0x00000001ul
+/* 64 CPUs */
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x00000040ul
+#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x0000007ful
+#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x0000007ful
+/* We have 25bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000080ul
+
+#endif /* CONFIG_64BIT */
+
+#if CONFIG_NR_CPUS > ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX
+#error "raw_atomic_seqcount_t does not support such large CONFIG_NR_CPUS"
+#endif
+
+/**
+ * raw_read_atomic_seqcount() - read the raw_atomic_seqcount_t counter value
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_read_atomic_seqcount() opens a read critical section of the given
+ * raw_atomic_seqcount_t, and without checking or masking the sequence counter
+ * LSBs (using ATOMIC_SEQCOUNT_WRITERS_MASK). Calling code is responsible for
+ * handling that.
+ *
+ * Return: count to be passed to raw_read_atomic_seqcount_retry()
+ */
+static inline unsigned long raw_read_atomic_seqcount(raw_atomic_seqcount_t *s)
+{
+ unsigned long seq = atomic_long_read(&s->sequence);
+
+ /* Read the sequence before anything in the critical section */
+ smp_rmb();
+ return seq;
+}
+
+/**
+ * raw_read_atomic_seqcount_begin() - begin a raw_seqcount_t read section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_read_atomic_seqcount_begin() opens a read critical section of the
+ * given raw_seqcount_t. This function must not be used in interrupt context.
+ *
+ * Return: count to be passed to raw_read_atomic_seqcount_retry()
+ */
+static inline unsigned long raw_read_atomic_seqcount_begin(raw_atomic_seqcount_t *s)
+{
+ unsigned long seq;
+
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+ DEBUG_LOCKS_WARN_ON(in_interrupt());
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+ while ((seq = atomic_long_read(&s->sequence)) &
+ ATOMIC_SEQCOUNT_WRITERS_MASK)
+ cpu_relax();
+
+ /* Load the sequence before any load in the critical section. */
+ smp_rmb();
+ return seq;
+}
+
+/**
+ * raw_read_atomic_seqcount_retry() - end a raw_seqcount_t read critical section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ * @start: count, for example from raw_read_atomic_seqcount_begin()
+ *
+ * raw_read_atomic_seqcount_retry() closes the read critical section of the
+ * given raw_seqcount_t. If the critical section was invalid, it must be ignored
+ * (and typically retried).
+ *
+ * Return: true if a read section retry is required, else false
+ */
+static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s,
+ unsigned long start)
+{
+ /* Load the sequence after any load in the critical section. */
+ smp_rmb();
+ return unlikely(atomic_long_read(&s->sequence) != start);
+}
+
+/**
+ * raw_write_seqcount_begin() - start a raw_seqcount_t write critical section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_write_seqcount_begin() opens the write critical section of the
+ * given raw_seqcount_t. This function must not be used in interrupt context.
+ */
+static inline void raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s)
+{
+ BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+ DEBUG_LOCKS_WARN_ON(in_interrupt());
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+ preempt_disable();
+ atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
+ /* Store the sequence before any store in the critical section. */
+ smp_mb__after_atomic();
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+ DEBUG_LOCKS_WARN_ON((atomic_long_read(&s->sequence) &
+ ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
+ ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX);
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+}
+
+/**
+ * raw_write_seqcount_end() - end a raw_seqcount_t write critical section
+ * @s: Pointer to the raw_atomic_seqcount_t
+ *
+ * raw_write_seqcount_end() closes the write critical section of the
+ * given raw_seqcount_t.
+ */
+static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s)
+{
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+ DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
+ ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+ /* Store the sequence after any store in the critical section. */
+ smp_mb__before_atomic();
+ atomic_long_add(ATOMIC_SEQCOUNT_SEQUENCE_STEP -
+ ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
+ preempt_enable();
+}
+
+#endif /* __LINUX_ATOMIC_SEQLOCK_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index cc7d53d9dc01..569c2c6ed47f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1298,6 +1298,7 @@ config PROVE_LOCKING
select DEBUG_MUTEXES if !PREEMPT_RT
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_RWSEMS
+ select DEBUG_ATOMIC_SEQCOUNT if !PREEMPT_RT
select DEBUG_WW_MUTEX_SLOWPATH
select DEBUG_LOCK_ALLOC
select PREEMPT_COUNT if !ARCH_NO_PREEMPT
@@ -1425,6 +1426,16 @@ config DEBUG_RWSEMS
This debugging feature allows mismatched rw semaphore locks
and unlocks to be detected and reported.

+config DEBUG_ATOMIC_SEQCOUNT
+ bool "Atomic seqcount debugging: basic checks"
+ depends on DEBUG_KERNEL && !PREEMPT_RT
+ help
+ This feature allows some atomic seqcount semantics violations to be
+ detected and reported.
+
+ The debug checks are only performed when running code that actively
+ uses atomic seqcounts; there are no dedicated test cases yet.
+
config DEBUG_LOCK_ALLOC
bool "Lock debugging: detect incorrect freeing of live locks"
depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT
--
2.41.0

2023-11-24 13:29:14

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 09/20] mm: improve folio_mapped_shared() for partially-mappable folios using rmap IDs

Let's make folio_mapped_shared() precise by using or rmap ID
magic to identify if a single MM is responsible for all mappings.

If there is a lot of concurrent (un)map activity, we could theoretically
spin for quite a while. But we're only looking at the rmap values in case
we didn't already identify the folio as "obviously shared". In most
cases, there should only be one or a handful of page tables involved.

For current THPs with ~512 .. 2048 subpages, we really shouldn't see a
lot of concurrent updates that keep us spinning for a long time. Anyhow,
if ever a problem this can be optimized later if there is real demand.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm.h | 21 ++++++++++++---
include/linux/rmap.h | 2 ++
mm/rmap_id.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 765e688690f1..1081a8faa1a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2113,6 +2113,17 @@ static inline size_t folio_size(struct folio *folio)
return PAGE_SIZE << folio_order(folio);
}

+#ifdef CONFIG_RMAP_ID
+bool __folio_large_mapped_shared(struct folio *folio, struct mm_struct *mm);
+#else
+static inline bool __folio_large_mapped_shared(struct folio *folio,
+ struct mm_struct *mm)
+{
+ /* ... guess based on the mapcount of the first page of the folio. */
+ return atomic_read(&folio->page._mapcount) > 0;
+}
+#endif
+
/**
* folio_mapped_shared - Report if a folio is certainly mapped by
* multiple entities in their page tables
@@ -2141,8 +2152,11 @@ static inline size_t folio_size(struct folio *folio)
* PMD-mapped PMD-sized THP), the result will be exactly correct.
*
* For all other (partially-mappable) folios, such as PTE-mapped THP, the
- * return value is partially fuzzy: true is not fuzzy, because it means
- * "certainly mapped shared", but false means "maybe mapped exclusively".
+ * return value is partially fuzzy without CONFIG_RMAP_ID: true is not fuzzy,
+ * because it means "certainly mapped shared", but false means
+ * "maybe mapped exclusively".
+ *
+ * With CONFIG_RMAP_ID, the result will be exactly correct.
*
* Note that this function only considers *current* page table mappings
* tracked via rmap -- that properly adjusts the folio mapcount(s) -- and
@@ -2177,8 +2191,7 @@ static inline bool folio_mapped_shared(struct folio *folio,
*/
if (total_mapcount > folio_nr_pages(folio))
return true;
- /* ... guess based on the mapcount of the first page of the folio. */
- return atomic_read(&folio->page._mapcount) > 0;
+ return __folio_large_mapped_shared(folio, mm);
}

#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 19c9dc3216df..a73e146d82d1 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -253,6 +253,8 @@ void __folio_set_large_rmap_val(struct folio *folio, int count,
struct mm_struct *mm);
void __folio_add_large_rmap_val(struct folio *folio, int count,
struct mm_struct *mm);
+bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm);
#else
static inline void __folio_prep_large_rmap(struct folio *folio)
{
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index e66b0f5aea2d..85a61c830f19 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -322,6 +322,69 @@ void __folio_add_large_rmap_val(struct folio *folio, int count,
}
}

+bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm)
+{
+ const unsigned int order = folio_order(folio);
+ unsigned long diff = 0;
+
+ switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ case RMAP_SUBID_6_MIN_ORDER .. RMAP_SUBID_6_MAX_ORDER:
+ diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_6(mm, 0) * count);
+ diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_6(mm, 1) * count);
+ diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_6(mm, 2) * count);
+ diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_6(mm, 3) * count);
+ diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_6(mm, 4) * count);
+ diff |= atomic_long_read(&folio->_rmap_val5) ^ (get_rmap_subid_6(mm, 5) * count);
+ break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ case RMAP_SUBID_5_MIN_ORDER .. RMAP_SUBID_5_MAX_ORDER:
+ diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_5(mm, 0) * count);
+ diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_5(mm, 1) * count);
+ diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_5(mm, 2) * count);
+ diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_5(mm, 3) * count);
+ diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_5(mm, 4) * count);
+ break;
+#endif
+ default:
+ diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_4(mm, 0) * count);
+ diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_4(mm, 1) * count);
+ diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_4(mm, 2) * count);
+ diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_4(mm, 3) * count);
+ break;
+ }
+ return !diff;
+}
+
+bool __folio_large_mapped_shared(struct folio *folio, struct mm_struct *mm)
+{
+ unsigned long start;
+ bool exclusive;
+ int mapcount;
+
+ VM_WARN_ON_ONCE(!folio_test_large_rmappable(folio));
+ VM_WARN_ON_ONCE(folio_test_hugetlb(folio));
+
+ /*
+ * Livelocking here is unlikely, as the caller already handles the
+ * "obviously shared" cases. If ever an issue and there is too much
+ * concurrent (un)mapping happening (using different page tables), we
+ * could stop earlier and just return "shared".
+ */
+ do {
+ start = raw_read_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount);
+ mapcount = folio_mapcount(folio);
+ if (unlikely(mapcount > folio_nr_pages(folio)))
+ return true;
+ exclusive = __folio_has_large_matching_rmap_val(folio, mapcount, mm);
+ } while (raw_read_atomic_seqcount_retry(&folio->_rmap_atomic_seqcount,
+ start));
+
+ return !exclusive;
+}
+
int alloc_rmap_id(void)
{
int id;
--
2.41.0

2023-11-24 13:29:13

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio

In contrast to small folios and hugetlb folios, for a partially-mappable
folio (i.e., THP), the total mapcount is often not expressive to identify
whether such a folio is "mapped shared" or "mapped exclusively". For small
folios and hugetlb folios that are always entirely mapped, the single
mapcount is traditionally used for that purpose: is it 1? Then the folio
is currently mapped exclusively; is it bigger than 1? Then it's mapped
at least twice, and, therefore, considered "mapped shared".

For a partially-mappable folio, each individual PTE/PMD/... mapping
requires exactly one folio reference and one folio mapcount;
folio_mapcount() > 1 does not imply that the folio is "mapped shared".

While there are some obvious cases when we can conclude that
partially-mappable folios are "mapped shared" -- see
folio_mapped_shared() -- but it is currently not always possible to
precisely tell whether a folio is "mapped exclusively".

For implementing a precise variant of folio_mapped_shared() and for
COW-reuse support of PTE-mapped anon THP, we need an efficient and precise
way to identify "mapped shared" vs. "mapped exclusively".

So how could we track if more than one MM is currently mapping a folio in
its page tables? Having a list of MMs per folio, or even a counter for
each MM for each folio is clearly not feasible.

... but what if we could play some fun math games to perform this
tracking while requiring a handful of counters per folio, the exact number
of counters depending on the size of the folio?

1. !!! Experimental Feature !!!
===============================

We'll only support CONFIG_64BIT and !CONFIG_PREEMPT_RT (implied by THP
support) for now. As we currently never get partially-mappable folios
without CONFIG_TRANSPARENT_HUGEPAGE, let's limit to that to avoid
unnecessary rmap ID allocations for setups without THP.

32bit support might be possible if there is demand, limiting it to 64k
rmap IDs and reasonably sized folio sizes (e.g., <= order-15).
Similarly, RT might be possible if there is ever real demand for it.

The feature will be experimental initially, and, therefore, disabled as
default.

Once the involved math is considered solid, the implementation saw extended
testing, and the performance implications are clear and have either been
optimized (e.g., rmap batching) or mitigated (e.g., do we really have to
perform this tracking for folios that are always assumed shared, like
folios mapping executables or shared libraries? Is some hardware
problematic?), we can consider always enabling it as default.

2. Per-mm rmap IDs
==================

We'll have to assign each MM an rmap ID that is smaller than
16*1024*1024 on 64bit. Note that these are significantly more than the
maximum number of processes we can possibly have in the system. There isn't
really a difference between supporting 16M IDs and 2M/4M IDs.

Due to the ID size limitation, we cannot use the MM pointer value and need
a separate ID allocator. Maybe, we want to cache some rmap IDs per CPU?
Maybe we want to improve the allocation path? We can add such improvements
when deemed necessary.

In the distant future, we might want to allocate rmap IDs for selected
VMAs: for example, imagine a systemcall that does something like fork
(COW-sharing of pages) within a process for a range of anonymous memory,
ending up with a new VMA that wants a separate rmap ID. For now, per-MM
is simple and sufficient.

3. Tracking Overview
====================

We derive a sequence of special sub-IDs from our MM rmap ID.

Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable
folio to/from a MM, we:

(1) Adjust (increment/decrement) the mapcount of the folio
(2) Adjust (add/remove) the folio rmap values using the MM sub-IDs

So the rmap values are always linked to the folio mapcount. Consequently,
we know that a single rmap value in the folio is the sum of exactly
#folio_mapcount() rmap sub-IDs. To identify whether a single MM is
responsible for all folio_mapcount() mappings of a folio
("mapped exclusively") or whether other MMs are involved ("mapped shared"),
we perform the following checks:

(1) Do we have more mappings than the folio has pages? Then the folio is
certainly shared. That is, when "folio_mapcount() > folio_nr_pages()"
(2) For each rmap value X, does that rmap value folio->_rmap_valX
correspond to "folio_mapcount() * sub-ID[X]" of the MM?
Then the folio is certainly exclusive. Note that we only check that
when "folio_mapcount() <= folio_nr_pages()".

4. Synchronization
==================

We're using an atomic seqcount, stored in the folio, to allow for readers
to detect concurrent (un)mapping, whereby they could obtain a wrong
snapshot of the mapcount+rmap values and make a wrong decision.

Further, the mapcount and all rmap values are updated using RMW atomics,
to allow for concurrent updates.

5. sub-IDs
==========

To achieve (2), we generate sub-IDs that have the following property,
assuming that our folio has P=folio_nr_pages() pages.
"2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
"3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
"4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
...
"P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs

The sub-IDs are generated in generations, whereby
(1) Generation #0 is the number 0
(2) Generation #N takes all numbers from generations #0..#N-1 and adds
(P + 1)^(N - 1), effectively doubling the number of sub-IDs

Consequently, the smallest number S in gen #N is:
S[#N] = (P + 1)^(N - 1)

The largest number L in gen #N is:
L[#N] = (P + 1)^(N - 1) + (P + 1)^(N - 2) + ... (P + 1)^0 + 0.
-> [geometric sum with "P + 1 != 1"]
= (1 - (P + 1)^N) / (1 - (P + 1))
= (1 - (P + 1)^N) / (-P)
= ((P + 1)^N - 1) / P

Example with P=4 (order-2 folio):

Generation #0: 0
------------------------ + (4 + 1)^0 = 1
Generation #1: 1
------------------------ + (4 + 1)^1 = 5
Generation #2: 5
6
------------------------ + (4 + 1)^2 = 25
Generation #3: 25
26
30
31
------------------------ + (4 + 1)^3 = 125
[...]

Intuitively, we are working with sub-counters that cannot overflow as
long as we have <= P components. Let's consider the simple case of P=3,
whereby our sub-counters are exactly 2-bit wide.

Subid | Bits | Sub-counters
--------------------------------
0 | 0000 0000 | 0,0,0,0
1 | 0000 0001 | 0,0,0,1
4 | 0000 0100 | 0,0,1,0
5 | 0000 0101 | 0,0,1,1
16 | 0001 0000 | 0,1,0,0
17 | 0001 0001 | 0,1,0,1
20 | 0001 0100 | 0,1,1,0
21 | 0001 0101 | 0,1,1,1
64 | 0100 0000 | 1,0,0,0
65 | 0100 0001 | 1,0,0,1
68 | 0100 0100 | 1,0,1,0
69 | 0100 0101 | 1,0,1,1
80 | 0101 0100 | 1,1,0,0
81 | 0101 0001 | 1,1,0,1
84 | 0101 0100 | 1,1,1,0
85 | 0101 0101 | 1,1,1,1

So if we, say, have:
3 * 17 = 0,3,0,3
how could we possible get to that number by using 3 other subids? It's
impossible, because the sub-counters won't overflow as long as we stay
<= 3.

Interesting side note that might come in handy at some point: we also
cannot get to 0,3,0,3 by using 1 or 2 other subids. But, we could get to
1 * 17 = 0,1,0,1 by using 2 subids (16 and 1) or similarly to 2 * 17 =
0,2,0,2 by using 4 subids (2x16 and 2x1). Looks like we cannot get to
X * subid using any 1..X other subids.

Note 1: we'll add the actual detection logic used to be used by
folio_mapped_shared() and wp_can_reuse_anon_folio() separately.

Note 2: we might want to use that infrastructure for hugetlb as well in the
future: there is nothing THP-specific about rmap ID handling.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm_types.h | 58 +++++++
include/linux/rmap.h | 126 +++++++++++++-
kernel/fork.c | 26 +++
mm/Kconfig | 21 +++
mm/Makefile | 1 +
mm/huge_memory.c | 16 +-
mm/init-mm.c | 4 +
mm/page_alloc.c | 9 +
mm/rmap_id.c | 351 +++++++++++++++++++++++++++++++++++++++
9 files changed, 604 insertions(+), 8 deletions(-)
create mode 100644 mm/rmap_id.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99b84b4797b9..75305c57ef64 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -18,6 +18,7 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/atomic_seqcount.h>
#include <linux/percpu_counter.h>

#include <asm/mmu.h>
@@ -273,6 +274,14 @@ typedef struct {
* @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
* @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
* @_deferred_list: Folios to be split under memory pressure.
+ * @_rmap_atomic_seqcount: Seqcount protecting _total_mapcount and _rmapX.
+ * Does not apply to hugetlb.
+ * @_rmap_val0 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val1 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val2 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val3 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val4 Do not use outside of rmap code. Does not apply to hugetlb.
+ * @_rmap_val5 Do not use outside of rmap code. Does not apply to hugetlb.
*
* A folio is a physically, virtually and logically contiguous set
* of bytes. It is a power-of-two in size, and it is aligned to that
@@ -331,6 +340,9 @@ struct folio {
atomic_t _pincount;
#ifdef CONFIG_64BIT
unsigned int _folio_nr_pages;
+#ifdef CONFIG_RMAP_ID
+ raw_atomic_seqcount_t _rmap_atomic_seqcount;
+#endif /* CONFIG_RMAP_ID */
#endif
/* private: the union with struct page is transitional */
};
@@ -356,6 +368,34 @@ struct folio {
};
struct page __page_2;
};
+ union {
+ struct {
+ unsigned long _flags_3;
+ unsigned long _head_3;
+ /* public: */
+#ifdef CONFIG_RMAP_ID
+ atomic_long_t _rmap_val0;
+ atomic_long_t _rmap_val1;
+ atomic_long_t _rmap_val2;
+ atomic_long_t _rmap_val3;
+#endif /* CONFIG_RMAP_ID */
+ /* private: the union with struct page is transitional */
+ };
+ struct page __page_3;
+ };
+ union {
+ struct {
+ unsigned long _flags_4;
+ unsigned long _head_4;
+ /* public: */
+#ifdef CONFIG_RMAP_ID
+ atomic_long_t _rmap_val4;
+ atomic_long_t _rmap_val5;
+#endif /* CONFIG_RMAP_ID */
+ /* private: the union with struct page is transitional */
+ };
+ struct page __page_4;
+ };
};

#define FOLIO_MATCH(pg, fl) \
@@ -392,6 +432,20 @@ FOLIO_MATCH(compound_head, _head_2);
FOLIO_MATCH(flags, _flags_2a);
FOLIO_MATCH(compound_head, _head_2a);
#undef FOLIO_MATCH
+#define FOLIO_MATCH(pg, fl) \
+ static_assert(offsetof(struct folio, fl) == \
+ offsetof(struct page, pg) + 3 * sizeof(struct page))
+FOLIO_MATCH(flags, _flags_3);
+FOLIO_MATCH(compound_head, _head_3);
+#undef FOLIO_MATCH
+#undef FOLIO_MATCH
+#define FOLIO_MATCH(pg, fl) \
+ static_assert(offsetof(struct folio, fl) == \
+ offsetof(struct page, pg) + 4 * sizeof(struct page))
+FOLIO_MATCH(flags, _flags_4);
+FOLIO_MATCH(compound_head, _head_4);
+#undef FOLIO_MATCH
+

/**
* struct ptdesc - Memory descriptor for page tables.
@@ -975,6 +1029,10 @@ struct mm_struct {
#endif
} lru_gen;
#endif /* CONFIG_LRU_GEN */
+
+#ifdef CONFIG_RMAP_ID
+ int mm_rmap_id;
+#endif /* CONFIG_RMAP_ID */
} __randomize_layout;

/*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9d5c2ed6ced5..19c9dc3216df 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -168,6 +168,116 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,

struct anon_vma *folio_get_anon_vma(struct folio *folio);

+#ifdef CONFIG_RMAP_ID
+/*
+ * For init_mm and friends, we don't actually expect to ever rmap pages. So
+ * we use a reserved dummy ID that we'll never hand out the normal way.
+ */
+#define RMAP_ID_DUMMY 0
+#define RMAP_ID_MIN (RMAP_ID_DUMMY + 1)
+#define RMAP_ID_MAX (16 * 1024 * 1024u - 1)
+
+void free_rmap_id(int id);
+int alloc_rmap_id(void);
+
+#define RMAP_SUBID_4_MAX_ORDER 10
+#define RMAP_SUBID_5_MIN_ORDER 11
+#define RMAP_SUBID_5_MAX_ORDER 12
+#define RMAP_SUBID_6_MIN_ORDER 13
+#define RMAP_SUBID_6_MAX_ORDER 15
+
+static inline void __folio_prep_large_rmap(struct folio *folio)
+{
+ const unsigned int order = folio_order(folio);
+
+ raw_seqcount_init(&folio->_rmap_atomic_seqcount);
+ switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+ atomic_long_set(&folio->_rmap_val5, 0);
+ fallthrough;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+ atomic_long_set(&folio->_rmap_val4, 0);
+ fallthrough;
+#endif
+ default:
+ atomic_long_set(&folio->_rmap_val3, 0);
+ atomic_long_set(&folio->_rmap_val2, 0);
+ atomic_long_set(&folio->_rmap_val1, 0);
+ atomic_long_set(&folio->_rmap_val0, 0);
+ break;
+ }
+}
+
+static inline void __folio_undo_large_rmap(struct folio *folio)
+{
+#ifdef CONFIG_DEBUG_VM
+ const unsigned int order = folio_order(folio);
+
+ switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+ VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val5));
+ fallthrough;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+ VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val4));
+ fallthrough;
+#endif
+ default:
+ VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val3));
+ VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val2));
+ VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val1));
+ VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val0));
+ break;
+ }
+#endif
+}
+
+static inline void __folio_write_large_rmap_begin(struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount);
+}
+
+static inline void __folio_write_large_rmap_end(struct folio *folio)
+{
+ raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount);
+}
+
+void __folio_set_large_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm);
+void __folio_add_large_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm);
+#else
+static inline void __folio_prep_large_rmap(struct folio *folio)
+{
+}
+static inline void __folio_undo_large_rmap(struct folio *folio)
+{
+}
+static inline void __folio_write_large_rmap_begin(struct folio *folio)
+{
+ VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
+ VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+}
+static inline void __folio_write_large_rmap_end(struct folio *folio)
+{
+}
+static inline void __folio_set_large_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm)
+{
+}
+static inline void __folio_add_large_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_RMAP_ID */
+
static inline void folio_set_large_mapcount(struct folio *folio,
int count, struct vm_area_struct *vma)
{
@@ -175,30 +285,34 @@ static inline void folio_set_large_mapcount(struct folio *folio,
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
/* increment count (starts at -1) */
atomic_set(&folio->_total_mapcount, count - 1);
+ __folio_set_large_rmap_val(folio, count, vma->vm_mm);
}

static inline void folio_inc_large_mapcount(struct folio *folio,
struct vm_area_struct *vma)
{
- VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
- VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ __folio_write_large_rmap_begin(folio);
atomic_inc(&folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, 1, vma->vm_mm);
+ __folio_write_large_rmap_end(folio);
}

static inline void folio_add_large_mapcount(struct folio *folio,
int count, struct vm_area_struct *vma)
{
- VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
- VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ __folio_write_large_rmap_begin(folio);
atomic_add(count, &folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, count, vma->vm_mm);
+ __folio_write_large_rmap_end(folio);
}

static inline void folio_dec_large_mapcount(struct folio *folio,
struct vm_area_struct *vma)
{
- VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
- VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ __folio_write_large_rmap_begin(folio);
atomic_dec(&folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, -1, vma->vm_mm);
+ __folio_write_large_rmap_end(folio);
}

/* RMAP flags, currently only relevant for some anon rmap operations. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..773c93613ca2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -814,6 +814,26 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
#define mm_free_pgd(mm)
#endif /* CONFIG_MMU */

+#ifdef CONFIG_RMAP_ID
+static inline int mm_alloc_rmap_id(struct mm_struct *mm)
+{
+ int id = alloc_rmap_id();
+
+ if (id < 0)
+ return id;
+ mm->mm_rmap_id = id;
+ return 0;
+}
+
+static inline void mm_free_rmap_id(struct mm_struct *mm)
+{
+ free_rmap_id(mm->mm_rmap_id);
+}
+#else
+#define mm_alloc_rmap_id(mm) (0)
+#define mm_free_rmap_id(mm)
+#endif /* CONFIG_RMAP_ID */
+
static void check_mm(struct mm_struct *mm)
{
int i;
@@ -917,6 +937,7 @@ void __mmdrop(struct mm_struct *mm)

WARN_ON_ONCE(mm == current->active_mm);
mm_free_pgd(mm);
+ mm_free_rmap_id(mm);
destroy_context(mm);
mmu_notifier_subscriptions_destroy(mm);
check_mm(mm);
@@ -1298,6 +1319,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (mm_alloc_pgd(mm))
goto fail_nopgd;

+ if (mm_alloc_rmap_id(mm))
+ goto fail_normapid;
+
if (init_new_context(p, mm))
goto fail_nocontext;

@@ -1317,6 +1341,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
fail_cid:
destroy_context(mm);
fail_nocontext:
+ mm_free_rmap_id(mm);
+fail_normapid:
mm_free_pgd(mm);
fail_nopgd:
free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..bb0b7b885ada 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -861,6 +861,27 @@ choice
benefit.
endchoice

+menuconfig RMAP_ID
+ bool "Rmap ID tracking (EXPERIMENTAL)"
+ depends on TRANSPARENT_HUGEPAGE && 64BIT
+ help
+ Use per-MM rmap IDs and the unleashed power of math to track
+ whether partially-mappable hugepages (i.e., THPs for now) are
+ "mapped shared" or "mapped exclusively".
+
+ This tracking allow for efficiently and precisely detecting
+ whether a PTE-mapped THP is mapped by a single process
+ ("mapped exclusively") or mapped by multiple ones ("mapped
+ shared"), with the cost of additional tracking when (un)mapping
+ (parts of) such a THP.
+
+ If this configuration is not enabled, an heuristic is used
+ instead that might result in false "mapped exclusively"
+ detection; some features relying on this information might
+ operate slightly imprecise (e.g., MADV_PAGEOUT succeeds although
+ it should fail) or might not be available at all (e.g.,
+ Copy-on-Write reuse support).
+
config THP_SWAP
def_bool y
depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP && 64BIT
diff --git a/mm/Makefile b/mm/Makefile
index 33873c8aedb3..b0cf2563f33a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_RMAP_ID) += rmap_id.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51a878efca0e..0228b04c4053 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -614,6 +614,7 @@ void folio_prep_large_rmappable(struct folio *folio)
{
VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
INIT_LIST_HEAD(&folio->_deferred_list);
+ __folio_prep_large_rmap(folio);
folio_set_large_rmappable(folio);
}

@@ -2478,8 +2479,8 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
(1L << PG_dirty) |
LRU_GEN_MASK | LRU_REFS_MASK));

- /* ->mapping in first and second tail page is replaced by other uses */
- VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+ /* ->mapping in some tail page is replaced by other uses */
+ VM_BUG_ON_PAGE(tail > 4 && page_tail->mapping != TAIL_MAPPING,
page_tail);
page_tail->mapping = head->mapping;
page_tail->index = head->index + tail;
@@ -2550,6 +2551,16 @@ static void __split_huge_page(struct page *page, struct list_head *list,

ClearPageHasHWPoisoned(head);

+#ifdef CONFIG_RMAP_ID
+ /*
+ * Make sure folio->_rmap_atomic_seqcount, which overlays
+ * tail->private, is 0. All other folio->_rmap_valX should be 0
+ * after unmapping the folio.
+ */
+ if (likely(nr >= 4))
+ raw_seqcount_init(&folio->_rmap_atomic_seqcount);
+#endif /* CONFIG_RMAP_ID */
+
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(folio, i, lruvec, list);
/* Some pages can be beyond EOF: drop them from page cache */
@@ -2809,6 +2820,7 @@ void folio_undo_large_rmappable(struct folio *folio)
struct deferred_split *ds_queue;
unsigned long flags;

+ __folio_undo_large_rmap(folio);
/*
* At this point, there is no one trying to add the folio to
* deferred_list. If folio is not in deferred_list, it's safe
diff --git a/mm/init-mm.c b/mm/init-mm.c
index cfd367822cdd..8890271b50c6 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -7,6 +7,7 @@
#include <linux/cpumask.h>
#include <linux/mman.h>
#include <linux/pgtable.h>
+#include <linux/rmap.h>

#include <linux/atomic.h>
#include <linux/user_namespace.h>
@@ -46,6 +47,9 @@ struct mm_struct init_mm = {
.cpu_bitmap = CPU_BITS_NONE,
#ifdef CONFIG_IOMMU_SVA
.pasid = IOMMU_PASID_INVALID,
+#endif
+#ifdef CONFIG_RMAP_ID
+ .mm_rmap_id = RMAP_ID_DUMMY,
#endif
INIT_MM_CONTEXT(init_mm)
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aad45758c0c7..c1dd039801e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,6 +1007,15 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
* deferred_list.next -- ignore value.
*/
break;
+#ifdef CONFIG_RMAP_ID
+ case 3:
+ case 4:
+ /*
+ * the third and fourth tail page: ->mapping may be
+ * used to store RMAP values for RMAP ID tracking.
+ */
+ break;
+#endif /* CONFIG_RMAP_ID */
default:
if (page->mapping != TAIL_MAPPING) {
bad_page(page, "corrupted mapping in tail page");
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
new file mode 100644
index 000000000000..e66b0f5aea2d
--- /dev/null
+++ b/mm/rmap_id.c
@@ -0,0 +1,351 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * rmap ID tracking for precise "mapped shared" vs. "mapped exclusively"
+ * detection of partially-mappable folios (e.g., PTE-mapped THP).
+ *
+ * Copyright Red Hat, Inc. 2023
+ *
+ * Author(s): David Hildenbrand <[email protected]>
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/idr.h>
+
+#include "internal.h"
+
+static DEFINE_SPINLOCK(rmap_id_lock);
+static DEFINE_IDA(rmap_ida);
+
+/* For now we only expect folios from the buddy, not hugetlb folios. */
+#if MAX_ORDER > RMAP_SUBID_6_MAX_ORDER
+#error "rmap ID tracking does not support such large MAX_ORDER"
+#endif
+
+/*
+ * We assign each MM a unique rmap ID and derive from it a sequence of
+ * special sub-IDs. We add/remove these sub-IDs to/from the corresponding
+ * folio rmap values (folio->rmap_valX) whenever (un)mapping (parts of) a
+ * partially mappable folio.
+ *
+ * With 24bit rmap IDs, and a folio size that is compatible with 4
+ * rmap values (more below), we calculate the sub-ID sequence like this:
+ *
+ * rmap ID : | 3 3 3 3 3 3 | 2 2 2 2 2 2 | 1 1 1 1 1 1 | 0 0 0 0 0 0 |
+ * sub-ID IDX : | IDX #3 | IDX #2 | IDX #1 | IDX #0 |
+ *
+ * sub-IDs : [ subid_4(#3), subid_4(#2), subid_4(#1), subid_4(#0) ]
+ * rmap value : [ _rmap_val3, _rmap_val2, _rmap_val1, _rmap_val0 ]
+ *
+ * Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable
+ * folio to/from a MM, we:
+ * (1) Adjust (increment/decrement) the mapcount of the folio
+ * (2) Adjust (add/remove) the folio rmap values using the MM sub-IDs
+ *
+ * So the rmap values are always linked to the folio mapcount.
+ * Consequently, we know that a single rmap value in the folio is the sum
+ * of exactly #folio_mapcount() rmap sub-IDs. As one example, if the folio
+ * is completely unmapped, the rmap values must be 0. As another example,
+ * if the folio is mapped exactly once, the rmap values correspond to the
+ * MM sub-IDs.
+ *
+ * To identify whether a given MM is responsible for all #folio_mapcount()
+ * mappings of a folio ("mapped exclusively") or whether other MMs are
+ * involved ("mapped shared"), we perform the following checks:
+ * (1) Do we have more mappings than the folio has pages? Then the folio
+ * is mapped shared. So when "folio_mapcount() > folio_nr_pages()".
+ * (2) Do the rmap values corresond to "#folio_mapcount() * sub-IDs" of
+ * the MM? Then the folio is mapped exclusive.
+ *
+ * To achieve (2), we generate sub-IDs that have the following property,
+ * assuming that our folio has P=folio_nr_pages() pages.
+ * "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
+ * "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
+ * "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
+ * ...
+ * "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs
+ *
+ * Further, we want "P * sub-ID" (the maximum number we will ever look at)
+ * to not overflow. If we overflow with " > P" mappings, we don't care as
+ * we won't be looking at the numbers until theya re fully expressive
+ * again.
+ *
+ * Consequently, to not overflow 64bit values with "P * sub-ID", folios
+ * with large P require more rmap values (we cannot generate that many sub
+ * IDs), whereby folios with smaller P can get away with less rmap values
+ * (we can generate more sub-IDs).
+ *
+ * The sub-IDs are generated in generations, whereby
+ * (1) Generation #0 is the number 0
+ * (2) Generation #N takes all numbers from generations #0..#N-1 and adds
+ * (P + 1)^(N - 1), effectively doubling the number of sub-IDs
+ *
+ * Note: a PMD-sized THP can, for a short time while PTE-mapping it, be
+ * mapped using PTEs and a single PMD, resulting in "P + 1" mappings.
+ * For now, we don't consider this case, as we are ususally not
+ * looking at such folios while they being remapped, because the
+ * involved page tables are locked and stop any page table walkers.
+ */
+
+/*
+ * With 1024 (order-10) possible exclusive mappings per folio, we can have 64
+ * sub-IDs per 64bit value.
+ *
+ * With 4 such 64bit values, we can support 64^4 == 16M IDs.
+ */
+static const unsigned long rmap_subids_4[64] = {
+ 0ul,
+ 1ul,
+ 1025ul,
+ 1026ul,
+ 1050625ul,
+ 1050626ul,
+ 1051650ul,
+ 1051651ul,
+ 1076890625ul,
+ 1076890626ul,
+ 1076891650ul,
+ 1076891651ul,
+ 1077941250ul,
+ 1077941251ul,
+ 1077942275ul,
+ 1077942276ul,
+ 1103812890625ul,
+ 1103812890626ul,
+ 1103812891650ul,
+ 1103812891651ul,
+ 1103813941250ul,
+ 1103813941251ul,
+ 1103813942275ul,
+ 1103813942276ul,
+ 1104889781250ul,
+ 1104889781251ul,
+ 1104889782275ul,
+ 1104889782276ul,
+ 1104890831875ul,
+ 1104890831876ul,
+ 1104890832900ul,
+ 1104890832901ul,
+ 1131408212890625ul,
+ 1131408212890626ul,
+ 1131408212891650ul,
+ 1131408212891651ul,
+ 1131408213941250ul,
+ 1131408213941251ul,
+ 1131408213942275ul,
+ 1131408213942276ul,
+ 1131409289781250ul,
+ 1131409289781251ul,
+ 1131409289782275ul,
+ 1131409289782276ul,
+ 1131409290831875ul,
+ 1131409290831876ul,
+ 1131409290832900ul,
+ 1131409290832901ul,
+ 1132512025781250ul,
+ 1132512025781251ul,
+ 1132512025782275ul,
+ 1132512025782276ul,
+ 1132512026831875ul,
+ 1132512026831876ul,
+ 1132512026832900ul,
+ 1132512026832901ul,
+ 1132513102671875ul,
+ 1132513102671876ul,
+ 1132513102672900ul,
+ 1132513102672901ul,
+ 1132513103722500ul,
+ 1132513103722501ul,
+ 1132513103723525ul,
+ 1132513103723526ul,
+};
+
+static unsigned long get_rmap_subid_4(struct mm_struct *mm, int nr)
+{
+ const unsigned int rmap_id = mm->mm_rmap_id;
+
+ VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 3);
+ return rmap_subids_4[(rmap_id >> (nr * 6)) & 0x3f];
+}
+
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+/*
+ * With 4096 (order-12) possible exclusive mappings per folio, we can have
+ * 32 sub-IDs per 64bit value.
+ *
+ * With 5 such 64bit values, we can support 32^5 > 16M IDs.
+ */
+static const unsigned long rmap_subids_5[32] = {
+ 0ul,
+ 1ul,
+ 4097ul,
+ 4098ul,
+ 16785409ul,
+ 16785410ul,
+ 16789506ul,
+ 16789507ul,
+ 68769820673ul,
+ 68769820674ul,
+ 68769824770ul,
+ 68769824771ul,
+ 68786606082ul,
+ 68786606083ul,
+ 68786610179ul,
+ 68786610180ul,
+ 281749955297281ul,
+ 281749955297282ul,
+ 281749955301378ul,
+ 281749955301379ul,
+ 281749972082690ul,
+ 281749972082691ul,
+ 281749972086787ul,
+ 281749972086788ul,
+ 281818725117954ul,
+ 281818725117955ul,
+ 281818725122051ul,
+ 281818725122052ul,
+ 281818741903363ul,
+ 281818741903364ul,
+ 281818741907460ul,
+ 281818741907461ul,
+};
+
+static unsigned long get_rmap_subid_5(struct mm_struct *mm, int nr)
+{
+ const unsigned int rmap_id = mm->mm_rmap_id;
+
+ VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 4);
+ return rmap_subids_5[(rmap_id >> (nr * 5)) & 0x1f];
+}
+#endif
+
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+/*
+ * With 32768 (order-15) possible exclusive mappings per folio, we can have
+ * 16 sub-IDs per 64bit value.
+ *
+ * With 6 such 64bit values, we can support 8^6 == 16M IDs.
+ */
+static const unsigned long rmap_subids_6[16] = {
+ 0ul,
+ 1ul,
+ 32769ul,
+ 32770ul,
+ 1073807361ul,
+ 1073807362ul,
+ 1073840130ul,
+ 1073840131ul,
+ 35187593412609ul,
+ 35187593412610ul,
+ 35187593445378ul,
+ 35187593445379ul,
+ 35188667219970ul,
+ 35188667219971ul,
+ 35188667252739ul,
+ 35188667252740ul,
+};
+
+static unsigned long get_rmap_subid_6(struct mm_struct *mm, int nr)
+{
+ const unsigned int rmap_id = mm->mm_rmap_id;
+
+ VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 15);
+ return rmap_subids_6[(rmap_id >> (nr * 4)) & 0xf];
+}
+#endif
+
+void __folio_set_large_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm)
+{
+ const unsigned int order = folio_order(folio);
+
+ switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+ atomic_long_set(&folio->_rmap_val0, get_rmap_subid_6(mm, 0) * count);
+ atomic_long_set(&folio->_rmap_val1, get_rmap_subid_6(mm, 1) * count);
+ atomic_long_set(&folio->_rmap_val2, get_rmap_subid_6(mm, 2) * count);
+ atomic_long_set(&folio->_rmap_val3, get_rmap_subid_6(mm, 3) * count);
+ atomic_long_set(&folio->_rmap_val4, get_rmap_subid_6(mm, 4) * count);
+ atomic_long_set(&folio->_rmap_val5, get_rmap_subid_6(mm, 5) * count);
+ break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+ atomic_long_set(&folio->_rmap_val0, get_rmap_subid_5(mm, 0) * count);
+ atomic_long_set(&folio->_rmap_val1, get_rmap_subid_5(mm, 1) * count);
+ atomic_long_set(&folio->_rmap_val2, get_rmap_subid_5(mm, 2) * count);
+ atomic_long_set(&folio->_rmap_val3, get_rmap_subid_5(mm, 3) * count);
+ atomic_long_set(&folio->_rmap_val4, get_rmap_subid_5(mm, 4) * count);
+ break;
+#endif
+ default:
+ atomic_long_set(&folio->_rmap_val0, get_rmap_subid_4(mm, 0) * count);
+ atomic_long_set(&folio->_rmap_val1, get_rmap_subid_4(mm, 1) * count);
+ atomic_long_set(&folio->_rmap_val2, get_rmap_subid_4(mm, 2) * count);
+ atomic_long_set(&folio->_rmap_val3, get_rmap_subid_4(mm, 3) * count);
+ break;
+ }
+}
+
+void __folio_add_large_rmap_val(struct folio *folio, int count,
+ struct mm_struct *mm)
+{
+ const unsigned int order = folio_order(folio);
+
+ switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+ atomic_long_add(get_rmap_subid_6(mm, 0) * count, &folio->_rmap_val0);
+ atomic_long_add(get_rmap_subid_6(mm, 1) * count, &folio->_rmap_val1);
+ atomic_long_add(get_rmap_subid_6(mm, 2) * count, &folio->_rmap_val2);
+ atomic_long_add(get_rmap_subid_6(mm, 3) * count, &folio->_rmap_val3);
+ atomic_long_add(get_rmap_subid_6(mm, 4) * count, &folio->_rmap_val4);
+ atomic_long_add(get_rmap_subid_6(mm, 5) * count, &folio->_rmap_val5);
+ break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+ atomic_long_add(get_rmap_subid_5(mm, 0) * count, &folio->_rmap_val0);
+ atomic_long_add(get_rmap_subid_5(mm, 1) * count, &folio->_rmap_val1);
+ atomic_long_add(get_rmap_subid_5(mm, 2) * count, &folio->_rmap_val2);
+ atomic_long_add(get_rmap_subid_5(mm, 3) * count, &folio->_rmap_val3);
+ atomic_long_add(get_rmap_subid_5(mm, 4) * count, &folio->_rmap_val4);
+ break;
+#endif
+ default:
+ atomic_long_add(get_rmap_subid_4(mm, 0) * count, &folio->_rmap_val0);
+ atomic_long_add(get_rmap_subid_4(mm, 1) * count, &folio->_rmap_val1);
+ atomic_long_add(get_rmap_subid_4(mm, 2) * count, &folio->_rmap_val2);
+ atomic_long_add(get_rmap_subid_4(mm, 3) * count, &folio->_rmap_val3);
+ break;
+ }
+}
+
+int alloc_rmap_id(void)
+{
+ int id;
+
+ /*
+ * We cannot use a mutex, because free_rmap_id() might get called
+ * when we are not allowed to sleep.
+ *
+ * TODO: do we need something like idr_preload()?
+ */
+ spin_lock(&rmap_id_lock);
+ id = ida_alloc_range(&rmap_ida, RMAP_ID_MIN, RMAP_ID_MAX, GFP_ATOMIC);
+ spin_unlock(&rmap_id_lock);
+
+ return id;
+}
+
+void free_rmap_id(int id)
+{
+ if (id == RMAP_ID_DUMMY)
+ return;
+ if (WARN_ON_ONCE(id < RMAP_ID_MIN || id > RMAP_ID_MAX))
+ return;
+ spin_lock(&rmap_id_lock);
+ ida_free(&rmap_ida, id);
+ spin_unlock(&rmap_id_lock);
+}
--
2.41.0

2023-11-24 13:29:18

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 11/20] mm/rmap_id: support for 1, 2 and 3 values by manual calculation

For smaller folios, we can use less rmap values:
* <= order-2: 1x 64bit value
* <= order-5: 2x 64bit values
* <= order-9: 3x 64bit values

We end up with a lot of subids, so we cannot really use lookup tables.
Pre-calculate the subids per MM.

For order-9 we could think about having a lookup table with 128bit
entries. Further, we could calcualte them only when really required.

With 2 MiB THP this now implies only 3 instead of 4 values.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm_types.h | 3 ++
include/linux/rmap.h | 58 ++++++++++++++++++++++++++++-
kernel/fork.c | 6 +++
mm/rmap_id.c | 79 +++++++++++++++++++++++++++++++++++++---
4 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 75305c57ef64..0ca5004e8f4a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1032,6 +1032,9 @@ struct mm_struct {

#ifdef CONFIG_RMAP_ID
int mm_rmap_id;
+ unsigned long mm_rmap_subid_1;
+ unsigned long mm_rmap_subid_2[2];
+ unsigned long mm_rmap_subid_3[3];
#endif /* CONFIG_RMAP_ID */
} __randomize_layout;

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a73e146d82d1..39aeab457f4a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -180,12 +180,54 @@ struct anon_vma *folio_get_anon_vma(struct folio *folio);
void free_rmap_id(int id);
int alloc_rmap_id(void);

+#define RMAP_SUBID_1_MAX_ORDER 2
+#define RMAP_SUBID_2_MIN_ORDER 3
+#define RMAP_SUBID_2_MAX_ORDER 5
+#define RMAP_SUBID_3_MIN_ORDER 6
+#define RMAP_SUBID_3_MAX_ORDER 9
+#define RMAP_SUBID_4_MIN_ORDER 10
#define RMAP_SUBID_4_MAX_ORDER 10
#define RMAP_SUBID_5_MIN_ORDER 11
#define RMAP_SUBID_5_MAX_ORDER 12
#define RMAP_SUBID_6_MIN_ORDER 13
#define RMAP_SUBID_6_MAX_ORDER 15

+static inline unsigned long calc_rmap_subid(unsigned int n, unsigned int i)
+{
+ unsigned long nr = 0, mult = 1;
+
+ while (i) {
+ if (i & 1)
+ nr += mult;
+ mult *= (n + 1);
+ i >>= 1;
+ }
+ return nr;
+}
+
+static inline unsigned long calc_rmap_subid_1(int rmap_id)
+{
+ VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX);
+
+ return calc_rmap_subid(1u << RMAP_SUBID_1_MAX_ORDER, rmap_id);
+}
+
+static inline unsigned long calc_rmap_subid_2(int rmap_id, int nr)
+{
+ VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 1);
+
+ return calc_rmap_subid(1u << RMAP_SUBID_2_MAX_ORDER,
+ (rmap_id >> (nr * 12)) & 0xfff);
+}
+
+static inline unsigned long calc_rmap_subid_3(int rmap_id, int nr)
+{
+ VM_WARN_ON_ONCE(rmap_id < RMAP_ID_MIN || rmap_id > RMAP_ID_MAX || nr > 2);
+
+ return calc_rmap_subid(1u << RMAP_SUBID_3_MAX_ORDER,
+ (rmap_id >> (nr * 8)) & 0xff);
+}
+
static inline void __folio_prep_large_rmap(struct folio *folio)
{
const unsigned int order = folio_order(folio);
@@ -202,10 +244,16 @@ static inline void __folio_prep_large_rmap(struct folio *folio)
atomic_long_set(&folio->_rmap_val4, 0);
fallthrough;
#endif
- default:
+ case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
atomic_long_set(&folio->_rmap_val3, 0);
+ fallthrough;
+ case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
atomic_long_set(&folio->_rmap_val2, 0);
+ fallthrough;
+ case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
atomic_long_set(&folio->_rmap_val1, 0);
+ fallthrough;
+ default:
atomic_long_set(&folio->_rmap_val0, 0);
break;
}
@@ -227,10 +275,16 @@ static inline void __folio_undo_large_rmap(struct folio *folio)
VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val4));
fallthrough;
#endif
- default:
+ case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val3));
+ fallthrough;
+ case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val2));
+ fallthrough;
+ case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val1));
+ fallthrough;
+ default:
VM_WARN_ON_ONCE(atomic_long_read(&folio->_rmap_val0));
break;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 773c93613ca2..1d2f6248c83e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -822,6 +822,12 @@ static inline int mm_alloc_rmap_id(struct mm_struct *mm)
if (id < 0)
return id;
mm->mm_rmap_id = id;
+ mm->mm_rmap_subid_1 = calc_rmap_subid_1(id);
+ mm->mm_rmap_subid_2[0] = calc_rmap_subid_2(id, 0);
+ mm->mm_rmap_subid_2[1] = calc_rmap_subid_2(id, 1);
+ mm->mm_rmap_subid_3[0] = calc_rmap_subid_3(id, 0);
+ mm->mm_rmap_subid_3[1] = calc_rmap_subid_3(id, 1);
+ mm->mm_rmap_subid_3[2] = calc_rmap_subid_3(id, 2);
return 0;
}

diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index 85a61c830f19..6c3187547741 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -87,6 +87,39 @@ static DEFINE_IDA(rmap_ida);
* involved page tables are locked and stop any page table walkers.
*/

+/*
+ * With 4 (order-2) possible exclusive mappings per folio, we can have
+ * 16777216 = 16M sub-IDs per 64bit value.
+ */
+static unsigned long get_rmap_subid_1(struct mm_struct *mm)
+{
+ return mm->mm_rmap_subid_1;
+}
+
+/*
+ * With 32 (order-5) possible exclusive mappings per folio, we can have
+ * 4096 sub-IDs per 64bit value.
+ *
+ * With 2 such 64bit values, we can support 4096^2 == 16M IDs.
+ */
+static unsigned long get_rmap_subid_2(struct mm_struct *mm, int nr)
+{
+ VM_WARN_ON_ONCE(nr > 1);
+ return mm->mm_rmap_subid_2[nr];
+}
+
+/*
+ * With 512 (order-9) possible exclusive mappings per folio, we can have
+ * 128 sub-IDs per 64bit value.
+ *
+ * With 3 such 64bit values, we can support 128^3 == 16M IDs.
+ */
+static unsigned long get_rmap_subid_3(struct mm_struct *mm, int nr)
+{
+ VM_WARN_ON_ONCE(nr > 2);
+ return mm->mm_rmap_subid_3[nr];
+}
+
/*
* With 1024 (order-10) possible exclusive mappings per folio, we can have 64
* sub-IDs per 64bit value.
@@ -279,12 +312,24 @@ void __folio_set_large_rmap_val(struct folio *folio, int count,
atomic_long_set(&folio->_rmap_val4, get_rmap_subid_5(mm, 4) * count);
break;
#endif
- default:
+ case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
atomic_long_set(&folio->_rmap_val0, get_rmap_subid_4(mm, 0) * count);
atomic_long_set(&folio->_rmap_val1, get_rmap_subid_4(mm, 1) * count);
atomic_long_set(&folio->_rmap_val2, get_rmap_subid_4(mm, 2) * count);
atomic_long_set(&folio->_rmap_val3, get_rmap_subid_4(mm, 3) * count);
break;
+ case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+ atomic_long_set(&folio->_rmap_val0, get_rmap_subid_3(mm, 0) * count);
+ atomic_long_set(&folio->_rmap_val1, get_rmap_subid_3(mm, 1) * count);
+ atomic_long_set(&folio->_rmap_val2, get_rmap_subid_3(mm, 2) * count);
+ break;
+ case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+ atomic_long_set(&folio->_rmap_val0, get_rmap_subid_2(mm, 0) * count);
+ atomic_long_set(&folio->_rmap_val1, get_rmap_subid_2(mm, 1) * count);
+ break;
+ default:
+ atomic_long_set(&folio->_rmap_val0, get_rmap_subid_1(mm) * count);
+ break;
}
}

@@ -313,12 +358,24 @@ void __folio_add_large_rmap_val(struct folio *folio, int count,
atomic_long_add(get_rmap_subid_5(mm, 4) * count, &folio->_rmap_val4);
break;
#endif
- default:
+ case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
atomic_long_add(get_rmap_subid_4(mm, 0) * count, &folio->_rmap_val0);
atomic_long_add(get_rmap_subid_4(mm, 1) * count, &folio->_rmap_val1);
atomic_long_add(get_rmap_subid_4(mm, 2) * count, &folio->_rmap_val2);
atomic_long_add(get_rmap_subid_4(mm, 3) * count, &folio->_rmap_val3);
break;
+ case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+ atomic_long_add(get_rmap_subid_3(mm, 0) * count, &folio->_rmap_val0);
+ atomic_long_add(get_rmap_subid_3(mm, 1) * count, &folio->_rmap_val1);
+ atomic_long_add(get_rmap_subid_3(mm, 2) * count, &folio->_rmap_val2);
+ break;
+ case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+ atomic_long_add(get_rmap_subid_2(mm, 0) * count, &folio->_rmap_val0);
+ atomic_long_add(get_rmap_subid_2(mm, 1) * count, &folio->_rmap_val1);
+ break;
+ default:
+ atomic_long_add(get_rmap_subid_1(mm) * count, &folio->_rmap_val0);
+ break;
}
}

@@ -330,7 +387,7 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,

switch (order) {
#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
- case RMAP_SUBID_6_MIN_ORDER .. RMAP_SUBID_6_MAX_ORDER:
+ case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_6(mm, 0) * count);
diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_6(mm, 1) * count);
diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_6(mm, 2) * count);
@@ -340,7 +397,7 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
break;
#endif
#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
- case RMAP_SUBID_5_MIN_ORDER .. RMAP_SUBID_5_MAX_ORDER:
+ case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_5(mm, 0) * count);
diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_5(mm, 1) * count);
diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_5(mm, 2) * count);
@@ -348,12 +405,24 @@ bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
diff |= atomic_long_read(&folio->_rmap_val4) ^ (get_rmap_subid_5(mm, 4) * count);
break;
#endif
- default:
+ case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_4(mm, 0) * count);
diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_4(mm, 1) * count);
diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_4(mm, 2) * count);
diff |= atomic_long_read(&folio->_rmap_val3) ^ (get_rmap_subid_4(mm, 3) * count);
break;
+ case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+ diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_3(mm, 0) * count);
+ diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_3(mm, 1) * count);
+ diff |= atomic_long_read(&folio->_rmap_val2) ^ (get_rmap_subid_3(mm, 2) * count);
+ break;
+ case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+ diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_2(mm, 0) * count);
+ diff |= atomic_long_read(&folio->_rmap_val1) ^ (get_rmap_subid_2(mm, 1) * count);
+ break;
+ default:
+ diff |= atomic_long_read(&folio->_rmap_val0) ^ (get_rmap_subid_1(mm) * count);
+ break;
}
return !diff;
}
--
2.41.0

2023-11-24 13:29:32

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 15/20] mm/rmap_id: verify precalculated subids with CONFIG_DEBUG_VM

Let's verify the precalculated subids for 4/5/6 values.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/rmap_id.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index 6c3187547741..421d8d2b646c 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -481,3 +481,29 @@ void free_rmap_id(int id)
ida_free(&rmap_ida, id);
spin_unlock(&rmap_id_lock);
}
+
+#ifdef CONFIG_DEBUG_VM
+static int __init rmap_id_init(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(rmap_subids_4); i++)
+ WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_4_MAX_ORDER, i) !=
+ rmap_subids_4[i]);
+
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ for (i = 0; i < ARRAY_SIZE(rmap_subids_5); i++)
+ WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_5_MAX_ORDER, i) !=
+ rmap_subids_5[i]);
+#endif
+
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ for (i = 0; i < ARRAY_SIZE(rmap_subids_6); i++)
+ WARN_ON_ONCE(calc_rmap_subid(1u << RMAP_SUBID_6_MAX_ORDER, i) !=
+ rmap_subids_6[i]);
+#endif
+
+ return 0;
+}
+module_init(rmap_id_init)
+#endif /* CONFIG_DEBUG_VM */
--
2.41.0

2023-11-24 13:29:51

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 17/20] mm/rmap_id: reduce atomic RMW operations when we are the exclusive writer

We can reduce the number of atomic RMW operations when we are the
single exclusive writer -- the common case.

So instead of always requiring

(1) 2 atomic RMW operations for adjusting the atomic seqcount
(2) 1 atomic RMW operation for adjusting the total mapcount
(3) 1 to 6 atomic RMW operation for adjusting the rmap values

We can avoid (2) and (3) if we are the exclusive writer and limit it
to the 2 atomic RMW operations from (1).

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 81 +++++++++++++++++++++++++++++++++-----------
mm/rmap_id.c | 52 ++++++++++++++++++++++++++++
2 files changed, 114 insertions(+), 19 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0758dddc5528..538c23d3c0c9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -291,23 +291,36 @@ static inline void __folio_undo_large_rmap(struct folio *folio)
#endif
}

-static inline void __folio_write_large_rmap_begin(struct folio *folio)
+static inline bool __folio_write_large_rmap_begin(struct folio *folio)
{
+ bool exclusive;
+
VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
- raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
- false);
+
+ exclusive = raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
+ true);
+ if (likely(exclusive)) {
+ prefetchw(&folio->_rmap_val0);
+ if (unlikely(folio_order(folio) > RMAP_SUBID_4_MAX_ORDER))
+ prefetchw(&folio->_rmap_val4);
+ }
+ return exclusive;
}

-static inline void __folio_write_large_rmap_end(struct folio *folio)
+static inline void __folio_write_large_rmap_end(struct folio *folio,
+ bool exclusive)
{
- raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, false);
+ raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount,
+ exclusive);
}

void __folio_set_large_rmap_val(struct folio *folio, int count,
struct mm_struct *mm);
void __folio_add_large_rmap_val(struct folio *folio, int count,
struct mm_struct *mm);
+void __folio_add_large_rmap_val_exclusive(struct folio *folio, int count,
+ struct mm_struct *mm);
bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
struct mm_struct *mm);
#else
@@ -317,12 +330,14 @@ static inline void __folio_prep_large_rmap(struct folio *folio)
static inline void __folio_undo_large_rmap(struct folio *folio)
{
}
-static inline void __folio_write_large_rmap_begin(struct folio *folio)
+static inline bool __folio_write_large_rmap_begin(struct folio *folio)
{
VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
+ return false;
}
-static inline void __folio_write_large_rmap_end(struct folio *folio)
+static inline void __folio_write_large_rmap_end(struct folio *folio,
+ bool exclusive)
{
}
static inline void __folio_set_large_rmap_val(struct folio *folio, int count,
@@ -333,6 +348,10 @@ static inline void __folio_add_large_rmap_val(struct folio *folio, int count,
struct mm_struct *mm)
{
}
+static inline void __folio_add_large_rmap_val_exclusive(struct folio *folio,
+ int count, struct mm_struct *mm)
+{
+}
#endif /* CONFIG_RMAP_ID */

static inline void folio_set_large_mapcount(struct folio *folio,
@@ -348,28 +367,52 @@ static inline void folio_set_large_mapcount(struct folio *folio,
static inline void folio_inc_large_mapcount(struct folio *folio,
struct vm_area_struct *vma)
{
- __folio_write_large_rmap_begin(folio);
- atomic_inc(&folio->_total_mapcount);
- __folio_add_large_rmap_val(folio, 1, vma->vm_mm);
- __folio_write_large_rmap_end(folio);
+ bool exclusive;
+
+ exclusive = __folio_write_large_rmap_begin(folio);
+ if (likely(exclusive)) {
+ atomic_set(&folio->_total_mapcount,
+ atomic_read(&folio->_total_mapcount) + 1);
+ __folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm);
+ } else {
+ atomic_inc(&folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, 1, vma->vm_mm);
+ }
+ __folio_write_large_rmap_end(folio, exclusive);
}

static inline void folio_add_large_mapcount(struct folio *folio,
int count, struct vm_area_struct *vma)
{
- __folio_write_large_rmap_begin(folio);
- atomic_add(count, &folio->_total_mapcount);
- __folio_add_large_rmap_val(folio, count, vma->vm_mm);
- __folio_write_large_rmap_end(folio);
+ bool exclusive;
+
+ exclusive = __folio_write_large_rmap_begin(folio);
+ if (likely(exclusive)) {
+ atomic_set(&folio->_total_mapcount,
+ atomic_read(&folio->_total_mapcount) + count);
+ __folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm);
+ } else {
+ atomic_add(count, &folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, count, vma->vm_mm);
+ }
+ __folio_write_large_rmap_end(folio, exclusive);
}

static inline void folio_dec_large_mapcount(struct folio *folio,
struct vm_area_struct *vma)
{
- __folio_write_large_rmap_begin(folio);
- atomic_dec(&folio->_total_mapcount);
- __folio_add_large_rmap_val(folio, -1, vma->vm_mm);
- __folio_write_large_rmap_end(folio);
+ bool exclusive;
+
+ exclusive = __folio_write_large_rmap_begin(folio);
+ if (likely(exclusive)) {
+ atomic_set(&folio->_total_mapcount,
+ atomic_read(&folio->_total_mapcount) - 1);
+ __folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm);
+ } else {
+ atomic_dec(&folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, -1, vma->vm_mm);
+ }
+ __folio_write_large_rmap_end(folio, exclusive);
}

/* RMAP flags, currently only relevant for some anon rmap operations. */
diff --git a/mm/rmap_id.c b/mm/rmap_id.c
index 421d8d2b646c..5009c6e43965 100644
--- a/mm/rmap_id.c
+++ b/mm/rmap_id.c
@@ -379,6 +379,58 @@ void __folio_add_large_rmap_val(struct folio *folio, int count,
}
}

+void __folio_add_large_rmap_val_exclusive(struct folio *folio, int count,
+ struct mm_struct *mm)
+{
+ const unsigned int order = folio_order(folio);
+
+ /*
+ * Concurrent rmap value modifications are impossible. We don't care
+ * about store tearing because readers will realize the concurrent
+ * updates using the seqcount and simply retry. So adjust the bare
+ * atomic counter instead.
+ */
+ switch (order) {
+#if MAX_ORDER >= RMAP_SUBID_6_MIN_ORDER
+ case RMAP_SUBID_6_MIN_ORDER ... RMAP_SUBID_6_MAX_ORDER:
+ folio->_rmap_val0.counter += get_rmap_subid_6(mm, 0) * count;
+ folio->_rmap_val1.counter += get_rmap_subid_6(mm, 1) * count;
+ folio->_rmap_val2.counter += get_rmap_subid_6(mm, 2) * count;
+ folio->_rmap_val3.counter += get_rmap_subid_6(mm, 3) * count;
+ folio->_rmap_val4.counter += get_rmap_subid_6(mm, 4) * count;
+ folio->_rmap_val5.counter += get_rmap_subid_6(mm, 5) * count;
+ break;
+#endif
+#if MAX_ORDER >= RMAP_SUBID_5_MIN_ORDER
+ case RMAP_SUBID_5_MIN_ORDER ... RMAP_SUBID_5_MAX_ORDER:
+ folio->_rmap_val0.counter += get_rmap_subid_5(mm, 0) * count;
+ folio->_rmap_val1.counter += get_rmap_subid_5(mm, 1) * count;
+ folio->_rmap_val2.counter += get_rmap_subid_5(mm, 2) * count;
+ folio->_rmap_val3.counter += get_rmap_subid_5(mm, 3) * count;
+ folio->_rmap_val4.counter += get_rmap_subid_5(mm, 4) * count;
+ break;
+#endif
+ case RMAP_SUBID_4_MIN_ORDER ... RMAP_SUBID_4_MAX_ORDER:
+ folio->_rmap_val0.counter += get_rmap_subid_4(mm, 0) * count;
+ folio->_rmap_val1.counter += get_rmap_subid_4(mm, 1) * count;
+ folio->_rmap_val2.counter += get_rmap_subid_4(mm, 2) * count;
+ folio->_rmap_val3.counter += get_rmap_subid_4(mm, 3) * count;
+ break;
+ case RMAP_SUBID_3_MIN_ORDER ... RMAP_SUBID_3_MAX_ORDER:
+ folio->_rmap_val0.counter += get_rmap_subid_3(mm, 0) * count;
+ folio->_rmap_val1.counter += get_rmap_subid_3(mm, 1) * count;
+ folio->_rmap_val2.counter += get_rmap_subid_3(mm, 2) * count;
+ break;
+ case RMAP_SUBID_2_MIN_ORDER ... RMAP_SUBID_2_MAX_ORDER:
+ folio->_rmap_val0.counter += get_rmap_subid_2(mm, 0) * count;
+ folio->_rmap_val1.counter += get_rmap_subid_2(mm, 1) * count;
+ break;
+ default:
+ folio->_rmap_val0.counter += get_rmap_subid_1(mm);
+ break;
+ }
+}
+
bool __folio_has_large_matching_rmap_val(struct folio *folio, int count,
struct mm_struct *mm)
{
--
2.41.0

2023-11-24 13:29:55

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 18/20] atomic_seqcount: use atomic add-return instead of atomic cmpxchg on 64bit

Turns out that it can be beneficial on some HW to use an add-return instead
of and atomic cmpxchg. However, we have to deal with more possible races
now: in the worst case, each and every CPU might try becoming the exclusive
writer at the same time, so we need the same number of bits as for the
shared writer case.

In case we detect that we didn't end up being the exclusive writer,
simply back off and convert to a shared writer.

Only implement this optimization on 64bit, where we can steal more bits
from the actual sequence without sorrow.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/atomic_seqcount.h | 43 +++++++++++++++++++++++++++------
1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h
index 00286a9da221..9cd40903863d 100644
--- a/include/linux/atomic_seqcount.h
+++ b/include/linux/atomic_seqcount.h
@@ -42,9 +42,10 @@ typedef struct raw_atomic_seqcount {
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x0000000000008000ul
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x000000000000fffful
#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x0000000000010000ul
-#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000001fffful
-/* We have 48bit for the actual sequence. */
-#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000020000ul
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK 0x00000000ffff0000ul
+#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x00000000fffffffful
+/* We have 32bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000100000000ul

#else /* CONFIG_64BIT */

@@ -53,6 +54,7 @@ typedef struct raw_atomic_seqcount {
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x00000040ul
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x0000007ful
#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x00000080ul
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK 0x00000080ul
#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000fful
/* We have 24bit for the actual sequence. */
#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000100ul
@@ -144,7 +146,7 @@ static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s,
static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
bool try_exclusive)
{
- unsigned long seqcount, seqcount_new;
+ unsigned long __maybe_unused seqcount, seqcount_new;

BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
@@ -160,6 +162,32 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
if (unlikely(seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK))
goto shared;

+#ifdef CONFIG_64BIT
+ BUILD_BUG_ON(__builtin_popcount(ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK) !=
+ __builtin_popcount(ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
+
+ /* See comment for atomic_long_try_cmpxchg() below. */
+ seqcount = atomic_long_add_return(ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER,
+ &s->sequence);
+ if (likely((seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK) ==
+ ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER))
+ return true;
+
+ /*
+ * Whoops, we raced with another writer. Back off, converting ourselves
+ * to a shared writer and wait for any exclusive writers.
+ */
+ atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER - ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER,
+ &s->sequence);
+ /*
+ * No need for __smp_mb__after_atomic(): the reader side already
+ * realizes that it has to retry and the memory barrier from
+ * atomic_long_add_return() is sufficient for that.
+ */
+ while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)
+ cpu_relax();
+ return false;
+#else
seqcount_new = seqcount | ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
/*
* Store the sequence before any store in the critical section. Further,
@@ -168,6 +196,7 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
*/
if (atomic_long_try_cmpxchg(&s->sequence, &seqcount, seqcount_new))
return true;
+#endif
shared:
/*
* Indicate that there is a shared writer, and spin until the exclusive
@@ -185,10 +214,10 @@ static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
DEBUG_LOCKS_WARN_ON((seqcount & ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX);
#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
- if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)))
+ if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)))
return false;

- while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)
+ while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK)
cpu_relax();
return false;
}
@@ -209,7 +238,7 @@ static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s,
if (likely(exclusive)) {
#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
- ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER));
+ ATOMIC_SEQCOUNT_EXCLUSIVE_WRITERS_MASK));
#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
val -= ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
} else {
--
2.41.0

2023-11-24 13:30:00

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 14/20] mm/huge_memory: avoid folio_refcount() < folio_mapcount() in __split_huge_pmd_locked()

Currently, there is a short period in time where the refcount is smaller
than the mapcount. Let's just make sure we obey the rules of refcount
vs. mapcount: increment the refcount before incrementing the mapcount
and decrement the refcount after decrementing the mapcount.

While this could make code like can_split_folio() fail to detect other
folio references, such code is (currently) racy already and this change
shouldn't actually be considered a real fix but rather an improvement/
cleanup.

The refcount vs. mapcount changes are now well balanced in the code, with
the cost of one additional refcount change, which really shouldn't matter
here that much -- we're usually touching >= 512 subpage mapcounts and
much more after all.

Found while playing with some sanity checks to detect such cases, which
we might add at some later point.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/huge_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f47971d1afbf..9639b4edc8a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2230,7 +2230,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (!freeze) {
rmap_t rmap_flags = RMAP_NONE;

- folio_ref_add(folio, HPAGE_PMD_NR - 1);
+ folio_ref_add(folio, HPAGE_PMD_NR);
if (anon_exclusive)
rmap_flags = RMAP_EXCLUSIVE;
folio_add_anon_rmap_range(folio, page, HPAGE_PMD_NR,
@@ -2294,10 +2294,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
pte_unmap(pte - 1);

- if (!pmd_migration)
+ if (!pmd_migration) {
page_remove_rmap(page, vma, true);
- if (freeze)
put_page(page);
+ }

smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
--
2.41.0

2023-11-24 13:30:37

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 19/20] mm/rmap: factor out removing folio range into __folio_remove_rmap_range()

Let's factor it out, optimize for small folios, and compact it a bit.

Well, we're adding the range part, but that will surely come in handy
soon -- and it's now wasier to compare it with __folio_add_rmap_range().

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/rmap.c | 90 +++++++++++++++++++++++++++++++++----------------------
1 file changed, 55 insertions(+), 35 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index da7fa46a18fc..80ac53633332 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1155,6 +1155,57 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
return nr;
}

+static unsigned int __folio_remove_rmap_range(struct folio *folio,
+ struct page *page, unsigned int nr_pages,
+ struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
+{
+ atomic_t *mapped = &folio->_nr_pages_mapped;
+ int last, count, nr = 0;
+
+ VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
+ VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
+ VM_WARN_ON_FOLIO(compound && nr_pages != folio_nr_pages(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_large(folio) && nr_pages != 1, folio);
+
+ if (likely(!folio_test_large(folio)))
+ return atomic_add_negative(-1, &page->_mapcount);
+
+ /* Is page being unmapped by PTE? Is this its last map to be removed? */
+ if (!compound) {
+ folio_add_large_mapcount(folio, -nr_pages, vma);
+ count = nr_pages;
+ do {
+ last = atomic_add_negative(-1, &page->_mapcount);
+ if (last) {
+ last = atomic_dec_return_relaxed(mapped);
+ if (last < COMPOUND_MAPPED)
+ nr++;
+ }
+ } while (page++, --count > 0);
+ } else if (folio_test_pmd_mappable(folio)) {
+ /* That test is redundant: it's for safety or to optimize out */
+
+ folio_dec_large_mapcount(folio, vma);
+ last = atomic_add_negative(-1, &folio->_entire_mapcount);
+ if (last) {
+ nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
+ if (likely(nr < COMPOUND_MAPPED)) {
+ *nr_pmdmapped = folio_nr_pages(folio);
+ nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+ /* Raced ahead of another remove and an add? */
+ if (unlikely(nr < 0))
+ nr = 0;
+ } else {
+ /* An add of COMPOUND_MAPPED raced ahead */
+ nr = 0;
+ }
+ }
+ } else {
+ VM_WARN_ON_ONCE_FOLIO(true, folio);
+ }
+ return nr;
+}
+
/**
* folio_move_anon_rmap - move a folio to our anon_vma
* @folio: The folio to move to our anon_vma
@@ -1439,13 +1490,10 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
bool compound)
{
struct folio *folio = page_folio(page);
- atomic_t *mapped = &folio->_nr_pages_mapped;
- int nr = 0, nr_pmdmapped = 0;
- bool last;
+ unsigned long nr_pages = compound ? folio_nr_pages(folio) : 1;
+ unsigned int nr, nr_pmdmapped = 0;
enum node_stat_item idx;

- VM_BUG_ON_PAGE(compound && !PageHead(page), page);
-
/* Hugetlb pages are not counted in NR_*MAPPED */
if (unlikely(folio_test_hugetlb(folio))) {
/* hugetlb pages are always mapped with pmds */
@@ -1454,36 +1502,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
return;
}

- if (folio_test_large(folio))
- folio_dec_large_mapcount(folio, vma);
-
- /* Is page being unmapped by PTE? Is this its last map to be removed? */
- if (likely(!compound)) {
- last = atomic_add_negative(-1, &page->_mapcount);
- nr = last;
- if (last && folio_test_large(folio)) {
- nr = atomic_dec_return_relaxed(mapped);
- nr = (nr < COMPOUND_MAPPED);
- }
- } else if (folio_test_pmd_mappable(folio)) {
- /* That test is redundant: it's for safety or to optimize out */
-
- last = atomic_add_negative(-1, &folio->_entire_mapcount);
- if (last) {
- nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
- if (likely(nr < COMPOUND_MAPPED)) {
- nr_pmdmapped = folio_nr_pages(folio);
- nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
- /* Raced ahead of another remove and an add? */
- if (unlikely(nr < 0))
- nr = 0;
- } else {
- /* An add of COMPOUND_MAPPED raced ahead */
- nr = 0;
- }
- }
- }
-
+ nr = __folio_remove_rmap_range(folio, page, nr_pages, vma, compound,
+ &nr_pmdmapped);
if (nr_pmdmapped) {
if (folio_test_anon(folio))
idx = NR_ANON_THPS;
--
2.41.0

2023-11-24 13:30:39

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 16/20] atomic_seqcount: support a single exclusive writer in the absence of other writers

The current atomic seqcount requires that all writers must use atomic
RMW operations in the critical section, which can result in quite some
overhead on some platforms. In the common case, there is only a single
writer, and ideally we'd be able to not use atomic RMW operations in that
case, to reduce the overall number of atomic RMW operations on the
fast path.

So let's add support for a single exclusive writer. If there are no
other writers, a writer can become the single exclusive writer by using
an atomic cmpxchg on the atomic seqcount. However, if there is any
concurrent writer (shared or exclusive), the writers become shared and
only have to wait for a single exclusive writer to finish.

So shared writers might be delayed a bit by the single exclusive writer,
but they don't starve as they are guaranteed to make progress after the
exclusive writer finished (that ideally runs faster than any shared writer
due to no atomic RMW operations in the critical section).

The exclusive path now effectively acts as a lock: if the trylock fails,
we fallback to the shared path. We need acquire-release semantics that are
implied by the full memory barriers that we are enforcing.

Instead of the atomic_long_add_return(), we could keep using an
atomic_long_add() + atomic_long_read(). But I suspect that doesn't
really matter. If it ever matters, if will be easy to optimize.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/atomic_seqcount.h | 101 ++++++++++++++++++++++++++------
include/linux/rmap.h | 5 +-
2 files changed, 85 insertions(+), 21 deletions(-)

diff --git a/include/linux/atomic_seqcount.h b/include/linux/atomic_seqcount.h
index 109447b663a1..00286a9da221 100644
--- a/include/linux/atomic_seqcount.h
+++ b/include/linux/atomic_seqcount.h
@@ -8,8 +8,11 @@

/*
* raw_atomic_seqcount_t -- a reader-writer consistency mechanism with
- * lockless readers (read-only retry loops), and lockless writers.
- * The writers must use atomic RMW operations in the critical section.
+ * lockless readers (read-only retry loops), and (almost) lockless writers.
+ * Shared writers must use atomic RMW operations in the critical section,
+ * a single exclusive writer can avoid atomic RMW operations in the critical
+ * section. Shared writers will always have to wait for at most one exclusive
+ * writer to finish in order to make progress.
*
* This locking mechanism is applicable when all individual operations
* performed by writers can be expressed using atomic RMW operations
@@ -38,9 +41,10 @@ typedef struct raw_atomic_seqcount {
/* 65536 CPUs */
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x0000000000008000ul
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x000000000000fffful
-#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000000fffful
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x0000000000010000ul
+#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000000001fffful
/* We have 48bit for the actual sequence. */
-#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000010000ul
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x0000000000020000ul

#else /* CONFIG_64BIT */

@@ -48,9 +52,10 @@ typedef struct raw_atomic_seqcount {
/* 64 CPUs */
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX 0x00000040ul
#define ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK 0x0000007ful
-#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x0000007ful
-/* We have 25bit for the actual sequence. */
-#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000080ul
+#define ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER 0x00000080ul
+#define ATOMIC_SEQCOUNT_WRITERS_MASK 0x000000fful
+/* We have 24bit for the actual sequence. */
+#define ATOMIC_SEQCOUNT_SEQUENCE_STEP 0x00000100ul

#endif /* CONFIG_64BIT */

@@ -126,44 +131,102 @@ static inline bool raw_read_atomic_seqcount_retry(raw_atomic_seqcount_t *s,
/**
* raw_write_seqcount_begin() - start a raw_seqcount_t write critical section
* @s: Pointer to the raw_atomic_seqcount_t
+ * @try_exclusive: Whether to try becoming the exclusive writer.
*
* raw_write_seqcount_begin() opens the write critical section of the
* given raw_seqcount_t. This function must not be used in interrupt context.
+ *
+ * Return: "true" when we are the exclusive writer and can avoid atomic RMW
+ * operations in the critical section. Otherwise, we are a shared
+ * writer and have to use atomic RMW operations in the critical
+ * section. Will always return "false" if @try_exclusive is not "true".
*/
-static inline void raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s)
+static inline bool raw_write_atomic_seqcount_begin(raw_atomic_seqcount_t *s,
+ bool try_exclusive)
{
+ unsigned long seqcount, seqcount_new;
+
BUILD_BUG_ON(IS_ENABLED(CONFIG_PREEMPT_RT));
#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
DEBUG_LOCKS_WARN_ON(in_interrupt());
#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
preempt_disable();
- atomic_long_add(ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
- /* Store the sequence before any store in the critical section. */
- smp_mb__after_atomic();
+
+ /* If requested, can we just become the exclusive writer? */
+ if (!try_exclusive)
+ goto shared;
+
+ seqcount = atomic_long_read(&s->sequence);
+ if (unlikely(seqcount & ATOMIC_SEQCOUNT_WRITERS_MASK))
+ goto shared;
+
+ seqcount_new = seqcount | ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
+ /*
+ * Store the sequence before any store in the critical section. Further,
+ * this implies an acquire so loads within the critical section are
+ * not reordered to be outside the critical section.
+ */
+ if (atomic_long_try_cmpxchg(&s->sequence, &seqcount, seqcount_new))
+ return true;
+shared:
+ /*
+ * Indicate that there is a shared writer, and spin until the exclusive
+ * writer is done. This avoids writer starvation, because we'll always
+ * have to wait for at most one writer.
+ *
+ * We spin with preemption disabled to not reschedule to a reader that
+ * cannot make any progress either way.
+ *
+ * Store the sequence before any store in the critical section.
+ */
+ seqcount = atomic_long_add_return(ATOMIC_SEQCOUNT_SHARED_WRITER,
+ &s->sequence);
#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
- DEBUG_LOCKS_WARN_ON((atomic_long_read(&s->sequence) &
- ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
+ DEBUG_LOCKS_WARN_ON((seqcount & ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK) >
ATOMIC_SEQCOUNT_SHARED_WRITERS_MAX);
#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+ if (likely(!(seqcount & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)))
+ return false;
+
+ while (atomic_long_read(&s->sequence) & ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER)
+ cpu_relax();
+ return false;
}

/**
* raw_write_seqcount_end() - end a raw_seqcount_t write critical section
* @s: Pointer to the raw_atomic_seqcount_t
+ * @exclusive: Return value of raw_write_atomic_seqcount_begin().
*
* raw_write_seqcount_end() closes the write critical section of the
* given raw_seqcount_t.
*/
-static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s)
+static inline void raw_write_atomic_seqcount_end(raw_atomic_seqcount_t *s,
+ bool exclusive)
{
+ unsigned long val = ATOMIC_SEQCOUNT_SEQUENCE_STEP;
+
+ if (likely(exclusive)) {
+#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
+ DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
+ ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER));
+#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
+ val -= ATOMIC_SEQCOUNT_EXCLUSIVE_WRITER;
+ } else {
#ifdef CONFIG_DEBUG_ATOMIC_SEQCOUNT
- DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
- ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
+ DEBUG_LOCKS_WARN_ON(!(atomic_long_read(&s->sequence) &
+ ATOMIC_SEQCOUNT_SHARED_WRITERS_MASK));
#endif /* CONFIG_DEBUG_ATOMIC_SEQCOUNT */
- /* Store the sequence after any store in the critical section. */
+ val -= ATOMIC_SEQCOUNT_SHARED_WRITER;
+ }
+ /*
+ * Store the sequence after any store in the critical section. For
+ * the exclusive path, this further implies a release, so loads
+ * within the critical section are not reordered to be outside the
+ * cricial section.
+ */
smp_mb__before_atomic();
- atomic_long_add(ATOMIC_SEQCOUNT_SEQUENCE_STEP -
- ATOMIC_SEQCOUNT_SHARED_WRITER, &s->sequence);
+ atomic_long_add(val, &s->sequence);
preempt_enable();
}

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 76e6fb1dad5c..0758dddc5528 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -295,12 +295,13 @@ static inline void __folio_write_large_rmap_begin(struct folio *folio)
{
VM_WARN_ON_FOLIO(!folio_test_large_rmappable(folio), folio);
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
- raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount);
+ raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
+ false);
}

static inline void __folio_write_large_rmap_end(struct folio *folio)
{
- raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount);
+ raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount, false);
}

void __folio_set_large_rmap_val(struct folio *folio, int count,
--
2.41.0

2023-11-24 13:32:02

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH WIP v1 20/20] mm/rmap: perform all mapcount operations of large folios under the rmap seqcount

Let's extend the atomic seqcount to also protect modifications of:
* The subpage mapcounts
* The entire mapcount
* folio->_nr_pages_mapped

This way, we can avoid another 1/2 atomic RMW operations on the fast
path (and significantly more when patching): When we are the exclusive
writer, we only need two atomic RMW operations to manage the atomic
seqcount.

Let's document how the existing atomic seqcount memory barriers keep the
old behavior unmodified: especially, how it makes sure that folio
refcount updates cannot be reordered with folio mapcount updates.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/rmap.h | 95 ++++++++++++++++++++++++++------------------
mm/rmap.c | 84 +++++++++++++++++++++++++++++++++++++--
2 files changed, 137 insertions(+), 42 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 538c23d3c0c9..3cff4aa71393 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -301,6 +301,12 @@ static inline bool __folio_write_large_rmap_begin(struct folio *folio)
exclusive = raw_write_atomic_seqcount_begin(&folio->_rmap_atomic_seqcount,
true);
if (likely(exclusive)) {
+ /*
+ * Note: raw_write_atomic_seqcount_begin() implies a full
+ * memory barrier like non-exclusive mapcount operations
+ * will. Any refcount updates that happened before this call
+ * are visible before any mapcount updates on other CPUs.
+ */
prefetchw(&folio->_rmap_val0);
if (unlikely(folio_order(folio) > RMAP_SUBID_4_MAX_ORDER))
prefetchw(&folio->_rmap_val4);
@@ -311,6 +317,12 @@ static inline bool __folio_write_large_rmap_begin(struct folio *folio)
static inline void __folio_write_large_rmap_end(struct folio *folio,
bool exclusive)
{
+ /*
+ * Note: raw_write_atomic_seqcount_end() implies a full memory
+ * barrier like non-exclusive mapcount operations will. Any
+ * refcount updates happening after this call are visible after any
+ * mapcount updates on other CPUs.
+ */
raw_write_atomic_seqcount_end(&folio->_rmap_atomic_seqcount,
exclusive);
}
@@ -367,52 +379,46 @@ static inline void folio_set_large_mapcount(struct folio *folio,
static inline void folio_inc_large_mapcount(struct folio *folio,
struct vm_area_struct *vma)
{
- bool exclusive;
+ atomic_inc(&folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, 1, vma->vm_mm);
+}

- exclusive = __folio_write_large_rmap_begin(folio);
- if (likely(exclusive)) {
- atomic_set(&folio->_total_mapcount,
- atomic_read(&folio->_total_mapcount) + 1);
- __folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm);
- } else {
- atomic_inc(&folio->_total_mapcount);
- __folio_add_large_rmap_val(folio, 1, vma->vm_mm);
- }
- __folio_write_large_rmap_end(folio, exclusive);
+static inline void folio_inc_large_mapcount_exclusive(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ atomic_set(&folio->_total_mapcount,
+ atomic_read(&folio->_total_mapcount) + 1);
+ __folio_add_large_rmap_val_exclusive(folio, 1, vma->vm_mm);
}

static inline void folio_add_large_mapcount(struct folio *folio,
int count, struct vm_area_struct *vma)
{
- bool exclusive;
+ atomic_add(count, &folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, count, vma->vm_mm);
+}

- exclusive = __folio_write_large_rmap_begin(folio);
- if (likely(exclusive)) {
- atomic_set(&folio->_total_mapcount,
- atomic_read(&folio->_total_mapcount) + count);
- __folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm);
- } else {
- atomic_add(count, &folio->_total_mapcount);
- __folio_add_large_rmap_val(folio, count, vma->vm_mm);
- }
- __folio_write_large_rmap_end(folio, exclusive);
+static inline void folio_add_large_mapcount_exclusive(struct folio *folio,
+ int count, struct vm_area_struct *vma)
+{
+ atomic_set(&folio->_total_mapcount,
+ atomic_read(&folio->_total_mapcount) + count);
+ __folio_add_large_rmap_val_exclusive(folio, count, vma->vm_mm);
}

static inline void folio_dec_large_mapcount(struct folio *folio,
struct vm_area_struct *vma)
{
- bool exclusive;
+ atomic_dec(&folio->_total_mapcount);
+ __folio_add_large_rmap_val(folio, -1, vma->vm_mm);
+}

- exclusive = __folio_write_large_rmap_begin(folio);
- if (likely(exclusive)) {
- atomic_set(&folio->_total_mapcount,
- atomic_read(&folio->_total_mapcount) - 1);
- __folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm);
- } else {
- atomic_dec(&folio->_total_mapcount);
- __folio_add_large_rmap_val(folio, -1, vma->vm_mm);
- }
- __folio_write_large_rmap_end(folio, exclusive);
+static inline void folio_dec_large_mapcount_exclusive(struct folio *folio,
+ struct vm_area_struct *vma)
+{
+ atomic_set(&folio->_total_mapcount,
+ atomic_read(&folio->_total_mapcount) - 1);
+ __folio_add_large_rmap_val_exclusive(folio, -1, vma->vm_mm);
}

/* RMAP flags, currently only relevant for some anon rmap operations. */
@@ -462,6 +468,7 @@ static inline void __page_dup_rmap(struct page *page,
struct vm_area_struct *dst_vma, bool compound)
{
struct folio *folio = page_folio(page);
+ bool exclusive;

VM_BUG_ON_PAGE(compound && !PageHead(page), page);
if (likely(!folio_test_large(folio))) {
@@ -475,11 +482,23 @@ static inline void __page_dup_rmap(struct page *page,
return;
}

- if (compound)
- atomic_inc(&folio->_entire_mapcount);
- else
- atomic_inc(&page->_mapcount);
- folio_inc_large_mapcount(folio, dst_vma);
+ exclusive = __folio_write_large_rmap_begin(folio);
+ if (likely(exclusive)) {
+ if (compound)
+ atomic_set(&folio->_entire_mapcount,
+ atomic_read(&folio->_entire_mapcount) + 1);
+ else
+ atomic_set(&page->_mapcount,
+ atomic_read(&page->_mapcount) + 1);
+ folio_inc_large_mapcount_exclusive(folio, dst_vma);
+ } else {
+ if (compound)
+ atomic_inc(&folio->_entire_mapcount);
+ else
+ atomic_inc(&page->_mapcount);
+ folio_inc_large_mapcount(folio, dst_vma);
+ }
+ __folio_write_large_rmap_end(folio, exclusive);
}

static inline void page_dup_file_rmap(struct page *page,
diff --git a/mm/rmap.c b/mm/rmap.c
index 80ac53633332..755a62b046e2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1109,7 +1109,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
- int first, count, nr = 0;
+ int first, val, count, nr = 0;
+ bool exclusive;

VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
@@ -1119,8 +1120,23 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
if (likely(!folio_test_large(folio)))
return atomic_inc_and_test(&page->_mapcount);

+ exclusive = __folio_write_large_rmap_begin(folio);
+
/* Is page being mapped by PTE? Is this its first map to be added? */
- if (!compound) {
+ if (likely(exclusive) && !compound) {
+ count = nr_pages;
+ do {
+ val = atomic_read(&page->_mapcount) + 1;
+ atomic_set(&page->_mapcount, val);
+ if (!val) {
+ val = atomic_read(mapped) + 1;
+ atomic_set(mapped, val);
+ if (val < COMPOUND_MAPPED)
+ nr++;
+ }
+ } while (page++, --count > 0);
+ folio_add_large_mapcount_exclusive(folio, nr_pages, vma);
+ } else if (!compound) {
count = nr_pages;
do {
first = atomic_inc_and_test(&page->_mapcount);
@@ -1131,6 +1147,26 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
}
} while (page++, --count > 0);
folio_add_large_mapcount(folio, nr_pages, vma);
+ } else if (likely(exclusive) && folio_test_pmd_mappable(folio)) {
+ /* That test is redundant: it's for safety or to optimize out */
+
+ val = atomic_read(&folio->_entire_mapcount) + 1;
+ atomic_set(&folio->_entire_mapcount, val);
+ if (!val) {
+ nr = atomic_read(mapped) + COMPOUND_MAPPED;
+ atomic_set(mapped, nr);
+ if (likely(nr < COMPOUND_MAPPED + COMPOUND_MAPPED)) {
+ *nr_pmdmapped = folio_nr_pages(folio);
+ nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+ /* Raced ahead of a remove and another add? */
+ if (unlikely(nr < 0))
+ nr = 0;
+ } else {
+ /* Raced ahead of a remove of COMPOUND_MAPPED */
+ nr = 0;
+ }
+ }
+ folio_inc_large_mapcount_exclusive(folio, vma);
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */

@@ -1152,6 +1188,8 @@ static unsigned int __folio_add_rmap_range(struct folio *folio,
} else {
VM_WARN_ON_ONCE_FOLIO(true, folio);
}
+
+ __folio_write_large_rmap_end(folio, exclusive);
return nr;
}

@@ -1160,7 +1198,8 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
struct vm_area_struct *vma, bool compound, int *nr_pmdmapped)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
- int last, count, nr = 0;
+ int last, val, count, nr = 0;
+ bool exclusive;

VM_WARN_ON_FOLIO(compound && page != &folio->page, folio);
VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
@@ -1170,8 +1209,23 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
if (likely(!folio_test_large(folio)))
return atomic_add_negative(-1, &page->_mapcount);

+ exclusive = __folio_write_large_rmap_begin(folio);
+
/* Is page being unmapped by PTE? Is this its last map to be removed? */
- if (!compound) {
+ if (likely(exclusive) && !compound) {
+ folio_add_large_mapcount_exclusive(folio, -nr_pages, vma);
+ count = nr_pages;
+ do {
+ val = atomic_read(&page->_mapcount) - 1;
+ atomic_set(&page->_mapcount, val);
+ if (val < 0) {
+ val = atomic_read(mapped) - 1;
+ atomic_set(mapped, val);
+ if (val < COMPOUND_MAPPED)
+ nr++;
+ }
+ } while (page++, --count > 0);
+ } else if (!compound) {
folio_add_large_mapcount(folio, -nr_pages, vma);
count = nr_pages;
do {
@@ -1182,6 +1236,26 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
nr++;
}
} while (page++, --count > 0);
+ } else if (likely(exclusive) && folio_test_pmd_mappable(folio)) {
+ /* That test is redundant: it's for safety or to optimize out */
+
+ folio_dec_large_mapcount_exclusive(folio, vma);
+ val = atomic_read(&folio->_entire_mapcount) - 1;
+ atomic_set(&folio->_entire_mapcount, val);
+ if (val < 0) {
+ nr = atomic_read(mapped) - COMPOUND_MAPPED;
+ atomic_set(mapped, nr);
+ if (likely(nr < COMPOUND_MAPPED)) {
+ *nr_pmdmapped = folio_nr_pages(folio);
+ nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+ /* Raced ahead of another remove and an add? */
+ if (unlikely(nr < 0))
+ nr = 0;
+ } else {
+ /* An add of COMPOUND_MAPPED raced ahead */
+ nr = 0;
+ }
+ }
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */

@@ -1203,6 +1277,8 @@ static unsigned int __folio_remove_rmap_range(struct folio *folio,
} else {
VM_WARN_ON_ONCE_FOLIO(true, folio);
}
+
+ __folio_write_large_rmap_end(folio, exclusive);
return nr;
}

--
2.41.0

2023-11-24 20:56:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios

On Fri, 24 Nov 2023 at 05:26, David Hildenbrand <[email protected]> wrote:
>
> Are you interested in some made-up math, new locking primitives and
> slightly unpleasant performance numbers on first sight? :)

Ugh. I'm not loving the "I have a proof, but it's too big to fit in
the margin" model of VM development.

This does seem to be very subtle.

Also, please benchmark what your rmap changes do to just plain regular
pages - it *looks* like maybe all you did was to add some
VM_WARN_ON_FOLIO() for those cases, but I have this strong memory of
that

if (likely(!compound)) {

case being very critical on all the usual cases (and the cleanups by
Hugh last year were nice).

I get the feeling that you are trying to optimize a particular case
that is special enough that some less complicated model might work.

Just by looking at your benchmarks, I *think* the case you actually
want to optimize is "THP -> fork -> child exit/execve -> parent write
COW reuse" where the THP page was really never in more than two VM's,
and the second VM was an almost accidental temporary thing that is
just about the whole "fork->exec/exit" model.

Which makes me really feel like your rmap_id is very over-engineered.
It seems to be designed to handle all the generic cases, but it seems
like the main cause for it is a very specific case that I _feel_
should be something that could be tracked with *way* less information
(eg just have a "pointer to owner vma, and a simple counter of
non-owners").

I dunno. I was cc'd, I looked at the patches, but I suspect I'm not
really the target audience. If Hugh is ok with this kind of
complexity, I bow to a higher authority. This *does* seem to add a lot
of conceptual complexity to something that is already complicated.

Linus

2023-11-25 17:02:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH WIP v1 00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios

On 24.11.23 21:55, Linus Torvalds wrote:
> On Fri, 24 Nov 2023 at 05:26, David Hildenbrand <[email protected]> wrote:
>>
>> Are you interested in some made-up math, new locking primitives and
>> slightly unpleasant performance numbers on first sight? :)

Hi Linus,

first of all -- wow -- thanks for that blazing fast feedback! You really
had to work through quite some text+code to understand what's happening.

Thanks for prioritizing that over Black Friday shopping ;)

>
> Ugh. I'm not loving the "I have a proof, but it's too big to fit in
> the margin" model of VM development.
>
> This does seem to be very subtle.

Yes, compared to other kernel subsystems, this level of math in the VM
is really new.

The main reason I excluded the proof in this WIP series is not the size,
though. I wanted to get the implementation out after talking about it
(and optimizing it ...) for way too long and (a) proofs involving
infinite sequences in pure ascii are just horrible to read; (b) I think
the proof can be cleaned-up / simplified, especially after I came up
with the "intuition" in the patch some days ago and decided to use that
one instead for now.

No questions asked, if this ever gets discussed for actual merging, only
with a public, reviewed proof available.

[most of the "magic" goes away once one simply uses one rmap value for
each bit in the mm->mm_rmap_id; 22 bit -> 22 rmap values. Of course, 22
values are undesirable.]

>
> Also, please benchmark what your rmap changes do to just plain regular
> pages - it *looks* like maybe all you did was to add some
> VM_WARN_ON_FOLIO() for those cases, but I have this strong memory of
> that
>
> if (likely(!compound)) {
>
> case being very critical on all the usual cases (and the cleanups by
> Hugh last year were nice).

Yes, indeed. I separated small vs. large folio handling cleanly, such
that we always have a pattern like:

+ if (likely(!folio_test_large(folio)))
+ return atomic_add_negative(-1, &page->_mapcount);

So, the fast default path is "small folio".

As stated, I want to do much more benchmarking to better understand all
performance impacts; especially, on top of Ryan's work of having THPs
that are always PTE-mapped and we don't have to "artificially" force a
PTE-mapped THP.

>
> I get the feeling that you are trying to optimize a particular case
> that is special enough that some less complicated model might work.
>
> Just by looking at your benchmarks, I *think* the case you actually
> want to optimize is "THP -> fork -> child exit/execve -> parent write
> COW reuse" where the THP page was really never in more than two VM's,
> and the second VM was an almost accidental temporary thing that is
> just about the whole "fork->exec/exit" model.

That's the most obvious/important case regarding COW reuse, agreed. And
also where I originally started, because it looked like the low-hanging
fruit (below).

For the benchmarks I have so far, I focused mostly on the
performance/harm of individual operations. Conceptually, with rmap IDs,
there is no performance difference if you end up reusing a THP in the
parent or the child, the performance is the same, so I didn't add it
manually to the micro benchmarks.

>
> Which makes me really feel like your rmap_id is very over-engineered.
> It seems to be designed to handle all the generic cases, but it seems
> like the main cause for it is a very specific case that I _feel_
> should be something that could be tracked with *way* less information
> (eg just have a "pointer to owner vma, and a simple counter of
> non-owners").

That's precisely where I originally started [1], but quickly wondered
(already in that mail):

(a) How to cleanly and safely stabilize refcount vs. mapcount. Without
playing tricks such that it's just "obvious" how it's just correct
and race-free in the COW reuse path.
(b) How to extend it to !anon folios, where we don't have a clean entry
point like folio_add_new_anon_rmap(); primarily to
get a sane replacement for folio_estimated_shares(), which I just
dislike at this point.
(c) If it's possibly to easily and cleanly change owners (creators in my
mail), without involving locks.

So I started thinking about the possibility of a precise and possibly
more universal/cleaner way of handling it, that doesn't add too much
runtime overhead. A way to just have what we have for small folios.

I was surprised to find an approach that gives a precise answer and
simply changes the owner implicitly, primarily just by
adding/subtracting numbers.

[I'll note that having a universal way to stabilize the mapcount vs.
refcount could be quite valuable. But achieving that also for small
folios would require e.g., having shared, hashed atomic seqcounts. And
I'm not too interested in harming small-folio performance at this point :)]

Now, if we want all that sooner, later, or maybe never, is a different
question. This WIP version primarily tries to show what's possible, at
which price, and what the limitations are.

>
> I dunno. I was cc'd, I looked at the patches, but I suspect I'm not
> really the target audience. If Hugh is ok with this kind of

Well, I really value your feedback, and you are always in my CC list
when I'm messing with COW and mapcounts.

Having that said, it's encouraging that you went over the patches
(thanks again!) and nothing immediately jumped at you (well, besides the
proof, but that will be fixed if this ever gets merged).

> complexity, I bow to a higher authority. This *does* seem to add a lot
> of conceptual complexity to something that is already complicated.

I'll note that while it all sounds complicated, in the end it's "just"
adding/subtracting numbers, and having a clean scheme to detect
concurrent (un)mapping. Further, it's handled without any new rmap hooks.

But yes, there sure is added code and complexity, and personally I
dislike having to go from 3 to 6 rmap values to support arm64 with 512
MiB THP. If we could just squeeze it all into a single rmap value, it
would all look much nicer: one total mapcount, one rmap value.

Before this would get merged a lot more has to happen, and most rmap
batching (and possibly including the exclusive atomic seqcount) could be
beneficial even without the rmap ID handling, so it's natural to start
with that independently.

Thanks again!

[1]
https://lore.kernel.org/all/[email protected]/T/#u

--
Cheers,

David / dhildenb

2023-12-17 19:14:19

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio


> On Nov 24, 2023, at 3:26 PM, David Hildenbrand <[email protected]> wrote:
>
> 5. sub-IDs
> ==========
>
> To achieve (2), we generate sub-IDs that have the following property,
> assuming that our folio has P=folio_nr_pages() pages.
> "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
> "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
> "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
> ...
> "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs
>
> The sub-IDs are generated in generations, whereby
> (1) Generation #0 is the number 0
> (2) Generation #N takes all numbers from generations #0..#N-1 and adds
> (P + 1)^(N - 1), effectively doubling the number of sub-IDs
>
> Consequently, the smallest number S in gen #N is:
> S[#N] = (P + 1)^(N - 1)
>
> The largest number L in gen #N is:
> L[#N] = (P + 1)^(N - 1) + (P + 1)^(N - 2) + ... (P + 1)^0 + 0.
> -> [geometric sum with "P + 1 != 1"]
> = (1 - (P + 1)^N) / (1 - (P + 1))
> = (1 - (P + 1)^N) / (-P)
> = ((P + 1)^N - 1) / P

David, as you know it took me a while to understand your impressive work.

I think that part of what made it hard for me is the presentation and the
formulation of sub-IDs in arithmetic means instead of bit manipulations.

Basically, IIUC, you want for order-K pages to add K-1 “0” bits between
each rmap-id bits.

In this case, in x86-64 (with BMI2) there is the PDEP instruction that can
generate these values rather easily with little overhead.

I think that besides the easier generation of sub-ids values in this
manner, discussing the matter without the “generations" also makes it
easier to understand the correctness (at least for me).


2023-12-18 14:07:46

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio

On 17.12.23 20:13, Nadav Amit wrote:
>
>> On Nov 24, 2023, at 3:26 PM, David Hildenbrand <[email protected]> wrote:
>>
>> 5. sub-IDs
>> ==========
>>
>> To achieve (2), we generate sub-IDs that have the following property,
>> assuming that our folio has P=folio_nr_pages() pages.
>> "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
>> "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
>> "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
>> ...
>> "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs
>>
>> The sub-IDs are generated in generations, whereby
>> (1) Generation #0 is the number 0
>> (2) Generation #N takes all numbers from generations #0..#N-1 and adds
>> (P + 1)^(N - 1), effectively doubling the number of sub-IDs
>>
>> Consequently, the smallest number S in gen #N is:
>> S[#N] = (P + 1)^(N - 1)
>>
>> The largest number L in gen #N is:
>> L[#N] = (P + 1)^(N - 1) + (P + 1)^(N - 2) + ... (P + 1)^0 + 0.
>> -> [geometric sum with "P + 1 != 1"]
>> = (1 - (P + 1)^N) / (1 - (P + 1))
>> = (1 - (P + 1)^N) / (-P)
>> = ((P + 1)^N - 1) / P
>
> David, as you know it took me a while to understand your impressive work.
>

Hi Nadav,

thanks a bunch for digging into the details of the solution and trying
to verify the correctness!

And thanks a lot for highlighting the special case below that I
intuitively used for the "intuition" :)


> I think that part of what made it hard for me is the presentation and the
> formulation of sub-IDs in arithmetic means instead of bit manipulations.

Yes, that needs more work to make it all easier to understand.

>
> Basically, IIUC, you want for order-K pages to add K-1 “0” bits between
> each rmap-id bits.
>
> In this case, in x86-64 (with BMI2) there is the PDEP instruction that can
> generate these values rather easily with little overhead.

Partially, yes.

What you describe here is a special case that is easier to "grasp",
likely a bit faster to calculate, but ends up being less-optimal in
space consumption for some cases (especially, order-2 and order-9 folios).

You are describing the special case where "P+1" is a power of two.

Let me explain using the example (P = 3) from the "intuition" further
down in this patch description, highlighting the mapping:

RMAP-ID | |Subid |
-----------------------------------
0 | 0000 | 0 | 0000 0000
1 | 0001 | 1 | 0000 0001
2 | 0010 | 4 | 0000 0100
3 | 0011 | 5 | 0000 0101
4 | 0100 | 16 | 0001 0000
5 | 0101 | 17 | 0001 0001
6 | 0110 | 20 | 0001 0100
7 | 0111 | 21 | 0001 0101
8 | 1000 | 64 | 0100 0000
9 | 1001 | 65 | 0100 0001
10 | 1010 | 68 | 0100 0100
11 | 1011 | 69 | 0100 0101
12 | 1100 | 80 | 0101 0100
13 | 1101 | 81 | 0101 0001
14 | 1110 | 84 | 0101 0100
15 | 1111 | 85 | 0101 0101

So going from e.g., 11 -> 69 means adding one zero for each bit, just
like you said.

But adding 1 "0" bit is not sufficient for handling order-2 folios (P =
4), only for handling order-1 folios. So what the current approach does
is the following (P = 4):

RMAP-ID | | Subid |
-----------------------------------
0 | 0000 | 0 | 0000 0000
1 | 0001 | 1 | 0000 0001
2 | 0010 | 5 | 0000 0101
3 | 0011 | 6 | 0000 0110
4 | 0100 | 25 | 0001 1001
5 | 0101 | 26 | 0001 1010
6 | 0110 | 30 | 0001 1110
7 | 0111 | 31 | 0001 1111
8 | 1000 | 125 | 0111 1101
9 | 1001 | 126 | 0111 1110
10 | 1010 | 130 | 1000 0010
11 | 1011 | 131 | 1000 0011
12 | 1100 | 150 | 1001 0110
13 | 1101 | 151 | 1001 0111
14 | 1110 | 155 | 1001 1011
15 | 1111 | 156 | 1001 1100

Which is not just adding "0"s.

To handle it using your simplification we'd have to add one more "0" bit
to be able to use it for order-2 folios. So I'll refine your statement to:
for order-K pages to add K “0” bits between each rmap-id bits.

Then, it's easy to see how going from 24 RMAP-ID bits for an order-2
page would require with that simplification 3*24 = 72bit and would no
longer fit into a single 64bit value.

To summarize, with the optimized (currently implemented) scheme, we achieve:
* == order-2: 1x 64bit rmap values
* <= order-5: 2x 64bit rmap values
* <= order-9: 3x 64bit rmap values
* <= order-10: 4x 64bit rmap values
* <= order-12: 5x 64bit rmap values
* <= order-15: 6x 64bit rmap values
* ...

I think, going with the simplification above (to be double-checked),
we'd achieve [essentially, subtracting 1 from all orders]:
* == order-1: 1x 64bit rmap values
* <= order-4: 2x 64bit rmap values
* <= order-8: 3x 64bit rmap values
* <= order-9: 4x 64bit rmap values
* <= order-11: 5x 64bit rmap values
* <= order-14: 6x 64bit rmap values
* ...

So especially for order-9 folios we would require 4 instead of 3 64bit
values.


Now, going with this simplification as first step makes absolute sense,
because it's much more intuitive and eventually a bit easier to
implement (although really not significantly IMHO).

One can then easily add the optimization on top later.

>
> I think that besides the easier generation of sub-ids values in this
> manner, discussing the matter without the “generations" also makes it
> easier to understand the correctness (at least for me).

Yes, that presentation certainly needs improvement. Generating the magic
numbers recursively makes it easier to proof the correctness using
induction, that's how I started to use that whole "generation" terminology.

--
Cheers,

David / dhildenb


2023-12-18 14:35:15

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio



> On Dec 18, 2023, at 4:04 PM, David Hildenbrand <[email protected]> wrote:
>
> But adding 1 "0" bit is not sufficient for handling order-2 folios (P = 4), only for handling order-1 folios. So what the current approach does is the following (P = 4):
>
> RMAP-ID | | Subid |
> -----------------------------------
> 0 | 0000 | 0 | 0000 0000
> 1 | 0001 | 1 | 0000 0001
> 2 | 0010 | 5 | 0000 0101
> 3 | 0011 | 6 | 0000 0110
> 4 | 0100 | 25 | 0001 1001
> 5 | 0101 | 26 | 0001 1010
> 6 | 0110 | 30 | 0001 1110
> 7 | 0111 | 31 | 0001 1111
> 8 | 1000 | 125 | 0111 1101
> 9 | 1001 | 126 | 0111 1110
> 10 | 1010 | 130 | 1000 0010
> 11 | 1011 | 131 | 1000 0011
> 12 | 1100 | 150 | 1001 0110
> 13 | 1101 | 151 | 1001 0111
> 14 | 1110 | 155 | 1001 1011
> 15 | 1111 | 156 | 1001 1100

Yes, of course. Silly me. You want to take advantage of the counter not
saturating for orders K where K-1 is not a power of 2.

I get your point. Not sure whether it worth the complexity though…