This patch follows up on a previous RFC:
[email protected]
When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
entry is cleared. In many cases, there is no difference between swapped-out
shared pages and newly allocated, non-dirty pages in the pagemap interface.
Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
#define NPAGES (256)
/* map 1MiB shared memory */
size_t pagesize = getpagesize();
char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
/* Dirty new pages. */
for (i = 0; i < PAGES; i++)
p[i * pagesize] = i;
Run the above program in a small cgroup, which causes swapping:
/* Initialise cgroup & run a program */
$ echo 512K > foo/memory.limit_in_bytes
$ echo 60 > foo/memory.swappiness
$ cgexec -g memory:foo ./pagemap-test
Check the pagemap report. Example of the current expected output:
$ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000710 e1 6b 06 00 00 00 00 a1 9e eb 06 00 00 00 00 a1 |.k..............|
00000720 6b ee 06 00 00 00 00 a1 a5 a4 05 00 00 00 00 a1 |k...............|
00000730 5c bf 06 00 00 00 00 a1 90 b6 06 00 00 00 00 a1 |\...............|
The first pagemap entries are reported as zeroes, indicating the pages have
never been allocated while they have actually been swapped out.
This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
make use of the XArray associated with the virtual memory area struct
passed as an argument. The XArray contains the location of virtual pages in
the page cache, swap cache or on disk. If they are in either of the caches,
then the original implementation still works. If not, then the missing
information will be retrieved from the XArray.
Performance
============
I measured the performance of the patch on a single socket Xeon E5-2620
machine, with 128GiB of RAM and 128GiB of swap storage. These were the
steps taken:
1. Run example pagemap-test code on a cgroup
a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
b. allocate 16GiB (about 4 million pages);
c. dirty 0,50 or 100% of pages;
d. do this for both private and shared memory.
2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
for each possible configuration above
a. 3 times for warm up;
b. 10 times to measure performance.
Use `time` or another performance measuring tool.
Results (averaged over 10 iterations):
+--------+------------+------------+
| dirty% | pre patch | post patch |
+--------+------------+------------+
private|anon | 0% | 8.15s | 8.40s |
| 50% | 11.83s | 12.19s |
| 100% | 12.37s | 12.20s |
+--------+------------+------------+
shared|anon | 0% | 8.17s | 8.18s |
| 50% | (*) 10.43s | 37.43s |
| 100% | (*) 10.20s | 38.59s |
+--------+------------+------------+
(*): reminder that pre-patch produces incorrect pagemap entries for swapped
out pages.
From run to run the above results are stable (mostly <1% stderr).
The amount of time it takes for a full read of the pagemap depends on the
granularity used by dd to read the pagemap file. Even though the access is
sequential, the script only reads 8 bytes at a time, running pagemap_read()
COUNT times (one time for each page in a 16GiB area).
To reduce overhead, we can use batching for large amounts of sequential
access. We can make dd read multiple page entries at a time,
allowing the kernel to make optimisations and yield more throughput.
Performance in real time (seconds) of
`dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
count=$((4194304 / $BATCH))`:
+---------------------------------+ +---------------------------------+
| Shared, Anon, 50% dirty | | Shared, Anon, 100% dirty |
+-------+------------+------------+ +-------+------------+------------+
| Batch | Pre-patch | Post-patch | | Batch | Pre-patch | Post-patch |
+-------+------------+------------+ +-------+------------+------------+
| 1 | (*) 10.43s | 37.43s | | 1 | (*) 10.20s | 38.59s |
| 2 | (*) 5.25s | 18.77s | | 2 | (*) 5.15s | 19.37s |
| 4 | (*) 2.63s | 9.42s | | 4 | (*) 2.63s | 9.74s |
| 8 | (*) 1.38s | 4.80s | | 8 | (*) 1.35s | 4.94s |
| 16 | (*) 0.73s | 2.46s | | 16 | (*) 0.72s | 2.54s |
| 32 | (*) 0.40s | 1.31s | | 32 | (*) 0.41s | 1.34s |
| 64 | (*) 0.25s | 0.72s | | 64 | (*) 0.24s | 0.74s |
| 128 | (*) 0.16s | 0.43s | | 128 | (*) 0.16s | 0.44s |
| 256 | (*) 0.12s | 0.28s | | 256 | (*) 0.12s | 0.29s |
| 512 | (*) 0.10s | 0.21s | | 512 | (*) 0.10s | 0.22s |
| 1024 | (*) 0.10s | 0.20s | | 1024 | (*) 0.10s | 0.21s |
+-------+------------+------------+ +-------+------------+------------+
To conclude, in order to make the most of the underlying mechanisms of
pagemap and xarray, one should be using batching to achieve better
performance.
Future Work
============
Note: there are PTE flags which currently do not survive the swap out when
the page is shmem: SOFT_DIRTY and UFFD_WP.
A solution for saving the state of the UFFD_WP flag has been proposed by
Peter Xu in the patch linked below. The concept and mechanism proposed
could be extended to include the SOFT_DIRTY bit as well:
[email protected]
Our patches are mostly orthogonal.
Kind regards,
Tibi
Tiberiu A Georgescu (1):
pagemap: report swap location for shared pages
fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
1 file changed, 30 insertions(+), 8 deletions(-)
--
2.32.0.380.geb27b338a3
Tiberiu A Georgescu <[email protected]> writes:
> This patch follows up on a previous RFC:
> [email protected]
>
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap
> interface.
What is the point?
You say a shared swapped out page is the same as a clean shared page
and you are exactly correct. What is the point in knowing a shared
page was swapped out? What does is the gain?
I tried to understand the point by looking at your numbers below
and everything I could see looked worse post patch.
Eric
> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
> #define NPAGES (256)
> /* map 1MiB shared memory */
> size_t pagesize = getpagesize();
> char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> /* Dirty new pages. */
> for (i = 0; i < PAGES; i++)
> p[i * pagesize] = i;
>
> Run the above program in a small cgroup, which causes swapping:
> /* Initialise cgroup & run a program */
> $ echo 512K > foo/memory.limit_in_bytes
> $ echo 60 > foo/memory.swappiness
> $ cgexec -g memory:foo ./pagemap-test
>
> Check the pagemap report. Example of the current expected output:
> $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000710 e1 6b 06 00 00 00 00 a1 9e eb 06 00 00 00 00 a1 |.k..............|
> 00000720 6b ee 06 00 00 00 00 a1 a5 a4 05 00 00 00 00 a1 |k...............|
> 00000730 5c bf 06 00 00 00 00 a1 90 b6 06 00 00 00 00 a1 |\...............|
>
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out.
>
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are in either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
>
> Performance
> ============
> I measured the performance of the patch on a single socket Xeon E5-2620
> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
> steps taken:
>
> 1. Run example pagemap-test code on a cgroup
> a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
> b. allocate 16GiB (about 4 million pages);
> c. dirty 0,50 or 100% of pages;
> d. do this for both private and shared memory.
> 2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
> for each possible configuration above
> a. 3 times for warm up;
> b. 10 times to measure performance.
> Use `time` or another performance measuring tool.
>
> Results (averaged over 10 iterations):
> +--------+------------+------------+
> | dirty% | pre patch | post patch |
> +--------+------------+------------+
> private|anon | 0% | 8.15s | 8.40s |
> | 50% | 11.83s | 12.19s |
> | 100% | 12.37s | 12.20s |
> +--------+------------+------------+
> shared|anon | 0% | 8.17s | 8.18s |
> | 50% | (*) 10.43s | 37.43s |
> | 100% | (*) 10.20s | 38.59s |
> +--------+------------+------------+
>
> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
> out pages.
>
> From run to run the above results are stable (mostly <1% stderr).
>
> The amount of time it takes for a full read of the pagemap depends on the
> granularity used by dd to read the pagemap file. Even though the access is
> sequential, the script only reads 8 bytes at a time, running pagemap_read()
> COUNT times (one time for each page in a 16GiB area).
>
> To reduce overhead, we can use batching for large amounts of sequential
> access. We can make dd read multiple page entries at a time,
> allowing the kernel to make optimisations and yield more throughput.
>
> Performance in real time (seconds) of
> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
> count=$((4194304 / $BATCH))`:
> +---------------------------------+ +---------------------------------+
> | Shared, Anon, 50% dirty | | Shared, Anon, 100% dirty |
> +-------+------------+------------+ +-------+------------+------------+
> | Batch | Pre-patch | Post-patch | | Batch | Pre-patch | Post-patch |
> +-------+------------+------------+ +-------+------------+------------+
> | 1 | (*) 10.43s | 37.43s | | 1 | (*) 10.20s | 38.59s |
> | 2 | (*) 5.25s | 18.77s | | 2 | (*) 5.15s | 19.37s |
> | 4 | (*) 2.63s | 9.42s | | 4 | (*) 2.63s | 9.74s |
> | 8 | (*) 1.38s | 4.80s | | 8 | (*) 1.35s | 4.94s |
> | 16 | (*) 0.73s | 2.46s | | 16 | (*) 0.72s | 2.54s |
> | 32 | (*) 0.40s | 1.31s | | 32 | (*) 0.41s | 1.34s |
> | 64 | (*) 0.25s | 0.72s | | 64 | (*) 0.24s | 0.74s |
> | 128 | (*) 0.16s | 0.43s | | 128 | (*) 0.16s | 0.44s |
> | 256 | (*) 0.12s | 0.28s | | 256 | (*) 0.12s | 0.29s |
> | 512 | (*) 0.10s | 0.21s | | 512 | (*) 0.10s | 0.22s |
> | 1024 | (*) 0.10s | 0.20s | | 1024 | (*) 0.10s | 0.21s |
> +-------+------------+------------+ +-------+------------+------------+
>
> To conclude, in order to make the most of the underlying mechanisms of
> pagemap and xarray, one should be using batching to achieve better
> performance.
>
> Future Work
> ============
>
> Note: there are PTE flags which currently do not survive the swap out when
> the page is shmem: SOFT_DIRTY and UFFD_WP.
>
> A solution for saving the state of the UFFD_WP flag has been proposed by
> Peter Xu in the patch linked below. The concept and mechanism proposed
> could be extended to include the SOFT_DIRTY bit as well:
> [email protected]
> Our patches are mostly orthogonal.
>
> Kind regards,
> Tibi
>
> Tiberiu A Georgescu (1):
> pagemap: report swap location for shared pages
>
> fs/proc/task_mmu.c | 38 ++++++++++++++++++++++++++++++--------
> 1 file changed, 30 insertions(+), 8 deletions(-)
> On 30 Jul 2021, at 18:28, Eric W. Biederman <[email protected]> wrote:
>
> Tiberiu A Georgescu <[email protected]> writes:
>
>> This patch follows up on a previous RFC:
>> [email protected]
>>
>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
>> entry is cleared. In many cases, there is no difference between swapped-out
>> shared pages and newly allocated, non-dirty pages in the pagemap
>> interface.
>
> What is the point?
The reason why this patch is important has been discussed in my RFC
patch and on this thread:
https://lore.kernel.org/lkml/[email protected]/.
The most relevant reply should be Ivan's:
https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/
In short, this swap information helps us enhance live migration in some cases.
>
> You say a shared swapped out page is the same as a clean shared page
> and you are exactly correct. What is the point in knowing a shared
> page was swapped out? What does is the gain?
>
What I meant was that shared swapped out pages and clean shared pages
have their ptes identical pre-patch. I understand they are somewhat similar
concepts when it comes to file shared pages, where swapping is done
directly on the disk.
Our case focuses on anonymous pages and shared pages with identical
underlying behaviour (like pages allocated using memfd). These pages get
cleared once the runtime is over, and the difference between allocated,
but uninitialised pages, and dirty pages that have been swapped out is
significant, as the former could still contain usable data.
The post-patch pagemap entry now contains the swap type and offset for
swapped out pages, regardless of whether the page is private or shared.
This, by definition of the pagemap, should be the correct behaviour.
> I tried to understand the point by looking at your numbers below
> and everything I could see looked worse post patch.
Indeed, the numbers are mostly bigger post-patch. It is a tradeoff between
correctness and performance. However, the tradeoff is not inconvenient on sparse
single accesses, and it can be made significantly faster by leveraging batching.
In future work, the performance can be improved by leveraging a mechanism
proposed by Peter Xu: Special PTEs:
https://lore.kernel.org/lkml/[email protected]/
The main concern of the RFC was that the xarray check would slow down
checking empty pages significantly. Thankfully, we can only see a small
overhead when no allocated shared page is dirtied.
>
> Eric
>
Hope I was able to clarifiy a few things. Now, having laid out the context,
please have another look at my proposed patch.
Thank you,
Tibi
Hi, Tiberiu,
On Fri, Jul 30, 2021 at 04:08:25PM +0000, Tiberiu A Georgescu wrote:
> This patch follows up on a previous RFC:
> [email protected]
>
> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
> entry is cleared. In many cases, there is no difference between swapped-out
> shared pages and newly allocated, non-dirty pages in the pagemap interface.
>
> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
> #define NPAGES (256)
> /* map 1MiB shared memory */
> size_t pagesize = getpagesize();
> char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> /* Dirty new pages. */
> for (i = 0; i < PAGES; i++)
> p[i * pagesize] = i;
>
> Run the above program in a small cgroup, which causes swapping:
> /* Initialise cgroup & run a program */
> $ echo 512K > foo/memory.limit_in_bytes
> $ echo 60 > foo/memory.swappiness
> $ cgexec -g memory:foo ./pagemap-test
>
> Check the pagemap report. Example of the current expected output:
> $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000710 e1 6b 06 00 00 00 00 a1 9e eb 06 00 00 00 00 a1 |.k..............|
> 00000720 6b ee 06 00 00 00 00 a1 a5 a4 05 00 00 00 00 a1 |k...............|
> 00000730 5c bf 06 00 00 00 00 a1 90 b6 06 00 00 00 00 a1 |\...............|
>
> The first pagemap entries are reported as zeroes, indicating the pages have
> never been allocated while they have actually been swapped out.
>
> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
> make use of the XArray associated with the virtual memory area struct
> passed as an argument. The XArray contains the location of virtual pages in
> the page cache, swap cache or on disk. If they are in either of the caches,
> then the original implementation still works. If not, then the missing
> information will be retrieved from the XArray.
>
> Performance
> ============
> I measured the performance of the patch on a single socket Xeon E5-2620
> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
> steps taken:
>
> 1. Run example pagemap-test code on a cgroup
> a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
> b. allocate 16GiB (about 4 million pages);
> c. dirty 0,50 or 100% of pages;
> d. do this for both private and shared memory.
> 2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
> for each possible configuration above
> a. 3 times for warm up;
> b. 10 times to measure performance.
> Use `time` or another performance measuring tool.
>
> Results (averaged over 10 iterations):
> +--------+------------+------------+
> | dirty% | pre patch | post patch |
> +--------+------------+------------+
> private|anon | 0% | 8.15s | 8.40s |
> | 50% | 11.83s | 12.19s |
> | 100% | 12.37s | 12.20s |
> +--------+------------+------------+
> shared|anon | 0% | 8.17s | 8.18s |
> | 50% | (*) 10.43s | 37.43s |
> | 100% | (*) 10.20s | 38.59s |
> +--------+------------+------------+
>
> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
> out pages.
>
> From run to run the above results are stable (mostly <1% stderr).
>
> The amount of time it takes for a full read of the pagemap depends on the
> granularity used by dd to read the pagemap file. Even though the access is
> sequential, the script only reads 8 bytes at a time, running pagemap_read()
> COUNT times (one time for each page in a 16GiB area).
>
> To reduce overhead, we can use batching for large amounts of sequential
> access. We can make dd read multiple page entries at a time,
> allowing the kernel to make optimisations and yield more throughput.
>
> Performance in real time (seconds) of
> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
> count=$((4194304 / $BATCH))`:
> +---------------------------------+ +---------------------------------+
> | Shared, Anon, 50% dirty | | Shared, Anon, 100% dirty |
> +-------+------------+------------+ +-------+------------+------------+
> | Batch | Pre-patch | Post-patch | | Batch | Pre-patch | Post-patch |
> +-------+------------+------------+ +-------+------------+------------+
> | 1 | (*) 10.43s | 37.43s | | 1 | (*) 10.20s | 38.59s |
> | 2 | (*) 5.25s | 18.77s | | 2 | (*) 5.15s | 19.37s |
> | 4 | (*) 2.63s | 9.42s | | 4 | (*) 2.63s | 9.74s |
> | 8 | (*) 1.38s | 4.80s | | 8 | (*) 1.35s | 4.94s |
> | 16 | (*) 0.73s | 2.46s | | 16 | (*) 0.72s | 2.54s |
> | 32 | (*) 0.40s | 1.31s | | 32 | (*) 0.41s | 1.34s |
> | 64 | (*) 0.25s | 0.72s | | 64 | (*) 0.24s | 0.74s |
> | 128 | (*) 0.16s | 0.43s | | 128 | (*) 0.16s | 0.44s |
> | 256 | (*) 0.12s | 0.28s | | 256 | (*) 0.12s | 0.29s |
> | 512 | (*) 0.10s | 0.21s | | 512 | (*) 0.10s | 0.22s |
> | 1024 | (*) 0.10s | 0.20s | | 1024 | (*) 0.10s | 0.21s |
> +-------+------------+------------+ +-------+------------+------------+
>
> To conclude, in order to make the most of the underlying mechanisms of
> pagemap and xarray, one should be using batching to achieve better
> performance.
So what I'm still a bit worried is whether it will regress some existing users.
Note that existing users can try to read pagemap in their own way; we can't
expect all the userspaces to change their behavior due to a kernel change.
Meanwhile, from the numbers, it seems to show a 4x speed down due to looking up
the page cache no matter the size of ibs=. IOW I don't see a good way to avoid
that overhead, so no way to have the userspace run as fast as before.
Also note that it's not only affecting the PM_SWAP users; it potentially
affects all the /proc/pagemap users as long as there're file-backed memory on
the read region of pagemap, which is very sane to happen.
That's why I think if we want to persist it, we should still consider starting
from the pte marker idea.
I do plan to move the pte marker idea forward unless that'll be NACKed upstream
for some other reason, because that seems to be the only way for uffd-wp to
support file based memories; no matter with a new swp type or with special swap
pte. I am even thinking about whether I should propose that with PM_SWAP first
because that seems to be a simpler scenario than uffd-wp (which will get the
rest uffd-wp patches involved then), then we can have a shared infrastructure.
But haven't thought deeper than that.
Thanks,
--
Peter Xu
On 04.08.21 20:33, Peter Xu wrote:
> Hi, Tiberiu,
>
> On Fri, Jul 30, 2021 at 04:08:25PM +0000, Tiberiu A Georgescu wrote:
>> This patch follows up on a previous RFC:
>> [email protected]
>>
>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap
>> entry is cleared. In many cases, there is no difference between swapped-out
>> shared pages and newly allocated, non-dirty pages in the pagemap interface.
>>
>> Example pagemap-test code (Tested on Kernel Version 5.14-rc3):
>> #define NPAGES (256)
>> /* map 1MiB shared memory */
>> size_t pagesize = getpagesize();
>> char *p = mmap(NULL, pagesize * NPAGES, PROT_READ | PROT_WRITE,
>> MAP_ANONYMOUS | MAP_SHARED, -1, 0);
>> /* Dirty new pages. */
>> for (i = 0; i < PAGES; i++)
>> p[i * pagesize] = i;
>>
>> Run the above program in a small cgroup, which causes swapping:
>> /* Initialise cgroup & run a program */
>> $ echo 512K > foo/memory.limit_in_bytes
>> $ echo 60 > foo/memory.swappiness
>> $ cgexec -g memory:foo ./pagemap-test
>>
>> Check the pagemap report. Example of the current expected output:
>> $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=$COUNT | hexdump -C
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 00000710 e1 6b 06 00 00 00 00 a1 9e eb 06 00 00 00 00 a1 |.k..............|
>> 00000720 6b ee 06 00 00 00 00 a1 a5 a4 05 00 00 00 00 a1 |k...............|
>> 00000730 5c bf 06 00 00 00 00 a1 90 b6 06 00 00 00 00 a1 |\...............|
>>
>> The first pagemap entries are reported as zeroes, indicating the pages have
>> never been allocated while they have actually been swapped out.
>>
>> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to
>> make use of the XArray associated with the virtual memory area struct
>> passed as an argument. The XArray contains the location of virtual pages in
>> the page cache, swap cache or on disk. If they are in either of the caches,
>> then the original implementation still works. If not, then the missing
>> information will be retrieved from the XArray.
>>
>> Performance
>> ============
>> I measured the performance of the patch on a single socket Xeon E5-2620
>> machine, with 128GiB of RAM and 128GiB of swap storage. These were the
>> steps taken:
>>
>> 1. Run example pagemap-test code on a cgroup
>> a. Set up cgroup with limit_in_bytes=4GiB and swappiness=60;
>> b. allocate 16GiB (about 4 million pages);
>> c. dirty 0,50 or 100% of pages;
>> d. do this for both private and shared memory.
>> 2. Run `dd if=<PAGEMAP> ibs=8 skip=$(($VADDR / $PAGESIZE)) count=4194304`
>> for each possible configuration above
>> a. 3 times for warm up;
>> b. 10 times to measure performance.
>> Use `time` or another performance measuring tool.
>>
>> Results (averaged over 10 iterations):
>> +--------+------------+------------+
>> | dirty% | pre patch | post patch |
>> +--------+------------+------------+
>> private|anon | 0% | 8.15s | 8.40s |
>> | 50% | 11.83s | 12.19s |
>> | 100% | 12.37s | 12.20s |
>> +--------+------------+------------+
>> shared|anon | 0% | 8.17s | 8.18s |
>> | 50% | (*) 10.43s | 37.43s |
>> | 100% | (*) 10.20s | 38.59s |
>> +--------+------------+------------+
>>
>> (*): reminder that pre-patch produces incorrect pagemap entries for swapped
>> out pages.
>>
>> From run to run the above results are stable (mostly <1% stderr).
>>
>> The amount of time it takes for a full read of the pagemap depends on the
>> granularity used by dd to read the pagemap file. Even though the access is
>> sequential, the script only reads 8 bytes at a time, running pagemap_read()
>> COUNT times (one time for each page in a 16GiB area).
>>
>> To reduce overhead, we can use batching for large amounts of sequential
>> access. We can make dd read multiple page entries at a time,
>> allowing the kernel to make optimisations and yield more throughput.
>>
>> Performance in real time (seconds) of
>> `dd if=<PAGEMAP> ibs=8*$BATCH skip=$(($VADDR / $PAGESIZE / $BATCH))
>> count=$((4194304 / $BATCH))`:
>> +---------------------------------+ +---------------------------------+
>> | Shared, Anon, 50% dirty | | Shared, Anon, 100% dirty |
>> +-------+------------+------------+ +-------+------------+------------+
>> | Batch | Pre-patch | Post-patch | | Batch | Pre-patch | Post-patch |
>> +-------+------------+------------+ +-------+------------+------------+
>> | 1 | (*) 10.43s | 37.43s | | 1 | (*) 10.20s | 38.59s |
>> | 2 | (*) 5.25s | 18.77s | | 2 | (*) 5.15s | 19.37s |
>> | 4 | (*) 2.63s | 9.42s | | 4 | (*) 2.63s | 9.74s |
>> | 8 | (*) 1.38s | 4.80s | | 8 | (*) 1.35s | 4.94s |
>> | 16 | (*) 0.73s | 2.46s | | 16 | (*) 0.72s | 2.54s |
>> | 32 | (*) 0.40s | 1.31s | | 32 | (*) 0.41s | 1.34s |
>> | 64 | (*) 0.25s | 0.72s | | 64 | (*) 0.24s | 0.74s |
>> | 128 | (*) 0.16s | 0.43s | | 128 | (*) 0.16s | 0.44s |
>> | 256 | (*) 0.12s | 0.28s | | 256 | (*) 0.12s | 0.29s |
>> | 512 | (*) 0.10s | 0.21s | | 512 | (*) 0.10s | 0.22s |
>> | 1024 | (*) 0.10s | 0.20s | | 1024 | (*) 0.10s | 0.21s |
>> +-------+------------+------------+ +-------+------------+------------+
>>
>> To conclude, in order to make the most of the underlying mechanisms of
>> pagemap and xarray, one should be using batching to achieve better
>> performance.
>
> So what I'm still a bit worried is whether it will regress some existing users.
> Note that existing users can try to read pagemap in their own way; we can't
> expect all the userspaces to change their behavior due to a kernel change.
Then let's provide a way to enable the new behavior for a process if we
don't find another way to extract that information. I would actually
prefer finding a different interface for that, because with such things
the "pagemap" no longer expresses which pages are currently mapped.
Shared memory is weird.
>
> Meanwhile, from the numbers, it seems to show a 4x speed down due to looking up
> the page cache no matter the size of ibs=. IOW I don't see a good way to avoid
> that overhead, so no way to have the userspace run as fast as before.
>
> Also note that it's not only affecting the PM_SWAP users; it potentially
> affects all the /proc/pagemap users as long as there're file-backed memory on
> the read region of pagemap, which is very sane to happen.
>
> That's why I think if we want to persist it, we should still consider starting
> from the pte marker idea.
TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't
store any state information regarding shared memory in per-process page
tables: it just doesn't make too much sense.
And this is similar to SOFTDIRTY or UFFD_WP bits: this information
actually belongs to the shared file ("did *someone* write to this page",
"is *someone* interested into changes to that page", "is there
something"). I know, that screams for a completely different design in
respect to these features.
I guess we start learning the hard way that shared memory is just
different and requires different interfaces than per-process page table
interfaces we have (pagemap, userfaultfd).
I didn't have time to explore any alternatives yet, but I wonder if
tracking such stuff per an actual fd/memfd and not via process page
tables is actually the right and clean approach. There are certainly
many issues to solve, but conceptually to me it feels more natural to
have these shared memory features not mangled into process page tables.
>
> I do plan to move the pte marker idea forward unless that'll be NACKed upstream
> for some other reason, because that seems to be the only way for uffd-wp to
> support file based memories; no matter with a new swp type or with special swap
> pte. I am even thinking about whether I should propose that with PM_SWAP first
> because that seems to be a simpler scenario than uffd-wp (which will get the
> rest uffd-wp patches involved then), then we can have a shared infrastructure.
> But haven't thought deeper than that.
>
> Thanks,
>
--
Thanks,
David / dhildenb
On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
> any state information regarding shared memory in per-process page tables: it
> just doesn't make too much sense.
>
> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
> belongs to the shared file ("did *someone* write to this page", "is
> *someone* interested into changes to that page", "is there something"). I
> know, that screams for a completely different design in respect to these
> features.
>
> I guess we start learning the hard way that shared memory is just different
> and requires different interfaces than per-process page table interfaces we
> have (pagemap, userfaultfd).
>
> I didn't have time to explore any alternatives yet, but I wonder if tracking
> such stuff per an actual fd/memfd and not via process page tables is
> actually the right and clean approach. There are certainly many issues to
> solve, but conceptually to me it feels more natural to have these shared
> memory features not mangled into process page tables.
Yes, we can explore all the possibilities, I'm totally fine with it.
I just want to say I still don't think when there's page cache then we must put
all the page-relevant things into the page cache.
They're shared by processes, but process can still have its own way to describe
the relationship to that page in the cache, to me it's as simple as "we allow
process A to write to page cache P", while "we don't allow process B to write
to the same page" like the write bit.
Swap information is indeed tricky because it's shared by all the processes, but
so far I still see it as a doable approach as long as it can solve the problem.
I am not against the approach mentioned in this patch either - but I still
share my concerns, as that's not normally what we do with existing interfaces.
Thanks,
--
Peter Xu
On 04.08.21 21:17, Peter Xu wrote:
> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>> any state information regarding shared memory in per-process page tables: it
>> just doesn't make too much sense.
>>
>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>> belongs to the shared file ("did *someone* write to this page", "is
>> *someone* interested into changes to that page", "is there something"). I
>> know, that screams for a completely different design in respect to these
>> features.
>>
>> I guess we start learning the hard way that shared memory is just different
>> and requires different interfaces than per-process page table interfaces we
>> have (pagemap, userfaultfd).
>>
>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>> such stuff per an actual fd/memfd and not via process page tables is
>> actually the right and clean approach. There are certainly many issues to
>> solve, but conceptually to me it feels more natural to have these shared
>> memory features not mangled into process page tables.
>
> Yes, we can explore all the possibilities, I'm totally fine with it.
>
> I just want to say I still don't think when there's page cache then we must put
> all the page-relevant things into the page cache.
[sorry for the late reply]
Right, but for the case of shared, swapped out pages, the information is
already there, in the page cache :)
>
> They're shared by processes, but process can still have its own way to describe
> the relationship to that page in the cache, to me it's as simple as "we allow
> process A to write to page cache P", while "we don't allow process B to write
> to the same page" like the write bit.
The issue I'm having uffd-wp as it was proposed for shared memory is
that there is hardly a sane use case where we would *want* it to work
that way.
A UFFD-WP flag in a page table for shared memory means "please notify
once this process modifies the shared memory (via page tables, not via
any other fd modification)". Do we have an example application where
these semantics makes sense and don't over-complicate the whole
approach? I don't know any, thus I'm asking dumb questions :)
For background snapshots in QEMU the flow would currently be like this,
assuming all processes have the shared guest memory mapped.
1. Background snapshot preparation: QEMU requests all processes
to uffd-wp the range
a) All processes register a uffd handler on guest RAM
b) All processes fault in all guest memory (essentially populating all
memory): with a uffd-WP extensions we might be able to get rid of
that, I remember you were working on that.
c) All processes uffd-WP the range to set the bit in their page table
2. Background snapshot runs:
a) A process either receives a UFFD-WP event and forwards it to QEMU or
QEMU polls all other processes for UFFD events.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU triggers all processes to un-protect the page and wake up any
waiters. All processes clear the uffd-WP bit in their page tables.
3. Background snapshot completes:
a) All processes unregister the uffd handler
Now imagine something like this:
1. Background snapshot preparation:
a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
to guest memory.
b) QEMU uffd-wp's the whole file
2. Background snapshot runs:
a) QEMU receives a UFFD-WP event.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU un-protect the page and wake up any waiters.
3. Background snapshot completes:
a) QEMU unregister the uffd handler
Wouldn't that be much nicer and much easier to handle? Yes, it is much
harder to implement because such an infrastructure does not exist yet,
and it most probably wouldn't be called uffd anymore, because we are
dealing with file access. But this way, it would actually be super easy
to use the feature across multiple processes and eventually to even
catch other file modifications.
Again, I am not sure if uffd-wp or softdirty make too much sense in
general when applied to shmem. But I'm happy to learn more.
--
Thanks,
David / dhildenb
On 11.08.21 18:15, David Hildenbrand wrote:
> On 04.08.21 21:17, Peter Xu wrote:
>> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>>> any state information regarding shared memory in per-process page tables: it
>>> just doesn't make too much sense.
>>>
>>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>>> belongs to the shared file ("did *someone* write to this page", "is
>>> *someone* interested into changes to that page", "is there something"). I
>>> know, that screams for a completely different design in respect to these
>>> features.
>>>
>>> I guess we start learning the hard way that shared memory is just different
>>> and requires different interfaces than per-process page table interfaces we
>>> have (pagemap, userfaultfd).
>>>
>>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>>> such stuff per an actual fd/memfd and not via process page tables is
>>> actually the right and clean approach. There are certainly many issues to
>>> solve, but conceptually to me it feels more natural to have these shared
>>> memory features not mangled into process page tables.
>>
>> Yes, we can explore all the possibilities, I'm totally fine with it.
>>
>> I just want to say I still don't think when there's page cache then we must put
>> all the page-relevant things into the page cache.
>
> [sorry for the late reply]
>
> Right, but for the case of shared, swapped out pages, the information is
> already there, in the page cache :)
>
>>
>> They're shared by processes, but process can still have its own way to describe
>> the relationship to that page in the cache, to me it's as simple as "we allow
>> process A to write to page cache P", while "we don't allow process B to write
>> to the same page" like the write bit.
>
> The issue I'm having uffd-wp as it was proposed for shared memory is
> that there is hardly a sane use case where we would *want* it to work
> that way.
>
> A UFFD-WP flag in a page table for shared memory means "please notify
> once this process modifies the shared memory (via page tables, not via
> any other fd modification)". Do we have an example application where
> these semantics makes sense and don't over-complicate the whole
> approach? I don't know any, thus I'm asking dumb questions :)
>
>
> For background snapshots in QEMU the flow would currently be like this,
> assuming all processes have the shared guest memory mapped.
>
> 1. Background snapshot preparation: QEMU requests all processes
> to uffd-wp the range
> a) All processes register a uffd handler on guest RAM
> b) All processes fault in all guest memory (essentially populating all
> memory): with a uffd-WP extensions we might be able to get rid of
> that, I remember you were working on that.
> c) All processes uffd-WP the range to set the bit in their page table
>
> 2. Background snapshot runs:
> a) A process either receives a UFFD-WP event and forwards it to QEMU or
> QEMU polls all other processes for UFFD events.
> b) QEMU writes the to-be-changed page to the migration stream.
> c) QEMU triggers all processes to un-protect the page and wake up any
> waiters. All processes clear the uffd-WP bit in their page tables.
Oh, and I forgot, whenever we save any page to the migration stream, we
have to trigger all processes to un-protect.
--
Thanks,
David / dhildenb
On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote:
> On 04.08.21 21:17, Peter Xu wrote:
> > On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
> > > TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
> > > any state information regarding shared memory in per-process page tables: it
> > > just doesn't make too much sense.
> > >
> > > And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
> > > belongs to the shared file ("did *someone* write to this page", "is
> > > *someone* interested into changes to that page", "is there something"). I
> > > know, that screams for a completely different design in respect to these
> > > features.
> > >
> > > I guess we start learning the hard way that shared memory is just different
> > > and requires different interfaces than per-process page table interfaces we
> > > have (pagemap, userfaultfd).
> > >
> > > I didn't have time to explore any alternatives yet, but I wonder if tracking
> > > such stuff per an actual fd/memfd and not via process page tables is
> > > actually the right and clean approach. There are certainly many issues to
> > > solve, but conceptually to me it feels more natural to have these shared
> > > memory features not mangled into process page tables.
> >
> > Yes, we can explore all the possibilities, I'm totally fine with it.
> >
> > I just want to say I still don't think when there's page cache then we must put
> > all the page-relevant things into the page cache.
>
> [sorry for the late reply]
>
> Right, but for the case of shared, swapped out pages, the information is
> already there, in the page cache :)
>
> >
> > They're shared by processes, but process can still have its own way to describe
> > the relationship to that page in the cache, to me it's as simple as "we allow
> > process A to write to page cache P", while "we don't allow process B to write
> > to the same page" like the write bit.
>
> The issue I'm having uffd-wp as it was proposed for shared memory is that
> there is hardly a sane use case where we would *want* it to work that way.
>
> A UFFD-WP flag in a page table for shared memory means "please notify once
> this process modifies the shared memory (via page tables, not via any other
> fd modification)". Do we have an example application where these semantics
> makes sense and don't over-complicate the whole approach? I don't know any,
> thus I'm asking dumb questions :)
>
>
> For background snapshots in QEMU the flow would currently be like this,
> assuming all processes have the shared guest memory mapped.
>
> 1. Background snapshot preparation: QEMU requests all processes
> to uffd-wp the range
> a) All processes register a uffd handler on guest RAM
To be explicit: not a handler; just register with uffd-wp and pass over the fd
to the main process.
> b) All processes fault in all guest memory (essentially populating all
> memory): with a uffd-WP extensions we might be able to get rid of
> that, I remember you were working on that.
> c) All processes uffd-WP the range to set the bit in their page table
>
> 2. Background snapshot runs:
> a) A process either receives a UFFD-WP event and forwards it to QEMU or
> QEMU polls all other processes for UFFD events.
> b) QEMU writes the to-be-changed page to the migration stream.
> c) QEMU triggers all processes to un-protect the page and wake up any
> waiters. All processes clear the uffd-WP bit in their page tables.
>
> 3. Background snapshot completes:
> a) All processes unregister the uffd handler
>
>
> Now imagine something like this:
>
> 1. Background snapshot preparation:
> a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
> to guest memory.
> b) QEMU uffd-wp's the whole file
>
> 2. Background snapshot runs:
> a) QEMU receives a UFFD-WP event.
> b) QEMU writes the to-be-changed page to the migration stream.
> c) QEMU un-protect the page and wake up any waiters.
>
> 3. Background snapshot completes:
> a) QEMU unregister the uffd handler
>
>
> Wouldn't that be much nicer and much easier to handle? Yes, it is much
> harder to implement because such an infrastructure does not exist yet, and
> it most probably wouldn't be called uffd anymore, because we are dealing
> with file access. But this way, it would actually be super easy to use the
> feature across multiple processes and eventually to even catch other file
> modifications.
I can totally understand how you see this. We've discussed about that, isn't
it? About the ideal worlds. :)
It would be great if this can work out, I hope so. So far I'm not that
ambicious, and as I said, I don't know whether there will be other concerns
when it goes into the page cache layer, and when it's a behavior of multiple
processes where one of them can rule others without others being notice of it.
Even if we want to go that way, I think we should first come up with some way
to describe the domains that one uffd-wp registered file should behave upon.
It shouldn't be "any process touching this file".
One quick example in my mind is when a malicious process wants to stop another
daemon process, it'll be easier as long as the malicious process can delete a
file that the daemon used to read/write, replace it with a shmem with uffd-wp
registered (or maybe just a regular file on file systems, if your proposal will
naturally work on them). The problem is, is it really "legal" to be able to
stop the daemon running like that?
I also don't know the initial concept when uffd is designed and why it's
designed at pte level. Avoid vma manipulation should be a major factor, but I
can't say I understand all of them. Not sure whether Andrea has any input here.
That's why I think current uffd can still make sense with per-process concepts
and keep it that way. When register uffd-wp yes we need to do that for
multiple processes, but it also means each process is fully aware that this is
happening so it's kind of verified that this is wanted behavior for that
process. It'll happen with less "surprises", and smells safer.
I don't think that will not work out. It may require all the process to
support uffd-wp apis and cooperate, but that's so far how it should work for me
in a safe and self-contained way. Say, every process should be aware of what's
going to happen on blocked page faults.
>
> Again, I am not sure if uffd-wp or softdirty make too much sense in general
> when applied to shmem. But I'm happy to learn more.
Me too, I'm more than glad to know whether the page cache idea could be
welcomed or am I just wrong about it. Before I understand more things around
this, so far I still think the per-process based and fd-based solution of uffd
still makes sense.
Thanks,
--
Peter Xu
On 11.08.21 20:25, Peter Xu wrote:
> On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote:
>> On 04.08.21 21:17, Peter Xu wrote:
>>> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>>>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>>>> any state information regarding shared memory in per-process page tables: it
>>>> just doesn't make too much sense.
>>>>
>>>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>>>> belongs to the shared file ("did *someone* write to this page", "is
>>>> *someone* interested into changes to that page", "is there something"). I
>>>> know, that screams for a completely different design in respect to these
>>>> features.
>>>>
>>>> I guess we start learning the hard way that shared memory is just different
>>>> and requires different interfaces than per-process page table interfaces we
>>>> have (pagemap, userfaultfd).
>>>>
>>>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>>>> such stuff per an actual fd/memfd and not via process page tables is
>>>> actually the right and clean approach. There are certainly many issues to
>>>> solve, but conceptually to me it feels more natural to have these shared
>>>> memory features not mangled into process page tables.
>>>
>>> Yes, we can explore all the possibilities, I'm totally fine with it.
>>>
>>> I just want to say I still don't think when there's page cache then we must put
>>> all the page-relevant things into the page cache.
>>
>> [sorry for the late reply]
>>
>> Right, but for the case of shared, swapped out pages, the information is
>> already there, in the page cache :)
>>
>>>
>>> They're shared by processes, but process can still have its own way to describe
>>> the relationship to that page in the cache, to me it's as simple as "we allow
>>> process A to write to page cache P", while "we don't allow process B to write
>>> to the same page" like the write bit.
>>
>> The issue I'm having uffd-wp as it was proposed for shared memory is that
>> there is hardly a sane use case where we would *want* it to work that way.
>>
>> A UFFD-WP flag in a page table for shared memory means "please notify once
>> this process modifies the shared memory (via page tables, not via any other
>> fd modification)". Do we have an example application where these semantics
>> makes sense and don't over-complicate the whole approach? I don't know any,
>> thus I'm asking dumb questions :)
>>
>>
>> For background snapshots in QEMU the flow would currently be like this,
>> assuming all processes have the shared guest memory mapped.
>>
>> 1. Background snapshot preparation: QEMU requests all processes
>> to uffd-wp the range
>> a) All processes register a uffd handler on guest RAM
>
> To be explicit: not a handler; just register with uffd-wp and pass over the fd
> to the main process.
Good point.
>
>> b) All processes fault in all guest memory (essentially populating all
>> memory): with a uffd-WP extensions we might be able to get rid of
>> that, I remember you were working on that.
>> c) All processes uffd-WP the range to set the bit in their page table
>>
>> 2. Background snapshot runs:
>> a) A process either receives a UFFD-WP event and forwards it to QEMU or
>> QEMU polls all other processes for UFFD events.
>> b) QEMU writes the to-be-changed page to the migration stream.
>> c) QEMU triggers all processes to un-protect the page and wake up any
>> waiters. All processes clear the uffd-WP bit in their page tables.
>>
>> 3. Background snapshot completes:
>> a) All processes unregister the uffd handler
>>
>>
>> Now imagine something like this:
>>
>> 1. Background snapshot preparation:
>> a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
>> to guest memory.
>> b) QEMU uffd-wp's the whole file
>>
>> 2. Background snapshot runs:
>> a) QEMU receives a UFFD-WP event.
>> b) QEMU writes the to-be-changed page to the migration stream.
>> c) QEMU un-protect the page and wake up any waiters.
>>
>> 3. Background snapshot completes:
>> a) QEMU unregister the uffd handler
>>
>>
>> Wouldn't that be much nicer and much easier to handle? Yes, it is much
>> harder to implement because such an infrastructure does not exist yet, and
>> it most probably wouldn't be called uffd anymore, because we are dealing
>> with file access. But this way, it would actually be super easy to use the
>> feature across multiple processes and eventually to even catch other file
>> modifications.
>
> I can totally understand how you see this. We've discussed about that, isn't
> it? About the ideal worlds. :)
Well, let's dream big :)
>
> It would be great if this can work out, I hope so. So far I'm not that
> ambicious, and as I said, I don't know whether there will be other concerns
> when it goes into the page cache layer, and when it's a behavior of multiple
> processes where one of them can rule others without others being notice of it.
>
> Even if we want to go that way, I think we should first come up with some way
> to describe the domains that one uffd-wp registered file should behave upon.
> It shouldn't be "any process touching this file".
>
> One quick example in my mind is when a malicious process wants to stop another
> daemon process, it'll be easier as long as the malicious process can delete a
> file that the daemon used to read/write, replace it with a shmem with uffd-wp
> registered (or maybe just a regular file on file systems, if your proposal will
> naturally work on them). The problem is, is it really "legal" to be able to
> stop the daemon running like that?
Good question, I'd imagine e.g., file sealing could forbid uffd (or
however it is called) registration on a file, and there would have to be
a way to reject files that have uffd registered. But it's certainly a
valid concern - and it raises the question to *what* we actually want to
apply such a concept. Random files? random memfd? most probably not.
Special memfds created with an ALLOW_UFFD flag? sounds like a good idea.
>
> I also don't know the initial concept when uffd is designed and why it's
> designed at pte level. Avoid vma manipulation should be a major factor, but I
> can't say I understand all of them. Not sure whether Andrea has any input here.
AFAIU originally a) avoid signal handler madness and b) avoid VMA
modifications and c) avoid taking the mmap lock in write (well, that
didn't work out completely for uffd-wp for now IIRC).
>
> That's why I think current uffd can still make sense with per-process concepts
> and keep it that way. When register uffd-wp yes we need to do that for
> multiple processes, but it also means each process is fully aware that this is
> happening so it's kind of verified that this is wanted behavior for that
> process. It'll happen with less "surprises", and smells safer.
>
> I don't think that will not work out. It may require all the process to
> support uffd-wp apis and cooperate, but that's so far how it should work for me
> in a safe and self-contained way. Say, every process should be aware of what's
> going to happen on blocked page faults.
That's a valid concern, although I wonder if it can just be handled via
specially marked memfds ("this memfd might get a uffd handler registered
later").
>>
>> Again, I am not sure if uffd-wp or softdirty make too much sense in general
>> when applied to shmem. But I'm happy to learn more.
>
> Me too, I'm more than glad to know whether the page cache idea could be
> welcomed or am I just wrong about it. Before I understand more things around
> this, so far I still think the per-process based and fd-based solution of uffd
> still makes sense.
I'd be curious about applications where the per-process approach would
actually solve something a per-fd approach couldn't solve. Maybe there
are some that I just can't envision.
(using shmem for a single process only isn't a use case I consider
important :) )
--
Thanks,
David / dhildenb
On Wed, Aug 11, 2021 at 08:41:32PM +0200, David Hildenbrand wrote:
> On 11.08.21 20:25, Peter Xu wrote:
> > On Wed, Aug 11, 2021 at 06:15:37PM +0200, David Hildenbrand wrote:
> > > On 04.08.21 21:17, Peter Xu wrote:
> > > > On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
> > > > > TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
> > > > > any state information regarding shared memory in per-process page tables: it
> > > > > just doesn't make too much sense.
> > > > >
> > > > > And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
> > > > > belongs to the shared file ("did *someone* write to this page", "is
> > > > > *someone* interested into changes to that page", "is there something"). I
> > > > > know, that screams for a completely different design in respect to these
> > > > > features.
> > > > >
> > > > > I guess we start learning the hard way that shared memory is just different
> > > > > and requires different interfaces than per-process page table interfaces we
> > > > > have (pagemap, userfaultfd).
> > > > >
> > > > > I didn't have time to explore any alternatives yet, but I wonder if tracking
> > > > > such stuff per an actual fd/memfd and not via process page tables is
> > > > > actually the right and clean approach. There are certainly many issues to
> > > > > solve, but conceptually to me it feels more natural to have these shared
> > > > > memory features not mangled into process page tables.
> > > >
> > > > Yes, we can explore all the possibilities, I'm totally fine with it.
> > > >
> > > > I just want to say I still don't think when there's page cache then we must put
> > > > all the page-relevant things into the page cache.
> > >
> > > [sorry for the late reply]
> > >
> > > Right, but for the case of shared, swapped out pages, the information is
> > > already there, in the page cache :)
> > >
> > > >
> > > > They're shared by processes, but process can still have its own way to describe
> > > > the relationship to that page in the cache, to me it's as simple as "we allow
> > > > process A to write to page cache P", while "we don't allow process B to write
> > > > to the same page" like the write bit.
> > >
> > > The issue I'm having uffd-wp as it was proposed for shared memory is that
> > > there is hardly a sane use case where we would *want* it to work that way.
> > >
> > > A UFFD-WP flag in a page table for shared memory means "please notify once
> > > this process modifies the shared memory (via page tables, not via any other
> > > fd modification)". Do we have an example application where these semantics
> > > makes sense and don't over-complicate the whole approach? I don't know any,
> > > thus I'm asking dumb questions :)
> > >
> > >
> > > For background snapshots in QEMU the flow would currently be like this,
> > > assuming all processes have the shared guest memory mapped.
> > >
> > > 1. Background snapshot preparation: QEMU requests all processes
> > > to uffd-wp the range
> > > a) All processes register a uffd handler on guest RAM
> >
> > To be explicit: not a handler; just register with uffd-wp and pass over the fd
> > to the main process.
>
> Good point.
>
> >
> > > b) All processes fault in all guest memory (essentially populating all
> > > memory): with a uffd-WP extensions we might be able to get rid of
> > > that, I remember you were working on that.
> > > c) All processes uffd-WP the range to set the bit in their page table
> > >
> > > 2. Background snapshot runs:
> > > a) A process either receives a UFFD-WP event and forwards it to QEMU or
> > > QEMU polls all other processes for UFFD events.
> > > b) QEMU writes the to-be-changed page to the migration stream.
> > > c) QEMU triggers all processes to un-protect the page and wake up any
> > > waiters. All processes clear the uffd-WP bit in their page tables.
> > >
> > > 3. Background snapshot completes:
> > > a) All processes unregister the uffd handler
> > >
> > >
> > > Now imagine something like this:
> > >
> > > 1. Background snapshot preparation:
> > > a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
> > > to guest memory.
> > > b) QEMU uffd-wp's the whole file
> > >
> > > 2. Background snapshot runs:
> > > a) QEMU receives a UFFD-WP event.
> > > b) QEMU writes the to-be-changed page to the migration stream.
> > > c) QEMU un-protect the page and wake up any waiters.
> > >
> > > 3. Background snapshot completes:
> > > a) QEMU unregister the uffd handler
> > >
> > >
> > > Wouldn't that be much nicer and much easier to handle? Yes, it is much
> > > harder to implement because such an infrastructure does not exist yet, and
> > > it most probably wouldn't be called uffd anymore, because we are dealing
> > > with file access. But this way, it would actually be super easy to use the
> > > feature across multiple processes and eventually to even catch other file
> > > modifications.
> >
> > I can totally understand how you see this. We've discussed about that, isn't
> > it? About the ideal worlds. :)
>
> Well, let's dream big :)
>
> >
> > It would be great if this can work out, I hope so. So far I'm not that
> > ambicious, and as I said, I don't know whether there will be other concerns
> > when it goes into the page cache layer, and when it's a behavior of multiple
> > processes where one of them can rule others without others being notice of it.
> >
> > Even if we want to go that way, I think we should first come up with some way
> > to describe the domains that one uffd-wp registered file should behave upon.
> > It shouldn't be "any process touching this file".
> >
> > One quick example in my mind is when a malicious process wants to stop another
> > daemon process, it'll be easier as long as the malicious process can delete a
> > file that the daemon used to read/write, replace it with a shmem with uffd-wp
> > registered (or maybe just a regular file on file systems, if your proposal will
> > naturally work on them). The problem is, is it really "legal" to be able to
> > stop the daemon running like that?
>
> Good question, I'd imagine e.g., file sealing could forbid uffd (or however
> it is called) registration on a file, and there would have to be a way to
> reject files that have uffd registered. But it's certainly a valid concern -
> and it raises the question to *what* we actually want to apply such a
> concept. Random files? random memfd? most probably not. Special memfds
> created with an ALLOW_UFFD flag? sounds like a good idea.
Note that when daemons open files, they may not be aware of what's underneath
but read that file directly. The attacker could still create the file with
uffd-wp enabled with any flag we introduce.
>
> >
> > I also don't know the initial concept when uffd is designed and why it's
> > designed at pte level. Avoid vma manipulation should be a major factor, but I
> > can't say I understand all of them. Not sure whether Andrea has any input here.
>
> AFAIU originally a) avoid signal handler madness and b) avoid VMA
> modifications and c) avoid taking the mmap lock in write (well, that didn't
> work out completely for uffd-wp for now IIRC).
Nadav fixed that; it's with read lock now just like when it's introduced.
Please see mwriteprotect_range() and commit 6ce64428d62026a10c.
I won't say message queue (signal handling) is because uffd is pte-based; that
seems to be an orthogonal design decision. But yeah I agree that's something
better than using signal handlers.
>
> >
> > That's why I think current uffd can still make sense with per-process concepts
> > and keep it that way. When register uffd-wp yes we need to do that for
> > multiple processes, but it also means each process is fully aware that this is
> > happening so it's kind of verified that this is wanted behavior for that
> > process. It'll happen with less "surprises", and smells safer.
> >
> > I don't think that will not work out. It may require all the process to
> > support uffd-wp apis and cooperate, but that's so far how it should work for me
> > in a safe and self-contained way. Say, every process should be aware of what's
> > going to happen on blocked page faults.
>
> That's a valid concern, although I wonder if it can just be handled via
> specially marked memfds ("this memfd might get a uffd handler registered
> later").
Yes, please see my above concern. So I think we at least reached concensus on:
(1) that idea is already not userfaultfd but something else; what's that is
still to be defined. And, (2) that definitely needs further thoughts and
context to support its validity and safety. Now uffd got people worried about
safety already, that's why all the uffd selinux and privileged_userfaultfd
sysctl comes to mainline; we'd wish good luck with the new concept!
OTOH, uffd whole idea is already in mainline, it has limitations on requiring
to rework all processes to support uffd-wp, but actually the same to MISSING
messages has already happened and our QE is testing those: that's what we do
with e.g. postcopy-migrating vhost-user enabled OVS-DPDK - we pass over uffd
registered with missing mode and let QEMU handle the page fault. So it's a bit
complicated but it should work. And I hope you can also agree we don't need to
block uffd before that idea settles.
The pte markers idea need comment; that's about implementation, and it'll be
great to have comments there or even NACK (better with a better suggestion,
though :). But the original idea of uffd that is pte-based has never changed.
>
> > >
> > > Again, I am not sure if uffd-wp or softdirty make too much sense in general
> > > when applied to shmem. But I'm happy to learn more.
> >
> > Me too, I'm more than glad to know whether the page cache idea could be
> > welcomed or am I just wrong about it. Before I understand more things around
> > this, so far I still think the per-process based and fd-based solution of uffd
> > still makes sense.
>
> I'd be curious about applications where the per-process approach would
> actually solve something a per-fd approach couldn't solve. Maybe there are
> some that I just can't envision.
Right, that's a good point.
Actually it could be when like virtio-mem that some process shouldn't have
write privilege, but we still allow some other process writting to the shmem.
Something like that.
>
> (using shmem for a single process only isn't a use case I consider important
> :) )
If you still remember the discussion about "having qemu start to use memfd and
shmem as default"? :)
shmem is hard but it's indeed useful in many cases, even if single threaded.
For example, shmem-based VMs can do local binary update without migrating guest
RAMs (because memory is shared between old/new binaries!). To me it's always a
valid request to enable both shmem and write protect.
[It seems Andrea is not in the loop; do that for real]
Thanks,
--
Peter Xu
>>
>> Good question, I'd imagine e.g., file sealing could forbid uffd (or however
>> it is called) registration on a file, and there would have to be a way to
>> reject files that have uffd registered. But it's certainly a valid concern -
>> and it raises the question to *what* we actually want to apply such a
>> concept. Random files? random memfd? most probably not. Special memfds
>> created with an ALLOW_UFFD flag? sounds like a good idea.
>
> Note that when daemons open files, they may not be aware of what's underneath
> but read that file directly. The attacker could still create the file with
> uffd-wp enabled with any flag we introduce.
Right, but we could, for example, use a prctrl to make a process to opt
in to opening possibly-uffd-wp-protected files at all. I guess securing
that aspect shouldn't be a hard nut to crack. At least with my thinking.
>
>>
>>>
>>> I also don't know the initial concept when uffd is designed and why it's
>>> designed at pte level. Avoid vma manipulation should be a major factor, but I
>>> can't say I understand all of them. Not sure whether Andrea has any input here.
>>
>> AFAIU originally a) avoid signal handler madness and b) avoid VMA
>> modifications and c) avoid taking the mmap lock in write (well, that didn't
>> work out completely for uffd-wp for now IIRC).
>
> Nadav fixed that; it's with read lock now just like when it's introduced.
> Please see mwriteprotect_range() and commit 6ce64428d62026a10c.
Oh, rings a bell, thanks!
>>
>>>
>>> That's why I think current uffd can still make sense with per-process concepts
>>> and keep it that way. When register uffd-wp yes we need to do that for
>>> multiple processes, but it also means each process is fully aware that this is
>>> happening so it's kind of verified that this is wanted behavior for that
>>> process. It'll happen with less "surprises", and smells safer.
>>>
>>> I don't think that will not work out. It may require all the process to
>>> support uffd-wp apis and cooperate, but that's so far how it should work for me
>>> in a safe and self-contained way. Say, every process should be aware of what's
>>> going to happen on blocked page faults.
>>
>> That's a valid concern, although I wonder if it can just be handled via
>> specially marked memfds ("this memfd might get a uffd handler registered
>> later").
>
> Yes, please see my above concern. So I think we at least reached concensus on:
> (1) that idea is already not userfaultfd but something else; what's that is
> still to be defined. And, (2) that definitely needs further thoughts and
> context to support its validity and safety. Now uffd got people worried about
> safety already, that's why all the uffd selinux and privileged_userfaultfd
> sysctl comes to mainline; we'd wish good luck with the new concept!
Sure, whenever you introduce random ever-lasting delays, we have to be
very careful what we support. And if means not supporting some ioctls
for such a special memfd (hello secretmem :)).
>
> OTOH, uffd whole idea is already in mainline, it has limitations on requiring
> to rework all processes to support uffd-wp, but actually the same to MISSING
> messages has already happened and our QE is testing those: that's what we do
> with e.g. postcopy-migrating vhost-user enabled OVS-DPDK - we pass over uffd
> registered with missing mode and let QEMU handle the page fault. So it's a bit
> complicated but it should work. And I hope you can also agree we don't need to
> block uffd before that idea settles.
Let's phrase it that way: instead of extending something that just
doesn't fit cleanly and feels kind of hackish (see my approach to
teaching QEMU background snapshots above), I'd much rather see something
clean and actually performant for the use cases I am aware of.
That doesn't mean that your current uffd-wp approach on shmem is all bad
IMHO (well, I make no decisions either way :) ), I'd just like us to
look into finding eventually an approach to handle this cleanly instead
of trying to solve problems we might not have to solve after all (pte
markers) when things are done differently.
>
> The pte markers idea need comment; that's about implementation, and it'll be
> great to have comments there or even NACK (better with a better suggestion,
> though :). But the original idea of uffd that is pte-based has never changed.
Right, I hope some other people can comment. If we want to go down that
path for uffd-wp, pte makers make sense. I'm not convinced we want them
to handle swapped shared pages, but that discussion is better off in
your posting.
>>
>>>>
>>>> Again, I am not sure if uffd-wp or softdirty make too much sense in general
>>>> when applied to shmem. But I'm happy to learn more.
>>>
>>> Me too, I'm more than glad to know whether the page cache idea could be
>>> welcomed or am I just wrong about it. Before I understand more things around
>>> this, so far I still think the per-process based and fd-based solution of uffd
>>> still makes sense.
>>
>> I'd be curious about applications where the per-process approach would
>> actually solve something a per-fd approach couldn't solve. Maybe there are
>> some that I just can't envision.
>
> Right, that's a good point.
>
> Actually it could be when like virtio-mem that some process shouldn't have
> write privilege, but we still allow some other process writting to the shmem.
> Something like that.
With virtio-mem, you most probably wouldn't want anybody writing to it,
at least from what I can tell. But I understand the rough idea -- just
that you cannot enforce something on another process that doesn't play
along (at least with the current uffd-wp approach! you could with an
fd-based approach).
>
>>
>> (using shmem for a single process only isn't a use case I consider important
>> :) )
>
> If you still remember the discussion about "having qemu start to use memfd and
> shmem as default"? :)
Oh yes :)
>
> shmem is hard but it's indeed useful in many cases, even if single threaded.
> For example, shmem-based VMs can do local binary update without migrating guest
> RAMs (because memory is shared between old/new binaries!). To me it's always a
> valid request to enable both shmem and write protect.
Right, but it would also just work with an fd-based approach. (well,
unless we're dealing with shared anonymous RAM, but that is just some
weird stuff for really exotic use cases)
--
Thanks,
David / dhildenb