2022-04-12 12:15:31

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 00/14] Add support for shared PTEs across processes

Page tables in kernel consume some of the memory and as long as number
of mappings being maintained is small enough, this space consumed by
page tables is not objectionable. When very few memory pages are
shared between processes, the number of page table entries (PTEs) to
maintain is mostly constrained by the number of pages of memory on the
system. As the number of shared pages and the number of times pages
are shared goes up, amount of memory consumed by page tables starts to
become significant.

Some of the field deployments commonly see memory pages shared across
1000s of processes. On x86_64, each page requires a PTE that is only 8
bytes long which is very small compared to the 4K page size. When 2000
processes map the same page in their address space, each one of them
requires 8 bytes for its PTE and together that adds up to 8K of memory
just to hold the PTEs for one 4K page. On a database server with 300GB
SGA, a system carsh was seen with out-of-memory condition when 1500+
clients tried to share this SGA even though the system had 512GB of
memory. On this server, in the worst case scenario of all 1500
processes mapping every page from SGA would have required 878GB+ for
just the PTEs. If these PTEs could be shared, amount of memory saved
is very significant.

This patch series implements a mechanism in kernel to allow userspace
processes to opt into sharing PTEs. It adds two new system calls - (1)
mshare(), which can be used by a process to create a region (we will
call it mshare'd region) which can be used by other processes to map
same pages using shared PTEs, (2) mshare_unlink() which is used to
detach from the mshare'd region. Once an mshare'd region is created,
other process(es), assuming they have the right permissions, can make
the mashare() system call to map the shared pages into their address
space using the shared PTEs. When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.


API
===

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64).
This alignment value can be looked up in /proc/sys/vm//mshare_size
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
shared automatically
- Shared address range can have unmapped addresses in it. Any access
to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.


Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, 0, 0);
if (addr == MAP_FAILED)
perror("ERROR: mmap failed");

err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
if (err < 0) {
perror("mshare() syscall failed");
exit(1);
}

strncpy(addr, "Some random shared text",
sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
struct mshare_info minfo;

fd = open("testregion", O_RDONLY);
if (fd < 0) {
perror("open failed");
exit(1);
}

if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
printf("INFO: %ld bytes shared at addr 0x%lx \n",
minfo.size, minfo.start);
else
perror("read failed");

close(fd);

addr = (void *)minfo.start;
err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
O_RDWR, 600);
if (err < 0) {
perror("mshare() syscall failed");
exit(1);
}

printf("Guest mmap at %px:\n", addr);
printf("%s\n", addr);
printf("\nDone\n");

err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
if (err < 0) {
perror("mshare_unlink() failed");
exit(1);
}
-----------------


Patch series
============

This series of patches is an initial implementation of these two
system calls. This code implements working basic functionality.

Prototype for the two syscalls is:

SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
unsigned long, len, int, oflag, mode_t, mode)

SYSCALL_DEFINE1(mshare_unlink, const char *, name)

In order to facilitate page table sharing, this implemntation adds a
new in-memory filesystem - msharefs which will be mounted at
/sys/fs/mshare. When a new mshare'd region is created, a file with
the name given by initial mshare() call is created under this
filesystem. Permissions for this file are given by the "mode"
parameter to mshare(). The only operation supported on this file is
read. A read from this file returns a structure containing
information about mshare'd region - (1) starting virtual address for
the region, and (2) size of mshare'd region.

A donor process that wants to create an mshare'd region from a
section of its mapped addresses calls mshare() with O_CREAT oflag.
mshare() syscall then creates a new mm_struct which will host the
page tables for the mshare'd region. vma->vm_private_data for the
vmas covering address range for this region are updated to point to
a structure containing pointer to this new mm_struct. Existing page
tables are copied over to new mm struct.

A consumer process that wants to map mshare'd region opens
/sys/fs/mshare/<filename> and reads the starting address and size of
mshare'd region. It then calls mshare() with these values to map the
entire region in its address space. Consumer process calls
mshare_unlink() to terminate its access.


Since RFC
=========

This patch series includes better error handling and more robust
locking besides improved implementation of mshare since the original
RFC. It also incorporates feedback from original RFC. Alignment and
size requirment are PGDIR_SIZE, same as RFC and this is open to
change based upon further feedback. More review is needed for this
patch series and is much appreciated.



Khalid Aziz (14):
mm: Add new system calls mshare, mshare_unlink
mm: Add msharefs filesystem
mm: Add read for msharefs
mm: implement mshare_unlink syscall
mm: Add locking to msharefs syscalls
mm/msharefs: Check for mounted filesystem
mm: Add vm flag for shared PTE
mm/mshare: Add basic page table sharing using mshare
mm: Do not free PTEs for mshare'd PTEs
mm/mshare: Check for mapped vma when mshare'ing existing mshare'd
range
mm/mshare: unmap vmas in mshare_unlink
mm/mshare: Add a proc file with mshare alignment/size information
mm/mshare: Enforce mshare'd region permissions
mm/mshare: Copy PTEs to host mm

Documentation/filesystems/msharefs.rst | 19 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/linux/mm.h | 11 +
include/trace/events/mmflags.h | 3 +-
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/magic.h | 1 +
include/uapi/linux/mman.h | 5 +
kernel/sysctl.c | 7 +
mm/Makefile | 2 +-
mm/internal.h | 7 +
mm/memory.c | 105 ++++-
mm/mshare.c | 587 +++++++++++++++++++++++++
12 files changed, 750 insertions(+), 6 deletions(-)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 mm/mshare.c

--
2.32.0


2022-04-12 12:48:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

Khalid Aziz <[email protected]> writes:

> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.
>
> Some of the field deployments commonly see memory pages shared across
> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> bytes long which is very small compared to the 4K page size. When 2000
> processes map the same page in their address space, each one of them
> requires 8 bytes for its PTE and together that adds up to 8K of memory
> just to hold the PTEs for one 4K page. On a database server with 300GB
> SGA, a system carsh was seen with out-of-memory condition when 1500+
> clients tried to share this SGA even though the system had 512GB of
> memory. On this server, in the worst case scenario of all 1500
> processes mapping every page from SGA would have required 878GB+ for
> just the PTEs. If these PTEs could be shared, amount of memory saved
> is very significant.
>
> This patch series implements a mechanism in kernel to allow userspace
> processes to opt into sharing PTEs. It adds two new system calls - (1)
> mshare(), which can be used by a process to create a region (we will
> call it mshare'd region) which can be used by other processes to map
> same pages using shared PTEs, (2) mshare_unlink() which is used to
> detach from the mshare'd region. Once an mshare'd region is created,
> other process(es), assuming they have the right permissions, can make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs. When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>

>
> API
> ===
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
> exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>

Please don't add system calls that take names.

Please just open objects on the filesystem and allow multi-instances of
the filesystem. Otherwise someone is going to have to come along later
and implement namespace magic to deal with your new system calls. You
already have a filesystem all that is needed to avoid having to
introduce namespace magic is to simply allow multiple instances of the
filesystem to be mounted.

On that note. Since you have a filesystem, introduce a well known
name for a directory and in that directory place all of the information
and possibly control files for your filesystem. No need to for proc
files and the like, and if at somepoint you have mount options that
allow the information to be changed you can have different mounts
with different values present.


This is must me. But I find it weird that you don't use mmap
to place the shared area from the mshare fd into your address space.

I think I would do:
// Establish the mshare region
addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_MSHARE, msharefd, 0);

// Remove the mshare region
addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
MAP_FIXED | MAP_MUNSHARE, msharefd, 0);

I could see a point of using separate system calls instead of
adding MAP_SHARE and MAP_UNSHARE flags.

What are the locking implications of taking a page fault in the shared
region?

Is it a noop or is it going to make some of the nasty locking we have in
the kernel for things like truncates worse?

Eric

2022-04-12 13:33:51

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls

Lock the root inode for msharefs when creating a new file or
deleting an existing one to avoid races. mshare syscalls are low
frequency operations, so locking the root inode is reasonable.

Signed-off-by: Khalid Aziz <[email protected]>
---
mm/mshare.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mshare.c b/mm/mshare.c
index b9d7836f9bd1..85ccb7f02f33 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -195,11 +195,12 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
if (err)
goto err_out;
+ inode_lock(d_inode(msharefs_sb->s_root));
dentry = d_lookup(msharefs_sb->s_root, &namestr);
if (dentry && (oflag & (O_EXCL|O_CREAT))) {
err = -EEXIST;
dput(dentry);
- goto err_out;
+ goto err_unlock_inode;
}

if (dentry) {
@@ -232,6 +233,7 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
goto err_relinfo;
}

+ inode_unlock(d_inode(msharefs_sb->s_root));
putname(fname);
return 0;

@@ -239,6 +241,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
kfree(info);
err_relmm:
mmput(mm);
+err_unlock_inode:
+ inode_unlock(d_inode(msharefs_sb->s_root));
err_out:
putname(fname);
return err;
@@ -264,10 +268,11 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
err = msharefs_d_hash(msharefs_sb->s_root, &namestr);
if (err)
goto err_out;
+ inode_lock(d_inode(msharefs_sb->s_root));
dentry = d_lookup(msharefs_sb->s_root, &namestr);
if (dentry == NULL) {
err = -EINVAL;
- goto err_out;
+ goto err_unlock_inode;
}

inode = d_inode(dentry);
@@ -290,11 +295,14 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
dput(dentry);
}

+ inode_unlock(d_inode(msharefs_sb->s_root));
putname(fname);
return 0;

err_dput:
dput(dentry);
+err_unlock_inode:
+ inode_unlock(d_inode(msharefs_sb->s_root));
err_out:
putname(fname);
return err;
--
2.32.0

2022-04-12 14:32:56

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 4/11/22 14:10, Eric W. Biederman wrote:
> Khalid Aziz <[email protected]> writes:
>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs. When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>> exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>
> Please don't add system calls that take names.
>
> Please just open objects on the filesystem and allow multi-instances of
> the filesystem. Otherwise someone is going to have to come along later
> and implement namespace magic to deal with your new system calls. You
> already have a filesystem all that is needed to avoid having to
> introduce namespace magic is to simply allow multiple instances of the
> filesystem to be mounted.

Hi Eric,

Thanks for taking the time to provide feedback.

That sounds reasonable. Is something like this more in line with what you are thinking:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
mshare(fd, start, end, O_RDWR);

This sequence creates the mshare'd region and assigns it an address range (which may or may not be empty). Then a client
process would do something like:

fd = open("/sys/fs/mshare/testregion", O_RDONLY);
mshare(fd, start, end, O_RDWR);

which maps the mshare'd region into client process's address space.

>
> On that note. Since you have a filesystem, introduce a well known
> name for a directory and in that directory place all of the information
> and possibly control files for your filesystem. No need to for proc
> files and the like, and if at somepoint you have mount options that
> allow the information to be changed you can have different mounts
> with different values present.

So have the kernel mount msharefs at /sys/fs/mshare and create a file /sys/fs/mshare/info. A read from
/sys/fs/mshare/info returns what /proc/sys/vm//mshare_size in patch 12 returns. Did I understand that correctly?

>
>
> This is must me. But I find it weird that you don't use mmap
> to place the shared area from the mshare fd into your address space.
>
> I think I would do:
> // Establish the mshare region
> addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
> MAP_SHARED | MAP_MSHARE, msharefd, 0);
>
> // Remove the mshare region
> addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
> MAP_FIXED | MAP_MUNSHARE, msharefd, 0);
>
> I could see a point of using separate system calls instead of
> adding MAP_SHARE and MAP_UNSHARE flags.

In that case, we can just do away with syscalls completely. For instance, to create mshare'd region, one would:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
addr = mmap(NULL, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
or
addr = mmap(myaddr, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0);

First mmap using this fd sets the address range which stays the same through out the life of region. To populate data in
this address range, the process could use mmap/mremap and other mechanisms subsequently.

To map this mshare'd region into its address space, a process would:

struct mshare_info {
unsigned long start;
unsigned long size;
} minfo;
fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
read(fd, &minfo, sizeof(struct mshare_info));
addr = mmap(NULL, minfo.size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

When done with mshare'd region, process would unlink("/sys/fs/mshare/testregion").

This changes API significantly but if it results in a cleaner and more maintainable API, I will make the changes.

>
> What are the locking implications of taking a page fault in the shared
> region?
>
> Is it a noop or is it going to make some of the nasty locking we have in
> the kernel for things like truncates worse?
>
> Eric

Handling page fault would require locking mm for the faulting process and host mm which would synchronize any other
process taking a page fault in the shared region at the same time.

Thanks,
Khalid

2022-04-12 20:38:44

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 06/14] mm/mshare: Check for mounted filesystem

Check if msharefs is mounted before performing any msharefs
operations to avoid inadvertent NULL pointer dereferences.

Signed-off-by: Khalid Aziz <[email protected]>
---
mm/mshare.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/mshare.c b/mm/mshare.c
index 85ccb7f02f33..cd2f7ad24d9d 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -176,6 +176,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
struct qstr namestr;
int err = PTR_ERR(fname);

+ /*
+ * Is msharefs mounted? TODO: If not mounted, return error
+ * or automount?
+ */
+ if (msharefs_sb == NULL)
+ return -ENOENT;
+
/*
* Address range being shared must be aligned to pgdir
* boundary and its size must be a multiple of pgdir size
@@ -260,6 +267,9 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
struct mshare_data *info;
struct qstr namestr;

+ if (msharefs_sb == NULL)
+ return -ENOENT;
+
if (IS_ERR(fname))
goto err_out;

--
2.32.0

2022-04-12 21:21:29

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 02/14] mm/mshare: Add msharefs filesystem

Add a ram-based filesystem that contains the files created for each
shared address range. This adds just the filesystem and creation of
files. Page table entries for these shared ranges created by mshare
syscall are still not shared.

Signed-off-by: Khalid Aziz <[email protected]>
---
Documentation/filesystems/msharefs.rst | 19 +++
include/uapi/linux/magic.h | 1 +
mm/mshare.c | 191 +++++++++++++++++++++++--
3 files changed, 197 insertions(+), 14 deletions(-)
create mode 100644 Documentation/filesystems/msharefs.rst

diff --git a/Documentation/filesystems/msharefs.rst b/Documentation/filesystems/msharefs.rst
new file mode 100644
index 000000000000..fd161f67045d
--- /dev/null
+++ b/Documentation/filesystems/msharefs.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+msharefs - a filesystem to support shared page tables
+=====================================================
+
+msharefs is a ram-based filesystem that allows multiple processes to
+share page table entries for shared pages.
+
+msharefs is typically mounted like this::
+
+ mount -t msharefs none /sys/fs/mshare
+
+When a process calls mshare syscall with a name for the shared address
+range, a file with the same name is created under msharefs with that
+name. This file can be opened by another process, if permissions
+allow, to query the addresses shared under this range. These files are
+removed by mshare_unlink syscall and can not be deleted directly.
+Hence these files are created as immutable files.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f724129c0425..2a57a6ec6f3e 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -105,5 +105,6 @@
#define Z3FOLD_MAGIC 0x33
#define PPC_CMM_MAGIC 0xc7571590
#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */
+#define MSHARE_MAGIC 0x4d534852 /* "MSHR" */

#endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/mshare.c b/mm/mshare.c
index 436195c0e74e..ad695288d4bb 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -10,20 +10,117 @@
* Matthew Wilcox <[email protected]>
*/

-#include <linux/anon_inodes.h>
#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <linux/pseudo_fs.h>
+#include <linux/fileattr.h>
+#include <uapi/linux/magic.h>
+#include <uapi/linux/limits.h>

-static const struct file_operations mshare_fops = {
+static struct super_block *msharefs_sb;
+
+static const struct file_operations msharefs_file_operations = {
+ .open = simple_open,
+ .llseek = no_llseek,
};

+static int
+msharefs_d_hash(const struct dentry *dentry, struct qstr *qstr)
+{
+ unsigned long hash = init_name_hash(dentry);
+ const unsigned char *s = qstr->name;
+ unsigned int len = qstr->len;
+
+ while (len--)
+ hash = partial_name_hash(*s++, hash);
+ qstr->hash = end_name_hash(hash);
+ return 0;
+}
+
+static struct dentry
+*msharefs_alloc_dentry(struct dentry *parent, const char *name)
+{
+ struct dentry *d;
+ struct qstr q;
+ int err;
+
+ q.name = name;
+ q.len = strlen(name);
+
+ err = msharefs_d_hash(parent, &q);
+ if (err)
+ return ERR_PTR(err);
+
+ d = d_alloc(parent, &q);
+ if (d)
+ return d;
+
+ return ERR_PTR(-ENOMEM);
+}
+
+static struct inode
+*msharefs_get_inode(struct super_block *sb, int mode)
+{
+ struct inode *inode = new_inode(sb);
+
+ if (inode) {
+ inode->i_ino = get_next_ino();
+ inode->i_mode = mode;
+
+ /*
+ * msharefs are not meant to be manipulated from userspace.
+ * Reading from the file is the only allowed operation
+ */
+ inode->i_flags = S_IMMUTABLE;
+
+ inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+ inode->i_fop = &msharefs_file_operations;
+ inode->i_size = 0;
+ inode->i_uid = current_fsuid();
+ inode->i_gid = current_fsgid();
+ }
+
+ return inode;
+}
+
+static int
+mshare_file_create(const char *name, unsigned long flags)
+{
+ struct inode *inode;
+ struct dentry *root, *dentry;
+ int err = 0;
+
+ root = msharefs_sb->s_root;
+
+ inode = msharefs_get_inode(msharefs_sb, S_IFREG | 0400);
+ if (IS_ERR(inode))
+ return PTR_ERR(inode);
+
+ dentry = msharefs_alloc_dentry(root, name);
+ if (IS_ERR(dentry)) {
+ err = PTR_ERR(dentry);
+ goto fail_inode;
+ }
+
+ d_add(dentry, inode);
+
+ return err;
+
+fail_inode:
+ iput(inode);
+ return err;
+}
+
/*
- * mshare syscall. Returns a file descriptor
+ * mshare syscall
*/
-SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
+SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
unsigned long, len, int, oflag, mode_t, mode)
{
- int fd;
+ char mshare_name[NAME_MAX];
+ int err;

/*
* Address range being shared must be aligned to pgdir
@@ -32,15 +129,14 @@ SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
if ((addr | len) & (PGDIR_SIZE - 1))
return -EINVAL;

- /*
- * Allocate a file descriptor to return
- *
- * TODO: This code ignores the object name completely. Add
- * support for that
- */
- fd = anon_inode_getfd("mshare", &mshare_fops, NULL, O_RDWR);
+ err = copy_from_user(mshare_name, name, NAME_MAX);
+ if (err)
+ goto err_out;

- return fd;
+ err = mshare_file_create(mshare_name, oflag);
+
+err_out:
+ return err;
}

/*
@@ -48,7 +144,8 @@ SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
*/
SYSCALL_DEFINE1(mshare_unlink, const char *, name)
{
- int fd;
+ char mshare_name[NAME_MAX];
+ int err;

/*
* Delete the named object
@@ -56,5 +153,71 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
* TODO: Mark mshare'd range for deletion
*
*/
+ err = copy_from_user(mshare_name, name, NAME_MAX);
+ if (err)
+ goto err_out;
+ return 0;
+
+err_out:
+ return err;
+}
+
+static const struct dentry_operations msharefs_d_ops = {
+ .d_hash = msharefs_d_hash,
+};
+
+static int
+msharefs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+ static const struct tree_descr empty_descr = {""};
+ int err;
+
+ sb->s_d_op = &msharefs_d_ops;
+ err = simple_fill_super(sb, MSHARE_MAGIC, &empty_descr);
+ if (err)
+ return err;
+
+ msharefs_sb = sb;
+ return 0;
+}
+
+static int
+msharefs_get_tree(struct fs_context *fc)
+{
+ return get_tree_single(fc, msharefs_fill_super);
+}
+
+static const struct fs_context_operations msharefs_context_ops = {
+ .get_tree = msharefs_get_tree,
+};
+
+static int
+mshare_init_fs_context(struct fs_context *fc)
+{
+ fc->ops = &msharefs_context_ops;
return 0;
}
+
+static struct file_system_type mshare_fs = {
+ .name = "msharefs",
+ .init_fs_context = mshare_init_fs_context,
+ .kill_sb = kill_litter_super,
+};
+
+static int
+mshare_init(void)
+{
+ int ret = 0;
+
+ ret = sysfs_create_mount_point(fs_kobj, "mshare");
+ if (ret)
+ return ret;
+
+ ret = register_filesystem(&mshare_fs);
+ if (ret)
+ sysfs_remove_mount_point(fs_kobj, "mshare");
+
+ return ret;
+}
+
+fs_initcall(mshare_init);
--
2.32.0

2022-04-12 21:24:35

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs

mshare'd PTEs should not be removed when a task exits. These PTEs
are removed when the last task sharing the PTEs exits. Add a check
for shared PTEs and skip them.

Signed-off-by: Khalid Aziz <[email protected]>
---
mm/memory.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c77c0d643ea8..e7c5bc6f8836 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
} else {
/*
* Optimization: gather nearby vmas into one call down
+ * as long as they all belong to the same mm (that
+ * may not be the case if a vma is part of mshare'd
+ * range
*/
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
- && !is_vm_hugetlb_page(next)) {
+ && !is_vm_hugetlb_page(next)
+ && vma->vm_mm == tlb->mm) {
vma = next;
next = vma->vm_next;
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
- free_pgd_range(tlb, addr, vma->vm_end,
- floor, next ? next->vm_start : ceiling);
+ /*
+ * Free pgd only if pgd is not allocated for an
+ * mshare'd range
+ */
+ if (vma->vm_mm == tlb->mm)
+ free_pgd_range(tlb, addr, vma->vm_end,
+ floor, next ? next->vm_start : ceiling);
}
vma = next;
}
@@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
pgd_t *pgd;
unsigned long next;

+ /*
+ * If this is an mshare'd page, do not unmap it since it might
+ * still be in use.
+ */
+ if (vma->vm_mm != tlb->mm)
+ return;
+
BUG_ON(addr >= end);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
--
2.32.0

2022-04-12 21:39:39

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information

mshare regions are subject to size and alignment requirement. This
alignment boundary can be different on different architectures and
userspace needs a way to know what the requirement is. Add a file
/proc/sys/vm//mshare_size that can be read by userspace to get
the alignment and size requirement.

Signed-off-by: Khalid Aziz <[email protected]>
---
v1:
- Provide a way for userspace to determine alignment and
size retriction (based upon feedback from Dave Hansen)

include/linux/mm.h | 1 +
kernel/sysctl.c | 7 +++++++
mm/mshare.c | 3 +++
3 files changed, 11 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 821ed7ee7b41..d9456d424202 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3339,6 +3339,7 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
#endif

extern int sysctl_nr_trim_pages;
+extern ulong sysctl_mshare_size;

#ifdef CONFIG_PRINTK
void mem_dump_obj(void *object);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 730ab56d9e92..66697ba5da88 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2810,6 +2810,13 @@ static struct ctl_table vm_table[] = {
.extra2 = SYSCTL_ONE,
},
#endif
+ {
+ .procname = "mshare_size",
+ .data = &sysctl_mshare_size,
+ .maxlen = sizeof(sysctl_mshare_size),
+ .mode = 0444,
+ .proc_handler = proc_doulongvec_minmax,
+ },
{ }
};

diff --git a/mm/mshare.c b/mm/mshare.c
index ec23d1db79b2..88c7cefc933d 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -28,6 +28,8 @@ struct mshare_data {
refcount_t refcnt;
};

+ulong sysctl_mshare_size;
+
static struct super_block *msharefs_sb;

/* Returns holding the host mm's lock for read. Caller must release. */
@@ -573,6 +575,7 @@ mshare_init(void)
if (ret)
sysfs_remove_mount_point(fs_kobj, "mshare");

+ sysctl_mshare_size = PGDIR_SIZE;
return ret;
}

--
2.32.0

2022-04-12 21:50:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On Mon, Apr 11, 2022 at 10:05:44AM -0600, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.

All of this is true. However, I've found a lot of people don't see this
as compelling. I've had more success explaining this from a different
direction:

--- 8< ---

Linux supports processes which share all of their address space (threads)
and processes that share none of their address space (tasks). We propose
a useful intermediate model where two or more cooperating processes
can choose to share portions of their address space with each other.
The shared portion is referred to by a file descriptor which processes
can choose to attach to their own address space.

Modifications to the shared region affect all processes sharing
that region, just as changes by one thread affect all threads in a
multithreaded program. This implies a certain level of trust between
the different processes (ie malicious processes should not be allowed
access to the mshared region).

--- 8< ---

Another argument that MM developers find compelling is that we can reduce
some of the complexity in hugetlbfs where it has the ability to share
page tables between processes.

One objection that was raised is that the mechanism for starting the
shared region is a bit clunky. Did you investigate the proposed approach
of creating an empty address space, attaching to it and using an fd-based
mmap to modify its contents?

> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.

Can you explain why this is a syscall instead of being a library
function which does

int dirfd = open("/sys/fs/mshare");
err = unlinkat(dirfd, name, 0);
close(dirfd);
return err;

Does msharefs support creating directories, so that we can use file
permissions to limit who can see the sharable files? Or is it strictly
a single-level-deep hierarchy?

2022-04-12 22:28:31

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink

Add two new system calls to support PTE sharing across processes through
explicit declarations of shared address space.There is almost no
implementation in this patch and it only wires up the system calls for
x86_64 only. mshare() returns a file descriptor which does not support
any operations yet.

Signed-off-by: Khalid Aziz <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
include/uapi/asm-generic/unistd.h | 7 ++-
mm/Makefile | 2 +-
mm/mshare.c | 60 ++++++++++++++++++++++++++
4 files changed, 69 insertions(+), 2 deletions(-)
create mode 100644 mm/mshare.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..e6e53b85fea6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,8 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common mshare sys_mshare
+452 common mshare_unlink sys_mshare_unlink

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c48b0ae3ba3..d546086d0661 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,13 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)

+#define __NR_mshare 451
+__SYSCALL(__NR_mshare, sys_mshare)
+#define __NR_mshare_unlink 452
+__SYSCALL(__NR_mshare_unlink, sys_mshare_unlink)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 453

/*
* 32 bit systems traditionally used different
diff --git a/mm/Makefile b/mm/Makefile
index 70d4309c9ce3..70a470b5ebe3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,7 +37,7 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)

mmu-y := nommu.o
-mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o mshare.o \
mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
msync.o page_vma_mapped.o pagewalk.o \
pgtable-generic.o rmap.o vmalloc.o
diff --git a/mm/mshare.c b/mm/mshare.c
new file mode 100644
index 000000000000..436195c0e74e
--- /dev/null
+++ b/mm/mshare.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * mm/mshare.c
+ *
+ * Page table sharing code
+ *
+ *
+ * Copyright (C) 2021 Oracle Corp. All rights reserved.
+ * Authors: Khalid Aziz <[email protected]>
+ * Matthew Wilcox <[email protected]>
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/syscalls.h>
+
+static const struct file_operations mshare_fops = {
+};
+
+/*
+ * mshare syscall. Returns a file descriptor
+ */
+SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
+ unsigned long, len, int, oflag, mode_t, mode)
+{
+ int fd;
+
+ /*
+ * Address range being shared must be aligned to pgdir
+ * boundary and its size must be a multiple of pgdir size
+ */
+ if ((addr | len) & (PGDIR_SIZE - 1))
+ return -EINVAL;
+
+ /*
+ * Allocate a file descriptor to return
+ *
+ * TODO: This code ignores the object name completely. Add
+ * support for that
+ */
+ fd = anon_inode_getfd("mshare", &mshare_fops, NULL, O_RDWR);
+
+ return fd;
+}
+
+/*
+ * mshare_unlink syscall. Close and remove the named mshare'd object
+ */
+SYSCALL_DEFINE1(mshare_unlink, const char *, name)
+{
+ int fd;
+
+ /*
+ * Delete the named object
+ *
+ * TODO: Mark mshare'd range for deletion
+ *
+ */
+ return 0;
+}
--
2.32.0

2022-04-12 22:53:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 4/11/22 10:37, Matthew Wilcox wrote:
> Another argument that MM developers find compelling is that we can reduce
> some of the complexity in hugetlbfs where it has the ability to share
> page tables between processes.

When could this complexity reduction actually happen in practice? Can
this mshare thingy be somehow dropped in underneath the existing
hugetlbfs implementation? Or would userspace need to change?

2022-04-12 23:15:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On Mon, Apr 11, 2022 at 11:51:46AM -0700, Dave Hansen wrote:
> On 4/11/22 10:37, Matthew Wilcox wrote:
> > Another argument that MM developers find compelling is that we can reduce
> > some of the complexity in hugetlbfs where it has the ability to share
> > page tables between processes.
>
> When could this complexity reduction actually happen in practice? Can
> this mshare thingy be somehow dropped in underneath the existing
> hugetlbfs implementation? Or would userspace need to change?

Userspace needs to opt in to mshare, so there's going to be a transition
period where we still need hugetlbfs to still support it, but I have
the impression that the users that need page table sharing are pretty
specialised and we'll be able to find them all before disabling it.

I don't think we can make it transparent to userspace, but I'll noodle
on that a bit.

2022-04-12 23:29:23

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE

Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTE by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTE.

Signed-off-by: Khalid Aziz <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/mm.h | 8 ++++++++
include/trace/events/mmflags.h | 3 ++-
mm/internal.h | 5 +++++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5744a3fc4716..821ed7ee7b41 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -308,11 +308,13 @@ extern unsigned int kobjsize(const void *objp);
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -354,6 +356,12 @@ extern unsigned int kobjsize(const void *objp);
# define VM_MTE_ALLOWED VM_NONE
#endif

+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_SHARED_PT VM_HIGH_ARCH_5
+#else
+#define VM_SHARED_PT 0
+#endif
+
#ifndef VM_GROWSUP
# define VM_GROWSUP VM_NONE
#endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 116ed4d5d0f8..002dbf2711c5 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -184,7 +184,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \
{VM_MIXEDMAP, "mixedmap" }, \
{VM_HUGEPAGE, "hugepage" }, \
{VM_NOHUGEPAGE, "nohugepage" }, \
- {VM_MERGEABLE, "mergeable" } \
+ {VM_MERGEABLE, "mergeable" }, \
+ {VM_SHARED_PT, "sharedpt" } \

#define show_vma_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..cf50a471384e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,4 +718,9 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags);

+static inline bool vma_is_shared(const struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_SHARED_PT;
+}
+
#endif /* __MM_INTERNAL_H */
--
2.32.0

2022-04-12 23:39:22

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 4/11/22 11:37, Matthew Wilcox wrote:
> On Mon, Apr 11, 2022 at 10:05:44AM -0600, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>
> All of this is true. However, I've found a lot of people don't see this
> as compelling. I've had more success explaining this from a different
> direction:
>
> --- 8< ---
>
> Linux supports processes which share all of their address space (threads)
> and processes that share none of their address space (tasks). We propose
> a useful intermediate model where two or more cooperating processes
> can choose to share portions of their address space with each other.
> The shared portion is referred to by a file descriptor which processes
> can choose to attach to their own address space.
>
> Modifications to the shared region affect all processes sharing
> that region, just as changes by one thread affect all threads in a
> multithreaded program. This implies a certain level of trust between
> the different processes (ie malicious processes should not be allowed
> access to the mshared region).
>
> --- 8< ---
>
> Another argument that MM developers find compelling is that we can reduce
> some of the complexity in hugetlbfs where it has the ability to share
> page tables between processes.

This all sounds reasonable.

>
> One objection that was raised is that the mechanism for starting the
> shared region is a bit clunky. Did you investigate the proposed approach
> of creating an empty address space, attaching to it and using an fd-based
> mmap to modify its contents?

I want to make sure I understand this correctly. In the example I gave, the process creating mshare'd region maps in the
address space first possibly using mmap(). It then calls mshare() to share this already-mapped region. Are you
suggesting that the process be able to call mshare() before mapping in address range and then map things into that
address range later? If yes, it is my intent to support that after the initial implementation as expansion of original
concept.

>
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>
> Can you explain why this is a syscall instead of being a library
> function which does
>
> int dirfd = open("/sys/fs/mshare");
> err = unlinkat(dirfd, name, 0);
> close(dirfd);
> return err;

mshare_unlink can be simple unlink on the file in msharefs. API will be asymmetrical in that creating mshare'd region is
a syscall while tearing it down is a file op. I don't mind saving a syscall slot. Would you prefer it that way?

>
> Does msharefs support creating directories, so that we can use file
> permissions to limit who can see the sharable files? Or is it strictly
> a single-level-deep hierarchy?
>

For now msharefs is single-level-deep. It can be expanded to support directories to limit visibility of filenames. Would
you prefer to see it support directories from the beginning or can that be a future expansion of this feature?

Thanks,
Khalid

2022-04-12 23:41:31

by Khalid Aziz

[permalink] [raw]
Subject: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare

This patch adds basic page table sharing across tasks by making
mshare syscall. It does this by creating a new mm_struct which
hosts the shared vmas and page tables. This mm_struct is
maintained as long as there is at least one task using the mshare'd
range. It is cleaned up by the last mshare_unlink syscall.

Signed-off-by: Khalid Aziz <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/internal.h | 2 +
mm/memory.c | 35 ++++++++++
mm/mshare.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 214 insertions(+), 13 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cf50a471384e..68f82f0f8b66 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags);

+extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
+ unsigned long *addrp);
static inline bool vma_is_shared(const struct vm_area_struct *vma)
{
return vma->vm_flags & VM_SHARED_PT;
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..c77c0d643ea8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, struct pt_regs *regs)
{
vm_fault_t ret;
+ bool shared = false;

__set_current_state(TASK_RUNNING);

@@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);

+ if (unlikely(vma_is_shared(vma))) {
+ ret = find_shared_vma(&vma, &address);
+ if (ret)
+ return ret;
+ if (!vma)
+ return VM_FAULT_SIGSEGV;
+ shared = true;
+ }
+
if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE))
@@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
else
ret = __handle_mm_fault(vma, address, flags);

+ /*
+ * Release the read lock on shared VMA's parent mm unless
+ * __handle_mm_fault released the lock already.
+ * __handle_mm_fault sets VM_FAULT_RETRY in return value if
+ * it released mmap lock. If lock was released, that implies
+ * the lock would have been released on task's original mm if
+ * this were not a shared PTE vma. To keep lock state consistent,
+ * make sure to release the lock on task's original mm
+ */
+ if (shared) {
+ int release_mmlock = 1;
+
+ if (!(ret & VM_FAULT_RETRY)) {
+ mmap_read_unlock(vma->vm_mm);
+ release_mmlock = 0;
+ } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
+ (flags & FAULT_FLAG_RETRY_NOWAIT)) {
+ mmap_read_unlock(vma->vm_mm);
+ release_mmlock = 0;
+ }
+
+ if (release_mmlock)
+ mmap_read_unlock(current->mm);
+ }
+
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
/*
diff --git a/mm/mshare.c b/mm/mshare.c
index cd2f7ad24d9d..d1896adcb00f 100644
--- a/mm/mshare.c
+++ b/mm/mshare.c
@@ -17,18 +17,49 @@
#include <linux/pseudo_fs.h>
#include <linux/fileattr.h>
#include <linux/refcount.h>
+#include <linux/mman.h>
#include <linux/sched/mm.h>
#include <uapi/linux/magic.h>
#include <uapi/linux/limits.h>

struct mshare_data {
- struct mm_struct *mm;
+ struct mm_struct *mm, *host_mm;
mode_t mode;
refcount_t refcnt;
};

static struct super_block *msharefs_sb;

+/* Returns holding the host mm's lock for read. Caller must release. */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
+{
+ struct vm_area_struct *vma, *guest = *vmap;
+ struct mshare_data *info = guest->vm_private_data;
+ struct mm_struct *host_mm = info->mm;
+ unsigned long host_addr;
+ pgd_t *pgd, *guest_pgd;
+
+ host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
+ pgd = pgd_offset(host_mm, host_addr);
+ guest_pgd = pgd_offset(current->mm, *addrp);
+ if (!pgd_same(*guest_pgd, *pgd)) {
+ set_pgd(guest_pgd, *pgd);
+ return VM_FAULT_NOPAGE;
+ }
+
+ *addrp = host_addr;
+ mmap_read_lock(host_mm);
+ vma = find_vma(host_mm, host_addr);
+
+ /* XXX: expand stack? */
+ if (vma && vma->vm_start > host_addr)
+ vma = NULL;
+
+ *vmap = vma;
+ return 0;
+}
+
static void
msharefs_evict_inode(struct inode *inode)
{
@@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
unsigned long, len, int, oflag, mode_t, mode)
{
struct mshare_data *info;
- struct mm_struct *mm;
struct filename *fname = getname(name);
struct dentry *dentry;
struct inode *inode;
struct qstr namestr;
+ struct vm_area_struct *vma, *next, *new_vma;
+ struct mm_struct *new_mm;
+ unsigned long end;
int err = PTR_ERR(fname);

/*
@@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
if (IS_ERR(fname))
goto err_out;

+ end = addr + len;
+
/*
* Does this mshare entry exist already? If it does, calling
* mshare with O_EXCL|O_CREAT is an error
@@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
inode_lock(d_inode(msharefs_sb->s_root));
dentry = d_lookup(msharefs_sb->s_root, &namestr);
if (dentry && (oflag & (O_EXCL|O_CREAT))) {
+ inode = d_inode(dentry);
err = -EEXIST;
dput(dentry);
goto err_unlock_inode;
}

if (dentry) {
+ unsigned long mapaddr, prot = PROT_NONE;
+
inode = d_inode(dentry);
if (inode == NULL) {
+ mmap_write_unlock(current->mm);
err = -EINVAL;
goto err_out;
}
info = inode->i_private;
- refcount_inc(&info->refcnt);
dput(dentry);
+
+ /*
+ * Map in the address range as anonymous mappings
+ */
+ oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
+ if (oflag & O_RDONLY)
+ prot |= PROT_READ;
+ else if (oflag & O_WRONLY)
+ prot |= PROT_WRITE;
+ else if (oflag & O_RDWR)
+ prot |= (PROT_READ | PROT_WRITE);
+ mapaddr = vm_mmap(NULL, addr, len, prot,
+ MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
+ if (IS_ERR((void *)mapaddr)) {
+ err = -EINVAL;
+ goto err_out;
+ }
+
+ refcount_inc(&info->refcnt);
+
+ /*
+ * Now that we have mmap'd the mshare'd range, update vma
+ * flags and vm_mm pointer for this mshare'd range.
+ */
+ mmap_write_lock(current->mm);
+ vma = find_vma(current->mm, addr);
+ if (vma && vma->vm_start < addr) {
+ mmap_write_unlock(current->mm);
+ err = -EINVAL;
+ goto err_out;
+ }
+
+ while (vma && vma->vm_start < (addr + len)) {
+ vma->vm_private_data = info;
+ vma->vm_mm = info->mm;
+ vma->vm_flags |= VM_SHARED_PT;
+ next = vma->vm_next;
+ vma = next;
+ }
} else {
- mm = mm_alloc();
- if (!mm)
+ unsigned long myaddr;
+ struct mm_struct *old_mm;
+
+ old_mm = current->mm;
+ new_mm = mm_alloc();
+ if (!new_mm)
return -ENOMEM;
info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info) {
err = -ENOMEM;
goto err_relmm;
}
- mm->mmap_base = addr;
- mm->task_size = addr + len;
- if (!mm->task_size)
- mm->task_size--;
- info->mm = mm;
+ new_mm->mmap_base = addr;
+ new_mm->task_size = addr + len;
+ if (!new_mm->task_size)
+ new_mm->task_size--;
+ info->mm = new_mm;
+ info->host_mm = old_mm;
info->mode = mode;
refcount_set(&info->refcnt, 1);
+
+ /*
+ * VMAs for this address range may or may not exist.
+ * If VMAs exist, they should be marked as shared at
+ * this point and page table info should be copied
+ * over to newly created mm_struct. TODO: If VMAs do not
+ * exist, create them and mark them as shared.
+ */
+ mmap_write_lock(old_mm);
+ vma = find_vma_intersection(old_mm, addr, end);
+ if (!vma) {
+ err = -EINVAL;
+ goto unlock;
+ }
+ /*
+ * TODO: If the currently allocated VMA goes beyond the
+ * mshare'd range, this VMA needs to be split.
+ *
+ * Double check that source VMAs do not extend outside
+ * the range
+ */
+ vma = find_vma(old_mm, addr + len);
+ if (vma && vma->vm_start < (addr + len)) {
+ err = -EINVAL;
+ goto unlock;
+ }
+
+ vma = find_vma(old_mm, addr);
+ if (vma && vma->vm_start < addr) {
+ err = -EINVAL;
+ goto unlock;
+ }
+
+ mmap_write_lock(new_mm);
+ while (vma && vma->vm_start < (addr + len)) {
+ /*
+ * Copy this vma over to host mm
+ */
+ vma->vm_private_data = info;
+ vma->vm_mm = new_mm;
+ vma->vm_flags |= VM_SHARED_PT;
+ new_vma = vm_area_dup(vma);
+ if (!new_vma) {
+ err = -ENOMEM;
+ goto unlock;
+ }
+ err = insert_vm_struct(new_mm, new_vma);
+ if (err)
+ goto unlock;
+
+ vma = vma->vm_next;
+ }
+ mmap_write_unlock(new_mm);
+
err = mshare_file_create(fname, oflag, info);
if (err)
- goto err_relinfo;
+ goto unlock;
+
+ /*
+ * Copy over current PTEs
+ */
+ myaddr = addr;
+ while (myaddr < new_mm->task_size) {
+ *pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
+ myaddr += PGDIR_SIZE;
+ }
+ /*
+ * TODO: Free the corresponding page table in calling
+ * process
+ */
}

+ mmap_write_unlock(current->mm);
inode_unlock(d_inode(msharefs_sb->s_root));
putname(fname);
return 0;

-err_relinfo:
+unlock:
+ mmap_write_unlock(current->mm);
kfree(info);
err_relmm:
- mmput(mm);
+ mmput(new_mm);
err_unlock_inode:
inode_unlock(d_inode(msharefs_sb->s_root));
err_out:
@@ -294,11 +445,24 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)

/*
* Is this the last reference?
+ * TODO: permission checks are needed before proceeding
*/
if (refcount_dec_and_test(&info->refcnt)) {
simple_unlink(d_inode(msharefs_sb->s_root), dentry);
d_drop(dentry);
d_delete(dentry);
+ /*
+ * TODO: Release all physical pages allocated for this
+ * mshare range and release associated page table. If
+ * the final unlink happens from the process that created
+ * mshare'd range, do we return page tables and pages to
+ * that process so the creating process can continue using
+ * the address range it had chosen to mshare at some
+ * point?
+ *
+ * TODO: unmap shared vmas from every task that is using
+ * this mshare'd range.
+ */
mmput(info->mm);
kfree(info);
} else {
--
2.32.0

2022-04-12 23:53:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 4/11/22 09:05, Khalid Aziz wrote:
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64).
> This alignment value can be looked up in /proc/sys/vm//mshare_size
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
> shared automatically
> - Shared address range can have unmapped addresses in it. Any access
> to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.

What does this mean in practice?

2022-05-30 13:26:00

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>
> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.
>
> Some of the field deployments commonly see memory pages shared across
> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> bytes long which is very small compared to the 4K page size. When 2000
> processes map the same page in their address space, each one of them
> requires 8 bytes for its PTE and together that adds up to 8K of memory
> just to hold the PTEs for one 4K page. On a database server with 300GB
> SGA, a system carsh was seen with out-of-memory condition when 1500+
> clients tried to share this SGA even though the system had 512GB of
> memory. On this server, in the worst case scenario of all 1500
> processes mapping every page from SGA would have required 878GB+ for
> just the PTEs. If these PTEs could be shared, amount of memory saved
> is very significant.
>
> This patch series implements a mechanism in kernel to allow userspace
> processes to opt into sharing PTEs. It adds two new system calls - (1)
> mshare(), which can be used by a process to create a region (we will
> call it mshare'd region) which can be used by other processes to map
> same pages using shared PTEs, (2) mshare_unlink() which is used to
> detach from the mshare'd region. Once an mshare'd region is created,
> other process(es), assuming they have the right permissions, can make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs. When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>
>
> API
> ===
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
> exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64).
> This alignment value can be looked up in /proc/sys/vm//mshare_size
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
> shared automatically
> - Shared address range can have unmapped addresses in it. Any access
> to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
>
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
>
> --
> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> if (addr == MAP_FAILED)
> perror("ERROR: mmap failed");
>
> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> if (err < 0) {
> perror("mshare() syscall failed");
> exit(1);
> }
>
> strncpy(addr, "Some random shared text",
> sizeof("Some random shared text"));
> -----------------
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
> struct mshare_info minfo;
>
> fd = open("testregion", O_RDONLY);
> if (fd < 0) {
> perror("open failed");
> exit(1);
> }
>
> if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
> printf("INFO: %ld bytes shared at addr 0x%lx \n",
> minfo.size, minfo.start);
> else
> perror("read failed");
>
> close(fd);
>
> addr = (void *)minfo.start;
> err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
> O_RDWR, 600);
> if (err < 0) {
> perror("mshare() syscall failed");
> exit(1);
> }
>
> printf("Guest mmap at %px:\n", addr);
> printf("%s\n", addr);
> printf("\nDone\n");
>
> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> if (err < 0) {
> perror("mshare_unlink() failed");
> exit(1);
> }
> -----------------


Does that mean those shared pages will get page_mapcount()=1 ?

A big pain for a memory limited system like a desktop/embedded system is
that reverse mapping will take tons of cpu in memory reclamation path
especially for those pages mapped by multiple processes. sometimes,
100% cpu utilization on LRU to scan and find out if a page is accessed
by reading PTE young.

if we result in one PTE only by this patchset, it means we are getting
significant
performance improvement in kernel LRU particularly when free memory
approaches the watermarks.

But I don't see how a new system call like mshare(), can be used
by those systems as they might need some more automatic PTEs sharing
mechanism.

BTW, I suppose we are actually able to share PTEs as long as the address
is 2MB aligned?

>
>
> Patch series
> ============
>
> This series of patches is an initial implementation of these two
> system calls. This code implements working basic functionality.
>
> Prototype for the two syscalls is:
>
> SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
> unsigned long, len, int, oflag, mode_t, mode)
>
> SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
> In order to facilitate page table sharing, this implemntation adds a
> new in-memory filesystem - msharefs which will be mounted at
> /sys/fs/mshare. When a new mshare'd region is created, a file with
> the name given by initial mshare() call is created under this
> filesystem. Permissions for this file are given by the "mode"
> parameter to mshare(). The only operation supported on this file is
> read. A read from this file returns a structure containing
> information about mshare'd region - (1) starting virtual address for
> the region, and (2) size of mshare'd region.
>
> A donor process that wants to create an mshare'd region from a
> section of its mapped addresses calls mshare() with O_CREAT oflag.
> mshare() syscall then creates a new mm_struct which will host the
> page tables for the mshare'd region. vma->vm_private_data for the
> vmas covering address range for this region are updated to point to
> a structure containing pointer to this new mm_struct. Existing page
> tables are copied over to new mm struct.
>
> A consumer process that wants to map mshare'd region opens
> /sys/fs/mshare/<filename> and reads the starting address and size of
> mshare'd region. It then calls mshare() with these values to map the
> entire region in its address space. Consumer process calls
> mshare_unlink() to terminate its access.
>
>
> Since RFC
> =========
>
> This patch series includes better error handling and more robust
> locking besides improved implementation of mshare since the original
> RFC. It also incorporates feedback from original RFC. Alignment and
> size requirment are PGDIR_SIZE, same as RFC and this is open to
> change based upon further feedback. More review is needed for this
> patch series and is much appreciated.
>
>
>
> Khalid Aziz (14):
> mm: Add new system calls mshare, mshare_unlink
> mm: Add msharefs filesystem
> mm: Add read for msharefs
> mm: implement mshare_unlink syscall
> mm: Add locking to msharefs syscalls
> mm/msharefs: Check for mounted filesystem
> mm: Add vm flag for shared PTE
> mm/mshare: Add basic page table sharing using mshare
> mm: Do not free PTEs for mshare'd PTEs
> mm/mshare: Check for mapped vma when mshare'ing existing mshare'd
> range
> mm/mshare: unmap vmas in mshare_unlink
> mm/mshare: Add a proc file with mshare alignment/size information
> mm/mshare: Enforce mshare'd region permissions
> mm/mshare: Copy PTEs to host mm
>
> Documentation/filesystems/msharefs.rst | 19 +
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> include/linux/mm.h | 11 +
> include/trace/events/mmflags.h | 3 +-
> include/uapi/asm-generic/unistd.h | 7 +-
> include/uapi/linux/magic.h | 1 +
> include/uapi/linux/mman.h | 5 +
> kernel/sysctl.c | 7 +
> mm/Makefile | 2 +-
> mm/internal.h | 7 +
> mm/memory.c | 105 ++++-
> mm/mshare.c | 587 +++++++++++++++++++++++++
> 12 files changed, 750 insertions(+), 6 deletions(-)
> create mode 100644 Documentation/filesystems/msharefs.rst
> create mode 100644 mm/mshare.c
>
> --
> 2.32.0
>

Thanks
Barry

2022-05-30 15:48:52

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 30.05.22 12:48, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs. When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>> exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>> This alignment value can be looked up in /proc/sys/vm//mshare_size
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>> shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>> to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>> if (addr == MAP_FAILED)
>> perror("ERROR: mmap failed");
>>
>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>> if (err < 0) {
>> perror("mshare() syscall failed");
>> exit(1);
>> }
>>
>> strncpy(addr, "Some random shared text",
>> sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>> struct mshare_info minfo;
>>
>> fd = open("testregion", O_RDONLY);
>> if (fd < 0) {
>> perror("open failed");
>> exit(1);
>> }
>>
>> if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>> printf("INFO: %ld bytes shared at addr 0x%lx \n",
>> minfo.size, minfo.start);
>> else
>> perror("read failed");
>>
>> close(fd);
>>
>> addr = (void *)minfo.start;
>> err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>> O_RDWR, 600);
>> if (err < 0) {
>> perror("mshare() syscall failed");
>> exit(1);
>> }
>>
>> printf("Guest mmap at %px:\n", addr);
>> printf("%s\n", addr);
>> printf("\nDone\n");
>>
>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>> if (err < 0) {
>> perror("mshare_unlink() failed");
>> exit(1);
>> }
>> -----------------
>
>
> Does that mean those shared pages will get page_mapcount()=1 ?

AFAIU, for mshare() that is the case.

>
> A big pain for a memory limited system like a desktop/embedded system is
> that reverse mapping will take tons of cpu in memory reclamation path
> especially for those pages mapped by multiple processes. sometimes,
> 100% cpu utilization on LRU to scan and find out if a page is accessed
> by reading PTE young.

Regarding PTE-table sharing:

Even if we'd account each logical mapping (independent of page table
sharing) in the page_mapcount(), we would benefit from page table
sharing. Simply when we unmap the page from the shared page table, we'd
have to adjust the mapcount accordingly. So unmapping from a single
(shared) pagetable could directly result in the mapcount dropping to zero.

What I am trying to say is: how the mapcount is handled might be an
implementation detail for PTE-sharing. Not sure how hugetlb handles that
with its PMD-table sharing.

We'd have to clarify what the mapcount actually expresses. Having the
mapcount express "is this page mapped by multiple processes or at
multiple VMAs" might be helpful in some cases. Right now it mostly
expresses exactly that.

>
> if we result in one PTE only by this patchset, it means we are getting
> significant
> performance improvement in kernel LRU particularly when free memory
> approaches the watermarks.
>
> But I don't see how a new system call like mshare(), can be used
> by those systems as they might need some more automatic PTEs sharing
> mechanism.

IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
mappings, similar to what hugetlb provides for PMD table sharing, which
leaves semantics unchanged for existing user space. Maybe there is a way
to factor that out and reuse it for PTE-table sharing.

I can understand that there are use cases for explicit sharing with new
(e.g., mprotect) semantics.

>
> BTW, I suppose we are actually able to share PTEs as long as the address
> is 2MB aligned?

2MB is x86 specific, but yes. As long as we're aligned to PMDs and
* the VMA spans the complete PMD
* the VMA permissions match the page table
* no process-specific page table features are used (uffd, softdirty
tracking)

... probably there are more corner cases. (e.g., referenced and dirty
bit don't reflect what the mapping process did, which might or might not
be relevant for current/future features)

--
Thanks,

David / dhildenb


2022-05-30 16:24:57

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On Mon, May 30, 2022 at 11:18 PM David Hildenbrand <[email protected]> wrote:
>
> On 30.05.22 12:48, Barry Song wrote:
> > On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
> >>
> >> Page tables in kernel consume some of the memory and as long as number
> >> of mappings being maintained is small enough, this space consumed by
> >> page tables is not objectionable. When very few memory pages are
> >> shared between processes, the number of page table entries (PTEs) to
> >> maintain is mostly constrained by the number of pages of memory on the
> >> system. As the number of shared pages and the number of times pages
> >> are shared goes up, amount of memory consumed by page tables starts to
> >> become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared across
> >> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> >> bytes long which is very small compared to the 4K page size. When 2000
> >> processes map the same page in their address space, each one of them
> >> requires 8 bytes for its PTE and together that adds up to 8K of memory
> >> just to hold the PTEs for one 4K page. On a database server with 300GB
> >> SGA, a system carsh was seen with out-of-memory condition when 1500+
> >> clients tried to share this SGA even though the system had 512GB of
> >> memory. On this server, in the worst case scenario of all 1500
> >> processes mapping every page from SGA would have required 878GB+ for
> >> just the PTEs. If these PTEs could be shared, amount of memory saved
> >> is very significant.
> >>
> >> This patch series implements a mechanism in kernel to allow userspace
> >> processes to opt into sharing PTEs. It adds two new system calls - (1)
> >> mshare(), which can be used by a process to create a region (we will
> >> call it mshare'd region) which can be used by other processes to map
> >> same pages using shared PTEs, (2) mshare_unlink() which is used to
> >> detach from the mshare'd region. Once an mshare'd region is created,
> >> other process(es), assuming they have the right permissions, can make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs. When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API
> >> ===
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >> exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >>
> >> PTEs are shared at pgdir level and hence it imposes following
> >> requirements on the address and size given to the mshare():
> >>
> >> - Starting address must be aligned to pgdir size (512GB on x86_64).
> >> This alignment value can be looked up in /proc/sys/vm//mshare_size
> >> - Size must be a multiple of pgdir size
> >> - Any mappings created in this address range at any time become
> >> shared automatically
> >> - Shared address range can have unmapped addresses in it. Any access
> >> to unmapped address will result in SIGBUS
> >>
> >> Mappings within this address range behave as if they were shared
> >> between threads, so a write to a MAP_PRIVATE mapping will create a
> >> page which is shared between all the sharers. The first process that
> >> declares an address range mshare'd can continue to map objects in
> >> the shared area. All other processes that want mshare'd access to
> >> this memory area can do so by calling mshare(). After this call, the
> >> address range given by mshare becomes a shared range in its address
> >> space. Anonymous mappings will be shared and not COWed.
> >>
> >> A file under /sys/fs/mshare can be opened and read from. A read from
> >> this file returns two long values - (1) starting address, and (2)
> >> size of the mshare'd region.
> >>
> >> --
> >> int mshare_unlink(char *name)
> >>
> >> A shared address range created by mshare() can be destroyed using
> >> mshare_unlink() which removes the shared named object. Once all
> >> processes have unmapped the shared object, the shared address range
> >> references are de-allocated and destroyed.
> >>
> >> mshare_unlink() returns 0 on success or -1 on error.
> >>
> >>
> >> Example Code
> >> ============
> >>
> >> Snippet of the code that a donor process would run looks like below:
> >>
> >> -----------------
> >> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >> if (addr == MAP_FAILED)
> >> perror("ERROR: mmap failed");
> >>
> >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >> if (err < 0) {
> >> perror("mshare() syscall failed");
> >> exit(1);
> >> }
> >>
> >> strncpy(addr, "Some random shared text",
> >> sizeof("Some random shared text"));
> >> -----------------
> >>
> >> Snippet of code that a consumer process would execute looks like:
> >>
> >> -----------------
> >> struct mshare_info minfo;
> >>
> >> fd = open("testregion", O_RDONLY);
> >> if (fd < 0) {
> >> perror("open failed");
> >> exit(1);
> >> }
> >>
> >> if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
> >> printf("INFO: %ld bytes shared at addr 0x%lx \n",
> >> minfo.size, minfo.start);
> >> else
> >> perror("read failed");
> >>
> >> close(fd);
> >>
> >> addr = (void *)minfo.start;
> >> err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
> >> O_RDWR, 600);
> >> if (err < 0) {
> >> perror("mshare() syscall failed");
> >> exit(1);
> >> }
> >>
> >> printf("Guest mmap at %px:\n", addr);
> >> printf("%s\n", addr);
> >> printf("\nDone\n");
> >>
> >> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >> if (err < 0) {
> >> perror("mshare_unlink() failed");
> >> exit(1);
> >> }
> >> -----------------
> >
> >
> > Does that mean those shared pages will get page_mapcount()=1 ?
>
> AFAIU, for mshare() that is the case.
>
> >
> > A big pain for a memory limited system like a desktop/embedded system is
> > that reverse mapping will take tons of cpu in memory reclamation path
> > especially for those pages mapped by multiple processes. sometimes,
> > 100% cpu utilization on LRU to scan and find out if a page is accessed
> > by reading PTE young.
>
> Regarding PTE-table sharing:
>
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
>
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
>
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

right, i guess mapcount, as a number for itself, isn't so important. the
only important thing is that we only need to read one PTE rather
than 1000 PTEs to figure out if one page is young.

so this patchset has already been able to do reverse mapping only
one time for a page shared by 1000 processes rather than reading
1000 PTEs?

i mean, i actually haven't found your actually touched any files in
mm/vmscan.c etc.

>
> >
> > if we result in one PTE only by this patchset, it means we are getting
> > significant
> > performance improvement in kernel LRU particularly when free memory
> > approaches the watermarks.
> >
> > But I don't see how a new system call like mshare(), can be used
> > by those systems as they might need some more automatic PTEs sharing
> > mechanism.
>
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
>
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.
>
> >
> > BTW, I suppose we are actually able to share PTEs as long as the address
> > is 2MB aligned?
>
> 2MB is x86 specific, but yes. As long as we're aligned to PMDs and
> * the VMA spans the complete PMD
> * the VMA permissions match the page table
> * no process-specific page table features are used (uffd, softdirty
> tracking)
>
> ... probably there are more corner cases. (e.g., referenced and dirty
> bit don't reflect what the mapping process did, which might or might not
> be relevant for current/future features)
>
> --
> Thanks,
>
> David / dhildenb
>

Thanks
Barry

2022-05-31 17:05:48

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>
> This patch adds basic page table sharing across tasks by making
> mshare syscall. It does this by creating a new mm_struct which
> hosts the shared vmas and page tables. This mm_struct is
> maintained as long as there is at least one task using the mshare'd
> range. It is cleaned up by the last mshare_unlink syscall.
>
> Signed-off-by: Khalid Aziz <[email protected]>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> mm/internal.h | 2 +
> mm/memory.c | 35 ++++++++++
> mm/mshare.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 214 insertions(+), 13 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index cf50a471384e..68f82f0f8b66 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
> int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> unsigned long addr, int page_nid, int *flags);
>
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
> + unsigned long *addrp);
> static inline bool vma_is_shared(const struct vm_area_struct *vma)
> {
> return vma->vm_flags & VM_SHARED_PT;
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..c77c0d643ea8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> unsigned int flags, struct pt_regs *regs)
> {
> vm_fault_t ret;
> + bool shared = false;
>
> __set_current_state(TASK_RUNNING);
>
> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> /* do counter updates before entering really critical section. */
> check_sync_rss_stat(current);
>
> + if (unlikely(vma_is_shared(vma))) {
> + ret = find_shared_vma(&vma, &address);
> + if (ret)
> + return ret;
> + if (!vma)
> + return VM_FAULT_SIGSEGV;
> + shared = true;
> + }
> +
> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> flags & FAULT_FLAG_INSTRUCTION,
> flags & FAULT_FLAG_REMOTE))
> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> else
> ret = __handle_mm_fault(vma, address, flags);
>
> + /*
> + * Release the read lock on shared VMA's parent mm unless
> + * __handle_mm_fault released the lock already.
> + * __handle_mm_fault sets VM_FAULT_RETRY in return value if
> + * it released mmap lock. If lock was released, that implies
> + * the lock would have been released on task's original mm if
> + * this were not a shared PTE vma. To keep lock state consistent,
> + * make sure to release the lock on task's original mm
> + */
> + if (shared) {
> + int release_mmlock = 1;
> +
> + if (!(ret & VM_FAULT_RETRY)) {
> + mmap_read_unlock(vma->vm_mm);
> + release_mmlock = 0;
> + } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> + (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> + mmap_read_unlock(vma->vm_mm);
> + release_mmlock = 0;
> + }
> +
> + if (release_mmlock)
> + mmap_read_unlock(current->mm);
> + }
> +
> if (flags & FAULT_FLAG_USER) {
> mem_cgroup_exit_user_fault();
> /*
> diff --git a/mm/mshare.c b/mm/mshare.c
> index cd2f7ad24d9d..d1896adcb00f 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -17,18 +17,49 @@
> #include <linux/pseudo_fs.h>
> #include <linux/fileattr.h>
> #include <linux/refcount.h>
> +#include <linux/mman.h>
> #include <linux/sched/mm.h>
> #include <uapi/linux/magic.h>
> #include <uapi/linux/limits.h>
>
> struct mshare_data {
> - struct mm_struct *mm;
> + struct mm_struct *mm, *host_mm;
> mode_t mode;
> refcount_t refcnt;
> };
>
> static struct super_block *msharefs_sb;
>
> +/* Returns holding the host mm's lock for read. Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> + struct vm_area_struct *vma, *guest = *vmap;
> + struct mshare_data *info = guest->vm_private_data;
> + struct mm_struct *host_mm = info->mm;
> + unsigned long host_addr;
> + pgd_t *pgd, *guest_pgd;
> +
> + host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> + pgd = pgd_offset(host_mm, host_addr);
> + guest_pgd = pgd_offset(current->mm, *addrp);
> + if (!pgd_same(*guest_pgd, *pgd)) {
> + set_pgd(guest_pgd, *pgd);
> + return VM_FAULT_NOPAGE;
> + }
> +
> + *addrp = host_addr;
> + mmap_read_lock(host_mm);
> + vma = find_vma(host_mm, host_addr);
> +
> + /* XXX: expand stack? */
> + if (vma && vma->vm_start > host_addr)
> + vma = NULL;
> +
> + *vmap = vma;
> + return 0;
> +}
> +
> static void
> msharefs_evict_inode(struct inode *inode)
> {
> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
> unsigned long, len, int, oflag, mode_t, mode)
> {
> struct mshare_data *info;
> - struct mm_struct *mm;
> struct filename *fname = getname(name);
> struct dentry *dentry;
> struct inode *inode;
> struct qstr namestr;
> + struct vm_area_struct *vma, *next, *new_vma;
> + struct mm_struct *new_mm;
> + unsigned long end;
> int err = PTR_ERR(fname);
>
> /*
> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
> if (IS_ERR(fname))
> goto err_out;
>
> + end = addr + len;
> +
> /*
> * Does this mshare entry exist already? If it does, calling
> * mshare with O_EXCL|O_CREAT is an error
> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
> inode_lock(d_inode(msharefs_sb->s_root));
> dentry = d_lookup(msharefs_sb->s_root, &namestr);
> if (dentry && (oflag & (O_EXCL|O_CREAT))) {
> + inode = d_inode(dentry);
> err = -EEXIST;
> dput(dentry);
> goto err_unlock_inode;
> }
>
> if (dentry) {
> + unsigned long mapaddr, prot = PROT_NONE;
> +
> inode = d_inode(dentry);
> if (inode == NULL) {
> + mmap_write_unlock(current->mm);
> err = -EINVAL;
> goto err_out;
> }
> info = inode->i_private;
> - refcount_inc(&info->refcnt);
> dput(dentry);
> +
> + /*
> + * Map in the address range as anonymous mappings
> + */
> + oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
> + if (oflag & O_RDONLY)
> + prot |= PROT_READ;
> + else if (oflag & O_WRONLY)
> + prot |= PROT_WRITE;
> + else if (oflag & O_RDWR)
> + prot |= (PROT_READ | PROT_WRITE);
> + mapaddr = vm_mmap(NULL, addr, len, prot,
> + MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
> + if (IS_ERR((void *)mapaddr)) {
> + err = -EINVAL;
> + goto err_out;
> + }
> +
> + refcount_inc(&info->refcnt);
> +
> + /*
> + * Now that we have mmap'd the mshare'd range, update vma
> + * flags and vm_mm pointer for this mshare'd range.
> + */
> + mmap_write_lock(current->mm);
> + vma = find_vma(current->mm, addr);
> + if (vma && vma->vm_start < addr) {

and I don't understand why "vma->vm_start < addr" can happen.
does it mean we have given mshare() an address which is not
aligned? then we should have return -EINVAL earlier?



> + mmap_write_unlock(current->mm);
> + err = -EINVAL;
> + goto err_out;
> + }
> +
> + while (vma && vma->vm_start < (addr + len)) {
> + vma->vm_private_data = info;
> + vma->vm_mm = info->mm;
> + vma->vm_flags |= VM_SHARED_PT;
> + next = vma->vm_next;
> + vma = next;
> + }
> } else {
> - mm = mm_alloc();
> - if (!mm)
> + unsigned long myaddr;
> + struct mm_struct *old_mm;
> +
> + old_mm = current->mm;
> + new_mm = mm_alloc();
> + if (!new_mm)
> return -ENOMEM;
> info = kzalloc(sizeof(*info), GFP_KERNEL);
> if (!info) {
> err = -ENOMEM;
> goto err_relmm;
> }
> - mm->mmap_base = addr;
> - mm->task_size = addr + len;
> - if (!mm->task_size)
> - mm->task_size--;
> - info->mm = mm;
> + new_mm->mmap_base = addr;
> + new_mm->task_size = addr + len;
> + if (!new_mm->task_size)
> + new_mm->task_size--;
> + info->mm = new_mm;
> + info->host_mm = old_mm;
> info->mode = mode;
> refcount_set(&info->refcnt, 1);
> +
> + /*
> + * VMAs for this address range may or may not exist.
> + * If VMAs exist, they should be marked as shared at
> + * this point and page table info should be copied
> + * over to newly created mm_struct. TODO: If VMAs do not
> + * exist, create them and mark them as shared.
> + */
> + mmap_write_lock(old_mm);
> + vma = find_vma_intersection(old_mm, addr, end);
> + if (!vma) {
> + err = -EINVAL;
> + goto unlock;
> + }
> + /*
> + * TODO: If the currently allocated VMA goes beyond the
> + * mshare'd range, this VMA needs to be split.
> + *
> + * Double check that source VMAs do not extend outside
> + * the range
> + */
> + vma = find_vma(old_mm, addr + len);
> + if (vma && vma->vm_start < (addr + len)) {
> + err = -EINVAL;
> + goto unlock;
> + }
> +
> + vma = find_vma(old_mm, addr);
> + if (vma && vma->vm_start < addr) {
> + err = -EINVAL;
> + goto unlock;
> + }
> +
> + mmap_write_lock(new_mm);
> + while (vma && vma->vm_start < (addr + len)) {
> + /*
> + * Copy this vma over to host mm
> + */
> + vma->vm_private_data = info;
> + vma->vm_mm = new_mm;
> + vma->vm_flags |= VM_SHARED_PT;
> + new_vma = vm_area_dup(vma);
> + if (!new_vma) {
> + err = -ENOMEM;
> + goto unlock;
> + }
> + err = insert_vm_struct(new_mm, new_vma);
> + if (err)
> + goto unlock;
> +
> + vma = vma->vm_next;
> + }
> + mmap_write_unlock(new_mm);
> +
> err = mshare_file_create(fname, oflag, info);
> if (err)
> - goto err_relinfo;
> + goto unlock;
> +
> + /*
> + * Copy over current PTEs
> + */
> + myaddr = addr;
> + while (myaddr < new_mm->task_size) {
> + *pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
> + myaddr += PGDIR_SIZE;
> + }
> + /*
> + * TODO: Free the corresponding page table in calling
> + * process
> + */
> }
>
> + mmap_write_unlock(current->mm);
> inode_unlock(d_inode(msharefs_sb->s_root));
> putname(fname);
> return 0;
>
> -err_relinfo:
> +unlock:
> + mmap_write_unlock(current->mm);
> kfree(info);
> err_relmm:
> - mmput(mm);
> + mmput(new_mm);
> err_unlock_inode:
> inode_unlock(d_inode(msharefs_sb->s_root));
> err_out:
> @@ -294,11 +445,24 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
> /*
> * Is this the last reference?
> + * TODO: permission checks are needed before proceeding
> */
> if (refcount_dec_and_test(&info->refcnt)) {
> simple_unlink(d_inode(msharefs_sb->s_root), dentry);
> d_drop(dentry);
> d_delete(dentry);
> + /*
> + * TODO: Release all physical pages allocated for this
> + * mshare range and release associated page table. If
> + * the final unlink happens from the process that created
> + * mshare'd range, do we return page tables and pages to
> + * that process so the creating process can continue using
> + * the address range it had chosen to mshare at some
> + * point?
> + *
> + * TODO: unmap shared vmas from every task that is using
> + * this mshare'd range.
> + */
> mmput(info->mm);
> kfree(info);
> } else {
> --
> 2.32.0
>

Thanks
Barry

2022-06-01 07:07:53

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>
> mshare'd PTEs should not be removed when a task exits. These PTEs
> are removed when the last task sharing the PTEs exits. Add a check
> for shared PTEs and skip them.
>
> Signed-off-by: Khalid Aziz <[email protected]>
> ---
> mm/memory.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c77c0d643ea8..e7c5bc6f8836 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> } else {
> /*
> * Optimization: gather nearby vmas into one call down
> + * as long as they all belong to the same mm (that
> + * may not be the case if a vma is part of mshare'd
> + * range
> */
> while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> - && !is_vm_hugetlb_page(next)) {
> + && !is_vm_hugetlb_page(next)
> + && vma->vm_mm == tlb->mm) {
> vma = next;
> next = vma->vm_next;
> unlink_anon_vmas(vma);
> unlink_file_vma(vma);
> }
> - free_pgd_range(tlb, addr, vma->vm_end,
> - floor, next ? next->vm_start : ceiling);
> + /*
> + * Free pgd only if pgd is not allocated for an
> + * mshare'd range
> + */
> + if (vma->vm_mm == tlb->mm)
> + free_pgd_range(tlb, addr, vma->vm_end,
> + floor, next ? next->vm_start : ceiling);
> }
> vma = next;
> }
> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
> pgd_t *pgd;
> unsigned long next;
>
> + /*
> + * If this is an mshare'd page, do not unmap it since it might
> + * still be in use.
> + */
> + if (vma->vm_mm != tlb->mm)
> + return;
> +

expect unmap, have you ever tested reverse mapping in vmscan, especially
folio_referenced()? are all vmas in those processes sharing page table still
in the rmap of the shared page?
without shared PTE, if 1000 processes share one page, we are reading 1000
PTEs, with it, are we reading just one? or are we reading the same PTE
1000 times? Have you tested it?

> BUG_ON(addr >= end);
> tlb_start_vma(tlb, vma);
> pgd = pgd_offset(vma->vm_mm, addr);
> --
> 2.32.0
>

Thanks
Barry

2022-06-01 21:20:02

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>
> This patch adds basic page table sharing across tasks by making
> mshare syscall. It does this by creating a new mm_struct which
> hosts the shared vmas and page tables. This mm_struct is
> maintained as long as there is at least one task using the mshare'd
> range. It is cleaned up by the last mshare_unlink syscall.
>
> Signed-off-by: Khalid Aziz <[email protected]>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> mm/internal.h | 2 +
> mm/memory.c | 35 ++++++++++
> mm/mshare.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 214 insertions(+), 13 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index cf50a471384e..68f82f0f8b66 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
> int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> unsigned long addr, int page_nid, int *flags);
>
> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
> + unsigned long *addrp);
> static inline bool vma_is_shared(const struct vm_area_struct *vma)
> {
> return vma->vm_flags & VM_SHARED_PT;
> diff --git a/mm/memory.c b/mm/memory.c
> index c125c4969913..c77c0d643ea8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> unsigned int flags, struct pt_regs *regs)
> {
> vm_fault_t ret;
> + bool shared = false;
>
> __set_current_state(TASK_RUNNING);
>
> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> /* do counter updates before entering really critical section. */
> check_sync_rss_stat(current);
>
> + if (unlikely(vma_is_shared(vma))) {
> + ret = find_shared_vma(&vma, &address);
> + if (ret)
> + return ret;
> + if (!vma)
> + return VM_FAULT_SIGSEGV;
> + shared = true;
> + }
> +
> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> flags & FAULT_FLAG_INSTRUCTION,
> flags & FAULT_FLAG_REMOTE))
> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> else
> ret = __handle_mm_fault(vma, address, flags);
>
> + /*
> + * Release the read lock on shared VMA's parent mm unless
> + * __handle_mm_fault released the lock already.
> + * __handle_mm_fault sets VM_FAULT_RETRY in return value if
> + * it released mmap lock. If lock was released, that implies
> + * the lock would have been released on task's original mm if
> + * this were not a shared PTE vma. To keep lock state consistent,
> + * make sure to release the lock on task's original mm
> + */
> + if (shared) {
> + int release_mmlock = 1;
> +
> + if (!(ret & VM_FAULT_RETRY)) {
> + mmap_read_unlock(vma->vm_mm);
> + release_mmlock = 0;
> + } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
> + (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> + mmap_read_unlock(vma->vm_mm);
> + release_mmlock = 0;
> + }
> +
> + if (release_mmlock)
> + mmap_read_unlock(current->mm);
> + }
> +
> if (flags & FAULT_FLAG_USER) {
> mem_cgroup_exit_user_fault();
> /*
> diff --git a/mm/mshare.c b/mm/mshare.c
> index cd2f7ad24d9d..d1896adcb00f 100644
> --- a/mm/mshare.c
> +++ b/mm/mshare.c
> @@ -17,18 +17,49 @@
> #include <linux/pseudo_fs.h>
> #include <linux/fileattr.h>
> #include <linux/refcount.h>
> +#include <linux/mman.h>
> #include <linux/sched/mm.h>
> #include <uapi/linux/magic.h>
> #include <uapi/linux/limits.h>
>
> struct mshare_data {
> - struct mm_struct *mm;
> + struct mm_struct *mm, *host_mm;
> mode_t mode;
> refcount_t refcnt;
> };
>
> static struct super_block *msharefs_sb;
>
> +/* Returns holding the host mm's lock for read. Caller must release. */
> +vm_fault_t
> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
> +{
> + struct vm_area_struct *vma, *guest = *vmap;
> + struct mshare_data *info = guest->vm_private_data;
> + struct mm_struct *host_mm = info->mm;
> + unsigned long host_addr;
> + pgd_t *pgd, *guest_pgd;
> +
> + host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
> + pgd = pgd_offset(host_mm, host_addr);
> + guest_pgd = pgd_offset(current->mm, *addrp);
> + if (!pgd_same(*guest_pgd, *pgd)) {
> + set_pgd(guest_pgd, *pgd);
> + return VM_FAULT_NOPAGE;
> + }
> +
> + *addrp = host_addr;
> + mmap_read_lock(host_mm);
> + vma = find_vma(host_mm, host_addr);
> +
> + /* XXX: expand stack? */
> + if (vma && vma->vm_start > host_addr)
> + vma = NULL;
> +
> + *vmap = vma;
> + return 0;
> +}
> +
> static void
> msharefs_evict_inode(struct inode *inode)
> {
> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
> unsigned long, len, int, oflag, mode_t, mode)
> {
> struct mshare_data *info;
> - struct mm_struct *mm;
> struct filename *fname = getname(name);
> struct dentry *dentry;
> struct inode *inode;
> struct qstr namestr;
> + struct vm_area_struct *vma, *next, *new_vma;
> + struct mm_struct *new_mm;
> + unsigned long end;
> int err = PTR_ERR(fname);
>
> /*
> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
> if (IS_ERR(fname))
> goto err_out;
>
> + end = addr + len;
> +
> /*
> * Does this mshare entry exist already? If it does, calling
> * mshare with O_EXCL|O_CREAT is an error
> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
> inode_lock(d_inode(msharefs_sb->s_root));
> dentry = d_lookup(msharefs_sb->s_root, &namestr);
> if (dentry && (oflag & (O_EXCL|O_CREAT))) {
> + inode = d_inode(dentry);
> err = -EEXIST;
> dput(dentry);
> goto err_unlock_inode;
> }
>
> if (dentry) {
> + unsigned long mapaddr, prot = PROT_NONE;
> +
> inode = d_inode(dentry);
> if (inode == NULL) {
> + mmap_write_unlock(current->mm);
> err = -EINVAL;
> goto err_out;
> }
> info = inode->i_private;
> - refcount_inc(&info->refcnt);
> dput(dentry);
> +
> + /*
> + * Map in the address range as anonymous mappings
> + */
> + oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
> + if (oflag & O_RDONLY)
> + prot |= PROT_READ;
> + else if (oflag & O_WRONLY)
> + prot |= PROT_WRITE;
> + else if (oflag & O_RDWR)
> + prot |= (PROT_READ | PROT_WRITE);
> + mapaddr = vm_mmap(NULL, addr, len, prot,
> + MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);

From the perspective of hardware, do we have to use MAP_FIXED to make
sure those processes sharing PTE
use the same virtual address for the shared area? or actually we don't
necessarily need it? as long as the
upper level pgtable entries point to the same lower level pgtable?


> + if (IS_ERR((void *)mapaddr)) {
> + err = -EINVAL;
> + goto err_out;
> + }
> +
> + refcount_inc(&info->refcnt);
> +
> + /*
> + * Now that we have mmap'd the mshare'd range, update vma
> + * flags and vm_mm pointer for this mshare'd range.
> + */
> + mmap_write_lock(current->mm);
> + vma = find_vma(current->mm, addr);
> + if (vma && vma->vm_start < addr) {
> + mmap_write_unlock(current->mm);
> + err = -EINVAL;
> + goto err_out;
> + }
> +
> + while (vma && vma->vm_start < (addr + len)) {
> + vma->vm_private_data = info;
> + vma->vm_mm = info->mm;
> + vma->vm_flags |= VM_SHARED_PT;
> + next = vma->vm_next;
> + vma = next;
> + }
> } else {
> - mm = mm_alloc();
> - if (!mm)
> + unsigned long myaddr;
> + struct mm_struct *old_mm;
> +
> + old_mm = current->mm;
> + new_mm = mm_alloc();
> + if (!new_mm)
> return -ENOMEM;
> info = kzalloc(sizeof(*info), GFP_KERNEL);
> if (!info) {
> err = -ENOMEM;
> goto err_relmm;
> }
> - mm->mmap_base = addr;
> - mm->task_size = addr + len;
> - if (!mm->task_size)
> - mm->task_size--;
> - info->mm = mm;
> + new_mm->mmap_base = addr;
> + new_mm->task_size = addr + len;
> + if (!new_mm->task_size)
> + new_mm->task_size--;
> + info->mm = new_mm;
> + info->host_mm = old_mm;
> info->mode = mode;
> refcount_set(&info->refcnt, 1);
> +
> + /*
> + * VMAs for this address range may or may not exist.
> + * If VMAs exist, they should be marked as shared at
> + * this point and page table info should be copied
> + * over to newly created mm_struct. TODO: If VMAs do not
> + * exist, create them and mark them as shared.
> + */
> + mmap_write_lock(old_mm);
> + vma = find_vma_intersection(old_mm, addr, end);
> + if (!vma) {
> + err = -EINVAL;
> + goto unlock;
> + }
> + /*
> + * TODO: If the currently allocated VMA goes beyond the
> + * mshare'd range, this VMA needs to be split.
> + *
> + * Double check that source VMAs do not extend outside
> + * the range
> + */
> + vma = find_vma(old_mm, addr + len);
> + if (vma && vma->vm_start < (addr + len)) {
> + err = -EINVAL;
> + goto unlock;
> + }
> +
> + vma = find_vma(old_mm, addr);
> + if (vma && vma->vm_start < addr) {
> + err = -EINVAL;
> + goto unlock;
> + }
> +
> + mmap_write_lock(new_mm);
> + while (vma && vma->vm_start < (addr + len)) {
> + /*
> + * Copy this vma over to host mm
> + */
> + vma->vm_private_data = info;
> + vma->vm_mm = new_mm;
> + vma->vm_flags |= VM_SHARED_PT;
> + new_vma = vm_area_dup(vma);
> + if (!new_vma) {
> + err = -ENOMEM;
> + goto unlock;
> + }
> + err = insert_vm_struct(new_mm, new_vma);
> + if (err)
> + goto unlock;
> +
> + vma = vma->vm_next;
> + }
> + mmap_write_unlock(new_mm);
> +
> err = mshare_file_create(fname, oflag, info);
> if (err)
> - goto err_relinfo;
> + goto unlock;
> +
> + /*
> + * Copy over current PTEs
> + */
> + myaddr = addr;
> + while (myaddr < new_mm->task_size) {
> + *pgd_offset(new_mm, myaddr) = *pgd_offset(old_mm, myaddr);
> + myaddr += PGDIR_SIZE;
> + }
> + /*
> + * TODO: Free the corresponding page table in calling
> + * process
> + */
> }
>
> + mmap_write_unlock(current->mm);
> inode_unlock(d_inode(msharefs_sb->s_root));
> putname(fname);
> return 0;
>
> -err_relinfo:
> +unlock:
> + mmap_write_unlock(current->mm);
> kfree(info);
> err_relmm:
> - mmput(mm);
> + mmput(new_mm);
> err_unlock_inode:
> inode_unlock(d_inode(msharefs_sb->s_root));
> err_out:
> @@ -294,11 +445,24 @@ SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
> /*
> * Is this the last reference?
> + * TODO: permission checks are needed before proceeding
> */
> if (refcount_dec_and_test(&info->refcnt)) {
> simple_unlink(d_inode(msharefs_sb->s_root), dentry);
> d_drop(dentry);
> d_delete(dentry);
> + /*
> + * TODO: Release all physical pages allocated for this
> + * mshare range and release associated page table. If
> + * the final unlink happens from the process that created
> + * mshare'd range, do we return page tables and pages to
> + * that process so the creating process can continue using
> + * the address range it had chosen to mshare at some
> + * point?
> + *
> + * TODO: unmap shared vmas from every task that is using
> + * this mshare'd range.
> + */
> mmput(info->mm);
> kfree(info);
> } else {
> --
> 2.32.0
>

Thanks
Barry

2022-06-28 20:18:16

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare

On 5/30/22 05:11, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>
>>
>> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>> if (IS_ERR(fname))
>> goto err_out;
>>
>> + end = addr + len;
>> +
>> /*
>> * Does this mshare entry exist already? If it does, calling
>> * mshare with O_EXCL|O_CREAT is an error
>> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>> inode_lock(d_inode(msharefs_sb->s_root));
>> dentry = d_lookup(msharefs_sb->s_root, &namestr);
>> if (dentry && (oflag & (O_EXCL|O_CREAT))) {
>> + inode = d_inode(dentry);
>> err = -EEXIST;
>> dput(dentry);
>> goto err_unlock_inode;
>> }
>>
>> if (dentry) {
>> + unsigned long mapaddr, prot = PROT_NONE;
>> +
>> inode = d_inode(dentry);
>> if (inode == NULL) {
>> + mmap_write_unlock(current->mm);
>> err = -EINVAL;
>> goto err_out;
>> }
>> info = inode->i_private;
>> - refcount_inc(&info->refcnt);
>> dput(dentry);
>> +
>> + /*
>> + * Map in the address range as anonymous mappings
>> + */
>> + oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
>> + if (oflag & O_RDONLY)
>> + prot |= PROT_READ;
>> + else if (oflag & O_WRONLY)
>> + prot |= PROT_WRITE;
>> + else if (oflag & O_RDWR)
>> + prot |= (PROT_READ | PROT_WRITE);
>> + mapaddr = vm_mmap(NULL, addr, len, prot,
>> + MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
>
> From the perspective of hardware, do we have to use MAP_FIXED to make
> sure those processes sharing PTE
> use the same virtual address for the shared area? or actually we don't
> necessarily need it? as long as the
> upper level pgtable entries point to the same lower level pgtable?

Hi Barry,

Sorry, I didn't mean to ignore this. I was out of commission for the last few weeks.

All processes sharing an mshare region must use the same virtual address otherwise page table entry for those processes
won't be identical and hence can not be shared. Upper bits of virtual address provide index into various level
directories. It may be possible to manipulate the various page directories to allow for different virtual addresses
across processes and get hardware page table walk to work correctly, but that would be complex and potentially error prone.

Thanks,
Khalid

2022-06-28 20:53:54

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare

On 5/30/22 21:46, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>
>> This patch adds basic page table sharing across tasks by making
>> mshare syscall. It does this by creating a new mm_struct which
>> hosts the shared vmas and page tables. This mm_struct is
>> maintained as long as there is at least one task using the mshare'd
>> range. It is cleaned up by the last mshare_unlink syscall.
>>
>> Signed-off-by: Khalid Aziz <[email protected]>
>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>> ---
>> mm/internal.h | 2 +
>> mm/memory.c | 35 ++++++++++
>> mm/mshare.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++----
>> 3 files changed, 214 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index cf50a471384e..68f82f0f8b66 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -718,6 +718,8 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>> int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>> unsigned long addr, int page_nid, int *flags);
>>
>> +extern vm_fault_t find_shared_vma(struct vm_area_struct **vma,
>> + unsigned long *addrp);
>> static inline bool vma_is_shared(const struct vm_area_struct *vma)
>> {
>> return vma->vm_flags & VM_SHARED_PT;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c125c4969913..c77c0d643ea8 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4776,6 +4776,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>> unsigned int flags, struct pt_regs *regs)
>> {
>> vm_fault_t ret;
>> + bool shared = false;
>>
>> __set_current_state(TASK_RUNNING);
>>
>> @@ -4785,6 +4786,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>> /* do counter updates before entering really critical section. */
>> check_sync_rss_stat(current);
>>
>> + if (unlikely(vma_is_shared(vma))) {
>> + ret = find_shared_vma(&vma, &address);
>> + if (ret)
>> + return ret;
>> + if (!vma)
>> + return VM_FAULT_SIGSEGV;
>> + shared = true;
>> + }
>> +
>> if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> flags & FAULT_FLAG_INSTRUCTION,
>> flags & FAULT_FLAG_REMOTE))
>> @@ -4802,6 +4812,31 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>> else
>> ret = __handle_mm_fault(vma, address, flags);
>>
>> + /*
>> + * Release the read lock on shared VMA's parent mm unless
>> + * __handle_mm_fault released the lock already.
>> + * __handle_mm_fault sets VM_FAULT_RETRY in return value if
>> + * it released mmap lock. If lock was released, that implies
>> + * the lock would have been released on task's original mm if
>> + * this were not a shared PTE vma. To keep lock state consistent,
>> + * make sure to release the lock on task's original mm
>> + */
>> + if (shared) {
>> + int release_mmlock = 1;
>> +
>> + if (!(ret & VM_FAULT_RETRY)) {
>> + mmap_read_unlock(vma->vm_mm);
>> + release_mmlock = 0;
>> + } else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
>> + (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>> + mmap_read_unlock(vma->vm_mm);
>> + release_mmlock = 0;
>> + }
>> +
>> + if (release_mmlock)
>> + mmap_read_unlock(current->mm);
>> + }
>> +
>> if (flags & FAULT_FLAG_USER) {
>> mem_cgroup_exit_user_fault();
>> /*
>> diff --git a/mm/mshare.c b/mm/mshare.c
>> index cd2f7ad24d9d..d1896adcb00f 100644
>> --- a/mm/mshare.c
>> +++ b/mm/mshare.c
>> @@ -17,18 +17,49 @@
>> #include <linux/pseudo_fs.h>
>> #include <linux/fileattr.h>
>> #include <linux/refcount.h>
>> +#include <linux/mman.h>
>> #include <linux/sched/mm.h>
>> #include <uapi/linux/magic.h>
>> #include <uapi/linux/limits.h>
>>
>> struct mshare_data {
>> - struct mm_struct *mm;
>> + struct mm_struct *mm, *host_mm;
>> mode_t mode;
>> refcount_t refcnt;
>> };
>>
>> static struct super_block *msharefs_sb;
>>
>> +/* Returns holding the host mm's lock for read. Caller must release. */
>> +vm_fault_t
>> +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp)
>> +{
>> + struct vm_area_struct *vma, *guest = *vmap;
>> + struct mshare_data *info = guest->vm_private_data;
>> + struct mm_struct *host_mm = info->mm;
>> + unsigned long host_addr;
>> + pgd_t *pgd, *guest_pgd;
>> +
>> + host_addr = *addrp - guest->vm_start + host_mm->mmap_base;
>> + pgd = pgd_offset(host_mm, host_addr);
>> + guest_pgd = pgd_offset(current->mm, *addrp);
>> + if (!pgd_same(*guest_pgd, *pgd)) {
>> + set_pgd(guest_pgd, *pgd);
>> + return VM_FAULT_NOPAGE;
>> + }
>> +
>> + *addrp = host_addr;
>> + mmap_read_lock(host_mm);
>> + vma = find_vma(host_mm, host_addr);
>> +
>> + /* XXX: expand stack? */
>> + if (vma && vma->vm_start > host_addr)
>> + vma = NULL;
>> +
>> + *vmap = vma;
>> + return 0;
>> +}
>> +
>> static void
>> msharefs_evict_inode(struct inode *inode)
>> {
>> @@ -169,11 +200,13 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>> unsigned long, len, int, oflag, mode_t, mode)
>> {
>> struct mshare_data *info;
>> - struct mm_struct *mm;
>> struct filename *fname = getname(name);
>> struct dentry *dentry;
>> struct inode *inode;
>> struct qstr namestr;
>> + struct vm_area_struct *vma, *next, *new_vma;
>> + struct mm_struct *new_mm;
>> + unsigned long end;
>> int err = PTR_ERR(fname);
>>
>> /*
>> @@ -193,6 +226,8 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>> if (IS_ERR(fname))
>> goto err_out;
>>
>> + end = addr + len;
>> +
>> /*
>> * Does this mshare entry exist already? If it does, calling
>> * mshare with O_EXCL|O_CREAT is an error
>> @@ -205,49 +240,165 @@ SYSCALL_DEFINE5(mshare, const char __user *, name, unsigned long, addr,
>> inode_lock(d_inode(msharefs_sb->s_root));
>> dentry = d_lookup(msharefs_sb->s_root, &namestr);
>> if (dentry && (oflag & (O_EXCL|O_CREAT))) {
>> + inode = d_inode(dentry);
>> err = -EEXIST;
>> dput(dentry);
>> goto err_unlock_inode;
>> }
>>
>> if (dentry) {
>> + unsigned long mapaddr, prot = PROT_NONE;
>> +
>> inode = d_inode(dentry);
>> if (inode == NULL) {
>> + mmap_write_unlock(current->mm);
>> err = -EINVAL;
>> goto err_out;
>> }
>> info = inode->i_private;
>> - refcount_inc(&info->refcnt);
>> dput(dentry);
>> +
>> + /*
>> + * Map in the address range as anonymous mappings
>> + */
>> + oflag &= (O_RDONLY | O_WRONLY | O_RDWR);
>> + if (oflag & O_RDONLY)
>> + prot |= PROT_READ;
>> + else if (oflag & O_WRONLY)
>> + prot |= PROT_WRITE;
>> + else if (oflag & O_RDWR)
>> + prot |= (PROT_READ | PROT_WRITE);
>> + mapaddr = vm_mmap(NULL, addr, len, prot,
>> + MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, 0);
>> + if (IS_ERR((void *)mapaddr)) {
>> + err = -EINVAL;
>> + goto err_out;
>> + }
>> +
>> + refcount_inc(&info->refcnt);
>> +
>> + /*
>> + * Now that we have mmap'd the mshare'd range, update vma
>> + * flags and vm_mm pointer for this mshare'd range.
>> + */
>> + mmap_write_lock(current->mm);
>> + vma = find_vma(current->mm, addr);
>> + if (vma && vma->vm_start < addr) {
>
> and I don't understand why "vma->vm_start < addr" can happen.
> does it mean we have given mshare() an address which is not
> aligned? then we should have return -EINVAL earlier?
>
Yes, this could potentially be caught earlier. I have rewritten the code to new API based upon mmap() entirely without
adding new system calls. Much of the above code is no longer needed since mmap takes care of lots of these checks
already. I am running some final tests on v2 patch series and will send it out shortly.

Thanks,
Khalid

2022-06-29 17:46:55

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 5/30/22 04:48, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs. When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>> exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>> This alignment value can be looked up in /proc/sys/vm//mshare_size
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>> shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>> to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>> if (addr == MAP_FAILED)
>> perror("ERROR: mmap failed");
>>
>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>> if (err < 0) {
>> perror("mshare() syscall failed");
>> exit(1);
>> }
>>
>> strncpy(addr, "Some random shared text",
>> sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>> struct mshare_info minfo;
>>
>> fd = open("testregion", O_RDONLY);
>> if (fd < 0) {
>> perror("open failed");
>> exit(1);
>> }
>>
>> if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>> printf("INFO: %ld bytes shared at addr 0x%lx \n",
>> minfo.size, minfo.start);
>> else
>> perror("read failed");
>>
>> close(fd);
>>
>> addr = (void *)minfo.start;
>> err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>> O_RDWR, 600);
>> if (err < 0) {
>> perror("mshare() syscall failed");
>> exit(1);
>> }
>>
>> printf("Guest mmap at %px:\n", addr);
>> printf("%s\n", addr);
>> printf("\nDone\n");
>>
>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>> if (err < 0) {
>> perror("mshare_unlink() failed");
>> exit(1);
>> }
>> -----------------
>
>
> Does that mean those shared pages will get page_mapcount()=1 ?
>
> A big pain for a memory limited system like a desktop/embedded system is
> that reverse mapping will take tons of cpu in memory reclamation path
> especially for those pages mapped by multiple processes. sometimes,
> 100% cpu utilization on LRU to scan and find out if a page is accessed
> by reading PTE young.
>
> if we result in one PTE only by this patchset, it means we are getting
> significant
> performance improvement in kernel LRU particularly when free memory
> approaches the watermarks.
>
> But I don't see how a new system call like mshare(), can be used
> by those systems as they might need some more automatic PTEs sharing
> mechanism.
>
> BTW, I suppose we are actually able to share PTEs as long as the address
> is 2MB aligned?
>

The anonymous pages that are allocated to back the virtual pages in vmas maintained in mshare region get added to rmap
once. mshare() system call is going away and has been replaced by fd based mmap instead in the next version of this patch.

Thanks,
Khalid

2022-06-29 18:04:22

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 5/30/22 05:18, David Hildenbrand wrote:
> On 30.05.22 12:48, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>>
>>> Page tables in kernel consume some of the memory and as long as number
>>> of mappings being maintained is small enough, this space consumed by
>>> page tables is not objectionable. When very few memory pages are
>>> shared between processes, the number of page table entries (PTEs) to
>>> maintain is mostly constrained by the number of pages of memory on the
>>> system. As the number of shared pages and the number of times pages
>>> are shared goes up, amount of memory consumed by page tables starts to
>>> become significant.
>>>
>>> Some of the field deployments commonly see memory pages shared across
>>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>>> bytes long which is very small compared to the 4K page size. When 2000
>>> processes map the same page in their address space, each one of them
>>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>>> just to hold the PTEs for one 4K page. On a database server with 300GB
>>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>>> clients tried to share this SGA even though the system had 512GB of
>>> memory. On this server, in the worst case scenario of all 1500
>>> processes mapping every page from SGA would have required 878GB+ for
>>> just the PTEs. If these PTEs could be shared, amount of memory saved
>>> is very significant.
>>>
>>> This patch series implements a mechanism in kernel to allow userspace
>>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>>> mshare(), which can be used by a process to create a region (we will
>>> call it mshare'd region) which can be used by other processes to map
>>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>>> detach from the mshare'd region. Once an mshare'd region is created,
>>> other process(es), assuming they have the right permissions, can make
>>> the mashare() system call to map the shared pages into their address
>>> space using the shared PTEs. When a process is done using this
>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>> access. When the last process accessing mshare'd region calls
>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>> it is freed.
>>>
>>>
>>> API
>>> ===
>>>
>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>
>>> --
>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>
>>> mshare() creates and opens a new, or opens an existing mshare'd
>>> region that will be shared at PTE level. "name" refers to shared object
>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>> of this shared memory area and length is the size of this area.
>>> oflags can be one of:
>>>
>>> - O_RDONLY opens shared memory area for read only access by everyone
>>> - O_RDWR opens shared memory area for read and write access
>>> - O_CREAT creates the named shared memory area if it does not exist
>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>> exists with that name, return an error.
>>>
>>> mode represents the creation mode for the shared object under
>>> /sys/fs/mshare.
>>>
>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> PTEs are shared at pgdir level and hence it imposes following
>>> requirements on the address and size given to the mshare():
>>>
>>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>> This alignment value can be looked up in /proc/sys/vm//mshare_size
>>> - Size must be a multiple of pgdir size
>>> - Any mappings created in this address range at any time become
>>> shared automatically
>>> - Shared address range can have unmapped addresses in it. Any access
>>> to unmapped address will result in SIGBUS
>>>
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in
>>> the shared area. All other processes that want mshare'd access to
>>> this memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>> this file returns two long values - (1) starting address, and (2)
>>> size of the mshare'd region.
>>>
>>> --
>>> int mshare_unlink(char *name)
>>>
>>> A shared address range created by mshare() can be destroyed using
>>> mshare_unlink() which removes the shared named object. Once all
>>> processes have unmapped the shared object, the shared address range
>>> references are de-allocated and destroyed.
>>>
>>> mshare_unlink() returns 0 on success or -1 on error.
>>>
>>>
>>> Example Code
>>> ============
>>>
>>> Snippet of the code that a donor process would run looks like below:
>>>
>>> -----------------
>>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>> if (addr == MAP_FAILED)
>>> perror("ERROR: mmap failed");
>>>
>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>> if (err < 0) {
>>> perror("mshare() syscall failed");
>>> exit(1);
>>> }
>>>
>>> strncpy(addr, "Some random shared text",
>>> sizeof("Some random shared text"));
>>> -----------------
>>>
>>> Snippet of code that a consumer process would execute looks like:
>>>
>>> -----------------
>>> struct mshare_info minfo;
>>>
>>> fd = open("testregion", O_RDONLY);
>>> if (fd < 0) {
>>> perror("open failed");
>>> exit(1);
>>> }
>>>
>>> if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>> printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>> minfo.size, minfo.start);
>>> else
>>> perror("read failed");
>>>
>>> close(fd);
>>>
>>> addr = (void *)minfo.start;
>>> err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>> O_RDWR, 600);
>>> if (err < 0) {
>>> perror("mshare() syscall failed");
>>> exit(1);
>>> }
>>>
>>> printf("Guest mmap at %px:\n", addr);
>>> printf("%s\n", addr);
>>> printf("\nDone\n");
>>>
>>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>> if (err < 0) {
>>> perror("mshare_unlink() failed");
>>> exit(1);
>>> }
>>> -----------------
>>
>>
>> Does that mean those shared pages will get page_mapcount()=1 ?
>
> AFAIU, for mshare() that is the case.
>
>>
>> A big pain for a memory limited system like a desktop/embedded system is
>> that reverse mapping will take tons of cpu in memory reclamation path
>> especially for those pages mapped by multiple processes. sometimes,
>> 100% cpu utilization on LRU to scan and find out if a page is accessed
>> by reading PTE young.
>
> Regarding PTE-table sharing:
>
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
>
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
>
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

Right, that is the question - what does mapcount represent. I am interpreting it as mapcount represents how many ptes
map the page. Since mshare uses one pte for each shared page irrespective of how many processes share the page, a
mapcount of 1 sounds reasonable to me.

>
>>
>> if we result in one PTE only by this patchset, it means we are getting
>> significant
>> performance improvement in kernel LRU particularly when free memory
>> approaches the watermarks.
>>
>> But I don't see how a new system call like mshare(), can be used
>> by those systems as they might need some more automatic PTEs sharing
>> mechanism.
>
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
>
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.

It is tempting to make this sharing automatic and mshare may evolve to that. Since mshare assumes significant trust
between the processes sharing pages (shared pages share attributes and protection keys possibly) , it sounds dangerous
to make that assumption automatically without processes explicitly declaring that level of trust.

Thanks,
Khalid

2022-06-29 18:04:33

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs

On 5/30/22 22:24, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>
>> mshare'd PTEs should not be removed when a task exits. These PTEs
>> are removed when the last task sharing the PTEs exits. Add a check
>> for shared PTEs and skip them.
>>
>> Signed-off-by: Khalid Aziz <[email protected]>
>> ---
>> mm/memory.c | 22 +++++++++++++++++++---
>> 1 file changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c77c0d643ea8..e7c5bc6f8836 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> } else {
>> /*
>> * Optimization: gather nearby vmas into one call down
>> + * as long as they all belong to the same mm (that
>> + * may not be the case if a vma is part of mshare'd
>> + * range
>> */
>> while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>> - && !is_vm_hugetlb_page(next)) {
>> + && !is_vm_hugetlb_page(next)
>> + && vma->vm_mm == tlb->mm) {
>> vma = next;
>> next = vma->vm_next;
>> unlink_anon_vmas(vma);
>> unlink_file_vma(vma);
>> }
>> - free_pgd_range(tlb, addr, vma->vm_end,
>> - floor, next ? next->vm_start : ceiling);
>> + /*
>> + * Free pgd only if pgd is not allocated for an
>> + * mshare'd range
>> + */
>> + if (vma->vm_mm == tlb->mm)
>> + free_pgd_range(tlb, addr, vma->vm_end,
>> + floor, next ? next->vm_start : ceiling);
>> }
>> vma = next;
>> }
>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>> pgd_t *pgd;
>> unsigned long next;
>>
>> + /*
>> + * If this is an mshare'd page, do not unmap it since it might
>> + * still be in use.
>> + */
>> + if (vma->vm_mm != tlb->mm)
>> + return;
>> +
>
> expect unmap, have you ever tested reverse mapping in vmscan, especially
> folio_referenced()? are all vmas in those processes sharing page table still
> in the rmap of the shared page?
> without shared PTE, if 1000 processes share one page, we are reading 1000
> PTEs, with it, are we reading just one? or are we reading the same PTE
> 1000 times? Have you tested it?
>

We are treating mshared region same as threads sharing address space. There is one PTE that is being used by all
processes and the VMA maintained in the separate mshare mm struct that also holds the shared PTE is the one that gets
added to rmap. This is a different model with mshare in that it adds an mm struct that is separate from the mm structs
of the processes that refer to the vma and pte in mshare mm struct. Do you see issues with rmap in this model?

Thanks,
Khalid

2022-07-03 20:59:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs

On 6/29/22 10:38, Khalid Aziz wrote:
> On 5/30/22 22:24, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]>
>> wrote:
>>>
>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>> are removed when the last task sharing the PTEs exits. Add a check
>>> for shared PTEs and skip them.
>>>
>>> Signed-off-by: Khalid Aziz <[email protected]>
>>> ---
>>>   mm/memory.c | 22 +++++++++++++++++++---
>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb,
>>> struct vm_area_struct *vma,
>>>                  } else {
>>>                          /*
>>>                           * Optimization: gather nearby vmas into one
>>> call down
>>> +                        * as long as they all belong to the same mm
>>> (that
>>> +                        * may not be the case if a vma is part of
>>> mshare'd
>>> +                        * range
>>>                           */
>>>                          while (next && next->vm_start <= vma->vm_end
>>> + PMD_SIZE
>>> -                              && !is_vm_hugetlb_page(next)) {
>>> +                              && !is_vm_hugetlb_page(next)
>>> +                              && vma->vm_mm == tlb->mm) {
>>>                                  vma = next;
>>>                                  next = vma->vm_next;
>>>                                  unlink_anon_vmas(vma);
>>>                                  unlink_file_vma(vma);
>>>                          }
>>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>>> -                               floor, next ? next->vm_start : ceiling);
>>> +                       /*
>>> +                        * Free pgd only if pgd is not allocated for an
>>> +                        * mshare'd range
>>> +                        */
>>> +                       if (vma->vm_mm == tlb->mm)
>>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>>> +                                       floor, next ? next->vm_start
>>> : ceiling);
>>>                  }
>>>                  vma = next;
>>>          }
>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>          pgd_t *pgd;
>>>          unsigned long next;
>>>
>>> +       /*
>>> +        * If this is an mshare'd page, do not unmap it since it might
>>> +        * still be in use.
>>> +        */
>>> +       if (vma->vm_mm != tlb->mm)
>>> +               return;
>>> +
>>
>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>> folio_referenced()? are all vmas in those processes sharing page table
>> still
>> in the rmap of the shared page?
>> without shared PTE, if 1000 processes share one page, we are reading 1000
>> PTEs, with it, are we reading just one? or are we reading the same PTE
>> 1000 times? Have you tested it?
>>
>
> We are treating mshared region same as threads sharing address space.
> There is one PTE that is being used by all processes and the VMA
> maintained in the separate mshare mm struct that also holds the shared
> PTE is the one that gets added to rmap. This is a different model with
> mshare in that it adds an mm struct that is separate from the mm structs
> of the processes that refer to the vma and pte in mshare mm struct. Do
> you see issues with rmap in this model?

I think this patch is actually the most interesting bit of the series by
far. Most of the rest is defining an API (which is important!) and
figuring out semantics. This patch changes something rather fundamental
about how user address spaces work: what vmas live in them. So let's
figure out its effects.

I admit I'm rather puzzled about what vm_mm is for in the first place.
In current kernels (without your patch), I think it's a pretty hard
requirement for vm_mm to equal the mm for all vmas in an mm. After a
quick and incomplete survey, vm_mm seems to be mostly used as a somewhat
lazy way to find the mm. Let's see:

file_operations->mmap doesn't receive an mm_struct. Instead it infers
the mm from vm_mm. (Why? I don't know.)

Some walk_page_range users seem to dig the mm out of vm_mm instead of
mm_walk.

Some manual address space walkers start with an mm, don't bother passing
it around, and dig it back out of of vm_mm. For example, unuse_vma()
and all its helpers.

The only real exception I've found so far is rmap: AFAICS (on quick
inspection -- I could be wrong), rmap can map from a folio to a bunch of
vmas, and the vmas' mms are not stored separately but instead determined
by vm_mm.



Your patch makes me quite nervous. You're potentially breaking any
kernel code path that assumes that mms only contain vmas that have vm_mm
== mm. And you're potentially causing rmap to be quite confused. I
think that if you're going to take this approach, you need to clearly
define the new semantics of vm_mm and audit or clean up every user of
vm_mm in the tree. This may be nontrivial (especially rmap), although a
cleanup of everything else to stop using vm_mm might be valuable.

But I'm wondering if it would be better to attack this from a different
direction. Right now, there's a hardcoded assumption that an mm owns
every page table it references. That's really the thing you're
changing. To me, it seems that a magical vma that shares page tables
should still be a vma that belongs to its mm_struct -- munmap() and
potentialy other m***() operations should all work on it, existing
find_vma() users should work, etc.

So maybe instead there should be new behavior (by a VM_ flag or
otherwise) that indicates that a vma owns its PTEs. It could even be a
vm_operation, although if anyone ever wants regular file mappings to
share PTEs, then a vm_operation doesn't really make sense.

2022-07-06 20:53:15

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs

On 7/3/22 14:54, Andy Lutomirski wrote:
> On 6/29/22 10:38, Khalid Aziz wrote:
>> On 5/30/22 22:24, Barry Song wrote:
>>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <[email protected]> wrote:
>>>>
>>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>>> are removed when the last task sharing the PTEs exits. Add a check
>>>> for shared PTEs and skip them.
>>>>
>>>> Signed-off-by: Khalid Aziz <[email protected]>
>>>> ---
>>>>   mm/memory.c | 22 +++++++++++++++++++---
>>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>                  } else {
>>>>                          /*
>>>>                           * Optimization: gather nearby vmas into one call down
>>>> +                        * as long as they all belong to the same mm (that
>>>> +                        * may not be the case if a vma is part of mshare'd
>>>> +                        * range
>>>>                           */
>>>>                          while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>>>> -                              && !is_vm_hugetlb_page(next)) {
>>>> +                              && !is_vm_hugetlb_page(next)
>>>> +                              && vma->vm_mm == tlb->mm) {
>>>>                                  vma = next;
>>>>                                  next = vma->vm_next;
>>>>                                  unlink_anon_vmas(vma);
>>>>                                  unlink_file_vma(vma);
>>>>                          }
>>>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>>>> -                               floor, next ? next->vm_start : ceiling);
>>>> +                       /*
>>>> +                        * Free pgd only if pgd is not allocated for an
>>>> +                        * mshare'd range
>>>> +                        */
>>>> +                       if (vma->vm_mm == tlb->mm)
>>>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>>>> +                                       floor, next ? next->vm_start : ceiling);
>>>>                  }
>>>>                  vma = next;
>>>>          }
>>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>>          pgd_t *pgd;
>>>>          unsigned long next;
>>>>
>>>> +       /*
>>>> +        * If this is an mshare'd page, do not unmap it since it might
>>>> +        * still be in use.
>>>> +        */
>>>> +       if (vma->vm_mm != tlb->mm)
>>>> +               return;
>>>> +
>>>
>>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>>> folio_referenced()? are all vmas in those processes sharing page table still
>>> in the rmap of the shared page?
>>> without shared PTE, if 1000 processes share one page, we are reading 1000
>>> PTEs, with it, are we reading just one? or are we reading the same PTE
>>> 1000 times? Have you tested it?
>>>
>>
>> We are treating mshared region same as threads sharing address space. There is one PTE that is being used by all
>> processes and the VMA maintained in the separate mshare mm struct that also holds the shared PTE is the one that gets
>> added to rmap. This is a different model with mshare in that it adds an mm struct that is separate from the mm structs
>> of the processes that refer to the vma and pte in mshare mm struct. Do you see issues with rmap in this model?
>
> I think this patch is actually the most interesting bit of the series by far.  Most of the rest is defining an API
> (which is important!) and figuring out semantics.  This patch changes something rather fundamental about how user
> address spaces work: what vmas live in them.  So let's figure out its effects.
>
> I admit I'm rather puzzled about what vm_mm is for in the first place. In current kernels (without your patch), I think
> it's a pretty hard requirement for vm_mm to equal the mm for all vmas in an mm.  After a quick and incomplete survey,
> vm_mm seems to be mostly used as a somewhat lazy way to find the mm.  Let's see:
>
> file_operations->mmap doesn't receive an mm_struct.  Instead it infers the mm from vm_mm.  (Why?  I don't know.)
>
> Some walk_page_range users seem to dig the mm out of vm_mm instead of mm_walk.
>
> Some manual address space walkers start with an mm, don't bother passing it around, and dig it back out of of vm_mm.
> For example, unuse_vma() and all its helpers.
>
> The only real exception I've found so far is rmap: AFAICS (on quick inspection -- I could be wrong), rmap can map from a
> folio to a bunch of vmas, and the vmas' mms are not stored separately but instead determined by vm_mm.
>
>
>
> Your patch makes me quite nervous.  You're potentially breaking any kernel code path that assumes that mms only contain
> vmas that have vm_mm == mm.  And you're potentially causing rmap to be quite confused.  I think that if you're going to
> take this approach, you need to clearly define the new semantics of vm_mm and audit or clean up every user of vm_mm in
> the tree.  This may be nontrivial (especially rmap), although a cleanup of everything else to stop using vm_mm might be
> valuable.
>
> But I'm wondering if it would be better to attack this from a different direction.  Right now, there's a hardcoded
> assumption that an mm owns every page table it references.  That's really the thing you're changing.  To me, it seems
> that a magical vma that shares page tables should still be a vma that belongs to its mm_struct -- munmap() and
> potentialy other m***() operations should all work on it, existing find_vma() users should work, etc.
>
> So maybe instead there should be new behavior (by a VM_ flag or otherwise) that indicates that a vma owns its PTEs.  It
> could even be a vm_operation, although if anyone ever wants regular file mappings to share PTEs, then a vm_operation
> doesn't really make sense.
>

Hi Andy,

You are absolutely right. Dragons lie on the path to changing the sense of vm_mm. Dave H pointed out potential issues
with rb_tree as well. As I have looked at more code, it seems breaking the assumption that vm_mm always points to
containing mm struct opens up the possibility of code breaking in strange ways in odd places. As a result, I have
changed the code in v2 patches to not break this assumption about vm_mm and instead I rewrote the code to use the vm
flag VM_SHARED_PT and vm_private_data everywhere it needed to find the mshare mm struct. All the vmas belonging to the
new mm struct for mshare region also have their vm_mm pointing to the mshare mm_struct and that keeps all vma operations
working normally.

Thanks,
Khalid