2014-06-13 10:45:18

by David Herrmann

[permalink] [raw]
Subject: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with
a longer introduction at gmane:
http://thread.gmane.org/gmane.comp.video.dri.devel/102241
An LWN article about memfd+sealing is available, too:
https://lwn.net/Articles/593918/
v2 with some more discussions can be found here:
http://thread.gmane.org/gmane.linux.kernel.mm/115713

This series introduces two new APIs:
memfd_create(): Think of this syscall as malloc() but it returns a
file-descriptor instead of a pointer. That file-descriptor is
backed by anon-memory and can be memory-mapped for access.
sealing: The sealing API can be used to prevent a specific set of operations
on a file-descriptor. You 'seal' the file and give thus the
guarantee, that it cannot be modified in the specific ways.

A short high-level introduction is also available here:
http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/


Changed in v3:
- fcntl() now returns EINVAL if the FD does not support sealing. We used to
return EBADF like pipe_fcntl() does, but that is really weird and I don't
like repeating that.
- seals are now saved as "unsigned int" instead of "u32".
- i_mmap_writable is now an atomic so we can deny writable mappings just like
i_writecount does.
- SHMEM_ALLOW_SEALING is dropped. We initialize all objects with F_SEAL_SEAL
and only unset it for memfds that shall support sealing.
- memfd_create() no longer has a size argument. It was redundant, use
ftruncate() or fallocate().
- memfd_create() flags are "unsigned int" now, instead of "u64".
- NAME_MAX off-by-one fix
- several cosmetic changes
- Added AIO/Direct-IO page-pinning protection

The last point is the most important change in this version: We now bail out if
any page-refcount is elevated while setting SEAL_WRITE. This prevents parallel
GUP users from writing to sealed files _after_ they were sealed. There is also a
new FUSE-based test-case to trigger such situations.

The last 2 patches try to improve the page-pinning handling. I included both in
this series, but obviously only one of them is needed (or we could stack them):
- 6/7: This waits for up to 150ms for pages to be unpinned
- 7/7: This isolates pinned pages and replaces them with a fresh copy

Hugh, patch 6 is basically your code. In case that gets merged, can I put your
Signed-off-by on it?

I hope I didn't miss anything. Further comments welcome!

Thanks
David

David Herrmann (7):
mm: allow drivers to prevent new writable mappings
shm: add sealing API
shm: add memfd_create() syscall
selftests: add memfd_create() + sealing tests
selftests: add memfd/sealing page-pinning tests
shm: wait for pins to be released when sealing
shm: isolate pinned pages when sealing files

arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/fcntl.c | 5 +
fs/inode.c | 1 +
include/linux/fs.h | 29 +-
include/linux/shmem_fs.h | 17 +
include/linux/syscalls.h | 1 +
include/uapi/linux/fcntl.h | 15 +
include/uapi/linux/memfd.h | 8 +
kernel/fork.c | 2 +-
kernel/sys_ni.c | 1 +
mm/mmap.c | 24 +-
mm/shmem.c | 320 ++++++++-
mm/swap_state.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/memfd/.gitignore | 4 +
tools/testing/selftests/memfd/Makefile | 40 ++
tools/testing/selftests/memfd/fuse_mnt.c | 110 +++
tools/testing/selftests/memfd/fuse_test.c | 311 +++++++++
tools/testing/selftests/memfd/memfd_test.c | 913 +++++++++++++++++++++++++
tools/testing/selftests/memfd/run_fuse_test.sh | 14 +
21 files changed, 1807 insertions(+), 12 deletions(-)
create mode 100644 include/uapi/linux/memfd.h
create mode 100644 tools/testing/selftests/memfd/.gitignore
create mode 100644 tools/testing/selftests/memfd/Makefile
create mode 100755 tools/testing/selftests/memfd/fuse_mnt.c
create mode 100644 tools/testing/selftests/memfd/fuse_test.c
create mode 100644 tools/testing/selftests/memfd/memfd_test.c
create mode 100755 tools/testing/selftests/memfd/run_fuse_test.sh

--
2.0.0


2014-06-13 10:45:27

by David Herrmann

[permalink] [raw]
Subject: [PATCH v3 2/7] shm: add sealing API

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
- one side cannot overwrite data while the other reads it
- one side cannot shrink the buffer while the other accesses it
- one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
1) A graphics client wants to share its rendering-buffer with a
graphics-server. The memory-region is allocated by the client for
read/write access and a second FD is passed to the server. While
scanning out from the memory region, the server has no guarantee that
the client doesn't shrink the buffer at any time, requiring rather
cumbersome SIGBUS handling.
2) A process wants to perform an RPC on another process. To avoid huge
bandwidth consumption, zero-copy is preferred. After a message is
assembled in-memory and a FD is passed to the remote side, both sides
want to be sure that neither modifies this shared copy, anymore. The
source may have put sensible data into the message without a separate
copy and the target may want to parse the message inline, to avoid a
local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever.
Unlike locks, seals can only be set, never removed. Hence, once you
verified a specific set of seals is set, you're guaranteed that no-one can
perform the blocked operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
- SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
in size. This affects ftruncate() and open(O_TRUNC).
- GROW: If SEAL_GROW is set, the file in question cannot be increased
in size. This affects ftruncate(), fallocate() and write().
- WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
are possible. This affects fallocate(PUNCH_HOLE), mmap() and
write().
- SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
This basically prevents the F_ADD_SEAL operation on a file and
can be set to prevent others from adding further seals that you
don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
1) The graphics server can verify that a passed file-descriptor has
SEAL_SHRINK set. This allows safe scanout, while the client is
allowed to increase buffer size for window-resizing on-the-fly.
Concurrent writes are explicitly allowed.
2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
process can modify the data while the other side parses it.
Furthermore, it guarantees that even with writable FDs passed to the
peer, it cannot increase the size to hit memory-limits of the source
process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
F_GET_SEALS: Return a bitset describing the seals on the file. This
can be called on any FD if the underlying file supports
sealing.
F_ADD_SEALS: Change the seals of a given file. This requires WRITE
access to the file and F_SEAL_SEAL may not already be set.
Furthermore, the underlying file must support sealing and
there may not be any existing shared mapping of that file.
Otherwise, EBADF/EPERM is returned.
The given seals are _added_ to the existing set of seals
on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <[email protected]>
---
fs/fcntl.c | 5 ++
include/linux/shmem_fs.h | 17 ++++
include/uapi/linux/fcntl.h | 15 ++++
mm/shmem.c | 189 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 223 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 72c82f6..22d1c3d 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
#include <linux/rcupdate.h>
#include <linux/pid_namespace.h>
#include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>

#include <asm/poll.h>
#include <asm/siginfo.h>
@@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
case F_GETPIPE_SZ:
err = pipe_fcntl(filp, cmd, arg);
break;
+ case F_ADD_SEALS:
+ case F_GET_SEALS:
+ err = shmem_fcntl(filp, cmd, arg);
+ break;
default:
break;
}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d1771c..50777b5 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
#ifndef __SHMEM_FS_H
#define __SHMEM_FS_H

+#include <linux/file.h>
#include <linux/swap.h>
#include <linux/mempolicy.h>
#include <linux/pagemap.h>
@@ -11,6 +12,7 @@

struct shmem_inode_info {
spinlock_t lock;
+ unsigned int seals; /* shmem seals */
unsigned long flags;
unsigned long alloced; /* data pages alloced to file */
union {
@@ -65,4 +67,19 @@ static inline struct page *shmem_read_mapping_page(
mapping_gfp_mask(mapping));
}

+#ifdef CONFIG_TMPFS
+
+extern int shmem_add_seals(struct file *file, unsigned int seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
+{
+ return -EINVAL;
+}
+
+#endif
+
#endif
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..beed138 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,21 @@
#define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)

/*
+ * Set/Get seals
+ */
+#define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
+#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */
+#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
+#define F_SEAL_GROW 0x0004 /* prevent file from growing */
+#define F_SEAL_WRITE 0x0008 /* prevent writes */
+/* (1U << 31) is reserved for signed error codes */
+
+/*
* Types of directory notifications that may be requested.
*/
#define DN_ACCESS 0x00000001 /* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..1438b3e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
#include <linux/highmem.h>
#include <linux/seq_file.h>
#include <linux/magic.h>
+#include <linux/fcntl.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = dentry->d_inode;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ loff_t oldsize = inode->i_size;
+ loff_t newsize = attr->ia_size;
int error;

error = inode_change_ok(inode, attr);
if (error)
return error;

- if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
- loff_t oldsize = inode->i_size;
- loff_t newsize = attr->ia_size;
+ /* protected by i_mutex */
+ if (attr->ia_valid & ATTR_SIZE) {
+ if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
+ (newsize > oldsize && (info->seals & F_SEAL_GROW)))
+ return -EPERM;
+ }

+ if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
if (newsize != oldsize) {
i_size_write(inode, newsize);
inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1315,6 +1323,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
+ info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
INIT_LIST_HEAD(&info->swaplist);
simple_xattrs_init(&info->xattrs);
@@ -1374,7 +1383,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
{
int ret;
struct inode *inode = mapping->host;
+ struct shmem_inode_info *info = SHMEM_I(inode);
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+ /* i_mutex is held by caller */
+ if (info->seals & F_SEAL_WRITE)
+ return -EPERM;
+ if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
+ return -EPERM;
+
ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
if (ret == 0 && *pagep)
init_page_accessed(*pagep);
@@ -1715,11 +1732,166 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
return offset;
}

+/*
+ * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
+ * via get_user_pages(), drivers might have some pending I/O without any active
+ * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
+ * and see whether it has an elevated ref-count. If so, we abort.
+ * The caller must guarantee that no new user will acquire writable references
+ * to those pages to avoid races.
+ */
+static int shmem_test_for_pins(struct address_space *mapping)
+{
+ struct radix_tree_iter iter;
+ void **slot;
+ pgoff_t start;
+ struct page *page;
+ int error;
+
+ /* flush additional refs in lru_add early */
+ lru_add_drain_all();
+
+ error = 0;
+ start = 0;
+ rcu_read_lock();
+
+restart:
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ page = radix_tree_deref_slot(slot);
+ if (!page || radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page))
+ goto restart;
+ } else if (page_count(page) - page_mapcount(page) > 1) {
+ error = -EBUSY;
+ break;
+ }
+
+ if (need_resched()) {
+ cond_resched_rcu();
+ start = iter.index + 1;
+ goto restart;
+ }
+ }
+ rcu_read_unlock();
+
+ return error;
+}
+
+#define F_ALL_SEALS (F_SEAL_SEAL | \
+ F_SEAL_SHRINK | \
+ F_SEAL_GROW | \
+ F_SEAL_WRITE)
+
+int shmem_add_seals(struct file *file, unsigned int seals)
+{
+ struct dentry *dentry = file->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ int error;
+
+ /*
+ * SEALING
+ * Sealing allows multiple parties to share a shmem-file but restrict
+ * access to a specific subset of file operations. Seals can only be
+ * added, but never removed. This way, mutually untrusted parties can
+ * share common memory regions with a well-defined policy. A malicious
+ * peer can thus never perform unwanted operations on a shared object.
+ *
+ * Seals are only supported on special shmem-files and always affect
+ * the whole underlying inode. Once a seal is set, it may prevent some
+ * kinds of access to the file. Currently, the following seals are
+ * defined:
+ * SEAL_SEAL: Prevent further seals from being set on this file
+ * SEAL_SHRINK: Prevent the file from shrinking
+ * SEAL_GROW: Prevent the file from growing
+ * SEAL_WRITE: Prevent write access to the file
+ *
+ * As we don't require any trust relationship between two parties, we
+ * must prevent seals from being removed. Therefore, sealing a file
+ * only adds a given set of seals to the file, it never touches
+ * existing seals. Furthermore, the "setting seals"-operation can be
+ * sealed itself, which basically prevents any further seal from being
+ * added.
+ *
+ * Semantics of sealing are only defined on volatile files. Only
+ * anonymous shmem files support sealing. More importantly, seals are
+ * never written to disk. Therefore, there's no plan to support it on
+ * other file types.
+ */
+
+ if (file->f_op != &shmem_file_operations)
+ return -EINVAL;
+ if (!(file->f_mode & FMODE_WRITE))
+ return -EINVAL;
+ if (seals & ~(unsigned int)F_ALL_SEALS)
+ return -EINVAL;
+
+ mutex_lock(&inode->i_mutex);
+
+ if (info->seals & F_SEAL_SEAL) {
+ error = -EPERM;
+ goto unlock;
+ }
+
+ if ((seals & F_SEAL_WRITE) && !(info->seals & F_SEAL_WRITE)) {
+ error = mapping_deny_writable(file->f_mapping);
+ if (error)
+ goto unlock;
+
+ error = shmem_test_for_pins(file->f_mapping);
+ if (error) {
+ mapping_allow_writable(file->f_mapping);
+ goto unlock;
+ }
+ }
+
+ info->seals |= seals;
+ error = 0;
+
+unlock:
+ mutex_unlock(&inode->i_mutex);
+ return error;
+}
+EXPORT_SYMBOL_GPL(shmem_add_seals);
+
+int shmem_get_seals(struct file *file)
+{
+ if (file->f_op != &shmem_file_operations)
+ return -EINVAL;
+
+ return SHMEM_I(file_inode(file))->seals & F_ALL_SEALS;
+}
+EXPORT_SYMBOL_GPL(shmem_get_seals);
+
+long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ long error;
+
+ switch (cmd) {
+ case F_ADD_SEALS:
+ /* disallow upper 32bit */
+ if (arg >> 32)
+ return -EINVAL;
+
+ error = shmem_add_seals(file, arg);
+ break;
+ case F_GET_SEALS:
+ error = shmem_get_seals(file);
+ break;
+ default:
+ error = -EINVAL;
+ break;
+ }
+
+ return error;
+}
+
static long shmem_fallocate(struct file *file, int mode, loff_t offset,
loff_t len)
{
struct inode *inode = file_inode(file);
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_falloc shmem_falloc;
pgoff_t start, index, end;
int error;
@@ -1731,6 +1903,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
loff_t unmap_start = round_up(offset, PAGE_SIZE);
loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;

+ /* protected by i_mutex */
+ if (info->seals & F_SEAL_WRITE) {
+ error = -EPERM;
+ goto out;
+ }
+
if ((u64)unmap_end > (u64)unmap_start)
unmap_mapping_range(mapping, unmap_start,
1 + unmap_end - unmap_start, 0);
@@ -1745,6 +1923,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
if (error)
goto out;

+ if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
+ error = -EPERM;
+ goto out;
+ }
+
start = offset >> PAGE_CACHE_SHIFT;
end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
/* Try to avoid a swapstorm if len is impossible to satisfy */
--
2.0.0

2014-06-13 10:45:32

by David Herrmann

[permalink] [raw]
Subject: [PATCH v3 5/7] selftests: add memfd/sealing page-pinning tests

Setting SEAL_WRITE is not possible if there're pending GUP users. This
commit adds selftests for memfd+sealing that use FUSE to create pending
page-references. FUSE is very helpful here in that it allows us to delay
direct-IO operations for an arbitrary amount of time. This way, we can
force the kernel to pin pages and then run our normal selftests.

Signed-off-by: David Herrmann <[email protected]>
---
tools/testing/selftests/memfd/.gitignore | 2 +
tools/testing/selftests/memfd/Makefile | 13 +-
tools/testing/selftests/memfd/fuse_mnt.c | 110 +++++++++
tools/testing/selftests/memfd/fuse_test.c | 311 +++++++++++++++++++++++++
tools/testing/selftests/memfd/run_fuse_test.sh | 14 ++
5 files changed, 449 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/memfd/fuse_mnt.c
create mode 100644 tools/testing/selftests/memfd/fuse_test.c
create mode 100755 tools/testing/selftests/memfd/run_fuse_test.sh

diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
index bcc8ee2..afe87c4 100644
--- a/tools/testing/selftests/memfd/.gitignore
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -1,2 +1,4 @@
+fuse_mnt
+fuse_test
memfd_test
memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
index 36653b9..77816620 100644
--- a/tools/testing/selftests/memfd/Makefile
+++ b/tools/testing/selftests/memfd/Makefile
@@ -25,5 +25,16 @@ ifeq ($(ARCH),X86)
endif
@./memfd_test || echo "memfd_test: [FAIL]"

+build_fuse:
+ifeq ($(ARCH),X86)
+ gcc $(CFLAGS) fuse_mnt.c `pkg-config fuse --cflags --libs` -o fuse_mnt
+ gcc $(CFLAGS) fuse_test.c -o fuse_test
+else
+ echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_fuse: build_fuse
+ @./run_fuse_test.sh || echo "fuse_test: [FAIL]"
+
clean:
- $(RM) memfd_test
+ $(RM) memfd_test fuse_test
diff --git a/tools/testing/selftests/memfd/fuse_mnt.c b/tools/testing/selftests/memfd/fuse_mnt.c
new file mode 100755
index 0000000..feacf12
--- /dev/null
+++ b/tools/testing/selftests/memfd/fuse_mnt.c
@@ -0,0 +1,110 @@
+/*
+ * memfd test file-system
+ * This file uses FUSE to create a dummy file-system with only one file /memfd.
+ * This file is read-only and takes 1s per read.
+ *
+ * This file-system is used by the memfd test-cases to force the kernel to pin
+ * pages during reads(). Due to the 1s delay of this file-system, this is a
+ * nice way to test race-conditions against get_user_pages() in the kernel.
+ *
+ * We use direct_io==1 to force the kernel to use direct-IO for this
+ * file-system.
+ */
+
+#define FUSE_USE_VERSION 26
+
+#include <fuse.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+static const char memfd_content[] = "memfd-example-content";
+static const char memfd_path[] = "/memfd";
+
+static int memfd_getattr(const char *path, struct stat *st)
+{
+ memset(st, 0, sizeof(*st));
+
+ if (!strcmp(path, "/")) {
+ st->st_mode = S_IFDIR | 0755;
+ st->st_nlink = 2;
+ } else if (!strcmp(path, memfd_path)) {
+ st->st_mode = S_IFREG | 0444;
+ st->st_nlink = 1;
+ st->st_size = strlen(memfd_content);
+ } else {
+ return -ENOENT;
+ }
+
+ return 0;
+}
+
+static int memfd_readdir(const char *path,
+ void *buf,
+ fuse_fill_dir_t filler,
+ off_t offset,
+ struct fuse_file_info *fi)
+{
+ if (strcmp(path, "/"))
+ return -ENOENT;
+
+ filler(buf, ".", NULL, 0);
+ filler(buf, "..", NULL, 0);
+ filler(buf, memfd_path + 1, NULL, 0);
+
+ return 0;
+}
+
+static int memfd_open(const char *path, struct fuse_file_info *fi)
+{
+ if (strcmp(path, memfd_path))
+ return -ENOENT;
+
+ if ((fi->flags & 3) != O_RDONLY)
+ return -EACCES;
+
+ /* force direct-IO */
+ fi->direct_io = 1;
+
+ return 0;
+}
+
+static int memfd_read(const char *path,
+ char *buf,
+ size_t size,
+ off_t offset,
+ struct fuse_file_info *fi)
+{
+ size_t len;
+
+ if (strcmp(path, memfd_path) != 0)
+ return -ENOENT;
+
+ sleep(1);
+
+ len = strlen(memfd_content);
+ if (offset < len) {
+ if (offset + size > len)
+ size = len - offset;
+
+ memcpy(buf, memfd_content + offset, size);
+ } else {
+ size = 0;
+ }
+
+ return size;
+}
+
+static struct fuse_operations memfd_ops = {
+ .getattr = memfd_getattr,
+ .readdir = memfd_readdir,
+ .open = memfd_open,
+ .read = memfd_read,
+};
+
+int main(int argc, char *argv[])
+{
+ return fuse_main(argc, argv, &memfd_ops, NULL);
+}
diff --git a/tools/testing/selftests/memfd/fuse_test.c b/tools/testing/selftests/memfd/fuse_test.c
new file mode 100644
index 0000000..67908b1
--- /dev/null
+++ b/tools/testing/selftests/memfd/fuse_test.c
@@ -0,0 +1,311 @@
+/*
+ * memfd GUP test-case
+ * This tests memfd interactions with get_user_pages(). We require the
+ * fuse_mnt.c program to provide a fake direct-IO FUSE mount-point for us. This
+ * file-system delays _all_ reads by 1s and forces direct-IO. This means, any
+ * read() on files in that file-system will pin the receive-buffer pages for at
+ * least 1s via get_user_pages().
+ *
+ * We use this trick to race ADD_SEALS against a write on a memfd object. The
+ * ADD_SEALS must fail if the memfd pages are still pinned. Note that we use
+ * the read() syscall with our memory-mapped memfd object as receive buffer to
+ * force the kernel to write into our memfd object.
+ */
+
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <errno.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/falloc.h>
+#include <linux/fcntl.h>
+#include <linux/memfd.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+ unsigned int flags)
+{
+ return syscall(__NR_memfd_create, name, flags);
+}
+
+static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
+{
+ int r, fd;
+
+ fd = sys_memfd_create(name, flags);
+ if (fd < 0) {
+ printf("memfd_create(\"%s\", %u) failed: %m\n",
+ name, flags);
+ abort();
+ }
+
+ r = ftruncate(fd, sz);
+ if (r < 0) {
+ printf("ftruncate(%llu) failed: %m\n", (unsigned long long)sz);
+ abort();
+ }
+
+ return fd;
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+ long r;
+
+ r = fcntl(fd, F_GET_SEALS);
+ if (r < 0) {
+ printf("GET_SEALS(%d) failed: %m\n", fd);
+ abort();
+ }
+
+ return r;
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+ __u64 s;
+
+ s = mfd_assert_get_seals(fd);
+ if (s != seals) {
+ printf("%llu != %llu = GET_SEALS(%d)\n",
+ (unsigned long long)seals, (unsigned long long)s, fd);
+ abort();
+ }
+}
+
+static void mfd_assert_add_seals(int fd, __u64 seals)
+{
+ long r;
+ __u64 s;
+
+ s = mfd_assert_get_seals(fd);
+ r = fcntl(fd, F_ADD_SEALS, seals);
+ if (r < 0) {
+ printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
+ fd, (unsigned long long)s, (unsigned long long)seals);
+ abort();
+ }
+}
+
+static int mfd_busy_add_seals(int fd, __u64 seals)
+{
+ long r;
+ __u64 s;
+
+ r = fcntl(fd, F_GET_SEALS);
+ if (r < 0)
+ s = 0;
+ else
+ s = r;
+
+ r = fcntl(fd, F_ADD_SEALS, seals);
+ if (r < 0 && errno != EBUSY) {
+ printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected with EBUSY: %m\n",
+ fd, (unsigned long long)s, (unsigned long long)seals);
+ abort();
+ }
+
+ return r;
+}
+
+static void *mfd_assert_mmap_shared(int fd)
+{
+ void *p;
+
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+
+ return p;
+}
+
+static void *mfd_assert_mmap_private(int fd)
+{
+ void *p;
+
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+
+ return p;
+}
+
+static int global_mfd = -1;
+static void *global_p = NULL;
+
+static int sealing_thread_fn(void *arg)
+{
+ int sig, r;
+
+ /*
+ * This thread first waits 200ms so any pending operation in the parent
+ * is correctly started. After that, it tries to seal @global_mfd as
+ * SEAL_WRITE. This _must_ fail as the parent thread has a read() into
+ * that memory mapped object still ongoing.
+ * We then wait one more second and try sealing again. This time it
+ * must succeed as there shouldn't be anyone else pinning the pages.
+ */
+
+ /* wait 200ms for FUSE-request to be active */
+ usleep(200000);
+
+ /* unmount mapping before sealing to avoid i_mmap_writable failures */
+ munmap(global_p, MFD_DEF_SIZE);
+
+ /* Try sealing the global file; expect EBUSY or success. Current
+ * kernels will never succeed, but in the future, kernels might
+ * implement page-replacements or other fancy ways to avoid racing
+ * writes. */
+ r = mfd_busy_add_seals(global_mfd, F_SEAL_WRITE);
+ if (r >= 0) {
+ printf("HURRAY! This kernel fixed GUP races!\n");
+ } else {
+ /* wait 1s more so the FUSE-request is done */
+ sleep(1);
+
+ /* try sealing the global file again */
+ mfd_assert_add_seals(global_mfd, F_SEAL_WRITE);
+ }
+
+ return 0;
+}
+
+static pid_t spawn_sealing_thread(void)
+{
+ uint8_t *stack;
+ pid_t pid;
+
+ stack = malloc(STACK_SIZE);
+ if (!stack) {
+ printf("malloc(STACK_SIZE) failed: %m\n");
+ abort();
+ }
+
+ pid = clone(sealing_thread_fn,
+ stack + STACK_SIZE,
+ SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM,
+ NULL);
+ if (pid < 0) {
+ printf("clone() failed: %m\n");
+ abort();
+ }
+
+ return pid;
+}
+
+static void join_sealing_thread(pid_t pid)
+{
+ waitpid(pid, NULL, 0);
+}
+
+int main(int argc, char **argv)
+{
+ static const char zero[MFD_DEF_SIZE];
+ int fd, mfd, r;
+ void *p;
+ int was_sealed;
+ pid_t pid;
+
+ if (argc < 2) {
+ printf("error: please pass path to file in fuse_mnt mount-point\n");
+ abort();
+ }
+
+ /* open FUSE memfd file for GUP testing */
+ printf("opening: %s\n", argv[1]);
+ fd = open(argv[1], O_RDONLY | O_CLOEXEC);
+ if (fd < 0) {
+ printf("cannot open(\"%s\"): %m\n", argv[1]);
+ abort();
+ }
+
+ /* create new memfd-object */
+ mfd = mfd_assert_new("kern_memfd_fuse",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ /* mmap memfd-object for writing */
+ p = mfd_assert_mmap_shared(mfd);
+
+ /* pass mfd+mapping to a separate sealing-thread which tries to seal
+ * the memfd objects with SEAL_WRITE while we write into it */
+ global_mfd = mfd;
+ global_p = p;
+ pid = spawn_sealing_thread();
+
+ /* Use read() on the FUSE file to read into our memory-mapped memfd
+ * object. This races the other thread which tries to seal the
+ * memfd-object.
+ * If @fd is on the memfd-fake-FUSE-FS, the read() is delayed by 1s.
+ * This guarantees that the receive-buffer is pinned for 1s until the
+ * data is written into it. The racing ADD_SEALS should thus fail as
+ * the pages are still pinned. */
+ r = read(fd, p, MFD_DEF_SIZE);
+ if (r < 0) {
+ printf("read() failed: %m\n");
+ abort();
+ } else if (!r) {
+ printf("unexpected EOF on read()\n");
+ abort();
+ }
+
+ was_sealed = mfd_assert_get_seals(mfd) & F_SEAL_WRITE;
+
+ /* Wait for sealing-thread to finish and verify that it
+ * successfully sealed the file after the second try. */
+ join_sealing_thread(pid);
+ mfd_assert_has_seals(mfd, F_SEAL_WRITE);
+
+ /* *IF* the memfd-object was sealed at the time our read() returned,
+ * then the kernel did a page-replacement or canceled the read() (or
+ * whatever magic it did..). In that case, the memfd object is still
+ * all zero.
+ * In case the memfd-object was *not* sealed, the read() was successfull
+ * and the memfd object must *not* be all zero.
+ * Note that in real scenarios, there might be a mixture of both, but
+ * in this test-cases, we have explicit 200ms delays which should be
+ * enough to avoid any in-flight writes. */
+
+ p = mfd_assert_mmap_private(mfd);
+ if (was_sealed && memcmp(p, zero, MFD_DEF_SIZE)) {
+ printf("memfd sealed during read() but data not discarded\n");
+ abort();
+ } else if (!was_sealed && !memcmp(p, zero, MFD_DEF_SIZE)) {
+ printf("memfd sealed after read() but data discarded\n");
+ abort();
+ }
+
+ close(mfd);
+ close(fd);
+
+ printf("fuse: DONE\n");
+
+ return 0;
+}
diff --git a/tools/testing/selftests/memfd/run_fuse_test.sh b/tools/testing/selftests/memfd/run_fuse_test.sh
new file mode 100755
index 0000000..69b930e
--- /dev/null
+++ b/tools/testing/selftests/memfd/run_fuse_test.sh
@@ -0,0 +1,14 @@
+#!/bin/sh
+
+if test -d "./mnt" ; then
+ fusermount -u ./mnt
+ rmdir ./mnt
+fi
+
+set -e
+
+mkdir mnt
+./fuse_mnt ./mnt
+./fuse_test ./mnt/memfd
+fusermount -u ./mnt
+rmdir ./mnt
--
2.0.0

2014-06-13 10:45:37

by David Herrmann

[permalink] [raw]
Subject: [RFC v3 7/7] shm: isolate pinned pages when sealing files

When setting SEAL_WRITE, we must make sure nobody has a writable reference
to the pages (via GUP or similar). We currently check references and wait
some time for them to be dropped. This, however, might fail for several
reasons, including:
- the page is pinned for longer than we wait
- while we wait, someone takes an already pinned page for read-access

Therefore, this patch introduces page-isolation. When sealing a file with
SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
is put in place atomically, the old page is detached and left alone. It
will get reclaimed once the last external user dropped it.

Signed-off-by: David Herrmann <[email protected]>
---
mm/shmem.c | 218 +++++++++++++++++++++++++++++--------------------------------
1 file changed, 105 insertions(+), 113 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ddc3998..34b14fb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1237,6 +1237,110 @@ unlock:
return error;
}

+static int shmem_isolate_page(struct inode *inode, struct page *oldpage)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ struct page *newpage;
+ int error;
+
+ if (oldpage->mapping != mapping)
+ return 0;
+ if (page_count(oldpage) - page_mapcount(oldpage) <= 2)
+ return 0;
+
+ if (page_mapped(oldpage))
+ unmap_mapping_range(mapping,
+ (loff_t)oldpage->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+
+ VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage);
+ VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage);
+
+ newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info,
+ oldpage->index);
+ if (!newpage)
+ return -ENOMEM;
+
+ __set_page_locked(newpage);
+ copy_highpage(newpage, oldpage);
+ flush_dcache_page(newpage);
+
+ page_cache_get(newpage);
+ SetPageUptodate(newpage);
+ SetPageSwapBacked(newpage);
+ newpage->mapping = mapping;
+ newpage->index = oldpage->index;
+
+ cancel_dirty_page(oldpage, PAGE_CACHE_SIZE);
+
+ spin_lock_irq(&mapping->tree_lock);
+ error = shmem_radix_tree_replace(mapping, oldpage->index,
+ oldpage, newpage);
+ if (!error) {
+ __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+
+ if (error) {
+ newpage->mapping = NULL;
+ unlock_page(newpage);
+ page_cache_release(newpage);
+ page_cache_release(newpage);
+ return error;
+ }
+
+ mem_cgroup_replace_page_cache(oldpage, newpage);
+ lru_cache_add_anon(newpage);
+
+ oldpage->mapping = NULL;
+ page_cache_release(oldpage);
+ unlock_page(newpage);
+ page_cache_release(newpage);
+
+ return 1;
+}
+
+static int shmem_isolate_pins(struct inode *inode)
+{
+ struct address_space *mapping = inode->i_mapping;
+ struct pagevec pvec;
+ pgoff_t indices[PAGEVEC_SIZE];
+ pgoff_t index;
+ int i, ret, error;
+
+ pagevec_init(&pvec, 0);
+ index = 0;
+ error = 0;
+ while ((pvec.nr = find_get_entries(mapping, index, PAGEVEC_SIZE,
+ pvec.pages, indices))) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+
+ index = indices[i];
+ if (radix_tree_exceptional_entry(page))
+ continue;
+ if (page->mapping != mapping)
+ continue;
+ if (page_count(page) - page_mapcount(page) <= 2)
+ continue;
+
+ lock_page(page);
+ ret = shmem_isolate_page(inode, page);
+ if (ret < 0)
+ error = ret;
+ unlock_page(page);
+ }
+ pagevec_remove_exceptionals(&pvec);
+ pagevec_release(&pvec);
+ cond_resched();
+ index++;
+ }
+
+ return error;
+}
+
static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
@@ -1734,118 +1838,6 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
return offset;
}

-/*
- * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
- * so reuse a tag which we firmly believe is never set or cleared on shmem.
- */
-#define SHMEM_TAG_PINNED PAGECACHE_TAG_TOWRITE
-#define LAST_SCAN 4 /* about 150ms max */
-
-static void shmem_tag_pins(struct address_space *mapping)
-{
- struct radix_tree_iter iter;
- void **slot;
- pgoff_t start;
- struct page *page;
-
- start = 0;
- rcu_read_lock();
-
-restart:
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- page = radix_tree_deref_slot(slot);
- if (!page || radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page))
- goto restart;
- } else if (page_count(page) - page_mapcount(page) > 1) {
- spin_lock_irq(&mapping->tree_lock);
- radix_tree_tag_set(&mapping->page_tree, iter.index,
- SHMEM_TAG_PINNED);
- spin_unlock_irq(&mapping->tree_lock);
- }
-
- if (need_resched()) {
- cond_resched_rcu();
- start = iter.index + 1;
- goto restart;
- }
- }
- rcu_read_unlock();
-}
-
-/*
- * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
- * via get_user_pages(), drivers might have some pending I/O without any active
- * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
- * and see whether it has an elevated ref-count. If so, we tag them and wait for
- * them to be dropped.
- * The caller must guarantee that no new user will acquire writable references
- * to those pages to avoid races.
- */
-static int shmem_wait_for_pins(struct address_space *mapping)
-{
- struct radix_tree_iter iter;
- void **slot;
- pgoff_t start;
- struct page *page;
- int error, scan;
-
- shmem_tag_pins(mapping);
-
- error = 0;
- for (scan = 0; scan <= LAST_SCAN; scan++) {
- if (!radix_tree_tagged(&mapping->page_tree, SHMEM_TAG_PINNED))
- break;
-
- if (!scan)
- lru_add_drain_all();
- else if (schedule_timeout_killable((HZ << scan) / 200))
- scan = LAST_SCAN;
-
- start = 0;
- rcu_read_lock();
-restart:
- radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
- start, SHMEM_TAG_PINNED) {
-
- page = radix_tree_deref_slot(slot);
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page))
- goto restart;
-
- page = NULL;
- }
-
- if (page &&
- page_count(page) - page_mapcount(page) != 1) {
- if (scan < LAST_SCAN)
- goto continue_resched;
-
- /*
- * On the last scan, we clean up all those tags
- * we inserted; but make a note that we still
- * found pages pinned.
- */
- error = -EBUSY;
- }
-
- spin_lock_irq(&mapping->tree_lock);
- radix_tree_tag_clear(&mapping->page_tree,
- iter.index, SHMEM_TAG_PINNED);
- spin_unlock_irq(&mapping->tree_lock);
-continue_resched:
- if (need_resched()) {
- cond_resched_rcu();
- start = iter.index + 1;
- goto restart;
- }
- }
- rcu_read_unlock();
- }
-
- return error;
-}
-
#define F_ALL_SEALS (F_SEAL_SEAL | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
@@ -1907,7 +1899,7 @@ int shmem_add_seals(struct file *file, unsigned int seals)
if (error)
goto unlock;

- error = shmem_wait_for_pins(file->f_mapping);
+ error = shmem_isolate_pins(inode);
if (error) {
mapping_allow_writable(file->f_mapping);
goto unlock;
--
2.0.0

2014-06-13 10:46:07

by David Herrmann

[permalink] [raw]
Subject: [RFC v3 6/7] shm: wait for pins to be released when sealing

We currently fail setting SEAL_WRITE in case there're pending page
references. This patch extends the pin-tests to wait up to 150ms for all
references to be dropped. This is still not perfect in that it doesn't
account for harmless read-only pins, but it's much better than a hard
failure.

Signed-off-by: David Herrmann <[email protected]>
---
mm/shmem.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 82 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e7c5fe1..ddc3998 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1735,25 +1735,19 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
}

/*
- * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
- * via get_user_pages(), drivers might have some pending I/O without any active
- * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
- * and see whether it has an elevated ref-count. If so, we abort.
- * The caller must guarantee that no new user will acquire writable references
- * to those pages to avoid races.
+ * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
+ * so reuse a tag which we firmly believe is never set or cleared on shmem.
*/
-static int shmem_test_for_pins(struct address_space *mapping)
+#define SHMEM_TAG_PINNED PAGECACHE_TAG_TOWRITE
+#define LAST_SCAN 4 /* about 150ms max */
+
+static void shmem_tag_pins(struct address_space *mapping)
{
struct radix_tree_iter iter;
void **slot;
pgoff_t start;
struct page *page;
- int error;
-
- /* flush additional refs in lru_add early */
- lru_add_drain_all();

- error = 0;
start = 0;
rcu_read_lock();

@@ -1764,8 +1758,10 @@ restart:
if (radix_tree_deref_retry(page))
goto restart;
} else if (page_count(page) - page_mapcount(page) > 1) {
- error = -EBUSY;
- break;
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_tag_set(&mapping->page_tree, iter.index,
+ SHMEM_TAG_PINNED);
+ spin_unlock_irq(&mapping->tree_lock);
}

if (need_resched()) {
@@ -1775,6 +1771,77 @@ restart:
}
}
rcu_read_unlock();
+}
+
+/*
+ * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
+ * via get_user_pages(), drivers might have some pending I/O without any active
+ * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
+ * and see whether it has an elevated ref-count. If so, we tag them and wait for
+ * them to be dropped.
+ * The caller must guarantee that no new user will acquire writable references
+ * to those pages to avoid races.
+ */
+static int shmem_wait_for_pins(struct address_space *mapping)
+{
+ struct radix_tree_iter iter;
+ void **slot;
+ pgoff_t start;
+ struct page *page;
+ int error, scan;
+
+ shmem_tag_pins(mapping);
+
+ error = 0;
+ for (scan = 0; scan <= LAST_SCAN; scan++) {
+ if (!radix_tree_tagged(&mapping->page_tree, SHMEM_TAG_PINNED))
+ break;
+
+ if (!scan)
+ lru_add_drain_all();
+ else if (schedule_timeout_killable((HZ << scan) / 200))
+ scan = LAST_SCAN;
+
+ start = 0;
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
+ start, SHMEM_TAG_PINNED) {
+
+ page = radix_tree_deref_slot(slot);
+ if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page))
+ goto restart;
+
+ page = NULL;
+ }
+
+ if (page &&
+ page_count(page) - page_mapcount(page) != 1) {
+ if (scan < LAST_SCAN)
+ goto continue_resched;
+
+ /*
+ * On the last scan, we clean up all those tags
+ * we inserted; but make a note that we still
+ * found pages pinned.
+ */
+ error = -EBUSY;
+ }
+
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_tag_clear(&mapping->page_tree,
+ iter.index, SHMEM_TAG_PINNED);
+ spin_unlock_irq(&mapping->tree_lock);
+continue_resched:
+ if (need_resched()) {
+ cond_resched_rcu();
+ start = iter.index + 1;
+ goto restart;
+ }
+ }
+ rcu_read_unlock();
+ }

return error;
}
@@ -1840,7 +1907,7 @@ int shmem_add_seals(struct file *file, unsigned int seals)
if (error)
goto unlock;

- error = shmem_test_for_pins(file->f_mapping);
+ error = shmem_wait_for_pins(file->f_mapping);
if (error) {
mapping_allow_writable(file->f_mapping);
goto unlock;
--
2.0.0

2014-06-13 10:46:46

by David Herrmann

[permalink] [raw]
Subject: [PATCH v3 4/7] selftests: add memfd_create() + sealing tests

Some basic tests to verify sealing on memfds works as expected and
guarantees the advertised semantics.

Signed-off-by: David Herrmann <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/memfd/.gitignore | 2 +
tools/testing/selftests/memfd/Makefile | 29 +
tools/testing/selftests/memfd/memfd_test.c | 913 +++++++++++++++++++++++++++++
4 files changed, 945 insertions(+)
create mode 100644 tools/testing/selftests/memfd/.gitignore
create mode 100644 tools/testing/selftests/memfd/Makefile
create mode 100644 tools/testing/selftests/memfd/memfd_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index e66e710..5ef80cb 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@ TARGETS = breakpoints
TARGETS += cpu-hotplug
TARGETS += efivarfs
TARGETS += kcmp
+TARGETS += memfd
TARGETS += memory-hotplug
TARGETS += mqueue
TARGETS += net
diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
new file mode 100644
index 0000000..bcc8ee2
--- /dev/null
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -0,0 +1,2 @@
+memfd_test
+memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
new file mode 100644
index 0000000..36653b9
--- /dev/null
+++ b/tools/testing/selftests/memfd/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+ ARCH := X86
+endif
+ifeq ($(ARCH),x86_64)
+ ARCH := X86
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/uapi/
+CFLAGS += -I../../../../arch/x86/include/uapi/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+ gcc $(CFLAGS) memfd_test.c -o memfd_test
+else
+ echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_tests: all
+ifeq ($(ARCH),X86)
+ gcc $(CFLAGS) memfd_test.c -o memfd_test
+endif
+ @./memfd_test || echo "memfd_test: [FAIL]"
+
+clean:
+ $(RM) memfd_test
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
new file mode 100644
index 0000000..3634c90
--- /dev/null
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -0,0 +1,913 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <errno.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/falloc.h>
+#include <linux/fcntl.h>
+#include <linux/memfd.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+ unsigned int flags)
+{
+ return syscall(__NR_memfd_create, name, flags);
+}
+
+static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
+{
+ int r, fd;
+
+ fd = sys_memfd_create(name, flags);
+ if (fd < 0) {
+ printf("memfd_create(\"%s\", %u) failed: %m\n",
+ name, flags);
+ abort();
+ }
+
+ r = ftruncate(fd, sz);
+ if (r < 0) {
+ printf("ftruncate(%llu) failed: %m\n", (unsigned long long)sz);
+ abort();
+ }
+
+ return fd;
+}
+
+static void mfd_fail_new(const char *name, unsigned int flags)
+{
+ int r;
+
+ r = sys_memfd_create(name, flags);
+ if (r >= 0) {
+ printf("memfd_create(\"%s\", %u) succeeded, but failure expected\n",
+ name, flags);
+ close(r);
+ abort();
+ }
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+ long r;
+
+ r = fcntl(fd, F_GET_SEALS);
+ if (r < 0) {
+ printf("GET_SEALS(%d) failed: %m\n", fd);
+ abort();
+ }
+
+ return r;
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+ __u64 s;
+
+ s = mfd_assert_get_seals(fd);
+ if (s != seals) {
+ printf("%llu != %llu = GET_SEALS(%d)\n",
+ (unsigned long long)seals, (unsigned long long)s, fd);
+ abort();
+ }
+}
+
+static void mfd_assert_add_seals(int fd, __u64 seals)
+{
+ long r;
+ __u64 s;
+
+ s = mfd_assert_get_seals(fd);
+ r = fcntl(fd, F_ADD_SEALS, seals);
+ if (r < 0) {
+ printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
+ fd, (unsigned long long)s, (unsigned long long)seals);
+ abort();
+ }
+}
+
+static void mfd_fail_add_seals(int fd, __u64 seals)
+{
+ long r;
+ __u64 s;
+
+ r = fcntl(fd, F_GET_SEALS);
+ if (r < 0)
+ s = 0;
+ else
+ s = r;
+
+ r = fcntl(fd, F_ADD_SEALS, seals);
+ if (r >= 0) {
+ printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
+ fd, (unsigned long long)s, (unsigned long long)seals);
+ abort();
+ }
+}
+
+static void mfd_assert_size(int fd, size_t size)
+{
+ struct stat st;
+ int r;
+
+ r = fstat(fd, &st);
+ if (r < 0) {
+ printf("fstat(%d) failed: %m\n", fd);
+ abort();
+ } else if (st.st_size != size) {
+ printf("wrong file size %lld, but expected %lld\n",
+ (long long)st.st_size, (long long)size);
+ abort();
+ }
+}
+
+static int mfd_assert_dup(int fd)
+{
+ int r;
+
+ r = dup(fd);
+ if (r < 0) {
+ printf("dup(%d) failed: %m\n", fd);
+ abort();
+ }
+
+ return r;
+}
+
+static void *mfd_assert_mmap_shared(int fd)
+{
+ void *p;
+
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+
+ return p;
+}
+
+static void *mfd_assert_mmap_private(int fd)
+{
+ void *p;
+
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ,
+ MAP_PRIVATE,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+
+ return p;
+}
+
+static int mfd_assert_open(int fd, int flags, mode_t mode)
+{
+ char buf[512];
+ int r;
+
+ sprintf(buf, "/proc/self/fd/%d", fd);
+ r = open(buf, flags, mode);
+ if (r < 0) {
+ printf("open(%s) failed: %m\n", buf);
+ abort();
+ }
+
+ return r;
+}
+
+static void mfd_fail_open(int fd, int flags, mode_t mode)
+{
+ char buf[512];
+ int r;
+
+ sprintf(buf, "/proc/self/fd/%d", fd);
+ r = open(buf, flags, mode);
+ if (r >= 0) {
+ printf("open(%s) didn't fail as expected\n");
+ abort();
+ }
+}
+
+static void mfd_assert_read(int fd)
+{
+ char buf[16];
+ void *p;
+ ssize_t l;
+
+ l = read(fd, buf, sizeof(buf));
+ if (l != sizeof(buf)) {
+ printf("read() failed: %m\n");
+ abort();
+ }
+
+ /* verify PROT_READ *is* allowed */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ,
+ MAP_PRIVATE,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+ munmap(p, MFD_DEF_SIZE);
+
+ /* verify MAP_PRIVATE is *always* allowed (even writable) */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+ munmap(p, MFD_DEF_SIZE);
+}
+
+static void mfd_assert_write(int fd)
+{
+ ssize_t l;
+ void *p;
+ int r;
+
+ /* verify write() succeeds */
+ l = write(fd, "\0\0\0\0", 4);
+ if (l != 4) {
+ printf("write() failed: %m\n");
+ abort();
+ }
+
+ /* verify PROT_READ | PROT_WRITE is allowed */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+ *(char *)p = 0;
+ munmap(p, MFD_DEF_SIZE);
+
+ /* verify PROT_WRITE is allowed */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_WRITE,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+ *(char *)p = 0;
+ munmap(p, MFD_DEF_SIZE);
+
+ /* verify PROT_READ with MAP_SHARED is allowed and a following
+ * mprotect(PROT_WRITE) allows writing */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p == MAP_FAILED) {
+ printf("mmap() failed: %m\n");
+ abort();
+ }
+
+ r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
+ if (r < 0) {
+ printf("mprotect() failed: %m\n");
+ abort();
+ }
+
+ *(char *)p = 0;
+ munmap(p, MFD_DEF_SIZE);
+
+ /* verify PUNCH_HOLE works */
+ r = fallocate(fd,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ 0,
+ MFD_DEF_SIZE);
+ if (r < 0) {
+ printf("fallocate(PUNCH_HOLE) failed: %m\n");
+ abort();
+ }
+}
+
+static void mfd_fail_write(int fd)
+{
+ ssize_t l;
+ void *p;
+ int r;
+
+ /* verify write() fails */
+ l = write(fd, "data", 4);
+ if (l != -EPERM) {
+ printf("expected EPERM on write(), but got %d: %m\n", (int)l);
+ abort();
+ }
+
+ /* verify PROT_READ | PROT_WRITE is not allowed */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p != MAP_FAILED) {
+ printf("mmap() didn't fail as expected\n");
+ abort();
+ }
+
+ /* verify PROT_WRITE is not allowed */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_WRITE,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p != MAP_FAILED) {
+ printf("mmap() didn't fail as expected\n");
+ abort();
+ }
+
+ /* Verify PROT_READ with MAP_SHARED with a following mprotect is not
+ * allowed. Note that for r/w the kernel already prevents the mmap. */
+ p = mmap(NULL,
+ MFD_DEF_SIZE,
+ PROT_READ,
+ MAP_SHARED,
+ fd,
+ 0);
+ if (p != MAP_FAILED) {
+ r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
+ if (r >= 0) {
+ printf("mmap()+mprotect() didn't fail as expected\n");
+ abort();
+ }
+ }
+
+ /* verify PUNCH_HOLE fails */
+ r = fallocate(fd,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ 0,
+ MFD_DEF_SIZE);
+ if (r >= 0) {
+ printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
+ abort();
+ }
+}
+
+static void mfd_assert_shrink(int fd)
+{
+ int r, fd2;
+
+ r = ftruncate(fd, MFD_DEF_SIZE / 2);
+ if (r < 0) {
+ printf("ftruncate(SHRINK) failed: %m\n");
+ abort();
+ }
+
+ mfd_assert_size(fd, MFD_DEF_SIZE / 2);
+
+ fd2 = mfd_assert_open(fd,
+ O_RDWR | O_CREAT | O_TRUNC,
+ S_IRUSR | S_IWUSR);
+ close(fd2);
+
+ mfd_assert_size(fd, 0);
+}
+
+static void mfd_fail_shrink(int fd)
+{
+ int r;
+
+ r = ftruncate(fd, MFD_DEF_SIZE / 2);
+ if (r >= 0) {
+ printf("ftruncate(SHRINK) didn't fail as expected\n");
+ abort();
+ }
+
+ mfd_fail_open(fd,
+ O_RDWR | O_CREAT | O_TRUNC,
+ S_IRUSR | S_IWUSR);
+}
+
+static void mfd_assert_grow(int fd)
+{
+ int r;
+
+ r = ftruncate(fd, MFD_DEF_SIZE * 2);
+ if (r < 0) {
+ printf("ftruncate(GROW) failed: %m\n");
+ abort();
+ }
+
+ mfd_assert_size(fd, MFD_DEF_SIZE * 2);
+
+ r = fallocate(fd,
+ 0,
+ 0,
+ MFD_DEF_SIZE * 4);
+ if (r < 0) {
+ printf("fallocate(ALLOC) failed: %m\n");
+ abort();
+ }
+
+ mfd_assert_size(fd, MFD_DEF_SIZE * 4);
+}
+
+static void mfd_fail_grow(int fd)
+{
+ int r;
+
+ r = ftruncate(fd, MFD_DEF_SIZE * 2);
+ if (r >= 0) {
+ printf("ftruncate(GROW) didn't fail as expected\n");
+ abort();
+ }
+
+ r = fallocate(fd,
+ 0,
+ 0,
+ MFD_DEF_SIZE * 4);
+ if (r >= 0) {
+ printf("fallocate(ALLOC) didn't fail as expected\n");
+ abort();
+ }
+}
+
+static void mfd_assert_grow_write(int fd)
+{
+ static char buf[MFD_DEF_SIZE * 8];
+ ssize_t l;
+
+ l = pwrite(fd, buf, sizeof(buf), 0);
+ if (l != sizeof(buf)) {
+ printf("pwrite() failed: %m\n");
+ abort();
+ }
+
+ mfd_assert_size(fd, MFD_DEF_SIZE * 8);
+}
+
+static void mfd_fail_grow_write(int fd)
+{
+ static char buf[MFD_DEF_SIZE * 8];
+ ssize_t l;
+
+ l = pwrite(fd, buf, sizeof(buf), 0);
+ if (l == sizeof(buf)) {
+ printf("pwrite() didn't fail as expected\n");
+ abort();
+ }
+}
+
+static int idle_thread_fn(void *arg)
+{
+ sigset_t set;
+ int sig;
+
+ /* dummy waiter; SIGTERM terminates us anyway */
+ sigemptyset(&set);
+ sigaddset(&set, SIGTERM);
+ sigwait(&set, &sig);
+
+ return 0;
+}
+
+static pid_t spawn_idle_thread(unsigned int flags)
+{
+ uint8_t *stack;
+ pid_t pid;
+
+ stack = malloc(STACK_SIZE);
+ if (!stack) {
+ printf("malloc(STACK_SIZE) failed: %m\n");
+ abort();
+ }
+
+ pid = clone(idle_thread_fn,
+ stack + STACK_SIZE,
+ SIGCHLD | flags,
+ NULL);
+ if (pid < 0) {
+ printf("clone() failed: %m\n");
+ abort();
+ }
+
+ return pid;
+}
+
+static void join_idle_thread(pid_t pid)
+{
+ kill(pid, SIGTERM);
+ waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test memfd_create() syscall
+ * Verify syscall-argument validation, including name checks, flag validation
+ * and more.
+ */
+static void test_create(void)
+{
+ char buf[2048];
+ int fd;
+
+ /* test NULL name */
+ mfd_fail_new(NULL, 0);
+
+ /* test over-long name (not zero-terminated) */
+ memset(buf, 0xff, sizeof(buf));
+ mfd_fail_new(buf, 0);
+
+ /* test over-long zero-terminated name */
+ memset(buf, 0xff, sizeof(buf));
+ buf[sizeof(buf) - 1] = 0;
+ mfd_fail_new(buf, 0);
+
+ /* verify "" is a valid name */
+ fd = mfd_assert_new("", 0, 0);
+ close(fd);
+
+ /* verify invalid O_* open flags */
+ mfd_fail_new("", 0x0100);
+ mfd_fail_new("", ~MFD_CLOEXEC);
+ mfd_fail_new("", ~MFD_ALLOW_SEALING);
+ mfd_fail_new("", ~0);
+ mfd_fail_new("", 0x80000000U);
+
+ /* verify MFD_CLOEXEC is allowed */
+ fd = mfd_assert_new("", 0, MFD_CLOEXEC);
+ close(fd);
+
+ /* verify MFD_ALLOW_SEALING is allowed */
+ fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
+ close(fd);
+
+ /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
+ fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
+ close(fd);
+}
+
+/*
+ * Test basic sealing
+ * A very basic sealing test to see whether setting/retrieving seals works.
+ */
+static void test_basic(void)
+{
+ int fd;
+
+ fd = mfd_assert_new("kern_memfd_basic",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ /* add basic seals */
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+ F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+ F_SEAL_WRITE);
+
+ /* add them again */
+ mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+ F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+ F_SEAL_WRITE);
+
+ /* add more seals and seal against sealing */
+ mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
+ mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+ F_SEAL_GROW |
+ F_SEAL_WRITE |
+ F_SEAL_SEAL);
+
+ /* verify that sealing no longer works */
+ mfd_fail_add_seals(fd, F_SEAL_GROW);
+ mfd_fail_add_seals(fd, 0);
+
+ close(fd);
+
+ /* verify sealing does not work without MFD_ALLOW_SEALING */
+ fd = mfd_assert_new("kern_memfd_basic",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL);
+ mfd_fail_add_seals(fd, F_SEAL_SHRINK |
+ F_SEAL_GROW |
+ F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL);
+ close(fd);
+}
+
+/*
+ * Test SEAL_WRITE
+ * Test whether SEAL_WRITE actually prevents modifications.
+ */
+static void test_seal_write(void)
+{
+ int fd;
+
+ fd = mfd_assert_new("kern_memfd_seal_write",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE);
+
+ mfd_assert_read(fd);
+ mfd_fail_write(fd);
+ mfd_assert_shrink(fd);
+ mfd_assert_grow(fd);
+ mfd_fail_grow_write(fd);
+
+ close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK
+ * Test whether SEAL_SHRINK actually prevents shrinking
+ */
+static void test_seal_shrink(void)
+{
+ int fd;
+
+ fd = mfd_assert_new("kern_memfd_seal_shrink",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd, F_SEAL_SHRINK);
+
+ mfd_assert_read(fd);
+ mfd_assert_write(fd);
+ mfd_fail_shrink(fd);
+ mfd_assert_grow(fd);
+ mfd_assert_grow_write(fd);
+
+ close(fd);
+}
+
+/*
+ * Test SEAL_GROW
+ * Test whether SEAL_GROW actually prevents growing
+ */
+static void test_seal_grow(void)
+{
+ int fd;
+
+ fd = mfd_assert_new("kern_memfd_seal_grow",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_GROW);
+ mfd_assert_has_seals(fd, F_SEAL_GROW);
+
+ mfd_assert_read(fd);
+ mfd_assert_write(fd);
+ mfd_assert_shrink(fd);
+ mfd_fail_grow(fd);
+ mfd_fail_grow_write(fd);
+
+ close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK | SEAL_GROW
+ * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
+ */
+static void test_seal_resize(void)
+{
+ int fd;
+
+ fd = mfd_assert_new("kern_memfd_seal_resize",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
+ mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
+
+ mfd_assert_read(fd);
+ mfd_assert_write(fd);
+ mfd_fail_shrink(fd);
+ mfd_fail_grow(fd);
+ mfd_fail_grow_write(fd);
+
+ close(fd);
+}
+
+/*
+ * Test sharing via dup()
+ * Test that seals are shared between dupped FDs and they're all equal.
+ */
+static void test_share_dup(void)
+{
+ int fd, fd2;
+
+ fd = mfd_assert_new("kern_memfd_share_dup",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+
+ fd2 = mfd_assert_dup(fd);
+ mfd_assert_has_seals(fd2, 0);
+
+ mfd_assert_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE);
+
+ mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+ mfd_assert_add_seals(fd, F_SEAL_SEAL);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+
+ mfd_fail_add_seals(fd, F_SEAL_GROW);
+ mfd_fail_add_seals(fd2, F_SEAL_GROW);
+ mfd_fail_add_seals(fd, F_SEAL_SEAL);
+ mfd_fail_add_seals(fd2, F_SEAL_SEAL);
+
+ close(fd2);
+
+ mfd_fail_add_seals(fd, F_SEAL_GROW);
+ close(fd);
+}
+
+/*
+ * Test sealing with active mmap()s
+ * Modifying seals is only allowed if no other mmap() refs exist.
+ */
+static void test_share_mmap(void)
+{
+ int fd;
+ void *p;
+
+ fd = mfd_assert_new("kern_memfd_share_mmap",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+
+ /* shared/writable ref prevents sealing WRITE, but allows others */
+ p = mfd_assert_mmap_shared(fd);
+ mfd_fail_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd, F_SEAL_SHRINK);
+ munmap(p, MFD_DEF_SIZE);
+
+ /* readable ref allows sealing */
+ p = mfd_assert_mmap_private(fd);
+ mfd_assert_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+ munmap(p, MFD_DEF_SIZE);
+
+ close(fd);
+}
+
+/*
+ * Test sealing with open(/proc/self/fd/%d)
+ * Via /proc we can get access to a separate file-context for the same memfd.
+ * This is *not* like dup(), but like a real separate open(). Make sure the
+ * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
+ */
+static void test_share_open(void)
+{
+ int fd, fd2;
+
+ fd = mfd_assert_new("kern_memfd_share_open",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+
+ fd2 = mfd_assert_open(fd, O_RDWR, 0);
+ mfd_assert_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE);
+
+ mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+ close(fd);
+ fd = mfd_assert_open(fd2, O_RDONLY, 0);
+
+ mfd_fail_add_seals(fd, F_SEAL_SEAL);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+ close(fd2);
+ fd2 = mfd_assert_open(fd, O_RDWR, 0);
+
+ mfd_assert_add_seals(fd2, F_SEAL_SEAL);
+ mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+ mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+
+ close(fd2);
+ close(fd);
+}
+
+/*
+ * Test sharing via fork()
+ * Test whether seal-modifications work as expected with forked childs.
+ */
+static void test_share_fork(void)
+{
+ int fd;
+ pid_t pid;
+
+ fd = mfd_assert_new("kern_memfd_share_fork",
+ MFD_DEF_SIZE,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ mfd_assert_has_seals(fd, 0);
+
+ pid = spawn_idle_thread(0);
+ mfd_assert_add_seals(fd, F_SEAL_SEAL);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+ mfd_fail_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+ join_idle_thread(pid);
+
+ mfd_fail_add_seals(fd, F_SEAL_WRITE);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+ close(fd);
+}
+
+int main(int argc, char **argv)
+{
+ pid_t pid;
+
+ printf("memfd: CREATE\n");
+ test_create();
+ printf("memfd: BASIC\n");
+ test_basic();
+
+ printf("memfd: SEAL-WRITE\n");
+ test_seal_write();
+ printf("memfd: SEAL-SHRINK\n");
+ test_seal_shrink();
+ printf("memfd: SEAL-GROW\n");
+ test_seal_grow();
+ printf("memfd: SEAL-RESIZE\n");
+ test_seal_resize();
+
+ printf("memfd: SHARE-DUP\n");
+ test_share_dup();
+ printf("memfd: SHARE-MMAP\n");
+ test_share_mmap();
+ printf("memfd: SHARE-OPEN\n");
+ test_share_open();
+ printf("memfd: SHARE-FORK\n");
+ test_share_fork();
+
+ /* Run test-suite in a multi-threaded environment with a shared
+ * file-table. */
+ pid = spawn_idle_thread(CLONE_FILES | CLONE_FS | CLONE_VM);
+ printf("memfd: SHARE-DUP (shared file-table)\n");
+ test_share_dup();
+ printf("memfd: SHARE-MMAP (shared file-table)\n");
+ test_share_mmap();
+ printf("memfd: SHARE-OPEN (shared file-table)\n");
+ test_share_open();
+ printf("memfd: SHARE-FORK (shared file-table)\n");
+ test_share_fork();
+ join_idle_thread(pid);
+
+ printf("memfd: DONE\n");
+
+ return 0;
+}
--
2.0.0

2014-06-13 10:47:14

by David Herrmann

[permalink] [raw]
Subject: [PATCH v3 3/7] shm: add memfd_create() syscall

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It can support sealing and avoids any
connection to user-visible mount-points. Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but
with a file-descriptor to it.

memfd_create() returns the raw shmem file, so calls like ftruncate() can
be used to modify the underlying inode. Also calls like fstat()
will return proper information and mark the file as regular file. If you
want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
supported (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike. It is still properly accounted to memcg
limits, though.

Signed-off-by: David Herrmann <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/linux/memfd.h | 8 +++++
kernel/sys_ni.c | 1 +
mm/shmem.c | 72 ++++++++++++++++++++++++++++++++++++++++
6 files changed, 84 insertions(+)
create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..e7495b4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
353 i386 renameat2 sys_renameat2
+354 i386 memfd_create sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..28be0e1 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common memfd_create sys_memfd_create

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0..0be5d4d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
asmlinkage long sys_eventfd(unsigned int count);
asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, unsigned int flags);
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..534e364
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,8 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+/* flags for memfd_create(2) (unsigned int) */
+#define MFD_CLOEXEC 0x0001U
+#define MFD_ALLOW_SEALING 0x0002U
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b5..489a4e6 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime);
cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);

/* performance counters: */
cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 1438b3e..e7c5fe1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
#include <linux/highmem.h>
#include <linux/seq_file.h>
#include <linux/magic.h>
+#include <linux/syscalls.h>
#include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -2662,6 +2664,76 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
}
+
+#define MFD_NAME_PREFIX "memfd:"
+#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
+#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
+
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
+
+SYSCALL_DEFINE2(memfd_create,
+ const char*, uname,
+ unsigned int, flags)
+{
+ struct shmem_inode_info *info;
+ struct file *file;
+ int fd, error;
+ char *name;
+ long len;
+
+ if (flags & ~(unsigned int)MFD_ALL_FLAGS)
+ return -EINVAL;
+
+ /* length includes terminating zero */
+ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
+ if (len <= 0)
+ return -EFAULT;
+ if (len > MFD_NAME_MAX_LEN + 1)
+ return -EINVAL;
+
+ name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY);
+ if (!name)
+ return -ENOMEM;
+
+ strcpy(name, MFD_NAME_PREFIX);
+ if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
+ error = -EFAULT;
+ goto err_name;
+ }
+
+ /* terminating-zero may have changed after strnlen_user() returned */
+ if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
+ error = -EFAULT;
+ goto err_name;
+ }
+
+ fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+ if (fd < 0) {
+ error = fd;
+ goto err_name;
+ }
+
+ file = shmem_file_setup(name, 0, VM_NORESERVE);
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
+ goto err_fd;
+ }
+ info = SHMEM_I(file_inode(file));
+ file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+ if (flags & MFD_ALLOW_SEALING)
+ info->seals &= ~F_SEAL_SEAL;
+
+ fd_install(fd, file);
+ kfree(name);
+ return fd;
+
+err_fd:
+ put_unused_fd(fd);
+err_name:
+ kfree(name);
+ return error;
+}
+
#endif /* CONFIG_TMPFS */

static void shmem_put_super(struct super_block *sb)
--
2.0.0

2014-06-13 10:47:35

by David Herrmann

[permalink] [raw]
Subject: [PATCH v3 1/7] mm: allow drivers to prevent new writable mappings

The i_mmap_writable field counts existing writable mappings of an
address_space. To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set. In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers. This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann <[email protected]>
---
fs/inode.c | 1 +
include/linux/fs.h | 29 +++++++++++++++++++++++++++--
kernel/fork.c | 2 +-
mm/mmap.c | 24 ++++++++++++++++++------
mm/swap_state.c | 1 +
5 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6eecb7f..9945b11 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -165,6 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
+ atomic_set(&mapping->i_mmap_writable, 0);
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
mapping->private_data = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338e6f7..71d17c9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -387,7 +387,7 @@ struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
spinlock_t tree_lock; /* and lock protecting it */
- unsigned int i_mmap_writable;/* count VM_SHARED mappings */
+ atomic_t i_mmap_writable;/* count VM_SHARED mappings */
struct rb_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
struct mutex i_mmap_mutex; /* protect tree, count, list */
@@ -470,10 +470,35 @@ static inline int mapping_mapped(struct address_space *mapping)
* Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
* marks vma as VM_SHARED if it is shared, and the file was opened for
* writing i.e. vma may be mprotected writable even if now readonly.
+ *
+ * If i_mmap_writable is negative, no new writable mappings are allowed. You
+ * can only deny writable mappings, if none exists right now.
*/
static inline int mapping_writably_mapped(struct address_space *mapping)
{
- return mapping->i_mmap_writable != 0;
+ return atomic_read(&mapping->i_mmap_writable) > 0;
+}
+
+static inline int mapping_map_writable(struct address_space *mapping)
+{
+ return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
+ 0 : -EPERM;
+}
+
+static inline void mapping_unmap_writable(struct address_space *mapping)
+{
+ return atomic_dec(&mapping->i_mmap_writable);
+}
+
+static inline int mapping_deny_writable(struct address_space *mapping)
+{
+ return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
+ 0 : -EBUSY;
+}
+
+static inline void mapping_allow_writable(struct address_space *mapping)
+{
+ atomic_inc(&mapping->i_mmap_writable);
}

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1..f1f127e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -421,7 +421,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
atomic_dec(&inode->i_writecount);
mutex_lock(&mapping->i_mmap_mutex);
if (tmp->vm_flags & VM_SHARED)
- mapping->i_mmap_writable++;
+ atomic_inc(&mapping->i_mmap_writable);
flush_dcache_mmap_lock(mapping);
/* insert tmp into the share list, just after mpnt */
if (unlikely(tmp->vm_flags & VM_NONLINEAR))
diff --git a/mm/mmap.c b/mm/mmap.c
index 129b847..19b6562 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -216,7 +216,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
if (vma->vm_flags & VM_DENYWRITE)
atomic_inc(&file_inode(file)->i_writecount);
if (vma->vm_flags & VM_SHARED)
- mapping->i_mmap_writable--;
+ mapping_unmap_writable(mapping);

flush_dcache_mmap_lock(mapping);
if (unlikely(vma->vm_flags & VM_NONLINEAR))
@@ -617,7 +617,7 @@ static void __vma_link_file(struct vm_area_struct *vma)
if (vma->vm_flags & VM_DENYWRITE)
atomic_dec(&file_inode(file)->i_writecount);
if (vma->vm_flags & VM_SHARED)
- mapping->i_mmap_writable++;
+ atomic_inc(&mapping->i_mmap_writable);

flush_dcache_mmap_lock(mapping);
if (unlikely(vma->vm_flags & VM_NONLINEAR))
@@ -1572,6 +1572,11 @@ munmap_back:
if (error)
goto free_vma;
}
+ if (vm_flags & VM_SHARED) {
+ error = mapping_map_writable(file->f_mapping);
+ if (error)
+ goto allow_write_and_free_vma;
+ }
vma->vm_file = get_file(file);
error = file->f_op->mmap(file, vma);
if (error)
@@ -1611,8 +1616,12 @@ munmap_back:

vma_link(mm, vma, prev, rb_link, rb_parent);
/* Once vma denies write, undo our temporary denial count */
- if (vm_flags & VM_DENYWRITE)
- allow_write_access(file);
+ if (file) {
+ if (vm_flags & VM_SHARED)
+ mapping_unmap_writable(file->f_mapping);
+ if (vm_flags & VM_DENYWRITE)
+ allow_write_access(file);
+ }
file = vma->vm_file;
out:
perf_event_mmap(vma);
@@ -1641,14 +1650,17 @@ out:
return addr;

unmap_and_free_vma:
- if (vm_flags & VM_DENYWRITE)
- allow_write_access(file);
vma->vm_file = NULL;
fput(file);

/* Undo any partial mapping done by a device driver. */
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
+ if (vm_flags & VM_SHARED)
+ mapping_unmap_writable(file->f_mapping);
+allow_write_and_free_vma:
+ if (vm_flags & VM_DENYWRITE)
+ allow_write_access(file);
free_vma:
kmem_cache_free(vm_area_cachep, vma);
unacct_error:
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2972eee..31321fa 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -39,6 +39,7 @@ static struct backing_dev_info swap_backing_dev_info = {
struct address_space swapper_spaces[MAX_SWAPFILES] = {
[0 ... MAX_SWAPFILES - 1] = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
+ .i_mmap_writable = ATOMIC_INIT(0),
.a_ops = &swap_aops,
.backing_dev_info = &swap_backing_dev_info,
}
--
2.0.0

Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

Hi David,

On Fri, Jun 13, 2014 at 12:36 PM, David Herrmann <[email protected]> wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
>
> memfd_create() returns the raw shmem file, so calls like ftruncate() can
> be used to modify the underlying inode. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> supported (like on all other regular files).
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike. It is still properly accounted to memcg
> limits, though.

Where do I find / is there detailed documentation (ideally, a man
page) for this new system call?

Cheers,

Michael


>
> Signed-off-by: David Herrmann <[email protected]>
> ---
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 1 +
> include/uapi/linux/memfd.h | 8 +++++
> kernel/sys_ni.c | 1 +
> mm/shmem.c | 72 ++++++++++++++++++++++++++++++++++++++++
> 6 files changed, 84 insertions(+)
> create mode 100644 include/uapi/linux/memfd.h
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..e7495b4 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
> 351 i386 sched_setattr sys_sched_setattr
> 352 i386 sched_getattr sys_sched_getattr
> 353 i386 renameat2 sys_renameat2
> +354 i386 memfd_create sys_memfd_create
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..28be0e1 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
> 314 common sched_setattr sys_sched_setattr
> 315 common sched_getattr sys_sched_getattr
> 316 common renameat2 sys_renameat2
> +317 common memfd_create sys_memfd_create
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..0be5d4d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
> asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
> asmlinkage long sys_eventfd(unsigned int count);
> asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char *uname_ptr, unsigned int flags);
> asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
> asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> new file mode 100644
> index 0000000..534e364
> --- /dev/null
> +++ b/include/uapi/linux/memfd.h
> @@ -0,0 +1,8 @@
> +#ifndef _UAPI_LINUX_MEMFD_H
> +#define _UAPI_LINUX_MEMFD_H
> +
> +/* flags for memfd_create(2) (unsigned int) */
> +#define MFD_CLOEXEC 0x0001U
> +#define MFD_ALLOW_SEALING 0x0002U
> +
> +#endif /* _UAPI_LINUX_MEMFD_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..489a4e6 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime);
> cond_syscall(compat_sys_timerfd_gettime);
> cond_syscall(sys_eventfd);
> cond_syscall(sys_eventfd2);
> +cond_syscall(sys_memfd_create);
>
> /* performance counters: */
> cond_syscall(sys_perf_event_open);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1438b3e..e7c5fe1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
> #include <linux/highmem.h>
> #include <linux/seq_file.h>
> #include <linux/magic.h>
> +#include <linux/syscalls.h>
> #include <linux/fcntl.h>
> +#include <uapi/linux/memfd.h>
>
> #include <asm/uaccess.h>
> #include <asm/pgtable.h>
> @@ -2662,6 +2664,76 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
> shmem_show_mpol(seq, sbinfo->mpol);
> return 0;
> }
> +
> +#define MFD_NAME_PREFIX "memfd:"
> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> +
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
> +
> +SYSCALL_DEFINE2(memfd_create,
> + const char*, uname,
> + unsigned int, flags)
> +{
> + struct shmem_inode_info *info;
> + struct file *file;
> + int fd, error;
> + char *name;
> + long len;
> +
> + if (flags & ~(unsigned int)MFD_ALL_FLAGS)
> + return -EINVAL;
> +
> + /* length includes terminating zero */
> + len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> + if (len <= 0)
> + return -EFAULT;
> + if (len > MFD_NAME_MAX_LEN + 1)
> + return -EINVAL;
> +
> + name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY);
> + if (!name)
> + return -ENOMEM;
> +
> + strcpy(name, MFD_NAME_PREFIX);
> + if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
> + error = -EFAULT;
> + goto err_name;
> + }
> +
> + /* terminating-zero may have changed after strnlen_user() returned */
> + if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
> + error = -EFAULT;
> + goto err_name;
> + }
> +
> + fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
> + if (fd < 0) {
> + error = fd;
> + goto err_name;
> + }
> +
> + file = shmem_file_setup(name, 0, VM_NORESERVE);
> + if (IS_ERR(file)) {
> + error = PTR_ERR(file);
> + goto err_fd;
> + }
> + info = SHMEM_I(file_inode(file));
> + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> + if (flags & MFD_ALLOW_SEALING)
> + info->seals &= ~F_SEAL_SEAL;
> +
> + fd_install(fd, file);
> + kfree(name);
> + return fd;
> +
> +err_fd:
> + put_unused_fd(fd);
> +err_name:
> + kfree(name);
> + return error;
> +}
> +
> #endif /* CONFIG_TMPFS */
>
> static void shmem_put_super(struct super_block *sb)
> --
> 2.0.0
>



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-06-13 12:41:26

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

Hi

On Fri, Jun 13, 2014 at 2:27 PM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> Hi David,
>
> On Fri, Jun 13, 2014 at 12:36 PM, David Herrmann <[email protected]> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It can support sealing and avoids any
>> connection to user-visible mount-points. Thus, it's not subject to quotas
>> on mounted file-systems, but can be used like malloc()'ed memory, but
>> with a file-descriptor to it.
>>
>> memfd_create() returns the raw shmem file, so calls like ftruncate() can
>> be used to modify the underlying inode. Also calls like fstat()
>> will return proper information and mark the file as regular file. If you
>> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
>> supported (like on all other regular files).
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike. It is still properly accounted to memcg
>> limits, though.
>
> Where do I find / is there detailed documentation (ideally, a man
> page) for this new system call?

I did write a man-page proposal for memfd_create() and a patch for
fcntl() for v1, however, the API changed several times so I didn't
keep them up to date (the man-page patches are on LKML). However, I
wrote a short introduction to memfd+sealing v3, that I recommend
reading first:
http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

This explains the idea behind the new API and describes almost all
aspects of it. It's up-to-date to v3 and I will use it to write the
final man-pages once Hugh and Andrew ACKed the patches. Let me know if
anything is unclear.

Thanks for looking at it!
David

Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

Hi David,

On Fri, Jun 13, 2014 at 2:41 PM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Fri, Jun 13, 2014 at 2:27 PM, Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>> Hi David,
>>
>> On Fri, Jun 13, 2014 at 12:36 PM, David Herrmann <[email protected]> wrote:
>>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>>> that you can pass to mmap(). It can support sealing and avoids any
>>> connection to user-visible mount-points. Thus, it's not subject to quotas
>>> on mounted file-systems, but can be used like malloc()'ed memory, but
>>> with a file-descriptor to it.
>>>
>>> memfd_create() returns the raw shmem file, so calls like ftruncate() can
>>> be used to modify the underlying inode. Also calls like fstat()
>>> will return proper information and mark the file as regular file. If you
>>> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
>>> supported (like on all other regular files).
>>>
>>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>>> subject to quotas and alike. It is still properly accounted to memcg
>>> limits, though.
>>
>> Where do I find / is there detailed documentation (ideally, a man
>> page) for this new system call?
>
> I did write a man-page proposal for memfd_create() and a patch for
> fcntl() for v1,

Ahh -- that's why I had a recollection of such a page ;-).

> however, the API changed several times so I didn't
> keep them up to date (the man-page patches are on LKML). However, I
> wrote a short introduction to memfd+sealing v3, that I recommend
> reading first:
> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

Yes, I saw it already. (It's good, but I want more.)

> This explains the idea behind the new API and describes almost all
> aspects of it. It's up-to-date to v3 and I will use it to write the
> final man-pages once Hugh and Andrew ACKed the patches. Let me know if
> anything is unclear.

The general notion these days is that a (comprehensive) manual page
_should_ come *with* the system call, rather than after the fact. And
there's a lot of value in that. I've found no end of bugs and design
errors while writing (comprehensive) man pages after the fact (by
which time it's too late to fix the design errors), and also found
quite a few of those issues when I've managed to work with folk at the
same time as they write the syscall. Bottom line: you really should
write formal documentation now, as part of the process of code
submission. It improves the chance of finding implementation and
design bugs, and may well widen your circle of reviewers.

Cheers,

Michael

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-06-13 15:06:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
> When setting SEAL_WRITE, we must make sure nobody has a writable reference
> to the pages (via GUP or similar). We currently check references and wait
> some time for them to be dropped. This, however, might fail for several
> reasons, including:
> - the page is pinned for longer than we wait
> - while we wait, someone takes an already pinned page for read-access
>
> Therefore, this patch introduces page-isolation. When sealing a file with
> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
> is put in place atomically, the old page is detached and left alone. It
> will get reclaimed once the last external user dropped it.
>
> Signed-off-by: David Herrmann <[email protected]>

Won't this have unexpected effects?

Thread 1: start read into mapping backed by fd

Thread 2: SEAL_WRITE

Thread 1: read finishes. now the page doesn't match the sealed page

Is this okay? Or am I missing something?

Are there really things that keep unnecessary writable pins around?

--Andy

2014-06-13 15:11:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with
> a longer introduction at gmane:
> http://thread.gmane.org/gmane.comp.video.dri.devel/102241
> An LWN article about memfd+sealing is available, too:
> https://lwn.net/Articles/593918/
> v2 with some more discussions can be found here:
> http://thread.gmane.org/gmane.linux.kernel.mm/115713
>
> This series introduces two new APIs:
> memfd_create(): Think of this syscall as malloc() but it returns a
> file-descriptor instead of a pointer. That file-descriptor is
> backed by anon-memory and can be memory-mapped for access.
> sealing: The sealing API can be used to prevent a specific set of operations
> on a file-descriptor. You 'seal' the file and give thus the
> guarantee, that it cannot be modified in the specific ways.
>
> A short high-level introduction is also available here:
> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

Potentially silly question: is it guaranteed that mmapping and reading
a SEAL_SHRINKed fd within size bounds will not SIGBUS? If so, should
this be documented? (The particular issue here would be reading
holes. It should work by using the zero page, but, if so, we should
probably make it a real documented guarantee.)

--Andy

2014-06-13 15:15:44

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Fri, Jun 13, 2014 at 5:10 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with
>> a longer introduction at gmane:
>> http://thread.gmane.org/gmane.comp.video.dri.devel/102241
>> An LWN article about memfd+sealing is available, too:
>> https://lwn.net/Articles/593918/
>> v2 with some more discussions can be found here:
>> http://thread.gmane.org/gmane.linux.kernel.mm/115713
>>
>> This series introduces two new APIs:
>> memfd_create(): Think of this syscall as malloc() but it returns a
>> file-descriptor instead of a pointer. That file-descriptor is
>> backed by anon-memory and can be memory-mapped for access.
>> sealing: The sealing API can be used to prevent a specific set of operations
>> on a file-descriptor. You 'seal' the file and give thus the
>> guarantee, that it cannot be modified in the specific ways.
>>
>> A short high-level introduction is also available here:
>> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/
>
> Potentially silly question: is it guaranteed that mmapping and reading
> a SEAL_SHRINKed fd within size bounds will not SIGBUS? If so, should
> this be documented? (The particular issue here would be reading
> holes. It should work by using the zero page, but, if so, we should
> probably make it a real documented guarantee.)

No, this is not guaranteed. See the previous discussion in v2 on Patch
2/4 between Hugh and me.

Summary is: If you want mmap-reads to not fail, use mlock(). There are
many situations where a fault might fail (think: OOM) and sealing is
not meant to protect against that. Btw., holes are automatically
filled with fresh pages by shmem. So a read only fails in OOM
situations (or memcg limits, etc.).

Thanks
David

2014-06-13 15:18:17

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Fri, Jun 13, 2014 at 8:15 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Fri, Jun 13, 2014 at 5:10 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
>>> Hi
>>>
>>> This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with
>>> a longer introduction at gmane:
>>> http://thread.gmane.org/gmane.comp.video.dri.devel/102241
>>> An LWN article about memfd+sealing is available, too:
>>> https://lwn.net/Articles/593918/
>>> v2 with some more discussions can be found here:
>>> http://thread.gmane.org/gmane.linux.kernel.mm/115713
>>>
>>> This series introduces two new APIs:
>>> memfd_create(): Think of this syscall as malloc() but it returns a
>>> file-descriptor instead of a pointer. That file-descriptor is
>>> backed by anon-memory and can be memory-mapped for access.
>>> sealing: The sealing API can be used to prevent a specific set of operations
>>> on a file-descriptor. You 'seal' the file and give thus the
>>> guarantee, that it cannot be modified in the specific ways.
>>>
>>> A short high-level introduction is also available here:
>>> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/
>>
>> Potentially silly question: is it guaranteed that mmapping and reading
>> a SEAL_SHRINKed fd within size bounds will not SIGBUS? If so, should
>> this be documented? (The particular issue here would be reading
>> holes. It should work by using the zero page, but, if so, we should
>> probably make it a real documented guarantee.)
>
> No, this is not guaranteed. See the previous discussion in v2 on Patch
> 2/4 between Hugh and me.
>
> Summary is: If you want mmap-reads to not fail, use mlock(). There are
> many situations where a fault might fail (think: OOM) and sealing is
> not meant to protect against that. Btw., holes are automatically
> filled with fresh pages by shmem. So a read only fails in OOM
> situations (or memcg limits, etc.).
>

Isn't the point of SEAL_SHRINK to allow servers to mmap and read
safely without worrying about SIGBUS?

--Andy

> Thanks
> David



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-13 15:27:17

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

Hi

On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
>> When setting SEAL_WRITE, we must make sure nobody has a writable reference
>> to the pages (via GUP or similar). We currently check references and wait
>> some time for them to be dropped. This, however, might fail for several
>> reasons, including:
>> - the page is pinned for longer than we wait
>> - while we wait, someone takes an already pinned page for read-access
>>
>> Therefore, this patch introduces page-isolation. When sealing a file with
>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
>> is put in place atomically, the old page is detached and left alone. It
>> will get reclaimed once the last external user dropped it.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> Won't this have unexpected effects?
>
> Thread 1: start read into mapping backed by fd
>
> Thread 2: SEAL_WRITE
>
> Thread 1: read finishes. now the page doesn't match the sealed page

Just to be clear: you're talking about read() calls that write into
the memfd? (like my FUSE example does) Your language might be
ambiguous to others as "read into" actually implies a write.

No, this does not have unexpected effects. But yes, your conclusion is
right. To be clear, this behavior would be part of the API. Any
asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
buffer before the write finishes. But you actually have to extend your
example:

Thread 1: p = mmap(memfd, SIZE);
Thread 1: h = async_read(some_fd, p, SIZE);
Thread 1: munmap(p, SIZE);
Thread 2: SEAL_WRITE
Thread 1: async_wait(h);

If you don't do the unmap(), then SEAL_WRITE will fail due to an
elevated i_mmap_writable. I think this is fine. In fact, I remember
reading that async-IO is not required to resolve user-space addresses
at the time of the syscall, but might delay it to the time of the
actual write. But you're right, it would be misleading that the AIO
operation returns success. This would be part of the memfd-API,
though. And if you mess with your address space while running an
async-IO operation on it, you're screwed anyway.

Btw., your sealing use-case is really odd. No-one guarantees that the
SEAL_WRITE happens _after_ you schedule your async-read. In case you
have some synchronization there, you just have to move it after
waiting for your async-io to finish.

Does that clear things up?
Thanks
David

2014-06-13 15:33:59

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Fri, Jun 13, 2014 at 5:17 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 13, 2014 at 8:15 AM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> On Fri, Jun 13, 2014 at 5:10 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
>>>> Hi
>>>>
>>>> This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with
>>>> a longer introduction at gmane:
>>>> http://thread.gmane.org/gmane.comp.video.dri.devel/102241
>>>> An LWN article about memfd+sealing is available, too:
>>>> https://lwn.net/Articles/593918/
>>>> v2 with some more discussions can be found here:
>>>> http://thread.gmane.org/gmane.linux.kernel.mm/115713
>>>>
>>>> This series introduces two new APIs:
>>>> memfd_create(): Think of this syscall as malloc() but it returns a
>>>> file-descriptor instead of a pointer. That file-descriptor is
>>>> backed by anon-memory and can be memory-mapped for access.
>>>> sealing: The sealing API can be used to prevent a specific set of operations
>>>> on a file-descriptor. You 'seal' the file and give thus the
>>>> guarantee, that it cannot be modified in the specific ways.
>>>>
>>>> A short high-level introduction is also available here:
>>>> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/
>>>
>>> Potentially silly question: is it guaranteed that mmapping and reading
>>> a SEAL_SHRINKed fd within size bounds will not SIGBUS? If so, should
>>> this be documented? (The particular issue here would be reading
>>> holes. It should work by using the zero page, but, if so, we should
>>> probably make it a real documented guarantee.)
>>
>> No, this is not guaranteed. See the previous discussion in v2 on Patch
>> 2/4 between Hugh and me.
>>
>> Summary is: If you want mmap-reads to not fail, use mlock(). There are
>> many situations where a fault might fail (think: OOM) and sealing is
>> not meant to protect against that. Btw., holes are automatically
>> filled with fresh pages by shmem. So a read only fails in OOM
>> situations (or memcg limits, etc.).
>>
>
> Isn't the point of SEAL_SHRINK to allow servers to mmap and read
> safely without worrying about SIGBUS?

No, I don't think so.
The point of SEAL_SHRINK is to prevent a file from shrinking. SIGBUS
is an effect, not a cause. It's only a coincidence that "OOM during
reads" and "reading beyond file-boundaries" has the same effect:
SIGBUS.
We only protect against reading beyond file-boundaries due to
shrinking. Therefore, OOM-SIGBUS is unrelated to SEAL_SHRINK.

Anyone dealing with mmap() _has_ to use mlock() to protect against
OOM-SIGBUS. Making SEAL_SHRINK protect against OOM-SIGBUS would be
redundant, because you can achieve the same with SEAL_SHRINK+mlock().

Thanks
David

2014-06-13 16:20:25

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

On Fri, Jun 13, 2014 at 7:20 AM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> The general notion these days is that a (comprehensive) manual page
> _should_ come *with* the system call, rather than after the fact. And
> there's a lot of value in that. I've found no end of bugs and design
> errors while writing (comprehensive) man pages after the fact (by
> which time it's too late to fix the design errors), and also found
> quite a few of those issues when I've managed to work with folk at the
> same time as they write the syscall. Bottom line: you really should
> write formal documentation now, as part of the process of code
> submission. It improves the chance of finding implementation and
> design bugs, and may well widen your circle of reviewers.

I very much agree here. One practical issue I've noticed is that
having separate targets for both the code changes and the manpages can
be an extra barrier for folks getting changes correctly documented as
the change is being submitted. Reviewers may say "be sure to send
updates to the man pages" but its not always easy to remember to
follow up and make sure the submitter got the changes (which match the
merged patches) to you as well.

I've been thinking it might be nice to have the kernel syscall man
pages included in the kernel source tree, then have them
copied/imported over to the man-pages project (similar to how glibc
imports uapi kernel headers). They could even be kept in the
include/uapi directory, and checkpatch could ensure that changes that
touch include/uapi also have modifications to something in the
manpages directory. This way folks would be able to include the man
page change with the code change, making it easier for developers to
do the right thing, making it easier for reviewers to ensure its
correct, and making it easier for maintainers to ensure man page
documentation is properly in sync.

Or is this something that has been hashed over already? I do admit
this would disrupt your process a bit.

thanks
-john

2014-06-13 17:23:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann <[email protected]> wrote:
>>> When setting SEAL_WRITE, we must make sure nobody has a writable reference
>>> to the pages (via GUP or similar). We currently check references and wait
>>> some time for them to be dropped. This, however, might fail for several
>>> reasons, including:
>>> - the page is pinned for longer than we wait
>>> - while we wait, someone takes an already pinned page for read-access
>>>
>>> Therefore, this patch introduces page-isolation. When sealing a file with
>>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
>>> is put in place atomically, the old page is detached and left alone. It
>>> will get reclaimed once the last external user dropped it.
>>>
>>> Signed-off-by: David Herrmann <[email protected]>
>>
>> Won't this have unexpected effects?
>>
>> Thread 1: start read into mapping backed by fd
>>
>> Thread 2: SEAL_WRITE
>>
>> Thread 1: read finishes. now the page doesn't match the sealed page
>
> Just to be clear: you're talking about read() calls that write into
> the memfd? (like my FUSE example does) Your language might be
> ambiguous to others as "read into" actually implies a write.
>
> No, this does not have unexpected effects. But yes, your conclusion is
> right. To be clear, this behavior would be part of the API. Any
> asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
> buffer before the write finishes. But you actually have to extend your
> example:
>
> Thread 1: p = mmap(memfd, SIZE);
> Thread 1: h = async_read(some_fd, p, SIZE);
> Thread 1: munmap(p, SIZE);
> Thread 2: SEAL_WRITE
> Thread 1: async_wait(h);
>
> If you don't do the unmap(), then SEAL_WRITE will fail due to an
> elevated i_mmap_writable. I think this is fine. In fact, I remember
> reading that async-IO is not required to resolve user-space addresses
> at the time of the syscall, but might delay it to the time of the
> actual write. But you're right, it would be misleading that the AIO
> operation returns success. This would be part of the memfd-API,
> though. And if you mess with your address space while running an
> async-IO operation on it, you're screwed anyway.

Ok, I missed the part where you had to munmap to trigger the oddity.
That seems fine to me.

>
> Btw., your sealing use-case is really odd. No-one guarantees that the
> SEAL_WRITE happens _after_ you schedule your async-read. In case you
> have some synchronization there, you just have to move it after
> waiting for your async-io to finish.
>
> Does that clear things up?

I think so.

--Andy

2014-06-15 10:50:22

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

On Fri, Jun 13, 2014 at 12:36:55PM +0200, David Herrmann wrote:
> +asmlinkage long sys_memfd_create(const char *uname_ptr, unsigned int flags);
[...]
> +SYSCALL_DEFINE2(memfd_create,
> + const char*, uname,
> + unsigned int, flags)
> +{

Shouldn't uname_ptr and uname be "const char __user *" instead of "const char *"?


Attachments:
(No filename) (322.00 B)
signature.asc (836.00 B)
Digital signature
Download all attachments
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

On 06/13/2014 06:20 PM, John Stultz wrote:
> On Fri, Jun 13, 2014 at 7:20 AM, Michael Kerrisk (man-pages)
> <[email protected]> wrote:
>>
>> The general notion these days is that a (comprehensive) manual page
>> _should_ come *with* the system call, rather than after the fact. And
>> there's a lot of value in that. I've found no end of bugs and design
>> errors while writing (comprehensive) man pages after the fact (by
>> which time it's too late to fix the design errors), and also found
>> quite a few of those issues when I've managed to work with folk at the
>> same time as they write the syscall. Bottom line: you really should
>> write formal documentation now, as part of the process of code
>> submission. It improves the chance of finding implementation and
>> design bugs, and may well widen your circle of reviewers.
>
> I very much agree here. One practical issue I've noticed is that
> having separate targets for both the code changes and the manpages can
> be an extra barrier for folks getting changes correctly documented as
> the change is being submitted. Reviewers may say "be sure to send
> updates to the man pages" but its not always easy to remember to
> follow up and make sure the submitter got the changes (which match the
> merged patches) to you as well.
>
> I've been thinking it might be nice to have the kernel syscall man
> pages included in the kernel source tree, then have them
> copied/imported over to the man-pages project (similar to how glibc
> imports uapi kernel headers). They could even be kept in the
> include/uapi directory, and checkpatch could ensure that changes that
> touch include/uapi also have modifications to something in the
> manpages directory. This way folks would be able to include the man
> page change with the code change, making it easier for developers to
> do the right thing, making it easier for reviewers to ensure its
> correct, and making it easier for maintainers to ensure man page
> documentation is properly in sync.
>
> Or is this something that has been hashed over already? I do admit
> this would disrupt your process a bit.

It's more a less a FAQ from my point of view, so I wrote this:
https://www.kernel.org/doc/man-pages/todo.html#migrate_to_kernel_source

In short, I agree that the current process is not optimal, but lacking
(a lot) more time, it'd be hard to make any change to the current
process. In any case, I think there's room for a lot of improvement
even without changing the current process. (For example, while I
agree that having man pages in a separate location from the kernel
source does create some barriers, I don't think it's the reason
most developers don't update the man pages. One just has to
look at the patchy state Documentation/filesystems/proc.txt as one
example to support that view point.)

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-06-17 09:54:38

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On 06/13/2014 05:33 PM, David Herrmann wrote:
> On Fri, Jun 13, 2014 at 5:17 PM, Andy Lutomirski <[email protected]> wrote:
>> Isn't the point of SEAL_SHRINK to allow servers to mmap and read
>> safely without worrying about SIGBUS?
>
> No, I don't think so.
> The point of SEAL_SHRINK is to prevent a file from shrinking. SIGBUS
> is an effect, not a cause. It's only a coincidence that "OOM during
> reads" and "reading beyond file-boundaries" has the same effect:
> SIGBUS.
> We only protect against reading beyond file-boundaries due to
> shrinking. Therefore, OOM-SIGBUS is unrelated to SEAL_SHRINK.
>
> Anyone dealing with mmap() _has_ to use mlock() to protect against
> OOM-SIGBUS. Making SEAL_SHRINK protect against OOM-SIGBUS would be
> redundant, because you can achieve the same with SEAL_SHRINK+mlock().

I don't think this is what potential users expect because mlock requires
capabilities which are not available to them.

A couple of weeks ago, sealing was to be applied to anonymous shared
memory. Has this changed? Why should *reading* it trigger OOM?

--
Florian Weimer / Red Hat Product Security Team

2014-06-17 10:01:58

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Tue, Jun 17, 2014 at 11:54 AM, Florian Weimer <[email protected]> wrote:
> On 06/13/2014 05:33 PM, David Herrmann wrote:
>>
>> On Fri, Jun 13, 2014 at 5:17 PM, Andy Lutomirski <[email protected]>
>> wrote:
>>>
>>> Isn't the point of SEAL_SHRINK to allow servers to mmap and read
>>> safely without worrying about SIGBUS?
>>
>>
>> No, I don't think so.
>> The point of SEAL_SHRINK is to prevent a file from shrinking. SIGBUS
>> is an effect, not a cause. It's only a coincidence that "OOM during
>> reads" and "reading beyond file-boundaries" has the same effect:
>> SIGBUS.
>> We only protect against reading beyond file-boundaries due to
>> shrinking. Therefore, OOM-SIGBUS is unrelated to SEAL_SHRINK.
>>
>> Anyone dealing with mmap() _has_ to use mlock() to protect against
>> OOM-SIGBUS. Making SEAL_SHRINK protect against OOM-SIGBUS would be
>> redundant, because you can achieve the same with SEAL_SHRINK+mlock().
>
>
> I don't think this is what potential users expect because mlock requires
> capabilities which are not available to them.
>
> A couple of weeks ago, sealing was to be applied to anonymous shared memory.
> Has this changed? Why should *reading* it trigger OOM?

The file might have holes, therefore, you'd have to allocate backing
pages. This might hit a soft-limit and fail. To avoid this, use
fallocate() to allocate pages prior to mmap() or mlock() to make the
kernel lock them in memory.

Thanks
David

2014-06-17 10:05:17

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On 06/17/2014 12:01 PM, David Herrmann wrote:

>> I don't think this is what potential users expect because mlock requires
>> capabilities which are not available to them.
>>
>> A couple of weeks ago, sealing was to be applied to anonymous shared memory.
>> Has this changed? Why should *reading* it trigger OOM?
>
> The file might have holes, therefore, you'd have to allocate backing
> pages. This might hit a soft-limit and fail. To avoid this, use
> fallocate() to allocate pages prior to mmap()

This does not work because the consuming side does not know how the
descriptor was set up if sealing does not imply that.

> or mlock() to make the kernel lock them in memory.

See above for why that does not work.

I think you should eliminate the holes on sealing and report ENOMEM
there if necessary.

--
Florian Weimer / Red Hat Product Security Team

2014-06-17 10:10:23

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Tue, Jun 17, 2014 at 12:04 PM, Florian Weimer <[email protected]> wrote:
> On 06/17/2014 12:01 PM, David Herrmann wrote:
>
>>> I don't think this is what potential users expect because mlock requires
>>> capabilities which are not available to them.
>>>
>>> A couple of weeks ago, sealing was to be applied to anonymous shared
>>> memory.
>>> Has this changed? Why should *reading* it trigger OOM?
>>
>> The file might have holes, therefore, you'd have to allocate backing
>> pages. This might hit a soft-limit and fail. To avoid this, use
>> fallocate() to allocate pages prior to mmap()
>
> This does not work because the consuming side does not know how the
> descriptor was set up if sealing does not imply that.

The consuming side has to very seals via F_GET_SEALS. After that, it
shall do a simple fallocate() on the whole file if it wants to go sure
that all pages are allocated. Why shouldn't that be possible? Please
elaborate.

Thanks
David

2014-06-17 12:13:59

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On 06/17/2014 12:10 PM, David Herrmann wrote:

>>> The file might have holes, therefore, you'd have to allocate backing
>>> pages. This might hit a soft-limit and fail. To avoid this, use
>>> fallocate() to allocate pages prior to mmap()
>>
>> This does not work because the consuming side does not know how the
>> descriptor was set up if sealing does not imply that.
>
> The consuming side has to very seals via F_GET_SEALS. After that, it
> shall do a simple fallocate() on the whole file if it wants to go sure
> that all pages are allocated. Why shouldn't that be possible? Please
> elaborate.

Hmm. You permit general fallocate even for WRITE seals. That's really
unexpected.

The inode_newsize_ok check in shmem_fallocate can result in SIGXFSZ,
which doesn't seem to be what's intended here.

Will the new pages attributed to the process calling fallocate, or to
the process calling memfd_create?

--
Florian Weimer / Red Hat Product Security Team

2014-06-17 13:26:26

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Tue, Jun 17, 2014 at 2:13 PM, Florian Weimer <[email protected]> wrote:
> On 06/17/2014 12:10 PM, David Herrmann wrote:
>
>>>> The file might have holes, therefore, you'd have to allocate backing
>>>> pages. This might hit a soft-limit and fail. To avoid this, use
>>>> fallocate() to allocate pages prior to mmap()
>>>
>>>
>>> This does not work because the consuming side does not know how the
>>> descriptor was set up if sealing does not imply that.
>>
>>
>> The consuming side has to very seals via F_GET_SEALS. After that, it
>> shall do a simple fallocate() on the whole file if it wants to go sure
>> that all pages are allocated. Why shouldn't that be possible? Please
>> elaborate.
>
>
> Hmm. You permit general fallocate even for WRITE seals. That's really
> unexpected.

SEAL_WRITE prevents modifications of file-content. fallocate() does
not modify file-contents, so I think it's not unexpected that
fallocate() is still allowed.

> The inode_newsize_ok check in shmem_fallocate can result in SIGXFSZ, which
> doesn't seem to be what's intended here.

It can only result in SIGXFSZ if you _increase_ the file-size with
fallocate(). You shouldn't do that if you only verify that holes are
allocated. Hence, a simple fallocate(st.st_size) cannot result in
SIGXFSZ. Obviously, this requires SEAL_SHRINK to prevent the remote
site to shrink the file while you call fallocate(). But SEAL_WRITE
usually goes together with SEAL_SHRINK for obvious reasons.

> Will the new pages attributed to the process calling fallocate, or to the
> process calling memfd_create?

Pages are always allocated by the caller and charged on current->mm
(current process).

Thanks
David

2014-06-17 16:36:59

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Tue, Jun 17, 2014 at 6:20 PM, Andy Lutomirski <[email protected]> wrote:
> Can you summarize why holes can't be reliably backed by the zero page?

To answer this, I will quote Hugh from "PATCH v2 1/3":

> We do already use the ZERO_PAGE instead of allocating when it's a
> simple read; and on the face of it, we could extend that to mmap
> once the file is sealed. But I am rather afraid to do so - for
> many years there was an mmap /dev/zero case which did that, but
> it was an easily forgotten case which caught us out at least
> once, so I'm reluctant to reintroduce it now for sealing.
>
> Anyway, I don't expect you to resolve the issue of sealed holes:
> that's very much my territory, to give you support on.

Holes can be avoided with a simple fallocate(). I don't understand why
I should make SEAL_WRITE do the fallocate for the caller. During the
discussion of memfd_create() I was told to drop the "size" parameter,
because it is redundant. I don't see how this implicit fallocate()
does not fall into the same category?

Thanks
David

2014-06-17 16:41:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Tue, Jun 17, 2014 at 9:36 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Tue, Jun 17, 2014 at 6:20 PM, Andy Lutomirski <[email protected]> wrote:
>> Can you summarize why holes can't be reliably backed by the zero page?
>
> To answer this, I will quote Hugh from "PATCH v2 1/3":
>
>> We do already use the ZERO_PAGE instead of allocating when it's a
>> simple read; and on the face of it, we could extend that to mmap
>> once the file is sealed. But I am rather afraid to do so - for
>> many years there was an mmap /dev/zero case which did that, but
>> it was an easily forgotten case which caught us out at least
>> once, so I'm reluctant to reintroduce it now for sealing.
>>
>> Anyway, I don't expect you to resolve the issue of sealed holes:
>> that's very much my territory, to give you support on.
>
> Holes can be avoided with a simple fallocate(). I don't understand why
> I should make SEAL_WRITE do the fallocate for the caller. During the
> discussion of memfd_create() I was told to drop the "size" parameter,
> because it is redundant. I don't see how this implicit fallocate()
> does not fall into the same category?
>

I'm really confused now.

If I SEAL_WRITE a file, and then I mmap it PROT_READ, and then I read
it, is that a "simple read"? If so, doesn't that mean that there's no
problem?

--Andy

2014-06-17 16:51:23

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Tue, Jun 17, 2014 at 6:41 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Jun 17, 2014 at 9:36 AM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> On Tue, Jun 17, 2014 at 6:20 PM, Andy Lutomirski <[email protected]> wrote:
>>> Can you summarize why holes can't be reliably backed by the zero page?
>>
>> To answer this, I will quote Hugh from "PATCH v2 1/3":
>>
>>> We do already use the ZERO_PAGE instead of allocating when it's a
>>> simple read; and on the face of it, we could extend that to mmap
>>> once the file is sealed. But I am rather afraid to do so - for
>>> many years there was an mmap /dev/zero case which did that, but
>>> it was an easily forgotten case which caught us out at least
>>> once, so I'm reluctant to reintroduce it now for sealing.
>>>
>>> Anyway, I don't expect you to resolve the issue of sealed holes:
>>> that's very much my territory, to give you support on.
>>
>> Holes can be avoided with a simple fallocate(). I don't understand why
>> I should make SEAL_WRITE do the fallocate for the caller. During the
>> discussion of memfd_create() I was told to drop the "size" parameter,
>> because it is redundant. I don't see how this implicit fallocate()
>> does not fall into the same category?
>>
>
> I'm really confused now.
>
> If I SEAL_WRITE a file, and then I mmap it PROT_READ, and then I read
> it, is that a "simple read"? If so, doesn't that mean that there's no
> problem?

I assumed Hugh was talking about read(). So no, this is not about
memory-reads on mmap()ed regions.

Looking at shmem_file_read_iter() I can see a ZERO_PAGE(0) call in
case shmem_getpage_gfp(SGP_READ) tells us there's a hole. I cannot see
anything like that in the mmap_region() and shmem_fault() paths.

Thanks
David

2014-06-17 17:01:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Tue, Jun 17, 2014 at 9:51 AM, David Herrmann <[email protected]> wrote:
> Hi
>
> On Tue, Jun 17, 2014 at 6:41 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Jun 17, 2014 at 9:36 AM, David Herrmann <[email protected]> wrote:
>>> Hi
>>>
>>> On Tue, Jun 17, 2014 at 6:20 PM, Andy Lutomirski <[email protected]> wrote:
>>>> Can you summarize why holes can't be reliably backed by the zero page?
>>>
>>> To answer this, I will quote Hugh from "PATCH v2 1/3":
>>>
>>>> We do already use the ZERO_PAGE instead of allocating when it's a
>>>> simple read; and on the face of it, we could extend that to mmap
>>>> once the file is sealed. But I am rather afraid to do so - for
>>>> many years there was an mmap /dev/zero case which did that, but
>>>> it was an easily forgotten case which caught us out at least
>>>> once, so I'm reluctant to reintroduce it now for sealing.
>>>>
>>>> Anyway, I don't expect you to resolve the issue of sealed holes:
>>>> that's very much my territory, to give you support on.
>>>
>>> Holes can be avoided with a simple fallocate(). I don't understand why
>>> I should make SEAL_WRITE do the fallocate for the caller. During the
>>> discussion of memfd_create() I was told to drop the "size" parameter,
>>> because it is redundant. I don't see how this implicit fallocate()
>>> does not fall into the same category?
>>>
>>
>> I'm really confused now.
>>
>> If I SEAL_WRITE a file, and then I mmap it PROT_READ, and then I read
>> it, is that a "simple read"? If so, doesn't that mean that there's no
>> problem?
>
> I assumed Hugh was talking about read(). So no, this is not about
> memory-reads on mmap()ed regions.
>
> Looking at shmem_file_read_iter() I can see a ZERO_PAGE(0) call in
> case shmem_getpage_gfp(SGP_READ) tells us there's a hole. I cannot see
> anything like that in the mmap_region() and shmem_fault() paths.

Would it be easy to fix this just for SEAL_WRITE files? Hugh?

This would make the interface much nicer, IMO.

--Andy

>
> Thanks
> David



--
Andy Lutomirski
AMA Capital Management, LLC

2014-06-17 20:32:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Tue, 17 Jun 2014, Andy Lutomirski wrote:
> On Tue, Jun 17, 2014 at 9:51 AM, David Herrmann <[email protected]> wrote:
> > On Tue, Jun 17, 2014 at 6:41 PM, Andy Lutomirski <[email protected]> wrote:
> >> On Tue, Jun 17, 2014 at 9:36 AM, David Herrmann <[email protected]> wrote:
> >>> On Tue, Jun 17, 2014 at 6:20 PM, Andy Lutomirski <[email protected]> wrote:
> >>>> Can you summarize why holes can't be reliably backed by the zero page?
> >>>
> >>> To answer this, I will quote Hugh from "PATCH v2 1/3":
> >>>
> >>>> We do already use the ZERO_PAGE instead of allocating when it's a
> >>>> simple read; and on the face of it, we could extend that to mmap
> >>>> once the file is sealed. But I am rather afraid to do so - for
> >>>> many years there was an mmap /dev/zero case which did that, but
> >>>> it was an easily forgotten case which caught us out at least
> >>>> once, so I'm reluctant to reintroduce it now for sealing.
> >>>>
> >>>> Anyway, I don't expect you to resolve the issue of sealed holes:
> >>>> that's very much my territory, to give you support on.
> >>>
> >>> Holes can be avoided with a simple fallocate(). I don't understand why
> >>> I should make SEAL_WRITE do the fallocate for the caller. During the
> >>> discussion of memfd_create() I was told to drop the "size" parameter,
> >>> because it is redundant. I don't see how this implicit fallocate()
> >>> does not fall into the same category?
> >>>
> >>
> >> I'm really confused now.
> >>
> >> If I SEAL_WRITE a file, and then I mmap it PROT_READ, and then I read
> >> it, is that a "simple read"? If so, doesn't that mean that there's no
> >> problem?
> >
> > I assumed Hugh was talking about read(). So no, this is not about
> > memory-reads on mmap()ed regions.
> >
> > Looking at shmem_file_read_iter() I can see a ZERO_PAGE(0) call in
> > case shmem_getpage_gfp(SGP_READ) tells us there's a hole. I cannot see
> > anything like that in the mmap_region() and shmem_fault() paths.
>
> Would it be easy to fix this just for SEAL_WRITE files? Hugh?
>
> This would make the interface much nicer, IMO.

I do agree with you, Andy.

I agree with David that a fallocate (of the fill-in-holes variety)
does not have to be prohibited on a sealed file, that detection of
holes is not an issue with respect to sealing, and that fallocate
by the recipient could be used to "post-seal" the object to safety.

But it doesn't feel right, and we shall be re-explaining and apologizing
for it for months to come, until we just fix it. I suspect David didn't
want to add a dependency upon me to fix it, and I didn't want to be
rushed into fixing it (nor is it a job I'd be comfortable to delegate).

I'll give it more thought. The problem is that there may be a variety
of codepaths, in mm/shmem.c but more seriously outside it, which expect
an appropriate page->mapping and page->index on any page of a shared
mapping, and will be buggily surprised to find a ZERO_PAGE instead.
I'll have to go through carefully. Splice may be more difficult to
audit than fault, I don't very often have to think about it.

And though I'd prefer to do the same for non-sealed as for sealed, it
may make more sense in the short term just to address the sealed case,
as you suggest. In the unsealed case, first write to a page entails
locating all the places where the ZERO_PAGE had previously been mapped,
and replacing it there by the newly allocated page; might depend on
VM_NONLINEAR removal, and might entail page_mkwrite(). Doing just
the sealed is easier, though the half-complete job will annoy me.

I did refresh my memory of the /dev/zero case that had particularly
worried me: it was stranger than I'd thought, that reading from
/dev/zero could insert ZERO_PAGEs into mappings of other files.
Nick put an end to that in 2.6.24, but perhaps its prior existence
helps give assurance that ZERO_PAGE in surprising places is less
trouble than I fear (it did force XIP into having its own zero_page,
but I don't remember other complications).

Hugh

2014-06-17 21:25:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Tue, Jun 17, 2014 at 1:31 PM, Hugh Dickins <[email protected]> wrote:
> On Tue, 17 Jun 2014, Andy Lutomirski wrote:
>> On Tue, Jun 17, 2014 at 9:51 AM, David Herrmann <[email protected]> wrote:
>> > On Tue, Jun 17, 2014 at 6:41 PM, Andy Lutomirski <[email protected]> wrote:
>> >> On Tue, Jun 17, 2014 at 9:36 AM, David Herrmann <[email protected]> wrote:
>> >>> On Tue, Jun 17, 2014 at 6:20 PM, Andy Lutomirski <[email protected]> wrote:
>> >>>> Can you summarize why holes can't be reliably backed by the zero page?
>> >>>
>> >>> To answer this, I will quote Hugh from "PATCH v2 1/3":
>> >>>
>> >>>> We do already use the ZERO_PAGE instead of allocating when it's a
>> >>>> simple read; and on the face of it, we could extend that to mmap
>> >>>> once the file is sealed. But I am rather afraid to do so - for
>> >>>> many years there was an mmap /dev/zero case which did that, but
>> >>>> it was an easily forgotten case which caught us out at least
>> >>>> once, so I'm reluctant to reintroduce it now for sealing.
>> >>>>
>> >>>> Anyway, I don't expect you to resolve the issue of sealed holes:
>> >>>> that's very much my territory, to give you support on.
>> >>>
>> >>> Holes can be avoided with a simple fallocate(). I don't understand why
>> >>> I should make SEAL_WRITE do the fallocate for the caller. During the
>> >>> discussion of memfd_create() I was told to drop the "size" parameter,
>> >>> because it is redundant. I don't see how this implicit fallocate()
>> >>> does not fall into the same category?
>> >>>
>> >>
>> >> I'm really confused now.
>> >>
>> >> If I SEAL_WRITE a file, and then I mmap it PROT_READ, and then I read
>> >> it, is that a "simple read"? If so, doesn't that mean that there's no
>> >> problem?
>> >
>> > I assumed Hugh was talking about read(). So no, this is not about
>> > memory-reads on mmap()ed regions.
>> >
>> > Looking at shmem_file_read_iter() I can see a ZERO_PAGE(0) call in
>> > case shmem_getpage_gfp(SGP_READ) tells us there's a hole. I cannot see
>> > anything like that in the mmap_region() and shmem_fault() paths.
>>
>> Would it be easy to fix this just for SEAL_WRITE files? Hugh?
>>
>> This would make the interface much nicer, IMO.
>
> I do agree with you, Andy.
>
> I agree with David that a fallocate (of the fill-in-holes variety)
> does not have to be prohibited on a sealed file, that detection of
> holes is not an issue with respect to sealing, and that fallocate
> by the recipient could be used to "post-seal" the object to safety.
>
> But it doesn't feel right, and we shall be re-explaining and apologizing
> for it for months to come, until we just fix it. I suspect David didn't
> want to add a dependency upon me to fix it, and I didn't want to be
> rushed into fixing it (nor is it a job I'd be comfortable to delegate).

I suppose it would be possible to merge memfd_create as is, and then
to fix the zero page thing and make fallocate on a SEAL_WRITEd file be
a no-op. It would be silly for code to fallocate actual
sealed-with-holes files and allocate fresh pages that are guaranteed
to only ever contain zeros.

--Andy

2014-07-08 16:55:00

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

Hi

On Fri, Jun 13, 2014 at 12:36 PM, David Herrmann <[email protected]> wrote:
> Hi
>
> This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with
> a longer introduction at gmane:
> http://thread.gmane.org/gmane.comp.video.dri.devel/102241
> An LWN article about memfd+sealing is available, too:
> https://lwn.net/Articles/593918/
> v2 with some more discussions can be found here:
> http://thread.gmane.org/gmane.linux.kernel.mm/115713
>
> This series introduces two new APIs:
> memfd_create(): Think of this syscall as malloc() but it returns a
> file-descriptor instead of a pointer. That file-descriptor is
> backed by anon-memory and can be memory-mapped for access.
> sealing: The sealing API can be used to prevent a specific set of operations
> on a file-descriptor. You 'seal' the file and give thus the
> guarantee, that it cannot be modified in the specific ways.
>
> A short high-level introduction is also available here:
> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/
>
>
> Changed in v3:
> - fcntl() now returns EINVAL if the FD does not support sealing. We used to
> return EBADF like pipe_fcntl() does, but that is really weird and I don't
> like repeating that.
> - seals are now saved as "unsigned int" instead of "u32".
> - i_mmap_writable is now an atomic so we can deny writable mappings just like
> i_writecount does.
> - SHMEM_ALLOW_SEALING is dropped. We initialize all objects with F_SEAL_SEAL
> and only unset it for memfds that shall support sealing.
> - memfd_create() no longer has a size argument. It was redundant, use
> ftruncate() or fallocate().
> - memfd_create() flags are "unsigned int" now, instead of "u64".
> - NAME_MAX off-by-one fix
> - several cosmetic changes
> - Added AIO/Direct-IO page-pinning protection
>
> The last point is the most important change in this version: We now bail out if
> any page-refcount is elevated while setting SEAL_WRITE. This prevents parallel
> GUP users from writing to sealed files _after_ they were sealed. There is also a
> new FUSE-based test-case to trigger such situations.
>
> The last 2 patches try to improve the page-pinning handling. I included both in
> this series, but obviously only one of them is needed (or we could stack them):
> - 6/7: This waits for up to 150ms for pages to be unpinned
> - 7/7: This isolates pinned pages and replaces them with a fresh copy
>
> Hugh, patch 6 is basically your code. In case that gets merged, can I put your
> Signed-off-by on it?

Hugh, any comments on patch 5, 6 and 7? Those are the last outstanding
issues with memfd+sealing. Patch 7 (isolating pages) is still my
favorite and has been running just fine on my machine for the last
months. I think it'd be nice if we could give it a try in -next. We
can always fall back to Patch 5 or Patch 5+6. Those will detect any
racing AIO and just fail or wait for the IO to finish for a short
period.

Are there any other blockers for this?

Thanks
David

2014-07-08 18:39:50

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

Hi

On Fri, Jun 13, 2014 at 4:20 PM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> Hi David,
>
> On Fri, Jun 13, 2014 at 2:41 PM, David Herrmann <[email protected]> wrote:
>> Hi
>>
>> On Fri, Jun 13, 2014 at 2:27 PM, Michael Kerrisk (man-pages)
>> <[email protected]> wrote:
>>> Hi David,
>>>
>>> On Fri, Jun 13, 2014 at 12:36 PM, David Herrmann <[email protected]> wrote:
>>>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>>>> that you can pass to mmap(). It can support sealing and avoids any
>>>> connection to user-visible mount-points. Thus, it's not subject to quotas
>>>> on mounted file-systems, but can be used like malloc()'ed memory, but
>>>> with a file-descriptor to it.
>>>>
>>>> memfd_create() returns the raw shmem file, so calls like ftruncate() can
>>>> be used to modify the underlying inode. Also calls like fstat()
>>>> will return proper information and mark the file as regular file. If you
>>>> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
>>>> supported (like on all other regular files).
>>>>
>>>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>>>> subject to quotas and alike. It is still properly accounted to memcg
>>>> limits, though.
>>>
>>> Where do I find / is there detailed documentation (ideally, a man
>>> page) for this new system call?
>>
>> I did write a man-page proposal for memfd_create() and a patch for
>> fcntl() for v1,
>
> Ahh -- that's why I had a recollection of such a page ;-).
>
>> however, the API changed several times so I didn't
>> keep them up to date (the man-page patches are on LKML). However, I
>> wrote a short introduction to memfd+sealing v3, that I recommend
>> reading first:
>> http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/
>
> Yes, I saw it already. (It's good, but I want more.)

Sorry, totally forgot about that one. I now pushed the man-pages out.
They're available here:

http://cgit.freedesktop.org/~dvdhrm/man-pages/log/?h=memfd

Thanks!
David

2014-07-09 08:55:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 0/7] File Sealing & memfd_create()

On Tue, 8 Jul 2014, David Herrmann wrote:
>
> Hugh, any comments on patch 5, 6 and 7? Those are the last outstanding
> issues with memfd+sealing. Patch 7 (isolating pages) is still my
> favorite and has been running just fine on my machine for the last
> months. I think it'd be nice if we could give it a try in -next. We
> can always fall back to Patch 5 or Patch 5+6. Those will detect any
> racing AIO and just fail or wait for the IO to finish for a short
> period.

It's distressing for both of us how slow I am to review these, sorry.
We have just too many bugs in mm (and yes, some of them mine) for me
to set aside time to get deep enough into new features.

I've been trying for days and weeks to get there, made some progress
today, and hope to continue tomorrow. I'll send my comments on 1/7
(thumb up) and 7/7 (thumb down) in a moment: 2-6 not tonight.

>
> Are there any other blockers for this?

Trivia only, I haven't noticed any blocker; though I'm still not quite
convinced by memfd_create() - but happy enough with it if others are.

Hugh

2014-07-09 08:56:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: allow drivers to prevent new writable mappings

On Fri, 13 Jun 2014, David Herrmann wrote:

> The i_mmap_writable field counts existing writable mappings of an
> address_space. To allow drivers to prevent new writable mappings, make
> this counter signed and prevent new writable mappings if it is negative.
> This is modelled after i_writecount and DENYWRITE.
>
> This will be required by the shmem-sealing infrastructure to prevent any
> new writable mappings after the WRITE seal has been set. In case there
> exists a writable mapping, this operation will fail with EBUSY.
>
> Note that we rely on the fact that iff you already own a writable mapping,
> you can increase the counter without using the helpers. This is the same
> that we do for i_writecount.
>
> Signed-off-by: David Herrmann <[email protected]>

I'm very pleased with the way you chose to do this in the end, following
the way VM_DENYWRITE does it: it works out nicer than I had imagined.

I do have some comments below, but the only one where I end up asking
for a change is to remove the void "return "; but please read through
in case any of my wonderings lead you to change your mind on something.

I am very tempted to suggest that you add VM_WRITE into those VM_SHARED
tests, and put the then necessary additional tests into mprotect.c: so
that you would no longer have the odd case that a VM_SHARED !VM_WRITE
area has to be unmapped before sealing the fd.

But that would involve an audit of flush_dcache_page() architectures,
and perhaps some mods to keep them safe across the mprotect(): let's
put it down as a desirable future enhancement, just to remove an odd
restriction.

With the "return " gone,
Acked-by: Hugh Dickins <[email protected]>

> ---
> fs/inode.c | 1 +
> include/linux/fs.h | 29 +++++++++++++++++++++++++++--
> kernel/fork.c | 2 +-
> mm/mmap.c | 24 ++++++++++++++++++------
> mm/swap_state.c | 1 +
> 5 files changed, 48 insertions(+), 9 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 6eecb7f..9945b11 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -165,6 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> mapping->a_ops = &empty_aops;
> mapping->host = inode;
> mapping->flags = 0;
> + atomic_set(&mapping->i_mmap_writable, 0);
> mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> mapping->private_data = NULL;
> mapping->backing_dev_info = &default_backing_dev_info;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 338e6f7..71d17c9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -387,7 +387,7 @@ struct address_space {
> struct inode *host; /* owner: inode, block_device */
> struct radix_tree_root page_tree; /* radix tree of all pages */
> spinlock_t tree_lock; /* and lock protecting it */
> - unsigned int i_mmap_writable;/* count VM_SHARED mappings */
> + atomic_t i_mmap_writable;/* count VM_SHARED mappings */
> struct rb_root i_mmap; /* tree of private and shared mappings */
> struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
> struct mutex i_mmap_mutex; /* protect tree, count, list */
> @@ -470,10 +470,35 @@ static inline int mapping_mapped(struct address_space *mapping)
> * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
> * marks vma as VM_SHARED if it is shared, and the file was opened for
> * writing i.e. vma may be mprotected writable even if now readonly.
> + *
> + * If i_mmap_writable is negative, no new writable mappings are allowed. You
> + * can only deny writable mappings, if none exists right now.
> */
> static inline int mapping_writably_mapped(struct address_space *mapping)
> {
> - return mapping->i_mmap_writable != 0;
> + return atomic_read(&mapping->i_mmap_writable) > 0;

I have a slight anxiety that you're halving the writable range here:
but I think that if we can overflow with this change, we could overflow
before; and there are int counts elsewhere which would overflow easier.

> +}
> +
> +static inline int mapping_map_writable(struct address_space *mapping)
> +{
> + return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
> + 0 : -EPERM;
> +}
> +
> +static inline void mapping_unmap_writable(struct address_space *mapping)
> +{
> + return atomic_dec(&mapping->i_mmap_writable);

Better just

atomic_dec(&mapping->i_mmap_writable);

I didn't realize before that the compiler allows you to return a void
function from a void function without warning, but better not rely on that.

> +}
> +
> +static inline int mapping_deny_writable(struct address_space *mapping)
> +{
> + return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
> + 0 : -EBUSY;
> +}
> +
> +static inline void mapping_allow_writable(struct address_space *mapping)
> +{
> + atomic_inc(&mapping->i_mmap_writable);
> }
>
> /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d2799d1..f1f127e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -421,7 +421,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> atomic_dec(&inode->i_writecount);
> mutex_lock(&mapping->i_mmap_mutex);
> if (tmp->vm_flags & VM_SHARED)
> - mapping->i_mmap_writable++;
> + atomic_inc(&mapping->i_mmap_writable);
> flush_dcache_mmap_lock(mapping);
> /* insert tmp into the share list, just after mpnt */
> if (unlikely(tmp->vm_flags & VM_NONLINEAR))
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 129b847..19b6562 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -216,7 +216,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
> if (vma->vm_flags & VM_DENYWRITE)
> atomic_inc(&file_inode(file)->i_writecount);
> if (vma->vm_flags & VM_SHARED)
> - mapping->i_mmap_writable--;
> + mapping_unmap_writable(mapping);

Okay. I waste ridiculous time trying to decide whether it's better to say

mapping_unmap_writable(mapping)
or
atomic_dec(&mapping->i_mmap_writable);

here: there are consistency arguments both ways. I usually solve this
problem by not giving wrapper names to such operations, but in this case
I think the answer is just to accept whatever you decided was best.

>
> flush_dcache_mmap_lock(mapping);
> if (unlikely(vma->vm_flags & VM_NONLINEAR))
> @@ -617,7 +617,7 @@ static void __vma_link_file(struct vm_area_struct *vma)
> if (vma->vm_flags & VM_DENYWRITE)
> atomic_dec(&file_inode(file)->i_writecount);
> if (vma->vm_flags & VM_SHARED)
> - mapping->i_mmap_writable++;
> + atomic_inc(&mapping->i_mmap_writable);
>
> flush_dcache_mmap_lock(mapping);
> if (unlikely(vma->vm_flags & VM_NONLINEAR))
> @@ -1572,6 +1572,11 @@ munmap_back:
> if (error)
> goto free_vma;
> }
> + if (vm_flags & VM_SHARED) {
> + error = mapping_map_writable(file->f_mapping);
> + if (error)
> + goto allow_write_and_free_vma;
> + }
> vma->vm_file = get_file(file);
> error = file->f_op->mmap(file, vma);
> if (error)
> @@ -1611,8 +1616,12 @@ munmap_back:
>
> vma_link(mm, vma, prev, rb_link, rb_parent);
> /* Once vma denies write, undo our temporary denial count */
> - if (vm_flags & VM_DENYWRITE)
> - allow_write_access(file);
> + if (file) {
> + if (vm_flags & VM_SHARED)
> + mapping_unmap_writable(file->f_mapping);
> + if (vm_flags & VM_DENYWRITE)
> + allow_write_access(file);
> + }
> file = vma->vm_file;

Very good. A bit subtle, and for a while I was preparing to warn you
of the odd shmem_zero_setup() case (which materializes a file when
none came in), and the danger of the count going wrong on that.

But you have it handled, with the "if (file)" block immediately before
the new assignment of file, which should force anyone to think about it.
And the ordering here, with respect to file and error exits, has always
been tricky - I don't think you are making it any worse.

Oh, more subtle than I realized above: I've just gone through a last
minute "hey, it's wrong; oh, no, it's okay" waver here, remembering
how mmap /dev/zero (and some others, I suppose) gives you a new file.

That is okay: we don't even have to assume that sealing is limited
to your memfd_create(): it's safe to assume that any device which
forks you a new file, is safe to assert mapping_map_writable() upon
temporarily; and the new file given could never come already sealed.

But maybe a brief comment to reassure us in future?

> out:
> perf_event_mmap(vma);
> @@ -1641,14 +1650,17 @@ out:
> return addr;
>
> unmap_and_free_vma:
> - if (vm_flags & VM_DENYWRITE)
> - allow_write_access(file);
> vma->vm_file = NULL;
> fput(file);
>
> /* Undo any partial mapping done by a device driver. */
> unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
> charged = 0;
> + if (vm_flags & VM_SHARED)
> + mapping_unmap_writable(file->f_mapping);
> +allow_write_and_free_vma:
> + if (vm_flags & VM_DENYWRITE)
> + allow_write_access(file);

Okay. I don't enjoy that rearrangement, such that those two uses of
file come after the fput(file); but I believe there are good reasons
why it can never be the final fput(file), so I think you're okay.

> free_vma:
> kmem_cache_free(vm_area_cachep, vma);
> unacct_error:
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 2972eee..31321fa 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -39,6 +39,7 @@ static struct backing_dev_info swap_backing_dev_info = {
> struct address_space swapper_spaces[MAX_SWAPFILES] = {
> [0 ... MAX_SWAPFILES - 1] = {
> .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
> + .i_mmap_writable = ATOMIC_INIT(0),
> .a_ops = &swap_aops,
> .backing_dev_info = &swap_backing_dev_info,
> }
> --
> 2.0.0

2014-07-09 08:59:30

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

On Fri, 13 Jun 2014, David Herrmann wrote:

> When setting SEAL_WRITE, we must make sure nobody has a writable reference
> to the pages (via GUP or similar). We currently check references and wait
> some time for them to be dropped. This, however, might fail for several
> reasons, including:
> - the page is pinned for longer than we wait
> - while we wait, someone takes an already pinned page for read-access
>
> Therefore, this patch introduces page-isolation. When sealing a file with
> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
> is put in place atomically, the old page is detached and left alone. It
> will get reclaimed once the last external user dropped it.
>
> Signed-off-by: David Herrmann <[email protected]>

I've not checked it line by line, but this seems to be very good work;
and I'm glad you have posted it, where we can refer back to it in future.

However, I'm NAKing this patch, at least for now.

The reason is simple and twofold.

I absolutely do not want to be maintaining an alternative form of
page migration in mm/shmem.c. Shmem has its own peculiar problems
(mostly because of swap): adding a new dimension of very rarely
exercised complication, and dependence on the rest mm, is not wise.

And sealing just does not need this. It is clearly technically
superior to, and more satisfying than, the "wait-a-while-then-give-up"
technique which it would replace. But in practice, the wait-a-while
technique is quite good enough (and much better self-contained than this).

I've heard no requirement to support sealing of objects pinned for I/O,
and the sealer just would not have given the object out for that; the
requirement is to give the recipient of a sealed object confidence
that it cannot be susceptible to modification in that way.

I doubt there will ever be an actual need for sealing to use this
migration technique; but I can imagine us referring back to your work in
future, when/if implementing revoke on regular files. And integrating
this into mm/migrate.c's unmap_and_move() as an extra-force mode
(proceed even when the page count is raised).

I think the concerns I had, when Tony first proposed this migration copy
technique, were in fact unfounded - I was worried by the new inverse COW.
On reflection, I don't think this introduces any new risks, which are
not already present in page migration, truncation and orphaned pages.

I didn't begin to test it at all, but the only defects that stood out
in your code were in the areas of memcg and mlock. I think that if we
go down the road of duplicating pinned pages, then we do have to make
a memcg charge on the new page in addition to the old page. And if
any pages happen to be mlock'ed into an address space, then we ought
to map in the replacement pages afterwards (as page migration does,
whether mlock'ed or not).

(You were perfectly correct to use unmap_mapping_range(), rather than
try_to_unmap() as page migration does: because unmap_mapping_range()
manages the VM_NONLINEAR case. But our intention, under way, is to
scrap all VM_NONLINEAR code, and just emulate it with multiple vmas,
in which case try_to_unmap() should do.)

Hugh

> ---
> mm/shmem.c | 218 +++++++++++++++++++++++++++++--------------------------------
> 1 file changed, 105 insertions(+), 113 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ddc3998..34b14fb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1237,6 +1237,110 @@ unlock:
> return error;
> }
>
> +static int shmem_isolate_page(struct inode *inode, struct page *oldpage)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + struct page *newpage;
> + int error;
> +
> + if (oldpage->mapping != mapping)
> + return 0;
> + if (page_count(oldpage) - page_mapcount(oldpage) <= 2)
> + return 0;
> +
> + if (page_mapped(oldpage))
> + unmap_mapping_range(mapping,
> + (loff_t)oldpage->index << PAGE_CACHE_SHIFT,
> + PAGE_CACHE_SIZE, 0);
> +
> + VM_BUG_ON_PAGE(PageWriteback(oldpage), oldpage);
> + VM_BUG_ON_PAGE(page_has_private(oldpage), oldpage);
> +
> + newpage = shmem_alloc_page(mapping_gfp_mask(mapping), info,
> + oldpage->index);
> + if (!newpage)
> + return -ENOMEM;
> +
> + __set_page_locked(newpage);
> + copy_highpage(newpage, oldpage);
> + flush_dcache_page(newpage);
> +
> + page_cache_get(newpage);
> + SetPageUptodate(newpage);
> + SetPageSwapBacked(newpage);
> + newpage->mapping = mapping;
> + newpage->index = oldpage->index;
> +
> + cancel_dirty_page(oldpage, PAGE_CACHE_SIZE);
> +
> + spin_lock_irq(&mapping->tree_lock);
> + error = shmem_radix_tree_replace(mapping, oldpage->index,
> + oldpage, newpage);
> + if (!error) {
> + __inc_zone_page_state(newpage, NR_FILE_PAGES);
> + __dec_zone_page_state(oldpage, NR_FILE_PAGES);
> + }
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + if (error) {
> + newpage->mapping = NULL;
> + unlock_page(newpage);
> + page_cache_release(newpage);
> + page_cache_release(newpage);
> + return error;
> + }
> +
> + mem_cgroup_replace_page_cache(oldpage, newpage);
> + lru_cache_add_anon(newpage);
> +
> + oldpage->mapping = NULL;
> + page_cache_release(oldpage);
> + unlock_page(newpage);
> + page_cache_release(newpage);
> +
> + return 1;
> +}
> +
> +static int shmem_isolate_pins(struct inode *inode)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct pagevec pvec;
> + pgoff_t indices[PAGEVEC_SIZE];
> + pgoff_t index;
> + int i, ret, error;
> +
> + pagevec_init(&pvec, 0);
> + index = 0;
> + error = 0;
> + while ((pvec.nr = find_get_entries(mapping, index, PAGEVEC_SIZE,
> + pvec.pages, indices))) {
> + for (i = 0; i < pagevec_count(&pvec); i++) {
> + struct page *page = pvec.pages[i];
> +
> + index = indices[i];
> + if (radix_tree_exceptional_entry(page))
> + continue;
> + if (page->mapping != mapping)
> + continue;
> + if (page_count(page) - page_mapcount(page) <= 2)
> + continue;
> +
> + lock_page(page);
> + ret = shmem_isolate_page(inode, page);
> + if (ret < 0)
> + error = ret;
> + unlock_page(page);
> + }
> + pagevec_remove_exceptionals(&pvec);
> + pagevec_release(&pvec);
> + cond_resched();
> + index++;
> + }
> +
> + return error;
> +}
> +
> static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> struct inode *inode = file_inode(vma->vm_file);
> @@ -1734,118 +1838,6 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
> return offset;
> }
>
> -/*
> - * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
> - * so reuse a tag which we firmly believe is never set or cleared on shmem.
> - */
> -#define SHMEM_TAG_PINNED PAGECACHE_TAG_TOWRITE
> -#define LAST_SCAN 4 /* about 150ms max */
> -
> -static void shmem_tag_pins(struct address_space *mapping)
> -{
> - struct radix_tree_iter iter;
> - void **slot;
> - pgoff_t start;
> - struct page *page;
> -
> - start = 0;
> - rcu_read_lock();
> -
> -restart:
> - radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> - page = radix_tree_deref_slot(slot);
> - if (!page || radix_tree_exception(page)) {
> - if (radix_tree_deref_retry(page))
> - goto restart;
> - } else if (page_count(page) - page_mapcount(page) > 1) {
> - spin_lock_irq(&mapping->tree_lock);
> - radix_tree_tag_set(&mapping->page_tree, iter.index,
> - SHMEM_TAG_PINNED);
> - spin_unlock_irq(&mapping->tree_lock);
> - }
> -
> - if (need_resched()) {
> - cond_resched_rcu();
> - start = iter.index + 1;
> - goto restart;
> - }
> - }
> - rcu_read_unlock();
> -}
> -
> -/*
> - * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
> - * via get_user_pages(), drivers might have some pending I/O without any active
> - * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
> - * and see whether it has an elevated ref-count. If so, we tag them and wait for
> - * them to be dropped.
> - * The caller must guarantee that no new user will acquire writable references
> - * to those pages to avoid races.
> - */
> -static int shmem_wait_for_pins(struct address_space *mapping)
> -{
> - struct radix_tree_iter iter;
> - void **slot;
> - pgoff_t start;
> - struct page *page;
> - int error, scan;
> -
> - shmem_tag_pins(mapping);
> -
> - error = 0;
> - for (scan = 0; scan <= LAST_SCAN; scan++) {
> - if (!radix_tree_tagged(&mapping->page_tree, SHMEM_TAG_PINNED))
> - break;
> -
> - if (!scan)
> - lru_add_drain_all();
> - else if (schedule_timeout_killable((HZ << scan) / 200))
> - scan = LAST_SCAN;
> -
> - start = 0;
> - rcu_read_lock();
> -restart:
> - radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
> - start, SHMEM_TAG_PINNED) {
> -
> - page = radix_tree_deref_slot(slot);
> - if (radix_tree_exception(page)) {
> - if (radix_tree_deref_retry(page))
> - goto restart;
> -
> - page = NULL;
> - }
> -
> - if (page &&
> - page_count(page) - page_mapcount(page) != 1) {
> - if (scan < LAST_SCAN)
> - goto continue_resched;
> -
> - /*
> - * On the last scan, we clean up all those tags
> - * we inserted; but make a note that we still
> - * found pages pinned.
> - */
> - error = -EBUSY;
> - }
> -
> - spin_lock_irq(&mapping->tree_lock);
> - radix_tree_tag_clear(&mapping->page_tree,
> - iter.index, SHMEM_TAG_PINNED);
> - spin_unlock_irq(&mapping->tree_lock);
> -continue_resched:
> - if (need_resched()) {
> - cond_resched_rcu();
> - start = iter.index + 1;
> - goto restart;
> - }
> - }
> - rcu_read_unlock();
> - }
> -
> - return error;
> -}
> -
> #define F_ALL_SEALS (F_SEAL_SEAL | \
> F_SEAL_SHRINK | \
> F_SEAL_GROW | \
> @@ -1907,7 +1899,7 @@ int shmem_add_seals(struct file *file, unsigned int seals)
> if (error)
> goto unlock;
>
> - error = shmem_wait_for_pins(file->f_mapping);
> + error = shmem_isolate_pins(inode);
> if (error) {
> mapping_allow_writable(file->f_mapping);
> goto unlock;
> --
> 2.0.0

2014-07-16 10:08:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] shm: add sealing API

On Fri, 13 Jun 2014, David Herrmann wrote:

> If two processes share a common memory region, they usually want some
> guarantees to allow safe access. This often includes:
> - one side cannot overwrite data while the other reads it
> - one side cannot shrink the buffer while the other accesses it
> - one side cannot grow the buffer beyond previously set boundaries
>
> If there is a trust-relationship between both parties, there is no need
> for policy enforcement. However, if there's no trust relationship (eg.,
> for general-purpose IPC) sharing memory-regions is highly fragile and
> often not possible without local copies. Look at the following two
> use-cases:
> 1) A graphics client wants to share its rendering-buffer with a
> graphics-server. The memory-region is allocated by the client for
> read/write access and a second FD is passed to the server. While
> scanning out from the memory region, the server has no guarantee that
> the client doesn't shrink the buffer at any time, requiring rather
> cumbersome SIGBUS handling.
> 2) A process wants to perform an RPC on another process. To avoid huge
> bandwidth consumption, zero-copy is preferred. After a message is
> assembled in-memory and a FD is passed to the remote side, both sides
> want to be sure that neither modifies this shared copy, anymore. The
> source may have put sensible data into the message without a separate
> copy and the target may want to parse the message inline, to avoid a
> local copy.
>
> While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
> ways to achieve most of this, the first one is unproportionally ugly to
> use in libraries and the latter two are broken/racy or even disabled due
> to denial of service attacks.
>
> This patch introduces the concept of SEALING. If you seal a file, a
> specific set of operations is blocked on that file forever.
> Unlike locks, seals can only be set, never removed. Hence, once you
> verified a specific set of seals is set, you're guaranteed that no-one can
> perform the blocked operations on this file, anymore.
>
> An initial set of SEALS is introduced by this patch:
> - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
> in size. This affects ftruncate() and open(O_TRUNC).
> - GROW: If SEAL_GROW is set, the file in question cannot be increased
> in size. This affects ftruncate(), fallocate() and write().
> - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
> are possible. This affects fallocate(PUNCH_HOLE), mmap() and
> write().
> - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
> This basically prevents the F_ADD_SEAL operation on a file and
> can be set to prevent others from adding further seals that you
> don't want.
>
> The described use-cases can easily use these seals to provide safe use
> without any trust-relationship:
> 1) The graphics server can verify that a passed file-descriptor has
> SEAL_SHRINK set. This allows safe scanout, while the client is
> allowed to increase buffer size for window-resizing on-the-fly.
> Concurrent writes are explicitly allowed.
> 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
> SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
> process can modify the data while the other side parses it.
> Furthermore, it guarantees that even with writable FDs passed to the
> peer, it cannot increase the size to hit memory-limits of the source
> process (in case the file-storage is accounted to the source).
>
> The new API is an extension to fcntl(), adding two new commands:
> F_GET_SEALS: Return a bitset describing the seals on the file. This
> can be called on any FD if the underlying file supports
> sealing.
> F_ADD_SEALS: Change the seals of a given file. This requires WRITE
> access to the file and F_SEAL_SEAL may not already be set.
> Furthermore, the underlying file must support sealing and
> there may not be any existing shared mapping of that file.
> Otherwise, EBADF/EPERM is returned.
> The given seals are _added_ to the existing set of seals
> on the file. You cannot remove seals again.
>
> The fcntl() handler is currently specific to shmem and disabled on all
> files. A file needs to explicitly support sealing for this interface to
> work. A separate syscall is added in a follow-up, which creates files that
> support sealing. There is no intention to support this on other
> file-systems. Semantics are unclear for non-volatile files and we lack any
> use-case right now. Therefore, the implementation is specific to shmem.
>
> Signed-off-by: David Herrmann <[email protected]>

Looks pretty good to me, minor comments below.

> ---
> fs/fcntl.c | 5 ++
> include/linux/shmem_fs.h | 17 ++++
> include/uapi/linux/fcntl.h | 15 ++++
> mm/shmem.c | 189 ++++++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 223 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 72c82f6..22d1c3d 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -21,6 +21,7 @@
> #include <linux/rcupdate.h>
> #include <linux/pid_namespace.h>
> #include <linux/user_namespace.h>
> +#include <linux/shmem_fs.h>
>
> #include <asm/poll.h>
> #include <asm/siginfo.h>
> @@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> case F_GETPIPE_SZ:
> err = pipe_fcntl(filp, cmd, arg);
> break;
> + case F_ADD_SEALS:
> + case F_GET_SEALS:
> + err = shmem_fcntl(filp, cmd, arg);
> + break;
> default:
> break;
> }
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 4d1771c..50777b5 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -1,6 +1,7 @@
> #ifndef __SHMEM_FS_H
> #define __SHMEM_FS_H
>
> +#include <linux/file.h>
> #include <linux/swap.h>
> #include <linux/mempolicy.h>
> #include <linux/pagemap.h>
> @@ -11,6 +12,7 @@
>
> struct shmem_inode_info {
> spinlock_t lock;
> + unsigned int seals; /* shmem seals */
> unsigned long flags;
> unsigned long alloced; /* data pages alloced to file */
> union {
> @@ -65,4 +67,19 @@ static inline struct page *shmem_read_mapping_page(
> mapping_gfp_mask(mapping));
> }
>
> +#ifdef CONFIG_TMPFS
> +
> +extern int shmem_add_seals(struct file *file, unsigned int seals);
> +extern int shmem_get_seals(struct file *file);
> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> +
> +#else
> +
> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
> +{
> + return -EINVAL;
> +}
> +
> +#endif
> +
> #endif
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 074b886..beed138 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -28,6 +28,21 @@
> #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)
>
> /*
> + * Set/Get seals
> + */
> +#define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
> +#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
> +
> +/*
> + * Types of seals
> + */
> +#define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */
> +#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
> +#define F_SEAL_GROW 0x0004 /* prevent file from growing */
> +#define F_SEAL_WRITE 0x0008 /* prevent writes */
> +/* (1U << 31) is reserved for signed error codes */
> +
> +/*
> * Types of directory notifications that may be requested.
> */
> #define DN_ACCESS 0x00000001 /* File accessed */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f484c27..1438b3e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
> #include <linux/highmem.h>
> #include <linux/seq_file.h>
> #include <linux/magic.h>
> +#include <linux/fcntl.h>
>
> #include <asm/uaccess.h>
> #include <asm/pgtable.h>
> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
> static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
> {
> struct inode *inode = dentry->d_inode;
> + struct shmem_inode_info *info = SHMEM_I(inode);
> + loff_t oldsize = inode->i_size;
> + loff_t newsize = attr->ia_size;
> int error;
>
> error = inode_change_ok(inode, attr);
> if (error)
> return error;
>
> - if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
> - loff_t oldsize = inode->i_size;
> - loff_t newsize = attr->ia_size;
> + /* protected by i_mutex */
> + if (attr->ia_valid & ATTR_SIZE) {
> + if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
> + (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> + return -EPERM;
> + }

Not important but...
I'd have thought all that was better inside the S_ISREG(inode->i_mode) &&
(attr->ia_valid & ATTR_SIZE) block. Less unnecessary change above, and
more efficient for non-size attrs. You cannot seal anything but a regular
file anyway, right?

>
> + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
> if (newsize != oldsize) {
> i_size_write(inode, newsize);
> inode->i_ctime = inode->i_mtime = CURRENT_TIME;
> @@ -1315,6 +1323,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
> info = SHMEM_I(inode);
> memset(info, 0, (char *)inode - (char *)info);
> spin_lock_init(&info->lock);
> + info->seals = F_SEAL_SEAL;
> info->flags = flags & VM_NORESERVE;
> INIT_LIST_HEAD(&info->swaplist);
> simple_xattrs_init(&info->xattrs);
> @@ -1374,7 +1383,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> {
> int ret;
> struct inode *inode = mapping->host;
> + struct shmem_inode_info *info = SHMEM_I(inode);
> pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> +
> + /* i_mutex is held by caller */
> + if (info->seals & F_SEAL_WRITE)
> + return -EPERM;
> + if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
> + return -EPERM;

I think this is your only addition which comes in a hot path.
Mel has been shaving nanoseconds off this path recently: you're not
introducing any atomic ops here, good, but I wonder if it would make any
measurable difference to include this pair of tests inside a single
"if (unlikely(info->seals)) {". Maybe not, but it wouldn't hurt.

> +
> ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
> if (ret == 0 && *pagep)
> init_page_accessed(*pagep);
> @@ -1715,11 +1732,166 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
> return offset;
> }
>
> +/*
> + * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
> + * via get_user_pages(), drivers might have some pending I/O without any active
> + * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
> + * and see whether it has an elevated ref-count. If so, we abort.
> + * The caller must guarantee that no new user will acquire writable references
> + * to those pages to avoid races.
> + */
> +static int shmem_test_for_pins(struct address_space *mapping)
> +{
> + struct radix_tree_iter iter;
> + void **slot;
> + pgoff_t start;
> + struct page *page;
> + int error;
> +
> + /* flush additional refs in lru_add early */
> + lru_add_drain_all();
> +
> + error = 0;
> + start = 0;
> + rcu_read_lock();
> +
> +restart:
> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> + page = radix_tree_deref_slot(slot);
> + if (!page || radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page))
> + goto restart;
> + } else if (page_count(page) - page_mapcount(page) > 1) {
> + error = -EBUSY;
> + break;
> + }
> +
> + if (need_resched()) {
> + cond_resched_rcu();
> + start = iter.index + 1;
> + goto restart;
> + }
> + }
> + rcu_read_unlock();
> +
> + return error;
> +}

Please leave shmem_test_for_pins() (and the comment above it)
out of this particular patch.

The implementation here satisfies none of us, make it harder to
figure out the patch improving it later, and distracts from the basic
sealing interface and functionality that you introduce in this patch.

A brief comment on the issue instead - "But what if a page of the
object is pinned for pending I/O? See later patch" - maybe, but on the
whole I think it's better to raise and settle the issue in later patch.

> +
> +#define F_ALL_SEALS (F_SEAL_SEAL | \
> + F_SEAL_SHRINK | \
> + F_SEAL_GROW | \
> + F_SEAL_WRITE)
> +
> +int shmem_add_seals(struct file *file, unsigned int seals)
> +{
> + struct dentry *dentry = file->f_path.dentry;
> + struct inode *inode = dentry->d_inode;

struct inode *inode = file_inode(file), and forget about dentry?

> + struct shmem_inode_info *info = SHMEM_I(inode);
> + int error;
> +
> + /*
> + * SEALING
> + * Sealing allows multiple parties to share a shmem-file but restrict
> + * access to a specific subset of file operations. Seals can only be
> + * added, but never removed. This way, mutually untrusted parties can
> + * share common memory regions with a well-defined policy. A malicious
> + * peer can thus never perform unwanted operations on a shared object.
> + *
> + * Seals are only supported on special shmem-files and always affect
> + * the whole underlying inode. Once a seal is set, it may prevent some
> + * kinds of access to the file. Currently, the following seals are
> + * defined:
> + * SEAL_SEAL: Prevent further seals from being set on this file
> + * SEAL_SHRINK: Prevent the file from shrinking
> + * SEAL_GROW: Prevent the file from growing
> + * SEAL_WRITE: Prevent write access to the file
> + *
> + * As we don't require any trust relationship between two parties, we
> + * must prevent seals from being removed. Therefore, sealing a file
> + * only adds a given set of seals to the file, it never touches
> + * existing seals. Furthermore, the "setting seals"-operation can be
> + * sealed itself, which basically prevents any further seal from being
> + * added.
> + *
> + * Semantics of sealing are only defined on volatile files. Only
> + * anonymous shmem files support sealing. More importantly, seals are
> + * never written to disk. Therefore, there's no plan to support it on
> + * other file types.
> + */
> +
> + if (file->f_op != &shmem_file_operations)
> + return -EINVAL;
> + if (!(file->f_mode & FMODE_WRITE))
> + return -EINVAL;

I would expect -EBADF there (like when you write to read-only fd).
Though I was okay with the -EPERM you had the previous version.

> + if (seals & ~(unsigned int)F_ALL_SEALS)
> + return -EINVAL;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + if (info->seals & F_SEAL_SEAL) {

I notice this is inconsistent with F_SEAL_WRITE just below:
we're allowed to SEAL_WRITE what's already SEAL_WRITEd,
but not to SEAL_SEAL what's already SEAL_SEALed.
Oh, never mind, I can see that makes some sense.

> + error = -EPERM;
> + goto unlock;
> + }
> +
> + if ((seals & F_SEAL_WRITE) && !(info->seals & F_SEAL_WRITE)) {
> + error = mapping_deny_writable(file->f_mapping);
> + if (error)

Which would be -EBUSY, yes, that seems okay.

And with your atomic i_mmap_writable changes in 1/7, and the i_mutex
here, the locking is now solid, and accomplished simply: nice.

> + goto unlock;
> +
> + error = shmem_test_for_pins(file->f_mapping);
> + if (error) {
> + mapping_allow_writable(file->f_mapping);
> + goto unlock;
> + }

Right, although I ask you to remove shmem_test_for_pins() from this
patch, I can see that you might want to include a "return 0" stub for
shmem_wait_for_pins() in this patch, just so that this can appear here
now, and we consider the non-atomicity of it. Yes, I agree this is
how it should proceed: first deny, then re-allow if waiting fails.

> + }
> +
> + info->seals |= seals;
> + error = 0;
> +
> +unlock:
> + mutex_unlock(&inode->i_mutex);
> + return error;
> +}
> +EXPORT_SYMBOL_GPL(shmem_add_seals);
> +
> +int shmem_get_seals(struct file *file)
> +{
> + if (file->f_op != &shmem_file_operations)
> + return -EINVAL;

That's fine, though it is worth considering whether return 0
might be preferable. No, I suppose this is easier, fits with
shmem_fcntl() just returning -EINVAL when !TMPFS or !SHMEM.

> +
> + return SHMEM_I(file_inode(file))->seals & F_ALL_SEALS;

& F_ALL_SEALS? Okay, that may be some kind of future proofing that you
have in mind; but it may just be a leftover from when you were using bit
31 for internal use.

> +}
> +EXPORT_SYMBOL_GPL(shmem_get_seals);
> +
> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + long error;
> +
> + switch (cmd) {
> + case F_ADD_SEALS:
> + /* disallow upper 32bit */
> + if (arg >> 32)
> + return -EINVAL;

That is worth checking, but gives
mm/shmem.c:1948:3: warning: right shift count >= width of type
on a 32-bit build. I expect there's an accepted way to do it;
I've used "arg > UINT_MAX" myself in some places.

> +
> + error = shmem_add_seals(file, arg);
> + break;
> + case F_GET_SEALS:
> + error = shmem_get_seals(file);
> + break;
> + default:
> + error = -EINVAL;
> + break;
> + }
> +
> + return error;
> +}
> +
> static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> loff_t len)
> {
> struct inode *inode = file_inode(file);
> struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> + struct shmem_inode_info *info = SHMEM_I(inode);
> struct shmem_falloc shmem_falloc;
> pgoff_t start, index, end;
> int error;
> @@ -1731,6 +1903,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> loff_t unmap_start = round_up(offset, PAGE_SIZE);
> loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>
> + /* protected by i_mutex */
> + if (info->seals & F_SEAL_WRITE) {
> + error = -EPERM;
> + goto out;
> + }
> +
> if ((u64)unmap_end > (u64)unmap_start)
> unmap_mapping_range(mapping, unmap_start,
> 1 + unmap_end - unmap_start, 0);
> @@ -1745,6 +1923,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> if (error)
> goto out;
>
> + if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
> + error = -EPERM;
> + goto out;
> + }
> +
> start = offset >> PAGE_CACHE_SHIFT;
> end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> /* Try to avoid a swapstorm if len is impossible to satisfy */
> --
> 2.0.0

2014-07-16 10:08:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

On Fri, 13 Jun 2014, David Herrmann wrote:

> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
>
> memfd_create() returns the raw shmem file, so calls like ftruncate() can
> be used to modify the underlying inode. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> supported (like on all other regular files).
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike. It is still properly accounted to memcg
> limits, though.

It's an important point, but unclear quite what "quotas and alike" means.
There's never been any quota support in shmem/tmpfs, but filesystem size
can be limited. Maybe say "and is not subject to a filesystem size limit.
It is still properly accounted to memcg limits, though, and to the same
overcommit or no-overcommit accounting as all user memory."

>
> Signed-off-by: David Herrmann <[email protected]>

A comment or two below, but this is okay by me. I'm not wildly excited
to be getting a new system call in mm/shmem.c. I do like it much better
now that you've dropped the size arg, thank you, but I still find it an
odd system call: if it were not for the name, that you want so much for
debugging, I think we would just implement this with a /dev/sealable
alongside /dev/zero, which gave you your own object on opening (in the
way that /dev/zero gives you your own object on mmap'ing).

I haven't checked the manpage, I hope it's made very clear that
there's no uniqueness imposed on the name, that it's merely a tag
attached to the object.

But from a shmem point of view this seems fine: if everyone else
is happy with memfd_create(), it's fine by me.

> ---
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 1 +
> include/uapi/linux/memfd.h | 8 +++++
> kernel/sys_ni.c | 1 +
> mm/shmem.c | 72 ++++++++++++++++++++++++++++++++++++++++
> 6 files changed, 84 insertions(+)
> create mode 100644 include/uapi/linux/memfd.h
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..e7495b4 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
> 351 i386 sched_setattr sys_sched_setattr
> 352 i386 sched_getattr sys_sched_getattr
> 353 i386 renameat2 sys_renameat2
> +354 i386 memfd_create sys_memfd_create
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..28be0e1 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
> 314 common sched_setattr sys_sched_setattr
> 315 common sched_getattr sys_sched_getattr
> 316 common renameat2 sys_renameat2
> +317 common memfd_create sys_memfd_create
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..0be5d4d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
> asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
> asmlinkage long sys_eventfd(unsigned int count);
> asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char *uname_ptr, unsigned int flags);
> asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
> asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> new file mode 100644
> index 0000000..534e364
> --- /dev/null
> +++ b/include/uapi/linux/memfd.h
> @@ -0,0 +1,8 @@
> +#ifndef _UAPI_LINUX_MEMFD_H
> +#define _UAPI_LINUX_MEMFD_H
> +
> +/* flags for memfd_create(2) (unsigned int) */
> +#define MFD_CLOEXEC 0x0001U
> +#define MFD_ALLOW_SEALING 0x0002U
> +
> +#endif /* _UAPI_LINUX_MEMFD_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..489a4e6 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime);
> cond_syscall(compat_sys_timerfd_gettime);
> cond_syscall(sys_eventfd);
> cond_syscall(sys_eventfd2);
> +cond_syscall(sys_memfd_create);
>
> /* performance counters: */
> cond_syscall(sys_perf_event_open);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1438b3e..e7c5fe1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
> #include <linux/highmem.h>
> #include <linux/seq_file.h>
> #include <linux/magic.h>
> +#include <linux/syscalls.h>
> #include <linux/fcntl.h>
> +#include <uapi/linux/memfd.h>
>
> #include <asm/uaccess.h>
> #include <asm/pgtable.h>
> @@ -2662,6 +2664,76 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
> shmem_show_mpol(seq, sbinfo->mpol);
> return 0;
> }
> +
> +#define MFD_NAME_PREFIX "memfd:"
> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> +
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
> +
> +SYSCALL_DEFINE2(memfd_create,
> + const char*, uname,

Jann Horn suggested "const char __user *" rather than "const char *",
here and in syscalls.h, I think that's right (for sparse: compare
with sys_open, for example).

> + unsigned int, flags)
> +{
> + struct shmem_inode_info *info;
> + struct file *file;
> + int fd, error;
> + char *name;
> + long len;
> +
> + if (flags & ~(unsigned int)MFD_ALL_FLAGS)
> + return -EINVAL;
> +
> + /* length includes terminating zero */
> + len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> + if (len <= 0)
> + return -EFAULT;
> + if (len > MFD_NAME_MAX_LEN + 1)
> + return -EINVAL;
> +
> + name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY);
> + if (!name)
> + return -ENOMEM;
> +
> + strcpy(name, MFD_NAME_PREFIX);
> + if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
> + error = -EFAULT;
> + goto err_name;
> + }
> +
> + /* terminating-zero may have changed after strnlen_user() returned */
> + if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
> + error = -EFAULT;
> + goto err_name;
> + }
> +
> + fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);

Perhaps we should throw O_LARGEFILE in there too? So 32-bit is not
surprised when it accesses beyond MAX_NON_LFS. I guess it's almost
a non-issue, since the file is in memory, so not expected to be very
large; but I seem to recall being caught out by a missing O_LARGEFILE
in the past, and a new interface like this might do better to force it.

But I'm not very sure of my ground here: please ask around, an fsdevel
person will have a better idea than me, whether it's best included.

> + if (fd < 0) {
> + error = fd;
> + goto err_name;
> + }
> +
> + file = shmem_file_setup(name, 0, VM_NORESERVE);
> + if (IS_ERR(file)) {
> + error = PTR_ERR(file);
> + goto err_fd;
> + }
> + info = SHMEM_I(file_inode(file));
> + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> + if (flags & MFD_ALLOW_SEALING)
> + info->seals &= ~F_SEAL_SEAL;
> +
> + fd_install(fd, file);
> + kfree(name);
> + return fd;
> +
> +err_fd:
> + put_unused_fd(fd);
> +err_name:
> + kfree(name);
> + return error;
> +}
> +
> #endif /* CONFIG_TMPFS */
>
> static void shmem_put_super(struct super_block *sb)
> --
> 2.0.0

2014-07-16 10:09:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] selftests: add memfd_create() + sealing tests

On Fri, 13 Jun 2014, David Herrmann wrote:

> Some basic tests to verify sealing on memfds works as expected and
> guarantees the advertised semantics.
>
> Signed-off-by: David Herrmann <[email protected]>

Thanks for whatever the fix was, I didn't hit any problem running
this version repeatedly, on 64-bit and on 32-bit.

Hugh

2014-07-16 10:10:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] selftests: add memfd/sealing page-pinning tests

On Fri, 13 Jun 2014, David Herrmann wrote:

> Setting SEAL_WRITE is not possible if there're pending GUP users. This
> commit adds selftests for memfd+sealing that use FUSE to create pending
> page-references. FUSE is very helpful here in that it allows us to delay
> direct-IO operations for an arbitrary amount of time. This way, we can
> force the kernel to pin pages and then run our normal selftests.
>
> Signed-off-by: David Herrmann <[email protected]>

I had a number of problems in getting this working (on openSUSE 13.1):
rpm told me I had fuse installed, yet I had to download and install
the tarball to get header files needed; then "make fuse_mnt" told me
to add -D_FILE_OFFSET_BITS=64 to the compile flags; after which I
got "undefined reference to `fuse_main_real'"; but then I tried
"make run_fuse" as root, and it seemed to sort these issues out
for itself, aside from "./run_fuse_test.sh: Permission denied" -
which was within my bounds of comprehension unlike the rest!

No complaint, thanks for providing the test (though I didn't check
the source to convince myself that "DONE" has done what's claimed):
some rainy day someone can get the Makefile working more smoothly,
no need to delay the patchset for this.

Hugh

2014-07-16 10:11:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC v3 6/7] shm: wait for pins to be released when sealing

On Fri, 13 Jun 2014, David Herrmann wrote:

> We currently fail setting SEAL_WRITE in case there're pending page
> references. This patch extends the pin-tests to wait up to 150ms for all
> references to be dropped. This is still not perfect in that it doesn't
> account for harmless read-only pins, but it's much better than a hard
> failure.
>
> Signed-off-by: David Herrmann <[email protected]>

Right, I didn't look through the patch itself, just compared the result
with what I sent. Okay, you prefer to separate out shmem_tag_pins().

Yes, it looks fine. There's just one change I'd like at this stage,
something I realized shortly after sending the code fragment: please
add a call to lru_add_drain() at the head of shmem_tag_pins(). The
reason being that lru_add_drain() is local to the cpu, so cheap, and
in many cases will bring down all the raised refcounts right then.

Whereas lru_add_drain_all() in the first scan of shmem_wait_for_pins()
is much more expensive, involving inter-processor interrupts to do
that on all cpus: it is appropriate to call it at that point, but we
really ought to try the cheaper lru_add_drain() at the earlier stage.

I would also like never to embark on this scan of the radix_tree
and wait for pins, if the pages were never given out in a VM_SHARED
mapping - or is that unrealistic, because every memfd is read-write,
and typical initialization expected to be by mmap() rather than write()?
But anyway, you're quite right not to get into that at this stage:
it's best left as an optimization once the basics are safely in.

Hugh

2014-07-19 16:12:06

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] mm: allow drivers to prevent new writable mappings

Hi Hugh

On Wed, Jul 9, 2014 at 10:55 AM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> The i_mmap_writable field counts existing writable mappings of an
>> address_space. To allow drivers to prevent new writable mappings, make
>> this counter signed and prevent new writable mappings if it is negative.
>> This is modelled after i_writecount and DENYWRITE.
>>
>> This will be required by the shmem-sealing infrastructure to prevent any
>> new writable mappings after the WRITE seal has been set. In case there
>> exists a writable mapping, this operation will fail with EBUSY.
>>
>> Note that we rely on the fact that iff you already own a writable mapping,
>> you can increase the counter without using the helpers. This is the same
>> that we do for i_writecount.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> I'm very pleased with the way you chose to do this in the end, following
> the way VM_DENYWRITE does it: it works out nicer than I had imagined.
>
> I do have some comments below, but the only one where I end up asking
> for a change is to remove the void "return "; but please read through
> in case any of my wonderings lead you to change your mind on something.
>
> I am very tempted to suggest that you add VM_WRITE into those VM_SHARED
> tests, and put the then necessary additional tests into mprotect.c: so
> that you would no longer have the odd case that a VM_SHARED !VM_WRITE
> area has to be unmapped before sealing the fd.
>
> But that would involve an audit of flush_dcache_page() architectures,
> and perhaps some mods to keep them safe across the mprotect(): let's
> put it down as a desirable future enhancement, just to remove an odd
> restriction.
>
> With the "return " gone,
> Acked-by: Hugh Dickins <[email protected]>

Thanks for the review! Yes, allowing VM_SHARED without VM_WRITE would
be nice, but I think it would make this patchset more complex than it
already is. I agree that it's better done separately.

>> ---
>> fs/inode.c | 1 +
>> include/linux/fs.h | 29 +++++++++++++++++++++++++++--
>> kernel/fork.c | 2 +-
>> mm/mmap.c | 24 ++++++++++++++++++------
>> mm/swap_state.c | 1 +
>> 5 files changed, 48 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 6eecb7f..9945b11 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -165,6 +165,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>> mapping->a_ops = &empty_aops;
>> mapping->host = inode;
>> mapping->flags = 0;
>> + atomic_set(&mapping->i_mmap_writable, 0);
>> mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>> mapping->private_data = NULL;
>> mapping->backing_dev_info = &default_backing_dev_info;
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 338e6f7..71d17c9 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -387,7 +387,7 @@ struct address_space {
>> struct inode *host; /* owner: inode, block_device */
>> struct radix_tree_root page_tree; /* radix tree of all pages */
>> spinlock_t tree_lock; /* and lock protecting it */
>> - unsigned int i_mmap_writable;/* count VM_SHARED mappings */
>> + atomic_t i_mmap_writable;/* count VM_SHARED mappings */
>> struct rb_root i_mmap; /* tree of private and shared mappings */
>> struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
>> struct mutex i_mmap_mutex; /* protect tree, count, list */
>> @@ -470,10 +470,35 @@ static inline int mapping_mapped(struct address_space *mapping)
>> * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
>> * marks vma as VM_SHARED if it is shared, and the file was opened for
>> * writing i.e. vma may be mprotected writable even if now readonly.
>> + *
>> + * If i_mmap_writable is negative, no new writable mappings are allowed. You
>> + * can only deny writable mappings, if none exists right now.
>> */
>> static inline int mapping_writably_mapped(struct address_space *mapping)
>> {
>> - return mapping->i_mmap_writable != 0;
>> + return atomic_read(&mapping->i_mmap_writable) > 0;
>
> I have a slight anxiety that you're halving the writable range here:
> but I think that if we can overflow with this change, we could overflow
> before; and there are int counts elsewhere which would overflow easier.

Currently you need a new vm_area_struct to increase that counter.
vm_area_struct is about 144 bytes, times INT_MAX is something around
300GB. This does indeed look like a limit that can be reached not far
from now on common machines. I somehow doubt it's the only atomic that
might overflow soon, though. And there's always RLIMIT_AS you can set
to avoid such attacks.

I wouldn't oppose to using atomic_long_t, though. Many VM counters
already use atomic_long_t instead of atomic_t.

>> +}
>> +
>> +static inline int mapping_map_writable(struct address_space *mapping)
>> +{
>> + return atomic_inc_unless_negative(&mapping->i_mmap_writable) ?
>> + 0 : -EPERM;
>> +}
>> +
>> +static inline void mapping_unmap_writable(struct address_space *mapping)
>> +{
>> + return atomic_dec(&mapping->i_mmap_writable);
>
> Better just
>
> atomic_dec(&mapping->i_mmap_writable);
>
> I didn't realize before that the compiler allows you to return a void
> function from a void function without warning, but better not rely on that.

Fixed, thanks.

>> +}
>> +
>> +static inline int mapping_deny_writable(struct address_space *mapping)
>> +{
>> + return atomic_dec_unless_positive(&mapping->i_mmap_writable) ?
>> + 0 : -EBUSY;
>> +}
>> +
>> +static inline void mapping_allow_writable(struct address_space *mapping)
>> +{
>> + atomic_inc(&mapping->i_mmap_writable);
>> }
>>
>> /*
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index d2799d1..f1f127e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -421,7 +421,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>> atomic_dec(&inode->i_writecount);
>> mutex_lock(&mapping->i_mmap_mutex);
>> if (tmp->vm_flags & VM_SHARED)
>> - mapping->i_mmap_writable++;
>> + atomic_inc(&mapping->i_mmap_writable);
>> flush_dcache_mmap_lock(mapping);
>> /* insert tmp into the share list, just after mpnt */
>> if (unlikely(tmp->vm_flags & VM_NONLINEAR))
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 129b847..19b6562 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -216,7 +216,7 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
>> if (vma->vm_flags & VM_DENYWRITE)
>> atomic_inc(&file_inode(file)->i_writecount);
>> if (vma->vm_flags & VM_SHARED)
>> - mapping->i_mmap_writable--;
>> + mapping_unmap_writable(mapping);
>
> Okay. I waste ridiculous time trying to decide whether it's better to say
>
> mapping_unmap_writable(mapping)
> or
> atomic_dec(&mapping->i_mmap_writable);
>
> here: there are consistency arguments both ways. I usually solve this
> problem by not giving wrapper names to such operations, but in this case
> I think the answer is just to accept whatever you decided was best.
>
>>
>> flush_dcache_mmap_lock(mapping);
>> if (unlikely(vma->vm_flags & VM_NONLINEAR))
>> @@ -617,7 +617,7 @@ static void __vma_link_file(struct vm_area_struct *vma)
>> if (vma->vm_flags & VM_DENYWRITE)
>> atomic_dec(&file_inode(file)->i_writecount);
>> if (vma->vm_flags & VM_SHARED)
>> - mapping->i_mmap_writable++;
>> + atomic_inc(&mapping->i_mmap_writable);
>>
>> flush_dcache_mmap_lock(mapping);
>> if (unlikely(vma->vm_flags & VM_NONLINEAR))
>> @@ -1572,6 +1572,11 @@ munmap_back:
>> if (error)
>> goto free_vma;
>> }
>> + if (vm_flags & VM_SHARED) {
>> + error = mapping_map_writable(file->f_mapping);
>> + if (error)
>> + goto allow_write_and_free_vma;
>> + }
>> vma->vm_file = get_file(file);
>> error = file->f_op->mmap(file, vma);
>> if (error)
>> @@ -1611,8 +1616,12 @@ munmap_back:
>>
>> vma_link(mm, vma, prev, rb_link, rb_parent);
>> /* Once vma denies write, undo our temporary denial count */
>> - if (vm_flags & VM_DENYWRITE)
>> - allow_write_access(file);
>> + if (file) {
>> + if (vm_flags & VM_SHARED)
>> + mapping_unmap_writable(file->f_mapping);
>> + if (vm_flags & VM_DENYWRITE)
>> + allow_write_access(file);
>> + }
>> file = vma->vm_file;
>
> Very good. A bit subtle, and for a while I was preparing to warn you
> of the odd shmem_zero_setup() case (which materializes a file when
> none came in), and the danger of the count going wrong on that.
>
> But you have it handled, with the "if (file)" block immediately before
> the new assignment of file, which should force anyone to think about it.
> And the ordering here, with respect to file and error exits, has always
> been tricky - I don't think you are making it any worse.

I tried to keep it as similar to VM_DENYWRITE as possible. It's not
really obvious and I had to read it several times, too. But looks all
fine to me now.

> Oh, more subtle than I realized above: I've just gone through a last
> minute "hey, it's wrong; oh, no, it's okay" waver here, remembering
> how mmap /dev/zero (and some others, I suppose) gives you a new file.
>
> That is okay: we don't even have to assume that sealing is limited
> to your memfd_create(): it's safe to assume that any device which
> forks you a new file, is safe to assert mapping_map_writable() upon
> temporarily; and the new file given could never come already sealed.
>
> But maybe a brief comment to reassure us in future?

I added a comment explaining that ->mmap has to make sure vma_link()
succeeds in raising these counters in case it modified vma->vm_file.
All users I found return files that haven't been exposed to user-space
at that point, therefore they're safe. They also don't do initial
sealing or VM_DENYWRITE, so they are safe in both regards.

>> out:
>> perf_event_mmap(vma);
>> @@ -1641,14 +1650,17 @@ out:
>> return addr;
>>
>> unmap_and_free_vma:
>> - if (vm_flags & VM_DENYWRITE)
>> - allow_write_access(file);
>> vma->vm_file = NULL;
>> fput(file);
>>
>> /* Undo any partial mapping done by a device driver. */
>> unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
>> charged = 0;
>> + if (vm_flags & VM_SHARED)
>> + mapping_unmap_writable(file->f_mapping);
>> +allow_write_and_free_vma:
>> + if (vm_flags & VM_DENYWRITE)
>> + allow_write_access(file);
>
> Okay. I don't enjoy that rearrangement, such that those two uses of
> file come after the fput(file); but I believe there are good reasons
> why it can never be the final fput(file), so I think you're okay.

Actually, the fput(file) should rather be "put(vma->vm_file). The
reference we free here is not the reference that we got passed by the
syscall-context, but it's the reference we took for vma->vm_file.
Therefore, I think it's fine to access "file" afterwards, as this is
an independent reference. The vma->vm_file reference is also taken
_after_ doing VM_DENYWRITE and VM_SHARED accounting. Therefore, I put
the cleanup of vma->vm_file before reverting the counters for
symmetry's sake.

Note that we cannot turn fput(file) into fput(vma->vm_file), though.
The vma->vm_file field might be left set to something else after
->mmap() failed.

Thanks
David

>> free_vma:
>> kmem_cache_free(vm_area_cachep, vma);
>> unacct_error:
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 2972eee..31321fa 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -39,6 +39,7 @@ static struct backing_dev_info swap_backing_dev_info = {
>> struct address_space swapper_spaces[MAX_SWAPFILES] = {
>> [0 ... MAX_SWAPFILES - 1] = {
>> .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
>> + .i_mmap_writable = ATOMIC_INIT(0),
>> .a_ops = &swap_aops,
>> .backing_dev_info = &swap_backing_dev_info,
>> }
>> --
>> 2.0.0

2014-07-19 16:17:36

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] shm: add sealing API

Hi Hugh

On Wed, Jul 16, 2014 at 12:06 PM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> If two processes share a common memory region, they usually want some
>> guarantees to allow safe access. This often includes:
>> - one side cannot overwrite data while the other reads it
>> - one side cannot shrink the buffer while the other accesses it
>> - one side cannot grow the buffer beyond previously set boundaries
>>
>> If there is a trust-relationship between both parties, there is no need
>> for policy enforcement. However, if there's no trust relationship (eg.,
>> for general-purpose IPC) sharing memory-regions is highly fragile and
>> often not possible without local copies. Look at the following two
>> use-cases:
>> 1) A graphics client wants to share its rendering-buffer with a
>> graphics-server. The memory-region is allocated by the client for
>> read/write access and a second FD is passed to the server. While
>> scanning out from the memory region, the server has no guarantee that
>> the client doesn't shrink the buffer at any time, requiring rather
>> cumbersome SIGBUS handling.
>> 2) A process wants to perform an RPC on another process. To avoid huge
>> bandwidth consumption, zero-copy is preferred. After a message is
>> assembled in-memory and a FD is passed to the remote side, both sides
>> want to be sure that neither modifies this shared copy, anymore. The
>> source may have put sensible data into the message without a separate
>> copy and the target may want to parse the message inline, to avoid a
>> local copy.
>>
>> While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
>> ways to achieve most of this, the first one is unproportionally ugly to
>> use in libraries and the latter two are broken/racy or even disabled due
>> to denial of service attacks.
>>
>> This patch introduces the concept of SEALING. If you seal a file, a
>> specific set of operations is blocked on that file forever.
>> Unlike locks, seals can only be set, never removed. Hence, once you
>> verified a specific set of seals is set, you're guaranteed that no-one can
>> perform the blocked operations on this file, anymore.
>>
>> An initial set of SEALS is introduced by this patch:
>> - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
>> in size. This affects ftruncate() and open(O_TRUNC).
>> - GROW: If SEAL_GROW is set, the file in question cannot be increased
>> in size. This affects ftruncate(), fallocate() and write().
>> - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
>> are possible. This affects fallocate(PUNCH_HOLE), mmap() and
>> write().
>> - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
>> This basically prevents the F_ADD_SEAL operation on a file and
>> can be set to prevent others from adding further seals that you
>> don't want.
>>
>> The described use-cases can easily use these seals to provide safe use
>> without any trust-relationship:
>> 1) The graphics server can verify that a passed file-descriptor has
>> SEAL_SHRINK set. This allows safe scanout, while the client is
>> allowed to increase buffer size for window-resizing on-the-fly.
>> Concurrent writes are explicitly allowed.
>> 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
>> SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
>> process can modify the data while the other side parses it.
>> Furthermore, it guarantees that even with writable FDs passed to the
>> peer, it cannot increase the size to hit memory-limits of the source
>> process (in case the file-storage is accounted to the source).
>>
>> The new API is an extension to fcntl(), adding two new commands:
>> F_GET_SEALS: Return a bitset describing the seals on the file. This
>> can be called on any FD if the underlying file supports
>> sealing.
>> F_ADD_SEALS: Change the seals of a given file. This requires WRITE
>> access to the file and F_SEAL_SEAL may not already be set.
>> Furthermore, the underlying file must support sealing and
>> there may not be any existing shared mapping of that file.
>> Otherwise, EBADF/EPERM is returned.
>> The given seals are _added_ to the existing set of seals
>> on the file. You cannot remove seals again.
>>
>> The fcntl() handler is currently specific to shmem and disabled on all
>> files. A file needs to explicitly support sealing for this interface to
>> work. A separate syscall is added in a follow-up, which creates files that
>> support sealing. There is no intention to support this on other
>> file-systems. Semantics are unclear for non-volatile files and we lack any
>> use-case right now. Therefore, the implementation is specific to shmem.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> Looks pretty good to me, minor comments below.
>
>> ---
>> fs/fcntl.c | 5 ++
>> include/linux/shmem_fs.h | 17 ++++
>> include/uapi/linux/fcntl.h | 15 ++++
>> mm/shmem.c | 189 ++++++++++++++++++++++++++++++++++++++++++++-
>> 4 files changed, 223 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/fcntl.c b/fs/fcntl.c
>> index 72c82f6..22d1c3d 100644
>> --- a/fs/fcntl.c
>> +++ b/fs/fcntl.c
>> @@ -21,6 +21,7 @@
>> #include <linux/rcupdate.h>
>> #include <linux/pid_namespace.h>
>> #include <linux/user_namespace.h>
>> +#include <linux/shmem_fs.h>
>>
>> #include <asm/poll.h>
>> #include <asm/siginfo.h>
>> @@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>> case F_GETPIPE_SZ:
>> err = pipe_fcntl(filp, cmd, arg);
>> break;
>> + case F_ADD_SEALS:
>> + case F_GET_SEALS:
>> + err = shmem_fcntl(filp, cmd, arg);
>> + break;
>> default:
>> break;
>> }
>> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
>> index 4d1771c..50777b5 100644
>> --- a/include/linux/shmem_fs.h
>> +++ b/include/linux/shmem_fs.h
>> @@ -1,6 +1,7 @@
>> #ifndef __SHMEM_FS_H
>> #define __SHMEM_FS_H
>>
>> +#include <linux/file.h>
>> #include <linux/swap.h>
>> #include <linux/mempolicy.h>
>> #include <linux/pagemap.h>
>> @@ -11,6 +12,7 @@
>>
>> struct shmem_inode_info {
>> spinlock_t lock;
>> + unsigned int seals; /* shmem seals */
>> unsigned long flags;
>> unsigned long alloced; /* data pages alloced to file */
>> union {
>> @@ -65,4 +67,19 @@ static inline struct page *shmem_read_mapping_page(
>> mapping_gfp_mask(mapping));
>> }
>>
>> +#ifdef CONFIG_TMPFS
>> +
>> +extern int shmem_add_seals(struct file *file, unsigned int seals);
>> +extern int shmem_get_seals(struct file *file);
>> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
>> +
>> +#else
>> +
>> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
>> +{
>> + return -EINVAL;
>> +}
>> +
>> +#endif
>> +
>> #endif
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 074b886..beed138 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -28,6 +28,21 @@
>> #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)
>>
>> /*
>> + * Set/Get seals
>> + */
>> +#define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
>> +#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
>> +
>> +/*
>> + * Types of seals
>> + */
>> +#define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */
>> +#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
>> +#define F_SEAL_GROW 0x0004 /* prevent file from growing */
>> +#define F_SEAL_WRITE 0x0008 /* prevent writes */
>> +/* (1U << 31) is reserved for signed error codes */
>> +
>> +/*
>> * Types of directory notifications that may be requested.
>> */
>> #define DN_ACCESS 0x00000001 /* File accessed */
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index f484c27..1438b3e 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
>> #include <linux/highmem.h>
>> #include <linux/seq_file.h>
>> #include <linux/magic.h>
>> +#include <linux/fcntl.h>
>>
>> #include <asm/uaccess.h>
>> #include <asm/pgtable.h>
>> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
>> static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>> {
>> struct inode *inode = dentry->d_inode;
>> + struct shmem_inode_info *info = SHMEM_I(inode);
>> + loff_t oldsize = inode->i_size;
>> + loff_t newsize = attr->ia_size;
>> int error;
>>
>> error = inode_change_ok(inode, attr);
>> if (error)
>> return error;
>>
>> - if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>> - loff_t oldsize = inode->i_size;
>> - loff_t newsize = attr->ia_size;
>> + /* protected by i_mutex */
>> + if (attr->ia_valid & ATTR_SIZE) {
>> + if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
>> + (newsize > oldsize && (info->seals & F_SEAL_GROW)))
>> + return -EPERM;
>> + }
>
> Not important but...
> I'd have thought all that was better inside the S_ISREG(inode->i_mode) &&
> (attr->ia_valid & ATTR_SIZE) block. Less unnecessary change above, and
> more efficient for non-size attrs. You cannot seal anything but a regular
> file anyway, right?

You're right. Fixed.

>>
>> + if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>> if (newsize != oldsize) {
>> i_size_write(inode, newsize);
>> inode->i_ctime = inode->i_mtime = CURRENT_TIME;
>> @@ -1315,6 +1323,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>> info = SHMEM_I(inode);
>> memset(info, 0, (char *)inode - (char *)info);
>> spin_lock_init(&info->lock);
>> + info->seals = F_SEAL_SEAL;
>> info->flags = flags & VM_NORESERVE;
>> INIT_LIST_HEAD(&info->swaplist);
>> simple_xattrs_init(&info->xattrs);
>> @@ -1374,7 +1383,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>> {
>> int ret;
>> struct inode *inode = mapping->host;
>> + struct shmem_inode_info *info = SHMEM_I(inode);
>> pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>> +
>> + /* i_mutex is held by caller */
>> + if (info->seals & F_SEAL_WRITE)
>> + return -EPERM;
>> + if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
>> + return -EPERM;
>
> I think this is your only addition which comes in a hot path.
> Mel has been shaving nanoseconds off this path recently: you're not
> introducing any atomic ops here, good, but I wonder if it would make any
> measurable difference to include this pair of tests inside a single
> "if (unlikely(info->seals)) {". Maybe not, but it wouldn't hurt.

It doesn't hurt adding the unlikely() protection, either. Fixed.

>> +
>> ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
>> if (ret == 0 && *pagep)
>> init_page_accessed(*pagep);
>> @@ -1715,11 +1732,166 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
>> return offset;
>> }
>>
>> +/*
>> + * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
>> + * via get_user_pages(), drivers might have some pending I/O without any active
>> + * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
>> + * and see whether it has an elevated ref-count. If so, we abort.
>> + * The caller must guarantee that no new user will acquire writable references
>> + * to those pages to avoid races.
>> + */
>> +static int shmem_test_for_pins(struct address_space *mapping)
>> +{
>> + struct radix_tree_iter iter;
>> + void **slot;
>> + pgoff_t start;
>> + struct page *page;
>> + int error;
>> +
>> + /* flush additional refs in lru_add early */
>> + lru_add_drain_all();
>> +
>> + error = 0;
>> + start = 0;
>> + rcu_read_lock();
>> +
>> +restart:
>> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
>> + page = radix_tree_deref_slot(slot);
>> + if (!page || radix_tree_exception(page)) {
>> + if (radix_tree_deref_retry(page))
>> + goto restart;
>> + } else if (page_count(page) - page_mapcount(page) > 1) {
>> + error = -EBUSY;
>> + break;
>> + }
>> +
>> + if (need_resched()) {
>> + cond_resched_rcu();
>> + start = iter.index + 1;
>> + goto restart;
>> + }
>> + }
>> + rcu_read_unlock();
>> +
>> + return error;
>> +}
>
> Please leave shmem_test_for_pins() (and the comment above it)
> out of this particular patch.
>
> The implementation here satisfies none of us, make it harder to
> figure out the patch improving it later, and distracts from the basic
> sealing interface and functionality that you introduce in this patch.
>
> A brief comment on the issue instead - "But what if a page of the
> object is pinned for pending I/O? See later patch" - maybe, but on the
> whole I think it's better to raise and settle the issue in later patch.

Removed and replaced by a dummy:

int shmem_wait_for_pins(struct address_space *mapping)
{
return 0;
}

>> +
>> +#define F_ALL_SEALS (F_SEAL_SEAL | \
>> + F_SEAL_SHRINK | \
>> + F_SEAL_GROW | \
>> + F_SEAL_WRITE)
>> +
>> +int shmem_add_seals(struct file *file, unsigned int seals)
>> +{
>> + struct dentry *dentry = file->f_path.dentry;
>> + struct inode *inode = dentry->d_inode;
>
> struct inode *inode = file_inode(file), and forget about dentry?

Fixed.

>> + struct shmem_inode_info *info = SHMEM_I(inode);
>> + int error;
>> +
>> + /*
>> + * SEALING
>> + * Sealing allows multiple parties to share a shmem-file but restrict
>> + * access to a specific subset of file operations. Seals can only be
>> + * added, but never removed. This way, mutually untrusted parties can
>> + * share common memory regions with a well-defined policy. A malicious
>> + * peer can thus never perform unwanted operations on a shared object.
>> + *
>> + * Seals are only supported on special shmem-files and always affect
>> + * the whole underlying inode. Once a seal is set, it may prevent some
>> + * kinds of access to the file. Currently, the following seals are
>> + * defined:
>> + * SEAL_SEAL: Prevent further seals from being set on this file
>> + * SEAL_SHRINK: Prevent the file from shrinking
>> + * SEAL_GROW: Prevent the file from growing
>> + * SEAL_WRITE: Prevent write access to the file
>> + *
>> + * As we don't require any trust relationship between two parties, we
>> + * must prevent seals from being removed. Therefore, sealing a file
>> + * only adds a given set of seals to the file, it never touches
>> + * existing seals. Furthermore, the "setting seals"-operation can be
>> + * sealed itself, which basically prevents any further seal from being
>> + * added.
>> + *
>> + * Semantics of sealing are only defined on volatile files. Only
>> + * anonymous shmem files support sealing. More importantly, seals are
>> + * never written to disk. Therefore, there's no plan to support it on
>> + * other file types.
>> + */
>> +
>> + if (file->f_op != &shmem_file_operations)
>> + return -EINVAL;
>> + if (!(file->f_mode & FMODE_WRITE))
>> + return -EINVAL;
>
> I would expect -EBADF there (like when you write to read-only fd).
> Though I was okay with the -EPERM you had the previous version.

Nice catch. Replaced by EPERM again.

>> + if (seals & ~(unsigned int)F_ALL_SEALS)
>> + return -EINVAL;
>> +
>> + mutex_lock(&inode->i_mutex);
>> +
>> + if (info->seals & F_SEAL_SEAL) {
>
> I notice this is inconsistent with F_SEAL_WRITE just below:
> we're allowed to SEAL_WRITE what's already SEAL_WRITEd,
> but not to SEAL_SEAL what's already SEAL_SEALed.
> Oh, never mind, I can see that makes some sense.
>
>> + error = -EPERM;
>> + goto unlock;
>> + }
>> +
>> + if ((seals & F_SEAL_WRITE) && !(info->seals & F_SEAL_WRITE)) {
>> + error = mapping_deny_writable(file->f_mapping);
>> + if (error)
>
> Which would be -EBUSY, yes, that seems okay.
>
> And with your atomic i_mmap_writable changes in 1/7, and the i_mutex
> here, the locking is now solid, and accomplished simply: nice.
>
>> + goto unlock;
>> +
>> + error = shmem_test_for_pins(file->f_mapping);
>> + if (error) {
>> + mapping_allow_writable(file->f_mapping);
>> + goto unlock;
>> + }
>
> Right, although I ask you to remove shmem_test_for_pins() from this
> patch, I can see that you might want to include a "return 0" stub for
> shmem_wait_for_pins() in this patch, just so that this can appear here
> now, and we consider the non-atomicity of it. Yes, I agree this is
> how it should proceed: first deny, then re-allow if waiting fails.

I replaced shmem_test_for_pins() with shmem_wait_for_pins() and kept
the code as is.

>> + }
>> +
>> + info->seals |= seals;
>> + error = 0;
>> +
>> +unlock:
>> + mutex_unlock(&inode->i_mutex);
>> + return error;
>> +}
>> +EXPORT_SYMBOL_GPL(shmem_add_seals);
>> +
>> +int shmem_get_seals(struct file *file)
>> +{
>> + if (file->f_op != &shmem_file_operations)
>> + return -EINVAL;
>
> That's fine, though it is worth considering whether return 0
> might be preferable. No, I suppose this is easier, fits with
> shmem_fcntl() just returning -EINVAL when !TMPFS or !SHMEM.

Agreed.

>> +
>> + return SHMEM_I(file_inode(file))->seals & F_ALL_SEALS;
>
> & F_ALL_SEALS? Okay, that may be some kind of future proofing that you
> have in mind; but it may just be a leftover from when you were using bit
> 31 for internal use.

Nice catch. It's indeed a left-over. I removed it.

>> +}
>> +EXPORT_SYMBOL_GPL(shmem_get_seals);
>> +
>> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> + long error;
>> +
>> + switch (cmd) {
>> + case F_ADD_SEALS:
>> + /* disallow upper 32bit */
>> + if (arg >> 32)
>> + return -EINVAL;
>
> That is worth checking, but gives
> mm/shmem.c:1948:3: warning: right shift count >= width of type
> on a 32-bit build. I expect there's an accepted way to do it;
> I've used "arg > UINT_MAX" myself in some places.

The zero-test bot already reported this and I fixed it with a u64
cast. Your UINT_MAX test is definitely nicer so I changed it again.

Thanks!
David

>> +
>> + error = shmem_add_seals(file, arg);
>> + break;
>> + case F_GET_SEALS:
>> + error = shmem_get_seals(file);
>> + break;
>> + default:
>> + error = -EINVAL;
>> + break;
>> + }
>> +
>> + return error;
>> +}
>> +
>> static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>> loff_t len)
>> {
>> struct inode *inode = file_inode(file);
>> struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>> + struct shmem_inode_info *info = SHMEM_I(inode);
>> struct shmem_falloc shmem_falloc;
>> pgoff_t start, index, end;
>> int error;
>> @@ -1731,6 +1903,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>> loff_t unmap_start = round_up(offset, PAGE_SIZE);
>> loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>>
>> + /* protected by i_mutex */
>> + if (info->seals & F_SEAL_WRITE) {
>> + error = -EPERM;
>> + goto out;
>> + }
>> +
>> if ((u64)unmap_end > (u64)unmap_start)
>> unmap_mapping_range(mapping, unmap_start,
>> 1 + unmap_end - unmap_start, 0);
>> @@ -1745,6 +1923,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>> if (error)
>> goto out;
>>
>> + if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
>> + error = -EPERM;
>> + goto out;
>> + }
>> +
>> start = offset >> PAGE_CACHE_SHIFT;
>> end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>> /* Try to avoid a swapstorm if len is impossible to satisfy */
>> --
>> 2.0.0

2014-07-19 16:29:34

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] shm: add memfd_create() syscall

Hi

On Wed, Jul 16, 2014 at 12:07 PM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It can support sealing and avoids any
>> connection to user-visible mount-points. Thus, it's not subject to quotas
>> on mounted file-systems, but can be used like malloc()'ed memory, but
>> with a file-descriptor to it.
>>
>> memfd_create() returns the raw shmem file, so calls like ftruncate() can
>> be used to modify the underlying inode. Also calls like fstat()
>> will return proper information and mark the file as regular file. If you
>> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
>> supported (like on all other regular files).
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike. It is still properly accounted to memcg
>> limits, though.
>
> It's an important point, but unclear quite what "quotas and alike" means.
> There's never been any quota support in shmem/tmpfs, but filesystem size
> can be limited. Maybe say "and is not subject to a filesystem size limit.
> It is still properly accounted to memcg limits, though, and to the same
> overcommit or no-overcommit accounting as all user memory."

Yes, makes sense. Fixed.

>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> A comment or two below, but this is okay by me. I'm not wildly excited
> to be getting a new system call in mm/shmem.c. I do like it much better
> now that you've dropped the size arg, thank you, but I still find it an
> odd system call: if it were not for the name, that you want so much for
> debugging, I think we would just implement this with a /dev/sealable
> alongside /dev/zero, which gave you your own object on opening (in the
> way that /dev/zero gives you your own object on mmap'ing).

mmap() supports replacing the file by a new file. Therefore, /dev/zero
works just fine. open() doesn't allow that and it looks non-trivial to
make it work. "non-trivial" is not really a counter-argument, but the
object-name is worth a new syscall, in my opinion. And it's a really
nice feature to debug complex systems.

> I haven't checked the manpage, I hope it's made very clear that
> there's no uniqueness imposed on the name, that it's merely a tag
> attached to the object.

Yes, the man-page clearly states that names are for debugging purposes
only and exposed via /proc/self/fd/ symlink-targets. They're not
subject to conflict-tests nor do two memfd's with the same name behave
any different.

> But from a shmem point of view this seems fine: if everyone else
> is happy with memfd_create(), it's fine by me.
>
>> ---
>> arch/x86/syscalls/syscall_32.tbl | 1 +
>> arch/x86/syscalls/syscall_64.tbl | 1 +
>> include/linux/syscalls.h | 1 +
>> include/uapi/linux/memfd.h | 8 +++++
>> kernel/sys_ni.c | 1 +
>> mm/shmem.c | 72 ++++++++++++++++++++++++++++++++++++++++
>> 6 files changed, 84 insertions(+)
>> create mode 100644 include/uapi/linux/memfd.h
>>
>> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> index d6b8679..e7495b4 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -360,3 +360,4 @@
>> 351 i386 sched_setattr sys_sched_setattr
>> 352 i386 sched_getattr sys_sched_getattr
>> 353 i386 renameat2 sys_renameat2
>> +354 i386 memfd_create sys_memfd_create
>> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> index ec255a1..28be0e1 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -323,6 +323,7 @@
>> 314 common sched_setattr sys_sched_setattr
>> 315 common sched_getattr sys_sched_getattr
>> 316 common renameat2 sys_renameat2
>> +317 common memfd_create sys_memfd_create
>>
>> #
>> # x32-specific system call numbers start at 512 to avoid cache impact
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index b0881a0..0be5d4d 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>> asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>> asmlinkage long sys_eventfd(unsigned int count);
>> asmlinkage long sys_eventfd2(unsigned int count, int flags);
>> +asmlinkage long sys_memfd_create(const char *uname_ptr, unsigned int flags);
>> asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>> asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>> asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
>> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
>> new file mode 100644
>> index 0000000..534e364
>> --- /dev/null
>> +++ b/include/uapi/linux/memfd.h
>> @@ -0,0 +1,8 @@
>> +#ifndef _UAPI_LINUX_MEMFD_H
>> +#define _UAPI_LINUX_MEMFD_H
>> +
>> +/* flags for memfd_create(2) (unsigned int) */
>> +#define MFD_CLOEXEC 0x0001U
>> +#define MFD_ALLOW_SEALING 0x0002U
>> +
>> +#endif /* _UAPI_LINUX_MEMFD_H */
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 36441b5..489a4e6 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime);
>> cond_syscall(compat_sys_timerfd_gettime);
>> cond_syscall(sys_eventfd);
>> cond_syscall(sys_eventfd2);
>> +cond_syscall(sys_memfd_create);
>>
>> /* performance counters: */
>> cond_syscall(sys_perf_event_open);
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 1438b3e..e7c5fe1 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>> #include <linux/highmem.h>
>> #include <linux/seq_file.h>
>> #include <linux/magic.h>
>> +#include <linux/syscalls.h>
>> #include <linux/fcntl.h>
>> +#include <uapi/linux/memfd.h>
>>
>> #include <asm/uaccess.h>
>> #include <asm/pgtable.h>
>> @@ -2662,6 +2664,76 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>> shmem_show_mpol(seq, sbinfo->mpol);
>> return 0;
>> }
>> +
>> +#define MFD_NAME_PREFIX "memfd:"
>> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>> +
>> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
>> +
>> +SYSCALL_DEFINE2(memfd_create,
>> + const char*, uname,
>
> Jann Horn suggested "const char __user *" rather than "const char *",
> here and in syscalls.h, I think that's right (for sparse: compare
> with sys_open, for example).

Both fixed already. Sorry, I forgot to reply to Jann Horn. Thanks to
both of you!

>> + unsigned int, flags)
>> +{
>> + struct shmem_inode_info *info;
>> + struct file *file;
>> + int fd, error;
>> + char *name;
>> + long len;
>> +
>> + if (flags & ~(unsigned int)MFD_ALL_FLAGS)
>> + return -EINVAL;
>> +
>> + /* length includes terminating zero */
>> + len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
>> + if (len <= 0)
>> + return -EFAULT;
>> + if (len > MFD_NAME_MAX_LEN + 1)
>> + return -EINVAL;
>> +
>> + name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY);
>> + if (!name)
>> + return -ENOMEM;
>> +
>> + strcpy(name, MFD_NAME_PREFIX);
>> + if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
>> + error = -EFAULT;
>> + goto err_name;
>> + }
>> +
>> + /* terminating-zero may have changed after strnlen_user() returned */
>> + if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
>> + error = -EFAULT;
>> + goto err_name;
>> + }
>> +
>> + fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
>
> Perhaps we should throw O_LARGEFILE in there too? So 32-bit is not
> surprised when it accesses beyond MAX_NON_LFS. I guess it's almost
> a non-issue, since the file is in memory, so not expected to be very
> large; but I seem to recall being caught out by a missing O_LARGEFILE
> in the past, and a new interface like this might do better to force it.
>
> But I'm not very sure of my ground here: please ask around, an fsdevel
> person will have a better idea than me, whether it's best included.

get_unused_fd_flags() doesn't take other flags than O_CLOEXEC, we need
to set it directly like we already do for f_mode.

On 64bit O_LARGEFILE is already forced for many syscalls. I added it
now as it makes perfect sense. It's part of the memfd ABI now.
man-page is fixed, too.

Thanks
David

>> + if (fd < 0) {
>> + error = fd;
>> + goto err_name;
>> + }
>> +
>> + file = shmem_file_setup(name, 0, VM_NORESERVE);
>> + if (IS_ERR(file)) {
>> + error = PTR_ERR(file);
>> + goto err_fd;
>> + }
>> + info = SHMEM_I(file_inode(file));
>> + file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>> + if (flags & MFD_ALLOW_SEALING)
>> + info->seals &= ~F_SEAL_SEAL;
>> +
>> + fd_install(fd, file);
>> + kfree(name);
>> + return fd;
>> +
>> +err_fd:
>> + put_unused_fd(fd);
>> +err_name:
>> + kfree(name);
>> + return error;
>> +}
>> +
>> #endif /* CONFIG_TMPFS */
>>
>> static void shmem_put_super(struct super_block *sb)
>> --
>> 2.0.0

2014-07-19 16:31:29

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] selftests: add memfd_create() + sealing tests

Hi

On Wed, Jul 16, 2014 at 12:07 PM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> Some basic tests to verify sealing on memfds works as expected and
>> guarantees the advertised semantics.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> Thanks for whatever the fix was, I didn't hit any problem running
> this version repeatedly, on 64-bit and on 32-bit.

glibc does pid-caching so getpid() can be skipped once called. fork()
and clone() have to update it, though. Therefore, you shouldn't mix
fork() and clone() in the same process. I replaced my fork() call with
a simple clone() and the bug was gone.

Thanks
David

2014-07-19 16:32:57

by David Herrmann

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] selftests: add memfd/sealing page-pinning tests

Hi

On Wed, Jul 16, 2014 at 12:08 PM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> Setting SEAL_WRITE is not possible if there're pending GUP users. This
>> commit adds selftests for memfd+sealing that use FUSE to create pending
>> page-references. FUSE is very helpful here in that it allows us to delay
>> direct-IO operations for an arbitrary amount of time. This way, we can
>> force the kernel to pin pages and then run our normal selftests.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> I had a number of problems in getting this working (on openSUSE 13.1):
> rpm told me I had fuse installed, yet I had to download and install
> the tarball to get header files needed; then "make fuse_mnt" told me
> to add -D_FILE_OFFSET_BITS=64 to the compile flags; after which I
> got "undefined reference to `fuse_main_real'"; but then I tried
> "make run_fuse" as root, and it seemed to sort these issues out
> for itself, aside from "./run_fuse_test.sh: Permission denied" -
> which was within my bounds of comprehension unlike the rest!
>
> No complaint, thanks for providing the test (though I didn't check
> the source to convince myself that "DONE" has done what's claimed):
> some rainy day someone can get the Makefile working more smoothly,
> no need to delay the patchset for this.

_FILE_OFFSET_BITS=64 makes sense. I added it. The "undefined ref"
thing doesn't make sense to me and I cannot reproduce it. I will see
what I can do.

The "Permission denied" obviously just requires access to /dev/fuse,
as you figured out yourself.

Thanks
David

2014-07-19 16:36:37

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v3 6/7] shm: wait for pins to be released when sealing

Hi

On Wed, Jul 16, 2014 at 12:09 PM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> We currently fail setting SEAL_WRITE in case there're pending page
>> references. This patch extends the pin-tests to wait up to 150ms for all
>> references to be dropped. This is still not perfect in that it doesn't
>> account for harmless read-only pins, but it's much better than a hard
>> failure.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> Right, I didn't look through the patch itself, just compared the result
> with what I sent. Okay, you prefer to separate out shmem_tag_pins().

The main reason why I split both is to avoid goto-label "restart" and
"restart2".

> Yes, it looks fine. There's just one change I'd like at this stage,
> something I realized shortly after sending the code fragment: please
> add a call to lru_add_drain() at the head of shmem_tag_pins(). The
> reason being that lru_add_drain() is local to the cpu, so cheap, and
> in many cases will bring down all the raised refcounts right then.
>
> Whereas lru_add_drain_all() in the first scan of shmem_wait_for_pins()
> is much more expensive, involving inter-processor interrupts to do
> that on all cpus: it is appropriate to call it at that point, but we
> really ought to try the cheaper lru_add_drain() at the earlier stage.

I added an lru_add_drain_all() to my shmem_test_pins() function in
Patch 2/7. This patch dropped it again as your wait_for_pins() already
included it and it's quite expensive. But yes, the local
lru_add_drain() makes perfect sense. Fixed!

Thanks
David

> I would also like never to embark on this scan of the radix_tree
> and wait for pins, if the pages were never given out in a VM_SHARED
> mapping - or is that unrealistic, because every memfd is read-write,
> and typical initialization expected to be by mmap() rather than write()?
> But anyway, you're quite right not to get into that at this stage:
> it's best left as an optimization once the basics are safely in.

2014-07-19 16:40:50

by David Herrmann

[permalink] [raw]
Subject: Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

Hi

On Wed, Jul 9, 2014 at 10:57 AM, Hugh Dickins <[email protected]> wrote:
> On Fri, 13 Jun 2014, David Herrmann wrote:
>
>> When setting SEAL_WRITE, we must make sure nobody has a writable reference
>> to the pages (via GUP or similar). We currently check references and wait
>> some time for them to be dropped. This, however, might fail for several
>> reasons, including:
>> - the page is pinned for longer than we wait
>> - while we wait, someone takes an already pinned page for read-access
>>
>> Therefore, this patch introduces page-isolation. When sealing a file with
>> SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
>> is put in place atomically, the old page is detached and left alone. It
>> will get reclaimed once the last external user dropped it.
>>
>> Signed-off-by: David Herrmann <[email protected]>
>
> I've not checked it line by line, but this seems to be very good work;
> and I'm glad you have posted it, where we can refer back to it in future.
>
> However, I'm NAKing this patch, at least for now.
>
> The reason is simple and twofold.
>
> I absolutely do not want to be maintaining an alternative form of
> page migration in mm/shmem.c. Shmem has its own peculiar problems
> (mostly because of swap): adding a new dimension of very rarely
> exercised complication, and dependence on the rest mm, is not wise.
>
> And sealing just does not need this. It is clearly technically
> superior to, and more satisfying than, the "wait-a-while-then-give-up"
> technique which it would replace. But in practice, the wait-a-while
> technique is quite good enough (and much better self-contained than this).
>
> I've heard no requirement to support sealing of objects pinned for I/O,
> and the sealer just would not have given the object out for that; the
> requirement is to give the recipient of a sealed object confidence
> that it cannot be susceptible to modification in that way.
>
> I doubt there will ever be an actual need for sealing to use this
> migration technique; but I can imagine us referring back to your work in
> future, when/if implementing revoke on regular files. And integrating
> this into mm/migrate.c's unmap_and_move() as an extra-force mode
> (proceed even when the page count is raised).
>
> I think the concerns I had, when Tony first proposed this migration copy
> technique, were in fact unfounded - I was worried by the new inverse COW.
> On reflection, I don't think this introduces any new risks, which are
> not already present in page migration, truncation and orphaned pages.
>
> I didn't begin to test it at all, but the only defects that stood out
> in your code were in the areas of memcg and mlock. I think that if we
> go down the road of duplicating pinned pages, then we do have to make
> a memcg charge on the new page in addition to the old page. And if
> any pages happen to be mlock'ed into an address space, then we ought
> to map in the replacement pages afterwards (as page migration does,
> whether mlock'ed or not).
>
> (You were perfectly correct to use unmap_mapping_range(), rather than
> try_to_unmap() as page migration does: because unmap_mapping_range()
> manages the VM_NONLINEAR case. But our intention, under way, is to
> scrap all VM_NONLINEAR code, and just emulate it with multiple vmas,
> in which case try_to_unmap() should do.)

Dropping VM_NONLINEAR would make a lot of stuff so much easier.

I'm fine with dropping this patch again. The mlock and memcg issues
you raised are valid and should get fixed. And indeed, my testing
never triggered any real evelated page-refs except if I pinned them
via FUSE. Therefore, the wait-for-pins function should be sufficient,
indeed.

Thanks for the reviews! I will send v4 shortly.
David