Hi all,
I'm a user-space developer working on Wayland compositors.
Back in 2014, David Herrmann has posted a patchset [1] to introduce memfd
and file sealing. The patchset reads:
> 1) Graphics Compositors
> If a graphics client creates a memory-backed render-buffer and passes a
> file-decsriptor to it to the graphics server for display, the server
> _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
> the client might run ftruncate() or O_TRUNC on the on file in parallel,
> thus crashing the server.
> With sealing, a compositor can reject any incoming file-descriptor that
> does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
> guaranteed to stay accessible. Furthermore, we still allow clients to
> increase the buffer-size in case they want to resize the render-buffer for
> the next frame. We also allow parallel writes so the client can render new
> frames into the same buffer (client is responsible of never rendering into
> a front-buffer if you want to avoid artifacts).
>
> Real use-case: Wayland wl_shm buffers can be transparently converted
Fast-forward to 7 years later, and notice that there doesn't exist a
single Wayland compositor that enforces file sealing for its clients.
The reason is that there will always exist clients which are either old
(and predate file sealing) or refuse to use Linux-only APIs (they don't
use memfd and file sealing, instead they use e.g. shm_open). Requiring
sealed memfds in compositors would break these clients.
I don't believe the situation is about to change.
Rather than requiring changes in all compositors *and* clients, can we
maybe only require changes in compositors? For instance, OpenBSD has a
__MAP_NOFAULT flag. When passed to mmap, it means that out-of-bound
accesses will read as zeroes instead of triggering SIGBUS. Such a flag
would be very helpful to unblock the annoying SIGBUS situation.
Would something among these lines be welcome in the Linux kernel?
Thanks,
Simon
[1]: https://lwn.net/Articles/591108/
On Tue, Apr 27, 2021 at 1:25 AM Simon Ser <[email protected]> wrote:
>
> Rather than requiring changes in all compositors *and* clients, can we
> maybe only require changes in compositors? For instance, OpenBSD has a
> __MAP_NOFAULT flag. When passed to mmap, it means that out-of-bound
> accesses will read as zeroes instead of triggering SIGBUS. Such a flag
> would be very helpful to unblock the annoying SIGBUS situation.
>
> Would something among these lines be welcome in the Linux kernel?
Hmm. It doesn't look too hard to do. The biggest problem is actually
that we've run out of flags in the vma (on 32-bit architectures), but
you could try this UNTESTED patch that just does the MAP_NOFAULT thing
unconditionally.
NOTE! Not only is it untested, not only is this a "for your testing
only" (because it does it unconditionally rather than only for
__MAP_NOFAULT), but it might be bogus for other reasons. In
particular, this patch depends on "vmf->address" not being changed by
the ->fault() infrastructure, so that we can just re-use the vmf for
the anonymous case if we get a SIGBUS.
I think that's all ok these days, because Kirill and Peter Xu cleaned
up those paths, but I didn't actually check. So I'm cc'ing Kirill,
Peter and Will, who have been working in this area for other reasons
fairly recently.
Side note: this will only ever work for non-shared mappings. That's
fundamental. We won't add an anonymous page to a shared mapping, and
do_anonymous_page() does verify that. So a MAP_SHARED mappign will
still return SIGBUS even with this patch (although it's not obvious
from the patch - the VM_FAULT_SIGBUS will just be re-created by
do_anonymous_page()).
So if you want a _shared_ mapping to honor __MAP_NOFAULT and insert
random anonymous pages into it, I think the answer is "no, that's not
going to be viable".
So _if_ this works for you, and if it's ok that only MAP_PRIVATE can
have __MAP_NOFAULT, and if Kirill/Peter/Will don't say "Oh, Linus,
you're completely off your rocker and clearly need to be taking your
meds", something like this - if we figure out the conditional bit -
might be doable.
That's a fair number of "ifs".
Ok, back to the merge window for me, I'll be throwing away this crazy
untested patch immediately after hitting "send". This is very much a
"throw the idea over to other people" patch, in other words.
Linus
On Tue, Apr 27, 2021 at 09:51:58AM -0700, Linus Torvalds wrote:
> On Tue, Apr 27, 2021 at 1:25 AM Simon Ser <[email protected]> wrote:
> >
> > Rather than requiring changes in all compositors *and* clients, can we
> > maybe only require changes in compositors? For instance, OpenBSD has a
> > __MAP_NOFAULT flag. When passed to mmap, it means that out-of-bound
> > accesses will read as zeroes instead of triggering SIGBUS. Such a flag
> > would be very helpful to unblock the annoying SIGBUS situation.
> >
> > Would something among these lines be welcome in the Linux kernel?
>
> Hmm. It doesn't look too hard to do. The biggest problem is actually
> that we've run out of flags in the vma (on 32-bit architectures), but
> you could try this UNTESTED patch that just does the MAP_NOFAULT thing
> unconditionally.
>
> NOTE! Not only is it untested, not only is this a "for your testing
> only" (because it does it unconditionally rather than only for
> __MAP_NOFAULT), but it might be bogus for other reasons. In
> particular, this patch depends on "vmf->address" not being changed by
> the ->fault() infrastructure, so that we can just re-use the vmf for
> the anonymous case if we get a SIGBUS.
>
> I think that's all ok these days, because Kirill and Peter Xu cleaned
> up those paths, but I didn't actually check. So I'm cc'ing Kirill,
> Peter and Will, who have been working in this area for other reasons
> fairly recently.
>
> Side note: this will only ever work for non-shared mappings.
I think it's show-stopper for the use-case, no? IIUC, the mappings is used
for communication between a compositor and a client and has to be shared.
> That's fundamental. We won't add an anonymous page to a shared mapping,
> and do_anonymous_page() does verify that. So a MAP_SHARED mappign will
> still return SIGBUS even with this patch (although it's not obvious from
> the patch - the VM_FAULT_SIGBUS will just be re-created by
> do_anonymous_page()).
>
> So if you want a _shared_ mapping to honor __MAP_NOFAULT and insert
> random anonymous pages into it, I think the answer is "no, that's not
> going to be viable".
+ Matthew, Dan.
DAX uses zero pages in page cache to avoid allocating backing storage read
accesses to holes. Maybe we can generalize it beyond DAX to any page cache
and add a (per-inode?) flag to do the same for accesses beyond i_size?
> So _if_ this works for you, and if it's ok that only MAP_PRIVATE can
> have __MAP_NOFAULT, and if Kirill/Peter/Will don't say "Oh, Linus,
> you're completely off your rocker and clearly need to be taking your
> meds", something like this - if we figure out the conditional bit -
> might be doable.
>
> That's a fair number of "ifs".
>
> Ok, back to the merge window for me, I'll be throwing away this crazy
> untested patch immediately after hitting "send". This is very much a
> "throw the idea over to other people" patch, in other words.
>
> Linus
--
Kirill A. Shutemov
On Thu, Apr 29, 2021 at 06:48:07PM +0300, Kirill A. Shutemov wrote:
> > Side note: this will only ever work for non-shared mappings.
>
> I think it's show-stopper for the use-case, no? IIUC, the mappings is used
> for communication between a compositor and a client and has to be shared.
Yes I had the same doubt.. Besides, we probably don't want to convert all
VM_FAULT_SIGBUS to fallback to anonymous pages, as I see that vmf_error()
converts mostly everything besides -ENOMEM to VM_FAULT_SIGBUS. Thanks,
--
Peter Xu
On Tuesday, April 27th, 2021 at 6:51 PM, Linus Torvalds <[email protected]> wrote:
> Hmm. It doesn't look too hard to do. The biggest problem is actually
> that we've run out of flags in the vma (on 32-bit architectures), but
> you could try this UNTESTED patch that just does the MAP_NOFAULT thing
> unconditionally.
Oh, thanks for the patch! Will test.
> Side note: this will only ever work for non-shared mappings. That's
> fundamental. We won't add an anonymous page to a shared mapping, and
> do_anonymous_page() does verify that. So a MAP_SHARED mappign will
> still return SIGBUS even with this patch (although it's not obvious
> from the patch - the VM_FAULT_SIGBUS will just be re-created by
> do_anonymous_page()).
>
> So if you want a _shared_ mapping to honor __MAP_NOFAULT and insert
> random anonymous pages into it, I think the answer is "no, that's not
> going to be viable".
>
> So _if_ this works for you, and if it's ok that only MAP_PRIVATE can
> have __MAP_NOFAULT, and if Kirill/Peter/Will don't say "Oh, Linus,
> you're completely off your rocker and clearly need to be taking your
> meds", something like this - if we figure out the conditional bit -
> might be doable.
Hm, that's unfortunate. For the use-case of a Wayland compositor this
doesn't seem like a complete show-stopper: in 90% of cases the compositor
only needs a read-only mapping. Wayland clients submit buffers they're
rendered pixels to, compositors only need to read them. So the compositor
could map with MAP_PRIVATE and still get up-to-date pages from a client
process I think.
The remaining 10% is when the compositor needs a writable mapping for
things like screen capture. It doesn't seem like a SIGBUS handler can
be avoided in this case then… Oh well.
> That's a fair number of "ifs".
>
> Ok, back to the merge window for me, I'll be throwing away this crazy
> untested patch immediately after hitting "send". This is very much a
> "throw the idea over to other people" patch, in other words.
Got it. I'll take over the patch if this is a good way forward.
On Tue, May 4, 2021 at 2:29 AM Simon Ser <[email protected]> wrote:
>
> The remaining 10% is when the compositor needs a writable mapping for
> things like screen capture. It doesn't seem like a SIGBUS handler can
> be avoided in this case then… Oh well.
So as Peter Xu mentioned, if we made it a "per inode" thing, we
probably could make such an inode do the zero page fill on its own,
and it might be ok for certain cases even for shared mappings.
However, realistically I think it's a horrible idea for the generic
situation, because I think that basically requires the filesystem
itself to buy into it. And we have something like 60+ different
filesystems.
Is there some very specific and targeted pattern for that "shared
mapping" case? For example, if it's always a shared anonymous mapping
with no filesystem backing, then that would possibly be a simpler case
than the "random arbitrary shared file descriptor".
But maybe that simpler (if untested) VM patch is fine if 90% of the
time it's a plain normal non-shared mapping, and you have to have the
SIGBUS case for backwards compatibility anyway, but at least some
_benign_ cases are now handled without pain...
Linus
On Tuesday, May 4th, 2021 at 6:08 PM, Linus Torvalds <[email protected]> wrote:
> On Tue, May 4, 2021 at 2:29 AM Simon Ser [email protected] wrote:
>
> > The remaining 10% is when the compositor needs a writable mapping for
> > things like screen capture. It doesn't seem like a SIGBUS handler can
> > be avoided in this case then… Oh well.
>
> So as Peter Xu mentioned, if we made it a "per inode" thing, we
> probably could make such an inode do the zero page fill on its own,
> and it might be ok for certain cases even for shared mappings.
> However, realistically I think it's a horrible idea for the generic
> situation, because I think that basically requires the filesystem
> itself to buy into it. And we have something like 60+ different
> filesystems.
>
> Is there some very specific and targeted pattern for that "shared
> mapping" case? For example, if it's always a shared anonymous mapping
> with no filesystem backing, then that would possibly be a simpler case
> than the "random arbitrary shared file descriptor".
Yes. I don't know of any Wayland client using buffers with real
filesystem backing. I think the main cases are:
- shm_open(3) immediately followed by shm_unlink(3). On Linux, this is
implemented with /dev/shm which is a tmpfs.
- Abusing /tmp or /run's tmpfs by creating a file there and unlinking
it immediately afterwards. Kind of similar to the first case.
- memfd_create(2) on Linux.
Is this enough to make it work on shared memory mappings? Is it
important that the mapping is anonymous?
Thanks,
Simon
On Wed, May 5, 2021 at 3:21 AM Simon Ser <[email protected]> wrote:
> >
> > Is there some very specific and targeted pattern for that "shared
> > mapping" case? For example, if it's always a shared anonymous mapping
> > with no filesystem backing, then that would possibly be a simpler case
> > than the "random arbitrary shared file descriptor".
>
> Yes. I don't know of any Wayland client using buffers with real
> filesystem backing. I think the main cases are:
>
> - shm_open(3) immediately followed by shm_unlink(3). On Linux, this is
> implemented with /dev/shm which is a tmpfs.
> - Abusing /tmp or /run's tmpfs by creating a file there and unlinking
> it immediately afterwards. Kind of similar to the first case.
> - memfd_create(2) on Linux.
>
> Is this enough to make it work on shared memory mappings? Is it
> important that the mapping is anonymous?
All of those should be anonymous in the sense that the backing store
is all the kernel's notion of anonymous pages, and there is no actual
file backing. The mappings may then be shared, of course.
So that does make Peter's idea to have some inode flag for "don't
SIGBUS on fault" be more reasonable, because there isn't some random
actual filesystem involved, only the core VM layer.
I'm not going to write the patch, though, but maybe you can convince
somebody else to try it..
Linus
On 5/5/2021 11:42 AM, Linus Torvalds wrote:
> On Wed, May 5, 2021 at 3:21 AM Simon Ser <[email protected]> wrote:
>>>
>>> Is there some very specific and targeted pattern for that "shared
>>> mapping" case? For example, if it's always a shared anonymous mapping
>>> with no filesystem backing, then that would possibly be a simpler case
>>> than the "random arbitrary shared file descriptor".
>>
>> Yes. I don't know of any Wayland client using buffers with real
>> filesystem backing. I think the main cases are:
>>
>> - shm_open(3) immediately followed by shm_unlink(3). On Linux, this is
>> implemented with /dev/shm which is a tmpfs.
>> - Abusing /tmp or /run's tmpfs by creating a file there and unlinking
>> it immediately afterwards. Kind of similar to the first case.
>> - memfd_create(2) on Linux.
>>
>> Is this enough to make it work on shared memory mappings? Is it
>> important that the mapping is anonymous?
>
> All of those should be anonymous in the sense that the backing store
> is all the kernel's notion of anonymous pages, and there is no actual
> file backing. The mappings may then be shared, of course.
>
> So that does make Peter's idea to have some inode flag for "don't
> SIGBUS on fault" be more reasonable, because there isn't some random
> actual filesystem involved, only the core VM layer.
>
> I'm not going to write the patch, though, but maybe you can convince
> somebody else to try it..
Does something like following draft patch on the right track?
1. Application set S_NOFAULT flag on shm mmap fd
#define S_NOFAULT (1 << 17)
fd = shm_open(shmpath, O_RDONLY, S_IRUSR | S_IWUSR);
ioctl(fd, FS_IOC_GETFLAGS, &flags);
flags |= S_NOFAULT;
ioctl(fd, FS_IOC_SETFLAGS, &flags)
2. Don't SIGBUS on read beyond i_size if S_NOFAULT flag set in inode.
Use zero page instead.
---
[RFC DRAFT PATCH] shm: no SIGBUS fault on out-of-band mmap read
---
include/linux/fs.h | 2 ++
mm/shmem.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 45 insertions(+), 1 deletion(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..a9be7cd71b94 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2202,6 +2202,7 @@ struct super_operations {
#define S_ENCRYPTED (1 << 14) /* Encrypted file (using fs/crypto/) */
#define S_CASEFOLD (1 << 15) /* Casefolded file */
#define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
+#define S_NOFAULT (1 << 17) /* No SIGBUS fault on out-of-band mmap read */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2244,6 +2245,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
#define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
+#define IS_NOFAULT(inode) ((inode)->i_flags & S_NOFAULT)
#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
diff --git a/mm/shmem.c b/mm/shmem.c
index 5d46611cba8d..856d2d8d4cdf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -38,8 +38,11 @@
#include <linux/hugetlb.h>
#include <linux/frontswap.h>
#include <linux/fs_parser.h>
+#include <linux/fs.h>
+#include <linux/fileattr.h>
#include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */
+#include <asm/pgalloc.h>
static struct vfsmount *shm_mnt;
@@ -1812,7 +1815,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
repeat:
if (sgp <= SGP_CACHE &&
((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
- return -EINVAL;
+ unsigned long dst_addr = vmf->address;
+ pte_t _dst_pte, *dst_pte;
+ spinlock_t *ptl;
+ int ret;
+
+ if (!IS_NOFAULT(inode))
+ return -EINVAL;
+
+ _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
+ vma->vm_page_prot));
+ dst_pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, dst_addr, &ptl);
+ ret = -EEXIST;
+ if (!pte_none(*dst_pte))
+ goto out_unlock;
+ set_pte_at(vma->vm_mm, dst_addr, dst_pte, _dst_pte);
+ update_mmu_cache(vma, dst_addr, dst_pte);
+ *fault_type = VM_FAULT_NOPAGE;
+ ret = 0;
+out_unlock:
+ pte_unmap_unlock(dst_pte, ptl);
+ return ret;
}
sbinfo = SHMEM_SB(inode->i_sb);
@@ -3819,6 +3842,23 @@ const struct address_space_operations shmem_aops = {
};
EXPORT_SYMBOL(shmem_aops);
+static int shmem_fileattr_get(struct dentry *dentry, struct fileattr *fa)
+{
+ struct inode *inode = d_inode(dentry);
+
+ fileattr_fill_flags(fa, inode->i_flags);
+
+ return 0;
+}
+
+static int shmem_fileattr_set(struct user_namespace *mnt_userns,
+ struct dentry *dentry, struct fileattr *fa)
+{
+ struct inode *inode = d_inode(dentry);
+ inode->i_flags = fa->flags;
+ return 0;
+}
+
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
.get_unmapped_area = shmem_get_unmapped_area,
@@ -3836,6 +3876,8 @@ static const struct file_operations shmem_file_operations = {
static const struct inode_operations shmem_inode_operations = {
.getattr = shmem_getattr,
.setattr = shmem_setattr,
+ .fileattr_get = shmem_fileattr_get,
+ .fileattr_set = shmem_fileattr_set,
#ifdef CONFIG_TMPFS_XATTR
.listxattr = shmem_listxattr,
.set_acl = simple_set_acl,
On Fri, May 28, 2021 at 7:07 AM Lin, Ming <[email protected]> wrote:
>
> Does something like following draft patch on the right track?
No, I don't think this can work:
> + _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
> + vma->vm_page_prot));
You can't just blindly insert the zero pfn - for a shared write
mapping, that would actually allow writes to the zeropage. That would
be horrible.
So it would have to do all the same things that it does for a page
that is inside the inode size.
I do also dislike how it's a per-inode flag - so it would affect other
mappings of the same shared memory segment too. But considering that
the page would have to be part of the page cache for that shmem inode,
that may be inevitable. But it sure does smell a bit.
Oh, and if we make this kind of magic shmem extension, Hugh Dickins
should be part of the conversation too. Hugh, you probably saw the
original on linux-mm, but I'm adding you explicitly to the
participants here.
.. and if you didn't see the background, here it is
https://lore.kernel.org/linux-mm/vs1Us2sm4qmfvLOqNat0-r16GyfmWzqUzQ4KHbXJwEcjhzeoQ4sBTxx7QXDG9B6zk5AeT7FsNb3CSr94LaKy6Novh1fbbw8D_BBxYsbPLms=@emersion.fr/
for your edification..
Linus
On 5/28/2021 6:03 PM, Linus Torvalds wrote:
> On Fri, May 28, 2021 at 7:07 AM Lin, Ming <[email protected]> wrote:
>>
>> Does something like following draft patch on the right track?
>
> No, I don't think this can work:
>
>> + _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
>> + vma->vm_page_prot));
>
> You can't just blindly insert the zero pfn - for a shared write
> mapping, that would actually allow writes to the zeropage. That would
> be horrible.
I should check the vma is not writable.
diff --git a/mm/shmem.c b/mm/shmem.c
index 856d2d8d4cdf..fa23e38bc692 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1820,7 +1820,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
spinlock_t *ptl;
int ret;
- if (!IS_NOFAULT(inode))
+ if (!IS_NOFAULT(inode) || (vma->vm_flags & VM_WRITE))
return -EINVAL;
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr)
On Fri, May 28, 2021 at 9:31 PM Lin, Ming <[email protected]> wrote:
>
> I should check the vma is not writable.
>
> - if (!IS_NOFAULT(inode))
> + if (!IS_NOFAULT(inode) || (vma->vm_flags & VM_WRITE))
> return -EINVAL;
That might be enough, yes.
But if this is sufficient for the compositor needs, and the rule is
that this only works for read-only mappings, then I think the flag in
the inode becomes the wrong thing to do.
Because if it's a read-only mapping, and we only ever care about
inserting zero pages into the page tables - and they never become part
of the shared memory region itself, then it really is purely about
that mmap, not about the shm inode.
So then it really does become purely about one particular mmap, and it
really should be a "madvise()" issue, not a "mark inode as no-fault".
I'd almost be inclined to just add a new "flags" field to the vma.
We've been running out of vma flags for a long time, to the point that
some of them are only available on 64-bit architectures.
I get the feeling that we should just bite the bullet and make
"vm_flags" be an u64. Or possibly make it two explicitly 32-bit
entities (vm_flags and vm_extra). Get rid of the silly 64-bit-only
"high" flags, and get rid of our artificial "we don't have enough
bits".
Because we already in practice *do* have enough bits, we've just
artificially limited ourselves to "on 32-bit architectures we only
have 32 bits in that field".
But all of this is very much dependent on that "this NOFAULT case
really only works for reads, not for writes".
(Alternatively, we could allow the *mapping* itself to be writable,
but always fault on writes, and only insert a read-only zero page)
Linus
On Sat, 29 May 2021, Linus Torvalds wrote:
> On Fri, May 28, 2021 at 9:31 PM Lin, Ming <[email protected]> wrote:
> >
> > I should check the vma is not writable.
> >
> > - if (!IS_NOFAULT(inode))
> > + if (!IS_NOFAULT(inode) || (vma->vm_flags & VM_WRITE))
> > return -EINVAL;
>
> That might be enough, yes.
>
> But if this is sufficient for the compositor needs, and the rule is
> that this only works for read-only mappings, then I think the flag in
> the inode becomes the wrong thing to do.
>
> Because if it's a read-only mapping, and we only ever care about
> inserting zero pages into the page tables - and they never become part
> of the shared memory region itself, then it really is purely about
> that mmap, not about the shm inode.
>
> So then it really does become purely about one particular mmap, and it
> really should be a "madvise()" issue, not a "mark inode as no-fault".
Yes, madvise or mmap flag: the recipient of this fd ought not to be
(even capable of) interfering with other maps of the shared object.
And IIUC it would have to be the recipient (Wayland compositor) doing
the NOFAULT business, because (going back to the original mail) we are
only considering this so that Wayland might satisfy clients who predate
or refuse Linux-only APIs. So, an ioctl (or fcntl, as sealing chose)
at the client end cannot be expected; and could not be relied on anyway.
>
> I'd almost be inclined to just add a new "flags" field to the vma.
> We've been running out of vma flags for a long time, to the point that
> some of them are only available on 64-bit architectures.
>
> I get the feeling that we should just bite the bullet and make
> "vm_flags" be an u64. Or possibly make it two explicitly 32-bit
> entities (vm_flags and vm_extra). Get rid of the silly 64-bit-only
> "high" flags, and get rid of our artificial "we don't have enough
> bits".
u64 saves messing around in the vma_merge() area, which has to
consider whether adjacent vm_flags are identical.
>
> Because we already in practice *do* have enough bits, we've just
> artificially limited ourselves to "on 32-bit architectures we only
> have 32 bits in that field".
Yes, that artificial limitation to 32-bit has been silly all along.
>
> But all of this is very much dependent on that "this NOFAULT case
> really only works for reads, not for writes".
>
> (Alternatively, we could allow the *mapping* itself to be writable,
> but always fault on writes, and only insert a read-only zero page)
NOFAULT? Does BSD use "fault" differently, and in Linux terms we
would say NOSIGBUS to mean the same?
Can someone point to a specification of BSD's __MAP_NOFAULT?
Searching just found me references to bugs.
What mainly worries me about the suggestion is: what happens to the
zero page inserted into NOFAULT mappings, when later a page for that
offset is created and added to page cache?
Treating it as an opaque blob of zeroes, that stays there ever after,
hiding the subsequent data: easy to implement, but a hack that we would
probably regret. (And I notice that even the quote from David Herrmann
in the original post allows for the possibility that client may want to
expand the object.)
I believe the correct behaviour would be to unmap the nofault page
then, allowing the proper page to be faulted in after. That is
certainly doable (the old mm/filemap_xip.c used to do so), but might
get into some awkward race territory, with filesystem dependence
(reminiscent of hole punch, in reverse). shmem could operate that
way, and be the better for it: but I wouldn't want to add that,
without also cleaning away all the shmem_recalc_inode() stuff.
Hugh
On 5/29/2021 1:15 PM, Hugh Dickins wrote:
>
> NOFAULT? Does BSD use "fault" differently, and in Linux terms we
> would say NOSIGBUS to mean the same?
>
> Can someone point to a specification of BSD's __MAP_NOFAULT?
> Searching just found me references to bugs.
Checked freebsd and openbsd, their MAP_NOFAULT seems quite different
than NOSIGBUS.
freebsd: https://github.com/freebsd/freebsd-src
MAP_NOFAULT: The mapping should not generate page faults
openbsd: https://github.com/openbsd/src
__MAP_NOFAULT only makes sense with a backing object
>
> What mainly worries me about the suggestion is: what happens to the
> zero page inserted into NOFAULT mappings, when later a page for that
> offset is created and added to page cache?
>
> Treating it as an opaque blob of zeroes, that stays there ever after,
> hiding the subsequent data: easy to implement, but a hack that we would
> probably regret. (And I notice that even the quote from David Herrmann
> in the original post allows for the possibility that client may want to
> expand the object.)
Yes, that's problem ...
>
> I believe the correct behaviour would be to unmap the nofault page
> then, allowing the proper page to be faulted in after. That is
> certainly doable (the old mm/filemap_xip.c used to do so), but might
> get into some awkward race territory, with filesystem dependence
> (reminiscent of hole punch, in reverse). shmem could operate that
> way, and be the better for it: but I wouldn't want to add that,
> without also cleaning away all the shmem_recalc_inode() stuff.
After we treat it as zero page, then no page fault for later read.
What is the timing to unmap the nofault page?
I'm reading filemap_xip.c to learn how to do it.
https://elixir.bootlin.com/linux/v3.19.8/source/mm/filemap_xip.c
On 5/29/2021 4:36 PM, Ming Lin wrote:
> On 5/29/2021 1:15 PM, Hugh Dickins wrote:
>
>>
>> I believe the correct behaviour would be to unmap the nofault page
>> then, allowing the proper page to be faulted in after. That is
>> certainly doable (the old mm/filemap_xip.c used to do so), but might
>> get into some awkward race territory, with filesystem dependence
>> (reminiscent of hole punch, in reverse). shmem could operate that
>> way, and be the better for it: but I wouldn't want to add that,
>> without also cleaning away all the shmem_recalc_inode() stuff.
OK, I borrowed code from filemap_xip.c and implemented this behavior.
Simon,
Before I send out the patches for review, would you mind have a quick test?
https://github.com/minggr/linux, branch shmem_no_sigbus
In Wayland compositors, you only need to pass in MAP_NOSIGBUS in mmap().
For example,
//fd should be received from Wayland compositors client
#define MAP_NOSIGBUS 0x200000
addr = mmap(NULL, size, PROT_READ, MAP_SHARED|MAP_NOSIGBUS, fd, offset)
Thanks,
Ming
On Mon, May 31, 2021 at 11:13 AM Ming Lin <[email protected]> wrote:
>
> OK, I borrowed code from filemap_xip.c and implemented this behavior.
I think that "unmap page" is too complicated and fragile.
The only page that could possibly validly be unmapped is a stale zero
page, but that code in shmem_unmap_nofault_page() seems to try to
handle other cases too (ie that whole page_remove_rmap() - afaik a
zero page has no rmap).
I get the feeling that the simpler thing to do is to just say "if you
use MAP_NOSIGBUS, and you access pages that don't have a backing
store, you will get zero pages, and they will NOT BE SYNCHRONIZED with
the backing store possibly later being updated".
IOW, just document the fact that a MAP_NOSIGBUS mapping isn't coherent
wrt shmem contents that are expanded and filled in later.
Don't try to "fix" it - because any user that uses MAP_NOSIGBUS had
better just accept that it's not compatible with expanding the shmem
backing store later.
Keep it simple and stupid. Hmm?
Linus
On 5/31/2021 11:24 PM, Linus Torvalds wrote:
> On Mon, May 31, 2021 at 11:13 AM Ming Lin <[email protected]> wrote:
>>
>> OK, I borrowed code from filemap_xip.c and implemented this behavior.
>
> I think that "unmap page" is too complicated and fragile.
>
> The only page that could possibly validly be unmapped is a stale zero
> page, but that code in shmem_unmap_nofault_page() seems to try to
> handle other cases too (ie that whole page_remove_rmap() - afaik a
> zero page has no rmap).
>
> I get the feeling that the simpler thing to do is to just say "if you
> use MAP_NOSIGBUS, and you access pages that don't have a backing
> store, you will get zero pages, and they will NOT BE SYNCHRONIZED with
> the backing store possibly later being updated".
>
> IOW, just document the fact that a MAP_NOSIGBUS mapping isn't coherent
> wrt shmem contents that are expanded and filled in later.
>
> Don't try to "fix" it - because any user that uses MAP_NOSIGBUS had
> better just accept that it's not compatible with expanding the shmem
> backing store later.
>
> Keep it simple and stupid. Hmm?
Simon,
Is this "simple" solution good enough for Wayland compositor use case?
On Tuesday, June 1st, 2021 at 9:08 AM, Ming Lin <[email protected]> wrote:
> On 5/31/2021 11:24 PM, Linus Torvalds wrote:
> > On Mon, May 31, 2021 at 11:13 AM Ming Lin <[email protected]> wrote:
> >>
> >> OK, I borrowed code from filemap_xip.c and implemented this behavior.
> >
> > I think that "unmap page" is too complicated and fragile.
> >
> > The only page that could possibly validly be unmapped is a stale zero
> > page, but that code in shmem_unmap_nofault_page() seems to try to
> > handle other cases too (ie that whole page_remove_rmap() - afaik a
> > zero page has no rmap).
> >
> > I get the feeling that the simpler thing to do is to just say "if you
> > use MAP_NOSIGBUS, and you access pages that don't have a backing
> > store, you will get zero pages, and they will NOT BE SYNCHRONIZED with
> > the backing store possibly later being updated".
> >
> > IOW, just document the fact that a MAP_NOSIGBUS mapping isn't coherent
> > wrt shmem contents that are expanded and filled in later.
> >
> > Don't try to "fix" it - because any user that uses MAP_NOSIGBUS had
> > better just accept that it's not compatible with expanding the shmem
> > backing store later.
> >
> > Keep it simple and stupid. Hmm?
>
> Simon,
>
> Is this "simple" solution good enough for Wayland compositor use case?
I've tried your patch "mm: adds MAP_NOSIGBUS extension for shmem read" with a
libwayland hack [1] and a Wayland client that shrinks a shm file after the
compositor has mmap'ed it [2]. It seems to work nicely, thanks!
Regarding the requirements for Wayland:
- The baseline requirement is being able to avoid SIGBUS for read-only mappings
of shm files.
- Wayland clients can expand their shm files. However the compositor doesn't
need to immediately access the new expanded region. The client will tell the
compositor what the new shm file size is, and the compositor will re-map it.
- Ideally, MAP_NOSIGBUS would work on PROT_WRITE + MAP_SHARED mappings (of
course, the no-SIGBUS behavior would be restricted to that mapping). The
use-case is writing back to client buffers e.g. for screen capture. From the
earlier discussions it seems like this would be complicated to implement.
This means we'll need to come up with a new libwayland API to allow
compositors to opt-in to the read-only mappings. This is sub-optimal but
seems doable.
- Ideally, MAP_SIGBUS wouldn't be restricted to shm. There are use-cases for
using it on ordinary files too, e.g. for sharing ICC profiles. But from the
earlier replies it seems very unlikely that this will become possible, and
making it work only on shm files would already be fantastic.
Thanks again for working on this! Let me know if the above is unclear
or if some info is missing.
Simon
[1]: https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/145
[2]: https://github.com/emersion/wleird/blob/master/sigbus.c
On Saturday, May 29th, 2021 at 10:15 PM, Hugh Dickins <[email protected]> wrote:
> And IIUC it would have to be the recipient (Wayland compositor) doing
> the NOFAULT business, because (going back to the original mail) we are
> only considering this so that Wayland might satisfy clients who predate
> or refuse Linux-only APIs. So, an ioctl (or fcntl, as sealing chose)
> at the client end cannot be expected; and could not be relied on anyway.
Yes, that is correct.
> NOFAULT? Does BSD use "fault" differently, and in Linux terms we
> would say NOSIGBUS to mean the same?
>
> Can someone point to a specification of BSD's __MAP_NOFAULT?
> Searching just found me references to bugs.
__MAP_NOFAULT isn't documented, sadly. The commit that introduces the
flag [1] is the best we're going to get, I think.
> What mainly worries me about the suggestion is: what happens to the
> zero page inserted into NOFAULT mappings, when later a page for that
> offset is created and added to page cache?
Not 100% sure exactly this means what I think it means, but from my PoV,
it's fine if the contents of an expanded shm file aren't visible from the
process that has mapped it with MAP_NOFAULT/MAP_NOSIGBUS. In other words,
it's fine if:
- The client sets up a 1KiB shm file and sends it to the compositor.
- The compositor maps it with MAP_NOFAULT/MAP_NOSIGBUS.
- The client expands the file to 2KiB and writes interesting data in it.
- The compositor still sees zeros past the 1KiB mark. The compositor needs
to unmap and re-map the file to see the data past the 1KiB mark.
If the MAP_NOFAULT/MAP_NOSIGBUS flag only affects the mapping itself and
nothing else, this should be fine?
> Treating it as an opaque blob of zeroes, that stays there ever after,
> hiding the subsequent data: easy to implement, but a hack that we would
> probably regret. (And I notice that even the quote from David Herrmann
> in the original post allows for the possibility that client may want to
> expand the object.)
>
> I believe the correct behaviour would be to unmap the nofault page
> then, allowing the proper page to be faulted in after. That is
> certainly doable (the old mm/filemap_xip.c used to do so), but might
> get into some awkward race territory, with filesystem dependence
> (reminiscent of hole punch, in reverse). shmem could operate that
> way, and be the better for it: but I wouldn't want to add that,
> without also cleaning away all the shmem_recalc_inode() stuff.
[1]: https://github.com/openbsd/src/commit/37f480c7e4870332b7ffb802fa6578f547c8a19f
On Thu, Jun 03, 2021 at 01:14:47PM +0000, Simon Ser wrote:
> On Saturday, May 29th, 2021 at 10:15 PM, Hugh Dickins <[email protected]> wrote:
>
> > And IIUC it would have to be the recipient (Wayland compositor) doing
> > the NOFAULT business, because (going back to the original mail) we are
> > only considering this so that Wayland might satisfy clients who predate
> > or refuse Linux-only APIs. So, an ioctl (or fcntl, as sealing chose)
> > at the client end cannot be expected; and could not be relied on anyway.
>
> Yes, that is correct.
>
> > NOFAULT? Does BSD use "fault" differently, and in Linux terms we
> > would say NOSIGBUS to mean the same?
> >
> > Can someone point to a specification of BSD's __MAP_NOFAULT?
> > Searching just found me references to bugs.
>
> __MAP_NOFAULT isn't documented, sadly. The commit that introduces the
> flag [1] is the best we're going to get, I think.
>
> > What mainly worries me about the suggestion is: what happens to the
> > zero page inserted into NOFAULT mappings, when later a page for that
> > offset is created and added to page cache?
>
> Not 100% sure exactly this means what I think it means, but from my PoV,
> it's fine if the contents of an expanded shm file aren't visible from the
> process that has mapped it with MAP_NOFAULT/MAP_NOSIGBUS. In other words,
> it's fine if:
>
> - The client sets up a 1KiB shm file and sends it to the compositor.
> - The compositor maps it with MAP_NOFAULT/MAP_NOSIGBUS.
> - The client expands the file to 2KiB and writes interesting data in it.
> - The compositor still sees zeros past the 1KiB mark. The compositor needs
> to unmap and re-map the file to see the data past the 1KiB mark.
>
> If the MAP_NOFAULT/MAP_NOSIGBUS flag only affects the mapping itself and
> nothing else, this should be fine?
This is going to operate at a page boundary, so the example you gave
will work. How about this:
- The client sets up a 1KiB shm file and sends it to the compositor.
- The client expands the file to 5KiB
- The compositor sees the new data up to 4KiB but zeroes past the 4KiB
mark.
Does that still make userspace happy?
On Thursday, June 3rd, 2021 at 3:57 PM, Matthew Wilcox <[email protected]> wrote:
> How about this:
>
> - The client sets up a 1KiB shm file and sends it to the compositor.
> - The client expands the file to 5KiB
> - The compositor sees the new data up to 4KiB but zeroes past the 4KiB
> mark.
>
> Does that still make userspace happy?
As long as the new data in the expanded region is visible after a remapping the
file with the new size, it should be fine. It doesn't matter that it's not
visible without a munmap+mmap cycle.
On 6/3/2021 6:01 AM, Simon Ser wrote:
>
> Regarding the requirements for Wayland:
>
> - The baseline requirement is being able to avoid SIGBUS for read-only mappings
> of shm files.
> - Wayland clients can expand their shm files. However the compositor doesn't
> need to immediately access the new expanded region. The client will tell the
> compositor what the new shm file size is, and the compositor will re-map it.
> - Ideally, MAP_NOSIGBUS would work on PROT_WRITE + MAP_SHARED mappings (of
> course, the no-SIGBUS behavior would be restricted to that mapping). The
> use-case is writing back to client buffers e.g. for screen capture. From the
> earlier discussions it seems like this would be complicated to implement.
> This means we'll need to come up with a new libwayland API to allow
> compositors to opt-in to the read-only mappings. This is sub-optimal but
> seems doable.
> - Ideally, MAP_SIGBUS wouldn't be restricted to shm. There are use-cases for
> using it on ordinary files too, e.g. for sharing ICC profiles. But from the
> earlier replies it seems very unlikely that this will become possible, and
> making it work only on shm files would already be fantastic.
In the new version of the patches, MAP_NOSIGBUS is not restricted to shmem.
It can be used on ordinary files.
On Thursday, June 3rd, 2021 at 10:07 PM, Ming Lin <[email protected]> wrote:
> In the new version of the patches, MAP_NOSIGBUS is not restricted to shmem.
> It can be used on ordinary files.
Oh, cool!
FWIW, if I had to choose between "MAP_NOSIGBUS not restricted to shm"
and "MAP_NOSIGBUS not restricted to MAP_PRIVATE", I'd probably choose
the latter. But of course the former is better than nothing at all.