2021-06-04 07:44:49

by Ming Lin

[permalink] [raw]
Subject: [PATCH v2 0/2] mm: support NOSIGBUS on fault of mmap

These 2 patches are based on the discussion of "Sealed memfd & no-fault mmap"
at https://bit.ly/3pdwOGR

v2:
- make MAP_NOSIGBUS generic instead of being restricted to shmem
- use do_anonymous_page() to insert zero page
- fix build warnings/errors reported by LKP test robot

v1:
https://lkml.org/lkml/2021/6/1/1076

Ming Lin (2):
mm: make "vm_flags" be an u64
mm: adds NOSIGBUS extension to mmap()

arch/arm64/Kconfig | 1 -
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/Kconfig | 1 -
arch/x86/Kconfig | 1 -
drivers/android/binder.c | 6 +-
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 2 +-
drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 4 +-
fs/exec.c | 2 +-
fs/userfaultfd.c | 6 +-
include/linux/huge_mm.h | 4 +-
include/linux/ksm.h | 4 +-
include/linux/mm.h | 108 +++++++++++++--------------
include/linux/mm_types.h | 6 +-
include/linux/mman.h | 5 +-
include/uapi/asm-generic/mman-common.h | 1 +
mm/Kconfig | 2 -
mm/debug.c | 4 +-
mm/khugepaged.c | 2 +-
mm/ksm.c | 2 +-
mm/madvise.c | 2 +-
mm/memory.c | 15 +++-
mm/mmap.c | 14 ++--
mm/mprotect.c | 4 +-
mm/mremap.c | 2 +-
tools/include/uapi/asm-generic/mman-common.h | 1 +
28 files changed, 108 insertions(+), 98 deletions(-)

--
1.8.3.1


2021-06-04 07:45:59

by Ming Lin

[permalink] [raw]
Subject: [PATCH v2 1/2] mm: make "vm_flags" be an u64

So we can have enough bits on 32-bit architectures.

Use vm_flags_t instead of "unsigned long".
Also fix build warnings for many print code.

Signed-off-by: Ming Lin <[email protected]>
---
arch/arm64/Kconfig | 1 -
arch/powerpc/Kconfig | 1 -
arch/x86/Kconfig | 1 -
drivers/android/binder.c | 6 +-
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 2 +-
drivers/infiniband/hw/hfi1/file_ops.c | 2 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 4 +-
fs/exec.c | 2 +-
fs/userfaultfd.c | 6 +-
include/linux/huge_mm.h | 4 +-
include/linux/ksm.h | 4 +-
include/linux/mm.h | 106 ++++++++++++++----------------
include/linux/mm_types.h | 6 +-
include/linux/mman.h | 4 +-
mm/Kconfig | 2 -
mm/debug.c | 4 +-
mm/khugepaged.c | 2 +-
mm/ksm.c | 2 +-
mm/madvise.c | 2 +-
mm/memory.c | 4 +-
mm/mmap.c | 10 +--
mm/mprotect.c | 4 +-
mm/mremap.c | 2 +-
25 files changed, 87 insertions(+), 98 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d856..c6960ea 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1658,7 +1658,6 @@ config ARM64_MTE
depends on AS_HAS_LSE_ATOMICS
# Required for tag checking in the uaccess routines
depends on ARM64_PAN
- select ARCH_USES_HIGH_VMA_FLAGS
help
Memory Tagging (part of the ARMv8.5 Extensions) provides
architectural support for run-time, always-on detection of
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2a..5c1b49e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -940,7 +940,6 @@ config PPC_MEM_KEYS
prompt "PowerPC Memory Protection Keys"
def_bool y
depends on PPC_BOOK3S_64
- select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
help
Memory Protection Keys provides a mechanism for enforcing
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0045e1b..a885336 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1874,7 +1874,6 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
def_bool y
# Note: only available in 64-bit mode
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
- select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
help
Memory Protection Keys provides a mechanism for enforcing
diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index bcec598..2a56b8b 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -4947,7 +4947,7 @@ static void binder_vma_open(struct vm_area_struct *vma)
struct binder_proc *proc = vma->vm_private_data;

binder_debug(BINDER_DEBUG_OPEN_CLOSE,
- "%d open vm area %lx-%lx (%ld K) vma %lx pagep %lx\n",
+ "%d open vm area %lx-%lx (%ld K) vma %llx pagep %lx\n",
proc->pid, vma->vm_start, vma->vm_end,
(vma->vm_end - vma->vm_start) / SZ_1K, vma->vm_flags,
(unsigned long)pgprot_val(vma->vm_page_prot));
@@ -4958,7 +4958,7 @@ static void binder_vma_close(struct vm_area_struct *vma)
struct binder_proc *proc = vma->vm_private_data;

binder_debug(BINDER_DEBUG_OPEN_CLOSE,
- "%d close vm area %lx-%lx (%ld K) vma %lx pagep %lx\n",
+ "%d close vm area %lx-%lx (%ld K) vma %llx pagep %lx\n",
proc->pid, vma->vm_start, vma->vm_end,
(vma->vm_end - vma->vm_start) / SZ_1K, vma->vm_flags,
(unsigned long)pgprot_val(vma->vm_page_prot));
@@ -4984,7 +4984,7 @@ static int binder_mmap(struct file *filp, struct vm_area_struct *vma)
return -EINVAL;

binder_debug(BINDER_DEBUG_OPEN_CLOSE,
- "%s: %d %lx-%lx (%ld K) vma %lx pagep %lx\n",
+ "%s: %d %lx-%lx (%ld K) vma %llx pagep %lx\n",
__func__, proc->pid, vma->vm_start, vma->vm_end,
(vma->vm_end - vma->vm_start) / SZ_1K, vma->vm_flags,
(unsigned long)pgprot_val(vma->vm_page_prot));
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 43de260..3a1726b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1957,7 +1957,7 @@ static int kfd_mmio_mmap(struct kfd_dev *dev, struct kfd_process *process,
pr_debug("pasid 0x%x mapping mmio page\n"
" target user address == 0x%08llX\n"
" physical address == 0x%08llX\n"
- " vm_flags == 0x%04lX\n"
+ " vm_flags == 0x%08llX\n"
" size == 0x%04lX\n",
process->pasid, (unsigned long long) vma->vm_start,
address, vma->vm_flags, PAGE_SIZE);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index 768d153..002462b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -150,7 +150,7 @@ int kfd_doorbell_mmap(struct kfd_dev *dev, struct kfd_process *process,
pr_debug("Mapping doorbell page\n"
" target user address == 0x%08llX\n"
" physical address == 0x%08llX\n"
- " vm_flags == 0x%04lX\n"
+ " vm_flags == 0x%08llX\n"
" size == 0x%04lX\n",
(unsigned long long) vma->vm_start, address, vma->vm_flags,
kfd_doorbell_process_slice(dev));
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index ba2c2ce..e25ff04 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -808,7 +808,7 @@ int kfd_event_mmap(struct kfd_process *p, struct vm_area_struct *vma)
pr_debug(" start user address == 0x%08lx\n", vma->vm_start);
pr_debug(" end user address == 0x%08lx\n", vma->vm_end);
pr_debug(" pfn == 0x%016lX\n", pfn);
- pr_debug(" vm_flags == 0x%08lX\n", vma->vm_flags);
+ pr_debug(" vm_flags == 0x%08llX\n", vma->vm_flags);
pr_debug(" size == 0x%08lX\n",
vma->vm_end - vma->vm_start);

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 3b7bbc7..a40410f 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -569,7 +569,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)

vma->vm_flags = flags;
hfi1_cdbg(PROC,
- "%u:%u type:%u io/vf:%d/%d, addr:0x%llx, len:%lu(%lu), flags:0x%lx\n",
+ "%u:%u type:%u io/vf:%d/%d, addr:0x%llx, len:%lu(%lu), flags:0x%llx\n",
ctxt, subctxt, type, mapio, vmf, memaddr, memlen,
vma->vm_end - vma->vm_start, vma->vm_flags);
if (vmf) {
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index c60e79d..9bd34e6 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -846,7 +846,7 @@ static int mmap_rcvegrbufs(struct vm_area_struct *vma,

if (vma->vm_flags & VM_WRITE) {
qib_devinfo(dd->pcidev,
- "Can't map eager buffers as writable (flags=%lx)\n",
+ "Can't map eager buffers as writable (flags=%llx)\n",
vma->vm_flags);
ret = -EPERM;
goto bail;
@@ -935,7 +935,7 @@ static int mmap_kvaddr(struct vm_area_struct *vma, u64 pgaddr,
/* rcvegrbufs are read-only on the slave */
if (vma->vm_flags & VM_WRITE) {
qib_devinfo(dd->pcidev,
- "Can't map eager buffers as writable (flags=%lx)\n",
+ "Can't map eager buffers as writable (flags=%llx)\n",
vma->vm_flags);
ret = -EPERM;
goto bail;
diff --git a/fs/exec.c b/fs/exec.c
index 18594f1..8dcf8a5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -748,7 +748,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = bprm->vma;
struct vm_area_struct *prev = NULL;
- unsigned long vm_flags;
+ vm_flags_t vm_flags;
unsigned long stack_base;
unsigned long stack_size;
unsigned long stack_expand;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 14f9228..b958055 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -846,7 +846,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
struct vm_area_struct *vma, *prev;
/* len == 0 means wake all */
struct userfaultfd_wake_range range = { .len = 0, };
- unsigned long new_flags;
+ vm_flags_t new_flags;

WRITE_ONCE(ctx->released, true);

@@ -1284,7 +1284,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
int ret;
struct uffdio_register uffdio_register;
struct uffdio_register __user *user_uffdio_register;
- unsigned long vm_flags, new_flags;
+ vm_flags_t vm_flags, new_flags;
bool found;
bool basic_ioctls;
unsigned long start, end, vma_end;
@@ -1510,7 +1510,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
struct vm_area_struct *vma, *prev, *cur;
int ret;
struct uffdio_range uffdio_unregister;
- unsigned long new_flags;
+ vm_flags_t new_flags;
bool found;
unsigned long start, end, vma_end;
const void __user *buf = (void __user *)arg;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9626fda..2f524f0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -215,7 +215,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
__split_huge_pud(__vma, __pud, __address); \
} while (0)

-int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
+int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
int advice);
void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
unsigned long end, long adjust_next);
@@ -403,7 +403,7 @@ static inline void split_huge_pmd_address(struct vm_area_struct *vma,
do { } while (0)

static inline int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
+ vm_flags_t *vm_flags, int advice)
{
BUG();
return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e816..9f57409 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -20,7 +20,7 @@

#ifdef CONFIG_KSM
int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, int advice, unsigned long *vm_flags);
+ unsigned long end, int advice, vm_flags_t *vm_flags);
int __ksm_enter(struct mm_struct *mm);
void __ksm_exit(struct mm_struct *mm);

@@ -67,7 +67,7 @@ static inline void ksm_exit(struct mm_struct *mm)

#ifdef CONFIG_MMU
static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, int advice, unsigned long *vm_flags)
+ unsigned long end, int advice, vm_flags_t *vm_flags)
{
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c274f75..9e86ca1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -264,73 +264,68 @@ int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
extern unsigned int kobjsize(const void *objp);
#endif

+#define VM_FLAGS_BIT(N) (1ULL << (N))
+
/*
* vm_flags in vm_area_struct, see mm_types.h.
* When changing, update also include/trace/events/mmflags.h
*/
#define VM_NONE 0x00000000

-#define VM_READ 0x00000001 /* currently active flags */
-#define VM_WRITE 0x00000002
-#define VM_EXEC 0x00000004
-#define VM_SHARED 0x00000008
+#define VM_READ VM_FLAGS_BIT(0) /* currently active flags */
+#define VM_WRITE VM_FLAGS_BIT(1)
+#define VM_EXEC VM_FLAGS_BIT(2)
+#define VM_SHARED VM_FLAGS_BIT(3)

/* mprotect() hardcodes VM_MAYREAD >> 4 == VM_READ, and so for r/w/x bits. */
-#define VM_MAYREAD 0x00000010 /* limits for mprotect() etc */
-#define VM_MAYWRITE 0x00000020
-#define VM_MAYEXEC 0x00000040
-#define VM_MAYSHARE 0x00000080
-
-#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
-#define VM_UFFD_MISSING 0x00000200 /* missing pages tracking */
-#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
-#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
-#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
-
-#define VM_LOCKED 0x00002000
-#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
-
- /* Used by sys_madvise() */
-#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
-#define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */
-
-#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
-#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
-#define VM_LOCKONFAULT 0x00080000 /* Lock the pages covered when they are faulted in */
-#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
-#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
-#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
-#define VM_SYNC 0x00800000 /* Synchronous page faults */
-#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
-#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
-#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
+#define VM_MAYREAD VM_FLAGS_BIT(4) /* limits for mprotect() etc */
+#define VM_MAYWRITE VM_FLAGS_BIT(5)
+#define VM_MAYEXEC VM_FLAGS_BIT(6)
+#define VM_MAYSHARE VM_FLAGS_BIT(7)
+
+#define VM_GROWSDOWN VM_FLAGS_BIT(8) /* general info on the segment */
+#define VM_UFFD_MISSING VM_FLAGS_BIT(9) /* missing pages tracking */
+#define VM_PFNMAP VM_FLAGS_BIT(10) /* Page-ranges managed without "struct page", just pure PFN */
+#define VM_DENYWRITE VM_FLAGS_BIT(11) /* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP VM_FLAGS_BIT(12) /* wrprotect pages tracking */
+
+#define VM_LOCKED VM_FLAGS_BIT(13)
+#define VM_IO VM_FLAGS_BIT(14) /* Memory mapped I/O or similar */
+
+ /* Used by sys_madvise() */
+#define VM_SEQ_READ VM_FLAGS_BIT(15) /* App will access data sequentially */
+#define VM_RAND_READ VM_FLAGS_BIT(16) /* App will not benefit from clustered reads */
+
+#define VM_DONTCOPY VM_FLAGS_BIT(17) /* Do not copy this vma on fork */
+#define VM_DONTEXPAND VM_FLAGS_BIT(18) /* Cannot expand with mremap() */
+#define VM_LOCKONFAULT VM_FLAGS_BIT(19) /* Lock the pages covered when they are faulted in */
+#define VM_ACCOUNT VM_FLAGS_BIT(20) /* Is a VM accounted object */
+#define VM_NORESERVE VM_FLAGS_BIT(21) /* should the VM suppress accounting */
+#define VM_HUGETLB VM_FLAGS_BIT(22) /* Huge TLB Page VM */
+#define VM_SYNC VM_FLAGS_BIT(23) /* Synchronous page faults */
+#define VM_ARCH_1 VM_FLAGS_BIT(24) /* Architecture-specific flag */
+#define VM_WIPEONFORK VM_FLAGS_BIT(25) /* Wipe VMA contents in child. */
+#define VM_DONTDUMP VM_FLAGS_BIT(26) /* Do not include in the core dump */

#ifdef CONFIG_MEM_SOFT_DIRTY
-# define VM_SOFTDIRTY 0x08000000 /* Not soft dirty clean area */
+# define VM_SOFTDIRTY VM_FLAGS_BIT(27) /* Not soft dirty clean area */
#else
# define VM_SOFTDIRTY 0
#endif

-#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
-#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
-#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
-#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
-
-#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
-#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
-#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
-#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
-#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
-#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
+#define VM_MIXEDMAP VM_FLAGS_BIT(28) /* Can contain "struct page" and pure PFN pages */
+#define VM_HUGEPAGE VM_FLAGS_BIT(29) /* MADV_HUGEPAGE marked this vma */
+#define VM_NOHUGEPAGE VM_FLAGS_BIT(30) /* MADV_NOHUGEPAGE marked this vma */
+#define VM_MERGEABLE VM_FLAGS_BIT(31) /* KSM may merge identical pages */
+
+#define VM_HIGH_ARCH_0 VM_FLAGS_BIT(32)
+#define VM_HIGH_ARCH_1 VM_FLAGS_BIT(33)
+#define VM_HIGH_ARCH_2 VM_FLAGS_BIT(34)
+#define VM_HIGH_ARCH_3 VM_FLAGS_BIT(35)
+#define VM_HIGH_ARCH_4 VM_FLAGS_BIT(36)

#ifdef CONFIG_ARCH_HAS_PKEYS
-# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
+# define VM_PKEY_SHIFT 32
# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
# define VM_PKEY_BIT1 VM_HIGH_ARCH_1 /* on x86 and 5-bit value on ppc64 */
# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
@@ -373,8 +368,7 @@ int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-# define VM_UFFD_MINOR_BIT 37
-# define VM_UFFD_MINOR BIT(VM_UFFD_MINOR_BIT) /* UFFD minor faults */
+# define VM_UFFD_MINOR VM_FLAGS_BIT(37) /* UFFD minor faults */
#else /* !CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
# define VM_UFFD_MINOR VM_NONE
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
@@ -1894,7 +1888,7 @@ extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long
unsigned long cp_flags);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
- unsigned long end, unsigned long newflags);
+ unsigned long end, vm_flags_t newflags);

/*
* doesn't attempt to fault and will return short.
@@ -2545,7 +2539,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
}
extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
- unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
+ vm_flags_t vm_flags, struct anon_vma *, struct file *, pgoff_t,
struct mempolicy *, struct vm_userfaultfd_ctx);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
@@ -2626,7 +2620,7 @@ static inline void mm_populate(unsigned long addr, unsigned long len) {}

/* These take the mm semaphore themselves */
extern int __must_check vm_brk(unsigned long, unsigned long);
-extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
+extern int __must_check vm_brk_flags(unsigned long, unsigned long, vm_flags_t);
extern int vm_munmap(unsigned long, size_t);
extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
unsigned long, unsigned long,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..cb612d0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -264,7 +264,7 @@ struct page_frag_cache {
bool pfmemalloc;
};

-typedef unsigned long vm_flags_t;
+typedef u64 vm_flags_t;

/*
* A region containing a mapping of a non-memory backed file under NOMMU
@@ -330,7 +330,7 @@ struct vm_area_struct {
* See vmf_insert_mixed_prot() for discussion.
*/
pgprot_t vm_page_prot;
- unsigned long vm_flags; /* Flags, see mm.h. */
+ vm_flags_t vm_flags; /* Flags, see mm.h. */

/*
* For areas with an address space and backing store,
@@ -478,7 +478,7 @@ struct mm_struct {
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm; /* VM_STACK */
- unsigned long def_flags;
+ vm_flags_t def_flags;

spinlock_t arg_lock; /* protect the below fields */
unsigned long start_code, end_code, start_data, end_data;
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 629cefc..b2cbae9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -135,7 +135,7 @@ static inline bool arch_validate_flags(unsigned long flags)
/*
* Combine the mmap "prot" argument into "vm_flags" used internally.
*/
-static inline unsigned long
+static inline vm_flags_t
calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
{
return _calc_vm_trans(prot, PROT_READ, VM_READ ) |
@@ -147,7 +147,7 @@ static inline bool arch_validate_flags(unsigned long flags)
/*
* Combine the mmap "flags" argument into "vm_flags" used internally.
*/
-static inline unsigned long
+static inline vm_flags_t
calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
diff --git a/mm/Kconfig b/mm/Kconfig
index 02d44e3..aa8efba 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -830,8 +830,6 @@ config DEVICE_PRIVATE
config VMAP_PFN
bool

-config ARCH_USES_HIGH_VMA_FLAGS
- bool
config ARCH_HAS_PKEYS
bool

diff --git a/mm/debug.c b/mm/debug.c
index 0bdda84..6165b5f 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -202,7 +202,7 @@ void dump_vma(const struct vm_area_struct *vma)
"next %px prev %px mm %px\n"
"prot %lx anon_vma %px vm_ops %px\n"
"pgoff %lx file %px private_data %px\n"
- "flags: %#lx(%pGv)\n",
+ "flags: %#llx(%pGv)\n",
vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_next,
vma->vm_prev, vma->vm_mm,
(unsigned long)pgprot_val(vma->vm_page_prot),
@@ -240,7 +240,7 @@ void dump_mm(const struct mm_struct *mm)
"numa_next_scan %lu numa_scan_offset %lu numa_scan_seq %d\n"
#endif
"tlb_flush_pending %d\n"
- "def_flags: %#lx(%pGv)\n",
+ "def_flags: %#llx(%pGv)\n",

mm, mm->mmap, (long long) mm->vmacache_seqnum, mm->task_size,
#ifdef CONFIG_MMU
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6c0185f..ad76bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -345,7 +345,7 @@ struct attribute_group khugepaged_attr_group = {
#endif /* CONFIG_SYSFS */

int hugepage_madvise(struct vm_area_struct *vma,
- unsigned long *vm_flags, int advice)
+ vm_flags_t *vm_flags, int advice)
{
switch (advice) {
case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index 2f3aaeb..257147c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2431,7 +2431,7 @@ static int ksm_scan_thread(void *nothing)
}

int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, int advice, unsigned long *vm_flags)
+ unsigned long end, int advice, vm_flags_t *vm_flags)
{
struct mm_struct *mm = vma->vm_mm;
int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index 63e489e..5105393 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,7 +71,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
- unsigned long new_flags = vma->vm_flags;
+ vm_flags_t new_flags = vma->vm_flags;

switch (behavior) {
case MADV_NORMAL:
diff --git a/mm/memory.c b/mm/memory.c
index 730daa0..8d5e583 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -550,7 +550,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
(long long)pte_val(pte), (long long)pmd_val(*pmd));
if (page)
dump_page(page, "bad pte");
- pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
+ pr_alert("addr:%px vm_flags:%08llx anon_vma:%px mapping:%px index:%lx\n",
(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
pr_alert("file:%pD fault:%ps mmap:%ps readpage:%ps\n",
vma->vm_file,
@@ -1241,7 +1241,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct page *page;

page = vm_normal_page(vma, addr, ptent);
- if (unlikely(details) && page) {
+ if (unlikely(details) && page && !(vma->vm_flags & VM_NOSIGBUS)) {
/*
* unmap_shared_mapping_pages() wants to
* invalidate cache without truncating:
diff --git a/mm/mmap.c b/mm/mmap.c
index 0584e54..8bed547 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -191,7 +191,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
return next;
}

-static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
+static int do_brk_flags(unsigned long addr, unsigned long request, vm_flags_t flags,
struct list_head *uf);
SYSCALL_DEFINE1(brk, unsigned long, brk)
{
@@ -1160,7 +1160,7 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
*/
struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
- unsigned long end, unsigned long vm_flags,
+ unsigned long end, vm_flags_t vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
@@ -1353,7 +1353,7 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
}

static inline int mlock_future_check(struct mm_struct *mm,
- unsigned long flags,
+ vm_flags_t flags,
unsigned long len)
{
unsigned long locked, lock_limit;
@@ -3050,7 +3050,7 @@ int vm_munmap(unsigned long start, size_t len)
* anonymous maps. eventually we may be able to do some
* brk-specific accounting here.
*/
-static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf)
+static int do_brk_flags(unsigned long addr, unsigned long len, vm_flags_t flags, struct list_head *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
@@ -3118,7 +3118,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
return 0;
}

-int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
+int vm_brk_flags(unsigned long addr, unsigned long request, vm_flags_t flags)
{
struct mm_struct *mm = current->mm;
unsigned long len;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e7a4431..0433db7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -397,10 +397,10 @@ static int prot_none_test(unsigned long addr, unsigned long next,

int
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
- unsigned long start, unsigned long end, unsigned long newflags)
+ unsigned long start, unsigned long end, vm_flags_t newflags)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long oldflags = vma->vm_flags;
+ vm_flags_t oldflags = vma->vm_flags;
long nrpages = (end - start) >> PAGE_SHIFT;
unsigned long charged = 0;
pgoff_t pgoff;
diff --git a/mm/mremap.c b/mm/mremap.c
index 47c255b..bf9a661 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -489,7 +489,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *new_vma;
- unsigned long vm_flags = vma->vm_flags;
+ vm_flags_t vm_flags = vma->vm_flags;
unsigned long new_pgoff;
unsigned long moved_len;
unsigned long excess = 0;
--
1.8.3.1

2021-06-04 07:47:31

by Ming Lin

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: adds NOSIGBUS extension to mmap()

Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
"don't SIGBUS on fault". Right now, this flag is only allowed
for private mapping.

For MAP_NOSIGBUS mapping, map in the zero page on read fault
or fill a freshly allocated page with zeroes on write fault.

Signed-off-by: Ming Lin <[email protected]>
---
arch/parisc/include/uapi/asm/mman.h | 1 +
include/linux/mm.h | 2 ++
include/linux/mman.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/memory.c | 11 +++++++++++
mm/mmap.c | 4 ++++
tools/include/uapi/asm-generic/mman-common.h | 1 +
7 files changed, 21 insertions(+)

diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index ab78cba..eecf9af 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -25,6 +25,7 @@
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */
#define MAP_UNINITIALIZED 0 /* uninitialized anonymous mmap */

#define MS_SYNC 1 /* synchronous memory sync */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9e86ca1..100d122 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -373,6 +373,8 @@ int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
# define VM_UFFD_MINOR VM_NONE
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */

+#define VM_NOSIGBUS VM_FLAGS_BIT(38) /* Do not SIGBUS on fault */
+
/* Bits set in the VMA until the stack is in its final location */
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index b2cbae9..c966b08 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -154,6 +154,7 @@ static inline bool arch_validate_flags(unsigned long flags)
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
+ _calc_vm_trans(flags, MAP_NOSIGBUS, VM_NOSIGBUS ) |
arch_calc_vm_flag_bits(flags);
}

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d..a2a5333 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@
#define MAP_HUGETLB 0x040000 /* create a huge page mapping */
#define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */

#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/mm/memory.c b/mm/memory.c
index 8d5e583..6b5a897 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3676,6 +3676,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
}

ret = vma->vm_ops->fault(vmf);
+ if (unlikely(ret & VM_FAULT_SIGBUS) && (vma->vm_flags & VM_NOSIGBUS)) {
+ /*
+ * For MAP_NOSIGBUS mapping, map in the zero page on read fault
+ * or fill a freshly allocated page with zeroes on write fault
+ */
+ ret = do_anonymous_page(vmf);
+ if (!ret)
+ ret = VM_FAULT_NOPAGE;
+ return ret;
+ }
+
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
VM_FAULT_DONE_COW)))
return ret;
diff --git a/mm/mmap.c b/mm/mmap.c
index 8bed547..d5c9fb5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1419,6 +1419,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (!len)
return -EINVAL;

+ /* Restrict MAP_NOSIGBUS to MAP_PRIVATE mapping */
+ if ((flags & MAP_NOSIGBUS) && !(flags & MAP_PRIVATE))
+ return -EINVAL;
+
/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index f94f65d..a2a5333 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@
#define MAP_HUGETLB 0x040000 /* create a huge page mapping */
#define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */

#define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
--
1.8.3.1

2021-06-04 15:27:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: adds NOSIGBUS extension to mmap()

On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> "don't SIGBUS on fault". Right now, this flag is only allowed
> for private mapping.

That's not what your use case asks for.

SIGBUS can be generated for a number of reasons, not only on fault beyond
end-of-file. vmf_error() would convert any errno, except ENOMEM to
VM_FAULT_SIGBUS.

Do you want to ignore -EIO or -ENOSPC? I don't think so.

> For MAP_NOSIGBUS mapping, map in the zero page on read fault
> or fill a freshly allocated page with zeroes on write fault.

I don't like the resulting semantics: if you had a read fault beyond EOF
and got zero page, you will still see zero page even if the file grows.
Yes, it's allowed by POSIX for MAP_PRIVATE to get out-of-sync with the
file, but it's not what users used to.

It might be enough for the use case, but I would rather avoid one-user
features.

--
Kirill A. Shutemov

2021-06-04 16:25:15

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: adds NOSIGBUS extension to mmap()

On Fri, Jun 04, 2021 at 06:24:07PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> > Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> > "don't SIGBUS on fault". Right now, this flag is only allowed
> > for private mapping.
>
> That's not what your use case asks for.

Simon explained the use case here: https://bit.ly/3wR85Lc

FYI, I copied here too.

------begin-------------------------------------------------------------------
Regarding the requirements for Wayland:

- The baseline requirement is being able to avoid SIGBUS for read-only mappings
of shm files.
- Wayland clients can expand their shm files. However the compositor doesn't
need to immediately access the new expanded region. The client will tell the
compositor what the new shm file size is, and the compositor will re-map it.
- Ideally, MAP_NOSIGBUS would work on PROT_WRITE + MAP_SHARED mappings (of
course, the no-SIGBUS behavior would be restricted to that mapping). The
use-case is writing back to client buffers e.g. for screen capture. From the
earlier discussions it seems like this would be complicated to implement.
This means we'll need to come up with a new libwayland API to allow
compositors to opt-in to the read-only mappings. This is sub-optimal but
seems doable.
- Ideally, MAP_SIGBUS wouldn't be restricted to shm. There are use-cases for
using it on ordinary files too, e.g. for sharing ICC profiles. But from the
earlier replies it seems very unlikely that this will become possible, and
making it work only on shm files would already be fantastic.
------end-------------------------------------------------------------------

>
> SIGBUS can be generated for a number of reasons, not only on fault beyond
> end-of-file. vmf_error() would convert any errno, except ENOMEM to
> VM_FAULT_SIGBUS.
>
> Do you want to ignore -EIO or -ENOSPC? I don't think so.
>
> > For MAP_NOSIGBUS mapping, map in the zero page on read fault
> > or fill a freshly allocated page with zeroes on write fault.
>
> I don't like the resulting semantics: if you had a read fault beyond EOF
> and got zero page, you will still see zero page even if the file grows.
> Yes, it's allowed by POSIX for MAP_PRIVATE to get out-of-sync with the
> file, but it's not what users used to.

Actually old version did support file grows.
https://github.com/minggr/linux/commit/77f3722b94ff33cafe0a72c1bf1b8fa374adb29f

We can support this if there is real use case.

>
> It might be enough for the use case, but I would rather avoid one-user
> features.

2021-06-28 18:19:42

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: adds NOSIGBUS extension to mmap()

On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> "don't SIGBUS on fault". Right now, this flag is only allowed
> for private mapping.
>
> For MAP_NOSIGBUS mapping, map in the zero page on read fault
> or fill a freshly allocated page with zeroes on write fault.

I am wondering if this could be of limited use for me if MAP_NOSIGBUS
were to be supported for shared mappings as well.

When virtiofs is run with dax enabled, then it is possible that if
a file is shared between two guests, then one guest truncates the
file and second guest tries to do load/store operation. Given current
kvm architecture, there is no mechanism to propagate SIGBUS to guest
process, instead KVM retries page fault infinitely and guest cpu/process
hangs.

Ideally we want this error to propagate all the way back into the
guest and to the guest process but that solution is not in place yet.

https://lore.kernel.org/kvm/[email protected]/

In the absense of a proper solution, one could think of mapping
shared file on host with MAP_NOSIGBUS, and hopefully that means
kvm will be able to resolve fault to a zero filled page and guest
will not hang. But this means that data sharing between two processes
is now broken. Writes by process A will not be visible to process B
in another once this situation happens, IIUC.

So if we were to MAP_NOSIGBUS, guest will not hang but failures resulting
from ftruncate will be silent and will be noticed sometime later. I guess
not exactly a very pleasant scenario...

Thanks
Vivek



>
> Signed-off-by: Ming Lin <[email protected]>
> ---
> arch/parisc/include/uapi/asm/mman.h | 1 +
> include/linux/mm.h | 2 ++
> include/linux/mman.h | 1 +
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/memory.c | 11 +++++++++++
> mm/mmap.c | 4 ++++
> tools/include/uapi/asm-generic/mman-common.h | 1 +
> 7 files changed, 21 insertions(+)
>
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index ab78cba..eecf9af 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -25,6 +25,7 @@
> #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
> +#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */
> #define MAP_UNINITIALIZED 0 /* uninitialized anonymous mmap */
>
> #define MS_SYNC 1 /* synchronous memory sync */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9e86ca1..100d122 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -373,6 +373,8 @@ int __add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> # define VM_UFFD_MINOR VM_NONE
> #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
>
> +#define VM_NOSIGBUS VM_FLAGS_BIT(38) /* Do not SIGBUS on fault */
> +
> /* Bits set in the VMA until the stack is in its final location */
> #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
>
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index b2cbae9..c966b08 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -154,6 +154,7 @@ static inline bool arch_validate_flags(unsigned long flags)
> _calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
> + _calc_vm_trans(flags, MAP_NOSIGBUS, VM_NOSIGBUS ) |
> arch_calc_vm_flag_bits(flags);
> }
>
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d..a2a5333 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -29,6 +29,7 @@
> #define MAP_HUGETLB 0x040000 /* create a huge page mapping */
> #define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */
> #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
> +#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */
>
> #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
> * uninitialized */
> diff --git a/mm/memory.c b/mm/memory.c
> index 8d5e583..6b5a897 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3676,6 +3676,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> }
>
> ret = vma->vm_ops->fault(vmf);
> + if (unlikely(ret & VM_FAULT_SIGBUS) && (vma->vm_flags & VM_NOSIGBUS)) {
> + /*
> + * For MAP_NOSIGBUS mapping, map in the zero page on read fault
> + * or fill a freshly allocated page with zeroes on write fault
> + */
> + ret = do_anonymous_page(vmf);
> + if (!ret)
> + ret = VM_FAULT_NOPAGE;
> + return ret;
> + }
> +
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> VM_FAULT_DONE_COW)))
> return ret;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8bed547..d5c9fb5 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1419,6 +1419,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> if (!len)
> return -EINVAL;
>
> + /* Restrict MAP_NOSIGBUS to MAP_PRIVATE mapping */
> + if ((flags & MAP_NOSIGBUS) && !(flags & MAP_PRIVATE))
> + return -EINVAL;
> +
> /*
> * Does the application expect PROT_READ to imply PROT_EXEC?
> *
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index f94f65d..a2a5333 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -29,6 +29,7 @@
> #define MAP_HUGETLB 0x040000 /* create a huge page mapping */
> #define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */
> #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
> +#define MAP_NOSIGBUS 0x200000 /* do not SIGBUS on fault */
>
> #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
> * uninitialized */
> --
> 1.8.3.1
>

2021-06-30 16:38:36

by Ming Lin

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: adds NOSIGBUS extension to mmap()

O Mon, Jun 28, 2021 at 10:27:23AM -0400, Vivek Goyal wrote:
> On Fri, Jun 04, 2021 at 12:43:22AM -0700, Ming Lin wrote:
> > Adds new flag MAP_NOSIGBUS of mmap() to specify the behavior of
> > "don't SIGBUS on fault". Right now, this flag is only allowed
> > for private mapping.
> >
> > For MAP_NOSIGBUS mapping, map in the zero page on read fault
> > or fill a freshly allocated page with zeroes on write fault.
>
> I am wondering if this could be of limited use for me if MAP_NOSIGBUS
> were to be supported for shared mappings as well.

V1 did support shared mapping.
https://lkml.org/lkml/2021/6/1/1078

And V0 even supported unmapping the zero page for later write.
https://github.com/minggr/linux/commit/77f3722b94ff33cafe0a72c1bf1b8fa374adb29f

We may support shared mapping if there is a real use case.
As Hugh mentioned:
> And by restricting to MAP_PRIVATE, you would allow for adding a
> proper MAP_SHARED implementation later, if it's thought useful
> (that being the implementation which can subsequently unmap a
> zero page to let new page cache be mapped).

See https://lkml.org/lkml/2021/6/1/1258

Ming

>
> When virtiofs is run with dax enabled, then it is possible that if
> a file is shared between two guests, then one guest truncates the
> file and second guest tries to do load/store operation. Given current
> kvm architecture, there is no mechanism to propagate SIGBUS to guest
> process, instead KVM retries page fault infinitely and guest cpu/process
> hangs.
>
> Ideally we want this error to propagate all the way back into the
> guest and to the guest process but that solution is not in place yet.
>
> https://lore.kernel.org/kvm/[email protected]/
>
> In the absense of a proper solution, one could think of mapping
> shared file on host with MAP_NOSIGBUS, and hopefully that means
> kvm will be able to resolve fault to a zero filled page and guest
> will not hang. But this means that data sharing between two processes
> is now broken. Writes by process A will not be visible to process B
> in another once this situation happens, IIUC.
>
> So if we were to MAP_NOSIGBUS, guest will not hang but failures resulting
> from ftruncate will be silent and will be noticed sometime later. I guess
> not exactly a very pleasant scenario...
>
> Thanks
> Vivek