This is the v1 of per-VMA locks patchset originally posted as an RFC at
[1] and described in LWN article at [2]. Per-vma locks idea that was
discussed during SPF [3] discussion at LSF/MM this year [4], which
concluded with suggestion that “a reader/writer semaphore could be put
into the VMA itself; that would have the effect of using the VMA as a
sort of range lock. There would still be contention at the VMA level, but
it would be an improvement.” This patchset implements this suggested
approach.
When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.
One notable way the implementation deviates from the proposal is the way
VMAs are read-locked. During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g. vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex. So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.
We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock. Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.
Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered read-locked
when these sequence numbers are equal. To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct. To unlock all VMAs we increment mm_struct's seq number. This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.
The patchset implements per-VMA locking only for anonymous pages which
are not in swap and avoids userfaultfs as their implementation is more
complex. Additional support for file-back page faults, swapped and user
pages can be added incrementally.
Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.
Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results. Here are
the known usecases when per-VMA locks show benefits:
Android:
Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode. Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs. Per-vma locks prevent this contention by
providing more granular lock.
Google Fibers:
We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.
At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup. Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.
TCP zerocopy receive:
From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.
In today's implementation, a process with N threads where N - 1 are
performing zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g. needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done. Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition. Per-VMA locking _removes_ by
definition this source of contention entirely.
There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock. On an RPC style synthetic workload with 4KB RPCs:
1a) The find+lock+unlock VMA path in the base case, without the per-vma
lock patchset, is about 0.7% of cycles as measured by perf.
1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
cycles overall - most of this is within the TCP read hotpath (a small
fraction is 'other' usage in the system).
2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial
patch written to take advantage of it in TCP, is about 0.4% of cycles
(down from 0.7% above)
2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1%
cycles and is out of the TCP read hotpath entirely (down from 0.5% before,
the remaining usage is the 'other' usage in the system).
So, in addition to entirely removing an onerous source of contention, it
also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
(compared to overall cycles in perf) for the 'small' RPC scenario.
The patchset structure is:
0001-0007: Enable maple-tree RCU mode
0008-0038: Main per-vma locks patchset
0039-0040: Performance optimizations
0041: Memory overhead optimization
Branch for testing is posted at:
https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock
The patchset applies cleanly over Linus' tree at:
commit b7bfaa761d76 ("Linux 6.2-rc3")
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lwn.net/Articles/906852/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lwn.net/Articles/893906/
Laurent Dufour (1):
powerc/mm: try VMA lock-based page fault handling first
Liam Howlett (4):
maple_tree: Be more cautious about dead nodes
maple_tree: Detect dead nodes in mas_start()
maple_tree: Fix freeing of nodes in rcu mode
maple_tree: remove extra smp_wmb() from mas_dead_leaves()
Liam R. Howlett (3):
maple_tree: Fix write memory barrier of nodes once dead for RCU mode
maple_tree: Add smp_rmb() to dead node detection
mm: Enable maple tree RCU mode by default.
Michel Lespinasse (1):
mm: rcu safe VMA freeing
Suren Baghdasaryan (32):
mm: introduce CONFIG_PER_VMA_LOCK
mm: move mmap_lock assert function definitions
mm: export dump_mm()
mm: add per-VMA lock and helper functions to control it
mm: introduce vma->vm_flags modifier functions
mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
mm: replace vma->vm_flags direct modifications with modifier calls
mm: replace vma->vm_flags indirect modification in ksm_madvise
mm/mmap: move VMA locking before anon_vma_lock_write call
mm/khugepaged: write-lock VMA while collapsing a huge page
mm/mmap: write-lock VMAs before merging, splitting or expanding them
mm/mmap: write-lock VMAs in vma_adjust
mm/mmap: write-lock VMAs affected by VMA expansion
mm/mremap: write-lock VMA while remapping it to a new address range
mm: write-lock VMAs before removing them from VMA tree
mm: conditionally write-lock VMA in free_pgtables
mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
kernel/fork: assert no VMA readers during its destruction
mm/mmap: prevent pagefault handler from racing with mmu_notifier
registration
mm: introduce lock_vma_under_rcu to be used from arch-specific code
mm: fall back to mmap_lock if vma->anon_vma is not yet set
mm: add FAULT_FLAG_VMA_LOCK flag
mm: prevent do_swap_page from handling page faults under VMA lock
mm: prevent userfaults to be handled under per-vma lock
mm: introduce per-VMA lock statistics
x86/mm: try VMA lock-based page fault handling first
arm64/mm: try VMA lock-based page fault handling first
mm: introduce mod_vm_flags_nolock
mm: avoid assertion in untrack_pfn
kernel/fork: throttle call_rcu() calls in vm_area_free
mm: separate vma->lock from vm_area_struct
mm: replace rw_semaphore with atomic_t in vma_lock
arch/arm/kernel/process.c | 2 +-
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 36 ++++
arch/ia64/mm/init.c | 8 +-
arch/loongarch/include/asm/tlb.h | 2 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 5 +-
arch/powerpc/kvm/book3s_xive_native.c | 2 +-
arch/powerpc/mm/book3s64/subpage_prot.c | 2 +-
arch/powerpc/mm/fault.c | 41 +++++
arch/powerpc/platforms/book3s/vas-api.c | 2 +-
arch/powerpc/platforms/cell/spufs/file.c | 14 +-
arch/powerpc/platforms/powernv/Kconfig | 1 +
arch/powerpc/platforms/pseries/Kconfig | 1 +
arch/s390/mm/gmap.c | 8 +-
arch/x86/Kconfig | 1 +
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/kernel/cpu/sgx/driver.c | 2 +-
arch/x86/kernel/cpu/sgx/virt.c | 2 +-
arch/x86/mm/fault.c | 36 ++++
arch/x86/mm/pat/memtype.c | 14 +-
arch/x86/um/mem_32.c | 2 +-
drivers/acpi/pfr_telemetry.c | 2 +-
drivers/android/binder.c | 3 +-
drivers/char/mspec.c | 2 +-
drivers/crypto/hisilicon/qm.c | 2 +-
drivers/dax/device.c | 2 +-
drivers/dma/idxd/cdev.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 +-
drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 4 +-
drivers/gpu/drm/amd/amdkfd/kfd_events.c | 4 +-
drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 +-
drivers/gpu/drm/drm_gem.c | 2 +-
drivers/gpu/drm/drm_gem_dma_helper.c | 3 +-
drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +-
drivers/gpu/drm/drm_vm.c | 8 +-
drivers/gpu/drm/etnaviv/etnaviv_gem.c | 2 +-
drivers/gpu/drm/exynos/exynos_drm_gem.c | 4 +-
drivers/gpu/drm/gma500/framebuffer.c | 2 +-
drivers/gpu/drm/i810/i810_dma.c | 2 +-
drivers/gpu/drm/i915/gem/i915_gem_mman.c | 4 +-
drivers/gpu/drm/mediatek/mtk_drm_gem.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 2 +-
drivers/gpu/drm/omapdrm/omap_gem.c | 3 +-
drivers/gpu/drm/rockchip/rockchip_drm_gem.c | 3 +-
drivers/gpu/drm/tegra/gem.c | 5 +-
drivers/gpu/drm/ttm/ttm_bo_vm.c | 3 +-
drivers/gpu/drm/virtio/virtgpu_vram.c | 2 +-
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c | 2 +-
drivers/gpu/drm/xen/xen_drm_front_gem.c | 3 +-
drivers/hsi/clients/cmt_speech.c | 2 +-
drivers/hwtracing/intel_th/msu.c | 2 +-
drivers/hwtracing/stm/core.c | 2 +-
drivers/infiniband/hw/hfi1/file_ops.c | 4 +-
drivers/infiniband/hw/mlx5/main.c | 4 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 13 +-
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 2 +-
.../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c | 2 +-
.../common/videobuf2/videobuf2-dma-contig.c | 2 +-
.../common/videobuf2/videobuf2-vmalloc.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 4 +-
drivers/media/v4l2-core/videobuf-vmalloc.c | 2 +-
drivers/misc/cxl/context.c | 2 +-
drivers/misc/habanalabs/common/memory.c | 2 +-
drivers/misc/habanalabs/gaudi/gaudi.c | 4 +-
drivers/misc/habanalabs/gaudi2/gaudi2.c | 8 +-
drivers/misc/habanalabs/goya/goya.c | 4 +-
drivers/misc/ocxl/context.c | 4 +-
drivers/misc/ocxl/sysfs.c | 2 +-
drivers/misc/open-dice.c | 6 +-
drivers/misc/sgi-gru/grufile.c | 4 +-
drivers/misc/uacce/uacce.c | 2 +-
drivers/sbus/char/oradax.c | 2 +-
drivers/scsi/cxlflash/ocxl_hw.c | 2 +-
drivers/scsi/sg.c | 2 +-
.../staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +-
drivers/staging/media/deprecated/meye/meye.c | 4 +-
.../media/deprecated/stkwebcam/stk-webcam.c | 2 +-
drivers/target/target_core_user.c | 2 +-
drivers/uio/uio.c | 2 +-
drivers/usb/core/devio.c | 3 +-
drivers/usb/mon/mon_bin.c | 3 +-
drivers/vdpa/vdpa_user/iova_domain.c | 2 +-
drivers/vfio/pci/vfio_pci_core.c | 2 +-
drivers/vhost/vdpa.c | 2 +-
drivers/video/fbdev/68328fb.c | 2 +-
drivers/video/fbdev/core/fb_defio.c | 4 +-
drivers/xen/gntalloc.c | 2 +-
drivers/xen/gntdev.c | 4 +-
drivers/xen/privcmd-buf.c | 2 +-
drivers/xen/privcmd.c | 4 +-
fs/aio.c | 2 +-
fs/cramfs/inode.c | 2 +-
fs/erofs/data.c | 2 +-
fs/exec.c | 4 +-
fs/ext4/file.c | 2 +-
fs/fuse/dax.c | 2 +-
fs/hugetlbfs/inode.c | 4 +-
fs/orangefs/file.c | 3 +-
fs/proc/task_mmu.c | 2 +-
fs/proc/vmcore.c | 3 +-
fs/userfaultfd.c | 12 +-
fs/xfs/xfs_file.c | 2 +-
include/linux/mm.h | 159 +++++++++++++++++-
include/linux/mm_types.h | 62 ++++++-
include/linux/mmap_lock.h | 37 ++--
include/linux/pgtable.h | 5 +-
include/linux/vm_event_item.h | 6 +
include/linux/vmstat.h | 6 +
kernel/bpf/ringbuf.c | 4 +-
kernel/bpf/syscall.c | 4 +-
kernel/events/core.c | 2 +-
kernel/fork.c | 148 +++++++++++++---
kernel/kcov.c | 2 +-
kernel/relay.c | 2 +-
lib/maple_tree.c | 146 +++++++++++++---
mm/Kconfig | 13 ++
mm/Kconfig.debug | 8 +
mm/debug.c | 1 +
mm/hugetlb.c | 4 +-
mm/init-mm.c | 8 +
mm/internal.h | 2 +-
mm/khugepaged.c | 7 +
mm/ksm.c | 2 +
mm/madvise.c | 2 +-
mm/memory.c | 94 +++++++++--
mm/memremap.c | 4 +-
mm/mlock.c | 12 +-
mm/mmap.c | 99 ++++++++---
mm/mprotect.c | 2 +-
mm/mremap.c | 9 +-
mm/nommu.c | 16 +-
mm/secretmem.c | 2 +-
mm/shmem.c | 2 +-
mm/vmalloc.c | 2 +-
mm/vmstat.c | 6 +
net/ipv4/tcp.c | 4 +-
security/selinux/selinuxfs.c | 6 +-
sound/core/oss/pcm_oss.c | 2 +-
sound/core/pcm_native.c | 9 +-
sound/soc/pxa/mmp-sspa.c | 2 +-
sound/usb/usx2y/us122l.c | 4 +-
sound/usb/usx2y/usX2Yhwdep.c | 2 +-
sound/usb/usx2y/usx2yhwdeppcm.c | 2 +-
tools/testing/radix-tree/maple.c | 16 ++
146 files changed, 1047 insertions(+), 316 deletions(-)
--
2.39.0
Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 3 +++
mm/memory.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c464fc8a514c..d0fddf6a1de9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
vma);
}
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address);
+
#else /* CONFIG_PER_VMA_LOCK */
static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 9ece18548db1..a658e26d965d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(handle_mm_fault);
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+ MA_STATE(mas, &mm->mm_mt, address, address);
+ struct vm_area_struct *vma, *validate;
+
+ rcu_read_lock();
+ vma = mas_walk(&mas);
+retry:
+ if (!vma)
+ goto inval;
+
+ /* Only anonymous vmas are supported for now */
+ if (!vma_is_anonymous(vma))
+ goto inval;
+
+ if (!vma_read_trylock(vma))
+ goto inval;
+
+ /* Check since vm_start/vm_end might change before we lock the VMA */
+ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+ vma_read_unlock(vma);
+ goto inval;
+ }
+
+ /* Check if the VMA got isolated after we found it */
+ mas.index = address;
+ validate = mas_walk(&mas);
+ if (validate != vma) {
+ vma_read_unlock(vma);
+ count_vm_vma_lock_event(VMA_LOCK_MISS);
+ /* The area was replaced with another one. */
+ vma = validate;
+ goto retry;
+ }
+
+ rcu_read_unlock();
+ return vma;
+inval:
+ rcu_read_unlock();
+ count_vm_vma_lock_event(VMA_LOCK_ABORT);
+ return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
#ifndef __PAGETABLE_P4D_FOLDED
/*
* Allocate p4d page table.
--
2.39.0
Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 2560524ad7f4..20806bc8b4eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3707,6 +3707,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
--
2.39.0
vma_adjust modifies a VMA and possibly its neighbors. Write-lock them
before making the modifications.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index f6ca4a87f9e2..1e2154137631 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -614,6 +614,12 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
* The following helper function should be used when such adjustments
* are necessary. The "insert" vma (if any) is to be inserted
* before we drop the necessary locks.
+ * 'expand' vma is always locked before it's passed to __vma_adjust()
+ * from vma_merge() because vma should not change from the moment
+ * can_vma_merge_{before|after} decision is made.
+ * 'insert' vma is used only by __split_vma() and it's always a brand
+ * new vma which is not yet added into mm's vma tree, therefore no need
+ * to lock it.
*/
int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
@@ -633,6 +639,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
MA_STATE(mas, &mm->mm_mt, 0, 0);
struct vm_area_struct *exporter = NULL, *importer = NULL;
+ vma_write_lock(vma);
+ if (next)
+ vma_write_lock(next);
+
if (next && !insert) {
if (end >= next->vm_end) {
/*
@@ -676,8 +686,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
* If next doesn't have anon_vma, import from vma after
* next, if the vma overlaps with it.
*/
- if (remove_next == 2 && !next->anon_vma)
+ if (remove_next == 2 && !next->anon_vma) {
exporter = next_next;
+ if (exporter)
+ vma_write_lock(exporter);
+ }
} else if (end > next->vm_start) {
/*
@@ -850,6 +863,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (remove_next == 2) {
remove_next = 1;
next = next_next;
+ if (next)
+ vma_write_lock(next);
goto again;
}
}
--
2.39.0
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 03934808b2ed..829fa6d14a36 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -95,6 +95,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+ select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 596f46dabe4e..833fa8bab291 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+ struct vm_area_struct *vma;
+#endif
if (kprobe_page_fault(regs, esr))
return 0;
@@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, addr);
+ if (!vma)
+ goto lock_mmap;
+
+ if (!(vma->vm_flags & vm_flags)) {
+ vma_read_unlock(vma);
+ goto lock_mmap;
+ }
+ fault = handle_mm_fault(vma, addr & PAGE_MASK,
+ mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_read_unlock(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
+ return 0;
+ }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
}
mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
/*
* Handle the "normal" (no error) case first.
*/
--
2.39.0
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/vm_event_item.h | 6 ++++++
include/linux/vmstat.h | 6 ++++++
mm/Kconfig.debug | 8 ++++++++
mm/vmstat.c | 6 ++++++
4 files changed, 26 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..8abfa1240040 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+ VMA_LOCK_SUCCESS,
+ VMA_LOCK_ABORT,
+ VMA_LOCK_RETRY,
+ VMA_LOCK_MISS,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..fed855bae6d8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu)
#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
#define __count_zid_vm_events(item, zid, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fca699ad1fb0..32a93b064590 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
kernel.
If in doubt, say N.
+
+
+config PER_VMA_LOCK_STATS
+ bool "Statistics for per-vma locks"
+ depends on PER_VMA_LOCK
+ default y
+ help
+ Statistics for per-vma locks.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..4f1089a1860e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = {
"direct_map_level2_splits",
"direct_map_level3_splits",
#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+ "vma_lock_success",
+ "vma_lock_abort",
+ "vma_lock_retry",
+ "vma_lock_miss",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.39.0
From: Liam Howlett <[email protected]>
The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types. This is necessary for the
RCU mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 73 ++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 62 insertions(+), 11 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a748938ad2e9..a11eea943f8d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -897,6 +897,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt,
meta->end = end;
}
+/*
+ * mas_clear_meta() - clear the metadata information of a node, if it exists
+ * @mas: The maple state
+ * @mn: The maple node
+ * @mt: The maple node type
+ * @offset: The offset of the highest sub-gap in this node.
+ * @end: The end of the data in this node.
+ */
+static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
+ enum maple_type mt)
+{
+ struct maple_metadata *meta;
+ unsigned long *pivots;
+ void __rcu **slots;
+ void *next;
+
+ switch (mt) {
+ case maple_range_64:
+ pivots = mn->mr64.pivot;
+ if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
+ slots = mn->mr64.slot;
+ next = mas_slot_locked(mas, slots,
+ MAPLE_RANGE64_SLOTS - 1);
+ if (unlikely((mte_to_node(next) && mte_node_type(next))))
+ return; /* The last slot is a node, no metadata */
+ }
+ fallthrough;
+ case maple_arange_64:
+ meta = ma_meta(mn, mt);
+ break;
+ default:
+ return;
+ }
+
+ meta->gap = 0;
+ meta->end = 0;
+}
+
/*
* ma_meta_end() - Get the data end of a node from the metadata
* @mn: The maple node
@@ -5448,20 +5486,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
* mas_dead_leaves() - Mark all leaves of a node as dead.
* @mas: The maple state
* @slots: Pointer to the slot array
+ * @type: The maple node type
*
* Must hold the write lock.
*
* Return: The number of leaves marked as dead.
*/
static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
+unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
+ enum maple_type mt)
{
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;
- for (offset = 0; offset < mt_slot_count(mas->node); offset++) {
+ for (offset = 0; offset < mt_slots[mt]; offset++) {
entry = mas_slot_locked(mas, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
@@ -5480,14 +5520,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
{
- struct maple_node *node, *next;
+ struct maple_node *next;
void __rcu **slots = NULL;
next = mas_mn(mas);
do {
- mas->node = ma_enode_ptr(next);
- node = mas_mn(mas);
- slots = ma_slots(node, node->type);
+ mas->node = mt_mk_node(next, next->type);
+ slots = ma_slots(next, next->type);
next = mas_slot_locked(mas, slots, offset);
offset = 0;
} while (!ma_is_leaf(next->type));
@@ -5551,11 +5590,14 @@ static inline void __rcu **mas_destroy_descend(struct ma_state *mas,
node = mas_mn(mas);
slots = ma_slots(node, mte_node_type(mas->node));
next = mas_slot_locked(mas, slots, 0);
- if ((mte_dead_node(next)))
+ if ((mte_dead_node(next))) {
+ mte_to_node(next)->type = mte_node_type(next);
next = mas_slot_locked(mas, slots, 1);
+ }
mte_set_node_dead(mas->node);
node->type = mte_node_type(mas->node);
+ mas_clear_meta(mas, node, node->type);
node->piv_parent = prev;
node->parent_slot = offset;
offset = 0;
@@ -5575,13 +5617,18 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
MA_STATE(mas, &mt, 0, 0);
- if (mte_is_leaf(enode))
+ mas.node = enode;
+ if (mte_is_leaf(enode)) {
+ node->type = mte_node_type(enode);
goto free_leaf;
+ }
+ ma_flags &= ~MT_FLAGS_LOCK_MASK;
mt_init_flags(&mt, ma_flags);
mas_lock(&mas);
- mas.node = start = enode;
+ mte_to_node(enode)->ma_flags = ma_flags;
+ start = enode;
slots = mas_destroy_descend(&mas, start, 0);
node = mas_mn(&mas);
do {
@@ -5589,7 +5636,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
unsigned char offset;
struct maple_enode *parent, *tmp;
- node->slot_len = mas_dead_leaves(&mas, slots);
+ node->type = mte_node_type(mas.node);
+ node->slot_len = mas_dead_leaves(&mas, slots, node->type);
if (free)
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
@@ -5613,7 +5661,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
} while (start != mas.node);
node = mas_mn(&mas);
- node->slot_len = mas_dead_leaves(&mas, slots);
+ node->type = mte_node_type(mas.node);
+ node->slot_len = mas_dead_leaves(&mas, slots, node->type);
if (free)
mt_free_bulk(node->slot_len, slots);
@@ -5623,6 +5672,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
free_leaf:
if (free)
mt_free_rcu(&node->rcu);
+ else
+ mas_clear_meta(&mas, node, node->type);
}
/*
--
2.39.0
To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
replace it with VM_LOCKED_MASK bitmask and convert all users.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 4 ++--
kernel/fork.c | 2 +-
mm/hugetlb.c | 4 ++--
mm/mlock.c | 6 +++---
mm/mmap.c | 6 +++---
mm/mremap.c | 2 +-
6 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35cf0a6cbcc2..2b16d45b75a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,8 +416,8 @@ extern unsigned int kobjsize(const void *objp);
/* This mask defines which mm->def_flags a process can inherit its parent */
#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
-/* This mask is used to clear all the VMA flags used by mlock */
-#define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT))
+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
/* Arch-specific flags to clear when updating VM flags on protection change */
#ifndef VM_ARCH_CLEAR
diff --git a/kernel/fork.c b/kernel/fork.c
index c026d75108b3..1591dd8a0745 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,7 +674,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->anon_vma = NULL;
} else if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
- tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+ clear_vm_flags(tmp, VM_LOCKED_MASK);
file = tmp->vm_file;
if (file) {
struct address_space *mapping = file->f_mapping;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db895230ee7e..24861cbfa2b1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6950,8 +6950,8 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;
/* Allow segments to share if only one is marked locked */
- unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
- unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
+ unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
+ unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
/*
* match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6dd0ce1..06aa9e204fac 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -490,7 +490,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
prev = mas_prev(&mas, 0);
for (nstart = start ; ; ) {
- vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+ vm_flags_t newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= flags;
@@ -662,7 +662,7 @@ static int apply_mlockall_flags(int flags)
struct vm_area_struct *vma, *prev = NULL;
vm_flags_t to_add = 0;
- current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
+ current->mm->def_flags &= ~VM_LOCKED_MASK;
if (flags & MCL_FUTURE) {
current->mm->def_flags |= VM_LOCKED;
@@ -682,7 +682,7 @@ static int apply_mlockall_flags(int flags)
mas_for_each(&mas, vma, ULONG_MAX) {
vm_flags_t newflags;
- newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+ newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= to_add;
/* Ignore errors */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9db37adfc00a..5c4b608edde9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2721,7 +2721,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ clear_vm_flags(vma, VM_LOCKED_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -3392,8 +3392,8 @@ static struct vm_area_struct *__install_special_mapping(
vma->vm_start = addr;
vma->vm_end = addr + len;
- vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY;
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ init_vm_flags(vma, (vm_flags | mm->def_flags |
+ VM_DONTEXPAND | VM_SOFTDIRTY) & ~VM_LOCKED_MASK);
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
vma->vm_ops = ops;
diff --git a/mm/mremap.c b/mm/mremap.c
index fe587c5d6591..5f6f9931bff1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -686,7 +686,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
/* We always clear VM_LOCKED[ONFAULT] on the old vma */
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ clear_vm_flags(vma, VM_LOCKED_MASK);
/*
* anon_vma links of the old vma is no longer needed after its page
--
2.39.0
rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore with an atomic variable.
When a reader takes read lock, it increments the atomic unless the
value is negative. If that fails read-locking is aborted and mmap_lock
is used instead.
When writer takes write lock, it resets atomic value to -1 if the
current value is 0 (no readers). Since all writers take mmap_lock in
write mode first, there can be only one writer at a time. If there
are readers, writer will place itself into a wait queue using new
mm_struct.vma_writer_wait waitqueue head. The last reader to release
the vma_lock will signal the writer to wake up.
vm_lock_seq is also moved into vma_lock and along with atomic_t they
are nicely packed and consume 8 bytes, bringing the overhead from
vma_lock from 44 to 16 bytes:
slabinfo before the changes:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
vm_area_struct ... 152 53 2 : ...
slabinfo with vma_lock:
<name> ... <objsize> <objperslab> <pagesperslab> : ...
rw_semaphore ... 8 512 1 : ...
vm_area_struct ... 160 51 2 : ...
Assuming 40000 vm_area_structs, memory consumption would be:
baseline: 6040kB
vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
Total increase: 556kB
atomic_t might overflow if there are many competing readers, therefore
vma_read_trylock() implements an overflow check and if that occurs it
restors the previous value and exits with a failure to lock.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 37 +++++++++++++++++++++++++------------
include/linux/mm_types.h | 10 ++++++++--
kernel/fork.c | 6 +++---
mm/init-mm.c | 2 ++
4 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d40bf8a5e19e..294dd44b2198 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
* mm->mm_lock_seq can't be concurrently modified.
*/
mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
- if (vma->vm_lock_seq == mm_lock_seq)
+ if (vma->vm_lock->lock_seq == mm_lock_seq)
return;
- down_write(&vma->vm_lock->lock);
- vma->vm_lock_seq = mm_lock_seq;
- up_write(&vma->vm_lock->lock);
+ if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
+ wait_event(vma->vm_mm->vma_writer_wait,
+ atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
+ vma->vm_lock->lock_seq = mm_lock_seq;
+ /* Write barrier to ensure lock_seq change is visible before count */
+ smp_wmb();
+ atomic_set(&vma->vm_lock->count, 0);
}
/*
@@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
static inline bool vma_read_trylock(struct vm_area_struct *vma)
{
/* Check before locking. A race might cause false locked result. */
- if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+ if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
- if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
+ if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
return false;
+ /* If atomic_t overflows, restore and fail to lock. */
+ if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
+ if (atomic_dec_and_test(&vma->vm_lock->count))
+ wake_up(&vma->vm_mm->vma_writer_wait);
+ return false;
+ }
+
/*
* Overflow might produce false locked result.
* False unlocked result is impossible because we modify and check
* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
* modification invalidates all existing locks.
*/
- if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
- up_read(&vma->vm_lock->lock);
+ if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+ if (atomic_dec_and_test(&vma->vm_lock->count))
+ wake_up(&vma->vm_mm->vma_writer_wait);
return false;
}
return true;
@@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
static inline void vma_read_unlock(struct vm_area_struct *vma)
{
- up_read(&vma->vm_lock->lock);
+ if (atomic_dec_and_test(&vma->vm_lock->count))
+ wake_up(&vma->vm_mm->vma_writer_wait);
}
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -674,13 +687,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
* current task is holding mmap_write_lock, both vma->vm_lock_seq and
* mm->mm_lock_seq can't be concurrently modified.
*/
- VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+ VM_BUG_ON_VMA(vma->vm_lock->lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
}
static inline void vma_assert_no_reader(struct vm_area_struct *vma)
{
- VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock) &&
- vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+ VM_BUG_ON_VMA(atomic_read(&vma->vm_lock->count) > 0 &&
+ vma->vm_lock->lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
vma);
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faa61b400f9b..a6050c38ca2e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -527,7 +527,13 @@ struct anon_vma_name {
};
struct vma_lock {
- struct rw_semaphore lock;
+ /*
+ * count > 0 ==> read-locked with 'count' number of readers
+ * count < 0 ==> write-locked
+ * count = 0 ==> unlocked
+ */
+ atomic_t count;
+ int lock_seq;
};
/*
@@ -566,7 +572,6 @@ struct vm_area_struct {
unsigned long vm_flags;
#ifdef CONFIG_PER_VMA_LOCK
- int vm_lock_seq;
struct vma_lock *vm_lock;
#endif
@@ -706,6 +711,7 @@ struct mm_struct {
* by mmlist_lock
*/
#ifdef CONFIG_PER_VMA_LOCK
+ struct wait_queue_head vma_writer_wait;
int mm_lock_seq;
struct {
struct list_head head;
diff --git a/kernel/fork.c b/kernel/fork.c
index 95db6a521cf1..b221ad182d98 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -461,9 +461,8 @@ static bool vma_init_lock(struct vm_area_struct *vma)
vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
if (!vma->vm_lock)
return false;
-
- init_rwsem(&vma->vm_lock->lock);
- vma->vm_lock_seq = -1;
+ atomic_set(&vma->vm_lock->count, 0);
+ vma->vm_lock->lock_seq = -1;
return true;
}
@@ -1229,6 +1228,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
#ifdef CONFIG_PER_VMA_LOCK
+ init_waitqueue_head(&mm->vma_writer_wait);
WRITE_ONCE(mm->mm_lock_seq, 0);
INIT_LIST_HEAD(&mm->vma_free_list.head);
spin_lock_init(&mm->vma_free_list.lock);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index b53d23c2d7a3..0088e31e5f7e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -38,6 +38,8 @@ struct mm_struct init_mm = {
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
#ifdef CONFIG_PER_VMA_LOCK
+ .vma_writer_wait =
+ __WAIT_QUEUE_HEAD_INITIALIZER(init_mm.vma_writer_wait),
.mm_lock_seq = 0,
.vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
.vma_free_list.lock = __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
--
2.39.0
From: Laurent Dufour <[email protected]>
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"
Signed-off-by: Laurent Dufour <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/powerpc/mm/fault.c | 41 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/Kconfig | 1 +
arch/powerpc/platforms/pseries/Kconfig | 1 +
3 files changed, 43 insertions(+)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2bef19cc1b98..f92f8956d5f2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, address);
+ if (!vma)
+ goto lock_mmap;
+
+ if (unlikely(access_pkey_error(is_write, is_exec,
+ (error_code & DSISR_KEYFAULT), vma))) {
+ int rc = bad_access_pkey(regs, address, vma);
+
+ vma_read_unlock(vma);
+ return rc;
+ }
+
+ if (unlikely(access_error(is_write, is_exec, vma))) {
+ int rc = bad_access(regs, address);
+
+ vma_read_unlock(vma);
+ return rc;
+ }
+
+ fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_read_unlock(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ if (fault_signal_pending(fault, regs))
+ return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
mmap_read_unlock(current->mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+ select ARCH_SUPPORTS_PER_VMA_LOCK
default y
config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index a3b4d99567cb..e036a04ff1ca 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
+ select ARCH_SUPPORTS_PER_VMA_LOCK
default y
config PARAVIRT
--
2.39.0
From: Liam Howlett <[email protected]>
When initially starting a search, the root node may already be in the
process of being replaced in RCU mode. Detect and restart the walk if
this is the case. This is necessary for RCU mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ff9f04e0150d..a748938ad2e9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1359,11 +1359,15 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
mas->depth = 0;
mas->offset = 0;
+retry:
root = mas_root(mas);
/* Tree with nodes */
if (likely(xa_is_node(root))) {
mas->depth = 1;
mas->node = mte_safe_root(root);
+ if (mte_dead_node(mas->node))
+ goto retry;
+
return NULL;
}
--
2.39.0
Move VMA flag modification (which now implies VMA locking) before
anon_vma_lock_write to match the locking order of page fault handler.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index fa994ae903d9..53d885e70a54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2953,13 +2953,13 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
if (mas_preallocate(mas, vma, GFP_KERNEL))
goto unacct_fail;
+ set_vm_flags(vma, VM_SOFTDIRTY);
vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
if (vma->anon_vma) {
anon_vma_lock_write(vma->anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}
vma->vm_end = addr + len;
- set_vm_flags(vma, VM_SOFTDIRTY);
mas_store_prealloc(mas, vma);
if (vma->anon_vma) {
--
2.39.0
Write-locking VMAs before isolating them ensures that page fault
handlers don't operate on isolated VMAs.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 2 ++
mm/nommu.c | 5 +++++
2 files changed, 7 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index da1908730828..be289e0b693b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -448,6 +448,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
*/
void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
{
+ vma_write_lock(vma);
trace_vma_mas_szero(mas->tree, vma->vm_start, vma->vm_end - 1);
mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
@@ -2300,6 +2301,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
static inline int munmap_sidetree(struct vm_area_struct *vma,
struct ma_state *mas_detach)
{
+ vma_write_lock(vma);
mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;
diff --git a/mm/nommu.c b/mm/nommu.c
index b3154357ced5..7ae91337ef14 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -552,6 +552,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
{
+ vma_write_lock(vma);
mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
mas_store_prealloc(mas, NULL);
@@ -1551,6 +1552,10 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_lock(mm);
for_each_vma(vmi, vma) {
cleanup_vma_from_mm(vma);
+ /*
+ * No need to lock VMA because this is the only mm user and no
+ * page fault handled can race with it.
+ */
delete_vma(mm, vma);
cond_resched();
}
--
2.39.0
Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with fault handlers that would hold VMA locks.
VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 30c7d1c5206e..a256deca0bc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
* of mm/rmap.c:
* - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
* hugetlb mapping);
+ * - all vmas marked locked
* - all i_mmap_rwsem locks;
* - all anon_vma->rwseml
*
@@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
mas_for_each(&mas, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
+ vma_write_lock(vma);
if (vma->vm_file && vma->vm_file->f_mapping &&
is_vm_hugetlb_page(vma))
vm_lock_mapping(mm, vma->vm_file->f_mapping);
@@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
if (vma->vm_file && vma->vm_file->f_mapping)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+ vma_write_unlock_mm(mm);
mutex_unlock(&mm_all_locks_mutex);
}
--
2.39.0
From: "Liam R. Howlett" <[email protected]>
During the development of the maple tree, the strategy of freeing
multiple nodes changed and, in the process, the pivots were reused to
store pointers to dead nodes. To ensure the readers see accurate
pivots, the writers need to mark the nodes as dead and call smp_wmb() to
ensure any readers can identify the node as dead before using the pivot
values.
There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead. Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.
Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being
freed are marked as dead to ensure there are no other call paths besides
the two updated paths.
This is necessary for the RCU mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 7 +++++--
tools/testing/radix-tree/maple.c | 16 ++++++++++++++++
2 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d85291b19f86..8066fb1e8ec9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -179,7 +179,7 @@ static void mt_free_rcu(struct rcu_head *head)
*/
static void ma_free_rcu(struct maple_node *node)
{
- node->parent = ma_parent_ptr(node);
+ WARN_ON(node->parent != ma_parent_ptr(node));
call_rcu(&node->rcu, mt_free_rcu);
}
@@ -1775,8 +1775,10 @@ static inline void mas_replace(struct ma_state *mas, bool advanced)
rcu_assign_pointer(slots[offset], mas->node);
}
- if (!advanced)
+ if (!advanced) {
+ mte_set_node_dead(old_enode);
mas_free(mas, old_enode);
+ }
}
/*
@@ -4217,6 +4219,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas)
done:
mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end);
if (in_rcu) {
+ mte_set_node_dead(mas->node);
mas->node = mt_mk_node(newnode, wr_mas->type);
mas_replace(mas, false);
} else {
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 81fa7ec2e66a..2539ad6c4777 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mn->slot[1] != NULL);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas.node = MAS_START;
mas_nomem(&mas, GFP_KERNEL);
@@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != i);
MT_BUG_ON(mt, !mn);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
@@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
MT_BUG_ON(mt, !mn);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
@@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
}
@@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas2); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
@@ -334,10 +341,12 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -375,6 +384,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mas_node_count(&mas, i); /* Request */
mas_nomem(&mas, GFP_KERNEL); /* Fill request */
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas_destroy(&mas);
@@ -382,10 +392,13 @@ static noinline void check_new_node(struct maple_tree *mt)
mas_node_count(&mas, i); /* Request */
mas_nomem(&mas, GFP_KERNEL); /* Fill request */
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas_destroy(&mas);
}
@@ -35369,6 +35382,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
mas_destroy(&mas);
@@ -35386,6 +35400,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
@@ -35756,6 +35771,7 @@ void farmer_tests(void)
tree.ma_root = mt_mk_node(node, maple_leaf_64);
mt_dump(&tree);
+ node->parent = ma_parent_ptr(node);
ma_free_rcu(node);
/* Check things that will make lockdep angry */
--
2.39.0
From: Michel Lespinasse <[email protected]>
This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.
Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 13 ++++++++++---
kernel/fork.c | 13 +++++++++++++
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4b6bce73fbb4..d5cdec1314fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -535,9 +535,16 @@ struct anon_vma_name {
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
- unsigned long vm_start; /* Our start address within vm_mm. */
- unsigned long vm_end; /* The first byte after our end address
- within vm_mm. */
+ union {
+ struct {
+ /* VMA covers [vm_start; vm_end) addresses within mm */
+ unsigned long vm_start;
+ unsigned long vm_end;
+ };
+#ifdef CONFIG_PER_VMA_LOCK
+ struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+ };
struct mm_struct *vm_mm; /* The address space we belong to. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 58aab6c889a4..5986817f393c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -479,10 +479,23 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}
+#ifdef CONFIG_PER_VMA_LOCK
+static void __vm_area_free(struct rcu_head *head)
+{
+ struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+ kmem_cache_free(vm_area_cachep, vma);
+}
+#endif
+
void vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
+#ifdef CONFIG_PER_VMA_LOCK
+ call_rcu(&vma->vm_rcu, __vm_area_free);
+#else
kmem_cache_free(vm_area_cachep, vma);
+#endif
}
static void account_kernel_stack(struct task_struct *tsk, int account)
--
2.39.0
untrack_pfn can be called after VMA was isolated and mmap_lock downgraded.
An attempt to lock affected VMA would cause an assertion, therefore
use mod_vm_flags_nolock in such situations.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/x86/mm/pat/memtype.c | 10 +++++++---
include/linux/mm.h | 2 +-
include/linux/pgtable.h | 5 +++--
mm/memory.c | 15 ++++++++-------
mm/memremap.c | 4 ++--
mm/mmap.c | 4 ++--
6 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 9e490a372896..f71c8381430b 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1045,7 +1045,7 @@ void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, pfn_t pfn)
* can be for the entire vma (in which case pfn, size are zero).
*/
void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
- unsigned long size)
+ unsigned long size, bool lock_vma)
{
resource_size_t paddr;
unsigned long prot;
@@ -1064,8 +1064,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
size = vma->vm_end - vma->vm_start;
}
free_pfn_range(paddr, size);
- if (vma)
- clear_vm_flags(vma, VM_PAT);
+ if (vma) {
+ if (lock_vma)
+ clear_vm_flags(vma, VM_PAT);
+ else
+ mod_vm_flags_nolock(vma, 0, VM_PAT);
+ }
}
/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7d436a5027cc..3158f33e268c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2135,7 +2135,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details);
void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
- unsigned long end);
+ unsigned long end, bool lock_vma);
struct mmu_notifier_range;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..eaa831bd675d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1214,7 +1214,8 @@ static inline int track_pfn_copy(struct vm_area_struct *vma)
* can be for the entire vma (in which case pfn, size are zero).
*/
static inline void untrack_pfn(struct vm_area_struct *vma,
- unsigned long pfn, unsigned long size)
+ unsigned long pfn, unsigned long size,
+ bool lock_vma)
{
}
@@ -1232,7 +1233,7 @@ extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
pfn_t pfn);
extern int track_pfn_copy(struct vm_area_struct *vma);
extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
- unsigned long size);
+ unsigned long size, bool lock_vma);
extern void untrack_pfn_moved(struct vm_area_struct *vma);
#endif
diff --git a/mm/memory.c b/mm/memory.c
index 12508f4d845a..5c7d5eaa60d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1610,7 +1610,7 @@ void unmap_page_range(struct mmu_gather *tlb,
static void unmap_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr,
- struct zap_details *details)
+ struct zap_details *details, bool lock_vma)
{
unsigned long start = max(vma->vm_start, start_addr);
unsigned long end;
@@ -1625,7 +1625,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
uprobe_munmap(vma, start, end);
if (unlikely(vma->vm_flags & VM_PFNMAP))
- untrack_pfn(vma, 0, 0);
+ untrack_pfn(vma, 0, 0, lock_vma);
if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
@@ -1672,7 +1672,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
*/
void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr)
+ unsigned long end_addr, bool lock_vma)
{
struct mmu_notifier_range range;
struct zap_details details = {
@@ -1686,7 +1686,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
do {
- unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
+ unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
+ lock_vma);
} while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
mmu_notifier_invalidate_range_end(&range);
}
@@ -1715,7 +1716,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
do {
- unmap_single_vma(&tlb, vma, start, range.end, NULL);
+ unmap_single_vma(&tlb, vma, start, range.end, NULL, false);
} while ((vma = mas_find(&mas, end - 1)) != NULL);
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
@@ -1750,7 +1751,7 @@ void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
* unmap 'address-end' not 'range.start-range.end' as range
* could have been expanded for hugetlb pmd sharing.
*/
- unmap_single_vma(&tlb, vma, address, end, details);
+ unmap_single_vma(&tlb, vma, address, end, details, false);
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}
@@ -2519,7 +2520,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
if (err)
- untrack_pfn(vma, pfn, PAGE_ALIGN(size));
+ untrack_pfn(vma, pfn, PAGE_ALIGN(size), true);
return err;
}
EXPORT_SYMBOL(remap_pfn_range);
diff --git a/mm/memremap.c b/mm/memremap.c
index 08cbf54fe037..2f88f43d4a01 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -129,7 +129,7 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
}
mem_hotplug_done();
- untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
+ untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
pgmap_array_delete(range);
}
@@ -276,7 +276,7 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
if (!is_private)
kasan_remove_zero_shadow(__va(range->start), range_len(range));
err_kasan:
- untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
+ untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range), true);
err_pfn_remap:
pgmap_array_delete(range);
return error;
diff --git a/mm/mmap.c b/mm/mmap.c
index a256deca0bc0..332af383f7cd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2209,7 +2209,7 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, mt, vma, start, end);
+ unmap_vmas(&tlb, mt, vma, start, end, lock_vma);
free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
next ? next->vm_start : USER_PGTABLES_CEILING,
lock_vma);
@@ -3127,7 +3127,7 @@ void exit_mmap(struct mm_struct *mm)
tlb_gather_mmu_fullmm(&tlb, mm);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
- unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
+ unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
mmap_read_unlock(mm);
/*
--
2.39.0
Introduce a per-VMA rw_semaphore to be used during page fault handling
instead of mmap_lock. Because there are cases when multiple VMAs need
to be exclusively locked during VMA tree modifications, instead of the
usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
exclusively and setting vma->lock_seq to the current mm->lock_seq. When
mmap_write_lock holder is done with all modifications and drops mmap_lock,
it will increment mm->lock_seq, effectively unlocking all VMAs marked as
locked.
VMA lock is placed on the cache line boundary so that its 'count' field
falls into the first cache line while the rest of the fields fall into
the second cache line. This lets the 'count' field to be cached with
other frequently accessed fields and used quickly in uncontended case
while 'owner' and other fields used in the contended case will not
invalidate the first cache line while waiting on the lock.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 80 +++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 8 ++++
include/linux/mmap_lock.h | 13 +++++++
kernel/fork.c | 4 ++
mm/init-mm.c | 3 ++
5 files changed, 108 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..ec2c4c227d51 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -612,6 +612,85 @@ struct vm_operations_struct {
unsigned long addr);
};
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+ init_rwsem(&vma->lock);
+ vma->vm_lock_seq = -1;
+}
+
+static inline void vma_write_lock(struct vm_area_struct *vma)
+{
+ int mm_lock_seq;
+
+ mmap_assert_write_locked(vma->vm_mm);
+
+ /*
+ * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+ * mm->mm_lock_seq can't be concurrently modified.
+ */
+ mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+ if (vma->vm_lock_seq == mm_lock_seq)
+ return;
+
+ down_write(&vma->lock);
+ vma->vm_lock_seq = mm_lock_seq;
+ up_write(&vma->lock);
+}
+
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ */
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+{
+ /* Check before locking. A race might cause false locked result. */
+ if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+ return false;
+
+ if (unlikely(down_read_trylock(&vma->lock) == 0))
+ return false;
+
+ /*
+ * Overflow might produce false locked result.
+ * False unlocked result is impossible because we modify and check
+ * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+ * modification invalidates all existing locks.
+ */
+ if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+ up_read(&vma->lock);
+ return false;
+ }
+ return true;
+}
+
+static inline void vma_read_unlock(struct vm_area_struct *vma)
+{
+ up_read(&vma->lock);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+{
+ mmap_assert_write_locked(vma->vm_mm);
+ /*
+ * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+ * mm->mm_lock_seq can't be concurrently modified.
+ */
+ VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline void vma_write_lock(struct vm_area_struct *vma) {}
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+ { return false; }
+static inline void vma_read_unlock(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+ vma_init_lock(vma);
}
static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d5cdec1314fe..5f7c5ca89931 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -555,6 +555,11 @@ struct vm_area_struct {
pgprot_t vm_page_prot;
unsigned long vm_flags; /* Flags, see mm.h. */
+#ifdef CONFIG_PER_VMA_LOCK
+ int vm_lock_seq;
+ struct rw_semaphore lock;
+#endif
+
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
@@ -680,6 +685,9 @@ struct mm_struct {
* init_mm.mmlist, and are protected
* by mmlist_lock
*/
+#ifdef CONFIG_PER_VMA_LOCK
+ int mm_lock_seq;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index e49ba91bb1f0..40facd4c398b 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
}
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_write_unlock_mm(struct mm_struct *mm)
+{
+ mmap_assert_write_locked(mm);
+ /* No races during update due to exclusive mmap_lock being held */
+ WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+}
+#else
+static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
+#endif
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
@@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
static inline void mmap_write_unlock(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, true);
+ vma_write_unlock_mm(mm);
up_write(&mm->mmap_lock);
}
static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
+ vma_write_unlock_mm(mm);
downgrade_write(&mm->mmap_lock);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 5986817f393c..c026d75108b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
*/
*new = data_race(*orig);
INIT_LIST_HEAD(&new->anon_vma_chain);
+ vma_init_lock(new);
dup_anon_vma_name(orig, new);
}
return new;
@@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
seqcount_init(&mm->write_protect_seq);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
+#ifdef CONFIG_PER_VMA_LOCK
+ WRITE_ONCE(mm->mm_lock_seq, 0);
+#endif
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index c9327abb771c..33269314e060 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -37,6 +37,9 @@ struct mm_struct init_mm = {
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
+#ifdef CONFIG_PER_VMA_LOCK
+ .mm_lock_seq = 0,
+#endif
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
#ifdef CONFIG_IOMMU_SVA
--
2.39.0
Assert there are no holders of VMA lock for reading when it is about to be
destroyed.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 8 ++++++++
kernel/fork.c | 2 ++
2 files changed, 10 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 594e835bad9c..c464fc8a514c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
}
+static inline void vma_assert_no_reader(struct vm_area_struct *vma)
+{
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
+ vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+ vma);
+}
+
#else /* CONFIG_PER_VMA_LOCK */
static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -688,6 +695,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
{ return false; }
static inline void vma_read_unlock(struct vm_area_struct *vma) {}
static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
#endif /* CONFIG_PER_VMA_LOCK */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1591dd8a0745..6d9f14e55ecf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -485,6 +485,8 @@ static void __vm_area_free(struct rcu_head *head)
{
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
vm_rcu);
+ /* The vma should either have no lock holders or be write-locked. */
+ vma_assert_no_reader(vma);
kmem_cache_free(vm_area_cachep, vma);
}
#endif
--
2.39.0
mmap_assert_write_locked() will be used in the next patch to ensure
vma write lock is taken only under mmap_lock exclusive lock. Because
mmap_assert_write_locked() uses dump_mm() and there are cases when
vma write lock is taken from inside a module, it's necessary to export
dump_mm() function.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/debug.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/debug.c b/mm/debug.c
index 7f8e5f744e42..b6e9e53469d1 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
mm->def_flags, &mm->def_flags
);
}
+EXPORT_SYMBOL(dump_mm);
static bool page_init_poisoning __read_mostly = true;
--
2.39.0
In cases when VMA flags are modified after VMA was isolated and mmap_lock
was downgraded, flags modifications do not require per-VMA locking and
an attempt to lock the VMA would result in an assertion because mmap
write lock is not held.
Introduce mod_vm_flags_nolock to be used in such situation.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e3be1d45371..7d436a5027cc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -743,6 +743,14 @@ void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
vma->vm_flags &= ~flags;
}
+static inline
+void mod_vm_flags_nolock(struct vm_area_struct *vma,
+ unsigned long set, unsigned long clear)
+{
+ vma->vm_flags |= set;
+ vma->vm_flags &= ~clear;
+}
+
static inline
void mod_vm_flags(struct vm_area_struct *vma,
unsigned long set, unsigned long clear)
--
2.39.0
call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed. To minimize that impact, place VMAs into
a list and free them in groups using one call_rcu() call per group.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 1 +
include/linux/mm_types.h | 19 +++++++++--
kernel/fork.c | 68 +++++++++++++++++++++++++++++++++++-----
mm/init-mm.c | 3 ++
mm/mmap.c | 1 +
5 files changed, 82 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3158f33e268c..50c7a6dd9c7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
struct vm_area_struct *vm_area_alloc(struct mm_struct *);
struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
void vm_area_free(struct vm_area_struct *);
+void drain_free_vmas(struct mm_struct *mm);
#ifndef CONFIG_MMU
extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fce9113d979c..c0e6c8e4700b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -592,8 +592,18 @@ struct vm_area_struct {
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units */
- struct file * vm_file; /* File we map to (can be NULL). */
- void * vm_private_data; /* was vm_pte (shared mem) */
+ union {
+ struct {
+ /* File we map to (can be NULL). */
+ struct file *vm_file;
+
+ /* was vm_pte (shared mem) */
+ void *vm_private_data;
+ };
+#ifdef CONFIG_PER_VMA_LOCK
+ struct list_head vm_free_list;
+#endif
+ };
#ifdef CONFIG_ANON_VMA_NAME
/*
@@ -693,6 +703,11 @@ struct mm_struct {
*/
#ifdef CONFIG_PER_VMA_LOCK
int mm_lock_seq;
+ struct {
+ struct list_head head;
+ spinlock_t lock;
+ int size;
+ } vma_free_list;
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d9f14e55ecf..97f2b751f88d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,26 +481,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
}
#ifdef CONFIG_PER_VMA_LOCK
-static void __vm_area_free(struct rcu_head *head)
+static inline void __vm_area_free(struct vm_area_struct *vma)
{
- struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
- vm_rcu);
/* The vma should either have no lock holders or be write-locked. */
vma_assert_no_reader(vma);
kmem_cache_free(vm_area_cachep, vma);
}
-#endif
+
+static void vma_free_rcu_callback(struct rcu_head *head)
+{
+ struct vm_area_struct *first_vma;
+ struct vm_area_struct *vma, *vma2;
+
+ first_vma = container_of(head, struct vm_area_struct, vm_rcu);
+ list_for_each_entry_safe(vma, vma2, &first_vma->vm_free_list, vm_free_list)
+ __vm_area_free(vma);
+ __vm_area_free(first_vma);
+}
+
+void drain_free_vmas(struct mm_struct *mm)
+{
+ struct vm_area_struct *first_vma;
+ LIST_HEAD(to_destroy);
+
+ spin_lock(&mm->vma_free_list.lock);
+ list_splice_init(&mm->vma_free_list.head, &to_destroy);
+ mm->vma_free_list.size = 0;
+ spin_unlock(&mm->vma_free_list.lock);
+
+ if (list_empty(&to_destroy))
+ return;
+
+ first_vma = list_first_entry(&to_destroy, struct vm_area_struct, vm_free_list);
+ /* Remove the head which is allocated on the stack */
+ list_del(&to_destroy);
+
+ call_rcu(&first_vma->vm_rcu, vma_free_rcu_callback);
+}
+
+#define VM_AREA_FREE_LIST_MAX 32
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ bool drain;
+
+ free_anon_vma_name(vma);
+
+ spin_lock(&mm->vma_free_list.lock);
+ list_add(&vma->vm_free_list, &mm->vma_free_list.head);
+ mm->vma_free_list.size++;
+ drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+ spin_unlock(&mm->vma_free_list.lock);
+
+ if (drain)
+ drain_free_vmas(mm);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+void drain_free_vmas(struct mm_struct *mm) {}
void vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
-#ifdef CONFIG_PER_VMA_LOCK
- call_rcu(&vma->vm_rcu, __vm_area_free);
-#else
kmem_cache_free(vm_area_cachep, vma);
-#endif
}
+#endif /* CONFIG_PER_VMA_LOCK */
+
static void account_kernel_stack(struct task_struct *tsk, int account)
{
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -1150,6 +1199,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
INIT_LIST_HEAD(&mm->mmlist);
#ifdef CONFIG_PER_VMA_LOCK
WRITE_ONCE(mm->mm_lock_seq, 0);
+ INIT_LIST_HEAD(&mm->vma_free_list.head);
+ spin_lock_init(&mm->vma_free_list.lock);
+ mm->vma_free_list.size = 0;
#endif
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 33269314e060..b53d23c2d7a3 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -39,6 +39,9 @@ struct mm_struct init_mm = {
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
#ifdef CONFIG_PER_VMA_LOCK
.mm_lock_seq = 0,
+ .vma_free_list.head = LIST_HEAD_INIT(init_mm.vma_free_list.head),
+ .vma_free_list.lock = __SPIN_LOCK_UNLOCKED(init_mm.vma_free_list.lock),
+ .vma_free_list.size = 0,
#endif
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
diff --git a/mm/mmap.c b/mm/mmap.c
index 332af383f7cd..a0d5d3af1d95 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
trace_exit_mmap(mm);
__mt_destroy(&mm->mm_mt);
mmap_write_unlock(mm);
+ drain_free_vmas(mm);
vm_unacct_memory(nr_accounted);
}
--
2.39.0
When vma->anon_vma is not set, pagefault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index a658e26d965d..2560524ad7f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5264,6 +5264,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_is_anonymous(vma))
goto inval;
+ /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
+ if (!vma->anon_vma)
+ goto inval;
+
if (!vma_read_trylock(vma))
goto inval;
--
2.39.0
Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/khugepaged.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5376246a3052..d8d0647f0c2c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1032,6 +1032,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
+ vma_write_lock(vma);
anon_vma_lock_write(vma->anon_vma);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
@@ -1503,6 +1504,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}
+ /* Lock the vma before taking i_mmap and page table locks */
+ vma_write_lock(vma);
+
/*
* We need to lock the mapping so that from here on, only GUP-fast and
* hardware page walks can access the parts of the page tables that
@@ -1690,6 +1694,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
result = SCAN_PTE_UFFD_WP;
goto unlock_next;
}
+ vma_write_lock(vma);
collapse_and_free_pmd(mm, vma, addr, pmd);
if (!cc->is_khugepaged && is_target)
result = set_huge_pmd(vma, addr, pmd, hpage);
--
2.39.0
Write-lock VMA as locked before copying it and when copy_vma produces
a new VMA.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
mm/mmap.c | 1 +
mm/mremap.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index ff02cb51e7e7..da1908730828 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3261,6 +3261,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+ vma_write_lock(new_vma);
if (vma_link(mm, new_vma))
goto out_vma_link;
*need_rmap_locks = false;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2ccdd1561f5b..d24a79bcb1a1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -622,6 +622,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return -ENOMEM;
}
+ vma_write_lock(vma);
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
&need_rmap_locks);
--
2.39.0
Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.
Not sure if expansion really needs this, just being paranoid. Otherwise
mmap_region and vm_brk_flags might not locking.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 53d885e70a54..f6ca4a87f9e2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -254,8 +254,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
*/
mas_set(&mas, oldbrk);
next = mas_find(&mas, newbrk - 1 + PAGE_SIZE + stack_guard_gap);
- if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
- goto out;
+ if (next) {
+ vma_write_lock(next);
+ if (newbrk + PAGE_SIZE > vm_start_gap(next))
+ goto out;
+ }
brkvma = mas_prev(&mas, mm->start_brk);
/* Ok, looks good - let it rip. */
@@ -1017,10 +1020,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;
+ if (prev)
+ vma_write_lock(prev);
next = find_vma(mm, prev ? prev->vm_end : 0);
mid = next;
- if (next && next->vm_end == end) /* cases 6, 7, 8 */
+ if (next)
+ vma_write_lock(next);
+ if (next && next->vm_end == end) { /* cases 6, 7, 8 */
next = find_vma(mm, next->vm_end);
+ if (next)
+ vma_write_lock(next);
+ }
/* verify some invariant that must be enforced by the caller */
VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -2198,6 +2208,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
int err;
validate_mm_mt(mm);
+ vma_write_lock(vma);
if (vma->vm_ops && vma->vm_ops->may_split) {
err = vma->vm_ops->may_split(vma, addr);
if (err)
@@ -2564,6 +2575,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Attempt to expand an old mapping */
/* Check next */
+ if (next)
+ vma_write_lock(next);
if (next && next->vm_start == end && !vma_policy(next) &&
can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
NULL_VM_UFFD_CTX, NULL)) {
@@ -2573,6 +2586,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
}
/* Check prev */
+ if (prev)
+ vma_write_lock(prev);
if (prev && prev->vm_end == addr && !vma_policy(prev) &&
(vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
pgoff, vma->vm_userfaultfd_ctx, NULL) :
@@ -2942,6 +2957,8 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;
+ if (vma)
+ vma_write_lock(vma);
/*
* Expand the existing vma if possible; Note that singular lists do not
* occur after forking, so the expand will only happen on new VMAs.
--
2.39.0
vma_expand changes VMA boundaries and might result in freeing an adjacent
VMA. Write-lock affected VMAs to prevent concurrent page faults.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 1e2154137631..ff02cb51e7e7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -544,6 +544,7 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
if (mas_preallocate(mas, vma, GFP_KERNEL))
goto nomem;
+ vma_write_lock(vma);
vma_adjust_trans_huge(vma, start, end, 0);
if (file) {
@@ -590,6 +591,7 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
}
if (remove_next) {
+ vma_write_lock(next);
if (file) {
uprobe_munmap(next, next->vm_start, next->vm_end);
fput(file);
--
2.39.0
From: "Liam R. Howlett" <[email protected]>
Add an smp_rmb() before reading the parent pointer to ensure that
anything read from the node prior to the parent pointer hasn't been
reordered ahead of this check.
The is necessary for RCU mode.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8066fb1e8ec9..80ca28b656d3 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -535,9 +535,11 @@ static inline struct maple_node *mte_parent(const struct maple_enode *enode)
*/
static inline bool ma_dead_node(const struct maple_node *node)
{
- struct maple_node *parent = (void *)((unsigned long)
- node->parent & ~MAPLE_NODE_MASK);
+ struct maple_node *parent;
+ /* Do not reorder reads from the node prior to the parent check */
+ smp_rmb();
+ parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
return (parent == node);
}
@@ -552,6 +554,8 @@ static inline bool mte_dead_node(const struct maple_enode *enode)
struct maple_node *parent, *node;
node = mte_to_node(enode);
+ /* Do not reorder reads from the node prior to the parent check */
+ smp_rmb();
parent = mte_parent(enode);
return (parent == node);
}
--
2.39.0
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Peter Xu <[email protected]>
---
mm/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 20806bc8b4eb..12508f4d845a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma->anon_vma)
goto inval;
+ /*
+ * Due to the possibility of userfault handler dropping mmap_lock, avoid
+ * it for now and fall back to page fault handling under mmap_lock.
+ */
+ if (userfaultfd_armed(vma))
+ goto inval;
+
if (!vma_read_trylock(vma))
goto inval;
--
2.39.0
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..3647f7bdb110 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+ select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..983266e7c49b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/mm.h> /* find_and_lock_vma() */
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1354,6 +1355,38 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, address);
+ if (!vma)
+ goto lock_mmap;
+
+ if (unlikely(access_error(error_code, vma))) {
+ vma_read_unlock(vma);
+ goto lock_mmap;
+ }
+ fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_read_unlock(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ kernelmode_fixup_or_oops(regs, error_code, address,
+ SIGBUS, BUS_ADRERR,
+ ARCH_DEFAULT_PKEY);
+ return;
+ }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1454,6 +1487,9 @@ void do_user_addr_fault(struct pt_regs *regs,
}
mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;
--
2.39.0
Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.
Add a parameter to free_pgtables and unmap_region for such scenario.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/internal.h | 2 +-
mm/memory.c | 6 +++++-
mm/mmap.c | 18 ++++++++++++------
3 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..5ea4ff1a70e7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -87,7 +87,7 @@ void folio_activate(struct folio *folio);
void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long floor,
- unsigned long ceiling);
+ unsigned long ceiling, bool lock_vma);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index 2fabf89b2be9..9ece18548db1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb,
void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long floor,
- unsigned long ceiling)
+ unsigned long ceiling, bool lock_vma)
{
MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
@@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
+ if (lock_vma)
+ vma_write_lock(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
@@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(&mas, ceiling - 1);
+ if (lock_vma)
+ vma_write_lock(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index be289e0b693b..0d767ce043af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -78,7 +78,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next, unsigned long start,
- unsigned long end);
+ unsigned long end, bool lock_vma);
static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
@@ -2202,7 +2202,7 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end, bool lock_vma)
{
struct mmu_gather tlb;
@@ -2211,7 +2211,8 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
update_hiwater_rss(mm);
unmap_vmas(&tlb, mt, vma, start, end);
free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
- next ? next->vm_start : USER_PGTABLES_CEILING);
+ next ? next->vm_start : USER_PGTABLES_CEILING,
+ lock_vma);
tlb_finish_mmu(&tlb);
}
@@ -2468,7 +2469,11 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
mmap_write_downgrade(mm);
}
- unmap_region(mm, &mt_detach, vma, prev, next, start, end);
+ /*
+ * We can free page tables without locking the vmas because they were
+ * isolated before we downgraded mmap_lock and dropped per-vma locks.
+ */
+ unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
/* Statistics and freeing VMAs */
mas_set(&mas_detach, start);
remove_mt(mm, &mas_detach);
@@ -2785,7 +2790,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_file = NULL;
/* Undo any partial mapping done by a device driver. */
- unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end);
+ unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end,
+ true);
if (file && (vm_flags & VM_SHARED))
mapping_unmap_writable(file->f_mapping);
free_vma:
@@ -3130,7 +3136,7 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_lock(mm);
mt_clear_in_rcu(&mm->mm_mt);
free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
- USER_PGTABLES_CEILING);
+ USER_PGTABLES_CEILING, true);
tlb_finish_mmu(&tlb);
/*
--
2.39.0
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
will be addressed in the next patch.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate it.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 25 ++++++------
include/linux/mm_types.h | 6 ++-
kernel/fork.c | 82 ++++++++++++++++++++++++++++------------
3 files changed, 74 insertions(+), 39 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50c7a6dd9c7a..d40bf8a5e19e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -615,11 +615,6 @@ struct vm_operations_struct {
};
#ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
- init_rwsem(&vma->lock);
- vma->vm_lock_seq = -1;
-}
static inline void vma_write_lock(struct vm_area_struct *vma)
{
@@ -635,9 +630,9 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
if (vma->vm_lock_seq == mm_lock_seq)
return;
- down_write(&vma->lock);
+ down_write(&vma->vm_lock->lock);
vma->vm_lock_seq = mm_lock_seq;
- up_write(&vma->lock);
+ up_write(&vma->vm_lock->lock);
}
/*
@@ -651,17 +646,17 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
- if (unlikely(down_read_trylock(&vma->lock) == 0))
+ if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
return false;
/*
* Overflow might produce false locked result.
* False unlocked result is impossible because we modify and check
- * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+ * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
* modification invalidates all existing locks.
*/
if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
- up_read(&vma->lock);
+ up_read(&vma->vm_lock->lock);
return false;
}
return true;
@@ -669,7 +664,7 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
static inline void vma_read_unlock(struct vm_area_struct *vma)
{
- up_read(&vma->lock);
+ up_read(&vma->vm_lock->lock);
}
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -684,7 +679,7 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
static inline void vma_assert_no_reader(struct vm_area_struct *vma)
{
- VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock) &&
vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
vma);
}
@@ -694,7 +689,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
#else /* CONFIG_PER_VMA_LOCK */
-static inline void vma_init_lock(struct vm_area_struct *vma) {}
static inline void vma_write_lock(struct vm_area_struct *vma) {}
static inline bool vma_read_trylock(struct vm_area_struct *vma)
{ return false; }
@@ -704,6 +698,10 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
#endif /* CONFIG_PER_VMA_LOCK */
+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -712,7 +710,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
- vma_init_lock(vma);
}
/* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c0e6c8e4700b..faa61b400f9b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,6 +526,10 @@ struct anon_vma_name {
char name[];
};
+struct vma_lock {
+ struct rw_semaphore lock;
+};
+
/*
* This struct describes a virtual memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -563,7 +567,7 @@ struct vm_area_struct {
#ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
- struct rw_semaphore lock;
+ struct vma_lock *vm_lock;
#endif
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 97f2b751f88d..95db6a521cf1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,40 +451,28 @@ static struct kmem_cache *vm_area_cachep;
/* SLAB cache for mm_struct structures (tsk->mm) */
static struct kmem_cache *mm_cachep;
-struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
-{
- struct vm_area_struct *vma;
+#ifdef CONFIG_PER_VMA_LOCK
- vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
- if (vma)
- vma_init(vma, mm);
- return vma;
-}
+/* SLAB cache for vm_area_struct.lock */
+static struct kmem_cache *vma_lock_cachep;
-struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+static bool vma_init_lock(struct vm_area_struct *vma)
{
- struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
+ if (!vma->vm_lock)
+ return false;
- if (new) {
- ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
- ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
- /*
- * orig->shared.rb may be modified concurrently, but the clone
- * will be reinitialized.
- */
- *new = data_race(*orig);
- INIT_LIST_HEAD(&new->anon_vma_chain);
- vma_init_lock(new);
- dup_anon_vma_name(orig, new);
- }
- return new;
+ init_rwsem(&vma->vm_lock->lock);
+ vma->vm_lock_seq = -1;
+
+ return true;
}
-#ifdef CONFIG_PER_VMA_LOCK
static inline void __vm_area_free(struct vm_area_struct *vma)
{
/* The vma should either have no lock holders or be write-locked. */
vma_assert_no_reader(vma);
+ kmem_cache_free(vma_lock_cachep, vma->vm_lock);
kmem_cache_free(vm_area_cachep, vma);
}
@@ -540,6 +528,7 @@ void vm_area_free(struct vm_area_struct *vma)
#else /* CONFIG_PER_VMA_LOCK */
+static bool vma_init_lock(struct vm_area_struct *vma) { return true; }
void drain_free_vmas(struct mm_struct *mm) {}
void vm_area_free(struct vm_area_struct *vma)
@@ -550,6 +539,48 @@ void vm_area_free(struct vm_area_struct *vma)
#endif /* CONFIG_PER_VMA_LOCK */
+struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+
+ vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ if (!vma)
+ return NULL;
+
+ vma_init(vma, mm);
+ if (!vma_init_lock(vma)) {
+ kmem_cache_free(vm_area_cachep, vma);
+ return NULL;
+ }
+
+ return vma;
+}
+
+struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
+{
+ struct vm_area_struct *new;
+
+ new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+ if (!new)
+ return NULL;
+
+ ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+ ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+ /*
+ * orig->shared.rb may be modified concurrently, but the clone
+ * will be reinitialized.
+ */
+ *new = data_race(*orig);
+ if (!vma_init_lock(new)) {
+ kmem_cache_free(vm_area_cachep, new);
+ return NULL;
+ }
+ INIT_LIST_HEAD(&new->anon_vma_chain);
+ dup_anon_vma_name(orig, new);
+
+ return new;
+}
+
static void account_kernel_stack(struct task_struct *tsk, int account)
{
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -3138,6 +3169,9 @@ void __init proc_caches_init(void)
NULL);
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+#ifdef CONFIG_PER_VMA_LOCK
+ vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
+#endif
mmap_init();
nsproxy_cache_init();
}
--
2.39.0
On 1/9/23 21:53, Suren Baghdasaryan wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
> 1. Readers never wait. They try to take the vma_lock and fall back to
> mmap_lock if that fails.
> 2. Only one writer at a time will ever try to write-lock a vma_lock
> because writers first take mmap_lock in write mode.
> Because of these requirements, full rw_semaphore functionality is not
> needed and we can replace rw_semaphore with an atomic variable.
> When a reader takes read lock, it increments the atomic unless the
> value is negative. If that fails read-locking is aborted and mmap_lock
> is used instead.
> When writer takes write lock, it resets atomic value to -1 if the
> current value is 0 (no readers). Since all writers take mmap_lock in
> write mode first, there can be only one writer at a time. If there
> are readers, writer will place itself into a wait queue using new
> mm_struct.vma_writer_wait waitqueue head. The last reader to release
> the vma_lock will signal the writer to wake up.
> vm_lock_seq is also moved into vma_lock and along with atomic_t they
> are nicely packed and consume 8 bytes, bringing the overhead from
> vma_lock from 44 to 16 bytes:
>
> slabinfo before the changes:
> <name> ... <objsize> <objperslab> <pagesperslab> : ...
> vm_area_struct ... 152 53 2 : ...
>
> slabinfo with vma_lock:
> <name> ... <objsize> <objperslab> <pagesperslab> : ...
> rw_semaphore ... 8 512 1 : ...
I guess the cache is called vma_lock, not rw_semaphore?
> vm_area_struct ... 160 51 2 : ...
>
> Assuming 40000 vm_area_structs, memory consumption would be:
> baseline: 6040kB
> vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> Total increase: 556kB
>
> atomic_t might overflow if there are many competing readers, therefore
> vma_read_trylock() implements an overflow check and if that occurs it
> restors the previous value and exits with a failure to lock.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
This patch is indeed an interesting addition indeed, but I can't help but
think it obsoletes the previous one :) We allocate an extra 8 bytes slab
object for the lock, and the pointer to it is also 8 bytes, and requires an
indirection. The vma_lock cache is not cacheline aligned (otherwise it would
be a major waste), so we have potential false sharing with up to 7 other
vma_lock's.
I'd expect if the vma_lock was placed with the relatively cold fields of
vm_area_struct, it shouldn't cause much cache ping pong when working with
that vma. Even if we don't cache align the vma to save memory (would be 192
bytes instead of 160 when aligned) and place the vma_lock and the cold
fields at the end of the vma, it may be false sharing the cacheline with the
next vma in the slab. But that's a single vma, not up to 7, so it shouldn't
be worse?
On Tue, Jan 10, 2023 at 12:04 AM Vlastimil Babka <[email protected]> wrote:
>
> On 1/9/23 21:53, Suren Baghdasaryan wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> > 1. Readers never wait. They try to take the vma_lock and fall back to
> > mmap_lock if that fails.
> > 2. Only one writer at a time will ever try to write-lock a vma_lock
> > because writers first take mmap_lock in write mode.
> > Because of these requirements, full rw_semaphore functionality is not
> > needed and we can replace rw_semaphore with an atomic variable.
> > When a reader takes read lock, it increments the atomic unless the
> > value is negative. If that fails read-locking is aborted and mmap_lock
> > is used instead.
> > When writer takes write lock, it resets atomic value to -1 if the
> > current value is 0 (no readers). Since all writers take mmap_lock in
> > write mode first, there can be only one writer at a time. If there
> > are readers, writer will place itself into a wait queue using new
> > mm_struct.vma_writer_wait waitqueue head. The last reader to release
> > the vma_lock will signal the writer to wake up.
> > vm_lock_seq is also moved into vma_lock and along with atomic_t they
> > are nicely packed and consume 8 bytes, bringing the overhead from
> > vma_lock from 44 to 16 bytes:
> >
> > slabinfo before the changes:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > vm_area_struct ... 152 53 2 : ...
> >
> > slabinfo with vma_lock:
> > <name> ... <objsize> <objperslab> <pagesperslab> : ...
> > rw_semaphore ... 8 512 1 : ...
>
> I guess the cache is called vma_lock, not rw_semaphore?
Yes, sorry. Copy/paste error when combining the results. The numbers
though look correct, so I did not screw up that part :)
>
> > vm_area_struct ... 160 51 2 : ...
> >
> > Assuming 40000 vm_area_structs, memory consumption would be:
> > baseline: 6040kB
> > vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
> > Total increase: 556kB
> >
> > atomic_t might overflow if there are many competing readers, therefore
> > vma_read_trylock() implements an overflow check and if that occurs it
> > restors the previous value and exits with a failure to lock.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> This patch is indeed an interesting addition indeed, but I can't help but
> think it obsoletes the previous one :) We allocate an extra 8 bytes slab
> object for the lock, and the pointer to it is also 8 bytes, and requires an
> indirection. The vma_lock cache is not cacheline aligned (otherwise it would
> be a major waste), so we have potential false sharing with up to 7 other
> vma_lock's.
True, I thought long and hard about combining the last two patches but
decided to keep them separate to document the intent. The previous
patch splits the lock for performance reasons and this one is focused
on memory consumption. I'm open to changing this if it's confusing.
> I'd expect if the vma_lock was placed with the relatively cold fields of
> vm_area_struct, it shouldn't cause much cache ping pong when working with
> that vma. Even if we don't cache align the vma to save memory (would be 192
> bytes instead of 160 when aligned) and place the vma_lock and the cold
> fields at the end of the vma, it may be false sharing the cacheline with the
> next vma in the slab.
I would love to combine the vma_lock with vm_area_struct and I spent
several days trying different combinations to achieve decent
performance. My best achieved result was when I placed the vm_lock
into the third cache line at offset 192 and allocated vm_area_structs
from cache-aligned slab (horrible memory waste with each vma consuming
256 bytes). Even then I see regression in pft-threads test on a NUMA
machine (where cache-bouncing problem is most pronounced):
This is the result with split vma locks (current version). The higher
number the better:
BASE PVL
Hmean faults/sec-1 469201.7282 ( 0.00%) 464453.3976 * -1.01%*
Hmean faults/sec-4 1754465.6221 ( 0.00%) 1660688.0452 * -5.35%*
Hmean faults/sec-7 2808141.6711 ( 0.00%) 2688910.6458 * -4.25%*
Hmean faults/sec-12 3750307.7553 ( 0.00%) 3863490.2057 * 3.02%*
Hmean faults/sec-21 4145672.4677 ( 0.00%) 3904532.7241 * -5.82%*
Hmean faults/sec-30 3775722.5726 ( 0.00%) 3923225.3734 * 3.91%*
Hmean faults/sec-48 4152563.5864 ( 0.00%) 4783720.6811 * 15.20%*
Hmean faults/sec-56 4163868.7111 ( 0.00%) 4851473.7241 * 16.51%*
Here are results with the vma locks integrated into cache-aligned
vm_area_struct:
BASE PVM_MERGED
Hmean faults/sec-1 469201.7282 ( 0.00%) 465268.1274 * -0.84%*
Hmean faults/sec-4 1754465.6221 ( 0.00%) 1658538.0217 * -5.47%*
Hmean faults/sec-7 2808141.6711 ( 0.00%) 2645016.1598 * -5.81%*
Hmean faults/sec-12 3750307.7553 ( 0.00%) 3664676.6956 * -2.28%*
Hmean faults/sec-21 4145672.4677 ( 0.00%) 3722203.7950 * -10.21%*
Hmean faults/sec-30 3775722.5726 ( 0.00%) 3821025.6963 * 1.20%*
Hmean faults/sec-48 4152563.5864 ( 0.00%) 4561016.1604 * 9.84%*
Hmean faults/sec-56 4163868.7111 ( 0.00%) 4528123.3737 * 8.75%*
These two compare with the same baseline test results, I just
separated the result into two to have readable email formatting.
It's also hard to find 56 bytes worth of fields in vm_area_struct
which are not used during page faults. So, in the end I decided to
keep vma_locks separate to preserve performance. If you have an idea
on how we can combine vm_area_struct fields in a better way, I would
love to try it out.
> But that's a single vma, not up to 7, so it shouldn't be worse?
Yes, I expected that too but mmtests show very small improvement when
I cache-align vma_locks slab. My spf_test does show about 10%
regression due to vma_lock cache-line bouncing, however considering
that it also shows 90% improvement over baseline, losing 10% of that
improvement to save 56 bytes per vma sounds like a good deal.
I think the lack of considerable regression here is due to vma_locks
being used only 2 times in the pagefault path - when we take it and
when we release it, while vm_aa_struct fields are used much more
heavily. So, invalidating vma_lock cache-line does not hit us as hard
as invalidating a part of vm_area_struct.
Looking forward to suggestions and thanks for the review, Vlastimil!
>
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d40bf8a5e19e..294dd44b2198 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> * mm->mm_lock_seq can't be concurrently modified.
> */
> mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> - if (vma->vm_lock_seq == mm_lock_seq)
> + if (vma->vm_lock->lock_seq == mm_lock_seq)
> return;
>
> - down_write(&vma->vm_lock->lock);
> - vma->vm_lock_seq = mm_lock_seq;
> - up_write(&vma->vm_lock->lock);
> + if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> + wait_event(vma->vm_mm->vma_writer_wait,
> + atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> + vma->vm_lock->lock_seq = mm_lock_seq;
> + /* Write barrier to ensure lock_seq change is visible before count */
> + smp_wmb();
> + atomic_set(&vma->vm_lock->count, 0);
> }
>
> /*
> @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> static inline bool vma_read_trylock(struct vm_area_struct *vma)
> {
> /* Check before locking. A race might cause false locked result. */
> - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> return false;
>
> - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> return false;
>
> + /* If atomic_t overflows, restore and fail to lock. */
> + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> + if (atomic_dec_and_test(&vma->vm_lock->count))
> + wake_up(&vma->vm_mm->vma_writer_wait);
> + return false;
> + }
> +
> /*
> * Overflow might produce false locked result.
> * False unlocked result is impossible because we modify and check
> * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> * modification invalidates all existing locks.
> */
> - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> - up_read(&vma->vm_lock->lock);
> + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> + if (atomic_dec_and_test(&vma->vm_lock->count))
> + wake_up(&vma->vm_mm->vma_writer_wait);
> return false;
> }
With this change readers can cause writers to starve.
What about checking waitqueue_active() before or after increasing
vma->vm_lock->count?
--
Thanks,
Hyeonggon
On Mon, Jan 16, 2023 at 3:15 AM Hyeonggon Yoo <[email protected]> wrote:
>
> On Mon, Jan 09, 2023 at 12:53:36PM -0800, Suren Baghdasaryan wrote:
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index d40bf8a5e19e..294dd44b2198 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > * mm->mm_lock_seq can't be concurrently modified.
> > */
> > mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > - if (vma->vm_lock_seq == mm_lock_seq)
> > + if (vma->vm_lock->lock_seq == mm_lock_seq)
> > return;
> >
> > - down_write(&vma->vm_lock->lock);
> > - vma->vm_lock_seq = mm_lock_seq;
> > - up_write(&vma->vm_lock->lock);
> > + if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > + wait_event(vma->vm_mm->vma_writer_wait,
> > + atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > + vma->vm_lock->lock_seq = mm_lock_seq;
> > + /* Write barrier to ensure lock_seq change is visible before count */
> > + smp_wmb();
> > + atomic_set(&vma->vm_lock->count, 0);
> > }
> >
> > /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > {
> > /* Check before locking. A race might cause false locked result. */
> > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > return false;
> >
> > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > return false;
> >
> > + /* If atomic_t overflows, restore and fail to lock. */
> > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > + return false;
> > + }
> > +
> > /*
> > * Overflow might produce false locked result.
> > * False unlocked result is impossible because we modify and check
> > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > * modification invalidates all existing locks.
> > */
> > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > - up_read(&vma->vm_lock->lock);
> > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > return false;
> > }
>
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?
The readers are in page fault path, which is the fast path, while
writers performing updates are in slow path. Therefore I *think*
starving writers should not be a big issue. So far in benchmarks I
haven't seen issues with that but maybe there is such a case?
>
> --
> Thanks,
> Hyeonggon
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Mon, Jan 16, 2023 at 3:08 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <[email protected]> wrote:
> >
> > On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <[email protected]>
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > * mm->mm_lock_seq can't be concurrently modified.
> > > */
> > > mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > > - if (vma->vm_lock_seq == mm_lock_seq)
> > > + if (vma->vm_lock->lock_seq == mm_lock_seq)
> > > return;
> >
> > lock acquire for write to info lockdep.
>
> Thanks for the review Hillf!
>
> Good idea. Will add in the next version.
>
> > >
> > > - down_write(&vma->vm_lock->lock);
> > > - vma->vm_lock_seq = mm_lock_seq;
> > > - up_write(&vma->vm_lock->lock);
> > > + if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > > + wait_event(vma->vm_mm->vma_writer_wait,
> > > + atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > > + vma->vm_lock->lock_seq = mm_lock_seq;
> > > + /* Write barrier to ensure lock_seq change is visible before count */
> > > + smp_wmb();
> > > + atomic_set(&vma->vm_lock->count, 0);
> > > }
> > >
> > > /*
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > {
> > > /* Check before locking. A race might cause false locked result. */
> > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > return false;
> >
> > Add mb to pair with the above wmb like
>
> The wmb above is to ensure the ordering between updates of lock_seq
> and vm_lock->count (lock_seq is updated first and vm_lock->count only
> after that). The first access to vm_lock->count in this function is
> atomic_inc_unless_negative() and it's an atomic RMW operation with a
> return value. According to documentation such functions are fully
> ordered, therefore I think we already have an implicit full memory
> barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> >
> > if (READ_ONCE(vma->vm_lock->lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
> > smp_acquire__after_ctrl_dep();
> > return false;
> > }
> > >
> > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > return false;
> > >
> > > + /* If atomic_t overflows, restore and fail to lock. */
> > > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > + return false;
> > > + }
> > > +
> > > /*
> > > * Overflow might produce false locked result.
> > > * False unlocked result is impossible because we modify and check
> > > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > * modification invalidates all existing locks.
> > > */
> > > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > - up_read(&vma->vm_lock->lock);
> > > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > return false;
> > > }
> >
> > Simpler way to detect write lock owner and count overflow like
> >
> > int count = atomic_read(&vma->vm_lock->count);
> > for (;;) {
> > int new = count + 1;
> >
> > if (count < 0 || new < 0)
> > return false;
> >
> > new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> > if (new == count)
> > break;
> > count = new;
> > cpu_relax();
> > }
> >
> > (wake up waiting readers after taking the lock;
> > but the write lock owner before this read trylock should be
> > responsible for waking waiters up.)
> >
> > lock acquire for read.
>
> This schema might cause readers to wait, which is not an exact
> replacement for down_read_trylock(). The requirement to wake up
> waiting readers also complicates things and since we can always fall
> back to mmap_lock, that complication is unnecessary IMHO. I could use
> part of your suggestion like this:
>
> int new = count + 1;
>
> if (count < 0 || new < 0)
> return false;
>
> new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> if (new == count)
> return false;
Made a mistake above. It should have been:
if (new != count)
return false;
>
> Compared to doing atomic_inc_unless_negative() first, like I did
> originally, this schema opens a bit wider window for the writer to get
> in the middle and cause the reader to fail locking but I don't think
> it would result in any visible regression.
>
> >
> > > return true;
> > > @@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > >
> > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > {
> > lock release for read.
>
> Ack.
>
> >
> > > - up_read(&vma->vm_lock->lock);
> > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > }
> >
On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <[email protected]> wrote:
>
> On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <[email protected]>
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > * mm->mm_lock_seq can't be concurrently modified.
> > */
> > mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > - if (vma->vm_lock_seq == mm_lock_seq)
> > + if (vma->vm_lock->lock_seq == mm_lock_seq)
> > return;
>
> lock acquire for write to info lockdep.
Thanks for the review Hillf!
Good idea. Will add in the next version.
> >
> > - down_write(&vma->vm_lock->lock);
> > - vma->vm_lock_seq = mm_lock_seq;
> > - up_write(&vma->vm_lock->lock);
> > + if (atomic_cmpxchg(&vma->vm_lock->count, 0, -1))
> > + wait_event(vma->vm_mm->vma_writer_wait,
> > + atomic_cmpxchg(&vma->vm_lock->count, 0, -1) == 0);
> > + vma->vm_lock->lock_seq = mm_lock_seq;
> > + /* Write barrier to ensure lock_seq change is visible before count */
> > + smp_wmb();
> > + atomic_set(&vma->vm_lock->count, 0);
> > }
> >
> > /*
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > {
> > /* Check before locking. A race might cause false locked result. */
> > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > return false;
>
> Add mb to pair with the above wmb like
The wmb above is to ensure the ordering between updates of lock_seq
and vm_lock->count (lock_seq is updated first and vm_lock->count only
after that). The first access to vm_lock->count in this function is
atomic_inc_unless_negative() and it's an atomic RMW operation with a
return value. According to documentation such functions are fully
ordered, therefore I think we already have an implicit full memory
barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> if (READ_ONCE(vma->vm_lock->lock_seq) == READ_ONCE(vma->vm_mm->mm_lock_seq)) {
> smp_acquire__after_ctrl_dep();
> return false;
> }
> >
> > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > return false;
> >
> > + /* If atomic_t overflows, restore and fail to lock. */
> > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > + return false;
> > + }
> > +
> > /*
> > * Overflow might produce false locked result.
> > * False unlocked result is impossible because we modify and check
> > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > * modification invalidates all existing locks.
> > */
> > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > - up_read(&vma->vm_lock->lock);
> > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > return false;
> > }
>
> Simpler way to detect write lock owner and count overflow like
>
> int count = atomic_read(&vma->vm_lock->count);
> for (;;) {
> int new = count + 1;
>
> if (count < 0 || new < 0)
> return false;
>
> new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> if (new == count)
> break;
> count = new;
> cpu_relax();
> }
>
> (wake up waiting readers after taking the lock;
> but the write lock owner before this read trylock should be
> responsible for waking waiters up.)
>
> lock acquire for read.
This schema might cause readers to wait, which is not an exact
replacement for down_read_trylock(). The requirement to wake up
waiting readers also complicates things and since we can always fall
back to mmap_lock, that complication is unnecessary IMHO. I could use
part of your suggestion like this:
int new = count + 1;
if (count < 0 || new < 0)
return false;
new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
if (new == count)
return false;
Compared to doing atomic_inc_unless_negative() first, like I did
originally, this schema opens a bit wider window for the writer to get
in the middle and cause the reader to fail locking but I don't think
it would result in any visible regression.
>
> > return true;
> > @@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct *vma)
> >
> > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > {
> lock release for read.
Ack.
>
> > - up_read(&vma->vm_lock->lock);
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > }
>
On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > {
> > /* Check before locking. A race might cause false locked result. */
> > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > return false;
> >
> > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > return false;
> >
> > + /* If atomic_t overflows, restore and fail to lock. */
> > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > + return false;
> > + }
> > +
> > /*
> > * Overflow might produce false locked result.
> > * False unlocked result is impossible because we modify and check
> > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > * modification invalidates all existing locks.
> > */
> > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > - up_read(&vma->vm_lock->lock);
> > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > return false;
> > }
>
> With this change readers can cause writers to starve.
> What about checking waitqueue_active() before or after increasing
> vma->vm_lock->count?
I don't understand how readers can starve a writer. Readers do
atomic_inc_unless_negative() so a writer can always force readers
to fail.
On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > {
> > > /* Check before locking. A race might cause false locked result. */
> > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > return false;
> > >
> > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > return false;
> > >
> > > + /* If atomic_t overflows, restore and fail to lock. */
> > > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > + return false;
> > > + }
> > > +
> > > /*
> > > * Overflow might produce false locked result.
> > > * False unlocked result is impossible because we modify and check
> > > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > * modification invalidates all existing locks.
> > > */
> > > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > - up_read(&vma->vm_lock->lock);
> > > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > return false;
> > > }
> >
> > With this change readers can cause writers to starve.
> > What about checking waitqueue_active() before or after increasing
> > vma->vm_lock->count?
>
> I don't understand how readers can starve a writer. Readers do
> atomic_inc_unless_negative() so a writer can always force readers
> to fail.
I think the point here was that if page faults keep occuring and they
prevent vm_lock->count from reaching 0 then a writer will be blocked
and there is no reader throttling mechanism (no max time that writer
will be waiting).
On Mon, Jan 16, 2023 at 7:16 PM Hillf Danton <[email protected]> wrote:
>
> On Mon, 16 Jan 2023 15:08:48 -0800 Suren Baghdasaryan <[email protected]>
> > On Mon, Jan 16, 2023 at 6:07 AM Hillf Danton <[email protected]> wrote:
> > > On Mon, 9 Jan 2023 12:53:36 -0800 Suren Baghdasaryan <[email protected]>
> > > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > {
> > > > /* Check before locking. A race might cause false locked result. */
> > > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > return false;
> > >
> > > Add mb to pair with the above wmb like
> >
> > The wmb above is to ensure the ordering between updates of lock_seq
> > and vm_lock->count (lock_seq is updated first and vm_lock->count only
> > after that). The first access to vm_lock->count in this function is
> > atomic_inc_unless_negative() and it's an atomic RMW operation with a
> > return value. According to documentation such functions are fully
> > ordered, therefore I think we already have an implicit full memory
> > barrier between reads of lock_seq and vm_lock->count here. Am I wrong?
>
> No you are not.
I'm not wrong or the other way around? Please expand a bit.
> Revisit it in case of full mb not ensured.
>
> #ifndef arch_atomic_inc_unless_negative
> static __always_inline bool
> arch_atomic_inc_unless_negative(atomic_t *v)
> {
> int c = arch_atomic_read(v);
>
> do {
> if (unlikely(c < 0))
> return false;
> } while (!arch_atomic_try_cmpxchg(v, &c, c + 1));
>
> return true;
> }
> #define arch_atomic_inc_unless_negative arch_atomic_inc_unless_negative
> #endif
I think your point is that the full mb is not ensured here, specifically if c<0?
>
> > > int count = atomic_read(&vma->vm_lock->count);
> > > for (;;) {
> > > int new = count + 1;
> > >
> > > if (count < 0 || new < 0)
> > > return false;
> > >
> > > new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> > > if (new == count)
> > > break;
> > > count = new;
> > > cpu_relax();
> > > }
> > >
> > > (wake up waiting readers after taking the lock;
> > > but the write lock owner before this read trylock should be
> > > responsible for waking waiters up.)
> > >
> > > lock acquire for read.
> >
> > This schema might cause readers to wait, which is not an exact
>
> Could you specify a bit more on wait?
Oh, I misunderstood your intent with that for() loop. Indeed, if a
writer took the lock, the count will be negative and it terminates
with a failure. Yeah, I think that would work.
>
> > replacement for down_read_trylock(). The requirement to wake up
> > waiting readers also complicates things
>
> If the writer lock owner is preempted by a reader while releasing lock,
>
> set count to zero
> <-- preempt
> wake up waiters
>
> then lock is owned by reader but with read waiters.
>
> That is buggy if write waiter starvation is allowed in this patchset.
I don't quite understand your point here. Readers don't wait, so there
can't be "read waiters". Could you please expand with a race diagram
maybe?
>
> > and since we can always fall
> > back to mmap_lock, that complication is unnecessary IMHO. I could use
> > part of your suggestion like this:
> >
> > int new = count + 1;
> >
> > if (count < 0 || new < 0)
> > return false;
> >
> > new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> > if (new == count)
> > return false;
> >
> > Compared to doing atomic_inc_unless_negative() first, like I did
> > originally, this schema opens a bit wider window for the writer to get
> > in the middle and cause the reader to fail locking but I don't think
> > it would result in any visible regression.
>
> It is definitely fine for writer to acquire the lock while reader is
> doing trylock, no?
Yes.
>
On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > {
> > > > /* Check before locking. A race might cause false locked result. */
> > > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > return false;
> > > >
> > > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > return false;
> > > >
> > > > + /* If atomic_t overflows, restore and fail to lock. */
> > > > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > + return false;
> > > > + }
> > > > +
> > > > /*
> > > > * Overflow might produce false locked result.
> > > > * False unlocked result is impossible because we modify and check
> > > > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > * modification invalidates all existing locks.
> > > > */
> > > > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > - up_read(&vma->vm_lock->lock);
> > > > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > return false;
> > > > }
> > >
> > > With this change readers can cause writers to starve.
> > > What about checking waitqueue_active() before or after increasing
> > > vma->vm_lock->count?
> >
> > I don't understand how readers can starve a writer. Readers do
> > atomic_inc_unless_negative() so a writer can always force readers
> > to fail.
>
> I think the point here was that if page faults keep occuring and they
> prevent vm_lock->count from reaching 0 then a writer will be blocked
> and there is no reader throttling mechanism (no max time that writer
> will be waiting).
Perhaps I misunderstood your description; I thought that a _waiting_
writer would make the count negative, not a successfully acquiring
writer.
On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > {
> > > > > /* Check before locking. A race might cause false locked result. */
> > > > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > return false;
> > > > >
> > > > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > return false;
> > > > >
> > > > > + /* If atomic_t overflows, restore and fail to lock. */
> > > > > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > + return false;
> > > > > + }
> > > > > +
> > > > > /*
> > > > > * Overflow might produce false locked result.
> > > > > * False unlocked result is impossible because we modify and check
> > > > > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > * modification invalidates all existing locks.
> > > > > */
> > > > > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > - up_read(&vma->vm_lock->lock);
> > > > > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > return false;
> > > > > }
> > > >
> > > > With this change readers can cause writers to starve.
> > > > What about checking waitqueue_active() before or after increasing
> > > > vma->vm_lock->count?
> > >
> > > I don't understand how readers can starve a writer. Readers do
> > > atomic_inc_unless_negative() so a writer can always force readers
> > > to fail.
> >
> > I think the point here was that if page faults keep occuring and they
> > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > and there is no reader throttling mechanism (no max time that writer
> > will be waiting).
>
> Perhaps I misunderstood your description; I thought that a _waiting_
> writer would make the count negative, not a successfully acquiring
> writer.
A waiting writer does not modify the counter, instead it's placed on
the wait queue and the last reader which sets the count to 0 while
releasing its read lock will wake it up. Once the writer is woken it
will try to set the count to negative and if successful will own the
lock, otherwise it goes back to sleep.
On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> */
> *new = data_race(*orig);
> INIT_LIST_HEAD(&new->anon_vma_chain);
> + vma_init_lock(new);
> dup_anon_vma_name(orig, new);
> }
> return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> seqcount_init(&mm->write_protect_seq);
> mmap_init_lock(mm);
> INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> + WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif
The mm shouldn't be visible so why WRITE_ONCE?
--
Michal Hocko
SUSE Labs
On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
>
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
>
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
> - read lock - (implemented by vma_read_trylock) requires the the
> vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> differ. If they match then there must be a vma exclusive lock
> held somewhere.
> - read unlock - (implemented by vma_read_unlock) is a trivial
> vma->lock unlock.
> - write lock - (vma_write_lock) requires the mmap_lock to be
> held exclusively and the current mm counter is noted to the vma
> side. This will allow multiple vmas to be locked under a single
> mmap_lock write lock (e.g. during vma merging). The vma counter
> is modified under exclusive vma lock.
Didn't realize one more thing.
Unlike standard write lock this implementation allows to be
called multiple times under a single mmap_lock. In a sense
it is more of mark_vma_potentially_modified than a lock.
> - write unlock - (vma_write_unlock_mm) is a batch release of all
> vma locks held. It doesn't pair with a specific
> vma_write_lock! It is done before exclusive mmap_lock is
> released by incrementing mm sequence counter (mm_lock_seq).
> - write downgrade - if the mmap_lock is downgraded to the read
> lock all vma write locks are released as well (effectivelly
> same as write unlock).
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> Move VMA flag modification (which now implies VMA locking) before
> anon_vma_lock_write to match the locking order of page fault handler.
Does this changelog assumes per vma locking in the #PF?
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.
I have to say I was struggling a bit with the above and only understood
what you mean by reading the patch several times. I would phrase it like
this (feel free to use if you consider this to be an improvement).
Introduce a per-VMA rw_semaphore. The lock implementation relies on a
per-vma and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_read_trylock) requires the the
vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
differ. If they match then there must be a vma exclusive lock
held somewhere.
- read unlock - (implemented by vma_read_unlock) is a trivial
vma->lock unlock.
- write lock - (vma_write_lock) requires the mmap_lock to be
held exclusively and the current mm counter is noted to the vma
side. This will allow multiple vmas to be locked under a single
mmap_lock write lock (e.g. during vma merging). The vma counter
is modified under exclusive vma lock.
- write unlock - (vma_write_unlock_mm) is a batch release of all
vma locks held. It doesn't pair with a specific
vma_write_lock! It is done before exclusive mmap_lock is
released by incrementing mm sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read
lock all vma write locks are released as well (effectivelly
same as write unlock).
> VMA lock is placed on the cache line boundary so that its 'count' field
> falls into the first cache line while the rest of the fields fall into
> the second cache line. This lets the 'count' field to be cached with
> other frequently accessed fields and used quickly in uncontended case
> while 'owner' and other fields used in the contended case will not
> invalidate the first cache line while waiting on the lock.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/mm.h | 80 +++++++++++++++++++++++++++++++++++++++
> include/linux/mm_types.h | 8 ++++
> include/linux/mmap_lock.h | 13 +++++++
> kernel/fork.c | 4 ++
> mm/init-mm.c | 3 ++
> 5 files changed, 108 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f196e4d66d..ec2c4c227d51 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -612,6 +612,85 @@ struct vm_operations_struct {
> unsigned long addr);
> };
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_init_lock(struct vm_area_struct *vma)
> +{
> + init_rwsem(&vma->lock);
> + vma->vm_lock_seq = -1;
> +}
> +
> +static inline void vma_write_lock(struct vm_area_struct *vma)
> +{
> + int mm_lock_seq;
> +
> + mmap_assert_write_locked(vma->vm_mm);
> +
> + /*
> + * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> + * mm->mm_lock_seq can't be concurrently modified.
> + */
> + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> + if (vma->vm_lock_seq == mm_lock_seq)
> + return;
> +
> + down_write(&vma->lock);
> + vma->vm_lock_seq = mm_lock_seq;
> + up_write(&vma->lock);
> +}
> +
> +/*
> + * Try to read-lock a vma. The function is allowed to occasionally yield false
> + * locked result to avoid performance overhead, in which case we fall back to
> + * using mmap_lock. The function should never yield false unlocked result.
> + */
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> +{
> + /* Check before locking. A race might cause false locked result. */
> + if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> + return false;
> +
> + if (unlikely(down_read_trylock(&vma->lock) == 0))
> + return false;
> +
> + /*
> + * Overflow might produce false locked result.
> + * False unlocked result is impossible because we modify and check
> + * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
> + * modification invalidates all existing locks.
> + */
> + if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> + up_read(&vma->lock);
> + return false;
> + }
> + return true;
> +}
> +
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> + up_read(&vma->lock);
> +}
> +
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + /*
> + * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> + * mm->mm_lock_seq can't be concurrently modified.
> + */
> + VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> +static inline void vma_write_lock(struct vm_area_struct *vma) {}
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> + { return false; }
> +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> {
> static const struct vm_operations_struct dummy_vm_ops = {};
> @@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> vma->vm_mm = mm;
> vma->vm_ops = &dummy_vm_ops;
> INIT_LIST_HEAD(&vma->anon_vma_chain);
> + vma_init_lock(vma);
> }
>
> static inline void vma_set_anonymous(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d5cdec1314fe..5f7c5ca89931 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -555,6 +555,11 @@ struct vm_area_struct {
> pgprot_t vm_page_prot;
> unsigned long vm_flags; /* Flags, see mm.h. */
>
> +#ifdef CONFIG_PER_VMA_LOCK
> + int vm_lock_seq;
> + struct rw_semaphore lock;
> +#endif
> +
> /*
> * For areas with an address space and backing store,
> * linkage into the address_space->i_mmap interval tree.
> @@ -680,6 +685,9 @@ struct mm_struct {
> * init_mm.mmlist, and are protected
> * by mmlist_lock
> */
> +#ifdef CONFIG_PER_VMA_LOCK
> + int mm_lock_seq;
> +#endif
>
>
> unsigned long hiwater_rss; /* High-watermark of RSS usage */
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index e49ba91bb1f0..40facd4c398b 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
> VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> }
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_write_unlock_mm(struct mm_struct *mm)
> +{
> + mmap_assert_write_locked(mm);
> + /* No races during update due to exclusive mmap_lock being held */
> + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> +}
> +#else
> +static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
> +#endif
> +
> static inline void mmap_init_lock(struct mm_struct *mm)
> {
> init_rwsem(&mm->mmap_lock);
> @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
> static inline void mmap_write_unlock(struct mm_struct *mm)
> {
> __mmap_lock_trace_released(mm, true);
> + vma_write_unlock_mm(mm);
> up_write(&mm->mmap_lock);
> }
>
> static inline void mmap_write_downgrade(struct mm_struct *mm)
> {
> __mmap_lock_trace_acquire_returned(mm, false, true);
> + vma_write_unlock_mm(mm);
> downgrade_write(&mm->mmap_lock);
> }
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> */
> *new = data_race(*orig);
> INIT_LIST_HEAD(&new->anon_vma_chain);
> + vma_init_lock(new);
> dup_anon_vma_name(orig, new);
> }
> return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> seqcount_init(&mm->write_protect_seq);
> mmap_init_lock(mm);
> INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> + WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif
> mm_pgtables_bytes_init(mm);
> mm->map_count = 0;
> mm->locked_vm = 0;
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index c9327abb771c..33269314e060 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
> .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
> .mmlist = LIST_HEAD_INIT(init_mm.mmlist),
> +#ifdef CONFIG_PER_VMA_LOCK
> + .mm_lock_seq = 0,
> +#endif
> .user_ns = &init_user_ns,
> .cpu_bitmap = CPU_BITS_NONE,
> #ifdef CONFIG_IOMMU_SVA
> --
> 2.39.0
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
[...]
> void vm_area_free(struct vm_area_struct *vma)
> {
> free_anon_vma_name(vma);
> +#ifdef CONFIG_PER_VMA_LOCK
> + call_rcu(&vma->vm_rcu, __vm_area_free);
> +#else
> kmem_cache_free(vm_area_cachep, vma);
> +#endif
Is it safe to have vma with already freed vma_name? I suspect this is
safe because of mmap_lock but is there any reason to split the freeing
process and have this potential UAF lurking?
> }
>
> static void account_kernel_stack(struct task_struct *tsk, int account)
> --
> 2.39.0
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> Protect VMA from concurrent page fault handler while collapsing a huge
> page. Page fault handler needs a stable PMD to use PTL and relies on
> per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> not be detected by a page fault handler without proper locking.
I am struggling with this changelog. Maybe because my recollection of
the THP collapsing subtleties is weak. But aren't you just trying to say
that the current #PF handling and THP collapsing need to be mutually
exclusive currently so in order to keep that assumption you have mark
the vma write locked?
Also it is not really clear to me how that handles other vmas which can
share the same thp?
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> Assert there are no holders of VMA lock for reading when it is about to be
> destroyed.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/mm.h | 8 ++++++++
> kernel/fork.c | 2 ++
> 2 files changed, 10 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 594e835bad9c..c464fc8a514c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> }
>
> +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> +{
> + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> + vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> + vma);
Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
us something is wrong on its own, no? This could be somebody racing with
the vma destruction and using the write lock. Unlikely but I do not see
why to narrow debugging scope.
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.
Could you describe why only anonymous vmas are handled at this stage and
what (roughly) has to be done to support other vmas? lock_vma_under_rcu
doesn't seem to have any anonymous vma specific requirements AFAICS.
Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
the naming is really the most important part but the rcu locking is
internal to the function so why should we spread this implementation
detail to the world...
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/mm.h | 3 +++
> mm/memory.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 54 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c464fc8a514c..d0fddf6a1de9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> vma);
> }
>
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> + unsigned long address);
> +
> #else /* CONFIG_PER_VMA_LOCK */
>
> static inline void vma_init_lock(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ece18548db1..a658e26d965d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> }
> EXPORT_SYMBOL_GPL(handle_mm_fault);
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +/*
> + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> + * stable and not isolated. If the VMA is not found or is being modified the
> + * function returns NULL.
> + */
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> + unsigned long address)
> +{
> + MA_STATE(mas, &mm->mm_mt, address, address);
> + struct vm_area_struct *vma, *validate;
> +
> + rcu_read_lock();
> + vma = mas_walk(&mas);
> +retry:
> + if (!vma)
> + goto inval;
> +
> + /* Only anonymous vmas are supported for now */
> + if (!vma_is_anonymous(vma))
> + goto inval;
> +
> + if (!vma_read_trylock(vma))
> + goto inval;
> +
> + /* Check since vm_start/vm_end might change before we lock the VMA */
> + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> + vma_read_unlock(vma);
> + goto inval;
> + }
> +
> + /* Check if the VMA got isolated after we found it */
> + mas.index = address;
> + validate = mas_walk(&mas);
> + if (validate != vma) {
> + vma_read_unlock(vma);
> + count_vm_vma_lock_event(VMA_LOCK_MISS);
> + /* The area was replaced with another one. */
> + vma = validate;
> + goto retry;
> + }
> +
> + rcu_read_unlock();
> + return vma;
> +inval:
> + rcu_read_unlock();
> + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> + return NULL;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> #ifndef __PAGETABLE_P4D_FOLDED
> /*
> * Allocate p4d page table.
> --
> 2.39.0
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed.
What kind of regressions.
> To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.
Please add some data to justify this additional complexity.
--
Michal Hocko
SUSE Labs
+locking maintainers
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.
[...]
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> + up_read(&vma->lock);
> +}
One thing that might be gnarly here is that I think you might not be
allowed to use up_read() to fully release ownership of an object -
from what I remember, I think that up_read() (unlike something like
spin_unlock()) can access the lock object after it's already been
acquired by someone else. So if you want to protect against concurrent
deletion, this might have to be something like:
rcu_read_lock(); /* keeps vma alive */
up_read(&vma->lock);
rcu_read_unlock();
But I'm not entirely sure about that, the locking folks might know better.
Also, it might not matter given that the rw_semaphore part is removed
in the current patch 41/41 anyway...
On Tue, Jan 17, 2023 at 12:34 AM Hillf Danton <[email protected]> wrote:
>
> On Mon, 16 Jan 2023 20:52:45 -0800 Suren Baghdasaryan <[email protected]>
> > On Mon, Jan 16, 2023 at 7:16 PM Hillf Danton <[email protected]> wrote:
> > > No you are not.
> >
> > I'm not wrong or the other way around? Please expand a bit.
>
> You are not wrong.
Ok, I think if I rewrite the vma_read_trylock() we should be fine?:
static inline bool vma_read_trylock(struct vm_area_struct *vma)
{
int count, new;
/* Check before locking. A race might cause false locked result. */
if (READ_ONCE(vma->vm_lock->lock_seq) ==
READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
count = atomic_read(&vma->vm_lock->count);
for (;;) {
/*
* Is VMA is write-locked? Overflow might produce false
locked result.
* False unlocked result is impossible because we modify and check
* vma->vm_lock_seq under vma->vm_lock protection and
mm->mm_lock_seq
* modification invalidates all existing locks.
*/
if (count < 0)
return false;
new = count + 1;
/* If atomic_t overflows, fail to lock. */
if (new < 0)
return false;
/*
* Atomic RMW will provide implicit mb on success to pair
with smp_wmb in
* vma_write_lock, on failure we retry.
*/
new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
if (new == count)
break;
count = new;
cpu_relax();
}
if (unlikely(READ_ONCE(vma->vm_lock->lock_seq) ==
READ_ONCE(vma->vm_mm->mm_lock_seq))) {
if (atomic_dec_and_test(&vma->vm_lock->count))
wake_up(&vma->vm_mm->vma_writer_wait);
return false;
}
return true;
}
> > >
> > > If the writer lock owner is preempted by a reader while releasing lock,
> > >
> > > set count to zero
> > > <-- preempt
> > > wake up waiters
> > >
> > > then lock is owned by reader but with read waiters.
> > >
> > > That is buggy if write waiter starvation is allowed in this patchset.
> >
> > I don't quite understand your point here. Readers don't wait, so there
> > can't be "read waiters". Could you please expand with a race diagram
> > maybe?
>
> cpu3 cpu2
> --- ---
> taskA bond to cpu3
> down_write(&mm->mmap_lock);
> vma_write_lock L
> taskB fail to take L for read
> taskC fail to take mmap_lock for write
> taskD fail to take L for read
> vma_write_unlock_mm(mm);
>
> preempted by taskE
> taskE take L for read and
> read waiters of L, taskB and taskD,
> should be woken up
>
> up_write(&mm->mmap_lock);
Readers never wait for vma lock, that's why we have only
vma_read_trylock and no vma_read_lock. In your scenario taskB and
taskD will fall back to taking mmap_lock for read after they failed
vma_read_trylock. Once taskA does up_write(mmap_lock) they will be
woken up since they are blocked on taking mmap_lock for read.
>
On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > > {
> > > > > > /* Check before locking. A race might cause false locked result. */
> > > > > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > return false;
> > > > > >
> > > > > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > > return false;
> > > > > >
> > > > > > + /* If atomic_t overflows, restore and fail to lock. */
> > > > > > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > + return false;
> > > > > > + }
> > > > > > +
> > > > > > /*
> > > > > > * Overflow might produce false locked result.
> > > > > > * False unlocked result is impossible because we modify and check
> > > > > > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > > * modification invalidates all existing locks.
> > > > > > */
> > > > > > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > - up_read(&vma->vm_lock->lock);
> > > > > > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > return false;
> > > > > > }
> > > > >
> > > > > With this change readers can cause writers to starve.
> > > > > What about checking waitqueue_active() before or after increasing
> > > > > vma->vm_lock->count?
> > > >
> > > > I don't understand how readers can starve a writer. Readers do
> > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > to fail.
> > >
> > > I think the point here was that if page faults keep occuring and they
> > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > and there is no reader throttling mechanism (no max time that writer
> > > will be waiting).
> >
> > Perhaps I misunderstood your description; I thought that a _waiting_
> > writer would make the count negative, not a successfully acquiring
> > writer.
>
> A waiting writer does not modify the counter, instead it's placed on
> the wait queue and the last reader which sets the count to 0 while
> releasing its read lock will wake it up. Once the writer is woken it
> will try to set the count to negative and if successful will own the
> lock, otherwise it goes back to sleep.
Then yes, that's a starvable lock. Preventing starvation on the mmap
sem was the original motivation for making rwsems non-starvable, so
changing that behaviour now seems like a bad idea. For efficiency, I'd
suggest that a waiting writer set the top bit of the counter. That way,
all new readers will back off without needing to check a second variable
and old readers will know that they *may* need to do the wakeup when
atomic_sub_return_release() is negative.
(rwsem.c has a more complex bitfield, but I don't think we need to go
that far; the important point is that the waiting writer indicates its
presence in the count field so that readers can modify their behaviour)
On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > {
> > > > > - up_read(&vma->vm_lock->lock);
> > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> > Task A:
> > atomic_dec_and_test(&vma->vm_lock->count)
> > Task B:
> > munmap()
> > write lock
> > free VMA
> > synchronize_rcu()
> > VMA is really freed
> > wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur. I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock. But maybe we should
> > have that assertion?
>
> Yep, that's what this patch is doing
> https://lore.kernel.org/all/[email protected]/
> by calling vma_assert_no_reader() from __vm_area_free().
That's not enough though. Task A still has a pointer to vma after it
has called atomic_dec_and_test(), even after vma has been freed by
Task B, and before Task A dereferences vma->vm_mm.
On Tue, Jan 17, 2023 at 10:27 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 10:21:28AM -0800, Suren Baghdasaryan wrote:
> > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > {
> > int count, new;
> >
> > /* Check before locking. A race might cause false locked result. */
> > if (READ_ONCE(vma->vm_lock->lock_seq) ==
> > READ_ONCE(vma->vm_mm->mm_lock_seq))
> > return false;
> >
> > count = atomic_read(&vma->vm_lock->count);
> > for (;;) {
> > /*
> > * Is VMA is write-locked? Overflow might produce false
> > locked result.
> > * False unlocked result is impossible because we modify and check
> > * vma->vm_lock_seq under vma->vm_lock protection and
> > mm->mm_lock_seq
> > * modification invalidates all existing locks.
> > */
> > if (count < 0)
> > return false;
> >
> > new = count + 1;
> > /* If atomic_t overflows, fail to lock. */
> > if (new < 0)
> > return false;
> >
> > /*
> > * Atomic RMW will provide implicit mb on success to pair
> > with smp_wmb in
> > * vma_write_lock, on failure we retry.
> > */
> > new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> > if (new == count)
> > break;
> > count = new;
> > cpu_relax();
>
> The cpu_relax() is exactly the wrong thing to do here. See this thread:
> https://lore.kernel.org/linux-fsdevel/[email protected]/
Thanks for the pointer, Matthew. I think we can safely remove
cpu_relax() since it's unlikely the count is constantly changing under
a reader.
>
On Tue, Jan 17, 2023 at 11:00 AM Jann Horn <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > > with a simpler structure:
> > > > > > > [...]
> > > > > > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > > {
> > > > > > > > - up_read(&vma->vm_lock->lock);
> > > > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > > }
> > > > > > >
> > > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > > >
> > > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > > is write-locked.
> > > > >
> > > > > Yes, but ...
> > > > >
> > > > > Task A:
> > > > > atomic_dec_and_test(&vma->vm_lock->count)
> > > > > Task B:
> > > > > munmap()
> > > > > write lock
> > > > > free VMA
> > > > > synchronize_rcu()
> > > > > VMA is really freed
> > > > > wake_up(&vma->vm_mm->vma_writer_wait);
> > > > >
> > > > > ... vma is freed.
> > > > >
> > > > > Now, I think this doesn't occur. I'm pretty sure that every caller of
> > > > > vma_read_unlock() is holding the RCU read lock. But maybe we should
> > > > > have that assertion?
> > > >
> > > > Yep, that's what this patch is doing
> > > > https://lore.kernel.org/all/[email protected]/
> > > > by calling vma_assert_no_reader() from __vm_area_free().
> > >
> > > That's not enough though. Task A still has a pointer to vma after it
> > > has called atomic_dec_and_test(), even after vma has been freed by
> > > Task B, and before Task A dereferences vma->vm_mm.
> >
> > Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> > local variable and call mmgrab() before atomic_dec_and_test(), then
> > use it in wake_up() and call mmdrop(). Is that what you are thinking?
>
> You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
> for page fault handling must be keeping the mm_struct alive.
Good point. Will update in the next revision to store mm before
dropping the count. Thanks for all the comments folks!
On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > {
> > > > - up_read(&vma->vm_lock->lock);
> > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
> Task A:
> atomic_dec_and_test(&vma->vm_lock->count)
> Task B:
> munmap()
> write lock
> free VMA
> synchronize_rcu()
> VMA is really freed
> wake_up(&vma->vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur. I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock. But maybe we should
> have that assertion?
Yep, that's what this patch is doing
https://lore.kernel.org/all/[email protected]/
by calling vma_assert_no_reader() from __vm_area_free().
>
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> vma->lock being part of the vm_area_struct causes performance regression
> during page faults because during contention its count and owner fields
> are constantly updated and having other parts of vm_area_struct used
> during page fault handling next to them causes constant cache line
> bouncing. Fix that by moving the lock outside of the vm_area_struct.
> All attempts to keep vma->lock inside vm_area_struct in a separate
> cache line still produce performance regression especially on NUMA
> machines. Smallest regression was achieved when lock is placed in the
> fourth cache line but that bloats vm_area_struct to 256 bytes.
Just checking: When you tested putting the lock in different cache
lines, did you force the slab allocator to actually store the
vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
on the slab or with a ____cacheline_aligned_in_smp on the struct
definition)?
On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> >
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > considerable space for each vm_area_struct. However vma_lock has
> > > two important specifics which can be used to replace rw_semaphore
> > > with a simpler structure:
> > [...]
> > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > {
> > > - up_read(&vma->vm_lock->lock);
> > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > }
> >
> > I haven't properly reviewed this, but this bit looks like a
> > use-after-free because you're accessing the vma after dropping your
> > reference on it. You'd have to first look up the vma->vm_mm, then do
> > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > touching the vma. Or alternatively wrap the whole thing in an RCU
> > read-side critical section if the VMA is freed with RCU delay.
>
> vm_lock->count does not control the lifetime of the VMA, it's a
> counter of how many readers took the lock or it's negative if the lock
> is write-locked.
Yes, but ...
Task A:
atomic_dec_and_test(&vma->vm_lock->count)
Task B:
munmap()
write lock
free VMA
synchronize_rcu()
VMA is really freed
wake_up(&vma->vm_mm->vma_writer_wait);
... vma is freed.
Now, I think this doesn't occur. I'm pretty sure that every caller of
vma_read_unlock() is holding the RCU read lock. But maybe we should
have that assertion?
On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > {
> > > > - up_read(&vma->vm_lock->lock);
> > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
> Task A:
> atomic_dec_and_test(&vma->vm_lock->count)
> Task B:
> munmap()
> write lock
> free VMA
> synchronize_rcu()
> VMA is really freed
> wake_up(&vma->vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur. I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock. But maybe we should
> have that assertion?
I don't see that. When do_user_addr_fault() is calling
vma_read_unlock(), there's no RCU read lock held, right?
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
[...]
> static inline void vma_read_unlock(struct vm_area_struct *vma)
> {
> - up_read(&vma->vm_lock->lock);
> + if (atomic_dec_and_test(&vma->vm_lock->count))
> + wake_up(&vma->vm_mm->vma_writer_wait);
> }
I haven't properly reviewed this, but this bit looks like a
use-after-free because you're accessing the vma after dropping your
reference on it. You'd have to first look up the vma->vm_mm, then do
the atomic_dec_and_test(), and afterwards do the wake_up() without
touching the vma. Or alternatively wrap the whole thing in an RCU
read-side critical section if the VMA is freed with RCU delay.
On Tue, Jan 17, 2023 at 10:34 AM Jann Horn <[email protected]> wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > vma->lock being part of the vm_area_struct causes performance regression
> > during page faults because during contention its count and owner fields
> > are constantly updated and having other parts of vm_area_struct used
> > during page fault handling next to them causes constant cache line
> > bouncing. Fix that by moving the lock outside of the vm_area_struct.
> > All attempts to keep vma->lock inside vm_area_struct in a separate
> > cache line still produce performance regression especially on NUMA
> > machines. Smallest regression was achieved when lock is placed in the
> > fourth cache line but that bloats vm_area_struct to 256 bytes.
>
> Just checking: When you tested putting the lock in different cache
> lines, did you force the slab allocator to actually store the
> vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
> on the slab or with a ____cacheline_aligned_in_smp on the struct
> definition)?
Yep, I tried all these combinations and still saw noticeable regression.
On Tue, Jan 17, 2023 at 10:21:28AM -0800, Suren Baghdasaryan wrote:
> static inline bool vma_read_trylock(struct vm_area_struct *vma)
> {
> int count, new;
>
> /* Check before locking. A race might cause false locked result. */
> if (READ_ONCE(vma->vm_lock->lock_seq) ==
> READ_ONCE(vma->vm_mm->mm_lock_seq))
> return false;
>
> count = atomic_read(&vma->vm_lock->count);
> for (;;) {
> /*
> * Is VMA is write-locked? Overflow might produce false
> locked result.
> * False unlocked result is impossible because we modify and check
> * vma->vm_lock_seq under vma->vm_lock protection and
> mm->mm_lock_seq
> * modification invalidates all existing locks.
> */
> if (count < 0)
> return false;
>
> new = count + 1;
> /* If atomic_t overflows, fail to lock. */
> if (new < 0)
> return false;
>
> /*
> * Atomic RMW will provide implicit mb on success to pair
> with smp_wmb in
> * vma_write_lock, on failure we retry.
> */
> new = atomic_cmpxchg(&vma->vm_lock->count, count, new);
> if (new == count)
> break;
> count = new;
> cpu_relax();
The cpu_relax() is exactly the wrong thing to do here. See this thread:
https://lore.kernel.org/linux-fsdevel/[email protected]/
On Tue, Jan 17, 2023 at 10:23 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 11:14:38AM +0000, Hyeonggon Yoo wrote:
> > > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct *vma)
> > > > > > > static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > > > {
> > > > > > > /* Check before locking. A race might cause false locked result. */
> > > > > > > - if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > + if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > return false;
> > > > > > >
> > > > > > > - if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
> > > > > > > + if (unlikely(!atomic_inc_unless_negative(&vma->vm_lock->count)))
> > > > > > > return false;
> > > > > > >
> > > > > > > + /* If atomic_t overflows, restore and fail to lock. */
> > > > > > > + if (unlikely(atomic_read(&vma->vm_lock->count) < 0)) {
> > > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > + return false;
> > > > > > > + }
> > > > > > > +
> > > > > > > /*
> > > > > > > * Overflow might produce false locked result.
> > > > > > > * False unlocked result is impossible because we modify and check
> > > > > > > * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
> > > > > > > * modification invalidates all existing locks.
> > > > > > > */
> > > > > > > - if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > - up_read(&vma->vm_lock->lock);
> > > > > > > + if (unlikely(vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > return false;
> > > > > > > }
> > > > > >
> > > > > > With this change readers can cause writers to starve.
> > > > > > What about checking waitqueue_active() before or after increasing
> > > > > > vma->vm_lock->count?
> > > > >
> > > > > I don't understand how readers can starve a writer. Readers do
> > > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > > to fail.
> > > >
> > > > I think the point here was that if page faults keep occuring and they
> > > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > > and there is no reader throttling mechanism (no max time that writer
> > > > will be waiting).
> > >
> > > Perhaps I misunderstood your description; I thought that a _waiting_
> > > writer would make the count negative, not a successfully acquiring
> > > writer.
> >
> > A waiting writer does not modify the counter, instead it's placed on
> > the wait queue and the last reader which sets the count to 0 while
> > releasing its read lock will wake it up. Once the writer is woken it
> > will try to set the count to negative and if successful will own the
> > lock, otherwise it goes back to sleep.
>
> Then yes, that's a starvable lock. Preventing starvation on the mmap
> sem was the original motivation for making rwsems non-starvable, so
> changing that behaviour now seems like a bad idea. For efficiency, I'd
> suggest that a waiting writer set the top bit of the counter. That way,
> all new readers will back off without needing to check a second variable
> and old readers will know that they *may* need to do the wakeup when
> atomic_sub_return_release() is negative.
>
> (rwsem.c has a more complex bitfield, but I don't think we need to go
> that far; the important point is that the waiting writer indicates its
> presence in the count field so that readers can modify their behaviour)
Got it. Ok, I think we can figure something out to check if there are
waiting write-lockers and prevent new readers from taking the lock.
On Tue, Jan 17, 2023 at 10:36 AM Jann Horn <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > {
> > > > > - up_read(&vma->vm_lock->lock);
> > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> > Task A:
> > atomic_dec_and_test(&vma->vm_lock->count)
> > Task B:
> > munmap()
> > write lock
> > free VMA
> > synchronize_rcu()
> > VMA is really freed
> > wake_up(&vma->vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur. I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock. But maybe we should
> > have that assertion?
>
> I don't see that. When do_user_addr_fault() is calling
> vma_read_unlock(), there's no RCU read lock held, right?
We free VMAs using call_rcu() after removing them from VMA tree. OTOH
page fault handlers are searching for VMAs from inside RCU read
section and calling vma_read_unlock() from there, see
https://lore.kernel.org/all/[email protected]/.
Once we take the VMA read-lock, it ensures that it can't be
write-locked and if someone is destroying or isolating the VMA, it
needs to write-lock it first.
On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan <[email protected]> wrote:
> On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > with a simpler structure:
> > > > > > [...]
> > > > > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > {
> > > > > > > - up_read(&vma->vm_lock->lock);
> > > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > > }
> > > > > >
> > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > >
> > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > is write-locked.
> > > >
> > > > Yes, but ...
> > > >
> > > > Task A:
> > > > atomic_dec_and_test(&vma->vm_lock->count)
> > > > Task B:
> > > > munmap()
> > > > write lock
> > > > free VMA
> > > > synchronize_rcu()
> > > > VMA is really freed
> > > > wake_up(&vma->vm_mm->vma_writer_wait);
> > > >
> > > > ... vma is freed.
> > > >
> > > > Now, I think this doesn't occur. I'm pretty sure that every caller of
> > > > vma_read_unlock() is holding the RCU read lock. But maybe we should
> > > > have that assertion?
> > >
> > > Yep, that's what this patch is doing
> > > https://lore.kernel.org/all/[email protected]/
> > > by calling vma_assert_no_reader() from __vm_area_free().
> >
> > That's not enough though. Task A still has a pointer to vma after it
> > has called atomic_dec_and_test(), even after vma has been freed by
> > Task B, and before Task A dereferences vma->vm_mm.
>
> Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> local variable and call mmgrab() before atomic_dec_and_test(), then
> use it in wake_up() and call mmdrop(). Is that what you are thinking?
You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
for page fault handling must be keeping the mm_struct alive.
On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > with a simpler structure:
> > > > > [...]
> > > > > > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > {
> > > > > > - up_read(&vma->vm_lock->lock);
> > > > > > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > > > > > + wake_up(&vma->vm_mm->vma_writer_wait);
> > > > > > }
> > > > >
> > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > use-after-free because you're accessing the vma after dropping your
> > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > read-side critical section if the VMA is freed with RCU delay.
> > > >
> > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > counter of how many readers took the lock or it's negative if the lock
> > > > is write-locked.
> > >
> > > Yes, but ...
> > >
> > > Task A:
> > > atomic_dec_and_test(&vma->vm_lock->count)
> > > Task B:
> > > munmap()
> > > write lock
> > > free VMA
> > > synchronize_rcu()
> > > VMA is really freed
> > > wake_up(&vma->vm_mm->vma_writer_wait);
> > >
> > > ... vma is freed.
> > >
> > > Now, I think this doesn't occur. I'm pretty sure that every caller of
> > > vma_read_unlock() is holding the RCU read lock. But maybe we should
> > > have that assertion?
> >
> > Yep, that's what this patch is doing
> > https://lore.kernel.org/all/[email protected]/
> > by calling vma_assert_no_reader() from __vm_area_free().
>
> That's not enough though. Task A still has a pointer to vma after it
> has called atomic_dec_and_test(), even after vma has been freed by
> Task B, and before Task A dereferences vma->vm_mm.
Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
local variable and call mmgrab() before atomic_dec_and_test(), then
use it in wake_up() and call mmdrop(). Is that what you are thinking?
On Tue, Jan 17, 2023 at 10:12 AM Jann Horn <[email protected]> wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> [...]
> > static inline void vma_read_unlock(struct vm_area_struct *vma)
> > {
> > - up_read(&vma->vm_lock->lock);
> > + if (atomic_dec_and_test(&vma->vm_lock->count))
> > + wake_up(&vma->vm_mm->vma_writer_wait);
> > }
>
> I haven't properly reviewed this, but this bit looks like a
> use-after-free because you're accessing the vma after dropping your
> reference on it. You'd have to first look up the vma->vm_mm, then do
> the atomic_dec_and_test(), and afterwards do the wake_up() without
> touching the vma. Or alternatively wrap the whole thing in an RCU
> read-side critical section if the VMA is freed with RCU delay.
vm_lock->count does not control the lifetime of the VMA, it's a
counter of how many readers took the lock or it's negative if the lock
is write-locked.
On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > >
> > > I have to say I was struggling a bit with the above and only understood
> > > what you mean by reading the patch several times. I would phrase it like
> > > this (feel free to use if you consider this to be an improvement).
> > >
> > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > per-vma and per-mm sequence counters to note exclusive locking:
> > > - read lock - (implemented by vma_read_trylock) requires the the
> > > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > differ. If they match then there must be a vma exclusive lock
> > > held somewhere.
> > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > vma->lock unlock.
> > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > held exclusively and the current mm counter is noted to the vma
> > > side. This will allow multiple vmas to be locked under a single
> > > mmap_lock write lock (e.g. during vma merging). The vma counter
> > > is modified under exclusive vma lock.
> >
> > Didn't realize one more thing.
> > Unlike standard write lock this implementation allows to be
> > called multiple times under a single mmap_lock. In a sense
> > it is more of mark_vma_potentially_modified than a lock.
>
> In the RFC it was called vma_mark_locked() originally and renames were
> discussed in the email thread ending here:
> https://lore.kernel.org/all/[email protected]/.
> If other names are preferable I'm open to changing them.
I don't want to bikeshed this, but rather than locking it seems to be
more:
vma_start_read()
vma_end_read()
vma_start_write()
vma_end_write()
vma_downgrade_write()
... and that these are _implemented_ with locks (in part) is an
implementation detail?
Would that reduce people's confusion?
> >
> > > - write unlock - (vma_write_unlock_mm) is a batch release of all
> > > vma locks held. It doesn't pair with a specific
> > > vma_write_lock! It is done before exclusive mmap_lock is
> > > released by incrementing mm sequence counter (mm_lock_seq).
> > > - write downgrade - if the mmap_lock is downgraded to the read
> > > lock all vma write locks are released as well (effectivelly
> > > same as write unlock).
> > --
> > Michal Hocko
> > SUSE Labs
On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <[email protected]> wrote:
> On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > Protect VMA from concurrent page fault handler while collapsing a huge
> > page. Page fault handler needs a stable PMD to use PTL and relies on
> > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > not be detected by a page fault handler without proper locking.
>
> I am struggling with this changelog. Maybe because my recollection of
> the THP collapsing subtleties is weak. But aren't you just trying to say
> that the current #PF handling and THP collapsing need to be mutually
> exclusive currently so in order to keep that assumption you have mark
> the vma write locked?
>
> Also it is not really clear to me how that handles other vmas which can
> share the same thp?
It's not about the hugepage itself, it's about how the THP collapse
operation frees page tables.
Before this series, page tables can be walked under any one of the
mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This series adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that one.
There is a codepath in khugepaged that iterates through all mappings
of a file to zap page tables (retract_page_tables()), which locks each
visited mm with mmap_write_trylock() and now also does
vma_write_lock().
I think one aspect of this patch that might cause trouble later on, if
support for non-anonymous VMAs is added, is that retract_page_tables()
now does vma_write_lock() while holding the mapping lock; the page
fault handling path would probably take the locks the other way
around, leading to a deadlock? So the vma_write_lock() in
retract_page_tables() might have to become a trylock later on.
Related: Please add the new VMA lock to the big lock ordering comments
at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
add file VMA support.)
On Tue, Jan 17, 2023 at 8:51 PM Jann Horn <[email protected]> wrote:
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > handling under VMA lock and retry holding mmap_lock. This can be handled
> > more gracefully in the future.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Suggested-by: Peter Xu <[email protected]>
> > ---
> > mm/memory.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 20806bc8b4eb..12508f4d845a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > if (!vma->anon_vma)
> > goto inval;
> >
> > + /*
> > + * Due to the possibility of userfault handler dropping mmap_lock, avoid
> > + * it for now and fall back to page fault handling under mmap_lock.
> > + */
> > + if (userfaultfd_armed(vma))
> > + goto inval;
>
> This looks racy wrt concurrent userfaultfd_register(). I think you'll
> want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
I still think this change is needed...
> and ensure that the userfaultfd code write-locks the VMA before
> changing the __VM_UFFD_FLAGS in vma->vm_flags.
Ah, but now I see you already took care of this half of the issue with
the reset_vm_flags() change in
https://lore.kernel.org/linux-mm/[email protected]/
.
> > if (!vma_read_trylock(vma))
> > goto inval;
> >
> > --
> > 2.39.0
> >
On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
[...]
> > Then yes, that's a starvable lock. Preventing starvation on the mmap
> > sem was the original motivation for making rwsems non-starvable, so
> > changing that behaviour now seems like a bad idea. For efficiency, I'd
> > suggest that a waiting writer set the top bit of the counter. That way,
> > all new readers will back off without needing to check a second variable
> > and old readers will know that they *may* need to do the wakeup when
> > atomic_sub_return_release() is negative.
> >
> > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > that far; the important point is that the waiting writer indicates its
> > presence in the count field so that readers can modify their behaviour)
>
> Got it. Ok, I think we can figure something out to check if there are
> waiting write-lockers and prevent new readers from taking the lock.
Reinventing locking primitives is a ticket to weird bugs. I would stick
with the rwsem and deal with performance fallouts after it is clear that
the core idea is generally acceptable and based on actual real life
numbers. This whole thing is quite big enough that we do not have to go
through "is this new synchronization primitive correct and behaving
reasonably" exercise.
--
Michal Hocko
SUSE Labs
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Suggested-by: Peter Xu <[email protected]>
> ---
> mm/memory.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 20806bc8b4eb..12508f4d845a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma->anon_vma)
> goto inval;
>
> + /*
> + * Due to the possibility of userfault handler dropping mmap_lock, avoid
> + * it for now and fall back to page fault handling under mmap_lock.
> + */
> + if (userfaultfd_armed(vma))
> + goto inval;
This looks racy wrt concurrent userfaultfd_register(). I think you'll
want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
and ensure that the userfaultfd code write-locks the VMA before
changing the __VM_UFFD_FLAGS in vma->vm_flags.
> if (!vma_read_trylock(vma))
> goto inval;
>
> --
> 2.39.0
>
On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <[email protected]> wrote:
> On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <[email protected]> wrote:
> >
> > +locking maintainers
>
> Thanks! I'll CC the locking maintainers in the next posting.
>
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > + up_read(&vma->lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else. So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(&vma->lock);
> > rcu_read_unlock();
>
> But for deleting VMA one would need to write-lock the vma->lock first,
> which I assume can't happen until this up_read() is complete. Is that
> assumption wrong?
__up_read() does:
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
RWSEM_FLAG_WAITERS)) {
clear_nonspinnable(sem);
rwsem_wake(sem);
}
The atomic_long_add_return_release() is the point where we are doing
the main lock-releasing.
So if a reader dropped the read-lock while someone else was waiting on
the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
lock together with it, the reader also does clear_nonspinnable() and
rwsem_wake() afterwards.
But in rwsem_down_write_slowpath(), after we've set
RWSEM_FLAG_WAITERS, we can return successfully immediately once
rwsem_try_write_lock() sees that there are no active readers or
writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
succeeds). We're not necessarily waiting for the "nonspinnable" bit or
the wake.
So yeah, I think down_write() can return successfully before up_read()
is done with its memory accesses.
(Spinlocks are different - the kernel relies on being able to drop
references via spin_unlock() in some places.)
On Tue, Jan 17, 2023 at 1:54 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > > - read lock - (implemented by vma_read_trylock) requires the the
> > > > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > > differ. If they match then there must be a vma exclusive lock
> > > > held somewhere.
> > > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > > vma->lock unlock.
> > > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > > held exclusively and the current mm counter is noted to the vma
> > > > side. This will allow multiple vmas to be locked under a single
> > > > mmap_lock write lock (e.g. during vma merging). The vma counter
> > > > is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > > Unlike standard write lock this implementation allows to be
> > > called multiple times under a single mmap_lock. In a sense
> > > it is more of mark_vma_potentially_modified than a lock.
> >
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/[email protected]/.
> > If other names are preferable I'm open to changing them.
>
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
>
> vma_start_read()
> vma_end_read()
> vma_start_write()
> vma_end_write()
> vma_downgrade_write()
Couple corrections, we would have to have vma_start_tryread() and
vma_end_write_all(). Also there is no vma_downgrade_write().
mmap_write_downgrade() simply does vma_end_write_all().
>
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?
>
> Would that reduce people's confusion?
>
> > >
> > > > - write unlock - (vma_write_unlock_mm) is a batch release of all
> > > > vma locks held. It doesn't pair with a specific
> > > > vma_write_lock! It is done before exclusive mmap_lock is
> > > > released by incrementing mm sequence counter (mm_lock_seq).
> > > > - write downgrade - if the mmap_lock is downgraded to the read
> > > > lock all vma write locks are released as well (effectivelly
> > > > same as write unlock).
> > > --
> > > Michal Hocko
> > > SUSE Labs
On Tue, Jan 17, 2023 at 1:46 PM Jann Horn <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <[email protected]> wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <[email protected]> wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > + up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
>
> __up_read() does:
>
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> RWSEM_FLAG_WAITERS)) {
> clear_nonspinnable(sem);
> rwsem_wake(sem);
> }
>
> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
>
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
>
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.
>
> (Spinlocks are different - the kernel relies on being able to drop
> references via spin_unlock() in some places.)
Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
suggested and that would fix the issue because we free VMAs from
call_rcu(). However this feels to me as an issue of rw_semaphore
design that this locking pattern is unsafe and might lead to UAF.
On Tue, Jan 17, 2023 at 7:07 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5986817f393c..c026d75108b3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > */
> > *new = data_race(*orig);
> > INIT_LIST_HEAD(&new->anon_vma_chain);
> > + vma_init_lock(new);
> > dup_anon_vma_name(orig, new);
> > }
> > return new;
> > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > seqcount_init(&mm->write_protect_seq);
> > mmap_init_lock(mm);
> > INIT_LIST_HEAD(&mm->mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
>
> The mm shouldn't be visible so why WRITE_ONCE?
True. Will change to a simple assignment.
>
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 12:31 PM Michal Hocko <[email protected]> wrote:
>
> On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
> [...]
> > > Then yes, that's a starvable lock. Preventing starvation on the mmap
> > > sem was the original motivation for making rwsems non-starvable, so
> > > changing that behaviour now seems like a bad idea. For efficiency, I'd
> > > suggest that a waiting writer set the top bit of the counter. That way,
> > > all new readers will back off without needing to check a second variable
> > > and old readers will know that they *may* need to do the wakeup when
> > > atomic_sub_return_release() is negative.
> > >
> > > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > > that far; the important point is that the waiting writer indicates its
> > > presence in the count field so that readers can modify their behaviour)
> >
> > Got it. Ok, I think we can figure something out to check if there are
> > waiting write-lockers and prevent new readers from taking the lock.
>
> Reinventing locking primitives is a ticket to weird bugs. I would stick
> with the rwsem and deal with performance fallouts after it is clear that
> the core idea is generally acceptable and based on actual real life
> numbers. This whole thing is quite big enough that we do not have to go
> through "is this new synchronization primitive correct and behaving
> reasonably" exercise.
Point taken. That's one of the reasons I kept this patch separate.
I'll drop this last patch from the series for now. One correction
though, this will not be a performance fallout but memory consumption
fallout.
>
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 12:36 PM Jann Horn <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 8:51 PM Jann Horn <[email protected]> wrote:
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan <[email protected]> wrote:
> > > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > more gracefully in the future.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > Suggested-by: Peter Xu <[email protected]>
> > > ---
> > > mm/memory.c | 7 +++++++
> > > 1 file changed, 7 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 20806bc8b4eb..12508f4d845a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > > if (!vma->anon_vma)
> > > goto inval;
> > >
> > > + /*
> > > + * Due to the possibility of userfault handler dropping mmap_lock, avoid
> > > + * it for now and fall back to page fault handling under mmap_lock.
> > > + */
> > > + if (userfaultfd_armed(vma))
> > > + goto inval;
> >
> > This looks racy wrt concurrent userfaultfd_register(). I think you'll
> > want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
>
> I still think this change is needed...
Yes, I think you are right. I'll move the check after locking the VMA. Thanks!
>
> > and ensure that the userfaultfd code write-locks the VMA before
> > changing the __VM_UFFD_FLAGS in vma->vm_flags.
>
> Ah, but now I see you already took care of this half of the issue with
> the reset_vm_flags() change in
> https://lore.kernel.org/linux-mm/[email protected]/
> .
>
>
> > > if (!vma_read_trylock(vma))
> > > goto inval;
> > >
> > > --
> > > 2.39.0
> > >
On Tue, Jan 17, 2023 at 12:28 PM Jann Horn <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <[email protected]> wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
>
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
>
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
>
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().
>
>
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.
>
> Related: Please add the new VMA lock to the big lock ordering comments
> at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
> add file VMA support.)
Thanks for the clarifications and the warning. I'll add appropriate
comments and will take this deadlocking scenario into account when
later implementing support for file-backed page faults.
On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <[email protected]> wrote:
>
> +locking maintainers
Thanks! I'll CC the locking maintainers in the next posting.
>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > + up_read(&vma->lock);
> > +}
>
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else. So if you want to protect against concurrent
> deletion, this might have to be something like:
>
> rcu_read_lock(); /* keeps vma alive */
> up_read(&vma->lock);
> rcu_read_unlock();
But for deleting VMA one would need to write-lock the vma->lock first,
which I assume can't happen until this up_read() is complete. Is that
assumption wrong?
>
> But I'm not entirely sure about that, the locking folks might know better.
>
> Also, it might not matter given that the rw_semaphore part is removed
> in the current patch 41/41 anyway...
This does matter because Michal suggested dropping that last 41/41
patch for now.
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.
[...]
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> + unsigned long address)
> +{
> + MA_STATE(mas, &mm->mm_mt, address, address);
> + struct vm_area_struct *vma, *validate;
> +
> + rcu_read_lock();
> + vma = mas_walk(&mas);
> +retry:
> + if (!vma)
> + goto inval;
> +
> + /* Only anonymous vmas are supported for now */
> + if (!vma_is_anonymous(vma))
> + goto inval;
> +
> + if (!vma_read_trylock(vma))
> + goto inval;
> +
> + /* Check since vm_start/vm_end might change before we lock the VMA */
> + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> + vma_read_unlock(vma);
> + goto inval;
> + }
> +
> + /* Check if the VMA got isolated after we found it */
> + mas.index = address;
> + validate = mas_walk(&mas);
Question for Maple Tree experts:
Are you allowed to use mas_walk() like this? If the first mas_walk()
call encountered a single-entry tree, it would store mas->node =
MAS_ROOT, right? And then the second call would go into
mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
true, and then mas_state_walk() would return the result of
mas_start(), which is NULL? And we'd end up with mas_walk() returning
NULL on the second run even though the tree hasn't changed?
> + if (validate != vma) {
> + vma_read_unlock(vma);
> + count_vm_vma_lock_event(VMA_LOCK_MISS);
> + /* The area was replaced with another one. */
> + vma = validate;
> + goto retry;
> + }
> +
> + rcu_read_unlock();
> + return vma;
> +inval:
> + rcu_read_unlock();
> + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> + return NULL;
> +}
On Tue, Jan 17, 2023 at 7:04 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
>
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
>
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
> - read lock - (implemented by vma_read_trylock) requires the the
> vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> differ. If they match then there must be a vma exclusive lock
> held somewhere.
> - read unlock - (implemented by vma_read_unlock) is a trivial
> vma->lock unlock.
> - write lock - (vma_write_lock) requires the mmap_lock to be
> held exclusively and the current mm counter is noted to the vma
> side. This will allow multiple vmas to be locked under a single
> mmap_lock write lock (e.g. during vma merging). The vma counter
> is modified under exclusive vma lock.
> - write unlock - (vma_write_unlock_mm) is a batch release of all
> vma locks held. It doesn't pair with a specific
> vma_write_lock! It is done before exclusive mmap_lock is
> released by incrementing mm sequence counter (mm_lock_seq).
> - write downgrade - if the mmap_lock is downgraded to the read
> lock all vma write locks are released as well (effectivelly
> same as write unlock).
Thanks for the suggestion, Michal. I'll definitely reuse your description.
>
> > VMA lock is placed on the cache line boundary so that its 'count' field
> > falls into the first cache line while the rest of the fields fall into
> > the second cache line. This lets the 'count' field to be cached with
> > other frequently accessed fields and used quickly in uncontended case
> > while 'owner' and other fields used in the contended case will not
> > invalidate the first cache line while waiting on the lock.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/mm.h | 80 +++++++++++++++++++++++++++++++++++++++
> > include/linux/mm_types.h | 8 ++++
> > include/linux/mmap_lock.h | 13 +++++++
> > kernel/fork.c | 4 ++
> > mm/init-mm.c | 3 ++
> > 5 files changed, 108 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f3f196e4d66d..ec2c4c227d51 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -612,6 +612,85 @@ struct vm_operations_struct {
> > unsigned long addr);
> > };
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_init_lock(struct vm_area_struct *vma)
> > +{
> > + init_rwsem(&vma->lock);
> > + vma->vm_lock_seq = -1;
> > +}
> > +
> > +static inline void vma_write_lock(struct vm_area_struct *vma)
> > +{
> > + int mm_lock_seq;
> > +
> > + mmap_assert_write_locked(vma->vm_mm);
> > +
> > + /*
> > + * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > + * mm->mm_lock_seq can't be concurrently modified.
> > + */
> > + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > + if (vma->vm_lock_seq == mm_lock_seq)
> > + return;
> > +
> > + down_write(&vma->lock);
> > + vma->vm_lock_seq = mm_lock_seq;
> > + up_write(&vma->lock);
> > +}
> > +
> > +/*
> > + * Try to read-lock a vma. The function is allowed to occasionally yield false
> > + * locked result to avoid performance overhead, in which case we fall back to
> > + * using mmap_lock. The function should never yield false unlocked result.
> > + */
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +{
> > + /* Check before locking. A race might cause false locked result. */
> > + if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > + return false;
> > +
> > + if (unlikely(down_read_trylock(&vma->lock) == 0))
> > + return false;
> > +
> > + /*
> > + * Overflow might produce false locked result.
> > + * False unlocked result is impossible because we modify and check
> > + * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
> > + * modification invalidates all existing locks.
> > + */
> > + if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > + up_read(&vma->lock);
> > + return false;
> > + }
> > + return true;
> > +}
> > +
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > + up_read(&vma->lock);
> > +}
> > +
> > +static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > +{
> > + mmap_assert_write_locked(vma->vm_mm);
> > + /*
> > + * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > + * mm->mm_lock_seq can't be concurrently modified.
> > + */
> > + VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > +static inline void vma_write_lock(struct vm_area_struct *vma) {}
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > + { return false; }
> > +static inline void vma_read_unlock(struct vm_area_struct *vma) {}
> > +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > {
> > static const struct vm_operations_struct dummy_vm_ops = {};
> > @@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > vma->vm_mm = mm;
> > vma->vm_ops = &dummy_vm_ops;
> > INIT_LIST_HEAD(&vma->anon_vma_chain);
> > + vma_init_lock(vma);
> > }
> >
> > static inline void vma_set_anonymous(struct vm_area_struct *vma)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index d5cdec1314fe..5f7c5ca89931 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -555,6 +555,11 @@ struct vm_area_struct {
> > pgprot_t vm_page_prot;
> > unsigned long vm_flags; /* Flags, see mm.h. */
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + int vm_lock_seq;
> > + struct rw_semaphore lock;
> > +#endif
> > +
> > /*
> > * For areas with an address space and backing store,
> > * linkage into the address_space->i_mmap interval tree.
> > @@ -680,6 +685,9 @@ struct mm_struct {
> > * init_mm.mmlist, and are protected
> > * by mmlist_lock
> > */
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + int mm_lock_seq;
> > +#endif
> >
> >
> > unsigned long hiwater_rss; /* High-watermark of RSS usage */
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index e49ba91bb1f0..40facd4c398b 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
> > VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > }
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_write_unlock_mm(struct mm_struct *mm)
> > +{
> > + mmap_assert_write_locked(mm);
> > + /* No races during update due to exclusive mmap_lock being held */
> > + WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
> > +}
> > +#else
> > +static inline void vma_write_unlock_mm(struct mm_struct *mm) {}
> > +#endif
> > +
> > static inline void mmap_init_lock(struct mm_struct *mm)
> > {
> > init_rwsem(&mm->mmap_lock);
> > @@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
> > static inline void mmap_write_unlock(struct mm_struct *mm)
> > {
> > __mmap_lock_trace_released(mm, true);
> > + vma_write_unlock_mm(mm);
> > up_write(&mm->mmap_lock);
> > }
> >
> > static inline void mmap_write_downgrade(struct mm_struct *mm)
> > {
> > __mmap_lock_trace_acquire_returned(mm, false, true);
> > + vma_write_unlock_mm(mm);
> > downgrade_write(&mm->mmap_lock);
> > }
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5986817f393c..c026d75108b3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
> > */
> > *new = data_race(*orig);
> > INIT_LIST_HEAD(&new->anon_vma_chain);
> > + vma_init_lock(new);
> > dup_anon_vma_name(orig, new);
> > }
> > return new;
> > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > seqcount_init(&mm->write_protect_seq);
> > mmap_init_lock(mm);
> > INIT_LIST_HEAD(&mm->mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
> > mm_pgtables_bytes_init(mm);
> > mm->map_count = 0;
> > mm->locked_vm = 0;
> > diff --git a/mm/init-mm.c b/mm/init-mm.c
> > index c9327abb771c..33269314e060 100644
> > --- a/mm/init-mm.c
> > +++ b/mm/init-mm.c
> > @@ -37,6 +37,9 @@ struct mm_struct init_mm = {
> > .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> > .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
> > .mmlist = LIST_HEAD_INIT(init_mm.mmlist),
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + .mm_lock_seq = 0,
> > +#endif
> > .user_ns = &init_user_ns,
> > .cpu_bitmap = CPU_BITS_NONE,
> > #ifdef CONFIG_IOMMU_SVA
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs
* Jann Horn <[email protected]> [230117 16:04]:
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
> [...]
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > + unsigned long address)
> > +{
> > + MA_STATE(mas, &mm->mm_mt, address, address);
> > + struct vm_area_struct *vma, *validate;
> > +
> > + rcu_read_lock();
> > + vma = mas_walk(&mas);
> > +retry:
> > + if (!vma)
> > + goto inval;
> > +
> > + /* Only anonymous vmas are supported for now */
> > + if (!vma_is_anonymous(vma))
> > + goto inval;
> > +
> > + if (!vma_read_trylock(vma))
> > + goto inval;
> > +
> > + /* Check since vm_start/vm_end might change before we lock the VMA */
> > + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > + vma_read_unlock(vma);
> > + goto inval;
> > + }
> > +
> > + /* Check if the VMA got isolated after we found it */
> > + mas.index = address;
> > + validate = mas_walk(&mas);
>
> Question for Maple Tree experts:
>
> Are you allowed to use mas_walk() like this? If the first mas_walk()
> call encountered a single-entry tree, it would store mas->node =
> MAS_ROOT, right? And then the second call would go into
> mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
> true, and then mas_state_walk() would return the result of
> mas_start(), which is NULL? And we'd end up with mas_walk() returning
> NULL on the second run even though the tree hasn't changed?
This is safe for VMAs. There might be a bug in the tree regarding
re-walking with a pointer, but it won't matter here.
A single entry tree will be a pointer if the entry is of the range 0 - 0
(mas.index == 0, mas.last == 0). This would be a zero sized VMA - which
is not valid.
The second walk will check if the maple node is dead and restart the
walk if it is dead. If the node isn't dead (almost always the case),
then it will be a very quick walk.
After a mas_walk(), the maple state has mas.index = vma->vm_start
and mas.last = (vma->vm_end - 1). The address is set prior to the second
walk in case of a vma split where mas.index from the first walk
is on the other side of the split than address.
>
> > + if (validate != vma) {
> > + vma_read_unlock(vma);
> > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > + /* The area was replaced with another one. */
> > + vma = validate;
> > + goto retry;
> > + }
> > +
> > + rcu_read_unlock();
> > + return vma;
> > +inval:
> > + rcu_read_unlock();
> > + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > + return NULL;
> > +}
On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> >
> > I have to say I was struggling a bit with the above and only understood
> > what you mean by reading the patch several times. I would phrase it like
> > this (feel free to use if you consider this to be an improvement).
> >
> > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > per-vma and per-mm sequence counters to note exclusive locking:
> > - read lock - (implemented by vma_read_trylock) requires the the
> > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > differ. If they match then there must be a vma exclusive lock
> > held somewhere.
> > - read unlock - (implemented by vma_read_unlock) is a trivial
> > vma->lock unlock.
> > - write lock - (vma_write_lock) requires the mmap_lock to be
> > held exclusively and the current mm counter is noted to the vma
> > side. This will allow multiple vmas to be locked under a single
> > mmap_lock write lock (e.g. during vma merging). The vma counter
> > is modified under exclusive vma lock.
>
> Didn't realize one more thing.
> Unlike standard write lock this implementation allows to be
> called multiple times under a single mmap_lock. In a sense
> it is more of mark_vma_potentially_modified than a lock.
In the RFC it was called vma_mark_locked() originally and renames were
discussed in the email thread ending here:
https://lore.kernel.org/all/[email protected]/.
If other names are preferable I'm open to changing them.
>
> > - write unlock - (vma_write_unlock_mm) is a batch release of all
> > vma locks held. It doesn't pair with a specific
> > vma_write_lock! It is done before exclusive mmap_lock is
> > released by incrementing mm sequence counter (mm_lock_seq).
> > - write downgrade - if the mmap_lock is downgraded to the read
> > lock all vma write locks are released as well (effectivelly
> > same as write unlock).
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 02:36:47PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 1:46 PM Jann Horn <[email protected]> wrote:
> > On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <[email protected]> wrote:
> > > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <[email protected]> wrote:
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else. So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(&vma->lock);
> > > > rcu_read_unlock();
> > >
> > > But for deleting VMA one would need to write-lock the vma->lock first,
> > > which I assume can't happen until this up_read() is complete. Is that
> > > assumption wrong?
> >
> > __up_read() does:
> >
> > rwsem_clear_reader_owned(sem);
> > tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> > DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> > if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> > RWSEM_FLAG_WAITERS)) {
> > clear_nonspinnable(sem);
> > rwsem_wake(sem);
> > }
> >
> > The atomic_long_add_return_release() is the point where we are doing
> > the main lock-releasing.
> >
> > So if a reader dropped the read-lock while someone else was waiting on
> > the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> > lock together with it, the reader also does clear_nonspinnable() and
> > rwsem_wake() afterwards.
> > But in rwsem_down_write_slowpath(), after we've set
> > RWSEM_FLAG_WAITERS, we can return successfully immediately once
> > rwsem_try_write_lock() sees that there are no active readers or
> > writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> > succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> > the wake.
> >
> > So yeah, I think down_write() can return successfully before up_read()
> > is done with its memory accesses.
> >
> > (Spinlocks are different - the kernel relies on being able to drop
> > references via spin_unlock() in some places.)
>
> Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
> suggested and that would fix the issue because we free VMAs from
> call_rcu(). However this feels to me as an issue of rw_semaphore
> design that this locking pattern is unsafe and might lead to UAF.
We have/had this problem with normal mutexes too. It was the impetus
for adding the struct completion which is very careful to not touch
anything after the completion is, well, completed.
On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
>
> Could you describe why only anonymous vmas are handled at this stage and
> what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> doesn't seem to have any anonymous vma specific requirements AFAICS.
TBH I haven't spent too much time looking into file-backed page faults
yet but a couple of tasks I can think of are:
- Ensure that all vma->vm_ops->fault() handlers do not rely on
mmap_lock being read-locked;
- vma->vm_file freeing like VMA freeing will need to be done after RCU
grace period since page fault handlers use it. This will require some
caution because simply adding it into __vm_area_free() called via
call_rcu() will cause corresponding fops->release() to be called
asynchronously. I had to solve this issue with out-of-tree SPF
implementation when asynchronously called snd_pcm_release() was
problematic.
I'm sure I'm missing more potential issues and maybe Matthew and
Michel can pinpoint more things to resolve here?
>
> Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
> the naming is really the most important part but the rcu locking is
> internal to the function so why should we spread this implementation
> detail to the world...
I wanted the name to indicate that the lookup is done with no locks
held. But I'm open to suggestions.
>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/mm.h | 3 +++
> > mm/memory.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 54 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c464fc8a514c..d0fddf6a1de9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > vma);
> > }
> >
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > + unsigned long address);
> > +
> > #else /* CONFIG_PER_VMA_LOCK */
> >
> > static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ece18548db1..a658e26d965d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
> > }
> > EXPORT_SYMBOL_GPL(handle_mm_fault);
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +/*
> > + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
> > + * stable and not isolated. If the VMA is not found or is being modified the
> > + * function returns NULL.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > + unsigned long address)
> > +{
> > + MA_STATE(mas, &mm->mm_mt, address, address);
> > + struct vm_area_struct *vma, *validate;
> > +
> > + rcu_read_lock();
> > + vma = mas_walk(&mas);
> > +retry:
> > + if (!vma)
> > + goto inval;
> > +
> > + /* Only anonymous vmas are supported for now */
> > + if (!vma_is_anonymous(vma))
> > + goto inval;
> > +
> > + if (!vma_read_trylock(vma))
> > + goto inval;
> > +
> > + /* Check since vm_start/vm_end might change before we lock the VMA */
> > + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > + vma_read_unlock(vma);
> > + goto inval;
> > + }
> > +
> > + /* Check if the VMA got isolated after we found it */
> > + mas.index = address;
> > + validate = mas_walk(&mas);
> > + if (validate != vma) {
> > + vma_read_unlock(vma);
> > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > + /* The area was replaced with another one. */
> > + vma = validate;
> > + goto retry;
> > + }
> > +
> > + rcu_read_unlock();
> > + return vma;
> > +inval:
> > + rcu_read_unlock();
> > + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > + return NULL;
> > +}
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > #ifndef __PAGETABLE_P4D_FOLDED
> > /*
> > * Allocate p4d page table.
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed.
>
> What kind of regressions.
>
> > To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> Please add some data to justify this additional complexity.
Sorry, should have done that in the first place. A 4.3% regression was
noticed when running execl test from unixbench suite. spawn test also
showed 1.6% regression. Profiling revealed that vma freeing was taking
longer due to call_rcu() which is slow when RCU callback offloading is
enabled. I asked Paul McKenney and he explained to me that because the
callbacks are offloaded to some other kthread, possibly running on
some other CPU, it is necessary to use explicit locking. Locking on a
per-call_rcu() basis would result in excessive contention during
callback flooding. So, by batching call_rcu() work we cut that
overhead and reduce this lock contention.
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > Move VMA flag modification (which now implies VMA locking) before
> > anon_vma_lock_write to match the locking order of page fault handler.
>
> Does this changelog assumes per vma locking in the #PF?
Hmm, you are right. Page fault handlers do not use per-vma locks yet
but the changelog already talks about that. Maybe I should change it
to simply:
```
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked.
```
Is that better?
>
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 6:25 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
> [...]
> > void vm_area_free(struct vm_area_struct *vma)
> > {
> > free_anon_vma_name(vma);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + call_rcu(&vma->vm_rcu, __vm_area_free);
> > +#else
> > kmem_cache_free(vm_area_cachep, vma);
> > +#endif
>
> Is it safe to have vma with already freed vma_name? I suspect this is
> safe because of mmap_lock but is there any reason to split the freeing
> process and have this potential UAF lurking?
It should be safe because VMA is either locked or has been isolated
while locked, so no page fault handlers should have access to it. But
you are right, moving free_anon_vma_name() into __vm_area_free() does
seem safer. Will make the change in the next rev.
>
> > }
> >
> > static void account_kernel_stack(struct task_struct *tsk, int account)
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs
On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > Assert there are no holders of VMA lock for reading when it is about to be
> > destroyed.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/mm.h | 8 ++++++++
> > kernel/fork.c | 2 ++
> > 2 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 594e835bad9c..c464fc8a514c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > }
> >
> > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > +{
> > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > + vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > + vma);
>
> Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> us something is wrong on its own, no? This could be somebody racing with
> the vma destruction and using the write lock. Unlikely but I do not see
> why to narrow debugging scope.
I wanted to ensure there are no page fault handlers (read-lockers)
when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
could trigger if someone is concurrently calling vma_write_lock(). But
I don't think we expect someone to be write-locking the VMA while we
are destroying it, so you are right, I'm overcomplicating things here.
I think I can get rid of vma_assert_no_reader() and add
VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
__vm_area_free(). WDYT?
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > page fault handling. When VMA is not found, can't be locked or changes
> > > after being locked, the function returns NULL. The lookup is performed
> > > under RCU protection to prevent the found VMA from being destroyed before
> > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > the results.
> > > For now only anonymous VMAs can be searched this way. In other cases the
> > > function returns NULL.
> >
> > Could you describe why only anonymous vmas are handled at this stage and
> > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > doesn't seem to have any anonymous vma specific requirements AFAICS.
>
> TBH I haven't spent too much time looking into file-backed page faults
> yet but a couple of tasks I can think of are:
> - Ensure that all vma->vm_ops->fault() handlers do not rely on
> mmap_lock being read-locked;
I think this way lies madness. There are just too many device drivers
that implement ->fault. My plan is to call the ->map_pages() method
under RCU without even read-locking the VMA. If that doesn't satisfy
the fault, then drop all the way back to taking the mmap_sem for read
before calling into ->fault.
On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > anon_vma_lock_write to match the locking order of page fault handler.
> >
> > Does this changelog assumes per vma locking in the #PF?
>
> Hmm, you are right. Page fault handlers do not use per-vma locks yet
> but the changelog already talks about that. Maybe I should change it
> to simply:
> ```
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked.
Because ....
Withtout that additional reasonaning it is not really clear why that is
needed and seems arbitrary.
--
Michal Hocko
SUSE Labs
On Tue 17-01-23 21:54:58, Matthew Wilcox wrote:
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > > - read lock - (implemented by vma_read_trylock) requires the the
> > > > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > > differ. If they match then there must be a vma exclusive lock
> > > > held somewhere.
> > > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > > vma->lock unlock.
> > > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > > held exclusively and the current mm counter is noted to the vma
> > > > side. This will allow multiple vmas to be locked under a single
> > > > mmap_lock write lock (e.g. during vma merging). The vma counter
> > > > is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > > Unlike standard write lock this implementation allows to be
> > > called multiple times under a single mmap_lock. In a sense
> > > it is more of mark_vma_potentially_modified than a lock.
> >
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/[email protected]/.
> > If other names are preferable I'm open to changing them.
>
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
>
> vma_start_read()
> vma_end_read()
> vma_start_write()
> vma_end_write()
> vma_downgrade_write()
>
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?
Agreed!
> Would that reduce people's confusion?
Yes I believe that naming it less like a locking primitive will clarify
it. vma_{start,end}_[try]read is better indeed. I am wondering about the
write side of things because that is where things get confusing. There
is no explicit write lock nor unlock. vma_start_write sounds better than
the vma_write_lock but it still lacks that pairing with vma_end_write
which is never the right thing to call. Wouldn't vma_mark_modified and
vma_publish_changes describe the scheme better?
Downgrade case is probably the least interesting one because that is
just one off thing that can be completely hidden from any code besides
mmap_write_downgrade so I wouldn't be too concern about that one.
But as you say, no need to bikeshed this too much. Great naming is hard
and if the scheme is documented properly we can live with a suboptimal
naming as well.
--
Michal Hocko
SUSE Labs
On Tue 17-01-23 21:28:06, Jann Horn wrote:
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <[email protected]> wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
>
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
>
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
>
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().
OK, I see. This would be a great addendum to the changelog.
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.
This, right?
#PF retract_page_tables
vma_read_lock
i_mmap_lock_write
i_mmap_lock_read
vma_write_lock
I might be missing something but I have only found huge_pmd_share to be
called from the #PF path. That one should be safe as it cannot be a
target for THP. Not that it would matter much because such a dependency
chain would be really subtle.
--
Michal Hocko
SUSE Labs
On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
> <[email protected]> wrote:
> >
> > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > Assert there are no holders of VMA lock for reading when it is about to be
> > > destroyed.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > ---
> > > include/linux/mm.h | 8 ++++++++
> > > kernel/fork.c | 2 ++
> > > 2 files changed, 10 insertions(+)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 594e835bad9c..c464fc8a514c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > > }
> > >
> > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > +{
> > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > + vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > + vma);
> >
> > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > us something is wrong on its own, no? This could be somebody racing with
> > the vma destruction and using the write lock. Unlikely but I do not see
> > why to narrow debugging scope.
>
> I wanted to ensure there are no page fault handlers (read-lockers)
> when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> could trigger if someone is concurrently calling vma_write_lock(). But
> I don't think we expect someone to be write-locking the VMA while we
That would be UAF, no?
> are destroying it, so you are right, I'm overcomplicating things here.
> I think I can get rid of vma_assert_no_reader() and add
> VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> __vm_area_free(). WDYT?
Yes, that adds some debugging. Not sure it is really necessary buyt it
is VM_BUG_ON so why not.
--
Michal Hocko
SUSE Labs
On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed.
> >
> > What kind of regressions.
> >
> > > To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > Please add some data to justify this additional complexity.
>
> Sorry, should have done that in the first place. A 4.3% regression was
> noticed when running execl test from unixbench suite. spawn test also
> showed 1.6% regression. Profiling revealed that vma freeing was taking
> longer due to call_rcu() which is slow when RCU callback offloading is
> enabled.
Could you be more specific? vma freeing is async with the RCU so how
come this has resulted in a regression? Is there any heavy
rcu_synchronize in the exec path? That would be an interesting
information.
--
Michal Hocko
SUSE Labs
On Tue 17-01-23 19:02:55, Jann Horn wrote:
> +locking maintainers
>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > + up_read(&vma->lock);
> > +}
>
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else.
Yes, I think you are right. From a look into the code it seems that
the UAF is quite unlikely as there is a ton of work to be done between
vma_write_lock used to prepare vma for removal and actual removal.
That doesn't make it less of a problem though.
> So if you want to protect against concurrent
> deletion, this might have to be something like:
>
> rcu_read_lock(); /* keeps vma alive */
> up_read(&vma->lock);
> rcu_read_unlock();
>
> But I'm not entirely sure about that, the locking folks might know better.
I am not a locking expert but to me it looks like this should work
because the final cleanup would have to happen rcu_read_unlock.
Thanks, I have completely missed this aspect of the locking when looking
into the code.
Btw. looking at this again I have fully realized how hard it is actually
to see that vm_area_free is guaranteed to sync up with ongoing readers.
vma manipulation functions like __adjust_vma make my head spin. Would it
make more sense to have a rcu style synchronization point in
vm_area_free directly before call_rcu? This would add an overhead of
uncontended down_write of course.
--
Michal Hocko
SUSE Labs
On Wed, Jan 18, 2023 at 10:40 AM Michal Hocko <[email protected]> wrote:
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <[email protected]> wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.
>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF retract_page_tables
> vma_read_lock
> i_mmap_lock_write
> i_mmap_lock_read
> vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.
Oops, yeah. Now that I'm looking closer I also don't see a path from
the #PF path to i_mmap_lock_read. Sorry for sending you on a wild
goose chase.
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> Page fault handlers might need to fire MMU notifications while a new
> notifier is being registered. Modify mm_take_all_locks to write-lock all
> VMAs and prevent this race with fault handlers that would hold VMA locks.
> VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> locking order as in page fault handlers.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/mmap.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 30c7d1c5206e..a256deca0bc0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
> * of mm/rmap.c:
> * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
> * hugetlb mapping);
> + * - all vmas marked locked
The existing comment above says that this is an *ordered* listing of
which locks are taken.
> * - all i_mmap_rwsem locks;
> * - all anon_vma->rwseml
> *
> @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> mas_for_each(&mas, vma, ULONG_MAX) {
> if (signal_pending(current))
> goto out_unlock;
> + vma_write_lock(vma);
> if (vma->vm_file && vma->vm_file->f_mapping &&
> is_vm_hugetlb_page(vma))
> vm_lock_mapping(mm, vma->vm_file->f_mapping);
Note that multiple VMAs can have the same ->f_mapping, so with this,
the lock ordering between VMA locks and the mapping locks of hugetlb
VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
->f_mapping, then the following operations happen:
1. lock VMA 1
2. lock mapping of VMAs 1 and 2
3. lock VMA 2
4. [second vm_lock_mapping() is a no-op]
So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
took the mapping lock first.
The existing code has one loop per lock type to ensure that the locks
really are taken in the specified order, even when some of the locks
are associated with multiple VMAs.
If we don't care about the ordering between these two, maybe that's
fine and you just have to adjust the comment; but it would be clearer
to add a separate loop for the VMA locks.
> @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> if (vma->vm_file && vma->vm_file->f_mapping)
> vm_unlock_mapping(vma->vm_file->f_mapping);
> }
> + vma_write_unlock_mm(mm);
>
> mutex_unlock(&mm_all_locks_mutex);
> }
> --
> 2.39.0
>
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <[email protected]> wrote:
> On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > +locking maintainers
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > + up_read(&vma->lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else.
>
> Yes, I think you are right. From a look into the code it seems that
> the UAF is quite unlikely as there is a ton of work to be done between
> vma_write_lock used to prepare vma for removal and actual removal.
> That doesn't make it less of a problem though.
>
> > So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(&vma->lock);
> > rcu_read_unlock();
> >
> > But I'm not entirely sure about that, the locking folks might know better.
>
> I am not a locking expert but to me it looks like this should work
> because the final cleanup would have to happen rcu_read_unlock.
>
> Thanks, I have completely missed this aspect of the locking when looking
> into the code.
>
> Btw. looking at this again I have fully realized how hard it is actually
> to see that vm_area_free is guaranteed to sync up with ongoing readers.
> vma manipulation functions like __adjust_vma make my head spin. Would it
> make more sense to have a rcu style synchronization point in
> vm_area_free directly before call_rcu? This would add an overhead of
> uncontended down_write of course.
Something along those lines might be a good idea, but I think that
rather than synchronizing the removal, it should maybe be something
that splats (and bails out?) if it detects pending readers. If we get
to vm_area_free() on a VMA that has pending readers, we might already
be in a lot of trouble because the concurrent readers might have been
traversing page tables while we were tearing them down or fun stuff
like that.
I think maybe Suren was already talking about something like that in
another part of this patch series but I don't remember...
On Wed 18-01-23 14:23:32, Jann Horn wrote:
> On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <[email protected]> wrote:
> > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > +locking maintainers
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > + up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else.
> >
> > Yes, I think you are right. From a look into the code it seems that
> > the UAF is quite unlikely as there is a ton of work to be done between
> > vma_write_lock used to prepare vma for removal and actual removal.
> > That doesn't make it less of a problem though.
> >
> > > So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> > >
> > > But I'm not entirely sure about that, the locking folks might know better.
> >
> > I am not a locking expert but to me it looks like this should work
> > because the final cleanup would have to happen rcu_read_unlock.
> >
> > Thanks, I have completely missed this aspect of the locking when looking
> > into the code.
> >
> > Btw. looking at this again I have fully realized how hard it is actually
> > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > vma manipulation functions like __adjust_vma make my head spin. Would it
> > make more sense to have a rcu style synchronization point in
> > vm_area_free directly before call_rcu? This would add an overhead of
> > uncontended down_write of course.
>
> Something along those lines might be a good idea, but I think that
> rather than synchronizing the removal, it should maybe be something
> that splats (and bails out?) if it detects pending readers. If we get
> to vm_area_free() on a VMA that has pending readers, we might already
> be in a lot of trouble because the concurrent readers might have been
> traversing page tables while we were tearing them down or fun stuff
> like that.
>
> I think maybe Suren was already talking about something like that in
> another part of this patch series but I don't remember...
This http://lkml.kernel.org/r/[email protected]?
--
Michal Hocko
SUSE Labs
On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <[email protected]> wrote:
> > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > +locking maintainers
> > > >
> > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > > [...]
> > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > +{
> > > > > + up_read(&vma->lock);
> > > > > +}
> > > >
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else.
> > >
> > > Yes, I think you are right. From a look into the code it seems that
> > > the UAF is quite unlikely as there is a ton of work to be done between
> > > vma_write_lock used to prepare vma for removal and actual removal.
> > > That doesn't make it less of a problem though.
> > >
> > > > So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(&vma->lock);
> > > > rcu_read_unlock();
> > > >
> > > > But I'm not entirely sure about that, the locking folks might know better.
> > >
> > > I am not a locking expert but to me it looks like this should work
> > > because the final cleanup would have to happen rcu_read_unlock.
> > >
> > > Thanks, I have completely missed this aspect of the locking when looking
> > > into the code.
> > >
> > > Btw. looking at this again I have fully realized how hard it is actually
> > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > make more sense to have a rcu style synchronization point in
> > > vm_area_free directly before call_rcu? This would add an overhead of
> > > uncontended down_write of course.
> >
> > Something along those lines might be a good idea, but I think that
> > rather than synchronizing the removal, it should maybe be something
> > that splats (and bails out?) if it detects pending readers. If we get
> > to vm_area_free() on a VMA that has pending readers, we might already
> > be in a lot of trouble because the concurrent readers might have been
> > traversing page tables while we were tearing them down or fun stuff
> > like that.
> >
> > I think maybe Suren was already talking about something like that in
> > another part of this patch series but I don't remember...
>
> This http://lkml.kernel.org/r/[email protected]?
Yes, I spent a lot of time ensuring that __adjust_vma locks the right
VMAs and that VMAs are freed or isolated under VMA write lock
protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
Michal mentioned gets hit then it's a bug in my design and I'll have
to fix it. But please, let's not add synchronize_rcu() in the
vm_area_free(). That will slow down any path that frees a VMA,
especially the exit path which might be freeing thousands of them. I
had an SPF version with synchronize_rcu() in the vm_area_free() and
phone vendors started yelling at me the very next day. call_rcu() with
CONFIG_RCU_NOCB_CPU (which Android uses for power saving purposes) is
already bad enough to show up in the benchmarks and that's why I had
to add call_rcu() batching in
https://lore.kernel.org/all/[email protected].
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, Jan 18, 2023 at 4:51 AM Jann Horn <[email protected]> wrote:
>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > Page fault handlers might need to fire MMU notifications while a new
> > notifier is being registered. Modify mm_take_all_locks to write-lock all
> > VMAs and prevent this race with fault handlers that would hold VMA locks.
> > VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> > locking order as in page fault handlers.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > mm/mmap.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 30c7d1c5206e..a256deca0bc0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
> > * of mm/rmap.c:
> > * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
> > * hugetlb mapping);
> > + * - all vmas marked locked
>
> The existing comment above says that this is an *ordered* listing of
> which locks are taken.
>
> > * - all i_mmap_rwsem locks;
> > * - all anon_vma->rwseml
> > *
> > @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> > mas_for_each(&mas, vma, ULONG_MAX) {
> > if (signal_pending(current))
> > goto out_unlock;
> > + vma_write_lock(vma);
> > if (vma->vm_file && vma->vm_file->f_mapping &&
> > is_vm_hugetlb_page(vma))
> > vm_lock_mapping(mm, vma->vm_file->f_mapping);
>
> Note that multiple VMAs can have the same ->f_mapping, so with this,
> the lock ordering between VMA locks and the mapping locks of hugetlb
> VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
> ->f_mapping, then the following operations happen:
>
> 1. lock VMA 1
> 2. lock mapping of VMAs 1 and 2
> 3. lock VMA 2
> 4. [second vm_lock_mapping() is a no-op]
>
> So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
> took the mapping lock first.
>
> The existing code has one loop per lock type to ensure that the locks
> really are taken in the specified order, even when some of the locks
> are associated with multiple VMAs.
>
> If we don't care about the ordering between these two, maybe that's
> fine and you just have to adjust the comment; but it would be clearer
> to add a separate loop for the VMA locks.
Oh, thanks for pointing out this detail. A separate loop is definitely
needed here. Will do that in the next respin.
>
> > @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> > if (vma->vm_file && vma->vm_file->f_mapping)
> > vm_unlock_mapping(vma->vm_file->f_mapping);
> > }
> > + vma_write_unlock_mm(mm);
> >
> > mutex_unlock(&mm_all_locks_mutex);
> > }
> > --
> > 2.39.0
> >
On Wed, Jan 18, 2023 at 1:43 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Tue 17-01-23 17:53:00, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
> > <[email protected]> wrote:
> > >
> > > On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > > > Assert there are no holders of VMA lock for reading when it is about to be
> > > > destroyed.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > ---
> > > > include/linux/mm.h | 8 ++++++++
> > > > kernel/fork.c | 2 ++
> > > > 2 files changed, 10 insertions(+)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 594e835bad9c..c464fc8a514c 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > > > }
> > > >
> > > > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > > > +{
> > > > + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock) &&
> > > > + vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > > > + vma);
> > >
> > > Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> > > us something is wrong on its own, no? This could be somebody racing with
> > > the vma destruction and using the write lock. Unlikely but I do not see
> > > why to narrow debugging scope.
> >
> > I wanted to ensure there are no page fault handlers (read-lockers)
> > when we are destroying the VMA and rwsem_is_locked(&vma->lock) alone
> > could trigger if someone is concurrently calling vma_write_lock(). But
> > I don't think we expect someone to be write-locking the VMA while we
>
> That would be UAF, no?
Yes. That's why what I have is an overkill (which is also racy).
>
> > are destroying it, so you are right, I'm overcomplicating things here.
> > I think I can get rid of vma_assert_no_reader() and add
> > VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock)) directly in
> > __vm_area_free(). WDYT?
>
> Yes, that adds some debugging. Not sure it is really necessary buyt it
> is VM_BUG_ON so why not.
I would like to keep it if possible. If it triggers that would be a
clear signal what the issue is. Otherwise it might be hard to debug
such a corner case.
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed.
> > >
> > > What kind of regressions.
> > >
> > > > To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > Please add some data to justify this additional complexity.
> >
> > Sorry, should have done that in the first place. A 4.3% regression was
> > noticed when running execl test from unixbench suite. spawn test also
> > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > longer due to call_rcu() which is slow when RCU callback offloading is
> > enabled.
>
> Could you be more specific? vma freeing is async with the RCU so how
> come this has resulted in a regression? Is there any heavy
> rcu_synchronize in the exec path? That would be an interesting
> information.
No, there is no heavy rcu_synchronize() or any other additional
synchronous load in the exit path. It's the call_rcu() which can block
the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
other call_rcu()'s going on in parallel. Note that call_rcu() calls
rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
revealed that this function was taking multiple ms (don't recall the
actual number, sorry). Paul's explanation implied that this happens
due to contention on the locks taken in this function. For more
in-depth details I'll have to ask Paul for help :) This code is quite
complex and I don't know all the details of RCU implementation.
> --
> Michal Hocko
> SUSE Labs
On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > Move VMA flag modification (which now implies VMA locking) before
> > > > anon_vma_lock_write to match the locking order of page fault handler.
> > >
> > > Does this changelog assumes per vma locking in the #PF?
> >
> > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > but the changelog already talks about that. Maybe I should change it
> > to simply:
> > ```
> > Move VMA flag modification (which now implies VMA locking) before
> > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > has been locked.
>
> Because ....
because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.
>
> Withtout that additional reasonaning it is not really clear why that is
> needed and seems arbitrary.
Would the above be a good reasoning?
>
> --
> Michal Hocko
> SUSE Labs
On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed.
> > > >
> > > > What kind of regressions.
> > > >
> > > > > To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > Please add some data to justify this additional complexity.
> > >
> > > Sorry, should have done that in the first place. A 4.3% regression was
> > > noticed when running execl test from unixbench suite. spawn test also
> > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > enabled.
> >
> > Could you be more specific? vma freeing is async with the RCU so how
> > come this has resulted in a regression? Is there any heavy
> > rcu_synchronize in the exec path? That would be an interesting
> > information.
>
> No, there is no heavy rcu_synchronize() or any other additional
> synchronous load in the exit path. It's the call_rcu() which can block
> the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> other call_rcu()'s going on in parallel. Note that call_rcu() calls
> rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> revealed that this function was taking multiple ms (don't recall the
> actual number, sorry). Paul's explanation implied that this happens
> due to contention on the locks taken in this function. For more
> in-depth details I'll have to ask Paul for help :) This code is quite
> complex and I don't know all the details of RCU implementation.
There are a couple of possibilities here.
First, if I am remembering correctly, the time between the call_rcu()
and invocation of the corresponding callback was taking multiple seconds,
but that was because the kernel was built with CONFIG_LAZY_RCU=y in
order to save power by batching RCU work over multiple call_rcu()
invocations. If this is causing a problem for a given call site, the
shiny new call_rcu_hurry() can be used instead. Doing this gets back
to the old-school non-laziness, but can of course consume more power.
Second, there is a much shorter one-jiffy delay between the call_rcu()
and the invocation of the corresponding callback in kernels built with
either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
of this delay is to avoid lock contention, and so this delay is incurred
only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
invoking call_rcu() at least 16 times within a given jiffy will incur
the added delay. The reason for this delay is the use of a separate
->nocb_bypass list. As Suren says, this bypass list is used to reduce
lock contention on the main ->cblist. This is not needed in old-school
kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
(including most datacenter kernels) because in that case the callbacks
enqueued by call_rcu() are touched only by the corresponding CPU, so
that there is no need for locks.
Third, if you are instead seeing multiple milliseconds of CPU consumed by
call_rcu() in the common case (for example, without the aid of interrupts,
NMIs, or SMIs), please do let me know. That sounds to me like a bug.
Or have I lost track of some other slow case?
Thanx, Paul
On Wed, Jan 18, 2023 at 02:26:39PM +0800, Hillf Danton wrote:
> On Tue, Jan 17, 2023 at 10:27 AM Matthew Wilcox <[email protected]> wrote:
> >
> > The cpu_relax() is exactly the wrong thing to do here. See this thread:
> > https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> If you are right, feel free to go and remove every cpu_relax() under the
> kernel/locking directory.
I see you didn't read the whole thread where Linus points out that a
cmpxchg() loop is fundamentally different from a spinlock.
On Wed, Jan 18, 2023 at 1:40 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko <[email protected]> wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.
I'll add Jann's description in the changelog. Thanks Jann!
>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF retract_page_tables
> vma_read_lock
> i_mmap_lock_write
> i_mmap_lock_read
> vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.
> --
> Michal Hocko
> SUSE Labs
On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed.
> > > > >
> > > > > What kind of regressions.
> > > > >
> > > > > > To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > Please add some data to justify this additional complexity.
> > > >
> > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > noticed when running execl test from unixbench suite. spawn test also
> > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > enabled.
> > >
> > > Could you be more specific? vma freeing is async with the RCU so how
> > > come this has resulted in a regression? Is there any heavy
> > > rcu_synchronize in the exec path? That would be an interesting
> > > information.
> >
> > No, there is no heavy rcu_synchronize() or any other additional
> > synchronous load in the exit path. It's the call_rcu() which can block
> > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > revealed that this function was taking multiple ms (don't recall the
> > actual number, sorry). Paul's explanation implied that this happens
> > due to contention on the locks taken in this function. For more
> > in-depth details I'll have to ask Paul for help :) This code is quite
> > complex and I don't know all the details of RCU implementation.
>
> There are a couple of possibilities here.
>
> First, if I am remembering correctly, the time between the call_rcu()
> and invocation of the corresponding callback was taking multiple seconds,
> but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> order to save power by batching RCU work over multiple call_rcu()
> invocations. If this is causing a problem for a given call site, the
> shiny new call_rcu_hurry() can be used instead. Doing this gets back
> to the old-school non-laziness, but can of course consume more power.
That would not be the case because CONFIG_LAZY_RCU was not an option
at the time I was profiling this issue.
Laxy RCU would be a great option to replace this patch but
unfortunately it's not the default behavior, so I would still have to
implement this batching in case lazy RCU is not enabled.
>
> Second, there is a much shorter one-jiffy delay between the call_rcu()
> and the invocation of the corresponding callback in kernels built with
> either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
> of this delay is to avoid lock contention, and so this delay is incurred
> only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> invoking call_rcu() at least 16 times within a given jiffy will incur
> the added delay. The reason for this delay is the use of a separate
> ->nocb_bypass list. As Suren says, this bypass list is used to reduce
> lock contention on the main ->cblist. This is not needed in old-school
> kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> (including most datacenter kernels) because in that case the callbacks
> enqueued by call_rcu() are touched only by the corresponding CPU, so
> that there is no need for locks.
I believe this is the reason in my profiled case.
>
> Third, if you are instead seeing multiple milliseconds of CPU consumed by
> call_rcu() in the common case (for example, without the aid of interrupts,
> NMIs, or SMIs), please do let me know. That sounds to me like a bug.
I don't think I've seen such a case.
Thanks for clarifications, Paul!
>
> Or have I lost track of some other slow case?
>
> Thanx, Paul
On Wed, Jan 18, 2023 at 11:01:08AM -0800, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Jan 18, 2023 at 10:04:39AM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 1:49 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Tue 17-01-23 17:19:46, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed.
> > > > > >
> > > > > > What kind of regressions.
> > > > > >
> > > > > > > To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > Please add some data to justify this additional complexity.
> > > > >
> > > > > Sorry, should have done that in the first place. A 4.3% regression was
> > > > > noticed when running execl test from unixbench suite. spawn test also
> > > > > showed 1.6% regression. Profiling revealed that vma freeing was taking
> > > > > longer due to call_rcu() which is slow when RCU callback offloading is
> > > > > enabled.
> > > >
> > > > Could you be more specific? vma freeing is async with the RCU so how
> > > > come this has resulted in a regression? Is there any heavy
> > > > rcu_synchronize in the exec path? That would be an interesting
> > > > information.
> > >
> > > No, there is no heavy rcu_synchronize() or any other additional
> > > synchronous load in the exit path. It's the call_rcu() which can block
> > > the caller if CONFIG_RCU_NOCB_CPU is enabled and there are lots of
> > > other call_rcu()'s going on in parallel. Note that call_rcu() calls
> > > rcu_nocb_try_bypass() if CONFIG_RCU_NOCB_CPU is enabled and profiling
> > > revealed that this function was taking multiple ms (don't recall the
> > > actual number, sorry). Paul's explanation implied that this happens
> > > due to contention on the locks taken in this function. For more
> > > in-depth details I'll have to ask Paul for help :) This code is quite
> > > complex and I don't know all the details of RCU implementation.
> >
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations. If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead. Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
>
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
>
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay. The reason for this delay is the use of a separate
> > ->nocb_bypass list. As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist. This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
>
> I believe this is the reason in my profiled case.
>
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know. That sounds to me like a bug.
>
> I don't think I've seen such a case.
Whew!!! ;-)
> Thanks for clarifications, Paul!
No problem!
Thanx, Paul
On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
> <[email protected]> wrote:
> >
> > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <[email protected]> wrote:
> > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > +locking maintainers
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > > locked.
> > > > > [...]
> > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > +{
> > > > > > + up_read(&vma->lock);
> > > > > > +}
> > > > >
> > > > > One thing that might be gnarly here is that I think you might not be
> > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > from what I remember, I think that up_read() (unlike something like
> > > > > spin_unlock()) can access the lock object after it's already been
> > > > > acquired by someone else.
> > > >
> > > > Yes, I think you are right. From a look into the code it seems that
> > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > That doesn't make it less of a problem though.
> > > >
> > > > > So if you want to protect against concurrent
> > > > > deletion, this might have to be something like:
> > > > >
> > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > up_read(&vma->lock);
> > > > > rcu_read_unlock();
> > > > >
> > > > > But I'm not entirely sure about that, the locking folks might know better.
> > > >
> > > > I am not a locking expert but to me it looks like this should work
> > > > because the final cleanup would have to happen rcu_read_unlock.
> > > >
> > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > into the code.
> > > >
> > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > make more sense to have a rcu style synchronization point in
> > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > uncontended down_write of course.
> > >
> > > Something along those lines might be a good idea, but I think that
> > > rather than synchronizing the removal, it should maybe be something
> > > that splats (and bails out?) if it detects pending readers. If we get
> > > to vm_area_free() on a VMA that has pending readers, we might already
> > > be in a lot of trouble because the concurrent readers might have been
> > > traversing page tables while we were tearing them down or fun stuff
> > > like that.
> > >
> > > I think maybe Suren was already talking about something like that in
> > > another part of this patch series but I don't remember...
> >
> > This http://lkml.kernel.org/r/[email protected]?
>
> Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> VMAs and that VMAs are freed or isolated under VMA write lock
> protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> Michal mentioned gets hit then it's a bug in my design and I'll have
> to fix it. But please, let's not add synchronize_rcu() in the
> vm_area_free().
Just to clarify. I didn't suggest to add synchronize_rcu into
vm_area_free. What I really meant was synchronize_rcu like primitive to
effectivelly synchronize with any potential pending read locker (so
something like vma_write_lock (or whatever it is called). The point is
that vma freeing is an event all readers should be notified about.
This can be done explicitly for each and every vma before vm_area_free
is called but this is just hard to review and easy to break over time.
See my point?
--
Michal Hocko
SUSE Labs
On Tue, Jan 17, 2023 at 6:44 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > > page fault handling. When VMA is not found, can't be locked or changes
> > > > after being locked, the function returns NULL. The lookup is performed
> > > > under RCU protection to prevent the found VMA from being destroyed before
> > > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > > the results.
> > > > For now only anonymous VMAs can be searched this way. In other cases the
> > > > function returns NULL.
> > >
> > > Could you describe why only anonymous vmas are handled at this stage and
> > > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > > doesn't seem to have any anonymous vma specific requirements AFAICS.
> >
> > TBH I haven't spent too much time looking into file-backed page faults
> > yet but a couple of tasks I can think of are:
> > - Ensure that all vma->vm_ops->fault() handlers do not rely on
> > mmap_lock being read-locked;
>
> I think this way lies madness. There are just too many device drivers
> that implement ->fault. My plan is to call the ->map_pages() method
> under RCU without even read-locking the VMA. If that doesn't satisfy
> the fault, then drop all the way back to taking the mmap_sem for read
> before calling into ->fault.
Sounds reasonable to me but I guess the devil is in the details...
>
On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko <[email protected]> wrote:
>
> On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > > anon_vma_lock_write to match the locking order of page fault handler.
> > > > >
> > > > > Does this changelog assumes per vma locking in the #PF?
> > > >
> > > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > > but the changelog already talks about that. Maybe I should change it
> > > > to simply:
> > > > ```
> > > > Move VMA flag modification (which now implies VMA locking) before
> > > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > > has been locked.
> > >
> > > Because ....
> >
> > because vma_adjust_trans_huge() modifies the VMA and such
> > modifications should be done under VMA write-lock protection.
>
> So it will become:
> Move VMA flag modification (which now implies VMA locking) before
> vma_adjust_trans_huge() to ensure the modifications are done after VMA
> has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.
>
> which is effectivelly saying
> vma_adjust_trans_huge() modifies the VMA and such modifications should
> be done under VMA write-lock protection so move VMA flag modifications
> before so all of them are covered by the same write protection.
>
> right?
Yes, and the wording in the latter version is simpler to understand
IMO, so I would like to adopt it. Do you agree?
> --
> Michal Hocko
> SUSE Labs
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <[email protected]> wrote:
>
> On Wed 18-01-23 09:36:44, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
> > <[email protected]> wrote:
> > >
> > > On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > > > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <[email protected]> wrote:
> > > > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > > > +locking maintainers
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > > > locked.
> > > > > > [...]
> > > > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > + up_read(&vma->lock);
> > > > > > > +}
> > > > > >
> > > > > > One thing that might be gnarly here is that I think you might not be
> > > > > > allowed to use up_read() to fully release ownership of an object -
> > > > > > from what I remember, I think that up_read() (unlike something like
> > > > > > spin_unlock()) can access the lock object after it's already been
> > > > > > acquired by someone else.
> > > > >
> > > > > Yes, I think you are right. From a look into the code it seems that
> > > > > the UAF is quite unlikely as there is a ton of work to be done between
> > > > > vma_write_lock used to prepare vma for removal and actual removal.
> > > > > That doesn't make it less of a problem though.
> > > > >
> > > > > > So if you want to protect against concurrent
> > > > > > deletion, this might have to be something like:
> > > > > >
> > > > > > rcu_read_lock(); /* keeps vma alive */
> > > > > > up_read(&vma->lock);
> > > > > > rcu_read_unlock();
> > > > > >
> > > > > > But I'm not entirely sure about that, the locking folks might know better.
> > > > >
> > > > > I am not a locking expert but to me it looks like this should work
> > > > > because the final cleanup would have to happen rcu_read_unlock.
> > > > >
> > > > > Thanks, I have completely missed this aspect of the locking when looking
> > > > > into the code.
> > > > >
> > > > > Btw. looking at this again I have fully realized how hard it is actually
> > > > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > > > make more sense to have a rcu style synchronization point in
> > > > > vm_area_free directly before call_rcu? This would add an overhead of
> > > > > uncontended down_write of course.
> > > >
> > > > Something along those lines might be a good idea, but I think that
> > > > rather than synchronizing the removal, it should maybe be something
> > > > that splats (and bails out?) if it detects pending readers. If we get
> > > > to vm_area_free() on a VMA that has pending readers, we might already
> > > > be in a lot of trouble because the concurrent readers might have been
> > > > traversing page tables while we were tearing them down or fun stuff
> > > > like that.
> > > >
> > > > I think maybe Suren was already talking about something like that in
> > > > another part of this patch series but I don't remember...
> > >
> > > This http://lkml.kernel.org/r/[email protected]?
> >
> > Yes, I spent a lot of time ensuring that __adjust_vma locks the right
> > VMAs and that VMAs are freed or isolated under VMA write lock
> > protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
> > Michal mentioned gets hit then it's a bug in my design and I'll have
> > to fix it. But please, let's not add synchronize_rcu() in the
> > vm_area_free().
>
> Just to clarify. I didn't suggest to add synchronize_rcu into
> vm_area_free. What I really meant was synchronize_rcu like primitive to
> effectivelly synchronize with any potential pending read locker (so
> something like vma_write_lock (or whatever it is called). The point is
> that vma freeing is an event all readers should be notified about.
I don't think readers need to be notified if we are ensuring that the
VMA is not used by anyone else and is not reachable by the readers.
This is currently done by write-locking the VMA either before removing
it from the tree or before freeing it.
> This can be done explicitly for each and every vma before vm_area_free
> is called but this is just hard to review and easy to break over time.
> See my point?
I understand your point now and if we really need that, one way would
be to have a VMA refcount (like Laurent had in his version of SPF
implementation). I don't think current implementation needs that level
of VMA lifetime control unless I missed some location that should take
the lock and does not.
>
> --
> Michal Hocko
> SUSE Labs
On Wed 18-01-23 10:09:29, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:23 AM Michal Hocko <[email protected]> wrote:
> >
> > On Tue 17-01-23 18:01:01, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > > > > Move VMA flag modification (which now implies VMA locking) before
> > > > > anon_vma_lock_write to match the locking order of page fault handler.
> > > >
> > > > Does this changelog assumes per vma locking in the #PF?
> > >
> > > Hmm, you are right. Page fault handlers do not use per-vma locks yet
> > > but the changelog already talks about that. Maybe I should change it
> > > to simply:
> > > ```
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked.
> >
> > Because ....
>
> because vma_adjust_trans_huge() modifies the VMA and such
> modifications should be done under VMA write-lock protection.
So it will become:
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
modifications should be done under VMA write-lock protection.
which is effectivelly saying
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done under VMA write-lock protection so move VMA flag modifications
before so all of them are covered by the same write protection.
right?
--
Michal Hocko
SUSE Labs
On Wed 18-01-23 13:48:13, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko <[email protected]> wrote:
[...]
> > So it will become:
> > Move VMA flag modification (which now implies VMA locking) before
> > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> > modifications should be done under VMA write-lock protection.
> >
> > which is effectivelly saying
> > vma_adjust_trans_huge() modifies the VMA and such modifications should
> > be done under VMA write-lock protection so move VMA flag modifications
> > before so all of them are covered by the same write protection.
> >
> > right?
>
> Yes, and the wording in the latter version is simpler to understand
> IMO, so I would like to adopt it. Do you agree?
of course.
--
Michal Hocko
SUSE Labs
On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
[...]
> > There are a couple of possibilities here.
> >
> > First, if I am remembering correctly, the time between the call_rcu()
> > and invocation of the corresponding callback was taking multiple seconds,
> > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > order to save power by batching RCU work over multiple call_rcu()
> > invocations. If this is causing a problem for a given call site, the
> > shiny new call_rcu_hurry() can be used instead. Doing this gets back
> > to the old-school non-laziness, but can of course consume more power.
>
> That would not be the case because CONFIG_LAZY_RCU was not an option
> at the time I was profiling this issue.
> Laxy RCU would be a great option to replace this patch but
> unfortunately it's not the default behavior, so I would still have to
> implement this batching in case lazy RCU is not enabled.
>
> >
> > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > and the invocation of the corresponding callback in kernels built with
> > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
> > of this delay is to avoid lock contention, and so this delay is incurred
> > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > invoking call_rcu() at least 16 times within a given jiffy will incur
> > the added delay. The reason for this delay is the use of a separate
> > ->nocb_bypass list. As Suren says, this bypass list is used to reduce
> > lock contention on the main ->cblist. This is not needed in old-school
> > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > (including most datacenter kernels) because in that case the callbacks
> > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > that there is no need for locks.
>
> I believe this is the reason in my profiled case.
>
> >
> > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > call_rcu() in the common case (for example, without the aid of interrupts,
> > NMIs, or SMIs), please do let me know. That sounds to me like a bug.
>
> I don't think I've seen such a case.
> Thanks for clarifications, Paul!
Thanks for the explanation Paul. I have to say this has caught me as a
surprise. There are just not enough details about the benchmark to
understand what is going on but I find it rather surprising that
call_rcu can induce a higher overhead than the actual kmem_cache_free
which is the callback. My naive understanding has been that call_rcu is
really fast way to defer the execution to the RCU safe context to do the
final cleanup.
--
Michal Hocko
SUSE Labs
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed. To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.
After some more clarification I can understand how call_rcu might not be
super happy about thousands of callbacks to be invoked and I do agree
that this is not really optimal.
On the other hand I do not like this solution much either.
VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
much with processes with a huge number of vmas either. It would still be
in housands of callbacks to be scheduled without a good reason.
Instead, are there any other cases than remove_vma that need this
batching? We could easily just link all the vmas into linked list and
use a single call_rcu instead, no? This would both simplify the
implementation, remove the scaling issue as well and we do not have to
argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
--
Michal Hocko
SUSE Labs
On Thu, Jan 19, 2023 at 1:31 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 18-01-23 13:48:13, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 1:33 PM Michal Hocko <[email protected]> wrote:
> [...]
> > > So it will become:
> > > Move VMA flag modification (which now implies VMA locking) before
> > > vma_adjust_trans_huge() to ensure the modifications are done after VMA
> > > has been locked. Because vma_adjust_trans_huge() modifies the VMA and such
> > > modifications should be done under VMA write-lock protection.
> > >
> > > which is effectivelly saying
> > > vma_adjust_trans_huge() modifies the VMA and such modifications should
> > > be done under VMA write-lock protection so move VMA flag modifications
> > > before so all of them are covered by the same write protection.
> > >
> > > right?
> >
> > Yes, and the wording in the latter version is simpler to understand
> > IMO, so I would like to adopt it. Do you agree?
>
> of course.
Will update in the next respin. Thanks!
> --
> Michal Hocko
> SUSE Labs
On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> After some more clarification I can understand how call_rcu might not be
> super happy about thousands of callbacks to be invoked and I do agree
> that this is not really optimal.
>
> On the other hand I do not like this solution much either.
> VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> much with processes with a huge number of vmas either. It would still be
> in housands of callbacks to be scheduled without a good reason.
>
> Instead, are there any other cases than remove_vma that need this
> batching? We could easily just link all the vmas into linked list and
> use a single call_rcu instead, no? This would both simplify the
> implementation, remove the scaling issue as well and we do not have to
> argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
Yes, I agree the solution is not stellar. I wanted something simple
but this is probably too simple. OTOH keeping all dead vm_area_structs
on the list without hooking up a shrinker (additional complexity) does
not sound too appealing either. WDYT about time domain throttling to
limit draining the list to say once per second like this:
void vm_area_free(struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
bool drain;
free_anon_vma_name(vma);
spin_lock(&mm->vma_free_list.lock);
list_add(&vma->vm_free_list, &mm->vma_free_list.head);
mm->vma_free_list.size++;
- drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+ drain = jiffies > mm->last_drain_tm + HZ;
spin_unlock(&mm->vma_free_list.lock);
- if (drain)
+ if (drain) {
drain_free_vmas(mm);
+ mm->last_drain_tm = jiffies;
+ }
}
Ultimately we want to prevent very frequent call_rcu() calls, so
throttling in the time domain seems appropriate. That's the simplest
way I can think of to address your concern about a quick spike in VMA
freeing. It does not place any restriction on the list size and we
might have excessive dead vm_area_structs if after a large spike there
are no vm_area_free() calls but I don't know if that's a real problem,
so not sure we should be addressing it at this time. WDYT?
>
> --
> Michal Hocko
> SUSE Labs
On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
>
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either. WDYT about time domain throttling to
> limit draining the list to say once per second like this:
>
> void vm_area_free(struct vm_area_struct *vma)
> {
> struct mm_struct *mm = vma->vm_mm;
> bool drain;
>
> free_anon_vma_name(vma);
>
> spin_lock(&mm->vma_free_list.lock);
> list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> mm->vma_free_list.size++;
> - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> + drain = jiffies > mm->last_drain_tm + HZ;
>
> spin_unlock(&mm->vma_free_list.lock);
>
> - if (drain)
> + if (drain) {
> drain_free_vmas(mm);
> + mm->last_drain_tm = jiffies;
> + }
> }
>
> Ultimately we want to prevent very frequent call_rcu() calls, so
> throttling in the time domain seems appropriate. That's the simplest
> way I can think of to address your concern about a quick spike in VMA
> freeing. It does not place any restriction on the list size and we
> might have excessive dead vm_area_structs if after a large spike there
> are no vm_area_free() calls but I don't know if that's a real problem,
> so not sure we should be addressing it at this time. WDYT?
Just to double-check, we really did try the very frequent call_rcu()
invocations and we really did see a problem, correct?
Although it is not perfect, call_rcu() is designed to take a fair amount
of abuse. So if we didn't see a real problem, the frequent call_rcu()
invocations might be a bit simpler.
Thanx, Paul
On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> [...]
> > > There are a couple of possibilities here.
> > >
> > > First, if I am remembering correctly, the time between the call_rcu()
> > > and invocation of the corresponding callback was taking multiple seconds,
> > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > order to save power by batching RCU work over multiple call_rcu()
> > > invocations. If this is causing a problem for a given call site, the
> > > shiny new call_rcu_hurry() can be used instead. Doing this gets back
> > > to the old-school non-laziness, but can of course consume more power.
> >
> > That would not be the case because CONFIG_LAZY_RCU was not an option
> > at the time I was profiling this issue.
> > Laxy RCU would be a great option to replace this patch but
> > unfortunately it's not the default behavior, so I would still have to
> > implement this batching in case lazy RCU is not enabled.
> >
> > >
> > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > and the invocation of the corresponding callback in kernels built with
> > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
> > > of this delay is to avoid lock contention, and so this delay is incurred
> > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > the added delay. The reason for this delay is the use of a separate
> > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce
> > > lock contention on the main ->cblist. This is not needed in old-school
> > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > (including most datacenter kernels) because in that case the callbacks
> > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > that there is no need for locks.
> >
> > I believe this is the reason in my profiled case.
> >
> > >
> > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > NMIs, or SMIs), please do let me know. That sounds to me like a bug.
> >
> > I don't think I've seen such a case.
> > Thanks for clarifications, Paul!
>
> Thanks for the explanation Paul. I have to say this has caught me as a
> surprise. There are just not enough details about the benchmark to
> understand what is going on but I find it rather surprising that
> call_rcu can induce a higher overhead than the actual kmem_cache_free
> which is the callback. My naive understanding has been that call_rcu is
> really fast way to defer the execution to the RCU safe context to do the
> final cleanup.
If I am following along correctly (ha!), then your "induce a higher
overhead" should be something like "induce a higher to-kfree() latency".
Of course, there already is a higher latency-to-kfree via call_rcu()
than via a direct call to kfree(), and callback-offload CPUs that are
being flooded with callbacks raise that latency a jiffy or so more in
order to avoid lock contention.
If this becomes a problem, the callback-offloading code can be a bit
smarter about avoiding lock contention, but need to see a real problem
before I make that change. But if there is a real problem I will of
course fix it.
Or did I miss a turn in this discussion?
Thanx, Paul
On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <[email protected]> wrote:
>
> On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either. WDYT about time domain throttling to
> > limit draining the list to say once per second like this:
> >
> > void vm_area_free(struct vm_area_struct *vma)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > bool drain;
> >
> > free_anon_vma_name(vma);
> >
> > spin_lock(&mm->vma_free_list.lock);
> > list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> > mm->vma_free_list.size++;
> > - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > + drain = jiffies > mm->last_drain_tm + HZ;
> >
> > spin_unlock(&mm->vma_free_list.lock);
> >
> > - if (drain)
> > + if (drain) {
> > drain_free_vmas(mm);
> > + mm->last_drain_tm = jiffies;
> > + }
> > }
> >
> > Ultimately we want to prevent very frequent call_rcu() calls, so
> > throttling in the time domain seems appropriate. That's the simplest
> > way I can think of to address your concern about a quick spike in VMA
> > freeing. It does not place any restriction on the list size and we
> > might have excessive dead vm_area_structs if after a large spike there
> > are no vm_area_free() calls but I don't know if that's a real problem,
> > so not sure we should be addressing it at this time. WDYT?
>
> Just to double-check, we really did try the very frequent call_rcu()
> invocations and we really did see a problem, correct?
Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
regressions when a process exits and all its VMAs get destroyed,
causing a flood of call_rcu()'s.
>
> Although it is not perfect, call_rcu() is designed to take a fair amount
> of abuse. So if we didn't see a real problem, the frequent call_rcu()
> invocations might be a bit simpler.
>
> Thanx, Paul
On Thu, Jan 19, 2023 at 11:47:36AM -0800, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 11:20 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Thu, Jan 19, 2023 at 10:52:03AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either. WDYT about time domain throttling to
> > > limit draining the list to say once per second like this:
> > >
> > > void vm_area_free(struct vm_area_struct *vma)
> > > {
> > > struct mm_struct *mm = vma->vm_mm;
> > > bool drain;
> > >
> > > free_anon_vma_name(vma);
> > >
> > > spin_lock(&mm->vma_free_list.lock);
> > > list_add(&vma->vm_free_list, &mm->vma_free_list.head);
> > > mm->vma_free_list.size++;
> > > - drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
> > > + drain = jiffies > mm->last_drain_tm + HZ;
> > >
> > > spin_unlock(&mm->vma_free_list.lock);
> > >
> > > - if (drain)
> > > + if (drain) {
> > > drain_free_vmas(mm);
> > > + mm->last_drain_tm = jiffies;
> > > + }
> > > }
> > >
> > > Ultimately we want to prevent very frequent call_rcu() calls, so
> > > throttling in the time domain seems appropriate. That's the simplest
> > > way I can think of to address your concern about a quick spike in VMA
> > > freeing. It does not place any restriction on the list size and we
> > > might have excessive dead vm_area_structs if after a large spike there
> > > are no vm_area_free() calls but I don't know if that's a real problem,
> > > so not sure we should be addressing it at this time. WDYT?
> >
> > Just to double-check, we really did try the very frequent call_rcu()
> > invocations and we really did see a problem, correct?
>
> Correct. More specifically with CONFIG_RCU_NOCB_CPU=y we saw
> regressions when a process exits and all its VMAs get destroyed,
> causing a flood of call_rcu()'s.
Thank you for the reminder, real problem needs solution. ;-)
Thanx, Paul
> > Although it is not perfect, call_rcu() is designed to take a fair amount
> > of abuse. So if we didn't see a real problem, the frequent call_rcu()
> > invocations might be a bit simpler.
On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > call_rcu() can take a long time when callback offloading is enabled.
> > > Its use in the vm_area_free can cause regressions in the exit path when
> > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > a list and free them in groups using one call_rcu() call per group.
> >
> > After some more clarification I can understand how call_rcu might not be
> > super happy about thousands of callbacks to be invoked and I do agree
> > that this is not really optimal.
> >
> > On the other hand I do not like this solution much either.
> > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > much with processes with a huge number of vmas either. It would still be
> > in housands of callbacks to be scheduled without a good reason.
> >
> > Instead, are there any other cases than remove_vma that need this
> > batching? We could easily just link all the vmas into linked list and
> > use a single call_rcu instead, no? This would both simplify the
> > implementation, remove the scaling issue as well and we do not have to
> > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
>
> Yes, I agree the solution is not stellar. I wanted something simple
> but this is probably too simple. OTOH keeping all dead vm_area_structs
> on the list without hooking up a shrinker (additional complexity) does
> not sound too appealing either.
I suspect you have missed my idea. I do not really want to keep the list
around or any shrinker. It is dead simple. Collect all vmas in
remove_vma and then call_rcu the whole list at once after the whole list
(be it from exit_mmap or remove_mt). See?
--
Michal Hocko
SUSE Labs
On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> > [...]
> > > > There are a couple of possibilities here.
> > > >
> > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > order to save power by batching RCU work over multiple call_rcu()
> > > > invocations. If this is causing a problem for a given call site, the
> > > > shiny new call_rcu_hurry() can be used instead. Doing this gets back
> > > > to the old-school non-laziness, but can of course consume more power.
> > >
> > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > at the time I was profiling this issue.
> > > Laxy RCU would be a great option to replace this patch but
> > > unfortunately it's not the default behavior, so I would still have to
> > > implement this batching in case lazy RCU is not enabled.
> > >
> > > >
> > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > and the invocation of the corresponding callback in kernels built with
> > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
> > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > the added delay. The reason for this delay is the use of a separate
> > > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce
> > > > lock contention on the main ->cblist. This is not needed in old-school
> > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > (including most datacenter kernels) because in that case the callbacks
> > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > that there is no need for locks.
> > >
> > > I believe this is the reason in my profiled case.
> > >
> > > >
> > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > NMIs, or SMIs), please do let me know. That sounds to me like a bug.
> > >
> > > I don't think I've seen such a case.
> > > Thanks for clarifications, Paul!
> >
> > Thanks for the explanation Paul. I have to say this has caught me as a
> > surprise. There are just not enough details about the benchmark to
> > understand what is going on but I find it rather surprising that
> > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > which is the callback. My naive understanding has been that call_rcu is
> > really fast way to defer the execution to the RCU safe context to do the
> > final cleanup.
>
> If I am following along correctly (ha!), then your "induce a higher
> overhead" should be something like "induce a higher to-kfree() latency".
Yes, this is expected.
> Of course, there already is a higher latency-to-kfree via call_rcu()
> than via a direct call to kfree(), and callback-offload CPUs that are
> being flooded with callbacks raise that latency a jiffy or so more in
> order to avoid lock contention.
>
> If this becomes a problem, the callback-offloading code can be a bit
> smarter about avoiding lock contention, but need to see a real problem
> before I make that change. But if there is a real problem I will of
> course fix it.
I believe that Suren claims that the call_rcu is really visible in the
exit_mmap case. Time-to-free actual vmas shouldn't really be material
for that path. If that happens much more later on there could be some
side effects by an increased memory consumption but that should be
marginal. How fast exit_mmap really is should only depend on direct
calls from that path.
But I guess we need some specific numbers from Suren to be sure what is
going on here.
Thanks!
--
Michal Hocko
SUSE Labs
On Fri, Jan 20, 2023 at 09:57:05AM +0100, Michal Hocko wrote:
> On Thu 19-01-23 11:17:07, Paul E. McKenney wrote:
> > On Thu, Jan 19, 2023 at 01:52:14PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-23 11:01:08, Suren Baghdasaryan wrote:
> > > > On Wed, Jan 18, 2023 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> > > [...]
> > > > > There are a couple of possibilities here.
> > > > >
> > > > > First, if I am remembering correctly, the time between the call_rcu()
> > > > > and invocation of the corresponding callback was taking multiple seconds,
> > > > > but that was because the kernel was built with CONFIG_LAZY_RCU=y in
> > > > > order to save power by batching RCU work over multiple call_rcu()
> > > > > invocations. If this is causing a problem for a given call site, the
> > > > > shiny new call_rcu_hurry() can be used instead. Doing this gets back
> > > > > to the old-school non-laziness, but can of course consume more power.
> > > >
> > > > That would not be the case because CONFIG_LAZY_RCU was not an option
> > > > at the time I was profiling this issue.
> > > > Laxy RCU would be a great option to replace this patch but
> > > > unfortunately it's not the default behavior, so I would still have to
> > > > implement this batching in case lazy RCU is not enabled.
> > > >
> > > > >
> > > > > Second, there is a much shorter one-jiffy delay between the call_rcu()
> > > > > and the invocation of the corresponding callback in kernels built with
> > > > > either CONFIG_NO_HZ_FULL=y (but only on CPUs mentioned in the nohz_full
> > > > > or rcu_nocbs kernel boot parameters) or CONFIG_RCU_NOCB_CPU=y (but only
> > > > > on CPUs mentioned in the rcu_nocbs kernel boot parameters). The purpose
> > > > > of this delay is to avoid lock contention, and so this delay is incurred
> > > > > only on CPUs that are queuing callbacks at a rate exceeding 16K/second.
> > > > > This is reduced to a per-jiffy limit, so on a HZ=1000 system, a CPU
> > > > > invoking call_rcu() at least 16 times within a given jiffy will incur
> > > > > the added delay. The reason for this delay is the use of a separate
> > > > > ->nocb_bypass list. As Suren says, this bypass list is used to reduce
> > > > > lock contention on the main ->cblist. This is not needed in old-school
> > > > > kernels built without either CONFIG_NO_HZ_FULL=y or CONFIG_RCU_NOCB_CPU=y
> > > > > (including most datacenter kernels) because in that case the callbacks
> > > > > enqueued by call_rcu() are touched only by the corresponding CPU, so
> > > > > that there is no need for locks.
> > > >
> > > > I believe this is the reason in my profiled case.
> > > >
> > > > >
> > > > > Third, if you are instead seeing multiple milliseconds of CPU consumed by
> > > > > call_rcu() in the common case (for example, without the aid of interrupts,
> > > > > NMIs, or SMIs), please do let me know. That sounds to me like a bug.
> > > >
> > > > I don't think I've seen such a case.
> > > > Thanks for clarifications, Paul!
> > >
> > > Thanks for the explanation Paul. I have to say this has caught me as a
> > > surprise. There are just not enough details about the benchmark to
> > > understand what is going on but I find it rather surprising that
> > > call_rcu can induce a higher overhead than the actual kmem_cache_free
> > > which is the callback. My naive understanding has been that call_rcu is
> > > really fast way to defer the execution to the RCU safe context to do the
> > > final cleanup.
> >
> > If I am following along correctly (ha!), then your "induce a higher
> > overhead" should be something like "induce a higher to-kfree() latency".
>
> Yes, this is expected.
>
> > Of course, there already is a higher latency-to-kfree via call_rcu()
> > than via a direct call to kfree(), and callback-offload CPUs that are
> > being flooded with callbacks raise that latency a jiffy or so more in
> > order to avoid lock contention.
> >
> > If this becomes a problem, the callback-offloading code can be a bit
> > smarter about avoiding lock contention, but need to see a real problem
> > before I make that change. But if there is a real problem I will of
> > course fix it.
>
> I believe that Suren claims that the call_rcu is really visible in the
> exit_mmap case. Time-to-free actual vmas shouldn't really be material
> for that path. If that happens much more later on there could be some
> side effects by an increased memory consumption but that should be
> marginal. How fast exit_mmap really is should only depend on direct
> calls from that path.
>
> But I guess we need some specific numbers from Suren to be sure what is
> going on here.
Actually, Suren did discuss these (perhaps offlist) back in August.
I was just being forgetful. :-/
Thanx, Paul
On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > After some more clarification I can understand how call_rcu might not be
> > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > that this is not really optimal.
> > > > >
> > > > > On the other hand I do not like this solution much either.
> > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > much with processes with a huge number of vmas either. It would still be
> > > > > in housands of callbacks to be scheduled without a good reason.
> > > > >
> > > > > Instead, are there any other cases than remove_vma that need this
> > > > > batching? We could easily just link all the vmas into linked list and
> > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > >
> > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > on the list without hooking up a shrinker (additional complexity) does
> > > > not sound too appealing either.
> > >
> > > I suspect you have missed my idea. I do not really want to keep the list
> > > around or any shrinker. It is dead simple. Collect all vmas in
> > > remove_vma and then call_rcu the whole list at once after the whole list
> > > (be it from exit_mmap or remove_mt). See?
> >
> > Yes, I understood your idea but keeping dead objects until the process
> > exits even when the system is low on memory (no shrinkers attached)
> > seems too wasteful. If we do this I would advocate for attaching a
> > shrinker.
>
> Maybe even simpler, since we are hit with this VMA freeing flood
> during exit_mmap (when all VMAs are destroyed), we pass a hint to
> vm_area_free to batch the destruction and all other cases call
> call_rcu()? I don't think there will be other cases of VMA destruction
> floods.
... or have two different call_rcu functions; one for munmap() and
one for exit. It'd be nice to use kmem_cache_free_bulk().
On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
>
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.
Maybe even simpler, since we are hit with this VMA freeing flood
during exit_mmap (when all VMAs are destroyed), we pass a hint to
vm_area_free to batch the destruction and all other cases call
call_rcu()? I don't think there will be other cases of VMA destruction
floods.
>
> >
> > --
> > Michal Hocko
> > SUSE Labs
On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
>
> On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > a list and free them in groups using one call_rcu() call per group.
> > >
> > > After some more clarification I can understand how call_rcu might not be
> > > super happy about thousands of callbacks to be invoked and I do agree
> > > that this is not really optimal.
> > >
> > > On the other hand I do not like this solution much either.
> > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > much with processes with a huge number of vmas either. It would still be
> > > in housands of callbacks to be scheduled without a good reason.
> > >
> > > Instead, are there any other cases than remove_vma that need this
> > > batching? We could easily just link all the vmas into linked list and
> > > use a single call_rcu instead, no? This would both simplify the
> > > implementation, remove the scaling issue as well and we do not have to
> > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> >
> > Yes, I agree the solution is not stellar. I wanted something simple
> > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > on the list without hooking up a shrinker (additional complexity) does
> > not sound too appealing either.
>
> I suspect you have missed my idea. I do not really want to keep the list
> around or any shrinker. It is dead simple. Collect all vmas in
> remove_vma and then call_rcu the whole list at once after the whole list
> (be it from exit_mmap or remove_mt). See?
Yes, I understood your idea but keeping dead objects until the process
exits even when the system is low on memory (no shrinkers attached)
seems too wasteful. If we do this I would advocate for attaching a
shrinker.
>
> --
> Michal Hocko
> SUSE Labs
On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <[email protected]> wrote:
>
> * Matthew Wilcox <[email protected]> [230120 11:50]:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit. It'd be nice to use kmem_cache_free_bulk().
>
> Do we even need a call_rcu on exit? At the point of freeing the VMAs we
> have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> Once we have obtained the write lock again, I think it's safe to say we
> can just go ahead and free the VMAs directly.
I think that would be still racy if the page fault handler found that
VMA under read-RCU protection but did not lock it yet (no locks are
held yet). If it's preempted, the VMA can be freed and destroyed from
under it without RCU grace period.
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
* Matthew Wilcox <[email protected]> [230120 11:50]:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> >
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
>
> ... or have two different call_rcu functions; one for munmap() and
> one for exit. It'd be nice to use kmem_cache_free_bulk().
Do we even need a call_rcu on exit? At the point of freeing the VMAs we
have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
Once we have obtained the write lock again, I think it's safe to say we
can just go ahead and free the VMAs directly.
On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <[email protected]> wrote:
> >
> > * Matthew Wilcox <[email protected]> [230120 11:50]:
> > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > >
> > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > that this is not really optimal.
> > > > > > > >
> > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > >
> > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > >
> > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > not sound too appealing either.
> > > > > >
> > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > (be it from exit_mmap or remove_mt). See?
> > > > >
> > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > shrinker.
> > > >
> > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > vm_area_free to batch the destruction and all other cases call
> > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > floods.
> > >
> > > ... or have two different call_rcu functions; one for munmap() and
> > > one for exit. It'd be nice to use kmem_cache_free_bulk().
> >
> > Do we even need a call_rcu on exit? At the point of freeing the VMAs we
> > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > Once we have obtained the write lock again, I think it's safe to say we
> > can just go ahead and free the VMAs directly.
>
> I think that would be still racy if the page fault handler found that
> VMA under read-RCU protection but did not lock it yet (no locks are
> held yet). If it's preempted, the VMA can be freed and destroyed from
> under it without RCU grace period.
The page fault handler (or whatever other reader -- ptrace, proc, etc)
should have a refcount on the mm_struct, so we can't be in this path
trying to free VMAs. Right?
On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <[email protected]> wrote:
> > >
> > > * Matthew Wilcox <[email protected]> [230120 11:50]:
> > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > >
> > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > that this is not really optimal.
> > > > > > > > >
> > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > >
> > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > >
> > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > not sound too appealing either.
> > > > > > >
> > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > >
> > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > shrinker.
> > > > >
> > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > vm_area_free to batch the destruction and all other cases call
> > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > floods.
> > > >
> > > > ... or have two different call_rcu functions; one for munmap() and
> > > > one for exit. It'd be nice to use kmem_cache_free_bulk().
> > >
> > > Do we even need a call_rcu on exit? At the point of freeing the VMAs we
> > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > Once we have obtained the write lock again, I think it's safe to say we
> > > can just go ahead and free the VMAs directly.
> >
> > I think that would be still racy if the page fault handler found that
> > VMA under read-RCU protection but did not lock it yet (no locks are
> > held yet). If it's preempted, the VMA can be freed and destroyed from
> > under it without RCU grace period.
>
> The page fault handler (or whatever other reader -- ptrace, proc, etc)
> should have a refcount on the mm_struct, so we can't be in this path
> trying to free VMAs. Right?
Hmm. That sounds right. I checked process_mrelease() as well, which
operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
VMAs without freeing them, so we are still good. Michal, do you agree
this is ok?
lock_vma_under_rcu() receives mm as a parameter, so I guess it's
implied that the caller should either mmget() it or operate on
current->mm, so no need to document this requirement?
On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > >
> > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > that this is not really optimal.
> > > > > >
> > > > > > On the other hand I do not like this solution much either.
> > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > >
> > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > >
> > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > not sound too appealing either.
> > > >
> > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > (be it from exit_mmap or remove_mt). See?
> > >
> > > Yes, I understood your idea but keeping dead objects until the process
> > > exits even when the system is low on memory (no shrinkers attached)
> > > seems too wasteful. If we do this I would advocate for attaching a
> > > shrinker.
> >
> > Maybe even simpler, since we are hit with this VMA freeing flood
> > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > vm_area_free to batch the destruction and all other cases call
> > call_rcu()? I don't think there will be other cases of VMA destruction
> > floods.
>
> ... or have two different call_rcu functions; one for munmap() and
> one for exit. It'd be nice to use kmem_cache_free_bulk().
Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
structure's field, is much more cache-efficient.
The penalty is that there is no callback function to do any cleanup.
There is just a kfree()/kvfree (bulk version where applicable),
nothing else.
Thanx, Paul
On Fri, Jan 20, 2023 at 9:21 AM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Jan 20, 2023 at 04:49:42PM +0000, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > >
> > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > that this is not really optimal.
> > > > > > >
> > > > > > > On the other hand I do not like this solution much either.
> > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > >
> > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > >
> > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > not sound too appealing either.
> > > > >
> > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > (be it from exit_mmap or remove_mt). See?
> > > >
> > > > Yes, I understood your idea but keeping dead objects until the process
> > > > exits even when the system is low on memory (no shrinkers attached)
> > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > shrinker.
> > >
> > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > vm_area_free to batch the destruction and all other cases call
> > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > floods.
> >
> > ... or have two different call_rcu functions; one for munmap() and
> > one for exit. It'd be nice to use kmem_cache_free_bulk().
>
> Good point, kfree_rcu(p, r) where "r" is the name of the rcu_head
> structure's field, is much more cache-efficient.
>
> The penalty is that there is no callback function to do any cleanup.
> There is just a kfree()/kvfree (bulk version where applicable),
> nothing else.
If Liam's suggestion works then we won't need anything additional. We
will free the vm_area_structs directly on process exit and use
call_rcu() in all other cases. Let's see if Michal knows of any case
which still needs an RCU grace period during exit_mmap.
>
> Thanx, Paul
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
* Suren Baghdasaryan <[email protected]> [230120 12:50]:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Jan 20, 2023 at 09:17:46AM -0800, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:08 AM Liam R. Howlett <[email protected]> wrote:
> > > >
> > > > * Matthew Wilcox <[email protected]> [230120 11:50]:
> > > > > On Fri, Jan 20, 2023 at 08:45:21AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 8:20 AM Suren Baghdasaryan <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > > > > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > > > > > > >
> > > > > > > > > > After some more clarification I can understand how call_rcu might not be
> > > > > > > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > > > > > > that this is not really optimal.
> > > > > > > > > >
> > > > > > > > > > On the other hand I do not like this solution much either.
> > > > > > > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > > > > > > much with processes with a huge number of vmas either. It would still be
> > > > > > > > > > in housands of callbacks to be scheduled without a good reason.
> > > > > > > > > >
> > > > > > > > > > Instead, are there any other cases than remove_vma that need this
> > > > > > > > > > batching? We could easily just link all the vmas into linked list and
> > > > > > > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > > > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > > > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > > > > > > >
> > > > > > > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > > > > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > > > > > > on the list without hooking up a shrinker (additional complexity) does
> > > > > > > > > not sound too appealing either.
> > > > > > > >
> > > > > > > > I suspect you have missed my idea. I do not really want to keep the list
> > > > > > > > around or any shrinker. It is dead simple. Collect all vmas in
> > > > > > > > remove_vma and then call_rcu the whole list at once after the whole list
> > > > > > > > (be it from exit_mmap or remove_mt). See?
> > > > > > >
> > > > > > > Yes, I understood your idea but keeping dead objects until the process
> > > > > > > exits even when the system is low on memory (no shrinkers attached)
> > > > > > > seems too wasteful. If we do this I would advocate for attaching a
> > > > > > > shrinker.
> > > > > >
> > > > > > Maybe even simpler, since we are hit with this VMA freeing flood
> > > > > > during exit_mmap (when all VMAs are destroyed), we pass a hint to
> > > > > > vm_area_free to batch the destruction and all other cases call
> > > > > > call_rcu()? I don't think there will be other cases of VMA destruction
> > > > > > floods.
> > > > >
> > > > > ... or have two different call_rcu functions; one for munmap() and
> > > > > one for exit. It'd be nice to use kmem_cache_free_bulk().
> > > >
> > > > Do we even need a call_rcu on exit? At the point of freeing the VMAs we
> > > > have set the MMF_OOM_SKIP bit and unmapped the vmas under the read lock.
> > > > Once we have obtained the write lock again, I think it's safe to say we
> > > > can just go ahead and free the VMAs directly.
> > >
> > > I think that would be still racy if the page fault handler found that
> > > VMA under read-RCU protection but did not lock it yet (no locks are
> > > held yet). If it's preempted, the VMA can be freed and destroyed from
> > > under it without RCU grace period.
> >
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs. Right?
>
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?
>
> lock_vma_under_rcu() receives mm as a parameter, so I guess it's
> implied that the caller should either mmget() it or operate on
> current->mm, so no need to document this requirement?
It is also implied by the vma->vm_mm link. Otherwise any RCU holder of
the VMA could have an unsafe pointer. In fact, if this isn't true then
we need to change the callers to take the ref count to avoid just this
scenario.
On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
[...]
> > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > should have a refcount on the mm_struct, so we can't be in this path
> > trying to free VMAs. Right?
>
> Hmm. That sounds right. I checked process_mrelease() as well, which
> operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> VMAs without freeing them, so we are still good. Michal, do you agree
> this is ok?
Don't we need RCU procetions for the vma life time assurance? Jann has
already shown how rwsem is not safe wrt to unlock and free without RCU.
--
Michal Hocko
SUSE Labs
On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> >
> > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > a list and free them in groups using one call_rcu() call per group.
> > > >
> > > > After some more clarification I can understand how call_rcu might not be
> > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > that this is not really optimal.
> > > >
> > > > On the other hand I do not like this solution much either.
> > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > much with processes with a huge number of vmas either. It would still be
> > > > in housands of callbacks to be scheduled without a good reason.
> > > >
> > > > Instead, are there any other cases than remove_vma that need this
> > > > batching? We could easily just link all the vmas into linked list and
> > > > use a single call_rcu instead, no? This would both simplify the
> > > > implementation, remove the scaling issue as well and we do not have to
> > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > >
> > > Yes, I agree the solution is not stellar. I wanted something simple
> > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > on the list without hooking up a shrinker (additional complexity) does
> > > not sound too appealing either.
> >
> > I suspect you have missed my idea. I do not really want to keep the list
> > around or any shrinker. It is dead simple. Collect all vmas in
> > remove_vma and then call_rcu the whole list at once after the whole list
> > (be it from exit_mmap or remove_mt). See?
>
> Yes, I understood your idea but keeping dead objects until the process
> exits even when the system is low on memory (no shrinkers attached)
> seems too wasteful. If we do this I would advocate for attaching a
> shrinker.
I am still not sure we are on the same page here. No, vmas shouldn't lay
around un ntil the process exit. I am really suggesting queuing only for
remove_vma paths. You can have a different rcu callback than the one
used for trivial single vma removal paths.
--
Michal Hocko
SUSE Labs
On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> [...]
> > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > should have a refcount on the mm_struct, so we can't be in this path
> > > trying to free VMAs. Right?
> >
> > Hmm. That sounds right. I checked process_mrelease() as well, which
> > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > VMAs without freeing them, so we are still good. Michal, do you agree
> > this is ok?
>
> Don't we need RCU procetions for the vma life time assurance? Jann has
> already shown how rwsem is not safe wrt to unlock and free without RCU.
Jann's case requires a thread freeing the VMA to be blocked on vma
write lock waiting for the vma real lock to be released by a page
fault handler. However exit_mmap() means mm->mm_users==0, which in
turn suggests that there are no racing page fault handlers and no new
page fault handlers will appear. Is that a correct assumption? If so,
then races with page fault handlers can't happen while in exit_mmap().
Any other path (other than page fault handlers), accesses vma->lock
under protection of mmap_lock (for read or write, does not matter).
One exception is when we operate on an isolated VMA, then we don't
need mmap_lock protection, but exit_mmap() does not deal with isolated
VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
protection of mmap_lock in write mode, so races with anything other
than page fault handler should be safe as they are today.
That said, the future possible users of lock_vma_under_rcu() using VMA
without mmap_lock protection will have to ensure mm's stability while
they are using the obtained VMA. IOW they should elevate mm's refcount
and keep it elevated as long as they are using that VMA and not before
vma->lock is released. I guess it would be a good idea to document
that requirement in lock_vma_under_rcu() comments if we decide to take
this route.
>
> --
> Michal Hocko
> SUSE Labs
On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
> >
> > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> > [...]
> > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > trying to free VMAs. Right?
> > >
> > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > this is ok?
> >
> > Don't we need RCU procetions for the vma life time assurance? Jann has
> > already shown how rwsem is not safe wrt to unlock and free without RCU.
>
> Jann's case requires a thread freeing the VMA to be blocked on vma
> write lock waiting for the vma real lock to be released by a page
> fault handler. However exit_mmap() means mm->mm_users==0, which in
> turn suggests that there are no racing page fault handlers and no new
> page fault handlers will appear. Is that a correct assumption? If so,
> then races with page fault handlers can't happen while in exit_mmap().
> Any other path (other than page fault handlers), accesses vma->lock
> under protection of mmap_lock (for read or write, does not matter).
> One exception is when we operate on an isolated VMA, then we don't
> need mmap_lock protection, but exit_mmap() does not deal with isolated
> VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> protection of mmap_lock in write mode, so races with anything other
> than page fault handler should be safe as they are today.
I do not see you talking about #PF (RCU + vma read lock protected) with
munmap. It is my understanding that the latter will synchronize over per
vma lock (along with mmap_lock exclusive locking). But then we are back
to the lifetime guarantees, or do I miss anything.
--
Michal Hocko
SUSE Labs
On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> > > [...]
> > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > trying to free VMAs. Right?
> > > >
> > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > this is ok?
> > >
> > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> >
> > Jann's case requires a thread freeing the VMA to be blocked on vma
> > write lock waiting for the vma real lock to be released by a page
> > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > turn suggests that there are no racing page fault handlers and no new
> > page fault handlers will appear. Is that a correct assumption? If so,
> > then races with page fault handlers can't happen while in exit_mmap().
> > Any other path (other than page fault handlers), accesses vma->lock
> > under protection of mmap_lock (for read or write, does not matter).
> > One exception is when we operate on an isolated VMA, then we don't
> > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > protection of mmap_lock in write mode, so races with anything other
> > than page fault handler should be safe as they are today.
>
> I do not see you talking about #PF (RCU + vma read lock protected) with
> munmap. It is my understanding that the latter will synchronize over per
> vma lock (along with mmap_lock exclusive locking). But then we are back
> to the lifetime guarantees, or do I miss anything.
munmap() or any VMA-freeing operation other than exit_mmap() will free
using call_rcu(), as implemented today. The suggestion is to free VMAs
directly, without RCU grace period only when done from exit_mmap().
That' because VMA freeing flood has been seen so far only in the case
of exit_mmap() and we assume other cases are not that heavy to cause
call_rcu() flood to cause regressions. That assumption might prove
false but we can deal with that once we know it needs fixing.
> --
> Michal Hocko
> SUSE Labs
On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> > > > [...]
> > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > trying to free VMAs. Right?
> > > > >
> > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > this is ok?
> > > >
> > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > >
> > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > write lock waiting for the vma real lock to be released by a page
> > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > turn suggests that there are no racing page fault handlers and no new
> > > page fault handlers will appear. Is that a correct assumption? If so,
> > > then races with page fault handlers can't happen while in exit_mmap().
> > > Any other path (other than page fault handlers), accesses vma->lock
> > > under protection of mmap_lock (for read or write, does not matter).
> > > One exception is when we operate on an isolated VMA, then we don't
> > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > protection of mmap_lock in write mode, so races with anything other
> > > than page fault handler should be safe as they are today.
> >
> > I do not see you talking about #PF (RCU + vma read lock protected) with
> > munmap. It is my understanding that the latter will synchronize over per
> > vma lock (along with mmap_lock exclusive locking). But then we are back
> > to the lifetime guarantees, or do I miss anything.
>
> munmap() or any VMA-freeing operation other than exit_mmap() will free
> using call_rcu(), as implemented today. The suggestion is to free VMAs
> directly, without RCU grace period only when done from exit_mmap().
OK, I have clearly missed that. This makes more sense but it also adds
some more complexity and assumptions - a harder to maintain code in the
end. Whoever wants to touch this scheme in future would have to
re-evaluate all of them. So, I would just avoid that special casing if
that is feasible.
Dealing with the flood of call_rcu during exit_mmap is a trivial thing
to deal with as proposed elsewhere (just batch all of them in a single
run). This will surely add some more code but at least the locking would
consistent.
--
Michal Hocko
SUSE Labs
On Mon, Jan 23, 2023 at 1:59 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Fri 20-01-23 08:20:43, Suren Baghdasaryan wrote:
> > On Fri, Jan 20, 2023 at 12:52 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Thu 19-01-23 10:52:03, Suren Baghdasaryan wrote:
> > > > On Thu, Jan 19, 2023 at 4:59 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > > > > > call_rcu() can take a long time when callback offloading is enabled.
> > > > > > Its use in the vm_area_free can cause regressions in the exit path when
> > > > > > multiple VMAs are being freed. To minimize that impact, place VMAs into
> > > > > > a list and free them in groups using one call_rcu() call per group.
> > > > >
> > > > > After some more clarification I can understand how call_rcu might not be
> > > > > super happy about thousands of callbacks to be invoked and I do agree
> > > > > that this is not really optimal.
> > > > >
> > > > > On the other hand I do not like this solution much either.
> > > > > VM_AREA_FREE_LIST_MAX is arbitrary and it won't really help all that
> > > > > much with processes with a huge number of vmas either. It would still be
> > > > > in housands of callbacks to be scheduled without a good reason.
> > > > >
> > > > > Instead, are there any other cases than remove_vma that need this
> > > > > batching? We could easily just link all the vmas into linked list and
> > > > > use a single call_rcu instead, no? This would both simplify the
> > > > > implementation, remove the scaling issue as well and we do not have to
> > > > > argue whether VM_AREA_FREE_LIST_MAX should be epsilon or epsilon + 1.
> > > >
> > > > Yes, I agree the solution is not stellar. I wanted something simple
> > > > but this is probably too simple. OTOH keeping all dead vm_area_structs
> > > > on the list without hooking up a shrinker (additional complexity) does
> > > > not sound too appealing either.
> > >
> > > I suspect you have missed my idea. I do not really want to keep the list
> > > around or any shrinker. It is dead simple. Collect all vmas in
> > > remove_vma and then call_rcu the whole list at once after the whole list
> > > (be it from exit_mmap or remove_mt). See?
> >
> > Yes, I understood your idea but keeping dead objects until the process
> > exits even when the system is low on memory (no shrinkers attached)
> > seems too wasteful. If we do this I would advocate for attaching a
> > shrinker.
>
> I am still not sure we are on the same page here. No, vmas shouldn't lay
> around un ntil the process exit. I am really suggesting queuing only for
> remove_vma paths. You can have a different rcu callback than the one
> used for trivial single vma removal paths.
Oh, I also missed the remove_mt() part and thought you want to drain
the list only in exit_mmap(). I think that's a good option!
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <[email protected]> wrote:
>
> On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> > > > > [...]
> > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > trying to free VMAs. Right?
> > > > > >
> > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > this is ok?
> > > > >
> > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > >
> > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > write lock waiting for the vma real lock to be released by a page
> > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > turn suggests that there are no racing page fault handlers and no new
> > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > under protection of mmap_lock (for read or write, does not matter).
> > > > One exception is when we operate on an isolated VMA, then we don't
> > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > protection of mmap_lock in write mode, so races with anything other
> > > > than page fault handler should be safe as they are today.
> > >
> > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > munmap. It is my understanding that the latter will synchronize over per
> > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > to the lifetime guarantees, or do I miss anything.
> >
> > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > directly, without RCU grace period only when done from exit_mmap().
>
> OK, I have clearly missed that. This makes more sense but it also adds
> some more complexity and assumptions - a harder to maintain code in the
> end. Whoever wants to touch this scheme in future would have to
> re-evaluate all of them. So, I would just avoid that special casing if
> that is feasible.
Ok, I understand your point.
>
> Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> to deal with as proposed elsewhere (just batch all of them in a single
> run). This will surely add some more code but at least the locking would
> consistent.
Yes, batching the vmas into a list and draining it in remove_mt() and
exit_mmap() as you suggested makes sense to me and is quite simple.
Let's do that if nobody has objections.
> --
> Michal Hocko
> SUSE Labs
On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
> > > > > >
> > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> > > > > > [...]
> > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > > trying to free VMAs. Right?
> > > > > > >
> > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > > this is ok?
> > > > > >
> > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > > >
> > > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > > write lock waiting for the vma real lock to be released by a page
> > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > > turn suggests that there are no racing page fault handlers and no new
> > > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > > under protection of mmap_lock (for read or write, does not matter).
> > > > > One exception is when we operate on an isolated VMA, then we don't
> > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > > protection of mmap_lock in write mode, so races with anything other
> > > > > than page fault handler should be safe as they are today.
> > > >
> > > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > > munmap. It is my understanding that the latter will synchronize over per
> > > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > > to the lifetime guarantees, or do I miss anything.
> > >
> > > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > > directly, without RCU grace period only when done from exit_mmap().
> >
> > OK, I have clearly missed that. This makes more sense but it also adds
> > some more complexity and assumptions - a harder to maintain code in the
> > end. Whoever wants to touch this scheme in future would have to
> > re-evaluate all of them. So, I would just avoid that special casing if
> > that is feasible.
>
> Ok, I understand your point.
>
> >
> > Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> > to deal with as proposed elsewhere (just batch all of them in a single
> > run). This will surely add some more code but at least the locking would
> > consistent.
>
> Yes, batching the vmas into a list and draining it in remove_mt() and
> exit_mmap() as you suggested makes sense to me and is quite simple.
> Let's do that if nobody has objections.
I object. We *know* nobody has a reference to any of the VMAs because
you have to have a refcount on the mm before you can get a reference
to a VMA. If Michal is saying that somebody could do:
mmget(mm);
vma = find_vma(mm);
lock_vma(vma);
mmput(mm);
vma->a = b;
unlock_vma(mm, vma);
then that's something we'd catch in review -- you obviously can't use
the mm after you've dropped your reference to it.
Having all this extra code to solve two problems badly is a very poor
choice. We have two distinct problems, each of which has a simple,
efficient solution.
On Mon, Jan 23, 2023 at 10:23 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 23, 2023 at 9:16 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Mon 23-01-23 09:07:34, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 23, 2023 at 8:55 AM Michal Hocko <[email protected]> wrote:
> > > > >
> > > > > On Mon 23-01-23 08:22:53, Suren Baghdasaryan wrote:
> > > > > > On Mon, Jan 23, 2023 at 1:56 AM Michal Hocko <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri 20-01-23 09:50:01, Suren Baghdasaryan wrote:
> > > > > > > > On Fri, Jan 20, 2023 at 9:32 AM Matthew Wilcox <[email protected]> wrote:
> > > > > > > [...]
> > > > > > > > > The page fault handler (or whatever other reader -- ptrace, proc, etc)
> > > > > > > > > should have a refcount on the mm_struct, so we can't be in this path
> > > > > > > > > trying to free VMAs. Right?
> > > > > > > >
> > > > > > > > Hmm. That sounds right. I checked process_mrelease() as well, which
> > > > > > > > operated on mm with only mmgrab()+mmap_read_lock() but it only unmaps
> > > > > > > > VMAs without freeing them, so we are still good. Michal, do you agree
> > > > > > > > this is ok?
> > > > > > >
> > > > > > > Don't we need RCU procetions for the vma life time assurance? Jann has
> > > > > > > already shown how rwsem is not safe wrt to unlock and free without RCU.
> > > > > >
> > > > > > Jann's case requires a thread freeing the VMA to be blocked on vma
> > > > > > write lock waiting for the vma real lock to be released by a page
> > > > > > fault handler. However exit_mmap() means mm->mm_users==0, which in
> > > > > > turn suggests that there are no racing page fault handlers and no new
> > > > > > page fault handlers will appear. Is that a correct assumption? If so,
> > > > > > then races with page fault handlers can't happen while in exit_mmap().
> > > > > > Any other path (other than page fault handlers), accesses vma->lock
> > > > > > under protection of mmap_lock (for read or write, does not matter).
> > > > > > One exception is when we operate on an isolated VMA, then we don't
> > > > > > need mmap_lock protection, but exit_mmap() does not deal with isolated
> > > > > > VMAs, so out of scope here. exit_mmap() frees vm_area_structs under
> > > > > > protection of mmap_lock in write mode, so races with anything other
> > > > > > than page fault handler should be safe as they are today.
> > > > >
> > > > > I do not see you talking about #PF (RCU + vma read lock protected) with
> > > > > munmap. It is my understanding that the latter will synchronize over per
> > > > > vma lock (along with mmap_lock exclusive locking). But then we are back
> > > > > to the lifetime guarantees, or do I miss anything.
> > > >
> > > > munmap() or any VMA-freeing operation other than exit_mmap() will free
> > > > using call_rcu(), as implemented today. The suggestion is to free VMAs
> > > > directly, without RCU grace period only when done from exit_mmap().
> > >
> > > OK, I have clearly missed that. This makes more sense but it also adds
> > > some more complexity and assumptions - a harder to maintain code in the
> > > end. Whoever wants to touch this scheme in future would have to
> > > re-evaluate all of them. So, I would just avoid that special casing if
> > > that is feasible.
> >
> > Ok, I understand your point.
> >
> > >
> > > Dealing with the flood of call_rcu during exit_mmap is a trivial thing
> > > to deal with as proposed elsewhere (just batch all of them in a single
> > > run). This will surely add some more code but at least the locking would
> > > consistent.
> >
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
>
> I object. We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA. If Michal is saying that somebody could do:
>
> mmget(mm);
> vma = find_vma(mm);
> lock_vma(vma);
> mmput(mm);
> vma->a = b;
> unlock_vma(mm, vma);
More precisely, it's:
mmget(mm);
vma = lock_vma_under_rcu(mm, addr); -> calls vma_read_trylock(vma)
mmput(mm);
vma->a = b;
vma_read_unlock(vma);
To be fair, vma_read_unlock() does not take mm as a parameter, so one
could have an impression that mm doesn't need to be pinned at the time
of its call.
>
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.
>
> Having all this extra code to solve two problems badly is a very poor
> choice. We have two distinct problems, each of which has a simple,
> efficient solution.
>
On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
[...]
> > Yes, batching the vmas into a list and draining it in remove_mt() and
> > exit_mmap() as you suggested makes sense to me and is quite simple.
> > Let's do that if nobody has objections.
>
> I object. We *know* nobody has a reference to any of the VMAs because
> you have to have a refcount on the mm before you can get a reference
> to a VMA. If Michal is saying that somebody could do:
>
> mmget(mm);
> vma = find_vma(mm);
> lock_vma(vma);
> mmput(mm);
> vma->a = b;
> unlock_vma(mm, vma);
>
> then that's something we'd catch in review -- you obviously can't use
> the mm after you've dropped your reference to it.
I am not claiming this is possible now. I do not think we want to have
something like that in the future either but that is really hard to
envision. I am claiming that it is subtle and potentially error prone to
have two different ways of mass vma freeing wrt. locking. Also, don't we
have a very similar situation during last munmaps?
--
Michal Hocko
SUSE Labs
On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> [...]
> > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > Let's do that if nobody has objections.
> >
> > I object. We *know* nobody has a reference to any of the VMAs because
> > you have to have a refcount on the mm before you can get a reference
> > to a VMA. If Michal is saying that somebody could do:
> >
> > mmget(mm);
> > vma = find_vma(mm);
> > lock_vma(vma);
> > mmput(mm);
> > vma->a = b;
> > unlock_vma(mm, vma);
> >
> > then that's something we'd catch in review -- you obviously can't use
> > the mm after you've dropped your reference to it.
>
> I am not claiming this is possible now. I do not think we want to have
> something like that in the future either but that is really hard to
> envision. I am claiming that it is subtle and potentially error prone to
> have two different ways of mass vma freeing wrt. locking. Also, don't we
> have a very similar situation during last munmaps?
We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that.
There are two cases; there's munmap(), which typically frees a single
VMA (yes, theoretically, you can free hundreds of VMAs with a single
call which spans multiple VMAs, but in practice that doesn't happen),
and there's exit_mmap() which happens on exec() and exit().
For the munmap() case, just RCU-free each one individually. For the
exit_mmap() case, there's no need to use RCU because nobody should still
have a VMA pointer after calling mmdrop() [1]
[1] Sorry, the above example should have been mmgrab()/mmdrop(), not
mmget()/mmput(); you're not allowed to look at the VMA list with an
mmget(), you need to have grabbed.
On Mon, Jan 23, 2023 at 11:31 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > >
> > > I object. We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA. If Michal is saying that somebody could do:
> > >
> > > mmget(mm);
> > > vma = find_vma(mm);
> > > lock_vma(vma);
> > > mmput(mm);
> > > vma->a = b;
> > > unlock_vma(mm, vma);
> > >
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> >
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
>
> We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().
>
> For the munmap() case, just RCU-free each one individually. For the
> exit_mmap() case, there's no need to use RCU because nobody should still
> have a VMA pointer after calling mmdrop() [1]
>
> [1] Sorry, the above example should have been mmgrab()/mmdrop(), not
> mmget()/mmput(); you're not allowed to look at the VMA list with an
> mmget(), you need to have grabbed.
I think it's clear that this would work with the current code and that
the concern is about any future possible misuse. So, it would be
preferable to proactively prevent such misuse.
vma_write_lock() and vma_write_unlock_mm() both have
mmap_assert_write_locked(), so they always happen under mmap_lock
protection and therefore do not pose any danger. The only issue we
need to be careful with is calling
vma_read_trylock()/vma_read_unlock() outside of mmap_lock protection
while mm is unstable. I don't think doing mmget/mmput inside these
functions is called for but maybe some assertions would prevent future
misuse?
On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > [...]
> > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > Let's do that if nobody has objections.
> > >
> > > I object. We *know* nobody has a reference to any of the VMAs because
> > > you have to have a refcount on the mm before you can get a reference
> > > to a VMA. If Michal is saying that somebody could do:
> > >
> > > mmget(mm);
> > > vma = find_vma(mm);
> > > lock_vma(vma);
> > > mmput(mm);
> > > vma->a = b;
> > > unlock_vma(mm, vma);
> > >
> > > then that's something we'd catch in review -- you obviously can't use
> > > the mm after you've dropped your reference to it.
> >
> > I am not claiming this is possible now. I do not think we want to have
> > something like that in the future either but that is really hard to
> > envision. I am claiming that it is subtle and potentially error prone to
> > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > have a very similar situation during last munmaps?
>
> We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that.
> There are two cases; there's munmap(), which typically frees a single
> VMA (yes, theoretically, you can free hundreds of VMAs with a single
> call which spans multiple VMAs, but in practice that doesn't happen),
> and there's exit_mmap() which happens on exec() and exit().
This requires special casing remove_vma for those two different paths
(exit_mmap and remove_mt). If you ask me that sounds like a suboptimal
code to even not handle potential large munmap which might very well be
a rare thing as you say. But haven't we learned that sooner or later we
will find out there is somebody that cares afterall? Anyway, this is not
something I care about all that much. It is just weird to special case
exit_mmap, if you ask me. Up to Suren to decide which way he wants to
go. I just really didn't like the initial implementation of batching
based on a completely arbitrary batch limit and lazy freeing.
--
Michal Hocko
SUSE Labs
On Mon, Jan 23, 2023 at 12:00 PM Michal Hocko <[email protected]> wrote:
>
> On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > > [...]
> > > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > > Let's do that if nobody has objections.
> > > >
> > > > I object. We *know* nobody has a reference to any of the VMAs because
> > > > you have to have a refcount on the mm before you can get a reference
> > > > to a VMA. If Michal is saying that somebody could do:
> > > >
> > > > mmget(mm);
> > > > vma = find_vma(mm);
> > > > lock_vma(vma);
> > > > mmput(mm);
> > > > vma->a = b;
> > > > unlock_vma(mm, vma);
> > > >
> > > > then that's something we'd catch in review -- you obviously can't use
> > > > the mm after you've dropped your reference to it.
> > >
> > > I am not claiming this is possible now. I do not think we want to have
> > > something like that in the future either but that is really hard to
> > > envision. I am claiming that it is subtle and potentially error prone to
> > > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > > have a very similar situation during last munmaps?
> >
> > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that.
> > There are two cases; there's munmap(), which typically frees a single
> > VMA (yes, theoretically, you can free hundreds of VMAs with a single
> > call which spans multiple VMAs, but in practice that doesn't happen),
> > and there's exit_mmap() which happens on exec() and exit().
>
> This requires special casing remove_vma for those two different paths
> (exit_mmap and remove_mt). If you ask me that sounds like a suboptimal
> code to even not handle potential large munmap which might very well be
> a rare thing as you say. But haven't we learned that sooner or later we
> will find out there is somebody that cares afterall? Anyway, this is not
> something I care about all that much. It is just weird to special case
> exit_mmap, if you ask me. Up to Suren to decide which way he wants to
> go. I just really didn't like the initial implementation of batching
> based on a completely arbitrary batch limit and lazy freeing.
I would prefer to go with the simplest sufficient solution. A
potential issue with a large munmap might prove to be real but I think
we know how to easily fix that with batching if the issue ever
materializes (I'll have a fix ready implementing Michal's suggestion).
So, I suggest going with Liam's/Matthew's solution and converting to
Michal's solution if regression shows up anywhere else. Would that be
acceptable?
> --
> Michal Hocko
> SUSE Labs
* Michal Hocko <[email protected]> [230123 15:00]:
> On Mon 23-01-23 19:30:43, Matthew Wilcox wrote:
> > On Mon, Jan 23, 2023 at 08:18:37PM +0100, Michal Hocko wrote:
> > > On Mon 23-01-23 18:23:08, Matthew Wilcox wrote:
> > > > On Mon, Jan 23, 2023 at 09:46:20AM -0800, Suren Baghdasaryan wrote:
> > > [...]
> > > > > Yes, batching the vmas into a list and draining it in remove_mt() and
> > > > > exit_mmap() as you suggested makes sense to me and is quite simple.
> > > > > Let's do that if nobody has objections.
> > > >
> > > > I object. We *know* nobody has a reference to any of the VMAs because
> > > > you have to have a refcount on the mm before you can get a reference
> > > > to a VMA. If Michal is saying that somebody could do:
> > > >
> > > > mmget(mm);
> > > > vma = find_vma(mm);
> > > > lock_vma(vma);
> > > > mmput(mm);
> > > > vma->a = b;
> > > > unlock_vma(mm, vma);
> > > >
> > > > then that's something we'd catch in review -- you obviously can't use
> > > > the mm after you've dropped your reference to it.
> > >
> > > I am not claiming this is possible now. I do not think we want to have
> > > something like that in the future either but that is really hard to
> > > envision. I am claiming that it is subtle and potentially error prone to
> > > have two different ways of mass vma freeing wrt. locking. Also, don't we
> > > have a very similar situation during last munmaps?
> >
> > We shouldn't have two ways of mass VMA freeing. Nobody's suggesting that.
> > There are two cases; there's munmap(), which typically frees a single
> > VMA (yes, theoretically, you can free hundreds of VMAs with a single
> > call which spans multiple VMAs, but in practice that doesn't happen),
> > and there's exit_mmap() which happens on exec() and exit().
>
> This requires special casing remove_vma for those two different paths
> (exit_mmap and remove_mt). If you ask me that sounds like a suboptimal
> code to even not handle potential large munmap which might very well be
> a rare thing as you say. But haven't we learned that sooner or later we
> will find out there is somebody that cares afterall? Anyway, this is not
> something I care about all that much. It is just weird to special case
> exit_mmap, if you ask me.
exit_mmap() is already a special case for locking (and statistics).
This exists today to optimize the special exit scenario. I don't think
it's a question of sub-optimal code but what we can get away without
doing in the case of the process exit.
On Tue, Jan 17, 2023 at 10:45:25PM +0100, Jann Horn wrote:
Hi Jann,
> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan <[email protected]> wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn <[email protected]> wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > + up_read(&vma->lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(&vma->lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
>
> __up_read() does:
>
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, &sem->count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> RWSEM_FLAG_WAITERS)) {
> clear_nonspinnable(sem);
> rwsem_wake(sem);
> }
This sequence is covered by preempt_disable()/preempt_enable().
Would not it preserve the RCU grace period until after __up_read()
exited?
> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
>
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
>
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.
Thanks!