Subject: [PATCH v7 0/9] Huge page-table entries for TTM

From: Thomas Hellstrom (VMware) <[email protected]>

In order to reduce CPU usage [1] and in theory TLB misses this patchset enables
huge- and giant page-table entries for TTM and TTM-enabled graphics drivers.

Patch 1 and 2 introduce a vma_is_special_huge() function to make the mm code
take the same path as DAX when splitting huge- and giant page table entries,
(which currently means zapping the page-table entry and rely on re-faulting).

Patch 3 makes the mm code split existing huge page-table entries
on huge_fault fallbacks. Typically on COW or on buffer-objects that want
write-notify. COW and write-notification is always done on the lowest
page-table level. See the patch log message for additional considerations.

Patch 4 introduces functions to allow the graphics drivers to manipulate
the caching- and encryption flags of huge page-table entries without ugly
hacks.

Patch 5 implements the huge_fault handler in TTM.
This enables huge page-table entries, provided that the kernel is configured
to support transhuge pages, either by default or using madvise().
However, they are unlikely to be inserted unless the kernel buffer object
pfns and user-space addresses align perfectly. There are various options
here, but since buffer objects that reside in system pages typically start
at huge page boundaries if they are backed by huge pages, we try to enforce
buffer object starting pfns and user-space addresses to be huge page-size
aligned if their size exceeds a huge page-size. If pud-size transhuge
("giant") pages are enabled by the arch, the same holds for those.

Patch 6 implements a specialized huge_fault handler for vmwgfx.
The vmwgfx driver may perform dirty-tracking and needs some special code
to handle that correctly.

Patch 7 implements a drm helper to align user-space addresses according
to the above scheme, if possible.

Patch 8 implements a TTM range manager for vmwgfx that does the same for
graphics IO memory. This may later be reused by other graphics drivers
if necessary.

Patch 9 finally hooks up the helpers of patch 7 and 8 to the vmwgfx driver.
A similar change is needed for graphics drivers that want a reasonable
likelyhood of actually using huge page-table entries.

If a buffer object size is not huge-page or giant-page aligned,
its size will NOT be inflated by this patchset. This means that the buffer
object tail will use smaller size page-table entries and thus no memory
overhead occurs. Drivers that want to pay the memory overhead price need to
implement their own scheme to inflate buffer-object sizes.

PMD size huge page-table-entries have been tested with vmwgfx and found to
work well both with system memory backed and IO memory backed buffer objects.

PUD size giant page-table-entries have seen limited (fault and COW) testing
using a modified kernel (to support 1GB page allocations) and a fake vmwgfx
TTM memory type. The vmwgfx driver does otherwise not support 1GB-size IO
memory resources.

This patch series is now about to become a pull request.
Thomas

Changes since RFC:
* Check for buffer objects present in contigous IO Memory (Christian König)
* Rebased on the vmwgfx emulated coherent memory functionality. That rebase
adds patch 5.
Changes since v1:
* Make the new TTM range manager vmwgfx-specific. (Christian König)
* Minor fixes for configs that don't support or only partially support
transhuge pages.
Changes since v2:
* Minor coding style and doc fixes in patch 5/9 (Christian König)
* Patch 5/9 doesn't touch mm. Remove from the patch title.
Changes since v3:
* Added reviews and acks
* Implemented ugly but generic ttm_pgprot_is_wrprotecting() instead of arch
specific code.
Changes since v4:
* Added timings (Andrew Morton)
* Updated function documentation (Andrew Morton)
Changes since v5:
* Fix drm build error with !CONFIG_MMU
(Reported-by: kbuild test robot <[email protected]>)
Changes since v6:
* drm_file.c new includes also conditioned on CONFIG_TRANSPARENT_HUGEPAGE
* checkpatch complained about formatting of a commit message - fixed.
* Updated Thomas' email address
* Added acks from Andrew Morton

[1]
The below test program generates the following gnu time output when run on a
vmwgfx-enabled kernel without the patch series:

4.78user 6.02system 0:10.91elapsed 99%CPU (0avgtext+0avgdata 1624maxresident)k
0inputs+0outputs (0major+640077minor)pagefaults 0swaps

and with the patch series:

1.71user 3.60system 0:05.40elapsed 98%CPU (0avgtext+0avgdata 1656maxresident)k
0inputs+0outputs (0major+20079minor)pagefaults 0swaps

A consistent number of reduced graphics page-faults can be seen with normal
graphics applications, but probably due to the aggressive buffer object
caching in vmwgfx user-space drivers the CPU time reduction is within
error limits.

#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <xf86drm.h>

static void checkerr(int ret, const char *name)
{
if (ret < 0) {
perror(name);
exit(-1);
}
}

int main(int agc, const char *argv[])
{
struct drm_mode_create_dumb c_arg = {0};
struct drm_mode_map_dumb m_arg = {0};
struct drm_mode_destroy_dumb d_arg = {0};
int ret, i, fd;
void *map;

fd = open("/dev/dri/card0", O_RDWR);
checkerr(fd, argv[0]);

for (i = 0; i < 10000; ++i) {
c_arg.bpp = 32;
c_arg.width = 1024;
c_arg.height = 1024;
ret = drmIoctl(fd, DRM_IOCTL_MODE_CREATE_DUMB, &c_arg);
checkerr(fd, argv[0]);

m_arg.handle = c_arg.handle;
ret = drmIoctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &m_arg);
checkerr(fd, argv[0]);

map = mmap(NULL, c_arg.size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
m_arg.offset);
checkerr(map == MAP_FAILED ? -1 : 0, argv[0]);

(void) madvise((void *) map, c_arg.size, MADV_HUGEPAGE);
memset(map, 0x67, c_arg.size);
munmap(map, c_arg.size);

d_arg.handle = c_arg.handle;
ret = drmIoctl(fd, DRM_IOCTL_MODE_DESTROY_DUMB, &d_arg);
checkerr(ret, argv[0]);
}

close(fd);
}

Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Dan Williams <[email protected]>



Subject: [PATCH v7 8/9] drm/vmwgfx: Introduce a huge page aligning TTM range manager

From: "Thomas Hellstrom (VMware)" <[email protected]>

Using huge page-table entries requires that the physical address of the
start of a buffer object is huge page size aligned.
Make a special version of the TTM range manager that accomplishes this,
but falls back to a smaller page size alignment (PUD->PMD, PMD->NORMAL)
to avoid eviction.
If other drivers want to use it in the future, it can be made a
TTM generic helper. Note that drivers can force eviction for a certain
alignment by assigning the TTM GPU alignment correspondingly.

Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Thomas Hellstrom (VMware) <[email protected]>
Reviewed-by: Roland Scheidegger <[email protected]>
Acked-by: Christian König <[email protected]>
---
drivers/gpu/drm/vmwgfx/Makefile | 1 +
drivers/gpu/drm/vmwgfx/vmwgfx_drv.h | 7 ++
drivers/gpu/drm/vmwgfx/vmwgfx_thp.c | 166 ++++++++++++++++++++++++++++
3 files changed, 174 insertions(+)
create mode 100644 drivers/gpu/drm/vmwgfx/vmwgfx_thp.c

diff --git a/drivers/gpu/drm/vmwgfx/Makefile b/drivers/gpu/drm/vmwgfx/Makefile
index c877a21a0739..421dd2a497a5 100644
--- a/drivers/gpu/drm/vmwgfx/Makefile
+++ b/drivers/gpu/drm/vmwgfx/Makefile
@@ -11,4 +11,5 @@ vmwgfx-y := vmwgfx_execbuf.o vmwgfx_gmr.o vmwgfx_kms.o vmwgfx_drv.o \
vmwgfx_validation.o vmwgfx_page_dirty.o \
ttm_object.o ttm_lock.o

+vmwgfx-$(CONFIG_TRANSPARENT_HUGEPAGE) += vmwgfx_thp.o
obj-$(CONFIG_DRM_VMWGFX) := vmwgfx.o
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
index 6fc8d5c171c6..d19d28c13671 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_drv.h
@@ -1407,6 +1407,13 @@ vm_fault_t vmw_bo_vm_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size);
#endif

+/* Transparent hugepage support - vmwgfx_thp.c */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern const struct ttm_mem_type_manager_func vmw_thp_func;
+#else
+#define vmw_thp_func ttm_bo_manager_func
+#endif
+
/**
* VMW_DEBUG_KMS - Debug output for kernel mode-setting
*
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_thp.c b/drivers/gpu/drm/vmwgfx/vmwgfx_thp.c
new file mode 100644
index 000000000000..b7c816ba7166
--- /dev/null
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_thp.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: GPL-2.0 OR MIT
+/*
+ * Huge page-table-entry support for IO memory.
+ *
+ * Copyright (C) 2007-2019 Vmware, Inc. All rights reservedd.
+ */
+#include "vmwgfx_drv.h"
+#include <drm/ttm/ttm_module.h>
+#include <drm/ttm/ttm_bo_driver.h>
+#include <drm/ttm/ttm_placement.h>
+
+/**
+ * struct vmw_thp_manager - Range manager implementing huge page alignment
+ *
+ * @mm: The underlying range manager. Protected by @lock.
+ * @lock: Manager lock.
+ */
+struct vmw_thp_manager {
+ struct drm_mm mm;
+ spinlock_t lock;
+};
+
+static int vmw_thp_insert_aligned(struct drm_mm *mm, struct drm_mm_node *node,
+ unsigned long align_pages,
+ const struct ttm_place *place,
+ struct ttm_mem_reg *mem,
+ unsigned long lpfn,
+ enum drm_mm_insert_mode mode)
+{
+ if (align_pages >= mem->page_alignment &&
+ (!mem->page_alignment || align_pages % mem->page_alignment == 0)) {
+ return drm_mm_insert_node_in_range(mm, node,
+ mem->num_pages,
+ align_pages, 0,
+ place->fpfn, lpfn, mode);
+ }
+
+ return -ENOSPC;
+}
+
+static int vmw_thp_get_node(struct ttm_mem_type_manager *man,
+ struct ttm_buffer_object *bo,
+ const struct ttm_place *place,
+ struct ttm_mem_reg *mem)
+{
+ struct vmw_thp_manager *rman = (struct vmw_thp_manager *) man->priv;
+ struct drm_mm *mm = &rman->mm;
+ struct drm_mm_node *node;
+ unsigned long align_pages;
+ unsigned long lpfn;
+ enum drm_mm_insert_mode mode = DRM_MM_INSERT_BEST;
+ int ret;
+
+ node = kzalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ return -ENOMEM;
+
+ lpfn = place->lpfn;
+ if (!lpfn)
+ lpfn = man->size;
+
+ mode = DRM_MM_INSERT_BEST;
+ if (place->flags & TTM_PL_FLAG_TOPDOWN)
+ mode = DRM_MM_INSERT_HIGH;
+
+ spin_lock(&rman->lock);
+ if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)) {
+ align_pages = (HPAGE_PUD_SIZE >> PAGE_SHIFT);
+ if (mem->num_pages >= align_pages) {
+ ret = vmw_thp_insert_aligned(mm, node, align_pages,
+ place, mem, lpfn, mode);
+ if (!ret)
+ goto found_unlock;
+ }
+ }
+
+ align_pages = (HPAGE_PMD_SIZE >> PAGE_SHIFT);
+ if (mem->num_pages >= align_pages) {
+ ret = vmw_thp_insert_aligned(mm, node, align_pages, place, mem,
+ lpfn, mode);
+ if (!ret)
+ goto found_unlock;
+ }
+
+ ret = drm_mm_insert_node_in_range(mm, node, mem->num_pages,
+ mem->page_alignment, 0,
+ place->fpfn, lpfn, mode);
+found_unlock:
+ spin_unlock(&rman->lock);
+
+ if (unlikely(ret)) {
+ kfree(node);
+ } else {
+ mem->mm_node = node;
+ mem->start = node->start;
+ }
+
+ return 0;
+}
+
+
+
+static void vmw_thp_put_node(struct ttm_mem_type_manager *man,
+ struct ttm_mem_reg *mem)
+{
+ struct vmw_thp_manager *rman = (struct vmw_thp_manager *) man->priv;
+
+ if (mem->mm_node) {
+ spin_lock(&rman->lock);
+ drm_mm_remove_node(mem->mm_node);
+ spin_unlock(&rman->lock);
+
+ kfree(mem->mm_node);
+ mem->mm_node = NULL;
+ }
+}
+
+static int vmw_thp_init(struct ttm_mem_type_manager *man,
+ unsigned long p_size)
+{
+ struct vmw_thp_manager *rman;
+
+ rman = kzalloc(sizeof(*rman), GFP_KERNEL);
+ if (!rman)
+ return -ENOMEM;
+
+ drm_mm_init(&rman->mm, 0, p_size);
+ spin_lock_init(&rman->lock);
+ man->priv = rman;
+ return 0;
+}
+
+static int vmw_thp_takedown(struct ttm_mem_type_manager *man)
+{
+ struct vmw_thp_manager *rman = (struct vmw_thp_manager *) man->priv;
+ struct drm_mm *mm = &rman->mm;
+
+ spin_lock(&rman->lock);
+ if (drm_mm_clean(mm)) {
+ drm_mm_takedown(mm);
+ spin_unlock(&rman->lock);
+ kfree(rman);
+ man->priv = NULL;
+ return 0;
+ }
+ spin_unlock(&rman->lock);
+ return -EBUSY;
+}
+
+static void vmw_thp_debug(struct ttm_mem_type_manager *man,
+ struct drm_printer *printer)
+{
+ struct vmw_thp_manager *rman = (struct vmw_thp_manager *) man->priv;
+
+ spin_lock(&rman->lock);
+ drm_mm_print(&rman->mm, printer);
+ spin_unlock(&rman->lock);
+}
+
+const struct ttm_mem_type_manager_func vmw_thp_func = {
+ .init = vmw_thp_init,
+ .takedown = vmw_thp_takedown,
+ .get_node = vmw_thp_get_node,
+ .put_node = vmw_thp_put_node,
+ .debug = vmw_thp_debug
+};
--
2.21.1

Subject: [PATCH v7 3/9] mm: Split huge pages on write-notify or COW

From: "Thomas Hellstrom (VMware)" <[email protected]>

The functions wp_huge_pmd() and wp_huge_pud() currently relies on the
huge_fault() callback to split huge page table entries if needed.
However for module users that requires export of the split_huge_xxx()
functionality which may be undesired. Instead split pre-existing huge
page-table entries on VM_FAULT_FALLBACK return.

We currently only do COW and write-notify on the PTE level, so if the
huge_fault() handler returns VM_FAULT_FALLBACK on wp faults,
split the huge pages and page-table entries. Also do this for huge PUDs
if there is no huge_fault() handler and the vma is not anonymous, similar
to how it's done for PMDs.

Note that fs/dax.c still does the splitting in the huge_fault() handler,
but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd()
if needed.

Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: "Jérôme Glisse" <[email protected]>
Cc: "Christian König" <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Thomas Hellstrom (VMware) <[email protected]>
Acked-by: Christian König <[email protected]>
Acked-by: Andrew Morton <[email protected]>
---
mm/memory.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e8bfdf0d9d1d..efa59b1b109c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3951,11 +3951,14 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
{
if (vma_is_anonymous(vmf->vma))
return do_huge_pmd_wp_page(vmf, orig_pmd);
- if (vmf->vma->vm_ops->huge_fault)
- return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
+ if (vmf->vma->vm_ops->huge_fault) {
+ vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);

- /* COW handled on pte level: split pmd */
- VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
+ if (!(ret & VM_FAULT_FALLBACK))
+ return ret;
+ }
+
+ /* COW or write-notify handled on pte level: split pmd. */
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

return VM_FAULT_FALLBACK;
@@ -3968,12 +3971,20 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)

static vm_fault_t create_huge_pud(struct vm_fault *vmf)
{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
/* No support for anonymous transparent PUD pages yet */
if (vma_is_anonymous(vmf->vma))
- return VM_FAULT_FALLBACK;
- if (vmf->vma->vm_ops->huge_fault)
- return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
+ goto split;
+ if (vmf->vma->vm_ops->huge_fault) {
+ vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
+
+ if (!(ret & VM_FAULT_FALLBACK))
+ return ret;
+ }
+split:
+ /* COW or write-notify not handled on PUD level: split pud.*/
+ __split_huge_pud(vmf->vma, vmf->pud, vmf->address);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
return VM_FAULT_FALLBACK;
}
--
2.21.1