This applies on top of the earlier vdso pvclock series I sent out.
Once that lands in -tip, this will apply to -tip.
This series cleans up the hack that is our vvar mapping. We currently
initialize the vvar mapping as a special mapping vma backed by nothing
whatsoever and then we abuse remap_pfn_range to populate it.
This cheats the mm core, probably breaks under various evil madvise
workloads, and prevents handling faults in more interesting ways.
To clean it up, this series:
- Adds a special mapping .fault operation
- Adds a vm_insert_pfn_prot helper
- Uses the new .fault infrastructure in x86's vdso and vvar mappings
- Hardens the HPET mapping, mitigating an HW attack surface that bothers me
I'd appreciate some review from the mm folks. Also, akpm, if you're
okay with this whole series going in through -tip, that would be
great -- it will avoid splitting it across two releases.
Andy Lutomirski (6):
mm: Add a vm_special_mapping .fault method
mm: Add vm_insert_pfn_prot
x86/vdso: Track each mm's loaded vdso image as well as its base
x86,vdso: Use .fault for the vdso text mapping
x86,vdso: Use .fault instead of remap_pfn_range for the vvar mapping
x86/vdso: Disallow vvar access to vclock IO for never-used vclocks
arch/x86/entry/vdso/vdso2c.h | 7 --
arch/x86/entry/vdso/vma.c | 124 ++++++++++++++++++++------------
arch/x86/entry/vsyscall/vsyscall_gtod.c | 9 ++-
arch/x86/include/asm/clocksource.h | 9 +--
arch/x86/include/asm/mmu.h | 3 +-
arch/x86/include/asm/vdso.h | 3 -
arch/x86/include/asm/vgtod.h | 6 ++
include/linux/mm.h | 2 +
include/linux/mm_types.h | 19 ++++-
mm/memory.c | 25 ++++++-
mm/mmap.c | 13 ++--
11 files changed, 150 insertions(+), 70 deletions(-)
--
2.5.0
From: Andy Lutomirski <[email protected]>
Requiring special mappings to give a list of struct pages is
inflexible: it prevents sane use of IO memory in a special mapping,
it's inefficient (it requires arch code to initialize a list of
struct pages, and it requires the mm core to walk the entire list
just to figure out how long it is), and it prevents arch code from
doing anything fancy when a special mapping fault occurs.
Add a .fault method as an alternative to filling in a .pages array.
Signed-off-by: Andy Lutomirski <[email protected]>
---
include/linux/mm_types.h | 19 ++++++++++++++++++-
mm/mmap.c | 13 +++++++++----
2 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f8d1492a114f..3d315d373daf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -568,10 +568,27 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm)
}
#endif
+struct vm_fault;
+
struct vm_special_mapping
{
- const char *name;
+ const char *name; /* The name, e.g. "[vdso]". */
+
+ /*
+ * If .fault is not provided, this is points to a
+ * NULL-terminated array of pages that back the special mapping.
+ *
+ * This must not be NULL unless .fault is provided.
+ */
struct page **pages;
+
+ /*
+ * If non-NULL, then this is called to resolve page faults
+ * on the special mapping. If used, .pages is not checked.
+ */
+ int (*fault)(const struct vm_special_mapping *sm,
+ struct vm_area_struct *vma,
+ struct vm_fault *vmf);
};
enum tlb_flush_reason {
diff --git a/mm/mmap.c b/mm/mmap.c
index 2ce04a649f6b..f717453b1a57 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3030,11 +3030,16 @@ static int special_mapping_fault(struct vm_area_struct *vma,
pgoff_t pgoff;
struct page **pages;
- if (vma->vm_ops == &legacy_special_mapping_vmops)
+ if (vma->vm_ops == &legacy_special_mapping_vmops) {
pages = vma->vm_private_data;
- else
- pages = ((struct vm_special_mapping *)vma->vm_private_data)->
- pages;
+ } else {
+ struct vm_special_mapping *sm = vma->vm_private_data;
+
+ if (sm->fault)
+ return sm->fault(sm, vma, vmf);
+
+ pages = sm->pages;
+ }
for (pgoff = vmf->pgoff; pgoff && *pages; ++pages)
pgoff--;
--
2.5.0
The x86 vvar mapping contains pages with differing cacheability
flags. This is currently only supported using (io_)remap_pfn_range,
but those functions can't be used inside page faults.
Add vm_insert_pfn_prot to support varying cacheability within the
same non-COW VMA in a more sane manner.
x86 needs this to avoid a CRIU-breaking and memory-wasting explosion
of VMAs when supporting userspace access to the HPET.
Signed-off-by: Andy Lutomirski <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 25 +++++++++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00bad7793788..87ef1d7730ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2080,6 +2080,8 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
+int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, pgprot_t pgprot);
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
diff --git a/mm/memory.c b/mm/memory.c
index c387430f06c3..a29f0b90fc56 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1564,8 +1564,29 @@ out:
int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn)
{
+ return vm_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
+}
+EXPORT_SYMBOL(vm_insert_pfn);
+
+/**
+ * vm_insert_pfn_prot - insert single pfn into user vma with specified pgprot
+ * @vma: user vma to map to
+ * @addr: target user address of this page
+ * @pfn: source kernel pfn
+ * @pgprot: pgprot flags for the inserted page
+ *
+ * This is exactly like vm_insert_pfn, except that it allows drivers to
+ * to override pgprot on a per-page basis.
+ *
+ * This only makes sense for IO mappings, and it makes no sense for
+ * cow mappings. In general, using multiple vmas is preferable;
+ * vm_insert_pfn_prot should only be used if using multiple VMAs is
+ * impractical.
+ */
+int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, pgprot_t pgprot)
+{
int ret;
- pgprot_t pgprot = vma->vm_page_prot;
/*
* Technically, architectures with pte_special can avoid all these
* restrictions (same for remap_pfn_range). However we would like
@@ -1587,7 +1608,7 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
return ret;
}
-EXPORT_SYMBOL(vm_insert_pfn);
+EXPORT_SYMBOL(vm_insert_pfn_prot);
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn)
--
2.5.0
As we start to do more intelligent things with the vdso at runtime
(as opposed to just at mm initialization time), we'll need to know
which vdso is in use.
In principle, we could guess based on the mm type, but that's
over-complicated and error-prone. Instead, just track it in the mmu
context.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vma.c | 1 +
arch/x86/include/asm/mmu.h | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index b8f69e264ac4..80b021067bd6 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -121,6 +121,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
text_start = addr - image->sym_vvar_start;
current->mm->context.vdso = (void __user *)text_start;
+ current->mm->context.vdso_image = image;
/*
* MAYWRITE to allow gdb to COW and set breakpoints
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 55234d5e7160..1ea0baef1175 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -19,7 +19,8 @@ typedef struct {
#endif
struct mutex lock;
- void __user *vdso;
+ void __user *vdso; /* vdso base address */
+ const struct vdso_image *vdso_image; /* vdso image in use */
atomic_t perf_rdpmc_allowed; /* nonzero if rdpmc is allowed */
} mm_context_t;
--
2.5.0
The old scheme for mapping the vdso text is rather complicated. vdso2c
generates a struct vm_special_mapping and a blank .pages array of the
correct size for each vdso image. Init code in vdso/vma.c populates
the .pages array for each vdso image, and the mapping code selects
the appropriate struct vm_special_mapping.
With .fault, we can use a less roundabout approach: vdso_fault
just returns the appropriate page for the selected vdso image.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vdso2c.h | 7 -------
arch/x86/entry/vdso/vma.c | 26 +++++++++++++++++++-------
arch/x86/include/asm/vdso.h | 3 ---
3 files changed, 19 insertions(+), 17 deletions(-)
diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 0224987556ce..abe961c7c71c 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -150,16 +150,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
}
fprintf(outfile, "\n};\n\n");
- fprintf(outfile, "static struct page *pages[%lu];\n\n",
- mapping_size / 4096);
-
fprintf(outfile, "const struct vdso_image %s = {\n", name);
fprintf(outfile, "\t.data = raw_data,\n");
fprintf(outfile, "\t.size = %lu,\n", mapping_size);
- fprintf(outfile, "\t.text_mapping = {\n");
- fprintf(outfile, "\t\t.name = \"[vdso]\",\n");
- fprintf(outfile, "\t\t.pages = pages,\n");
- fprintf(outfile, "\t},\n");
if (alt_sec) {
fprintf(outfile, "\t.alt = %lu,\n",
(unsigned long)GET_LE(&alt_sec->sh_offset));
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 80b021067bd6..eb50d7c1f161 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -27,13 +27,7 @@ unsigned int __read_mostly vdso64_enabled = 1;
void __init init_vdso_image(const struct vdso_image *image)
{
- int i;
- int npages = (image->size) / PAGE_SIZE;
-
BUG_ON(image->size % PAGE_SIZE != 0);
- for (i = 0; i < npages; i++)
- image->text_mapping.pages[i] =
- virt_to_page(image->data + i*PAGE_SIZE);
apply_alternatives((struct alt_instr *)(image->data + image->alt),
(struct alt_instr *)(image->data + image->alt +
@@ -90,6 +84,24 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
#endif
}
+static int vdso_fault(const struct vm_special_mapping *sm,
+ struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ const struct vdso_image *image = vma->vm_mm->context.vdso_image;
+
+ if (!image || (vmf->pgoff << PAGE_SHIFT) >= image->size)
+ return VM_FAULT_SIGBUS;
+
+ vmf->page = virt_to_page(image->data + (vmf->pgoff << PAGE_SHIFT));
+ get_page(vmf->page);
+ return 0;
+}
+
+static const struct vm_special_mapping text_mapping = {
+ .name = "[vdso]",
+ .fault = vdso_fault,
+};
+
static int map_vdso(const struct vdso_image *image, bool calculate_addr)
{
struct mm_struct *mm = current->mm;
@@ -131,7 +143,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
image->size,
VM_READ|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- &image->text_mapping);
+ &text_mapping);
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index deabaf9759b6..43dc55be524e 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -13,9 +13,6 @@ struct vdso_image {
void *data;
unsigned long size; /* Always a multiple of PAGE_SIZE */
- /* text_mapping.pages is big enough for data/size page pointers */
- struct vm_special_mapping text_mapping;
-
unsigned long alt, alt_len;
long sym_vvar_start; /* Negative offset to the vvar area */
--
2.5.0
This is IMO much less ugly, and it also opens the door to
disallowing unprivileged userspace HPET access on systems with
usable TSCs.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vma.c | 97 ++++++++++++++++++++++++++++-------------------
1 file changed, 57 insertions(+), 40 deletions(-)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index eb50d7c1f161..02221e98b83f 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -102,18 +102,69 @@ static const struct vm_special_mapping text_mapping = {
.fault = vdso_fault,
};
+static int vvar_fault(const struct vm_special_mapping *sm,
+ struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ const struct vdso_image *image = vma->vm_mm->context.vdso_image;
+ long sym_offset;
+ int ret = -EFAULT;
+
+ if (!image)
+ return VM_FAULT_SIGBUS;
+
+ sym_offset = (long)(vmf->pgoff << PAGE_SHIFT) +
+ image->sym_vvar_start;
+
+ /*
+ * Sanity check: a symbol offset of zero means that the page
+ * does not exist for this vdso image, not that the page is at
+ * offset zero relative to the text mapping. This should be
+ * impossible here, because sym_offset should only be zero for
+ * the page past the end of the vvar mapping.
+ */
+ if (sym_offset == 0)
+ return VM_FAULT_SIGBUS;
+
+ if (sym_offset == image->sym_vvar_page) {
+ ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address,
+ __pa_symbol(&__vvar_page) >> PAGE_SHIFT);
+ } else if (sym_offset == image->sym_hpet_page) {
+#ifdef CONFIG_HPET_TIMER
+ if (hpet_address) {
+ ret = vm_insert_pfn_prot(
+ vma,
+ (unsigned long)vmf->virtual_address,
+ hpet_address >> PAGE_SHIFT,
+ pgprot_noncached(PAGE_READONLY));
+ }
+#endif
+ } else if (sym_offset == image->sym_pvclock_page) {
+ struct pvclock_vsyscall_time_info *pvti =
+ pvclock_pvti_cpu0_va();
+ if (pvti) {
+ ret = vm_insert_pfn(
+ vma,
+ (unsigned long)vmf->virtual_address,
+ __pa(pvti) >> PAGE_SHIFT);
+ }
+ }
+
+ if (ret == 0)
+ return VM_FAULT_NOPAGE;
+
+ return VM_FAULT_SIGBUS;
+}
+
static int map_vdso(const struct vdso_image *image, bool calculate_addr)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long addr, text_start;
int ret = 0;
- static struct page *no_pages[] = {NULL};
- static struct vm_special_mapping vvar_mapping = {
+ static const struct vm_special_mapping vvar_mapping = {
.name = "[vvar]",
- .pages = no_pages,
+ .fault = vvar_fault,
};
- struct pvclock_vsyscall_time_info *pvti;
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -153,7 +204,8 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
vma = _install_special_mapping(mm,
addr,
-image->sym_vvar_start,
- VM_READ|VM_MAYREAD,
+ VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
+ VM_PFNMAP,
&vvar_mapping);
if (IS_ERR(vma)) {
@@ -161,41 +213,6 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr)
goto up_fail;
}
- if (image->sym_vvar_page)
- ret = remap_pfn_range(vma,
- text_start + image->sym_vvar_page,
- __pa_symbol(&__vvar_page) >> PAGE_SHIFT,
- PAGE_SIZE,
- PAGE_READONLY);
-
- if (ret)
- goto up_fail;
-
-#ifdef CONFIG_HPET_TIMER
- if (hpet_address && image->sym_hpet_page) {
- ret = io_remap_pfn_range(vma,
- text_start + image->sym_hpet_page,
- hpet_address >> PAGE_SHIFT,
- PAGE_SIZE,
- pgprot_noncached(PAGE_READONLY));
-
- if (ret)
- goto up_fail;
- }
-#endif
-
- pvti = pvclock_pvti_cpu0_va();
- if (pvti && image->sym_pvclock_page) {
- ret = remap_pfn_range(vma,
- text_start + image->sym_pvclock_page,
- __pa(pvti) >> PAGE_SHIFT,
- PAGE_SIZE,
- PAGE_READONLY);
-
- if (ret)
- goto up_fail;
- }
-
up_fail:
if (ret)
current->mm->context.vdso = NULL;
--
2.5.0
It makes me uncomfortable that even modern systems grant every
process direct read access to the HPET.
While fixing this for real without regressing anything is a mess
(unmapping the HPET is tricky because we don't adequately track all
the mappings), we can do almost as well by tracking which vclocks
have ever been used and only allowing pages associated with used
vclocks to be faulted in.
This will cause rogue programs that try to peek at the HPET to get
SIGBUS instead on most systems.
We can't restrict faults to vclock pages that are associated with
the currently selected vclock due to a race: a process could start
to access the HPET for the first time and race against a switch away
from the HPET as the current clocksource. We can't segfault the
process trying to peek at the HPET in this case, even though the
process isn't going to do anything useful with the data.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/entry/vdso/vma.c | 4 ++--
arch/x86/entry/vsyscall/vsyscall_gtod.c | 9 ++++++++-
arch/x86/include/asm/clocksource.h | 9 +++++----
arch/x86/include/asm/vgtod.h | 6 ++++++
4 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 02221e98b83f..aa4f5b99f1f3 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -130,7 +130,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
__pa_symbol(&__vvar_page) >> PAGE_SHIFT);
} else if (sym_offset == image->sym_hpet_page) {
#ifdef CONFIG_HPET_TIMER
- if (hpet_address) {
+ if (hpet_address && vclock_was_used(VCLOCK_HPET)) {
ret = vm_insert_pfn_prot(
vma,
(unsigned long)vmf->virtual_address,
@@ -141,7 +141,7 @@ static int vvar_fault(const struct vm_special_mapping *sm,
} else if (sym_offset == image->sym_pvclock_page) {
struct pvclock_vsyscall_time_info *pvti =
pvclock_pvti_cpu0_va();
- if (pvti) {
+ if (pvti && vclock_was_used(VCLOCK_PVCLOCK)) {
ret = vm_insert_pfn(
vma,
(unsigned long)vmf->virtual_address,
diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c
index 51e330416995..0fb3a104ac62 100644
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -16,6 +16,8 @@
#include <asm/vgtod.h>
#include <asm/vvar.h>
+int vclocks_used __read_mostly;
+
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);
void update_vsyscall_tz(void)
@@ -26,12 +28,17 @@ void update_vsyscall_tz(void)
void update_vsyscall(struct timekeeper *tk)
{
+ int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+ /* Mark the new vclock used. */
+ BUILD_BUG_ON(VCLOCK_MAX >= 32);
+ WRITE_ONCE(vclocks_used, READ_ONCE(vclocks_used) | (1 << vclock_mode));
+
gtod_write_begin(vdata);
/* copy vsyscall data */
- vdata->vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
+ vdata->vclock_mode = vclock_mode;
vdata->cycle_last = tk->tkr_mono.cycle_last;
vdata->mask = tk->tkr_mono.mask;
vdata->mult = tk->tkr_mono.mult;
diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
index eda81dc0f4ae..d194266acb28 100644
--- a/arch/x86/include/asm/clocksource.h
+++ b/arch/x86/include/asm/clocksource.h
@@ -3,10 +3,11 @@
#ifndef _ASM_X86_CLOCKSOURCE_H
#define _ASM_X86_CLOCKSOURCE_H
-#define VCLOCK_NONE 0 /* No vDSO clock available. */
-#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */
-#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */
-#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
+#define VCLOCK_NONE 0 /* No vDSO clock available. */
+#define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */
+#define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */
+#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
+#define VCLOCK_MAX 3
struct arch_clocksource_data {
int vclock_mode;
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index f556c4843aa1..e728699db774 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -37,6 +37,12 @@ struct vsyscall_gtod_data {
};
extern struct vsyscall_gtod_data vsyscall_gtod_data;
+extern int vclocks_used;
+static inline bool vclock_was_used(int vclock)
+{
+ return READ_ONCE(vclocks_used) & (1 << vclock);
+}
+
static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
{
unsigned ret;
--
2.5.0
On Thu, 10 Dec 2015 19:21:42 -0800 Andy Lutomirski <[email protected]> wrote:
> From: Andy Lutomirski <[email protected]>
>
> Requiring special mappings to give a list of struct pages is
> inflexible: it prevents sane use of IO memory in a special mapping,
> it's inefficient (it requires arch code to initialize a list of
> struct pages, and it requires the mm core to walk the entire list
> just to figure out how long it is), and it prevents arch code from
> doing anything fancy when a special mapping fault occurs.
>
> Add a .fault method as an alternative to filling in a .pages array.
>
> ...
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -568,10 +568,27 @@ static inline void clear_tlb_flush_pending(struct mm_struct *mm)
> }
> #endif
>
> +struct vm_fault;
> +
> struct vm_special_mapping
> {
We may as well fix the code layout while we're in there.
> - const char *name;
> + const char *name; /* The name, e.g. "[vdso]". */
> +
> + /*
> + * If .fault is not provided, this is points to a
s/is//
> + * NULL-terminated array of pages that back the special mapping.
> + *
> + * This must not be NULL unless .fault is provided.
> + */
> struct page **pages;
> +
> + /*
> + * If non-NULL, then this is called to resolve page faults
> + * on the special mapping. If used, .pages is not checked.
> + */
> + int (*fault)(const struct vm_special_mapping *sm,
> + struct vm_area_struct *vma,
> + struct vm_fault *vmf);
> };
>
> enum tlb_flush_reason {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2ce04a649f6b..f717453b1a57 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3030,11 +3030,16 @@ static int special_mapping_fault(struct vm_area_struct *vma,
> pgoff_t pgoff;
> struct page **pages;
>
> - if (vma->vm_ops == &legacy_special_mapping_vmops)
> + if (vma->vm_ops == &legacy_special_mapping_vmops) {
> pages = vma->vm_private_data;
> - else
> - pages = ((struct vm_special_mapping *)vma->vm_private_data)->
> - pages;
> + } else {
> + struct vm_special_mapping *sm = vma->vm_private_data;
> +
> + if (sm->fault)
> + return sm->fault(sm, vma, vmf);
> +
> + pages = sm->pages;
> + }
>
> for (pgoff = vmf->pgoff; pgoff && *pages; ++pages)
> pgoff--;
Otherwise looks OK. I'll assume this will be merged via an x86 tree.
On Thu, 10 Dec 2015 19:21:43 -0800 Andy Lutomirski <[email protected]> wrote:
> The x86 vvar mapping contains pages with differing cacheability
> flags. This is currently only supported using (io_)remap_pfn_range,
> but those functions can't be used inside page faults.
Foggy. What does "support" mean here?
> Add vm_insert_pfn_prot to support varying cacheability within the
> same non-COW VMA in a more sane manner.
Here, "support" presumably means "insertion of pfns". Can we spell all
this out more completely please?
> x86 needs this to avoid a CRIU-breaking and memory-wasting explosion
> of VMAs when supporting userspace access to the HPET.
>
OtherwiseAck.
On Fri, Dec 11, 2015 at 2:33 PM, Andrew Morton
<[email protected]> wrote:
> On Thu, 10 Dec 2015 19:21:43 -0800 Andy Lutomirski <[email protected]> wrote:
>
>> The x86 vvar mapping contains pages with differing cacheability
>> flags. This is currently only supported using (io_)remap_pfn_range,
>> but those functions can't be used inside page faults.
>
> Foggy. What does "support" mean here?
We currently have a hack in which every x86 mm has a "vvar" vma that
has a .fault handler that always fails (it's the vm_special_mapping
fault handler backed by an empty pages array). To make everything
work, at mm startup, the vdso code uses remap_pfn_range and
io_remap_pfn_range to poke the pfns into the page tables.
I'd much rather implement this using the new .fault mechanism, and the
canonical way to implement .fault seems to be vm_insert_pfn, and
vm_insert_pfn doesn't allow setting per-page cacheability.
Unfortunately, one of the three x86 vvar pages needs to be uncacheable
because it's a genuine IO page, so I can't use vm_insert_pfn.
I suppose I could just call io_remap_pfn_range from .fault, but I
think that's frowned upon. Admittedly, I wasn't really sure *why*
that's frowned upon. This goes way back to 2007
(e0dc0d8f4a327d033bfb63d43f113d5f31d11b3c) when .fault got fancier.
>
>> Add vm_insert_pfn_prot to support varying cacheability within the
>> same non-COW VMA in a more sane manner.
>
> Here, "support" presumably means "insertion of pfns". Can we spell all
> this out more completely please?
Yes, will fix.
>
>> x86 needs this to avoid a CRIU-breaking and memory-wasting explosion
>> of VMAs when supporting userspace access to the HPET.
>>
>
> OtherwiseAck.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
* Andrew Morton <[email protected]> wrote:
> > + } else {
> > + struct vm_special_mapping *sm = vma->vm_private_data;
> > +
> > + if (sm->fault)
> > + return sm->fault(sm, vma, vmf);
> > +
> > + pages = sm->pages;
> > + }
> >
> > for (pgoff = vmf->pgoff; pgoff && *pages; ++pages)
> > pgoff--;
>
> Otherwise looks OK. I'll assume this will be merged via an x86 tree.
Yeah, was hoping to be able to do that with your Acked-by.
Thanks,
Ingo