2015-07-14 15:39:38

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCHSET v4 0/5] pagemap: make useable for non-privilege users

This patchset makes pagemap useable again in the safe way (after row hammer
bug it was made CAP_SYS_ADMIN-only). This patchset restores access for
non-privileged users but hides PFNs from them.

Also it adds bit 'map-exlusive' which is set if page is mapped only here:
it helps in estimation of working set without exposing pfns and allows to
distinguish CoWed and non-CoWed private anonymous pages.

Second patch removes page-shift bits and completes migration to the new
pagemap format: flags soft-dirty and mmap-exlusive are available only
in the new format.

Changes since v3:
* patches reordered: cleanup now in second patch
* update pagemap for hugetlb, add missing 'FILE' bit
* fix PM_PFRAME_BITS: its 55 not 54 as was in previous versions

---

Konstantin Khlebnikov (5):
pagemap: check permissions and capabilities at open time
pagemap: switch to the new format and do some cleanup
pagemap: rework hugetlb and thp report
pagemap: hide physical addresses from non-privileged users
pagemap: add mmap-exclusive bit for marking pages mapped only here


Documentation/vm/pagemap.txt | 3
fs/proc/task_mmu.c | 267 ++++++++++++++++++------------------------
tools/vm/page-types.c | 35 +++---
3 files changed, 137 insertions(+), 168 deletions(-)

--
Konstantin


2015-07-14 15:39:37

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v4 1/5] pagemap: check permissions and capabilities at open time

This patch moves permission checks from pagemap_read() into pagemap_open().

Pointer to mm is saved in file->private_data. This reference pins only
mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Link: http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com
---
fs/proc/task_mmu.c | 48 ++++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091881d4..270bf7cbc8a5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1227,40 +1227,33 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task = get_proc_task(file_inode(file));
- struct mm_struct *mm;
+ struct mm_struct *mm = file->private_data;
struct pagemapread pm;
- int ret = -ESRCH;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
unsigned long start_vaddr;
unsigned long end_vaddr;
- int copied = 0;
+ int ret = 0, copied = 0;

- if (!task)
+ if (!mm || !atomic_inc_not_zero(&mm->mm_users))
goto out;

ret = -EINVAL;
/* file position must be aligned */
if ((*ppos % PM_ENTRY_BYTES) || (count % PM_ENTRY_BYTES))
- goto out_task;
+ goto out_mm;

ret = 0;
if (!count)
- goto out_task;
+ goto out_mm;

pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
if (!pm.buffer)
- goto out_task;
-
- mm = mm_access(task, PTRACE_MODE_READ);
- ret = PTR_ERR(mm);
- if (!mm || IS_ERR(mm))
- goto out_free;
+ goto out_mm;

pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
@@ -1273,10 +1266,10 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
src = *ppos;
svpfn = src / PM_ENTRY_BYTES;
start_vaddr = svpfn << PAGE_SHIFT;
- end_vaddr = TASK_SIZE_OF(task);
+ end_vaddr = mm->task_size;

/* watch out for wraparound */
- if (svpfn > TASK_SIZE_OF(task) >> PAGE_SHIFT)
+ if (svpfn > mm->task_size >> PAGE_SHIFT)
start_vaddr = end_vaddr;

/*
@@ -1303,7 +1296,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
len = min(count, PM_ENTRY_BYTES * pm.pos);
if (copy_to_user(buf, pm.buffer, len)) {
ret = -EFAULT;
- goto out_mm;
+ goto out_free;
}
copied += len;
buf += len;
@@ -1313,24 +1306,38 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!ret || ret == PM_END_OF_BUFFER)
ret = copied;

-out_mm:
- mmput(mm);
out_free:
kfree(pm.buffer);
-out_task:
- put_task_struct(task);
+out_mm:
+ mmput(mm);
out:
return ret;
}

static int pagemap_open(struct inode *inode, struct file *file)
{
+ struct mm_struct *mm;
+
/* do not disclose physical addresses: attack vector */
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
"to stop being page-shift some time soon. See the "
"linux/Documentation/vm/pagemap.txt for details.\n");
+
+ mm = proc_mem_open(inode, PTRACE_MODE_READ);
+ if (IS_ERR(mm))
+ return PTR_ERR(mm);
+ file->private_data = mm;
+ return 0;
+}
+
+static int pagemap_release(struct inode *inode, struct file *file)
+{
+ struct mm_struct *mm = file->private_data;
+
+ if (mm)
+ mmdrop(mm);
return 0;
}

@@ -1338,6 +1345,7 @@ const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
.open = pagemap_open,
+ .release = pagemap_release,
};
#endif /* CONFIG_PROC_PAGE_MONITOR */

2015-07-14 15:46:16

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v4 2/5] pagemap: switch to the new format and do some cleanup

This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/proc/task_mmu.c | 150 +++++++++++++++++--------------------------------
tools/vm/page-types.c | 25 +++-----
2 files changed, 61 insertions(+), 114 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 270bf7cbc8a5..c05db6acdc35 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

-/*
- * We do not want to have constant page-shift bits sitting in
- * pagemap entries and are about to reuse them some time soon.
- *
- * Here's the "migration strategy":
- * 1. when the system boots these bits remain what they are,
- * but a warning about future change is printed in log;
- * 2. once anyone clears soft-dirty bits via clear_refs file,
- * these flag is set to denote, that user is aware of the
- * new API and those page-shift bits change their meaning.
- * The respective warning is printed in dmesg;
- * 3. In a couple of releases we will remove all the mentions
- * of page-shift in pagemap entries.
- */
-
-static bool soft_dirty_cleared __read_mostly;
-
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
@@ -887,13 +870,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;

- if (type == CLEAR_REFS_SOFT_DIRTY) {
- soft_dirty_cleared = true;
- pr_warn_once("The pagemap bits 55-60 has changed their meaning!"
- " See the linux/Documentation/vm/pagemap.txt for "
- "details.\n");
- }
-
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
@@ -961,36 +937,24 @@ typedef struct {
struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
- bool v2;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
#define PAGEMAP_WALK_MASK (PMD_MASK)

-#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-/* in "new" pagemap pshift bits are occupied with more status bits */
-#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
-
-#define __PM_SOFT_DIRTY (1LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
+#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
+#define PM_PFRAME_BITS 55
+#define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
+#define PM_SOFT_DIRTY BIT_ULL(55)
+#define PM_FILE BIT_ULL(61)
+#define PM_SWAP BIT_ULL(62)
+#define PM_PRESENT BIT_ULL(63)
+
#define PM_END_OF_BUFFER 1

-static inline pagemap_entry_t make_pme(u64 val)
+static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
{
- return (pagemap_entry_t) { .pme = val };
+ return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
}

static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
@@ -1011,7 +975,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

while (addr < end) {
struct vm_area_struct *vma = find_vma(walk->mm, addr);
- pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+ pagemap_entry_t pme = make_pme(0, 0);
/* End of address space hole, which we mark as non-present. */
unsigned long hole_end;

@@ -1031,7 +995,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

/* Addresses in the VMA. */
if (vma->vm_flags & VM_SOFTDIRTY)
- pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
+ pme = make_pme(0, PM_SOFT_DIRTY);
for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
err = add_to_pagemap(addr, &pme, pm);
if (err)
@@ -1042,63 +1006,61 @@ out:
return err;
}

-static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
+static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- u64 frame, flags;
+ u64 frame = 0, flags = 0;
struct page *page = NULL;
- int flags2 = 0;

if (pte_present(pte)) {
frame = pte_pfn(pte);
- flags = PM_PRESENT;
+ flags |= PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
entry = pte_to_swp_entry(pte);
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
- flags = PM_SWAP;
+ flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
- } else {
- if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 |= __PM_SOFT_DIRTY;
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
- return;
}

if (page && !PageAnon(page))
flags |= PM_FILE;
- if ((vma->vm_flags & VM_SOFTDIRTY))
- flags2 |= __PM_SOFT_DIRTY;
+ if (vma->vm_flags & VM_SOFTDIRTY)
+ flags |= PM_SOFT_DIRTY;

- *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
+ return make_pme(frame, flags);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
+ u64 frame = 0;
+
/*
* Currently pmd for thp is always present because thp can not be
* swapped-out, migrated, or HWPOISONed (split in such cases instead.)
* This if-check is just to prepare for future implementation.
*/
- if (pmd_present(pmd))
- *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
- | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
- else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
+ if (pmd_present(pmd)) {
+ frame = pmd_pfn(pmd) + offset;
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}
#else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
+ return make_pme(0, 0);
}
#endif

@@ -1112,12 +1074,10 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
int err = 0;

if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- int pmd_flags2;
+ u64 flags = 0;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
- pmd_flags2 = __PM_SOFT_DIRTY;
- else
- pmd_flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

for (; addr != end; addr += PAGE_SIZE) {
unsigned long offset;
@@ -1125,7 +1085,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

offset = (addr & ~PAGEMAP_WALK_MASK) >>
PAGE_SHIFT;
- thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
+ pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1145,7 +1105,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;

- pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+ pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1158,16 +1118,17 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pte_t pte, int offset, int flags2)
+static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
+ pte_t pte, int offset, u64 flags)
{
- if (pte_present(pte))
- *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) |
- PM_STATUS2(pm->v2, flags2) |
- PM_PRESENT);
- else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
- PM_STATUS2(pm->v2, flags2));
+ u64 frame = 0;
+
+ if (pte_present(pte)) {
+ frame = pte_pfn(pte) + offset;
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}

/* This function walks within one hugetlb entry in the single call */
@@ -1178,17 +1139,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
int err = 0;
- int flags2;
+ u64 flags = 0;
pagemap_entry_t pme;

if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 = __PM_SOFT_DIRTY;
- else
- flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

for (; addr != end; addr += PAGE_SIZE) {
int offset = (addr & ~hmask) >> PAGE_SHIFT;
- huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
+ pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
@@ -1209,7 +1168,8 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
+ * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bits 56-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
@@ -1248,7 +1208,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!count)
goto out_mm;

- pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
@@ -1321,9 +1280,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
/* do not disclose physical addresses: attack vector */
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
- pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
- "to stop being page-shift some time soon. See the "
- "linux/Documentation/vm/pagemap.txt for details.\n");

mm = proc_mem_open(inode, PTRACE_MODE_READ);
if (IS_ERR(mm))
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 8bdf16b8ba60..603ec916716b 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -57,23 +57,14 @@
* pagemap kernel ABI bits
*/

-#define PM_ENTRY_BYTES sizeof(uint64_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-
-#define __PM_SOFT_DIRTY (1LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
-
+#define PM_ENTRY_BYTES 8
+#define PM_PFRAME_BITS 55
+#define PM_PFRAME_MASK ((1LL << PM_PFRAME_BITS) - 1)
+#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
+#define PM_SOFT_DIRTY (1ULL << 55)
+#define PM_FILE (1ULL << 61)
+#define PM_SWAP (1ULL << 62)
+#define PM_PRESENT (1ULL << 63)

/*
* kernel page flags

2015-07-14 15:38:33

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v4 3/5] pagemap: rework hugetlb and thp report

This patch moves pmd dissection out of reporting loop: huge pages
are reported as bunch of normal pages with contiguous PFNs.

Add missing "FILE" bit in hugetlb vmas.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/proc/task_mmu.c | 100 +++++++++++++++++++++++-----------------------------
1 file changed, 44 insertions(+), 56 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c05db6acdc35..040721fa405a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1038,33 +1038,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
return make_pme(frame, flags);
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
- pmd_t pmd, int offset, u64 flags)
-{
- u64 frame = 0;
-
- /*
- * Currently pmd for thp is always present because thp can not be
- * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
- * This if-check is just to prepare for future implementation.
- */
- if (pmd_present(pmd)) {
- frame = pmd_pfn(pmd) + offset;
- flags |= PM_PRESENT;
- }
-
- return make_pme(frame, flags);
-}
-#else
-static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
- pmd_t pmd, int offset, u64 flags)
-{
- return make_pme(0, 0);
-}
-#endif
-
-static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
@@ -1073,35 +1047,48 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte, *orig_pte;
int err = 0;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- u64 flags = 0;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pmd_trans_huge_lock(pmdp, vma, &ptl) == 1) {
+ u64 flags = 0, frame = 0;
+ pmd_t pmd = *pmdp;

- if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
+ if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;

+ /*
+ * Currently pmd for thp is always present because thp
+ * can not be swapped-out, migrated, or HWPOISONed
+ * (split in such cases instead.)
+ * This if-check is just to prepare for future implementation.
+ */
+ if (pmd_present(pmd)) {
+ flags |= PM_PRESENT;
+ frame = pmd_pfn(pmd) +
+ ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ }
+
for (; addr != end; addr += PAGE_SIZE) {
- unsigned long offset;
- pagemap_entry_t pme;
+ pagemap_entry_t pme = make_pme(frame, flags);

- offset = (addr & ~PAGEMAP_WALK_MASK) >>
- PAGE_SHIFT;
- pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
+ if (flags & PM_PRESENT)
+ frame++;
}
spin_unlock(ptl);
return err;
}

- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmdp))
return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
* We can assume that @vma always points to a valid one and @end never
* goes beyond vma->vm_end.
*/
- orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;

@@ -1118,39 +1105,40 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

#ifdef CONFIG_HUGETLB_PAGE
-static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
- pte_t pte, int offset, u64 flags)
-{
- u64 frame = 0;
-
- if (pte_present(pte)) {
- frame = pte_pfn(pte) + offset;
- flags |= PM_PRESENT;
- }
-
- return make_pme(frame, flags);
-}
-
/* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
+static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
+ u64 flags = 0, frame = 0;
int err = 0;
- u64 flags = 0;
- pagemap_entry_t pme;
+ pte_t pte;

if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;

+ pte = huge_ptep_get(ptep);
+ if (pte_present(pte)) {
+ struct page *page = pte_page(pte);
+
+ if (!PageAnon(page))
+ flags |= PM_FILE;
+
+ flags |= PM_PRESENT;
+ frame = pte_pfn(pte) +
+ ((addr & ~hmask) >> PAGE_SHIFT);
+ }
+
for (; addr != end; addr += PAGE_SIZE) {
- int offset = (addr & ~hmask) >> PAGE_SHIFT;
- pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
+ pagemap_entry_t pme = make_pme(frame, flags);
+
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
+ if (flags & PM_PRESENT)
+ frame++;
}

cond_resched();
@@ -1214,7 +1202,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!pm.buffer)
goto out_mm;

- pagemap_walk.pmd_entry = pagemap_pte_range;
+ pagemap_walk.pmd_entry = pagemap_pmd_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
#ifdef CONFIG_HUGETLB_PAGE
pagemap_walk.hugetlb_entry = pagemap_hugetlb_range;

2015-07-14 15:38:32

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v4 4/5] pagemap: hide physical addresses from non-privileged users

This patch makes pagemap readable for normal users and hides physical
addresses from them. For some use-cases PFN isn't required at all.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
Link: http://lkml.kernel.org/r/[email protected]
---
fs/proc/task_mmu.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 040721fa405a..3a5d338ea219 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -937,6 +937,7 @@ typedef struct {
struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
+ bool show_pfn;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
@@ -1013,7 +1014,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
struct page *page = NULL;

if (pte_present(pte)) {
- frame = pte_pfn(pte);
+ if (pm->show_pfn)
+ frame = pte_pfn(pte);
flags |= PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
@@ -1063,8 +1065,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
*/
if (pmd_present(pmd)) {
flags |= PM_PRESENT;
- frame = pmd_pfn(pmd) +
- ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+ if (pm->show_pfn)
+ frame = pmd_pfn(pmd) +
+ ((addr & ~PMD_MASK) >> PAGE_SHIFT);
}

for (; addr != end; addr += PAGE_SIZE) {
@@ -1073,7 +1076,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
- if (flags & PM_PRESENT)
+ if (pm->show_pfn && (flags & PM_PRESENT))
frame++;
}
spin_unlock(ptl);
@@ -1127,8 +1130,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
flags |= PM_FILE;

flags |= PM_PRESENT;
- frame = pte_pfn(pte) +
- ((addr & ~hmask) >> PAGE_SHIFT);
+ if (pm->show_pfn)
+ frame = pte_pfn(pte) +
+ ((addr & ~hmask) >> PAGE_SHIFT);
}

for (; addr != end; addr += PAGE_SIZE) {
@@ -1137,7 +1141,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
- if (flags & PM_PRESENT)
+ if (pm->show_pfn && (flags & PM_PRESENT))
frame++;
}

@@ -1196,6 +1200,9 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!count)
goto out_mm;

+ /* do not disclose physical addresses: attack vector */
+ pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
+
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
@@ -1265,10 +1272,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
{
struct mm_struct *mm;

- /* do not disclose physical addresses: attack vector */
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
-
mm = proc_mem_open(inode, PTRACE_MODE_READ);
if (IS_ERR(mm))
return PTR_ERR(mm);

2015-07-14 15:38:30

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v4 5/5] pagemap: add mmap-exclusive bit for marking pages mapped only here

This patch sets bit 56 in pagemap if this page is mapped only once.
It allows to detect exclusively used pages without exposing PFN:

present file exclusive state
0 0 0 non-present
1 1 0 file page mapped somewhere else
1 1 1 file page mapped only here
1 0 0 anon non-CoWed page (shared with parent/child)
1 0 1 anon CoWed page (or never forked)

CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
page could be mapped once but has several swap-ptes which point to it.
Application could detect that by swap bit in pagemap entry and touch
that pte via /proc/pid/mem to get real information.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Requested-by: Mark Williamson <[email protected]>
Link: http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
---
Documentation/vm/pagemap.txt | 3 ++-
fs/proc/task_mmu.c | 14 +++++++++++++-
tools/vm/page-types.c | 10 ++++++++++
3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..3cfbbb333ea1 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,8 @@ There are three components to pagemap:
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
- * Bits 56-60 zero
+ * Bit 56 page exlusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3a5d338ea219..bac4c97f8ff8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -947,6 +947,7 @@ struct pagemapread {
#define PM_PFRAME_BITS 55
#define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
#define PM_SOFT_DIRTY BIT_ULL(55)
+#define PM_MMAP_EXCLUSIVE BIT_ULL(56)
#define PM_FILE BIT_ULL(61)
#define PM_SWAP BIT_ULL(62)
#define PM_PRESENT BIT_ULL(63)
@@ -1034,6 +1035,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,

if (page && !PageAnon(page))
flags |= PM_FILE;
+ if (page && page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;

@@ -1064,6 +1067,11 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
* This if-check is just to prepare for future implementation.
*/
if (pmd_present(pmd)) {
+ struct page *page = pmd_page(pmd);
+
+ if (page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
+
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pmd_pfn(pmd) +
@@ -1129,6 +1137,9 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
if (!PageAnon(page))
flags |= PM_FILE;

+ if (page_mapcount(page) == 1)
+ flags |= PM_MMAP_EXCLUSIVE;
+
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pte_pfn(pte) +
@@ -1161,7 +1172,8 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
- * Bits 56-60 zero
+ * Bit 56 page exclusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 603ec916716b..7f73fa32a590 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -62,6 +62,7 @@
#define PM_PFRAME_MASK ((1LL << PM_PFRAME_BITS) - 1)
#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
#define PM_SOFT_DIRTY (1ULL << 55)
+#define PM_MMAP_EXCLUSIVE (1ULL << 56)
#define PM_FILE (1ULL << 61)
#define PM_SWAP (1ULL << 62)
#define PM_PRESENT (1ULL << 63)
@@ -91,6 +92,8 @@
#define KPF_SLOB_FREE 49
#define KPF_SLUB_FROZEN 50
#define KPF_SLUB_DEBUG 51
+#define KPF_FILE 62
+#define KPF_MMAP_EXCLUSIVE 63

#define KPF_ALL_BITS ((uint64_t)~0ULL)
#define KPF_HACKERS_BITS (0xffffULL << 32)
@@ -140,6 +143,9 @@ static const char * const page_flag_names[] = {
[KPF_SLOB_FREE] = "P:slob_free",
[KPF_SLUB_FROZEN] = "A:slub_frozen",
[KPF_SLUB_DEBUG] = "E:slub_debug",
+
+ [KPF_FILE] = "F:file",
+ [KPF_MMAP_EXCLUSIVE] = "1:mmap_exclusive",
};


@@ -443,6 +449,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)

if (pme & PM_SOFT_DIRTY)
flags |= BIT(SOFTDIRTY);
+ if (pme & PM_FILE)
+ flags |= BIT(FILE);
+ if (pme & PM_MMAP_EXCLUSIVE)
+ flags |= BIT(MMAP_EXCLUSIVE);

return flags;
}

2015-07-14 18:52:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET v4 0/5] pagemap: make useable for non-privilege users

On Tue, 14 Jul 2015 18:37:34 +0300 Konstantin Khlebnikov <[email protected]> wrote:

> This patchset makes pagemap useable again in the safe way (after row hammer
> bug it was made CAP_SYS_ADMIN-only). This patchset restores access for
> non-privileged users but hides PFNs from them.

Documentation/vm/pagemap.txt hasn't been updated to describe these
privilege issues?

> Also it adds bit 'map-exlusive' which is set if page is mapped only here:
> it helps in estimation of working set without exposing pfns and allows to
> distinguish CoWed and non-CoWed private anonymous pages.
>
> Second patch removes page-shift bits and completes migration to the new
> pagemap format: flags soft-dirty and mmap-exlusive are available only
> in the new format.

I'm not really seeing a description of the new format in these
changelogs. Precisely what got removed, what got added and which
capabilities change the output in what manner?

2015-07-14 20:15:45

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCHSET v4 0/5] pagemap: make useable for non-privilege users

On Tue, Jul 14, 2015 at 9:52 PM, Andrew Morton
<[email protected]> wrote:
> On Tue, 14 Jul 2015 18:37:34 +0300 Konstantin Khlebnikov <[email protected]> wrote:
>
>> This patchset makes pagemap useable again in the safe way (after row hammer
>> bug it was made CAP_SYS_ADMIN-only). This patchset restores access for
>> non-privileged users but hides PFNs from them.
>
> Documentation/vm/pagemap.txt hasn't been updated to describe these
> privilege issues?

Will do. Too much time passed between versions, I planned but forgot about that.

>
>> Also it adds bit 'map-exlusive' which is set if page is mapped only here:
>> it helps in estimation of working set without exposing pfns and allows to
>> distinguish CoWed and non-CoWed private anonymous pages.
>>
>> Second patch removes page-shift bits and completes migration to the new
>> pagemap format: flags soft-dirty and mmap-exlusive are available only
>> in the new format.
>
> I'm not really seeing a description of the new format in these
> changelogs. Precisely what got removed, what got added and which
> capabilities change the output in what manner?

Now pfn (bits 0-54) is zero if task who opened pagemap has no
CAP_SYS_ADMIN (system-wide).

in v2 format page-shift (bits 55-60) now used for flags:
55 - soft-dirty (added for checkpoint-restore, I guess)
56 - mmap-exclusive (added in last patch)
57-60 - free for use

I'll document the history of these changes.

>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-07-16 18:47:48

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH] pagemap: update documentation

Notes about recent changes.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
Documentation/vm/pagemap.txt | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 3cfbbb333ea1..aab39aa7dd8f 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,12 +16,17 @@ There are three components to pagemap:
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
- * Bit 56 page exlusively mapped
+ * Bit 56 page exclusively mapped (since 4.2)
* Bits 57-60 zero
- * Bit 61 page is file-page or shared-anon
+ * Bit 61 page is file-page or shared-anon (since 3.5)
* Bit 62 page swapped
* Bit 63 page present

+ Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs:
+ for unprivileged users from 4.0 till 4.2 open fails with -EPERM, starting
+ from from 4.2 PFN field is zeroed if user has no CAP_SYS_ADMIN capability.
+ Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
+
If the page is not present but in swap, then the PFN contains an
encoding of the swap file number and the page's offset into the
swap. Unmapped pages return a null PFN. This allows determining
@@ -160,3 +165,8 @@ Other notes:
Reading from any of the files will return -EINVAL if you are not starting
the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
into the file), or if the size of the read is not a multiple of 8 bytes.
+
+Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
+always 12 at most architectures). Since Linux 3.11 their meaning changes
+after first clear of soft-dirty bits. Since Linux 4.2 they are used for
+flags unconditionally.

2015-07-19 11:10:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] pagemap: rework hugetlb and thp report

On Tue, Jul 14, 2015 at 06:37:39PM +0300, Konstantin Khlebnikov wrote:
> @@ -1073,35 +1047,48 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> pte_t *pte, *orig_pte;
> int err = 0;
>
> - if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> - u64 flags = 0;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + if (pmd_trans_huge_lock(pmdp, vma, &ptl) == 1) {

#ifdef is redundant. pmd_trans_huge_lock() always return 0 for !THP.

--
Kirill A. Shutemov

2015-07-21 08:35:41

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] pagemap: switch to the new format and do some cleanup

On Tue, Jul 14, 2015 at 06:37:37PM +0300, Konstantin Khlebnikov wrote:
> This patch removes page-shift bits (scheduled to remove since 3.11) and
> completes migration to the new bit layout. Also it cleans messy macro.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>

Reviewed-by: Naoya Horiguchi <[email protected]>-

2015-07-21 08:35:50

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] pagemap: rework hugetlb and thp report

On Tue, Jul 14, 2015 at 06:37:39PM +0300, Konstantin Khlebnikov wrote:
> This patch moves pmd dissection out of reporting loop: huge pages
> are reported as bunch of normal pages with contiguous PFNs.
>
> Add missing "FILE" bit in hugetlb vmas.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>

With reflecting Kirill's comment about #ifdef, I'm OK for this patch.

Reviewed-by: Naoya Horiguchi <[email protected]>-

2015-07-21 08:35:31

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] pagemap: check permissions and capabilities at open time

On Tue, Jul 14, 2015 at 06:37:35PM +0300, Konstantin Khlebnikov wrote:
> This patch moves permission checks from pagemap_read() into pagemap_open().
>
> Pointer to mm is saved in file->private_data. This reference pins only
> mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Link: http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com

Reviewed-by: Naoya Horiguchi <[email protected]>-

2015-07-21 08:36:11

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] pagemap: hide physical addresses from non-privileged users

On Tue, Jul 14, 2015 at 06:37:47PM +0300, Konstantin Khlebnikov wrote:
> This patch makes pagemap readable for normal users and hides physical
> addresses from them. For some use-cases PFN isn't required at all.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> fs/proc/task_mmu.c | 25 ++++++++++++++-----------
> 1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 040721fa405a..3a5d338ea219 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -937,6 +937,7 @@ typedef struct {
> struct pagemapread {
> int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
> pagemap_entry_t *buffer;
> + bool show_pfn;
> };
>
> #define PAGEMAP_WALK_SIZE (PMD_SIZE)
> @@ -1013,7 +1014,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> struct page *page = NULL;
>
> if (pte_present(pte)) {
> - frame = pte_pfn(pte);
> + if (pm->show_pfn)
> + frame = pte_pfn(pte);
> flags |= PM_PRESENT;
> page = vm_normal_page(vma, addr, pte);
> if (pte_soft_dirty(pte))

Don't you need the same if (pm->show_pfn) check in is_swap_pte path, too?
(although I don't think that it can be exploited by row hammer attack ...)

Thanks,
Naoya Horiguchi-

2015-07-21 08:36:22

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] pagemap: add mmap-exclusive bit for marking pages mapped only here

On Tue, Jul 14, 2015 at 06:37:49PM +0300, Konstantin Khlebnikov wrote:
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
>
> present file exclusive state
> 0 0 0 non-present
> 1 1 0 file page mapped somewhere else
> 1 1 1 file page mapped only here
> 1 0 0 anon non-CoWed page (shared with parent/child)
> 1 0 1 anon CoWed page (or never forked)
>
> CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.
>
> MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
> page could be mapped once but has several swap-ptes which point to it.
> Application could detect that by swap bit in pagemap entry and touch
> that pte via /proc/pid/mem to get real information.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Requested-by: Mark Williamson <[email protected]>
> Link: http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

Reviewed-by: Naoya Horiguchi <[email protected]>-

2015-07-21 08:36:45

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] pagemap: update documentation

On Thu, Jul 16, 2015 at 09:47:42PM +0300, Konstantin Khlebnikov wrote:
> Notes about recent changes.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---
> Documentation/vm/pagemap.txt | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 3cfbbb333ea1..aab39aa7dd8f 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -16,12 +16,17 @@ There are three components to pagemap:
> * Bits 0-4 swap type if swapped
> * Bits 5-54 swap offset if swapped
> * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
> - * Bit 56 page exlusively mapped
> + * Bit 56 page exclusively mapped (since 4.2)
> * Bits 57-60 zero
> - * Bit 61 page is file-page or shared-anon
> + * Bit 61 page is file-page or shared-anon (since 3.5)
> * Bit 62 page swapped
> * Bit 63 page present
>
> + Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs:
> + for unprivileged users from 4.0 till 4.2 open fails with -EPERM, starting

I'm expecting that this patch will be merged before 4.2 is released, so if that's
right, stating "till 4.2" might be incorrect.

> + from from 4.2 PFN field is zeroed if user has no CAP_SYS_ADMIN capability.

"from" duplicates ...

Thanks,
Naoya Horiguchi

> + Reason: information about PFNs helps in exploiting Rowhammer vulnerability.
> +
> If the page is not present but in swap, then the PFN contains an
> encoding of the swap file number and the page's offset into the
> swap. Unmapped pages return a null PFN. This allows determining
> @@ -160,3 +165,8 @@ Other notes:
> Reading from any of the files will return -EINVAL if you are not starting
> the read on an 8-byte boundary (e.g., if you sought an odd number of bytes
> into the file), or if the size of the read is not a multiple of 8 bytes.
> +
> +Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
> +always 12 at most architectures). Since Linux 3.11 their meaning changes
> +after first clear of soft-dirty bits. Since Linux 4.2 they are used for
> +flags unconditionally.
> -

2015-07-21 08:40:14

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] pagemap: hide physical addresses from non-privileged users

On Tue, Jul 21, 2015 at 11:11 AM, Naoya Horiguchi
<[email protected]> wrote:
> On Tue, Jul 14, 2015 at 06:37:47PM +0300, Konstantin Khlebnikov wrote:
>> This patch makes pagemap readable for normal users and hides physical
>> addresses from them. For some use-cases PFN isn't required at all.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
>> Link: http://lkml.kernel.org/r/[email protected]
>> ---
>> fs/proc/task_mmu.c | 25 ++++++++++++++-----------
>> 1 file changed, 14 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 040721fa405a..3a5d338ea219 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -937,6 +937,7 @@ typedef struct {
>> struct pagemapread {
>> int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
>> pagemap_entry_t *buffer;
>> + bool show_pfn;
>> };
>>
>> #define PAGEMAP_WALK_SIZE (PMD_SIZE)
>> @@ -1013,7 +1014,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>> struct page *page = NULL;
>>
>> if (pte_present(pte)) {
>> - frame = pte_pfn(pte);
>> + if (pm->show_pfn)
>> + frame = pte_pfn(pte);
>> flags |= PM_PRESENT;
>> page = vm_normal_page(vma, addr, pte);
>> if (pte_soft_dirty(pte))
>
> Don't you need the same if (pm->show_pfn) check in is_swap_pte path, too?
> (although I don't think that it can be exploited by row hammer attack ...)

Yeah, but I see no reason for that.
Probably except swap on ramdrive, but this too weird =)

>
> Thanks,
> Naoya Horiguchi
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a hrefmailto:"[email protected]"> [email protected] </a>

2015-07-21 08:47:29

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] pagemap: rework hugetlb and thp report

On Tue, Jul 21, 2015 at 11:00 AM, Naoya Horiguchi
<[email protected]> wrote:
> On Tue, Jul 14, 2015 at 06:37:39PM +0300, Konstantin Khlebnikov wrote:
>> This patch moves pmd dissection out of reporting loop: huge pages
>> are reported as bunch of normal pages with contiguous PFNs.
>>
>> Add missing "FILE" bit in hugetlb vmas.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>
> With reflecting Kirill's comment about #ifdef, I'm OK for this patch.

That ifdef works most like documentation: "all thp magic happens here".
I'd like to keep it, if two redundant lines isn't a big deal.

>
> Reviewed-by: Naoya Horiguchi <[email protected]>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a hrefmailto:"[email protected]"> [email protected] </a>

2015-07-24 17:34:23

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCHSET v4 0/5] pagemap: make useable for non-privilege users

Hi Konstantin,

Thank you for the further update - I tested this patchset against our
code and it allows our software to work correctly (with minor userland
changes, as before).

I'll follow up with review messages but there aren't really any
concerns that I can see.

Cheers,
Mark

On Tue, Jul 14, 2015 at 4:37 PM, Konstantin Khlebnikov
<[email protected]> wrote:
> This patchset makes pagemap useable again in the safe way (after row hammer
> bug it was made CAP_SYS_ADMIN-only). This patchset restores access for
> non-privileged users but hides PFNs from them.
>
> Also it adds bit 'map-exlusive' which is set if page is mapped only here:
> it helps in estimation of working set without exposing pfns and allows to
> distinguish CoWed and non-CoWed private anonymous pages.
>
> Second patch removes page-shift bits and completes migration to the new
> pagemap format: flags soft-dirty and mmap-exlusive are available only
> in the new format.
>
> Changes since v3:
> * patches reordered: cleanup now in second patch
> * update pagemap for hugetlb, add missing 'FILE' bit
> * fix PM_PFRAME_BITS: its 55 not 54 as was in previous versions
>
> ---
>
> Konstantin Khlebnikov (5):
> pagemap: check permissions and capabilities at open time
> pagemap: switch to the new format and do some cleanup
> pagemap: rework hugetlb and thp report
> pagemap: hide physical addresses from non-privileged users
> pagemap: add mmap-exclusive bit for marking pages mapped only here
>
>
> Documentation/vm/pagemap.txt | 3
> fs/proc/task_mmu.c | 267 ++++++++++++++++++------------------------
> tools/vm/page-types.c | 35 +++---
> 3 files changed, 137 insertions(+), 168 deletions(-)
>
> --
> Konstantin

2015-07-24 18:16:42

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v4 1/5] pagemap: check permissions and capabilities at open time

(within the limits of my understanding of the mm code)
Reviewed-by: Mark Williamson <[email protected]>

On Tue, Jul 21, 2015 at 9:06 AM, Naoya Horiguchi
<[email protected]> wrote:
> On Tue, Jul 14, 2015 at 06:37:35PM +0300, Konstantin Khlebnikov wrote:
>> This patch moves permission checks from pagemap_read() into pagemap_open().
>>
>> Pointer to mm is saved in file->private_data. This reference pins only
>> mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> Link: http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com
>
> Reviewed-by: Naoya Horiguchi <[email protected]>

2015-07-24 18:17:51

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] pagemap: rework hugetlb and thp report

Reviewed-by: Mark Williamson <[email protected]>

On Tue, Jul 21, 2015 at 9:43 AM, Konstantin Khlebnikov <[email protected]> wrote:
> On Tue, Jul 21, 2015 at 11:00 AM, Naoya Horiguchi
> <[email protected]> wrote:
>> On Tue, Jul 14, 2015 at 06:37:39PM +0300, Konstantin Khlebnikov wrote:
>>> This patch moves pmd dissection out of reporting loop: huge pages
>>> are reported as bunch of normal pages with contiguous PFNs.
>>>
>>> Add missing "FILE" bit in hugetlb vmas.
>>>
>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>>
>> With reflecting Kirill's comment about #ifdef, I'm OK for this patch.
>
> That ifdef works most like documentation: "all thp magic happens here".
> I'd like to keep it, if two redundant lines isn't a big deal.
>
>>
>> Reviewed-by: Naoya Horiguchi <[email protected]>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a hrefmailto:"[email protected]"> [email protected] </a>

2015-07-24 18:18:21

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] pagemap: hide physical addresses from non-privileged users

Reviewed-by: Mark Williamson <[email protected]>

On Tue, Jul 21, 2015 at 9:39 AM, Konstantin Khlebnikov <[email protected]> wrote:
> On Tue, Jul 21, 2015 at 11:11 AM, Naoya Horiguchi
> <[email protected]> wrote:
>> On Tue, Jul 14, 2015 at 06:37:47PM +0300, Konstantin Khlebnikov wrote:
>>> This patch makes pagemap readable for normal users and hides physical
>>> addresses from them. For some use-cases PFN isn't required at all.
>>>
>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>>> Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
>>> Link: http://lkml.kernel.org/r/[email protected]
>>> ---
>>> fs/proc/task_mmu.c | 25 ++++++++++++++-----------
>>> 1 file changed, 14 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 040721fa405a..3a5d338ea219 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -937,6 +937,7 @@ typedef struct {
>>> struct pagemapread {
>>> int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
>>> pagemap_entry_t *buffer;
>>> + bool show_pfn;
>>> };
>>>
>>> #define PAGEMAP_WALK_SIZE (PMD_SIZE)
>>> @@ -1013,7 +1014,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>> struct page *page = NULL;
>>>
>>> if (pte_present(pte)) {
>>> - frame = pte_pfn(pte);
>>> + if (pm->show_pfn)
>>> + frame = pte_pfn(pte);
>>> flags |= PM_PRESENT;
>>> page = vm_normal_page(vma, addr, pte);
>>> if (pte_soft_dirty(pte))
>>
>> Don't you need the same if (pm->show_pfn) check in is_swap_pte path, too?
>> (although I don't think that it can be exploited by row hammer attack ...)
>
> Yeah, but I see no reason for that.
> Probably except swap on ramdrive, but this too weird =)
>
>>
>> Thanks,
>> Naoya Horiguchi
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a hrefmailto:"[email protected]"> [email protected] </a>

2015-07-24 18:18:40

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] pagemap: add mmap-exclusive bit for marking pages mapped only here

Reviewed-by: Mark Williamson <[email protected]>

On Tue, Jul 21, 2015 at 9:17 AM, Naoya Horiguchi
<[email protected]> wrote:
> On Tue, Jul 14, 2015 at 06:37:49PM +0300, Konstantin Khlebnikov wrote:
>> This patch sets bit 56 in pagemap if this page is mapped only once.
>> It allows to detect exclusively used pages without exposing PFN:
>>
>> present file exclusive state
>> 0 0 0 non-present
>> 1 1 0 file page mapped somewhere else
>> 1 1 1 file page mapped only here
>> 1 0 0 anon non-CoWed page (shared with parent/child)
>> 1 0 1 anon CoWed page (or never forked)
>>
>> CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.
>>
>> MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
>> page could be mapped once but has several swap-ptes which point to it.
>> Application could detect that by swap bit in pagemap entry and touch
>> that pte via /proc/pid/mem to get real information.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> Requested-by: Mark Williamson <[email protected]>
>> Link: http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
>
> Reviewed-by: Naoya Horiguchi <[email protected]>

2015-07-24 18:19:09

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] pagemap: rework hugetlb and thp report

(my review on this patch comes with the caveat that the specifics of
hugetlb / thp are a bit outside my experience)

On Fri, Jul 24, 2015 at 7:17 PM, Mark Williamson
<[email protected]> wrote:
> Reviewed-by: Mark Williamson <[email protected]>
>
> On Tue, Jul 21, 2015 at 9:43 AM, Konstantin Khlebnikov <[email protected]> wrote:
>> On Tue, Jul 21, 2015 at 11:00 AM, Naoya Horiguchi
>> <[email protected]> wrote:
>>> On Tue, Jul 14, 2015 at 06:37:39PM +0300, Konstantin Khlebnikov wrote:
>>>> This patch moves pmd dissection out of reporting loop: huge pages
>>>> are reported as bunch of normal pages with contiguous PFNs.
>>>>
>>>> Add missing "FILE" bit in hugetlb vmas.
>>>>
>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>>>
>>> With reflecting Kirill's comment about #ifdef, I'm OK for this patch.
>>
>> That ifdef works most like documentation: "all thp magic happens here".
>> I'd like to keep it, if two redundant lines isn't a big deal.
>>
>>>
>>> Reviewed-by: Naoya Horiguchi <[email protected]>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a hrefmailto:"[email protected]"> [email protected] </a>