2015-06-09 20:00:20

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCHSET v3 0/4] pagemap: make useable for non-privilege users

This patchset makes pagemap useable again in the safe way. It adds bit
'map-exlusive' which is set if page is mapped only here and restores
access for non-privileged users but hides pfn from them.

Last patch removes page-shift bits and completes migration to the new
pagemap format: flags soft-dirty and mmap-exlusive are available only
in the new format.

v3: check permissions in ->open

---

Konstantin Khlebnikov (4):
pagemap: check permissions and capabilities at open time
pagemap: add mmap-exclusive bit for marking pages mapped only here
pagemap: hide physical addresses from non-privileged users
pagemap: switch to the new format and do some cleanup


Documentation/vm/pagemap.txt | 3 -
fs/proc/task_mmu.c | 219 +++++++++++++++++++-----------------------
tools/vm/page-types.c | 35 +++----
3 files changed, 118 insertions(+), 139 deletions(-)

--
Signature


2015-06-09 20:00:31

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v3 1/4] pagemap: check permissions and capabilities at open time

From: Konstantin Khlebnikov <[email protected]>

This patch moves permission checks from pagemap_read() into pagemap_open().

Pointer to mm is saved in file->private_data. This reference pins only
mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Link: http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com
---
fs/proc/task_mmu.c | 48 ++++++++++++++++++++++++++++--------------------
1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d..21bc251 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1227,40 +1227,33 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task = get_proc_task(file_inode(file));
- struct mm_struct *mm;
+ struct mm_struct *mm = file->private_data;
struct pagemapread pm;
- int ret = -ESRCH;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
unsigned long start_vaddr;
unsigned long end_vaddr;
- int copied = 0;
+ int ret = 0, copied = 0;

- if (!task)
+ if (!mm || !atomic_inc_not_zero(&mm->mm_users))
goto out;

ret = -EINVAL;
/* file position must be aligned */
if ((*ppos % PM_ENTRY_BYTES) || (count % PM_ENTRY_BYTES))
- goto out_task;
+ goto out_mm;

ret = 0;
if (!count)
- goto out_task;
+ goto out_mm;

pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
if (!pm.buffer)
- goto out_task;
-
- mm = mm_access(task, PTRACE_MODE_READ);
- ret = PTR_ERR(mm);
- if (!mm || IS_ERR(mm))
- goto out_free;
+ goto out_mm;

pagemap_walk.pmd_entry = pagemap_pte_range;
pagemap_walk.pte_hole = pagemap_pte_hole;
@@ -1273,10 +1266,10 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
src = *ppos;
svpfn = src / PM_ENTRY_BYTES;
start_vaddr = svpfn << PAGE_SHIFT;
- end_vaddr = TASK_SIZE_OF(task);
+ end_vaddr = mm->task_size;

/* watch out for wraparound */
- if (svpfn > TASK_SIZE_OF(task) >> PAGE_SHIFT)
+ if (svpfn > mm->task_size >> PAGE_SHIFT)
start_vaddr = end_vaddr;

/*
@@ -1303,7 +1296,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
len = min(count, PM_ENTRY_BYTES * pm.pos);
if (copy_to_user(buf, pm.buffer, len)) {
ret = -EFAULT;
- goto out_mm;
+ goto out_free;
}
copied += len;
buf += len;
@@ -1313,24 +1306,38 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!ret || ret == PM_END_OF_BUFFER)
ret = copied;

-out_mm:
- mmput(mm);
out_free:
kfree(pm.buffer);
-out_task:
- put_task_struct(task);
+out_mm:
+ mmput(mm);
out:
return ret;
}

static int pagemap_open(struct inode *inode, struct file *file)
{
+ struct mm_struct *mm;
+
/* do not disclose physical addresses: attack vector */
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
"to stop being page-shift some time soon. See the "
"linux/Documentation/vm/pagemap.txt for details.\n");
+
+ mm = proc_mem_open(inode, PTRACE_MODE_READ);
+ if (IS_ERR(mm))
+ return PTR_ERR(mm);
+ file->private_data = mm;
+ return 0;
+}
+
+static int pagemap_release(struct inode *inode, struct file *file)
+{
+ struct mm_struct *mm = file->private_data;
+
+ if (mm)
+ mmdrop(mm);
return 0;
}

@@ -1338,6 +1345,7 @@ const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
.open = pagemap_open,
+ .release = pagemap_release,
};
#endif /* CONFIG_PROC_PAGE_MONITOR */

2015-06-09 20:00:41

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v3 2/4] pagemap: add mmap-exclusive bit for marking pages mapped only here

From: Konstantin Khlebnikov <[email protected]>

This patch sets bit 56 in pagemap if this page is mapped only once.
It allows to detect exclusively used pages without exposing PFN:

present file exclusive state
0 0 0 non-present
1 1 0 file page mapped somewhere else
1 1 1 file page mapped only here
1 0 0 anon non-CoWed page (shared with parent/child)
1 0 1 anon CoWed page (or never forked)

CoWed pages in MAP_FILE|MAP_PRIVATE areas are anon in this context.

Mmap-exclusive bit doesn't reflect potential page-sharing via swapcache:
page could be mapped once but has several swap-ptes which point to it.
Application could detect that by swap bit in pagemap entry and touch
that pte via /proc/pid/mem to get real information.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Link: http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

---

v2:
* handle transparent huge pages
* invert bit and rename shared -> exclusive (less confusing name)
---
Documentation/vm/pagemap.txt | 3 ++-
fs/proc/task_mmu.c | 10 ++++++++++
tools/vm/page-types.c | 12 ++++++++++++
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc17..3cfbbb3 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,8 @@ There are three components to pagemap:
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
* Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
- * Bits 56-60 zero
+ * Bit 56 page exlusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 21bc251..b02e38f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -982,6 +982,7 @@ struct pagemapread {
#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))

#define __PM_SOFT_DIRTY (1LL)
+#define __PM_MMAP_EXCLUSIVE (2LL)
#define PM_PRESENT PM_STATUS(4LL)
#define PM_SWAP PM_STATUS(2LL)
#define PM_FILE PM_STATUS(1LL)
@@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,

if (page && !PageAnon(page))
flags |= PM_FILE;
+ if (page && page_mapcount(page) == 1)
+ flags2 |= __PM_MMAP_EXCLUSIVE;
if ((vma->vm_flags & VM_SOFTDIRTY))
flags2 |= __PM_SOFT_DIRTY;

@@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
else
pmd_flags2 = 0;

+ if (pmd_present(*pmd)) {
+ struct page *page = pmd_page(*pmd);
+
+ if (page_mapcount(page) == 1)
+ pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+ }
+
for (; addr != end; addr += PAGE_SIZE) {
unsigned long offset;
pagemap_entry_t pme;
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 8bdf16b..3a9f193 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -70,9 +70,12 @@
#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)

#define __PM_SOFT_DIRTY (1LL)
+#define __PM_MMAP_EXCLUSIVE (2LL)
#define PM_PRESENT PM_STATUS(4LL)
#define PM_SWAP PM_STATUS(2LL)
+#define PM_FILE PM_STATUS(1LL)
#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
+#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)


/*
@@ -100,6 +103,8 @@
#define KPF_SLOB_FREE 49
#define KPF_SLUB_FROZEN 50
#define KPF_SLUB_DEBUG 51
+#define KPF_FILE 62
+#define KPF_MMAP_EXCLUSIVE 63

#define KPF_ALL_BITS ((uint64_t)~0ULL)
#define KPF_HACKERS_BITS (0xffffULL << 32)
@@ -149,6 +154,9 @@ static const char * const page_flag_names[] = {
[KPF_SLOB_FREE] = "P:slob_free",
[KPF_SLUB_FROZEN] = "A:slub_frozen",
[KPF_SLUB_DEBUG] = "E:slub_debug",
+
+ [KPF_FILE] = "F:file",
+ [KPF_MMAP_EXCLUSIVE] = "1:mmap_exclusive",
};


@@ -452,6 +460,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)

if (pme & PM_SOFT_DIRTY)
flags |= BIT(SOFTDIRTY);
+ if (pme & PM_FILE)
+ flags |= BIT(FILE);
+ if (pme & PM_MMAP_EXCLUSIVE)
+ flags |= BIT(MMAP_EXCLUSIVE);

return flags;
}

2015-06-09 20:00:53

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v3 3/4] pagemap: hide physical addresses from non-privileged users

From: Konstantin Khlebnikov <[email protected]>

This patch makes pagemap readable for normal users back but hides physical
addresses from them. For some use cases PFN isn't required at all: flags
give information about presence, page type (anon/file/swap), soft-dirty mark,
and hint about page mapcount state: exclusive(mapcount = 1) or (mapcount > 1).

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
Link: http://lkml.kernel.org/r/[email protected]

---

v3: get capabilities from file
---
fs/proc/task_mmu.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b02e38f..f1b9ae8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -962,6 +962,7 @@ struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
bool v2;
+ bool show_pfn;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
@@ -1046,12 +1047,13 @@ out:
static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- u64 frame, flags;
+ u64 frame = 0, flags;
struct page *page = NULL;
int flags2 = 0;

if (pte_present(pte)) {
- frame = pte_pfn(pte);
+ if (pm->show_pfn)
+ frame = pte_pfn(pte);
flags = PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
@@ -1087,15 +1089,19 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pmd_t pmd, int offset, int pmd_flags2)
{
+ u64 frame = 0;
+
/*
* Currently pmd for thp is always present because thp can not be
* swapped-out, migrated, or HWPOISONed (split in such cases instead.)
* This if-check is just to prepare for future implementation.
*/
- if (pmd_present(pmd))
- *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
- | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
- else
+ if (pmd_present(pmd)) {
+ if (pm->show_pfn)
+ frame = pmd_pfn(pmd) + offset;
+ *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
+ PM_STATUS2(pm->v2, pmd_flags2));
+ } else
*pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
}
#else
@@ -1171,11 +1177,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
pte_t pte, int offset, int flags2)
{
- if (pte_present(pte))
- *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) |
- PM_STATUS2(pm->v2, flags2) |
- PM_PRESENT);
- else
+ u64 frame = 0;
+
+ if (pte_present(pte)) {
+ if (pm->show_pfn)
+ frame = pte_pfn(pte) + offset;
+ *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
+ PM_STATUS2(pm->v2, flags2));
+ } else
*pme = make_pme(PM_NOT_PRESENT(pm->v2) |
PM_STATUS2(pm->v2, flags2));
}
@@ -1258,6 +1267,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!count)
goto out_mm;

+ /* do not disclose physical addresses: attack vector */
+ pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
@@ -1328,9 +1339,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
{
struct mm_struct *mm;

- /* do not disclose physical addresses: attack vector */
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
"to stop being page-shift some time soon. See the "
"linux/Documentation/vm/pagemap.txt for details.\n");

2015-06-09 20:01:34

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v3 4/4] pagemap: switch to the new format and do some cleanup

From: Konstantin Khlebnikov <[email protected]>

This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/proc/task_mmu.c | 147 ++++++++++++++++---------------------------------
tools/vm/page-types.c | 29 +++-------
2 files changed, 58 insertions(+), 118 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f1b9ae8..0e134bf 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

-/*
- * We do not want to have constant page-shift bits sitting in
- * pagemap entries and are about to reuse them some time soon.
- *
- * Here's the "migration strategy":
- * 1. when the system boots these bits remain what they are,
- * but a warning about future change is printed in log;
- * 2. once anyone clears soft-dirty bits via clear_refs file,
- * these flag is set to denote, that user is aware of the
- * new API and those page-shift bits change their meaning.
- * The respective warning is printed in dmesg;
- * 3. In a couple of releases we will remove all the mentions
- * of page-shift in pagemap entries.
- */
-
-static bool soft_dirty_cleared __read_mostly;
-
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
@@ -887,13 +870,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;

- if (type == CLEAR_REFS_SOFT_DIRTY) {
- soft_dirty_cleared = true;
- pr_warn_once("The pagemap bits 55-60 has changed their meaning!"
- " See the linux/Documentation/vm/pagemap.txt for "
- "details.\n");
- }
-
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
@@ -961,38 +937,26 @@ typedef struct {
struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
- bool v2;
bool show_pfn;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
#define PAGEMAP_WALK_MASK (PMD_MASK)

-#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-/* in "new" pagemap pshift bits are occupied with more status bits */
-#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
-
-#define __PM_SOFT_DIRTY (1LL)
-#define __PM_MMAP_EXCLUSIVE (2LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
+#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
+#define PM_PFEAME_BITS 54
+#define PM_PFRAME_MASK GENMASK_ULL(PM_PFEAME_BITS - 1, 0)
+#define PM_SOFT_DIRTY BIT_ULL(55)
+#define PM_MMAP_EXCLUSIVE BIT_ULL(56)
+#define PM_FILE BIT_ULL(61)
+#define PM_SWAP BIT_ULL(62)
+#define PM_PRESENT BIT_ULL(63)
+
#define PM_END_OF_BUFFER 1

-static inline pagemap_entry_t make_pme(u64 val)
+static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
{
- return (pagemap_entry_t) { .pme = val };
+ return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
}

static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
@@ -1013,7 +977,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

while (addr < end) {
struct vm_area_struct *vma = find_vma(walk->mm, addr);
- pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+ pagemap_entry_t pme = make_pme(0, 0);
/* End of address space hole, which we mark as non-present. */
unsigned long hole_end;

@@ -1033,7 +997,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

/* Addresses in the VMA. */
if (vma->vm_flags & VM_SOFTDIRTY)
- pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
+ pme = make_pme(0, PM_SOFT_DIRTY);
for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
err = add_to_pagemap(addr, &pme, pm);
if (err)
@@ -1044,50 +1008,44 @@ out:
return err;
}

-static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
+static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- u64 frame = 0, flags;
+ u64 frame = 0, flags = 0;
struct page *page = NULL;
- int flags2 = 0;

if (pte_present(pte)) {
if (pm->show_pfn)
frame = pte_pfn(pte);
- flags = PM_PRESENT;
+ flags |= PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
entry = pte_to_swp_entry(pte);
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
- flags = PM_SWAP;
+ flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
- } else {
- if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 |= __PM_SOFT_DIRTY;
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
- return;
}

if (page && !PageAnon(page))
flags |= PM_FILE;
if (page && page_mapcount(page) == 1)
- flags2 |= __PM_MMAP_EXCLUSIVE;
- if ((vma->vm_flags & VM_SOFTDIRTY))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_MMAP_EXCLUSIVE;
+ if (vma->vm_flags & VM_SOFTDIRTY)
+ flags |= PM_SOFT_DIRTY;

- *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
+ return make_pme(frame, flags);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
u64 frame = 0;

@@ -1099,15 +1057,16 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
if (pmd_present(pmd)) {
if (pm->show_pfn)
frame = pmd_pfn(pmd) + offset;
- *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
- PM_STATUS2(pm->v2, pmd_flags2));
- } else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}
#else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
+ return make_pme(0, 0);
}
#endif

@@ -1121,18 +1080,16 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
int err = 0;

if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- int pmd_flags2;
+ u64 flags = 0;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
- pmd_flags2 = __PM_SOFT_DIRTY;
- else
- pmd_flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

if (pmd_present(*pmd)) {
struct page *page = pmd_page(*pmd);

if (page_mapcount(page) == 1)
- pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+ flags |= PM_MMAP_EXCLUSIVE;
}

for (; addr != end; addr += PAGE_SIZE) {
@@ -1141,7 +1098,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

offset = (addr & ~PAGEMAP_WALK_MASK) >>
PAGE_SHIFT;
- thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
+ pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1161,7 +1118,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;

- pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+ pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1174,19 +1131,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pte_t pte, int offset, int flags2)
+static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
+ pte_t pte, int offset, u64 flags)
{
u64 frame = 0;

if (pte_present(pte)) {
if (pm->show_pfn)
frame = pte_pfn(pte) + offset;
- *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
- PM_STATUS2(pm->v2, flags2));
- } else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
- PM_STATUS2(pm->v2, flags2));
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}

/* This function walks within one hugetlb entry in the single call */
@@ -1197,17 +1153,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
int err = 0;
- int flags2;
+ u64 flags = 0;
pagemap_entry_t pme;

if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 = __PM_SOFT_DIRTY;
- else
- flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

for (; addr != end; addr += PAGE_SIZE) {
int offset = (addr & ~hmask) >> PAGE_SHIFT;
- huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
+ pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
@@ -1228,7 +1182,9 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
+ * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bit 56 page exclusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
@@ -1269,7 +1225,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,

/* do not disclose physical addresses: attack vector */
pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
- pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
@@ -1339,10 +1294,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
{
struct mm_struct *mm;

- pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
- "to stop being page-shift some time soon. See the "
- "linux/Documentation/vm/pagemap.txt for details.\n");
-
mm = proc_mem_open(inode, PTRACE_MODE_READ);
if (IS_ERR(mm))
return PTR_ERR(mm);
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 3a9f193..1fa872e 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -57,26 +57,15 @@
* pagemap kernel ABI bits
*/

-#define PM_ENTRY_BYTES sizeof(uint64_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-
-#define __PM_SOFT_DIRTY (1LL)
-#define __PM_MMAP_EXCLUSIVE (2LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
-#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
-
+#define PM_ENTRY_BYTES 8
+#define PM_PFEAME_BITS 54
+#define PM_PFRAME_MASK ((1LL << PM_PFEAME_BITS) - 1)
+#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
+#define PM_SOFT_DIRTY (1ULL << 55)
+#define PM_MMAP_EXCLUSIVE (1ULL << 56)
+#define PM_FILE (1ULL << 61)
+#define PM_SWAP (1ULL << 62)
+#define PM_PRESENT (1ULL << 63)

/*
* kernel page flags

2015-06-12 18:44:39

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] pagemap: check permissions and capabilities at open time

This looks good from our side - thanks!

Reviewed-by: [email protected]
Tested-by: [email protected]

On Tue, Jun 9, 2015 at 9:00 PM, Konstantin Khlebnikov <[email protected]> wrote:
> From: Konstantin Khlebnikov <[email protected]>
>
> This patch moves permission checks from pagemap_read() into pagemap_open().
>
> Pointer to mm is saved in file->private_data. This reference pins only
> mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Link: http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com
> ---
> fs/proc/task_mmu.c | 48 ++++++++++++++++++++++++++++--------------------
> 1 file changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6dee68d..21bc251 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1227,40 +1227,33 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
> static ssize_t pagemap_read(struct file *file, char __user *buf,
> size_t count, loff_t *ppos)
> {
> - struct task_struct *task = get_proc_task(file_inode(file));
> - struct mm_struct *mm;
> + struct mm_struct *mm = file->private_data;
> struct pagemapread pm;
> - int ret = -ESRCH;
> struct mm_walk pagemap_walk = {};
> unsigned long src;
> unsigned long svpfn;
> unsigned long start_vaddr;
> unsigned long end_vaddr;
> - int copied = 0;
> + int ret = 0, copied = 0;
>
> - if (!task)
> + if (!mm || !atomic_inc_not_zero(&mm->mm_users))
> goto out;
>
> ret = -EINVAL;
> /* file position must be aligned */
> if ((*ppos % PM_ENTRY_BYTES) || (count % PM_ENTRY_BYTES))
> - goto out_task;
> + goto out_mm;
>
> ret = 0;
> if (!count)
> - goto out_task;
> + goto out_mm;
>
> pm.v2 = soft_dirty_cleared;
> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
> ret = -ENOMEM;
> if (!pm.buffer)
> - goto out_task;
> -
> - mm = mm_access(task, PTRACE_MODE_READ);
> - ret = PTR_ERR(mm);
> - if (!mm || IS_ERR(mm))
> - goto out_free;
> + goto out_mm;
>
> pagemap_walk.pmd_entry = pagemap_pte_range;
> pagemap_walk.pte_hole = pagemap_pte_hole;
> @@ -1273,10 +1266,10 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> src = *ppos;
> svpfn = src / PM_ENTRY_BYTES;
> start_vaddr = svpfn << PAGE_SHIFT;
> - end_vaddr = TASK_SIZE_OF(task);
> + end_vaddr = mm->task_size;
>
> /* watch out for wraparound */
> - if (svpfn > TASK_SIZE_OF(task) >> PAGE_SHIFT)
> + if (svpfn > mm->task_size >> PAGE_SHIFT)
> start_vaddr = end_vaddr;
>
> /*
> @@ -1303,7 +1296,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> len = min(count, PM_ENTRY_BYTES * pm.pos);
> if (copy_to_user(buf, pm.buffer, len)) {
> ret = -EFAULT;
> - goto out_mm;
> + goto out_free;
> }
> copied += len;
> buf += len;
> @@ -1313,24 +1306,38 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> if (!ret || ret == PM_END_OF_BUFFER)
> ret = copied;
>
> -out_mm:
> - mmput(mm);
> out_free:
> kfree(pm.buffer);
> -out_task:
> - put_task_struct(task);
> +out_mm:
> + mmput(mm);
> out:
> return ret;
> }
>
> static int pagemap_open(struct inode *inode, struct file *file)
> {
> + struct mm_struct *mm;
> +
> /* do not disclose physical addresses: attack vector */
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
> pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
> "to stop being page-shift some time soon. See the "
> "linux/Documentation/vm/pagemap.txt for details.\n");
> +
> + mm = proc_mem_open(inode, PTRACE_MODE_READ);
> + if (IS_ERR(mm))
> + return PTR_ERR(mm);
> + file->private_data = mm;
> + return 0;
> +}
> +
> +static int pagemap_release(struct inode *inode, struct file *file)
> +{
> + struct mm_struct *mm = file->private_data;
> +
> + if (mm)
> + mmdrop(mm);
> return 0;
> }
>
> @@ -1338,6 +1345,7 @@ const struct file_operations proc_pagemap_operations = {
> .llseek = mem_lseek, /* borrow this */
> .read = pagemap_read,
> .open = pagemap_open,
> + .release = pagemap_release,
> };
> #endif /* CONFIG_PROC_PAGE_MONITOR */
>
>

2015-06-12 18:47:00

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] pagemap: add mmap-exclusive bit for marking pages mapped only here

This looks good from our side - thanks!

Reviewed-by: [email protected]
Tested-by: [email protected]

On Tue, Jun 9, 2015 at 9:00 PM, Konstantin Khlebnikov <[email protected]> wrote:
> From: Konstantin Khlebnikov <[email protected]>
>
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
>
> present file exclusive state
> 0 0 0 non-present
> 1 1 0 file page mapped somewhere else
> 1 1 1 file page mapped only here
> 1 0 0 anon non-CoWed page (shared with parent/child)
> 1 0 1 anon CoWed page (or never forked)
>
> CoWed pages in MAP_FILE|MAP_PRIVATE areas are anon in this context.
>
> Mmap-exclusive bit doesn't reflect potential page-sharing via swapcache:
> page could be mapped once but has several swap-ptes which point to it.
> Application could detect that by swap bit in pagemap entry and touch
> that pte via /proc/pid/mem to get real information.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Link: http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
>
> ---
>
> v2:
> * handle transparent huge pages
> * invert bit and rename shared -> exclusive (less confusing name)
> ---
> Documentation/vm/pagemap.txt | 3 ++-
> fs/proc/task_mmu.c | 10 ++++++++++
> tools/vm/page-types.c | 12 ++++++++++++
> 3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index 6bfbc17..3cfbbb3 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -16,7 +16,8 @@ There are three components to pagemap:
> * Bits 0-4 swap type if swapped
> * Bits 5-54 swap offset if swapped
> * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
> - * Bits 56-60 zero
> + * Bit 56 page exlusively mapped
> + * Bits 57-60 zero
> * Bit 61 page is file-page or shared-anon
> * Bit 62 page swapped
> * Bit 63 page present
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 21bc251..b02e38f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -982,6 +982,7 @@ struct pagemapread {
> #define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
>
> #define __PM_SOFT_DIRTY (1LL)
> +#define __PM_MMAP_EXCLUSIVE (2LL)
> #define PM_PRESENT PM_STATUS(4LL)
> #define PM_SWAP PM_STATUS(2LL)
> #define PM_FILE PM_STATUS(1LL)
> @@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
>
> if (page && !PageAnon(page))
> flags |= PM_FILE;
> + if (page && page_mapcount(page) == 1)
> + flags2 |= __PM_MMAP_EXCLUSIVE;
> if ((vma->vm_flags & VM_SOFTDIRTY))
> flags2 |= __PM_SOFT_DIRTY;
>
> @@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> else
> pmd_flags2 = 0;
>
> + if (pmd_present(*pmd)) {
> + struct page *page = pmd_page(*pmd);
> +
> + if (page_mapcount(page) == 1)
> + pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
> + }
> +
> for (; addr != end; addr += PAGE_SIZE) {
> unsigned long offset;
> pagemap_entry_t pme;
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 8bdf16b..3a9f193 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -70,9 +70,12 @@
> #define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
>
> #define __PM_SOFT_DIRTY (1LL)
> +#define __PM_MMAP_EXCLUSIVE (2LL)
> #define PM_PRESENT PM_STATUS(4LL)
> #define PM_SWAP PM_STATUS(2LL)
> +#define PM_FILE PM_STATUS(1LL)
> #define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
> +#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
>
>
> /*
> @@ -100,6 +103,8 @@
> #define KPF_SLOB_FREE 49
> #define KPF_SLUB_FROZEN 50
> #define KPF_SLUB_DEBUG 51
> +#define KPF_FILE 62
> +#define KPF_MMAP_EXCLUSIVE 63
>
> #define KPF_ALL_BITS ((uint64_t)~0ULL)
> #define KPF_HACKERS_BITS (0xffffULL << 32)
> @@ -149,6 +154,9 @@ static const char * const page_flag_names[] = {
> [KPF_SLOB_FREE] = "P:slob_free",
> [KPF_SLUB_FROZEN] = "A:slub_frozen",
> [KPF_SLUB_DEBUG] = "E:slub_debug",
> +
> + [KPF_FILE] = "F:file",
> + [KPF_MMAP_EXCLUSIVE] = "1:mmap_exclusive",
> };
>
>
> @@ -452,6 +460,10 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
>
> if (pme & PM_SOFT_DIRTY)
> flags |= BIT(SOFTDIRTY);
> + if (pme & PM_FILE)
> + flags |= BIT(FILE);
> + if (pme & PM_MMAP_EXCLUSIVE)
> + flags |= BIT(MMAP_EXCLUSIVE);
>
> return flags;
> }
>

2015-06-12 18:47:10

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 3/4] pagemap: hide physical addresses from non-privileged users

This looks good from our side - thanks!

Reviewed-by: [email protected]
Tested-by: [email protected]

On Tue, Jun 9, 2015 at 9:00 PM, Konstantin Khlebnikov <[email protected]> wrote:
> From: Konstantin Khlebnikov <[email protected]>
>
> This patch makes pagemap readable for normal users back but hides physical
> addresses from them. For some use cases PFN isn't required at all: flags
> give information about presence, page type (anon/file/swap), soft-dirty mark,
> and hint about page mapcount state: exclusive(mapcount = 1) or (mapcount > 1).
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
> Link: http://lkml.kernel.org/r/[email protected]
>
> ---
>
> v3: get capabilities from file
> ---
> fs/proc/task_mmu.c | 36 ++++++++++++++++++++++--------------
> 1 file changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b02e38f..f1b9ae8 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -962,6 +962,7 @@ struct pagemapread {
> int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
> pagemap_entry_t *buffer;
> bool v2;
> + bool show_pfn;
> };
>
> #define PAGEMAP_WALK_SIZE (PMD_SIZE)
> @@ -1046,12 +1047,13 @@ out:
> static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> {
> - u64 frame, flags;
> + u64 frame = 0, flags;
> struct page *page = NULL;
> int flags2 = 0;
>
> if (pte_present(pte)) {
> - frame = pte_pfn(pte);
> + if (pm->show_pfn)
> + frame = pte_pfn(pte);
> flags = PM_PRESENT;
> page = vm_normal_page(vma, addr, pte);
> if (pte_soft_dirty(pte))
> @@ -1087,15 +1089,19 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> pmd_t pmd, int offset, int pmd_flags2)
> {
> + u64 frame = 0;
> +
> /*
> * Currently pmd for thp is always present because thp can not be
> * swapped-out, migrated, or HWPOISONed (split in such cases instead.)
> * This if-check is just to prepare for future implementation.
> */
> - if (pmd_present(pmd))
> - *pme = make_pme(PM_PFRAME(pmd_pfn(pmd) + offset)
> - | PM_STATUS2(pm->v2, pmd_flags2) | PM_PRESENT);
> - else
> + if (pmd_present(pmd)) {
> + if (pm->show_pfn)
> + frame = pmd_pfn(pmd) + offset;
> + *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> + PM_STATUS2(pm->v2, pmd_flags2));
> + } else
> *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
> }
> #else
> @@ -1171,11 +1177,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> pte_t pte, int offset, int flags2)
> {
> - if (pte_present(pte))
> - *pme = make_pme(PM_PFRAME(pte_pfn(pte) + offset) |
> - PM_STATUS2(pm->v2, flags2) |
> - PM_PRESENT);
> - else
> + u64 frame = 0;
> +
> + if (pte_present(pte)) {
> + if (pm->show_pfn)
> + frame = pte_pfn(pte) + offset;
> + *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> + PM_STATUS2(pm->v2, flags2));
> + } else
> *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
> PM_STATUS2(pm->v2, flags2));
> }
> @@ -1258,6 +1267,8 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
> if (!count)
> goto out_mm;
>
> + /* do not disclose physical addresses: attack vector */
> + pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
> pm.v2 = soft_dirty_cleared;
> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
> @@ -1328,9 +1339,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
> {
> struct mm_struct *mm;
>
> - /* do not disclose physical addresses: attack vector */
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
> "to stop being page-shift some time soon. See the "
> "linux/Documentation/vm/pagemap.txt for details.\n");
>

2015-06-12 18:49:53

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 4/4] pagemap: switch to the new format and do some cleanup

One tiny nitpick / typo, inline below - functionally, this looks good
from our side...

Reviewed-by: [email protected]
Tested-by: [email protected]

On Tue, Jun 9, 2015 at 9:00 PM, Konstantin Khlebnikov <[email protected]> wrote:
> From: Konstantin Khlebnikov <[email protected]>

<...snip...>

> -#define __PM_SOFT_DIRTY (1LL)
> -#define __PM_MMAP_EXCLUSIVE (2LL)
> -#define PM_PRESENT PM_STATUS(4LL)
> -#define PM_SWAP PM_STATUS(2LL)
> -#define PM_FILE PM_STATUS(1LL)
> -#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
> +#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
> +#define PM_PFEAME_BITS 54
> +#define PM_PFRAME_MASK GENMASK_ULL(PM_PFEAME_BITS - 1, 0)

s/PM_FEAME_BITS/PM_FRAME_BITS/ I presume?

> +#define PM_SOFT_DIRTY BIT_ULL(55)
> +#define PM_MMAP_EXCLUSIVE BIT_ULL(56)
> +#define PM_FILE BIT_ULL(61)
> +#define PM_SWAP BIT_ULL(62)
> +#define PM_PRESENT BIT_ULL(63)
> +
> #define PM_END_OF_BUFFER 1
>
> -static inline pagemap_entry_t make_pme(u64 val)
> +static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
> {
> - return (pagemap_entry_t) { .pme = val };
> + return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
> }
>
> static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
> @@ -1013,7 +977,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>
> while (addr < end) {
> struct vm_area_struct *vma = find_vma(walk->mm, addr);
> - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
> + pagemap_entry_t pme = make_pme(0, 0);
> /* End of address space hole, which we mark as non-present. */
> unsigned long hole_end;
>
> @@ -1033,7 +997,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>
> /* Addresses in the VMA. */
> if (vma->vm_flags & VM_SOFTDIRTY)
> - pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
> + pme = make_pme(0, PM_SOFT_DIRTY);
> for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> @@ -1044,50 +1008,44 @@ out:
> return err;
> }
>
> -static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> +static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> {
> - u64 frame = 0, flags;
> + u64 frame = 0, flags = 0;
> struct page *page = NULL;
> - int flags2 = 0;
>
> if (pte_present(pte)) {
> if (pm->show_pfn)
> frame = pte_pfn(pte);
> - flags = PM_PRESENT;
> + flags |= PM_PRESENT;
> page = vm_normal_page(vma, addr, pte);
> if (pte_soft_dirty(pte))
> - flags2 |= __PM_SOFT_DIRTY;
> + flags |= PM_SOFT_DIRTY;
> } else if (is_swap_pte(pte)) {
> swp_entry_t entry;
> if (pte_swp_soft_dirty(pte))
> - flags2 |= __PM_SOFT_DIRTY;
> + flags |= PM_SOFT_DIRTY;
> entry = pte_to_swp_entry(pte);
> frame = swp_type(entry) |
> (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> - flags = PM_SWAP;
> + flags |= PM_SWAP;
> if (is_migration_entry(entry))
> page = migration_entry_to_page(entry);
> - } else {
> - if (vma->vm_flags & VM_SOFTDIRTY)
> - flags2 |= __PM_SOFT_DIRTY;
> - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
> - return;
> }
>
> if (page && !PageAnon(page))
> flags |= PM_FILE;
> if (page && page_mapcount(page) == 1)
> - flags2 |= __PM_MMAP_EXCLUSIVE;
> - if ((vma->vm_flags & VM_SOFTDIRTY))
> - flags2 |= __PM_SOFT_DIRTY;
> + flags |= PM_MMAP_EXCLUSIVE;
> + if (vma->vm_flags & VM_SOFTDIRTY)
> + flags |= PM_SOFT_DIRTY;
>
> - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
> + return make_pme(frame, flags);
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> - pmd_t pmd, int offset, int pmd_flags2)
> +static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
> + pmd_t pmd, int offset, u64 flags)
> {
> u64 frame = 0;
>
> @@ -1099,15 +1057,16 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
> if (pmd_present(pmd)) {
> if (pm->show_pfn)
> frame = pmd_pfn(pmd) + offset;
> - *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> - PM_STATUS2(pm->v2, pmd_flags2));
> - } else
> - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
> + flags |= PM_PRESENT;
> + }
> +
> + return make_pme(frame, flags);
> }
> #else
> -static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> - pmd_t pmd, int offset, int pmd_flags2)
> +static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
> + pmd_t pmd, int offset, u64 flags)
> {
> + return make_pme(0, 0);
> }
> #endif
>
> @@ -1121,18 +1080,16 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> int err = 0;
>
> if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> - int pmd_flags2;
> + u64 flags = 0;
>
> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
> - pmd_flags2 = __PM_SOFT_DIRTY;
> - else
> - pmd_flags2 = 0;
> + flags |= PM_SOFT_DIRTY;
>
> if (pmd_present(*pmd)) {
> struct page *page = pmd_page(*pmd);
>
> if (page_mapcount(page) == 1)
> - pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
> + flags |= PM_MMAP_EXCLUSIVE;
> }
>
> for (; addr != end; addr += PAGE_SIZE) {
> @@ -1141,7 +1098,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> offset = (addr & ~PAGEMAP_WALK_MASK) >>
> PAGE_SHIFT;
> - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
> + pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> break;
> @@ -1161,7 +1118,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> for (; addr < end; pte++, addr += PAGE_SIZE) {
> pagemap_entry_t pme;
>
> - pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
> + pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> break;
> @@ -1174,19 +1131,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> }
>
> #ifdef CONFIG_HUGETLB_PAGE
> -static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> - pte_t pte, int offset, int flags2)
> +static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
> + pte_t pte, int offset, u64 flags)
> {
> u64 frame = 0;
>
> if (pte_present(pte)) {
> if (pm->show_pfn)
> frame = pte_pfn(pte) + offset;
> - *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> - PM_STATUS2(pm->v2, flags2));
> - } else
> - *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
> - PM_STATUS2(pm->v2, flags2));
> + flags |= PM_PRESENT;
> + }
> +
> + return make_pme(frame, flags);
> }
>
> /* This function walks within one hugetlb entry in the single call */
> @@ -1197,17 +1153,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
> struct pagemapread *pm = walk->private;
> struct vm_area_struct *vma = walk->vma;
> int err = 0;
> - int flags2;
> + u64 flags = 0;
> pagemap_entry_t pme;
>
> if (vma->vm_flags & VM_SOFTDIRTY)
> - flags2 = __PM_SOFT_DIRTY;
> - else
> - flags2 = 0;
> + flags |= PM_SOFT_DIRTY;
>
> for (; addr != end; addr += PAGE_SIZE) {
> int offset = (addr & ~hmask) >> PAGE_SHIFT;
> - huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
> + pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> return err;
> @@ -1228,7 +1182,9 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
> * Bits 0-54 page frame number (PFN) if present
> * Bits 0-4 swap type if swapped
> * Bits 5-54 swap offset if swapped
> - * Bits 55-60 page shift (page size = 1<<page shift)
> + * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
> + * Bit 56 page exclusively mapped
> + * Bits 57-60 zero
> * Bit 61 page is file-page or shared-anon
> * Bit 62 page swapped
> * Bit 63 page present
> @@ -1269,7 +1225,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
>
> /* do not disclose physical addresses: attack vector */
> pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
> - pm.v2 = soft_dirty_cleared;
> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
> ret = -ENOMEM;
> @@ -1339,10 +1294,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
> {
> struct mm_struct *mm;
>
> - pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
> - "to stop being page-shift some time soon. See the "
> - "linux/Documentation/vm/pagemap.txt for details.\n");
> -
> mm = proc_mem_open(inode, PTRACE_MODE_READ);
> if (IS_ERR(mm))
> return PTR_ERR(mm);
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 3a9f193..1fa872e 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -57,26 +57,15 @@
> * pagemap kernel ABI bits
> */
>
> -#define PM_ENTRY_BYTES sizeof(uint64_t)
> -#define PM_STATUS_BITS 3
> -#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
> -#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
> -#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
> -#define PM_PSHIFT_BITS 6
> -#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
> -#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
> -#define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
> -#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
> -#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
> -
> -#define __PM_SOFT_DIRTY (1LL)
> -#define __PM_MMAP_EXCLUSIVE (2LL)
> -#define PM_PRESENT PM_STATUS(4LL)
> -#define PM_SWAP PM_STATUS(2LL)
> -#define PM_FILE PM_STATUS(1LL)
> -#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
> -#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
> -
> +#define PM_ENTRY_BYTES 8
> +#define PM_PFEAME_BITS 54
> +#define PM_PFRAME_MASK ((1LL << PM_PFEAME_BITS) - 1)
> +#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
> +#define PM_SOFT_DIRTY (1ULL << 55)
> +#define PM_MMAP_EXCLUSIVE (1ULL << 56)
> +#define PM_FILE (1ULL << 61)
> +#define PM_SWAP (1ULL << 62)
> +#define PM_PRESENT (1ULL << 63)
>
> /*
> * kernel page flags
>

2015-06-12 18:59:35

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCHSET v3 0/4] pagemap: make useable for non-privilege users

Hi Konstantin,

Thanks very much for your help on this.

>From our side, I've tested our application against a patched kernel
and I confirm that the functionality can replace what we lost when
PFNs were removed from /proc/PID/pagemap. This addresses the
functionality regression from our PoV (just requires minor userspace
changes on our part, which is fine).

I also reviewed the patch content and everything seemed good to me.

We're keen to see these get into mainline, so let us know if there's
anything we can do to help.

Cheers,
Mark

On Tue, Jun 9, 2015 at 9:00 PM, Konstantin Khlebnikov <[email protected]> wrote:
> This patchset makes pagemap useable again in the safe way. It adds bit
> 'map-exlusive' which is set if page is mapped only here and restores
> access for non-privileged users but hides pfn from them.
>
> Last patch removes page-shift bits and completes migration to the new
> pagemap format: flags soft-dirty and mmap-exlusive are available only
> in the new format.
>
> v3: check permissions in ->open
>
> ---
>
> Konstantin Khlebnikov (4):
> pagemap: check permissions and capabilities at open time
> pagemap: add mmap-exclusive bit for marking pages mapped only here
> pagemap: hide physical addresses from non-privileged users
> pagemap: switch to the new format and do some cleanup
>
>
> Documentation/vm/pagemap.txt | 3 -
> fs/proc/task_mmu.c | 219 +++++++++++++++++++-----------------------
> tools/vm/page-types.c | 35 +++----
> 3 files changed, 118 insertions(+), 139 deletions(-)
>
> --
> Signature

2015-06-15 05:57:04

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH v4] pagemap: switch to the new format and do some cleanup

From: Konstantin Khlebnikov <[email protected]>

This patch removes page-shift bits (scheduled to remove since 3.11) and
completes migration to the new bit layout. Also it cleans messy macro.

Signed-off-by: Konstantin Khlebnikov <[email protected]>

---

v4: fix misprint PM_PFEAME_BITS -> PM_PFRAME_BITS
---
fs/proc/task_mmu.c | 147 ++++++++++++++++---------------------------------
tools/vm/page-types.c | 29 +++-------
2 files changed, 58 insertions(+), 118 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f1b9ae8..99fa2ae 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
.release = proc_map_release,
};

-/*
- * We do not want to have constant page-shift bits sitting in
- * pagemap entries and are about to reuse them some time soon.
- *
- * Here's the "migration strategy":
- * 1. when the system boots these bits remain what they are,
- * but a warning about future change is printed in log;
- * 2. once anyone clears soft-dirty bits via clear_refs file,
- * these flag is set to denote, that user is aware of the
- * new API and those page-shift bits change their meaning.
- * The respective warning is printed in dmesg;
- * 3. In a couple of releases we will remove all the mentions
- * of page-shift in pagemap entries.
- */
-
-static bool soft_dirty_cleared __read_mostly;
-
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
@@ -887,13 +870,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
return -EINVAL;

- if (type == CLEAR_REFS_SOFT_DIRTY) {
- soft_dirty_cleared = true;
- pr_warn_once("The pagemap bits 55-60 has changed their meaning!"
- " See the linux/Documentation/vm/pagemap.txt for "
- "details.\n");
- }
-
task = get_proc_task(file_inode(file));
if (!task)
return -ESRCH;
@@ -961,38 +937,26 @@ typedef struct {
struct pagemapread {
int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
pagemap_entry_t *buffer;
- bool v2;
bool show_pfn;
};

#define PAGEMAP_WALK_SIZE (PMD_SIZE)
#define PAGEMAP_WALK_MASK (PMD_MASK)

-#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-/* in "new" pagemap pshift bits are occupied with more status bits */
-#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
-
-#define __PM_SOFT_DIRTY (1LL)
-#define __PM_MMAP_EXCLUSIVE (2LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
+#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
+#define PM_PFRAME_BITS 54
+#define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
+#define PM_SOFT_DIRTY BIT_ULL(55)
+#define PM_MMAP_EXCLUSIVE BIT_ULL(56)
+#define PM_FILE BIT_ULL(61)
+#define PM_SWAP BIT_ULL(62)
+#define PM_PRESENT BIT_ULL(63)
+
#define PM_END_OF_BUFFER 1

-static inline pagemap_entry_t make_pme(u64 val)
+static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
{
- return (pagemap_entry_t) { .pme = val };
+ return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
}

static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
@@ -1013,7 +977,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

while (addr < end) {
struct vm_area_struct *vma = find_vma(walk->mm, addr);
- pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+ pagemap_entry_t pme = make_pme(0, 0);
/* End of address space hole, which we mark as non-present. */
unsigned long hole_end;

@@ -1033,7 +997,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,

/* Addresses in the VMA. */
if (vma->vm_flags & VM_SOFTDIRTY)
- pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
+ pme = make_pme(0, PM_SOFT_DIRTY);
for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
err = add_to_pagemap(addr, &pme, pm);
if (err)
@@ -1044,50 +1008,44 @@ out:
return err;
}

-static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
+static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- u64 frame = 0, flags;
+ u64 frame = 0, flags = 0;
struct page *page = NULL;
- int flags2 = 0;

if (pte_present(pte)) {
if (pm->show_pfn)
frame = pte_pfn(pte);
- flags = PM_PRESENT;
+ flags |= PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
} else if (is_swap_pte(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_SOFT_DIRTY;
entry = pte_to_swp_entry(pte);
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
- flags = PM_SWAP;
+ flags |= PM_SWAP;
if (is_migration_entry(entry))
page = migration_entry_to_page(entry);
- } else {
- if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 |= __PM_SOFT_DIRTY;
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
- return;
}

if (page && !PageAnon(page))
flags |= PM_FILE;
if (page && page_mapcount(page) == 1)
- flags2 |= __PM_MMAP_EXCLUSIVE;
- if ((vma->vm_flags & VM_SOFTDIRTY))
- flags2 |= __PM_SOFT_DIRTY;
+ flags |= PM_MMAP_EXCLUSIVE;
+ if (vma->vm_flags & VM_SOFTDIRTY)
+ flags |= PM_SOFT_DIRTY;

- *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
+ return make_pme(frame, flags);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
u64 frame = 0;

@@ -1099,15 +1057,16 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
if (pmd_present(pmd)) {
if (pm->show_pfn)
frame = pmd_pfn(pmd) + offset;
- *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
- PM_STATUS2(pm->v2, pmd_flags2));
- } else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}
#else
-static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pmd_t pmd, int offset, int pmd_flags2)
+static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
+ pmd_t pmd, int offset, u64 flags)
{
+ return make_pme(0, 0);
}
#endif

@@ -1121,18 +1080,16 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
int err = 0;

if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- int pmd_flags2;
+ u64 flags = 0;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
- pmd_flags2 = __PM_SOFT_DIRTY;
- else
- pmd_flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

if (pmd_present(*pmd)) {
struct page *page = pmd_page(*pmd);

if (page_mapcount(page) == 1)
- pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
+ flags |= PM_MMAP_EXCLUSIVE;
}

for (; addr != end; addr += PAGE_SIZE) {
@@ -1141,7 +1098,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,

offset = (addr & ~PAGEMAP_WALK_MASK) >>
PAGE_SHIFT;
- thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
+ pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1161,7 +1118,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;

- pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+ pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
err = add_to_pagemap(addr, &pme, pm);
if (err)
break;
@@ -1174,19 +1131,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
- pte_t pte, int offset, int flags2)
+static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
+ pte_t pte, int offset, u64 flags)
{
u64 frame = 0;

if (pte_present(pte)) {
if (pm->show_pfn)
frame = pte_pfn(pte) + offset;
- *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
- PM_STATUS2(pm->v2, flags2));
- } else
- *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
- PM_STATUS2(pm->v2, flags2));
+ flags |= PM_PRESENT;
+ }
+
+ return make_pme(frame, flags);
}

/* This function walks within one hugetlb entry in the single call */
@@ -1197,17 +1153,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
int err = 0;
- int flags2;
+ u64 flags = 0;
pagemap_entry_t pme;

if (vma->vm_flags & VM_SOFTDIRTY)
- flags2 = __PM_SOFT_DIRTY;
- else
- flags2 = 0;
+ flags |= PM_SOFT_DIRTY;

for (; addr != end; addr += PAGE_SIZE) {
int offset = (addr & ~hmask) >> PAGE_SHIFT;
- huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
+ pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
err = add_to_pagemap(addr, &pme, pm);
if (err)
return err;
@@ -1228,7 +1182,9 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
* Bits 0-54 page frame number (PFN) if present
* Bits 0-4 swap type if swapped
* Bits 5-54 swap offset if swapped
- * Bits 55-60 page shift (page size = 1<<page shift)
+ * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
+ * Bit 56 page exclusively mapped
+ * Bits 57-60 zero
* Bit 61 page is file-page or shared-anon
* Bit 62 page swapped
* Bit 63 page present
@@ -1269,7 +1225,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,

/* do not disclose physical addresses: attack vector */
pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
- pm.v2 = soft_dirty_cleared;
pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
ret = -ENOMEM;
@@ -1339,10 +1294,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
{
struct mm_struct *mm;

- pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
- "to stop being page-shift some time soon. See the "
- "linux/Documentation/vm/pagemap.txt for details.\n");
-
mm = proc_mem_open(inode, PTRACE_MODE_READ);
if (IS_ERR(mm))
return PTR_ERR(mm);
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 3a9f193..e1d5ff8 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -57,26 +57,15 @@
* pagemap kernel ABI bits
*/

-#define PM_ENTRY_BYTES sizeof(uint64_t)
-#define PM_STATUS_BITS 3
-#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
-#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
-#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
-#define PM_PSHIFT_BITS 6
-#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
-#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
-#define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
-#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
-#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
-
-#define __PM_SOFT_DIRTY (1LL)
-#define __PM_MMAP_EXCLUSIVE (2LL)
-#define PM_PRESENT PM_STATUS(4LL)
-#define PM_SWAP PM_STATUS(2LL)
-#define PM_FILE PM_STATUS(1LL)
-#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
-#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
-
+#define PM_ENTRY_BYTES 8
+#define PM_PFRAME_BITS 54
+#define PM_PFRAME_MASK ((1LL << PM_PFRAME_BITS) - 1)
+#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
+#define PM_SOFT_DIRTY (1ULL << 55)
+#define PM_MMAP_EXCLUSIVE (1ULL << 56)
+#define PM_FILE (1ULL << 61)
+#define PM_SWAP (1ULL << 62)
+#define PM_PRESENT (1ULL << 63)

/*
* kernel page flags

2015-06-15 14:57:16

by Mark Williamson

[permalink] [raw]
Subject: Re: [PATCH v4] pagemap: switch to the new format and do some cleanup

Thanks! No outstanding issues with the patchset, from our side.

Reviewed-by: [email protected]

On Mon, Jun 15, 2015 at 6:56 AM, Konstantin Khlebnikov <[email protected]> wrote:
> From: Konstantin Khlebnikov <[email protected]>
>
> This patch removes page-shift bits (scheduled to remove since 3.11) and
> completes migration to the new bit layout. Also it cleans messy macro.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>
> ---
>
> v4: fix misprint PM_PFEAME_BITS -> PM_PFRAME_BITS
> ---
> fs/proc/task_mmu.c | 147 ++++++++++++++++---------------------------------
> tools/vm/page-types.c | 29 +++-------
> 2 files changed, 58 insertions(+), 118 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f1b9ae8..99fa2ae 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -710,23 +710,6 @@ const struct file_operations proc_tid_smaps_operations = {
> .release = proc_map_release,
> };
>
> -/*
> - * We do not want to have constant page-shift bits sitting in
> - * pagemap entries and are about to reuse them some time soon.
> - *
> - * Here's the "migration strategy":
> - * 1. when the system boots these bits remain what they are,
> - * but a warning about future change is printed in log;
> - * 2. once anyone clears soft-dirty bits via clear_refs file,
> - * these flag is set to denote, that user is aware of the
> - * new API and those page-shift bits change their meaning.
> - * The respective warning is printed in dmesg;
> - * 3. In a couple of releases we will remove all the mentions
> - * of page-shift in pagemap entries.
> - */
> -
> -static bool soft_dirty_cleared __read_mostly;
> -
> enum clear_refs_types {
> CLEAR_REFS_ALL = 1,
> CLEAR_REFS_ANON,
> @@ -887,13 +870,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
> return -EINVAL;
>
> - if (type == CLEAR_REFS_SOFT_DIRTY) {
> - soft_dirty_cleared = true;
> - pr_warn_once("The pagemap bits 55-60 has changed their meaning!"
> - " See the linux/Documentation/vm/pagemap.txt for "
> - "details.\n");
> - }
> -
> task = get_proc_task(file_inode(file));
> if (!task)
> return -ESRCH;
> @@ -961,38 +937,26 @@ typedef struct {
> struct pagemapread {
> int pos, len; /* units: PM_ENTRY_BYTES, not bytes */
> pagemap_entry_t *buffer;
> - bool v2;
> bool show_pfn;
> };
>
> #define PAGEMAP_WALK_SIZE (PMD_SIZE)
> #define PAGEMAP_WALK_MASK (PMD_MASK)
>
> -#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
> -#define PM_STATUS_BITS 3
> -#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
> -#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
> -#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
> -#define PM_PSHIFT_BITS 6
> -#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
> -#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
> -#define __PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
> -#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
> -#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
> -/* in "new" pagemap pshift bits are occupied with more status bits */
> -#define PM_STATUS2(v2, x) (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
> -
> -#define __PM_SOFT_DIRTY (1LL)
> -#define __PM_MMAP_EXCLUSIVE (2LL)
> -#define PM_PRESENT PM_STATUS(4LL)
> -#define PM_SWAP PM_STATUS(2LL)
> -#define PM_FILE PM_STATUS(1LL)
> -#define PM_NOT_PRESENT(v2) PM_STATUS2(v2, 0)
> +#define PM_ENTRY_BYTES sizeof(pagemap_entry_t)
> +#define PM_PFRAME_BITS 54
> +#define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
> +#define PM_SOFT_DIRTY BIT_ULL(55)
> +#define PM_MMAP_EXCLUSIVE BIT_ULL(56)
> +#define PM_FILE BIT_ULL(61)
> +#define PM_SWAP BIT_ULL(62)
> +#define PM_PRESENT BIT_ULL(63)
> +
> #define PM_END_OF_BUFFER 1
>
> -static inline pagemap_entry_t make_pme(u64 val)
> +static inline pagemap_entry_t make_pme(u64 frame, u64 flags)
> {
> - return (pagemap_entry_t) { .pme = val };
> + return (pagemap_entry_t) { .pme = (frame & PM_PFRAME_MASK) | flags };
> }
>
> static int add_to_pagemap(unsigned long addr, pagemap_entry_t *pme,
> @@ -1013,7 +977,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>
> while (addr < end) {
> struct vm_area_struct *vma = find_vma(walk->mm, addr);
> - pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
> + pagemap_entry_t pme = make_pme(0, 0);
> /* End of address space hole, which we mark as non-present. */
> unsigned long hole_end;
>
> @@ -1033,7 +997,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
>
> /* Addresses in the VMA. */
> if (vma->vm_flags & VM_SOFTDIRTY)
> - pme.pme |= PM_STATUS2(pm->v2, __PM_SOFT_DIRTY);
> + pme = make_pme(0, PM_SOFT_DIRTY);
> for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> @@ -1044,50 +1008,44 @@ out:
> return err;
> }
>
> -static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> +static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> {
> - u64 frame = 0, flags;
> + u64 frame = 0, flags = 0;
> struct page *page = NULL;
> - int flags2 = 0;
>
> if (pte_present(pte)) {
> if (pm->show_pfn)
> frame = pte_pfn(pte);
> - flags = PM_PRESENT;
> + flags |= PM_PRESENT;
> page = vm_normal_page(vma, addr, pte);
> if (pte_soft_dirty(pte))
> - flags2 |= __PM_SOFT_DIRTY;
> + flags |= PM_SOFT_DIRTY;
> } else if (is_swap_pte(pte)) {
> swp_entry_t entry;
> if (pte_swp_soft_dirty(pte))
> - flags2 |= __PM_SOFT_DIRTY;
> + flags |= PM_SOFT_DIRTY;
> entry = pte_to_swp_entry(pte);
> frame = swp_type(entry) |
> (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> - flags = PM_SWAP;
> + flags |= PM_SWAP;
> if (is_migration_entry(entry))
> page = migration_entry_to_page(entry);
> - } else {
> - if (vma->vm_flags & VM_SOFTDIRTY)
> - flags2 |= __PM_SOFT_DIRTY;
> - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
> - return;
> }
>
> if (page && !PageAnon(page))
> flags |= PM_FILE;
> if (page && page_mapcount(page) == 1)
> - flags2 |= __PM_MMAP_EXCLUSIVE;
> - if ((vma->vm_flags & VM_SOFTDIRTY))
> - flags2 |= __PM_SOFT_DIRTY;
> + flags |= PM_MMAP_EXCLUSIVE;
> + if (vma->vm_flags & VM_SOFTDIRTY)
> + flags |= PM_SOFT_DIRTY;
>
> - *pme = make_pme(PM_PFRAME(frame) | PM_STATUS2(pm->v2, flags2) | flags);
> + return make_pme(frame, flags);
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> - pmd_t pmd, int offset, int pmd_flags2)
> +static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
> + pmd_t pmd, int offset, u64 flags)
> {
> u64 frame = 0;
>
> @@ -1099,15 +1057,16 @@ static void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *p
> if (pmd_present(pmd)) {
> if (pm->show_pfn)
> frame = pmd_pfn(pmd) + offset;
> - *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> - PM_STATUS2(pm->v2, pmd_flags2));
> - } else
> - *pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, pmd_flags2));
> + flags |= PM_PRESENT;
> + }
> +
> + return make_pme(frame, flags);
> }
> #else
> -static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> - pmd_t pmd, int offset, int pmd_flags2)
> +static pagemap_entry_t thp_pmd_to_pagemap_entry(struct pagemapread *pm,
> + pmd_t pmd, int offset, u64 flags)
> {
> + return make_pme(0, 0);
> }
> #endif
>
> @@ -1121,18 +1080,16 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> int err = 0;
>
> if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> - int pmd_flags2;
> + u64 flags = 0;
>
> if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
> - pmd_flags2 = __PM_SOFT_DIRTY;
> - else
> - pmd_flags2 = 0;
> + flags |= PM_SOFT_DIRTY;
>
> if (pmd_present(*pmd)) {
> struct page *page = pmd_page(*pmd);
>
> if (page_mapcount(page) == 1)
> - pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
> + flags |= PM_MMAP_EXCLUSIVE;
> }
>
> for (; addr != end; addr += PAGE_SIZE) {
> @@ -1141,7 +1098,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> offset = (addr & ~PAGEMAP_WALK_MASK) >>
> PAGE_SHIFT;
> - thp_pmd_to_pagemap_entry(&pme, pm, *pmd, offset, pmd_flags2);
> + pme = thp_pmd_to_pagemap_entry(pm, *pmd, offset, flags);
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> break;
> @@ -1161,7 +1118,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> for (; addr < end; pte++, addr += PAGE_SIZE) {
> pagemap_entry_t pme;
>
> - pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
> + pme = pte_to_pagemap_entry(pm, vma, addr, *pte);
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> break;
> @@ -1174,19 +1131,18 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> }
>
> #ifdef CONFIG_HUGETLB_PAGE
> -static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
> - pte_t pte, int offset, int flags2)
> +static pagemap_entry_t huge_pte_to_pagemap_entry(struct pagemapread *pm,
> + pte_t pte, int offset, u64 flags)
> {
> u64 frame = 0;
>
> if (pte_present(pte)) {
> if (pm->show_pfn)
> frame = pte_pfn(pte) + offset;
> - *pme = make_pme(PM_PFRAME(frame) | PM_PRESENT |
> - PM_STATUS2(pm->v2, flags2));
> - } else
> - *pme = make_pme(PM_NOT_PRESENT(pm->v2) |
> - PM_STATUS2(pm->v2, flags2));
> + flags |= PM_PRESENT;
> + }
> +
> + return make_pme(frame, flags);
> }
>
> /* This function walks within one hugetlb entry in the single call */
> @@ -1197,17 +1153,15 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
> struct pagemapread *pm = walk->private;
> struct vm_area_struct *vma = walk->vma;
> int err = 0;
> - int flags2;
> + u64 flags = 0;
> pagemap_entry_t pme;
>
> if (vma->vm_flags & VM_SOFTDIRTY)
> - flags2 = __PM_SOFT_DIRTY;
> - else
> - flags2 = 0;
> + flags |= PM_SOFT_DIRTY;
>
> for (; addr != end; addr += PAGE_SIZE) {
> int offset = (addr & ~hmask) >> PAGE_SHIFT;
> - huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
> + pme = huge_pte_to_pagemap_entry(pm, *pte, offset, flags);
> err = add_to_pagemap(addr, &pme, pm);
> if (err)
> return err;
> @@ -1228,7 +1182,9 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
> * Bits 0-54 page frame number (PFN) if present
> * Bits 0-4 swap type if swapped
> * Bits 5-54 swap offset if swapped
> - * Bits 55-60 page shift (page size = 1<<page shift)
> + * Bit 55 pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
> + * Bit 56 page exclusively mapped
> + * Bits 57-60 zero
> * Bit 61 page is file-page or shared-anon
> * Bit 62 page swapped
> * Bit 63 page present
> @@ -1269,7 +1225,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
>
> /* do not disclose physical addresses: attack vector */
> pm.show_pfn = file_ns_capable(file, &init_user_ns, CAP_SYS_ADMIN);
> - pm.v2 = soft_dirty_cleared;
> pm.len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> pm.buffer = kmalloc(pm.len * PM_ENTRY_BYTES, GFP_TEMPORARY);
> ret = -ENOMEM;
> @@ -1339,10 +1294,6 @@ static int pagemap_open(struct inode *inode, struct file *file)
> {
> struct mm_struct *mm;
>
> - pr_warn_once("Bits 55-60 of /proc/PID/pagemap entries are about "
> - "to stop being page-shift some time soon. See the "
> - "linux/Documentation/vm/pagemap.txt for details.\n");
> -
> mm = proc_mem_open(inode, PTRACE_MODE_READ);
> if (IS_ERR(mm))
> return PTR_ERR(mm);
> diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
> index 3a9f193..e1d5ff8 100644
> --- a/tools/vm/page-types.c
> +++ b/tools/vm/page-types.c
> @@ -57,26 +57,15 @@
> * pagemap kernel ABI bits
> */
>
> -#define PM_ENTRY_BYTES sizeof(uint64_t)
> -#define PM_STATUS_BITS 3
> -#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
> -#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
> -#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
> -#define PM_PSHIFT_BITS 6
> -#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
> -#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
> -#define __PM_PSHIFT(x) (((uint64_t) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
> -#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
> -#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
> -
> -#define __PM_SOFT_DIRTY (1LL)
> -#define __PM_MMAP_EXCLUSIVE (2LL)
> -#define PM_PRESENT PM_STATUS(4LL)
> -#define PM_SWAP PM_STATUS(2LL)
> -#define PM_FILE PM_STATUS(1LL)
> -#define PM_SOFT_DIRTY __PM_PSHIFT(__PM_SOFT_DIRTY)
> -#define PM_MMAP_EXCLUSIVE __PM_PSHIFT(__PM_MMAP_EXCLUSIVE)
> -
> +#define PM_ENTRY_BYTES 8
> +#define PM_PFRAME_BITS 54
> +#define PM_PFRAME_MASK ((1LL << PM_PFRAME_BITS) - 1)
> +#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
> +#define PM_SOFT_DIRTY (1ULL << 55)
> +#define PM_MMAP_EXCLUSIVE (1ULL << 56)
> +#define PM_FILE (1ULL << 61)
> +#define PM_SWAP (1ULL << 62)
> +#define PM_PRESENT (1ULL << 63)
>
> /*
> * kernel page flags
>

2015-06-16 21:29:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] pagemap: switch to the new format and do some cleanup

On Mon, 15 Jun 2015 08:56:49 +0300 Konstantin Khlebnikov <[email protected]> wrote:

> This patch removes page-shift bits (scheduled to remove since 3.11) and
> completes migration to the new bit layout. Also it cleans messy macro.

hm, I can't find any kernel version to which this patch comes close to
applying.

2015-06-17 04:59:41

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v4] pagemap: switch to the new format and do some cleanup

On Wed, Jun 17, 2015 at 12:29 AM, Andrew Morton
<[email protected]> wrote:
> On Mon, 15 Jun 2015 08:56:49 +0300 Konstantin Khlebnikov <[email protected]> wrote:
>
>> This patch removes page-shift bits (scheduled to remove since 3.11) and
>> completes migration to the new bit layout. Also it cleans messy macro.
>
> hm, I can't find any kernel version to which this patch comes close to
> applying.

This patchset applies to 4.1-rc8 and current mmotm without problems.
I guess you've tried pick this patch alone without previous changes.

2015-06-17 06:40:57

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v4] pagemap: switch to the new format and do some cleanup

On Wed, Jun 17, 2015 at 7:59 AM, Konstantin Khlebnikov <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 12:29 AM, Andrew Morton
> <[email protected]> wrote:
>> On Mon, 15 Jun 2015 08:56:49 +0300 Konstantin Khlebnikov <[email protected]> wrote:
>>
>>> This patch removes page-shift bits (scheduled to remove since 3.11) and
>>> completes migration to the new bit layout. Also it cleans messy macro.
>>
>> hm, I can't find any kernel version to which this patch comes close to
>> applying.
>
> This patchset applies to 4.1-rc8 and current mmotm without problems.
> I guess you've tried pick this patch alone without previous changes.

My bad. I've sent single v4 patch as a reply to v3 patch and forget
'4/4' in subject.
That's fourth patch in patchset.

Here is v3 patchset cover letter: https://lkml.org/lkml/2015/6/9/804
"[PATCHSET v3 0/4] pagemap: make useable for non-privilege users"

2015-06-17 07:59:53

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] pagemap: check permissions and capabilities at open time

On Tue, Jun 09, 2015 at 11:00:15PM +0300, Konstantin Khlebnikov wrote:
> From: Konstantin Khlebnikov <[email protected]>
>
> This patch moves permission checks from pagemap_read() into pagemap_open().
>
> Pointer to mm is saved in file->private_data. This reference pins only
> mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Link: http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com

Reviewed-by: Naoya Horiguchi <[email protected]> -

2015-06-17 08:13:53

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] pagemap: add mmap-exclusive bit for marking pages mapped only here

On Tue, Jun 09, 2015 at 11:00:17PM +0300, Konstantin Khlebnikov wrote:
> From: Konstantin Khlebnikov <[email protected]>
>
> This patch sets bit 56 in pagemap if this page is mapped only once.
> It allows to detect exclusively used pages without exposing PFN:
>
> present file exclusive state
> 0 0 0 non-present
> 1 1 0 file page mapped somewhere else
> 1 1 1 file page mapped only here
> 1 0 0 anon non-CoWed page (shared with parent/child)
> 1 0 1 anon CoWed page (or never forked)
>
> CoWed pages in MAP_FILE|MAP_PRIVATE areas are anon in this context.
>
> Mmap-exclusive bit doesn't reflect potential page-sharing via swapcache:
> page could be mapped once but has several swap-ptes which point to it.
> Application could detect that by swap bit in pagemap entry and touch
> that pte via /proc/pid/mem to get real information.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Link: http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
>
> ---
>
> v2:
> * handle transparent huge pages
> * invert bit and rename shared -> exclusive (less confusing name)
> ---
...

> @@ -1119,6 +1122,13 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> else
> pmd_flags2 = 0;
>
> + if (pmd_present(*pmd)) {
> + struct page *page = pmd_page(*pmd);
> +
> + if (page_mapcount(page) == 1)
> + pmd_flags2 |= __PM_MMAP_EXCLUSIVE;
> + }
> +

Could you do the same thing for huge_pte_to_pagemap_entry(), too?

Thanks,
Naoya Horiguchi -